Understanding Hidden Failures in RCM Analyses

EP Editorial Staff | January 1, 2003

Addressing hidden failure modes is a key aspect for successfully achieving plant reliability.

Reliability Centered Maintenance (RCM) is not new. Airline Maintenance Steering Group (MSG) Logic, the predecessor to RCM, has existed since the early 1960s. F. Stanley Nowlan and Howard Heap of United Airlines introduced formal RCM to the commercial aviation industry in 1978. Airline reliability is primarily based on this work. The vision is as relevant today as it was when the first edition of Reliability Centered Maintenance was published in 1978.

Today, almost everyone in a manufacturing, power generating, or technological environment is familiar with the concept of RCM. However, the perceived degree of familiarity with RCM may be deceiving. RCM is simple in concept but also sophisticatedly subtle in its application.

As with many processes, a simplistic and limited understanding of RCM may prove more problematic than beneficial. The false comfort level of naïvely believing that a superficial implementation of the process will become a panacea for plant equipment problems and then depending on that process to produce significant reliability results is unrealistic.

Analyzing a system
The simple understanding of RCM consists of identifying system functions, functional failures, consequences of those failures, etc. However, Nowlan and Heap gave great importance to understanding hidden failures which are not widely understood and are often overlooked when performing an RCM analysis.

The true reliability benefits of RCM become evident only with a thorough understanding of how to functionally analyze a system. Understanding hidden failure modes, understanding when a single-failure analysis is not acceptable, and understanding when run-to-failure (RTF) is acceptable, are the real cornerstones of RCM. Additionally, the subtle but important distinction between true redundancy and redundant components fulfilling a backup function is also a key to reliability success.

Many utilities and other industries have implemented an RCM program only to find that they continued to have fundamental reliability issues that were not addressed by their analysis. The primary reason is the lack of a grass-roots philosophical understanding of the principles governing the analysis.

Identifying important equipment
Optimizing a preventive maintenance program consists of three phases: Phase 1, identifying equipment that is important to plant safety, operation, and asset protection; Phase 2, specifying the requisite PM tasks for the equipment identified in Phase 1; and Phase 3, properly executing the tasks specified in Phase 2.

At the very least, identifying equipment important to plant safety, operation, and asset protection consists of three programmatic principles that must be well understood before commencing an RCM analysis.

  1. Understand the cornerstones for developing an effective RCM program.
  2. Identify the defensive strategies for maintaining an effective RCM program.
  3. Identify when a component can be classified as RTF and understand the limitations governing RTF components.

A look at each of these principles in detail will illustrate the key areas for successfully achieving plant reliability and maximizing cost containment efforts.

Understand the cornerstones
There are three cornerstones that must be understood for developing an effective RCM program:

  • Know when a single-failure analysis is not acceptable.
  • Identify hidden failures.
  • Know when a multiple-failure analysis is required.

A single-failure analysis is not acceptable when the occurrence of the failure is hidden. When a component is required to perform its function and the occurrence of the failure is not evident to operating personnel, that is, the immediate overall operation of the system remains unaffected in either the normal or demand mode of operation, then the failure mode is defined as hidden.

0103_rcmfailure_fig1A multiple-failure analysis is required when the occurrence of a single failure is hidden. Addressing hidden failure modes is a key aspect for maintaining plant reliability.

Identify the defensive strategies
There are three distinct lines of defense for maintaining an effective RCM program. The first strategy for defending a plant against unplanned equipment failures is identifying critical components. These are components where a single failure will result in one or more consequences similar to the following:

  • A direct impact to personnel or plant safety.
  • A plant trip or shutdown of a manufacturing facility.
  • A power reduction, down power, or the loss of a facility’s operational capability.
  • An inadvertent actuation of a safety system.
  • An unplanned forced outage.
  • Other (depending on specific type of plant or industry)

The second line of defense for protecting a plant or facility is to identify what this author refers to as potentially critical components. These are components which, if they fail when called upon to function, the failure is hidden and will not have an immediate effect on the plant. However, the hidden failure in combination with one or more additional failures will result in consequences similar to the following:

  • A direct impact to personnel or plant safety.
  • A plant trip or shutdown of a manufacturing facility.
  • A power reduction, down power, or the loss of a facility’s operational capability.
  • An inadvertent actuation of a safety system.
  • An unplanned forced outage.
  • Other (depending on specific type of plant or industry)

Note the similarities between critical and potentially critical components. The only difference is that critical failures manifest themselves immediately while failures of potentially critical components are hidden and will not manifest themselves until a second, multiple failure occurs.

To better understand the concept of potentially critical components (which is totally different from the potential failure of a given component) consider the following example.

0103_rcmfailure_fig2When two or more components (valves, pumps, motors, etc.) operate in parallel flow paths to supply a function but only one component is required to fulfill the function, and there is no indication of failure for each component individually, then a failure of one of the components will be hidden (there will be no indication the component has failed) and the failure will not result in a plant effect. However, if the second component should fail, then a plant-effecting consequence would occur. Hence, the component is considered to be potentially critical.

Another example involves a pump discharge check valve. If there are two pumps operating at the same time, a failure of the check valve in the open position will be hidden. Only when one pump fails will the unwanted reverse flow path through the failed open check valve become evident.

How prevalent are hidden failures? Extremely. Just a few examples include main turbine overspeed components, many check valves, diesel generator fuel oil pumps, and emergency diesel generator shutdown components. Identifying potentially critical components affords perhaps the greatest degree of reliability protection for a plant or facility.

Hidden failures are typically failures of one or more components aligned in parallel with no indication of failure for each individual component. In Fig. 1 for example, one of the two components could fail but since each one by itself can satisfy the function, only when the second one fails will the functional failure become evident; therefore, the failure of the first component is potentially critical.

How important is this concept? Very. There are many examples in industry where a designer intentionally builds in multiple redundancy to ensure reliable system operation. Unfortunately, if the redundancy has no way of manifesting itself when it fails, a plant-effecting consequence can occur with the second failure.

There is a vast difference between a component operating in a backup function and one that is not (Fig. 2). In Example 2, the component is an RTF component while the component in Example 4 is critical.

The third line of defense to protect a plant is to identify economically significant components. These are components whose failure will not be critical or potentially critical, but will result in one or more of these economic concerns:

  • An unacceptable cost of replacement or restoration.
  • An unacceptable corrective maintenance history.
  • A long lead time for replacement parts.
  • An obsolescence issue.
  • Other (depending on specific type of plant or industry)

Failures of economic components have no effect on plant safety or operability. Economic failures will result only in labor and/or parts replacement costs. It is important to keep this economic categorization separate from critical and potentially critical components to enable a prioritization of work.

Note: If a failure occurs to a major piece of equipment (even if it is economically significant) but it results in an effect on plant safety, operation, or a plant outage, it would be more than merely an economic consideration. It would be captured as either a critical or potentially critical consequence of failure.

Identify RTF components, understand limitations
RTF in its most basic definition means PMs are not required prior to failure. This does not imply that the component is unimportant and never needs to be fixed. Corrective maintenance is required in a timely manner after failure to restore the component to an operable status. RTF components are understood to have no safety, operational, commitment, or economic consequences as the result of a single failure. Also, the occurrence of failure must be evident to operations personnel.

RTF components are designated as such because a failure is evident and there is no significant consequence from a single failure. If it does not matter whether a failed component is ever restored to an operable status, one would question why that component is even installed in the plant.

The heart of reliability is a sound preventive maintenance program and RCM provides the most prudent approach for establishing an effective PM program. MT

Neil Bloom is program manager, RCM and preventive maintenance programs, at Southern California Edison’s San Onofre Nuclear Generating Station. He previously worked in the commercial airline industry in both maintenance and engineering management positions. He can be reached at Mail Unit K-50, P.O. Box 128, San Clemente, CA 92672; (949) 368-6378




View Comments

Sign up for insights, trends, & developments in
  • Machinery Solutions
  • Maintenance & Reliability Solutions
  • Energy Efficiency
Return to top