Failure to define failure leads to confusion
EP Editorial Staff | November 2, 2000
Failure modes, failure causes, and failure effects are important concepts in reliability centered maintenance (RCM) and similar processes. Without a clear understanding of these failure terms, the analyses often become confusing and possibly lead to incorrect decisions.
For as long as I can recall, there have been varying degrees of confusion about what people mean when they use terminology that involves the word “failure.”
Failure is an unpleasant word, and we often use substitute words such as anomaly, defect, discrepancy, irregularity, etc., because they tend to sound less threatening or less severe.
The spectrum of interpretations for failure runs from negligible glitch to catastrophy. Might I suggest that the meaning is really quite simple:
Failure is the inability of a piece of equipment, a system, or a plant to meet its expected performance.
This expectation is always spelled out in a specification in our engineering world, and, when properly written, leaves no doubt as to exactly where the limits of satisfactory performance reside. So, failure is the inability to meet specifications. Simple enough, I believe, to avoid much of the initial confusion.
Additionally, there are several important and frequently used phrases that include the word failure: failure symptom, failure mode, failure cause, and failure effect.
Failure symptom: This is a telltale indicator that alerts us (usually the operator) to the fact that a failure is about to exist. Our senses or instruments are the primary source of such indication. Failure symptoms may or may not tell us exactly where the pending failure is located or how close to the full failure condition we might be. In many cases, there is no failure symptom (or warning) at all. Once the failure has occurred, any indication of its presence is no longer a symptom—we now observe its effect.
Failure mode: This is a brief description of what is wrong. It is extremely important for us to understand this simple definition because, in the maintenance world, it is the failure mode that we try to prevent, or, failing that, what we have to physically fix.
There are hundreds of simple words that we use to develop appropriate failure mode descriptions: jammed, worn, frayed, cracked, bent, nicked, leaks, clogged, sheared, scored, ruptured, eroded, shorted, split, open, torn, and so forth. The main confusion here is clearly distinguishing between failure mode and failure cause—and understanding that failure mode is what we need to prevent or fix.
Failure cause: This is a brief word description of why it went wrong. Failure cause is often very difficult to fully diagnose or hypothesize. If we wish to attempt a permanent prevention of the failure mode, we usually need to understand its cause (thus the term, root cause failure analysis). Even though we may know the cause, we may not be able to totally prevent the failure mode—or it may cost too much to pursue such a path.
As a simple illustration, a gate valve jams “closed” (failure mode), but why did this happen? Let’s say that this valve sits in a very humid outside environment—so “humidity-induced corrosion” is the failure cause. We could opt to replace the valve with a high-grade stainless steel model that would resist (perhaps stop) the corrosion (a design fix), or, from a maintenance point of view, we could periodically lubricate and operate the valve to mitigate the corrosive effect, but there is nothing we can do to eliminate the natural humid environment. Thus, PM tasks cannot fix the cause—they can address only the mode. This is an important distinction to make, and many people do not clearly understand this distinction.
Failure effect: Finally, we briefly describe the consequence of the failure mode should it occur. To be complete, this is usually done at three levels of assembly—local, system, and plant. In describing the effect in this fashion, we clearly see the buildup of the consequences. With our jammed gate valve, the local effect at the valve is “stops all flow.” At the system level, “no fluid passes on to the next step in the process,” and finally, at the plant level, “product production ceases (downtime) until the valve can be restored to operation.”
Thus, without a clear understanding of failure terminology, reliability analyses not only become confusing, but also can lead to decisions that are incorrect. MT
Anthony M. “Mac” Smith, San Jose, CA, is a pioneer in the application of Reliability-Centered Maintenance (RCM) to complex plants and facilities. Mac has 47 years of engineering experience, the past 18 of which focused on RCM program installation. He is recognized internationally for his book Reliability-Centered Maintenance.