Looking For Failure In All The Wrong Places

Maintenance • Predictive Maintenance • Preventive Maintenance • Reliability • Reliability Engineering

Looking For Failure In All The Wrong Places

EP Editorial Staff | October 6, 2017

Focus on ‘zones’ when reliability of plant equipment is concerned.

By Randall Noon, P.E.

When new equipment is put into service, it’s usually integrated into a plant’s equipment-monitoring program. The purpose of this program is to check the progress of degradation, i.e., to look for failure. Typically, that’s a two-step process. The first step is to measure important operating parameters at regular intervals. The second is to plot the measurements against time to understand how things are progressing. When a measured parameter exceeds its safe operating limit, the equipment is removed from service and refurbished. There is, however, an error trap in the first step: It involves “regular intervals.”

Most “regular intervals” are based on monitoring equipment as it performs in Zone II of its Weibull probability density curve. Failure, though, typically occurs in Zone III. Sampling periods sufficient to monitor equipment performance in Zone II are typically insufficient to provide adequate warning of impending failure when the equipment is operating in Zone III. In short, you’re looking for failure in the wrong places.

What’s happening?

Let’s start this discussion with a refresher on Zones I, II, and III.

As well-informed reliability and maintenance engineers know, a plot of equipment failure rate versus its service time tends to follow a mixed Weibull probability density curve (see above), otherwise known as the bathtub curve

While the information needed to generate a specific bathtub curve from its fundamental Weibull probability density function depends upon several parameters, the principles embedded in a generalized curve of this type are fundamental to understanding equipment reliability.

A typical bathtub curve is composed of three distinct parts, or zones.

Zone I has a Weibull function β shape factor, sometimes called the Weibull slope, equal to about 0.5. This portion of the bathtub curve is typically associated with equipment infant mortality failures, where the failure rate decreases with time in service. Failures in this part of the curve are the result of equipment installation errors, equipment assembly errors at the factory, or part defects. This is sometimes referred to as the break-in or shake-down period. Normally, this reflects the shortest portion of the bathtub curve with respect to time. In general, the most reliable operating time of the equipment, i.e., its lowest failure rate, occurs directly after the equipment has passed through Zone I and is beginning to operate in Zone II.

Zone II has a Weibull β shape factor equal to about 1.0. This portion of the bathtub curve is typically associated with normal service life and the failure rate in this zone is relatively low. It is basically a random failure rate function that is more or less linear. (Referring to the accompanying Sidebar, note how the equation simplifies when β = 1 and γ = 0.) The failure-rate-versus-time curve in Zone II is either flat, with little increase over time, or the failure rate increase over time is gradual. This is the longest portion of the bathtub curve and represents the period of time when the equipment provides reliable service with normal maintenance and care.

Zone III has a Weibull function β shape factor equal to about 3.0. A significant change in the β shape factor signals a significant change in the failure rate. This portion of the bathtub curve is associated with end-of-life failures due to age and wear. As the equipment is operated further into this region of the curve, the failure rate increases exponentially.

Some contend that the optimum time to replace or refurbish equipment to maximize its service time and minimize risk, based upon the bathtub curve, is when the failure rate in Zone III becomes equal to or slightly greater than the maximum failure rate in Zone I. That means replacing equipment when the chance of failure due to age effects becomes equal to or slightly greater than the chance of failure due to infant mortality. Otherwise, according to this argument, there’s a greater risk of failure in replacing the equipment than just letting it remain in place. While inspection of the bathtub curve in Fig. 1 indicates the basic merit of this idea, at least as a starting point, there are two factors to consider. They could lead to a modification of this strategy.

During the Zone 1 infant mortality period, the equipment often isn’t actually in service. In recognition of the potential for installation errors, assembly errors, or defective parts, many organizations require that equipment be run or tested without being in a production mode to check for such latent deficiencies.

This is the reasoning behind static-pressure testing, load testing, post-maintenance testing, and circuit-board burn-in periods. The idea is to stress the equipment in a short period of time prior to being put into actual service to force any potential infant-mortality failures to occur. This removes some, but not necessarily all, of the infant-mortality risk. If such testing is sufficiently thorough, however, the remaining small failure rate for Zone I may be less than the risk at some point during normal operation of the equipment, i.e., in Zone II.

Zone III end-of-life failures often proceed exponentially through their failure stages. Because internal positive-feedback mechanisms kick in once a certain threshold has been crossed, initial small deviations from normal can progress rapidly from a barely detectable precursor to complete catastrophic failure.

Thus, trying to squeeze out that last bit of service life from equipment before total failure occurs is a risky move. The steep increase in the curve in Zone III indicates that the failure rate changes exponentially, and rapidly increases in a short amount of time. This introduces an increasing amount of uncertainty in any time-to-failure projections.

That said, the Zone III end-of-life-failures item noted above basically sums up the previously stated point of this article: A regular condition-monitoring schedule that’s capable of tracking and predicting equipment performance in Zone II is inadequate when the equipment is operating in Zone III.

In Zone II, the failure rate is stable and predictable with only occasional random failures occurring. Regular condition monitoring, performed every three months, six months, or annually, may be sufficient to verify that the equipment is slowly and predictably showing the expected effects of age and wear. This data can then be readily extrapolated to predict what the values of the monitored parameters will be when the equipment is next monitored with reasonable accuracy. Simple linear extrapolation often works reasonably well in Zone II.

In Zone III, however, the failure rate increases exponentially with time. Catastrophic failure can occur in a short period of time: from no indication at all to complete failure in a just a week, a couple of days, or even just a few hours. The data collected from regularly spaced condition monitoring in Zone II is generally insufficient to accurately indicate impending end-of-life failure in Zone III unless the monitoring periods in Zone III are also exponentially shortened to match the exponentially increasing failure rates. This is the significance of the β shape factor being 3 instead of 1.

Keeping up with the rapidly changing failure rate in Zone III is difficult. Some companies attempt to do this by approximating exponentially changing failure rates with a series of step functions. For example, they might shorten a quarterly condition-monitoring period to a monthly period, then shorten that to a weekly period, and, perhaps, eventually shorten it to every four hours. There’s a significant problem in this approach, however.

What’s the problem?

Most organizations don’t respond quickly enough to modify the monitoring period as needed to match the exponentially “moving target” failure rate in Zone III. Since non-linear extrapolation methods have to be used to predict equipment conditions in Zone III, and since small errors in numerically determining the exponential function produce large errors in the resulting prediction of when failure may occur, failure may still occur when it is not expected.

Consider, for example, how high-voltage oil-insulated current transformers, (HVCTs) fail. Such units can degrade over time, primarily due to thermal, electrical, and environmental effects. Many have a useful service life of 30 to 40 years. When they do fail, however, a significant number explode and catch fire. Those explosions, in turn, often produce flying debris that can cause collateral damage to nearby equipment and personnel.

HVCT-condition-monitoring methods include partial discharge, gas in oil, and power factors. All of these techniques provide a good picture of the HVCT condition when it’s in Zone II. But, when the failure rate begins to rapidly change, as it does in Zone III, the usual sampling periods associated with these methods are often too long to detect and warn of impending failures that can occur in a matter of hours.

A common failure mechanism in an HVCT is dielectric breakdown of the oil. In Zone II, the dielectric properties of the oil change more or less regularly with service time and can be adequately monitored by oil-in-gas analyses. With continued service into Zone III, though, at some point the dielectric eventually degrades to the point where it breaks down electrically and suddenly allows arcing within the HVCT. This arcing heats and vaporizes oil and decomposes the oil vapor into gas. The resulting gas bubble creates internal pressure and sometimes causes the HVCT to rupture.

The exact point at which the dielectric fails while in service and allows internal electrical arcing to occur is difficult to predict. It depends upon various factors affecting the oil, such as moisture content, temperature, particle count, amount of metals in the particles, gases in the oil, amount of added inhibitors, and oil acid number. Different combinations of these factors change the point at which actual breakdown occurs. Prior to breakdown, changes in each of these factors generally occur over time in a relatively slow fashion. Once breakdown occurs, however, the equipment will fail relatively quickly.

Using the condition-monitoring methods applicable to Zone II to predict when failure will occur in Zone III is possible, of course, but involves more cost and, frequently, more monitoring sensors and control equipment, and likely will require more people resources. The question is, “How much is it worth to your plant in risk and resources to operate right up to the edge of failure?” Do you really want to replace a major transformer, motor, or engine in the middle of a production run or an operating cycle?

As a practical matter, replacement or refurbishment of significant production equipment should be done while the equipment is still operating in Zone II, with the time and place chosen by appropriate plant personnel. Many operations essentially schedule end-of-life replacement of critical equipment to coincide with an outage period just before the end of Zone II, or perhaps even two outages before Zone II ends. This type of scheduling provides flexibility for dealing with unforeseen circumstance.

If that’s how replacements or refurbishments are scheduled at your plant, you need to know when Zone II ends and Zone III begins. This can be determined by consulting various databases with statistical details on when equipment fails. EPRI (Electric Power Research Institute, epri.com) offers a fine database for most plant-equipment items, some of which are even subdivided by brand and model number. (Many power plants base their preventive and predictive maintenance programs on EPRI’s database.)

The current edition of the IEEE Standard 500 Reliability Manual is another good source, as are publically available and tax-payer-funded military and government publications such as MIL-HDBK-338B, Electronic Reliability Design Handbook, AD-A273 174, Naval Surface Warfare Center, Handbook of Reliability Prediction. These and similar databases are available on the Internet.

Finally, don’t overlook accountants’ amortization tables for major equipment purchases. They also can be helpful. One place to find them is in the U.S. Federal Government IRS Code, “Table of Class Lives and Recovery Periods.” (Banks don’t like to lend money on major pieces of equipment if the equipment is no longer serviceable.)

Keep in mind

When it comes to achieving reasonable reliability of plant equipment:

Don’t try to predict failure in Zone III of the bathtub curve by using condition-monitoring methods more appropriate for Zone II.

If at all possible, don’t operate critical equipment in Zone III where predictability is dicey.

Avoidance of Zone III can be accomplished by consulting various databases that provide a statistical basis for estimating useful service life. Choose a confidence level appropriate to the level of risk your organization is willing to accept with respect to service life, and schedule replacement or refurbishment of the equipment accordingly. EP

Randall Noon is a registered professional engineer and author of several books and articles about failure analysis. He has conducted root-cause investigations for four decades, for nuclear and non-nuclear power facilities. Contact him at noon@carsoncomm.com.