The Shortest Distance Between Success And Failure (It’s Not A Straight Line)

Maintenance • November • Predictive Maintenance

The Shortest Distance Between Success And Failure (It’s Not A Straight Line)

EP Editorial Staff | March 21, 2012

No matter where you start, one of your main objectives will always be to get the most service life possible from your equipment.

By Randall Noon, P.E.

The basis for most condition-monitoring programs for critical plant equipment consists of two fundamental parts:

1. The periodic measurement of key performance parameters of the equipment.

2. A comparison of the equipment’s key performance parameters with an acceptance standard.

In essence, this basis contains an assumed correlation: If the measured key performance parameters are in line with accepted standards, then the equipment as a whole has a high probability of performing well.

Fig. 1. Simplified bathtub curve (equipment-failure rate vs. time)

From a statistical standpoint, this is a reasonable way to determine the current condition of plant equipment with a minimum of work or interruption of operations. If key performance parameters—usually things like temperature, vibration or lubrication chemistry—are also collected and maintained in a historical database, the progressive degradation of equipment performance due to wear and age effects can be plotted, trended and quantified.

Excluding statistical variations inherent in measuring performance parameters, unexpected deviations from either the acceptance standard or the expected degradation trend usually indi-cate developing problems that may portend failure of the equipment. Having been shown many times that this is much more cost-effective than the old “run to failure and ensure a series of stomach-ulcer-generating crises” method of plant maintenance, most industries in the 21st century still don’t seem to have adopted it.

At this point in the process, the condition-monitoring data that’s been collected over time can be used to turn a condition-monitoring program into a predictive-maintenance program. This is accomplished by leveraging the historical data to predict, by numerical extrapolation, when equipment failure can be expected. Based on this type of prediction, repair or replacement of the equipment can be conveniently scheduled prior to failure.

When condition monitoring and predictive maintenance are done well, the run-time of equipment is maximized—and unexpected plant downtime and labor overtime costs are minimized. This, in turn, boosts a company’s bottom line, improves the quality of work life for employees and, like Gershwin opined, “…the livin’ is easy.”

Unfortunately, the transition from condition monitoring to predictive maintenance is the point where most mistakes in a predictive maintenance system are made. The three most common mistakes are:

1) Linear extrapolation
2) A rigid data-sampling period
and
3) Incomplete inspection feedback

Mistake #1: Linear extrapolation
Fundamentally, extrapolation using linear functions is usually inaccurate for predicting failure.

If the performance parameter of a piece of equipment that is running well was “A” six months ago and is now “B,” connecting “A” to “B” with a straight line that is then extended into the future until it crosses the “unacceptable” performance level typically won’t provide a useful answer about when failure may occur. Generally, it overestimates when failure might occur—make that “grossly” overestimates.

During the period when a particular piece of equipment is operating well, it often appears that the observed degradation trend is linear. That’s because changes in the key performance parameters being measured may not change much over a relatively long period, especially when the equipment is operating in the flat part of its “bathtub curve.” This is the portion of the curve where failures are primarily due to random causes, not wear or end-of-service-life effects. When equipment is operating in this type of zone, performance-parameter changes can reasonably be approximated by a line. Furthermore, during this particular operating period, the dominant degradation mechanism may indeed be a linear—or nearly linear—physical function.

Consider the simplified bathtub curve depicted in Fig. 1. The “time” axis has been greatly shortened due to space; for this reason it is often plotted on a logarithmic scale. Typically, the infant mortality zone will be a few days or weeks, the random failure zone will be years and the wear-out failure zone will be days or weeks. Note that the observed overall failure rate—the blue line—is composed of three independent factors that are summed:

1. Infant mortality failures (red dotted line)
2. Wear-out failures (orange dotted line)
and
3. Random failures (green solid line)

As the bathtub curve suggests, the time between the infant-mortality zone and the wear-out failure zone constitutes the major portion of an item’s service life. This is the area of operation where the likelihood of failure is lowest, and operating performance parameters don’t change much for a long time.

As Fig. 1 implies, when failure occurs, it’s usually a combination of several physical mechanisms interacting with one another. The net effect of this interaction is to accelerate the failure process. Consequently, as failure approaches, performance parameters often degrade exponentially.

Consider the following example that involved a newly installed, 2-pole, 3-phase motor. The motor was rated for 2500 hp at 4160 v.a.c. and 300 amperes, operated at 3570 rpm and rated for a temperature rise of 50 C degrees. It ran constantly when the plant was operating. The rated resistance from terminal to terminal in a phase was 0.0329 ohms.

Measured motor-testing data collected during each scheduled plant outage indicated that the motor had increased its phase-to-phase resistance imbalance from 0.0 to 0.70% in its first 18 months of service. The resistance imbalance then increased again from 0.70% to 1.98% in the next 18-month period (after a total of 36 months of service). Since measurements were made at the motor’s junction box, the resistance measurements also included the power supply cables.

The company had previously established an acceptance standard of 2.00% resistance imbalance. As long as this imbalance was less than 2.00%, the condition of the motor would be considered acceptable. (If resistance imbalance was 2.00% or higher, the company’s procedure would require checking for possible high-resistance connection problems.) Since the resistance imbalance was 1.98%, the measurement was considered to be within the acceptance criterion, and the motor was put back in service for another 18 months.

Fig. 2. Plot of the referenced motor resistance-imbalance measurements (click to enlarge)

The blue line in Fig. 2 plots the three resistance-imbalance measurements versus service time. The three points are connected with a straight line. The slope of the line in the first 18 months showed a 0.0388% increase per month. At this rate, the 2% acceptance criterion would be crossed after a service time of 51.4 months.

Since the next scheduled plant outage was at 36 months, the initial linear extrapolation to failure indicated that the motor would make it to the next outage at 36 months. Typically, for motors rated at 600 v.a.c. or more, a measured resistance imbalance of 3% or higher is well into the range for immediate failure. Imbalances that high cause large negative sequence currents that overheat insulation. Since the initial linear extrapolation predicted a resistance imbalance of 2.09% at 54 months—just a bit over the 2% acceptance criteria—the motor would also probably operate well enough to last that long.

During the next outage, resistance imbalance was measured to be 1.98%. The expected value provided by the initial linear extrapolation was 1.40% resistance imbalance. During the second period of 18 months, the linear rate of resistance-imbalance had significantly changed from an increase of 0.0388% per month to an increase of 0.0711% per month.

Before the third outage at 54 months was reached, the motor failed—after 42 months of service. The insulation covering one of the power supply cables inside the connection box on the motor broke down. Arcing occurred between one of the cables and the grounded connection box. The cable had been bent with too tight a radius inside the box. Figure 3 shows what the failure looked like when it was discovered.

Before repairs were made, the resistance imbalance at the junction box was measured and found to be 3.85%. All four resistance-imbalance measurements are plotted in Fig. 4. As can be inferred in Fig. 4, as the failure time was approached, the rate of the increasing resistance imbalance accelerated.

Fig. 4. Plot of all four resistance-imbalance measurements of the referenced motor

The data plot in Fig. 4—even with straight lines from point to point—outlines an exponential function of the form y = Ae^st – B. This is a typical function associated with dielectric breakdown. If the first three measurements of resistance imbalance had been used to solve the exponential function, the following would have been concluded:

The above equation would have predicted 0.73% imbalance after the first 18 months of service and 1.98% imbalance after the first 36 months—which is an excellent fit to the actual measurements. A short extrapolation of the exponential curve indicates that just one week after the 1.98% measurement was made at 36 months, the motor would have crossed the 2.00% resistance imbalance threshold.

Similarly, using the exponential equation to solve for resistance imbalance just before the failure occurred at 42 months finds that the predicted resistance imbalance would have been 2.47, which is a resistance-imbalance value midway between the caution set point of 2% and the alarm set point of 3% recommended by some motor-analysis handbooks. In simple terms, extrapolation of the exponential model accurately indicated that the motor would fail shortly after the last measurement was made after 36 months of service—and that the motor had no hope of making it to the next outage at 54 months.

Mistake #2: Rigid data-sampling periods
Many condition-monitoring systems check performance parameters of equip-ment at regular, unvarying intervals. When the equipment is operating in the random-failure zone, as shown in Fig. 1, this is not necessarily a problem. However, as shown in Figs. 2 and 4, when the time to failure is approached, changes in

performance parameters accelerate. Consequently, if the monitoring interval in which performance parameters are checked is too long, the relatively short period when degradation rapidly accelerates to failure may be missed. In short, the condition-monitoring system may provide no warning of an impending failure.

To prevent such a chain of events, it’s useful to have a flexible monitoring interval. Monitoring intervals during the random failure period may be longer than those closer to the expected wear-out or end-of-service-life period. For this reason, some companies begin to shorten monitoring intervals as a piece of equipment begins to approach its statistical end-of-service-life service point. Translation: These companies are looking for the first indications of significant non-linear deviations from the previously established degradation rates. As can be seen in Fig. 4, knowing that the recent measurement of resistance imbalance indicated a significantly higher rate of increase, it would have been prudent to schedule another monitoring of resistance imbalance—perhaps three or four months after the one completed at 36 months. This shortened monitoring period would have indicated that failure was imminent since the 3.0% imbalance level would already have passed.

Mistake #3: Incomplete inspection feedback
The primary point of a predictive-maintenance program is to reduce the cost of operating a plant by getting the most value out of equipment while minimizing downtime and overtime. To avoid failure, most companies establish regular periods for replacing or repairing critical equipment. Many, however, do not closely inspect the old part to determine whether the replacement or repair period is appropriate. Due to the press of work schedules, an unfortunate common practice is to simply replace the old part with a new one and toss the old one away with only a cursory check. This is a wasted opportunity to save money and further optimize your maintenance program.

If the part being replaced or repaired is in excellent condition—especially in critical areas—and performance parameters are steady and have ample margin before approaching the established acceptance standard limits, consideration should be given to increasing its time in service. On the other hand, if the part shows unexpected wear in critical areas, perhaps consideration should be given to shortening the service period. In either case, inspections save money. They save money by allowing longer use of a part—or allowing users to replace a part sooner and avoid an unexpected failure. MT

Randall Noon is a root-cause team leader at Nebraska’s Cooper Nuclear Station. A licensed professional engineer in several states, he’s been investigating failures for more than 30 years. He’s also the author of several articles and texts, including: The Engineering Analysis of Fires and Explosions; Forensic Engineering Investigations and Scientific Method: Applications in Failure Investigation and Forensic Science. Email:rknoon@nppd.com.