The Benefits of Detailed Failed-Part Analysis
EP Editorial Staff | March 12, 2015
Analyze your failed parts as a doctor would conduct an autopsy, and you’ll learn much about your operational effectiveness.
By Cody Hostick, P.E., CMRP, Pacific Northwest National Laboratory
A robust failed-part analysis of field returns is a well-recognized approach for identifying part hardening and design-improvement opportunities. Part and equipment autopsies are analogous to medical autopsies, and seek to accomplish the same objective of understanding the root-cause of failure. In addition, a well-executed failed-part analysis generates benefits that support maintenance optimization, including understanding your current level of repair, characterizing your troubleshooting effectiveness, providing a top-down approach for improvement, and understanding operating-environment stressors.
Level of repair
Understanding the level-of-repair decisions made by the maintenance organization helps determine if appropriate repair or discard decisions are being made, and if these decisions are made consistently. This assists in level-of-repair optimization and provides the information needed to ensure your spare-part strategy, test equipment and training supports the level of repair desired.
Results can vary considerably, especially when geographically dispersed maintenance organizations have their own, established level-of-repair strategies. For example, in analyses of field returns of failed parts from multiple locations, some maintenance organizations replaced and discarded entire electronic assemblies, while others consistently identified and replaced individual boards within the assemblies. (Fig. 1). For the same equipment, some maintenance organizations did not generate any field returns as they completed repairs at the component level on the board.
Analysis of repair-level variability across different organizations identified the need for standardization to reduce replacement spare-part costs and ensure that the level of repair can be supported by maintenance personnel. In this case, board-level replacement was found to be the best choice, and standardization is now pursued through training, troubleshooting guides and procedures that reflect board-level replacement.
Another example of repair-level disparity was found in the servicing of 2000 VA/1200W uninterruptible power supplies (UPS). Some maintenance organizations were replacing an entire unit at a cost of $650, instead of replacing batteries at a cost of $60. Follow-up investigation found that no official guidance on the level of repair had been provided to maintenance personnel.
Characterizing troubleshooting effectiveness
The analysis and retest of parts removed from service provides insight into a maintenance organization’s troubleshooting effectiveness in two ways: percent of field returns with No Fault Found and the mix of parts replaced for a single repair action.
About 30% of field returns are No Fault Found. Some returns are necessary, of course, especially when equipment is susceptible to moisture-induced intermittent faults. However, many parts are not known to have intermittent faults, and were likely incorrectly removed from service. Reasons for this can vary, such as with cabling connectors. The acts of disconnecting and reconnecting during part replacement may have resolved a connector problem, but resulted in a part that will be No Fault Found. The systematic retesting of parts removed from service provides valuable feedback. Ideally, test equipment can enable the maintenance team to perform their own rechecks of removed parts to assist in improving their troubleshooting skills.
Another area where field returns provide insight into the effectiveness of a maintenance organization is the part mix removed from service. One area of concern is when parts removed from service cannot be attributed to a single fault. This is the symptom of a shotgun approach to corrective maintenance. One such example is shown in Fig. 2. A large number of field returns are comprised of power-supply and battery backups, with battery-backup units typically operational when retested. It appears that when a power supply fails, the maintenance team replaces both the power supply and battery backup unit as a set. Part of this particular problem may stem from power-supply failures resulting in fault lights displayed by the battery backup units.
A number of battery backup units confirmed as failed is also an example of needed level-of-repair improvements. For example, Fig. 3 shows a discarded unit that was returned to service by the replacement of a plug-in 4-amp fuse.
A top-down approach for improvement
Most comprehensive failure-analysis campaigns can be pursued using either a bottoms-up approach, such as with a failure-modes-and-effects analysis, or a top-down approach like a fault-tree analysis. While a failure-modes analysis is robust, it can be a challenge to undertake in terms of time and resources if a broad mix of equipment is involved.
Using the results of failure analysis to drive a top-down approach is less rigorous, but provides a faster path to address problems experienced in the field. This is certainly true if there is a high concentration of failures of a certain type. High-frequency failures are the most commonly investigated. Another advantage of driving improvement efforts with failure analysis is that field returns often provide insight into the unforeseen stressors of the operating environment, which can be challenging to adequately account for in a failure-modes-and-effects analysis.
Field returns can also be used to help prioritize maintenance procedures and troubleshooting guides. For example, persistent troubleshooting mistakes or inappropriate levels of repair all point to the need to strengthen maintenance documentation and associated training.
Understanding operating-environment stressors
Unexpected failure rates of parts and equipment are often the result of unanticipated stressors encountered in the operating environment. Failed-parts analysis can provide insight into the nature of these stressors. This insight can result in hardening equipment in the field as well as improved design and/or equipment selection and testing for new installations. This applies not only to hardware and electronics, but also to software. Environmental stressors that can result in unforeseen software problems include events that lead to a lack of graceful shutdowns of computer applications, resulting in restart difficulties. Other issues are unforeseen failure modes of ancillary equipment that can result in saturation of databases due to excessive error messages or data-stream disruptions that create processing errors.
Environmental stressors can also include unforeseen corrosive environments (Fig. 4), excessive dust and grime leading to corrosion of your electronics (Fig. 5), and a broad array of problems related to electrical voltage or current overstress (Fig. 6). Although solutions are all issue-specific (e.g., alternative material selection to address corrosive environments, air-filtration to address dirt and grime, and voltage surge protection and board redesign to address electrical overstress), failure analysis accelerates the corrective-action process.
Failed-part analysis does more than identify failure mechanisms of field returns. A well-executed, routine, failed-part analysis can guide repair-level standardization, refinements in spares strategy to support standardized level of repair, and improvements in maintenance documentation and training to address weaknesses in maintenance performance. Finally, by providing insight on environmental stressors associated with operating environments, equipment hardening can be pursued for installed equipment, and more representative environmental conditions can be factored into design/build or procure/test processes for new installations. MT
Cody Hostick is an engineer with the National Security Directorate at the Pacific Northwest National Laboratory (PNNL) in Richland, WA. For more details, contact email@example.com or visit pnnl.gov.
About the Pacific Northwest National Laboratory
The mission of the Pacific Northwest National Laboratory (PNNL) is to “transform the world through courageous discovery and innovation,” according to its Website (pnnl.gov). The Department of Energy (DOE)-managed organization does so by providing the facilities and equipment that allow scientists and engineers to conduct research that will strengthen U.S. scientific foundations; increase U.S. energy capacity; reduce the effects of energy-generation and use on the environment; and prevent and counter acts of terrorism. Rigorous failed-part analysis is a routine component of its ongoing efforts.
Recent PNNL innovations include a high energy-density zinc-polyiodide flow battery designed for storing renewable energy in densely populated cities; an injectable tracking device for fish; and a mobile-app guide to biodetection technology for first responders.
Based in Richland, WA, PNNL is one of 10 DOE National Laboratories managed by DOE’s Office of Science. In addition to solutions for the DOE, PNNL contributes to solutions for the U.S. Department of Homeland Security, the National Nuclear Security Administration, other government agencies, universities and industry.