Comprehensive Failure Investigation

June

Comprehensive Failure Investigation

Kathy | June 1, 2007

Looking for a basic, step-by-step approach to improved problem solving around your plant? Start here.

Failure analysis, incident investigation and root cause analysis are among the terms used by organizations to refer to their various problem-solving approaches. Regardless of the name, these types of investigations typically boil down to three basic questions:

What’s the problem?
Why did it happen?
What should be done to prevent it from recurring?

These questions, or steps, are the framework for information collection that is then organized with the help of tools such as timelines, diagrams/photos and process maps. Together, these steps and tools lead to comprehensive failure investigation, which can be defined as the collection and organization of all necessary information to answer the three questions thoroughly and completely, supported by clear, concise documentation of the incident.

Two important points apply to every aspect of an investigation: focusing on principles and being specific.

Principles…
Principles are constants. They do not change from problem to problem. The cause-and-effect principle is fundamental to all investigations. This principle applies to equipment failures, supply chain problems, production outages, customer service issues and people problems the same way. By focusing on the principle of cause-and-effect, an organization can develop a consistent approach to investigating and solving all problems. If you confront a problem that appears to contradict basic physics, check your assumptions, because some are not accurate.

Remember: There are no equipment failures or problems in a facility that defy the laws of physics and chemistry.

There always will be an explanation or truth to what has already happened. Think of this in terms of the terrain in your hometown. The map of the town represents the terrain. The creation of the town map should be an objective exercise because the roads already fit together in a particular way. The map should match the actual terrain, just as the investigation should match the incident that occurred.

Many people think of cause-and-effect as a linear relationship, where an effect has a cause. In fact, cause-andeffect is an example of a system. A system has parts just like an effect has causes. The equipment downtime came about because a part failed. We find that the part failed because of fatigue. The next question is: “Why did it fatigue?”… and the why questions just keep coming. Most organizations mistakenly believe that an investigation is about finding the one cause, or “root cause.”

Remember: An effect doesn’t have a single cause—it has “causes,” which reveal different ways to solve the problem.

Be specific…
The word “analysis” means to “break down into parts.” Failure analysis, problem analysis and root cause analysis all start with a problem, then break it down into parts—which are the causes. Identifying the causes reveals additional ways that the problem may be solved. As the causes become more specific (detailed), the solutions also become more specific.

Remember: Problems are solved when specific action is taken. Problems are not solved in general—the devil is in the details.

One common mistake that many organizations make is trying to group an entire investigation into one category. This makes the incident more general, not more specific. The five most common generalizations are: human error, procedure not followed, equipment failure, inadequate training and design. Many groups believe that the end of an investigation has been reached if they can get to one of these five categories.

Remember: Don’t generalize an investigation—ask more “why” questions and be specific.

Conducting the investigation Step #1: What’s the problem? (the definition)
Everyone seems to know that defining the problem is the first step in an investigation. How this is done varies widely. Some groups write a lengthy problem statement and then debate the wording for 30 minutes or more. A facilitator should remember that people see problems differently. When someone states his/ her view of the problem, be prepared for the fact that someone else is going to disagree and offer a different problem. The word “problem” itself is problematic, in that people use it for whatever they see as the “bad thing.” To accurately define a failure, there are four more simple questions answer:

What’s the problem?
When did it happen?
Where did it happen?
How were the overall goals impacted?

Instead of writing a long problem description, simply answer these four questions in an outline format—and don’t write responses as complete sentences, just short phrases.

The question, “How were the overall goals impacted?” captures the overall magnitude of any issue. While the first question—“What’s the problem?“— reflects individual views of the problem, the company is going to view the problem as any deviation from the ideal state. For a manufacturing company, the overall goals (or ideal states) typically include: no safety injuries, no environmental issues, no customer service issues, no production problems and no excess materials or labor spending.

The overall goals that were negatively impacted by a failure incident provide the starting point for the “why” questions. Step #2 does not start with what people see as “the problem,” rather, it begins with the impact to the overall goals (the 4th question). People see problems differently, but defining every failure by how it negatively impacts the goals provides a consistent starting point. Start with the impact to the overall goals to define your next problem.

Step #2: Why did it happen? (the analysis)
It’s important to remember in this step that every effect has “causes” (plural). While people may try to identify the single cause of an issue (commonly referred to as the “root cause”), the fact is there is not just one cause of an incident—there are causes.

The fire triangle in Fig. 2 shows us, there is no single cause for a fire; there are causes—heat, fuel and oxygen. Controlling any one of these causes will reduce the risk of the fire. Most people mistakenly believe oxygen is a “contributing factor” to a fire, meaning on its own it can’t produce a fire. In reality, there is no difference between a contributing factor and a cause. A cause, by definition, is required to produce an effect. Oxygen is required for fire; therefore, it is a cause of fire. On its own, oxygen will not produce a fire. Neither will heat nor fuel. Fire requires all three causes, heat, fuel and oxygen. Every effect requires all of its causes.

The most effective way to communicate all causes of an incident is through a visual format, similar to that in Fig. 2. The cause-and-effect analysis should start with a discussion of the goals that were impacted, followed by the asking of “why” questions moving to right. The simple convention is effect on left, cause on right. “Why” questions take us backwards through the failure. Visually breaking down the cause-and-effect relationships is the simplest way to document an incident during the investigation

The focus of Step #2 is on generating an accurate causeand- effect analysis with a sufficient level of detail. During this step detail is added to the timeline, diagrams and photographs are utilized and specific steps of the processes are identified to ensure that the analysis is accurate. The facilitator is typically moving back and forth between the different tools and the cause-and-effect analysis as information becomes available. A complete analysis identifies the causes and validates them with evidence.

Step #3: What should be done? (the solutions)
The solutions step is where specific actions are defined to prevent the issue from occurring. This step begins once Step #2, the analysis step, is complete. The solutions step breaks into two parts:

possible solutions are identified first;
then they are pared down to the best solutions.

The analysis step is objective and based on evidence, while the solutions step is subjective and creative.

Possible solutions are the different ideas that people think up by examining each of the causes. Ideas come from those who are involved with the problem. Managers, engineers and supervisors will have some ideas, as will designers, manufacturers and vendors. People who operate and maintain the system or equipment on a daily basis also will have ideas. To get their ideas, ask—most importantly, ask those who are closest to the work. It is crucial for people who are involved in the problem to be part of the problem-solving process. There is a significant amount of knowledge and brainpower within organizations that is underutilized because it is not asked for regularly.

The best solutions are selected based on how effective they are and the level of effort required for their implementation. The effectiveness of a solution is a function of its reduction on the impact to the overall goals, while the level of effort is a function of the resources, cost and time to implement the solution. Possible solutions can be ranked based on effectiveness and effort so that the best ones are revealed. These best solutions become the action plan with specific owners and due dates.

Organizing the investigation
Defining the failure and its impact on the overall goals in Step #1 is based on answering a very specific set of four questions, something that typically takes less than five minutes. In the analysis step (Step #2), when the causeand- effect relationships are being identified, information is being discussed using timelines, diagrams and processes. People may offer some causes, explain the sequence of events, then review a process step, draw a picture and then go back to discussing causes. Regardless of what people offer it should be captured with the appropriate tool.

Some information will appear in both the timeline and the cause-and-effect analysis. A diagram may contain a drawing of the part; the timeline may contain some history about the part and when it failed; the cause-and-effect analysis will contain the causes of why the part failed.

The facilitator’s role is to keep the group focused on those three basic questions common to every investigation— “What’s the problem?”… “Why did it happen?“… “What can be done to prevent it from recurring?“—and to appropriately organize all information. The following notes highlight the tools needed for organization of the collected information:

Capture the timeline… A timeline, also known as a sequence of events, defines the chronological order of occurrences for a given issue. The simplest way to create a timeline is in a table format with date, time and description headers. Each entry, which should be a short phrase, not a complete sentence, corresponds to a specific date and time.

The timeline shows what happened at a specific date and time, but it does not explain why it happened. A timeline is dependent on time. A causeand- effect analysis is dependent on causes (the “why” questions). The timeline entry may be “9:05AM, Valve opened,” but the causes of why the valve opened are located in the causeand- effect analysis.

A timeline should always be constructed for larger issues. Background information also can be added to the timeline instead of being written in a separate paragraph. The time scale on a timeline can be based on years, days, hours, minutes or seconds—but it also can change throughout the timeline, as long as entry is placed in the proper chronological order.

Timelines are very helpful tools in investigations. They complement thorough cause-and-effect analyses, but they don’t replace them. Many organizations mistakenly consider a timeline the analysis of the failure. Make sure that your organization doesn’t.

Remember: Simply identifying the sequence of events does not explain the cause-and-effect relationships.

Use diagrams, drawings and photos…
Visual tools, such as diagrams, drawings, sketches and photographs, give people a common view of the issue. Without these, everyone has his/her own mental picture of the failure. A simple sketch on paper or a dry-erase board immediately provides a group with a picture that everyone can edit, improve, point to and comment on.

Don’t overlook the importance of a simple sketch. Mechanical drawings and diagrams from manuals or the original equipment manufacturers also can be used during the investigation to improve the accuracy of the analysis.

Photographs are especially helpful because they create such a simple and accurate record. Digital cameras allow people to take plenty of pictures so that the most relevant can be selected later. Digital photos easily can become part of the investigation record.

Remember: The more detail that’s included in a diagram, drawing, sketch or photo, the more specific the discussion can be.

Review the processes…
Identifying the processes that were in place before the failure occurred is extremely important in order to prevent the incident from happening again. Recurring problems are symptomatic of not managing by process.

A thorough investigation includes a review of the processes that produce the failure. (It’s much like a mechanic who must know how the transmission works in order to explain why the transmission failed.) During the investigation, a clear understanding of the current work process helps explain what specifically was being done to lead up to the failure. Secondly, the process needs to be well understood so that specific improvements can be made within the process to ensure that the failure doesn’t happen again.

Remember: The best solutions are all actions that will be implemented within the work processes.

A complete investigation
The ultimate output of an investigation is the implementation of the action items to prevent failures from occurring. The purpose of a comprehensive investigation is a thorough and accurate understanding of the incident so that the most effective solutions can be identified. An investigation has a very specific purpose. Everyone participating in the investigation should be focused on positively impacting the overall goals of the organization.

The steps and tools covered in this article are all parts of a complete investigation. Each step and each tool has a specific way of capturing and presenting information. They are intended to simplify and organize all of the different pieces of information that become part of an investigation.

Documenting the investigation as it is being conducted plays a significant role in how well people understand what happened and why. How clearly the incident is documented can affect how well the investigation goes. The rate that the information is collected also can affect how well the investigation goes. The point of all this is that effective investigations cause organizations to become more effective. Likewise ineffective investigations are a competitive disadvantage.

Experiment with and practice any one or all of the steps and tools in this article. Begin, now, to improve the way your group analyzes, documents, communicates and solves problems.

Mark Galley’s practical investigation experience spans many different types of industries and issues. He has been leading investigations and Cause Mapping workshops for ThinkReliability since 2000. Prior to that, he had worked with the Dow Chemical Company, as a reliability engineer for almost nine years. Galley holds a B.S. in Mechanical Engineering from the University of Colorado, and is a Certified Reliability Engineer through the American Society of Quality. E-mail: mark.galley@ thinkreliability.com