Maintenance Reliability

Reliability Engineers Are STO Essentials

EP Editorial Staff | May 27, 2019

Risk management, a reliability engineer’s specialty, provides an analytical foundation of well-planned STO event.

Don’t overlook the value reliability engineering can deliver to your shutdown/turnaround/outage events.

By Drew D. Troyer, CRE, CMRP, Contributing Editor

The plant shutdown/turnaround/outage (STO) is an occasional maintenance event intended to restore the facility to “like new” performance. It’s also the time to execute major plant upgrades and expansive capital projects that add new production capabilities. When the STO goes well, the objectives are met at an economically justified cost to the business. When things go wrong, however, significant cost and time overruns are incurred.

Fig. 1. Cost-control results are starkly different when comparing the best- and worst-prepared STO events.

Research on STOs suggests that the best-prepared organizations come in at or under budget, while those who are unprepared see cost and time overruns of 60%. As an example, for a major STO with a projected cost of $100 million, the difference between the best- and worst-prepared plants could be as much as $81 million (see Fig. 1). Here’s where the money goes:

• excessive labor due to poor wrench-time performance
• expediting premiums for parts and tool hire
• additional manpower and overtime premiums to avoid schedule overruns
• poor anticipation of emergent and contingency work.

STO-cost overrun, though, is just the tip of the iceberg. A poorly executed event leads to schedule overruns and a bevy of startup and work-quality-related failures that can plague the site for months—or even years—after the STO. In some cases, the STO is so poorly managed and executed that it can’t reliably deliver in the plant’s expected operating window. This could then require the organization to schedule another one of these disruptive events sooner than desired.

The difference between the best and worst performance on STO events really comes down preparedness, including governance, coordination, scope freeze, control, planning, and scheduling. Frequently, dedicated teams of STO managers, planners, schedulers, and coordinators oversee such events. However, other personnel should also be involved—specifically members of a plant’s reliability-engineering team. That’s because risk management, which is a reliability engineer’s specialty, provides the analytical foundation to ensure that an STO is optimized and well planned.

A plant’s reliability engineer(s) can play a pivotal role in de-risking an STO, assuring that the event achieves its objective, and helping avoid a scenario where, to put it bluntly, “the wheels come off.” They possess detailed knowledge about the facility’s design and configuration, and the context and history of maintenance and operations. They also possess special skills in risk management, including FMEA and RCA (failure mode and effects analysis and root-cause analysis, respectively), as well as the physics of failure, reliability analytics/metrics, and other skills that enable these engineers to intervene before, during, and after the STO.

Fig. 2. A reliability engineer can reduce risk at every step of the STO process, but his or her intervention has the greatest effect early on.

Let’s examine the key elements of the reliability engineer’s role at each stage of the STO process. The goal is to ensure that the event delivers operational reliability; comes in safe, on time, on budget; and results in a smooth startup with minimal work-quality issues. (See Fig. 2 above.)


A reliability engineer should be involved in defining the mission, strategy, and premise for the STO event because he or she has a well-analyzed understanding of the operational risks and priorities of those various risks, relative to each other. This helps to avoid the scenario in which the squeaky wheel gets the grease because someone complains the loudest and most often. 

Once the initial premise is established, it’s time for the reliability engineer to conduct a pre-STO-process FMEA (PFMEA). An FMEA employs inductive reasoning to anticipate what could go wrong, estimate the consequences and the likelihood of occurrence, and to evaluate the effectiveness of existing controls. Each of the three categories is assigned a score of 1 to 10, with 10 being the highest.

The product of multiplying these scores provides a risk-priority number ranging from 1 to 1,000. The FMEA method was initially developed to evaluate risks with products such as airplanes or cars, to prioritize them for improvement, and to measure risk-mitigation progress. It can easily be applied to processes such as managing an STO event. Sometimes, the PFMEA is called a pre-mortem root-cause analysis (RCA).

Evaluating lessons learned from a site’s most-recently completed STO is a good place to start. This essentially involves conducting a formal review of a log with analytical notes of everything that went wrong during the most-recent STO event (see “Manage STO Risk”). Because STOs are conducted periodically, it’s difficult to engrain a corporate memory about how to manage them. It’s not uncommon to see situations where half the staff that worked on the previous STO have retired or left the company and are, therefore, not available to support the current event. Consequently, a site must depend on records and detailed notes from that event in the planning of the upcoming STO.

Reliability engineers who are trained—and, it is hoped, certified—on use of FMEA, RCA, and related tools are ideal candidates for performing the pre-STO PFMEA. This type of pre-event risk analysis is not trivial. Allocating sufficient resources to do the PFMEA correctly will result in fewer surprises during the upcoming STO event. 


Excessive scope and poor scope control are STO-performance killers. In most plants, reliability engineers use data from various inspections and monitoring to evaluate the condition of the equipment and decide which work should be included and which should be left out. The schedule for many shutdowns, though, is driven by time-based jobs. 

In some cases, time-based jobs must be performed to retain a plant’s license to operate. In other cases, those jobs are reliability driven, wherein the organization has decided to replace or rebuild equipment, based upon calendar time and/or operating time. Unfortunately, the interval for many time-based jobs is guesswork. In the worst instances, they’re the byproduct of a knee-jerk reaction to a failure that bit the organization years or even decades ago. In some cases, the organization doesn’t even know why a particular job is performed at a particular interval—it’s just the way they’ve always done it. Reliability engineers are trained to evaluate the validity of these jobs and to either optimize the interval, convert the job to a condition-based job, and/or address the root cause and eliminate or reduce the risk. A reliability-engineering challenge can go a long way in reducing the scope of an STO event.

Keep in mind that a major source of scope creep is break-in work. Break-in work is that which is requested after the lock-down date. To be included in the scope, break-in work must pass a business challenge to assure that the job must be included. In short, we must determine whether the job is important enough; if it must be done during an STO versus run-time maintenance; and if it must be included in the current STO event. The reliability engineer must be involved in this process.

Additionally, another variation of the FMEA can be used to rank jobs in terms of priority. In this case, however, we add the elements of time and cost to ensure the jobs included in the final scope mitigate the greatest amount of operational risk, relative to the allocated time and money.


During the planning and preparing stage, reliability engineers can provide a wide range of services, including, but not limited to:

• supporting sustaining and expansive capital-projects teams to ensure that design for reliability, maintainability, and operability elements are prioritized, incorporated in the design, and will deliver the desired results

• working with planners to incorporate enhancements that will improve, among other things, reliability, maintainability, operability, and inspectability, where possible

• performing data analysis to improve projections about discovery work personnel might expect to find (what, where, why, how much), so as to reduce the number of surprises and better prepare for such work

• conducting condition monitoring of parts and materials in the storeroom to assure that they haven’t “shelf-degraded” to a point where they’re unsuitable for use.


Once the STO is over, the reliability engineer must perform a formal post-event RCA to identify lessons learned. This closes the loop from the previously discussed pre-STO PFMEA. Because the STO is not a routine activity, a post-STO RCA serves as the corporate memory. Again, evaluating the lessons learned from each completed STO should be the first step when planning the next one.

There are too many instances in which reliability is on the sidelines when it comes to managing an STO. Reliability engineers have skills and training that enable them to analyze and mitigate risks that threaten the success of these labor-intensive, disruptive events. Plan to get this part of your team off the sidelines and into the game to ensure that the STO comes in safe, on budget, on schedule, and delivers the reliability required during the next operating cycle. EP

Contributing editor Drew Troyer is a senior manager with T.A. Cook Consultants Inc. , The Woodlands, TX. Email



View Comments

Sign up for insights, trends, & developments in
  • Machinery Solutions
  • Maintenance & Reliability Solutions
  • Energy Efficiency
Return to top