Root Cause Analysis Made Simple

Root Cause Analysis Made Simple

Uptime Elements - ReliabilityWeb
 Root Cause Analysis Made Simple

“Rules of Thumb for Maintenance and Reliability Engineers”

Chapter Title:

Root Cause Analysis (RCA), also called Root Cause Failure Analysis (RCFA), is defined as the systematic evaluation of problems to find the basic causes that, when corrected, will prevent or significantly reduce the likelihood of a recurrence. These basic causes are called “root causes”. It is important to realize that most problems have more than one contributing cause, and if one of these contributing causes were eliminated, the problem would not recur.  

Unexpected equipment failures are not normal and should not be tolerated. The occurrence of equipment problems not defined sufficiently and only solved well enough to get the equipment back up and running is great and therefore we need to understand why the failure occurred and fix the cause, not just the failed equipment. 

A structured RCA process is needed to make sure we understand the true root cause(s) of a problem instead of the more obvious symptoms. If our solution to the problem addresses the symptoms, it is obvious that the problem will reappear at some point in the future or that other problems will be created by the implemented solution. 
Every failure affords us the opportunity to learn. If we ignore these opportunities, we will miss the chance to extend equipment life, decrease repair frequency, and improve profitability. There four fundamental steps to RCA:
  • Quantify the magnitude of the problem and decide on the resources required to resolve it     
  • Perform the analysis by selecting the appropriate technique     
  • Develop a list of options for solving the problem and implement the most cost-effective solution     
  • Document the results of the analysis in the appropriate format
Four methods are generally used, either singly or in combination, for conducting a Root Cause Analysis. These methods include: 
  1. Plan, Do, Check, Act (PDCA)  The following section of the manual provides definitions and the sequence of activities that apply in each step of the PDCA cycle.  It will be helpful to keep an eye on the outline as you proceed through this material.  You will begin to see the usefulness of each step as a foundation for succeeding steps.
Plan  The "Plan" phase of PDCA includes:

Clearly Define the Problem. 

"A problem clearly stated is a problem half solved".  Although it seems like a trivial step, the team should not take this step lightly.  It is important to begin this problem-solving journey with a clear, concise problem statement.  If this is not done properly, it could lead to one of the following:  excessive time in cause identification due to a broad problem statement, predisposing the team to a particular solution or problem solving turns into solution implementation rather than root-cause identification and remedy. A problem can occur in one of two contexts and should be reflected in the problem statement:
  • A specific set of conditions is preventing a desired result from occurring.     
  • A specific set of conditions is causing an undesired result to occur.     
  • Avoid solution statements (e.g. "Not enough backup instrumentation", "No in-house support").     
  • Include two parts: a description of undesirable condition(s) along with what it is causing or preventing from happening and the consequence of the problem.

  • If the team has data at this point, quantify the size of the problem within the problem statement (e.g. "Process is available 45% of the time" as opposed to saying "Process is available large amount of the time").     
  • Define the problem as narrowly as possible. (e.g., Is problem on one maker or across all makers? Obsolete/missing material specs for all materials?  Which ones in particular are causing problems? Parameter availability across all shifts, all days or can it be more specifically defined?)

Collect Evidence of Problem 

This activity focuses on obtaining information/data to clearly demonstrate that the problem does exist.  In the case of team problem solving, this should be a quick exercise since the reliability engineering function must have been looking at data in order to create the team.  However, it is important that the team gathers and views this data to answer the following questions:
  • Does the problem truly exist?     
  • Is the problem measurable?     
  • Is the problem chronic?     
  • Does the data show the problem existing over an extended period?    
  • Is the problem significant? If solved, will it result in significant improvement or savings in time, money, morale and/or resources?
The output of this activity will be a list of evidence statements (or graphs) to illustrate that the problem exists, its size and the chronic nature of it.

Identification of Impacts or Opportunities 

This part of the Plan segment focuses on identifying the benefits if this problem solving is successful.  This activity needs to be thought of in two different perspectives because Project Team work can take the form of control work (fixing a problem that stands in the way of expected results) or pure improvement (attempting to take results to a new level of performance.)
  • Impacts: For control-type work, what is the consequence of not solving this problem?     
  • Opportunities: For pure improvement work, what is the lost opportunity if this work is not initiated?
In each case the output of this activity will be a list of statements.  The impact statements and opportunity statements should be stated in terms of loss of dollars, time, "product", rework, processing time and/or morale.

Measurement of Problem  

Before problem solving proceeds, it is important for the team to do a quick check on the issue of how valid or reliable the data is on which the team is making the decision to tackle the problem.   For the parameter(s) that are being used as evidence of the problem, is there any information known by the team that would question the validity, accuracy or reliability of the data?  
This question should be examined whether we are relying on an instrument, a recorder or people to record information or data. If the team suspects that there are significant issues that "cloud" the data, then these measurement problems need to be addressed, fixed and new measures obtained before proceeding with the other segments of PDCA.

Measure(s) of Effectiveness  

At this point, the team needs to identify how they intend to measure success of their problem solving efforts.  This is one of the most important steps in PDCA and one that certainly differentiates it from "traditional" problem solving. The strategy is to agree on the What and How, obtain the benchmark “before” reading, perform the PDCA activities and re-measure or obtain the “after” measure.  It is at that point that the team will need to decide whether they need to "recycle" through PDCA in order to achieve their pre-stated objective. 
  • The team needs to determine what appropriate measures would directly reflect improvement. Don't worry about the How at this point.     
  • Look at TQMs, customer feedback results, costs and off-line parameters as possible measures.     Typically 1 to 3 measures are sufficient.     
  • The team then decides how the measure should be obtained and expressed (i.e., Pareto form, control chart form, survey and tabulation of results). Guideline:  Get creative!     
  • Someone should be assigned the task of obtaining the “before” measure. Maybe the data exists and it just needs to be researched.  Perhaps the vehicle to begin capturing the data needs to be developed/implemented.

Key Point:

  • Only after the “before snapshot” has been taken and the team has reviewed the data, should the next step of Setting the Objective occur.

Team Objective Setting  

When the team has determined how it will measure success and has obtained the "before" or current level of performance of those measures, they can effectively set the team objective for improvement.  Knowing the amount of effort and resources that will be utilized on this problem, what amount of improvement in the Measure of Effectiveness would provide a good return on investment?  To do this activity productively requires that the team leader has had dialogue with the QC, PMT, SIT or DMT, can fairly represent their expectations and knows the general time frame for problem solving activities.

Key Points

  • The objective should be to significantly reduce the problem, not necessarily to totally eliminate it.     
  • The team should be aware of effort/return ratio. That is, ensure that the expected benefit level is a meaningful return for the time/energy/resources to be expended in solving the problem.     
  • The objective should be stated in terms of % reduction in average level or % change in Measure of Effectiveness, % reduction in variation, % reduction in cost or time.     
  • Objectives should be set so they can be achieved in a reasonable period of time.

Rough PDCA Timetable 

For purposes of resource planning and team plan updating, the team projects a rough timetable for completing each segment of PDCA. At this point the team has gathered enough information to enable them to establish a rough timetable for the completion of each segment of the PDCA cycle.  The team must recognize that this is a preliminary estimate based on the information currently available and that this will need to be revised as the team progresses.  The format is simple:
Size of the problem.The information that the team should use to make these estimates is:
  • How much of the problem the team is attempting to solve (team objective). 
  • Complexity of the problem. 
  • Other conflicts for time that team members will have. 
  • QC, PMT, SIT or DMT priority for resolution of the problem
This timetable will be used by the team in several ways:  
  • Serve as team objectives for getting work done.     
  • Planning purposes so other activities can be scheduled with this timing objective in mind.     
  • QC, PMT, SIT or DMT planning. With this information the Management Team can better manage the team plan by knowing when other teams need to begin, the resources available, etc.

Management Team Approval and Review

During the sequence of work that the team will be performing, it will be necessary to maintain quality communications with the Management Team, whether it be Line Management Team, SIT, DMT, PMT, or QC.  The nature of this communication could be: 
  • To inform the Management Team of team progress or results.      
  • To review team plans or to obtain Management Team approval to carry out changes that the team has concluded are necessary. The principle of this activity is not that the Management Team will be "looking over the shoulder" of the team in an attempt to solve problems, but rather that the Management Team should be concerned with process (how the team does its work) as well as results.  The more informed the Management Team is, the better job it can do in prioritizing and coordinating efforts and in optimizing the allocation of a limited number of problem solving resources.


These sessions with the Management Team could be done by the team leader, a rotating representative of the team or in some cases, it might be appropriate for the entire team to be present.


The standardized forms to capture team outputs should be the basis for the presentations.  Talking from these forms should give the Management Team a good sense of the process (quality of problem solving efforts) as well as the key conclusions at which the team has arrived.  In the Check segment especially, it might be necessary to use additional exhibits to demonstrate at how conclusions were drawn.


The content of these sessions will vary depending on the stage of PDCA.  For this first Management Team approval, the major items to be covered are: 
  • Team problem statement.     
  • Impact information.     
  • Measure(s) of Effectiveness with "before" measurement(s).     
  • Team objective.     
  • Rough PDCA timetable.

Generate Possible Causes

To avoid falling into the mode of solution implementation or trial and error problem solving, the team needs to start with a "blank slate" and from a fresh perspective lay out all possible causes of the problem.  From this point, the team can use data, its collective knowledge and experience to sort through the most feasible or likely major causes.  Proceeding in this manner will help ensure that the team will ultimately get at root causes of problems and won't stop at the treatment of other symptoms.  The best tool to facilitate this thinking is the Cause and Effect Diagram done by those people most knowledgeable of and closest to the problem. 
Broke-Need-Fixing Causes Identified, Worked On After constructing the Cause & Effect Diagram, the leader should obtain any available “clean” data that illustrates relationships between possible causes and the effect (or dependent variable).
  • If no data is available, go to Step 3 immediately.     
  • Step 3 is based on any data shared by the leader along with the team's collective knowledge and experience.     
  • In Step 3, suspected major causes are identified which the team will initially investigate ndicate these by circling those bones on the Cause and Effect Diagram.     
  • As a guideline, you should target to circle 2-6 bones. These are the 1st/2nd level causes, that if real, will collectively reduce the problem to the level stated in the objective.     
  • In the absence of existing data (Step 2), Step 3 often becomes an exercise to determine which variables need to be included in a designed experiment.     
  • Keep in mind, if these PDCA activities do not turn out to meet the objective, the team would come back to this Cause & Effect Diagram and identify other suspected causes.
Before proceeding to carry out either an Action Plan (for Cause Remedies) or an Experimental Test Plan there are often parts of the process that are "broke".  This could take on many different forms.  For example:
  • Mechanical part known to be defective or incorrect.     
  • Piece of equipment not functioning as intended or designed.     
  • Erratic behavior of a piece of equipment.     
  • Temporary replacement of a part/piece of equipment that is not equivalent to the requirement.     
  • A method or procedural change made temporarily to "get around" a problem.
In most of these cases, the items are obvious to the team as something "they have had to just live with".  These items, if not fixed, might obscure any experimental results or limit the amount of improvement realized in the Action Plan.  A few key guidelines to remember in performing this activity:
  • Only focus on items obvious to the team.     
  • Do not work on unless you have clear consensus by the team.     
  • Only address those items, which can be fixed in a short period of time (weeks not months).   

Write Experimental Test Or Action Plan

Depending upon the type of problem being worked on, the PDCA strategy will take one of two different directions at this point.  The direction is based on whether it is a "data-based" problem or "data-limited" problem.  Shown in the table below is the distinction between these two strategies and in particular, the difference between an Action Plan and Experimental Test Plan.  Note that in some cases, it will be necessary to use a combination of Action Plans and Experimental Test Plans.  That is, for some cause areas an Action Plan is appropriate and for other causes within the same problem, carrying out an Experimental Test Plan is the best route. 
In order to get to the point of writing the Action Plan, the team needs to brainstorm possible solutions or remedies for each of the "cause areas" (circled bones) and reach consensus on the prioritized solutions.  This work can be carried out as a team or split into sub-teams.  Either way, the entire team will have to reach agreement on proposed remedies and agree to the Action Plan.  The Action Plan will be implemented in the Check segment.  Who, When, and What to be done is what you are attempting to spell out in the Action Plan.  The format for this Plan is:

Write Experimental Test Plan

The  Experimental Test Plan is a document which shows the experimental test(s) to be carried out.  This will verify whether a root cause that has been identified really does impact the dependent variable of interest.  Sometimes this can be one test that will test all causes at once or it could be a series of tests.  Note:  If there is a suspicion that there is an interaction between causes, those causes should be included in the same test.

The Experimental Test Plan should reflect:
  • Time/length of test.     
  • How the cause factors will be altered during the trials.     
  • Dependent variable (variable interested in affecting) of interest.     
  • Any noise variables that must be tracked.     
  • Items to be kept constant.
Everyone involved in the Experimental Test Plan(s) should be informed before the test is run.  This should include:
  • Purpose of the test.     
  • Experimental Test Plan (details).     
  • How they will be involved.     
  • Key factors to ensure good results. 
When solutions have been worked up, the team should coordinate trial implementation of the solutions and the "switch on/off" data analysis technique. 

Resources Identified

Once the Experimental Test Plan or the Action Plan is written, it will be fairly obvious to the team what resources are needed to conduct the work.  For resources not on the team, the team should construct a list of who is needed, for what reason, the time frame and the approximate amount of time that will be needed.  This information will be given to the Management Team.

Revised PDCA Timetable

At this point, the team has a much better feel for what is to be involved in the remainder of its PDCA activities.  They should adjust the rough timetables that had been projected in the Plan segment.  This information should be updated on the team Plan, as well as taken to the Management Team.

Management Team Review/Approval  

The team has reached a critical point in the PDCA cycle.  The activities they are about to carry out will have obvious impact and consequences to the department.  For this reason, it is crucial to make a presentation to the Management Team before proceeding.  This can be done by the team leader or the entire team.  The content/purpose of this presentation is:
  • Present team outputs to date.    
  • Explain logic leading up to the work completed to date.     
  • Present and get Management Team approval for:         
  • Measure of Effectiveness with "before" measure.         
  • Priority causes.         
  • Action Plan (for Cause Remedies) or Experimental Test Plan.         
  • Revised PDCA timetable.

Carryout Experimental Test or Action Plan  

Depending upon the nature of the problem, the team will be carrying out either of these steps:
  • Conduct Experimental Test Plan(s) to test and verify root causes or     
  • Work through the details of the appropriate solutions for each cause area. Then, through data, verify to see if those solutions were effective. 
On the following pages, we will look at some general information and key points to remember for both of these strategies.

Carryout Action Plan 

In the case of Action Plans, where solutions have been worked up and agreed to by the team, the "switch on/switch off" techniques will need to be used to verify that the solutions are appropriate and effective.  To follow this strategy, the team needs to identify the dependent variable—the variable that the team is trying to impact through changes in cause factors. 
When using this strategy remember these important points:
  • Collect data on the dependent variable for a "representative" period before the test period. It should be comparable in length to the test period.     
  • Test for normality, develop control limits to define typical performance under the "old system".     
  • Test period: Implement solutions.  
  • (Ensure window of test period is long enough to capture most sources of variation.)     
  • Compare test period data against already defined limits from the "before" data.     
  • Check to see if the level has shifted significantly—evidenced by OOCs.    
  • "Switch off"—undo the changes. See if the performance returns to "before" level.
Note:  The purpose of the "switch on/off" technique is to guard against the situation in which the implemented changes had a positive effect, but the results did not show it because "new" causes (time related) entered the process and offset the positive effects of the planned changes.  The data, in that case, would show no change.  However, using the "switch on/off" technique will help to overcome that phenomenon.

Carryout Experimental Test Plan

During the Check segment, the Experimental Tests to check all of the major prioritized causes are to be conducted, data analyzed and conclusions drawn and agreed to by the team.  A few key points to remember in doing this series of activities:
  • The team leader should be conferring with the TQI Specialist to ensure appropriate data analysis techniques are used. The tools used will depend upon the nature of the Test Plan.  Typically, the techniques of Q3 are the most often used.     
  • While the test(s) are being run, complete documentation is kept. This information will help the team decide if the results are valid.     
  • Clear, simple, concise, data recording sheets shall be constructed to ensure the right information is recorded and correct experimental conditions are set.     
  • Team members should be assigned to closely monitor test conditions to ensure the Experimental Test Plan is followed as designed.

Analyze Data From Experimental or Action Plan  

Typically, one person from the team is assigned the responsibility to perform the analysis of the data from the Test Plan.  When necessary, this person should use the department or plant resource available to give guidance on the proper data analysis tools and/or the interpretation of outputs.  The specific tools that should be used will depend upon the nature of the Test Plan.  
Some of the most frequently used techniques include:
  •  Analysis of Variance—1 way, multifactor. 
  • Tukey, Scheffe. 
  • Post-Hoc Technique. 
  • Significance Testing (t test)—means. 
  • Significance Testing (F test)—variation. 
  • Regression fitting. Chi-square analysis. 
  • Fractional Analysis of variance. 
  • Correlation analysis. 
  • Discriminate analysis. 
  • Non-parametric techniques (non-normal data). 
  • "Switch on/off" comparisons. 
  • Response surface. Stepwise regression.
*most of these techniques will be covered in later in this paper"  In most cases, a combination of several techniques will be used to analyze the data.  The use of each of these techniques yields very specific outputs which need to be interpreted and conclusions drawn from them.  These conclusions need to be clearly documented and then shared with the team.  It is important that the team understand how these conclusions were reached, based on the raw data.  Typically, these conclusions center around answering the following: 
  • Which (if any) causes demonstrated a significant impact (mean, variation) on the dependent variable?     Were there any interactions? Was it not just one cause that created the difference alone, but a combination of causes?        
  • What is an accurate estimate of the expected impact on the dependent variable if the cause were eliminated?
Team members should be careful not to "force" conclusions or try to creatively look at the data to create a difference. If the results of the technique applied indicate no significant impact, accept that conclusion and move on. Often times in data analysis, if carefully performed, you can obtain evidence of the presence of a cause variable that was not part of the design.  This information could then be used when going back to the Do segment.

Decisions-Back to “DO” Stage or Proceed

After reviewing the data analysis conclusions about the suspected causes or solutions that were tested, the team needs to make a critical decision of what action to take based on this information. 

The data analysis step could have been performed in either of the following contexts: 
Implementation Plan To Make Change Permanent
  • After the Action Plan (solutions) was carried out, data analysis was performed to see if the dependent variable was impacted. If the conclusions were favorable, the team could then go on to develop the Implementation Plan.     
  • The Experimental Test Plan was conducted, data was analyzed to verify causes. If the conclusions were favorable (significant causes identified), the team must then develop solutions to overcome those causes before proceeding to develop the Implementation Plan.  (e.g., It was just discovered through the Test Plan that technician differences contribute to measurement error.)
Next, the team would need to identify ways to eliminate these differences.

Implementation Plan  

The Implementation Plan, to make the changes permanent, should cover the following areas.  In each case, clear accountability for carrying out that function and activity should be identified.
  • Changes needed to equipment, procedures, processes— What, Who, When, How     
  • Training needs     
  • Communication needs     
  • Approval steps to get changes made
To write an Implementation Plan, the team should ask the following critical questions:

Critical Questions to Write Implementation Plan

  •  What procedures need to be permanently modified?     
  • Who needs to be trained and in what to make this permanent?     
  • Who will do the training?     
  • What equipment needs to be modified, altered or added?     
  • What job responsibilities need to be modified, added, deleted?     
  • What work processes need to be altered, how can we document these changes?     
  • Who needs to approve these changes?     
  • How will the changes be permanently implemented? When?      
  • How will they be phased in?     
  • Who needs to know that these changes are taking place? Who will  do the communication?
Once these questions are answered thoroughly, the team can construct an Implementation Plan to make the necessary changes permanent.

Implementation Plan  

It is absolutely critical that this plan be carefully and thoroughly prepared by the team to ensure that the proven remedies can be implemented smoothly, as intended, and with the support and buy-in of those involved.

Force Field On Implementation   

Once the Implementation Plan is written, the team should do a Force Field Analysis on factors pulling for and factors pulling against a successful implementation – success in the sense that the results seen in the test situation will be realized on a permanent basis once the solutions are implemented.

As a result of this activity, the team should ask two questions:
1. Given these results, is the probability of success high enough to proceed?  
2. Looking at the factors on the right hand side, what can be added (if anything) to the Implementation Plan to minimize the effects of these negative factors?

The Implementation Plan should then be revised as needed and finalized.

Management Team Review/Approval

The team has reached a very critical point in the PDCA cycle and needs to meet with the Management Team before proceeding.  This meeting is extremely important, because the team will be going forward with permanent changes to be made in operations.  The Management Team not only needs to approve  these changes but also the way in which they will be implemented.

Purpose of Management Team Approval

The purpose of this session is to:
  • Provide details of the solutions developed as part of the Action Plan.     
  • Present the data and logic involved in the team conclusions which were drawn from the data analysis.     In the case of Test Plans—present solutions developed to overcome significant causes.     
  • Obtain approval to make the necessary changes permanent by carrying out the Implementation Plan or obtain approval to return to the Do segment.  
  • The key outputs or information to be presented in this session should include the following:      
  • Experimental Test Plan data analysis and list of conclusions or Action Plan outputs—details of solutions for each cause area.     
  • In the case of Action Plans—"Switch on/off" results after trial.     
  • Implementation Plan.     
  • Force Field on implementation. 

Carryout Implementation Plan  

If the team has written a complete, clear and well thought through Implementation Plan, it will be very obvious what work needs to be done, by whom and when to carry out the Act segment of the PDCA cycle.  The team should give significant attention to assure communications and training is carried out thoroughly, so department members will know what is changing, why the change is being made and what they need to do specifically to make implementation a success.

Post-Measure of Effectiveness  

After all changes have been made and sufficient time has passed for the results of these changes to have an effect, the team needs to go out and gather data on all of the Measures of Effectiveness.  The data then needs to be analyzed to see if a significant shift has occurred.  To accomplish this a team could do the following:
  • Establish control limits for the measure based on the "before" data.     
  • Extend the limits.     
  • Plot post measures on the same graph. Check to see if the chart goes OOC on the favorable side. See figure 5.1.
Figure 5.1 Measure of Effectiveness  

Analyze Results versus Team Objectives  

In the previous step, the team looked at whether the Measure(s) of Effectiveness had been impacted in any significant way by the permanent implementation of the changes.  The team cannot stop here.  If the answer to that question is favorable, then the team needs to verify if the amount of improvement was large enough to meet the team objective.

To answer the question the team should use the tools such as hypotheses testing or confidence intervals.   Note:  The only evidence that should be accepted as proof that the team has “done its work” is a significant shift in the measure of effectiveness.  Until this happens, the team should not close out.

Team Feedback Gathered  

Once the team decision has been made that the PDCA cycle has been successfully completed (based on Measure of Effectiveness change), the team needs to present this information to the Management Team.  Before this is done, the team leader needs to gather feedback from the team.  This feedback will be in the form of a questionnaire that all team members (including the team leader) should fill out.  The results will be tallied by the team leader and recorded.  

The Team Leader should then call a meeting to review the results as part of the team close-out.  The results will also be shared with the Management Team at the Management Team close-out meeting. The feedback questionnaire will attempt to assess team members perceptions in the following areas:
  • How well did the team follow and use PDCA?     
  • How efficient was the team?     How effective and efficient were the team meetings?     
  • How effectively was the team led?     
  • How much did the team members learn or grow in PDCA and use of the tools and techniques?
Not only can the Management Team use these results, over time, as a measure of PDCA progress, but they can also provide valuable information to team leaders, other resources and the Management Team, so that they may take appropriate steps to ensure the mastery of PDCA.  There is a strong correlation between the degree to which employees can effectively use PDCA and the continuous improvement of processes, products, work life and costs.

Management Team Close-out Meeting  

Before disbanding, the team needs to conduct a close-out meeting with the Management Team.  The major areas to be covered in this meeting are:
  • Wrap up any implementation loose ends.     
  • Review Measure of Effectiveness results, compare to team objective.     
  • Ensure team documentation is complete and in order.     
  • Share team member feedback on team experiences (standardized forms and informal discussion).
Note:  A composite picture of team member feedback should be given to the Management Team.

To all my friends, The Maintenance Community on Slack is an incredible free space where over 1,500 maintenance and reliability professionals like myself share real life experiences with each other.   
To join us, sign up here:

Post a Comment

Previous Post Next Post