Difference between revisions of "Root Cause Analysis"

From CitconWiki
Jump to navigationJump to search
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
09:00 on Saturday, Nov 12, 2011 morning in Space Invaders (the big room)
 
09:00 on Saturday, Nov 12, 2011 morning in Space Invaders (the big room)
  
Squirrel has slides on how to go about doing a root cause analysis (PJ reminder: get the slides from Squirrel to attach to the wiki) ('''Squirrel''': I can't figure out how to attach the slides. Help! In the meantime, you can watch a [http://skillsmatter.com/podcast/agile-testing/talk-by-squirrel video that contains the slides].
+
Squirrel has slides on how to go about doing a root cause analysis (PJ reminder: get the slides from Squirrel to attach to the wiki) ('''Squirrel''': I can't figure out how to attach the slides. Help! In the meantime, you can watch a [http://skillsmatter.com/podcast/agile-testing/talk-by-squirrel video that contains the slides].)
 +
 
 +
=="Rules" for root-cause analyses==
 +
Really suggestions for discussion. This is how [http://www.timgroup.com TIM Group] do root-cause analyses (RCAs for short).
  
 
===Target a specific event===
 
===Target a specific event===
could do a root cause analysis on a "big" event over time, like as part of a master's thesis
+
If you want to complete the analysis in a short meeting (30-60 minutes) it's best to focus on a specific recent event, such as a production bug or outage. Good to understand what the level of pain is. One person said he had seen a root cause analysis on a "big" event or series of events completed over time (it took a month and was part of a master's thesis).
can be helpful to start with the "level" of pain
 
defects are not "really" defects, they are misunderstanding
 
should do production bugs
 
  
 
===Everyone affected attends===
 
===Everyone affected attends===
the "feature" team attends, what about senior managers? representatives from other areas of the business
+
The "feature" team attends (typically developers) as well as senior managers and representatives from other areas of the business, e.g. client support or operations. Not always feasible to get "everyone" in the room. One technique is to give them results and tasks from an RCA they did not show up for. You can't assign actions to someone not in the room though, so you have to do something like "visit X daily for a week to ensure she does Y".
not always good at getting "everyone" in the room
 
one technique is to give them results from one they did not show up for
 
  
 
===No blame===
 
===No blame===
Ops folks tend toward blame
+
Some participants may tend toward blame, depending on your company culture. Need to set it up ahead of time to avoid blame - "inoculate" people against blame with a discussion at the beginning of the session. A related anti-pattern: as long as it isn't MY discipline, then I have gotten what I want out of this session and I don't have to do anything else.
Need to set it up ahead of time to avoid blame...  "inoculate" people against blame
 
Anti-pattern: as long as it isn't MY discipline, then I have gotten what I want out of this session
 
  
 
===Poll to identify problems===
 
===Poll to identify problems===
Go around entire room and ask "Hey PJ, please list all the problems"
+
Go around entire room and ask "Please list all the problems event X caused". OK if others have said your items - this naturally happens. The poll ensures everyone gets a say and no one is left out. Other methods include private ballots on post-its or email solicitation. Try to avoid proxies. Get the right people in the room (see above).
Then go around the room and ask for add ons
 
Private ballots on post-its. Email solicitation.
 
Try to avoid proxies. Get the right people in the room
 
  
===Write alot===
+
===Write a lot===
 +
When not sure what to do, write! Want to fill the board, and capture all the ideas. OK to have many whys for one item - rare that there is a single chain of whys.
  
 
===Move down then across===
 
===Move down then across===
 +
Ask why again and again. People will resist going to the fifth why (it may actually take more). Write down items not in current why chain and promise to return to them later (be sure you actually do so to build trust). Push to get to a cultural or training issue.
  
 
===If it doesn't hurt, then you aren't doing it right===
 
===If it doesn't hurt, then you aren't doing it right===
 +
Typically a big pause before someone says, "Well, maybe it's that we don't value testing highly enough" or something like that. Wait for the pause and let them squirm.
  
 
===Proportionate tasks===
 
===Proportionate tasks===
If you are re-writing your entire app because of a 3 minutes of down time, then you are not doing the right thing
+
If you are re-writing your entire app because of 3 minutes of down time, then you are not doing the right thing. It's OK to do part of a task ("Write the outline of a training programme for new sysadmins on our Puppet setup and add to induction wiki page") - if the problem occurs again, you can do the next step ("Fill in the deployment section of the outline").
  
 
===All tasks done in a week===
 
===All tasks done in a week===
Every task agreed to:
+
Identify tasks at many levels of the why chain (not just at the fifth why). Every task agreed to:
 
  1) Has to be do-able in one week
 
  1) Has to be do-able in one week
 
  2) Has to actually be done in one week
 
  2) Has to actually be done in one week
 +
Someone (typically the one running the RCA) has to chase this to ensure it happens. Keeping to one week helps ensure proportionality and completion.
  
How does this compare to retrospectives?
+
==Questions and Comments==
 +
''How does this compare to retrospectives?''
 
Retros are related to teams, the pain is more direct
 
Retros are related to teams, the pain is more direct
  
Other techniques for NOT losing focus?
+
''Other techniques for NOT losing focus?''
Keep it short term
+
Keep it short term. The next root cause analysis might highlight the "next" step, but for now, "all we have to do now is take this first step"
The next root cause analysis might highlight the "next" step, but for now, "all we have to do now is take this first step"
+
 
 +
''Vote every day'' on actions from retrospectives to determine whether or not they are being actioned. Smiley faces or sad faces depending on votes
  
Vote every day on actions from retrospectives to determine whether or not they are being actioned
+
''Bickering can be a problem.'' Having a senior person present helps defuse these types of arguments.
Smiley faces or sad faces depending on votes
 
  
Bickering can be a problem. Having a senior person present helps diffuse these types of arguments.
+
Good article on what to avoid in an RCA (above process is designed to avoid most of these errors): [http://www.reinertsenassociates.com/#tipofmonth Cult of the Root Cause] by Don Reintertsen.
  
===Wallace & Gromit Video===
+
==RCA for Funny Video==
Building snowmen
+
* Building snowmen
Squirrel divided up the group into two: Wallace & Gromit
+
* Squirrel divided up the group into two: Wallace & Gromit
 +
* '''Squirrel''': Did anyone take a photo of the board? Would be good to include here (sorry I didn't think of it).
  
 
===Bad things that happened===
 
===Bad things that happened===
Snowman destroyed - lost good snowman
+
* Snowman destroyed - lost good snowman
Wallace covered in snowman
+
* Wallace covered in snowman
Got a cold
+
* Got a cold
Wasted Gromit time and unhappy
+
* Wasted Gromit time and unhappy
  
 
===Lost good snowman===
 
===Lost good snowman===
in wrong place (wallace's garden)
+
* in wrong place (wallace's garden)
Wallace inconsiderate?
+
* Wallace inconsiderate?
Couldn't see - didn't look - Hard to look - Van too big - Wanted impressive snowman -  
+
* Couldn't see - didn't look - Hard to look - Van too big - Wanted impressive snowman -  
  
 
30 to 60 second pause is "good" (it has to hurt a little)
 
30 to 60 second pause is "good" (it has to hurt a little)
  
Worked down to Competition and Dog Can't Talk at end of 7 why's
+
Worked down to Competition and Dog Can't Talk at end of 7 whys
  
 
===Actions===
 
===Actions===
Video, Mirrors, Reverse warning
+
* Video, Mirrors, Reverse warning
Lightning talk on snowman
+
* Lightning talk on snowman
Board Agenda: Profit Sharing
+
* Board Agenda: Profit Sharing
Daily meetings (standups), Sign language classes for gromit
+
* Daily meetings (standups), Sign language classes for gromit
  
(volunteers for each action)
+
(take volunteers for each action)

Latest revision as of 16:48, 20 November 2011

09:00 on Saturday, Nov 12, 2011 morning in Space Invaders (the big room)

Squirrel has slides on how to go about doing a root cause analysis (PJ reminder: get the slides from Squirrel to attach to the wiki) (Squirrel: I can't figure out how to attach the slides. Help! In the meantime, you can watch a video that contains the slides.)

"Rules" for root-cause analyses

Really suggestions for discussion. This is how TIM Group do root-cause analyses (RCAs for short).

Target a specific event

If you want to complete the analysis in a short meeting (30-60 minutes) it's best to focus on a specific recent event, such as a production bug or outage. Good to understand what the level of pain is. One person said he had seen a root cause analysis on a "big" event or series of events completed over time (it took a month and was part of a master's thesis).

Everyone affected attends

The "feature" team attends (typically developers) as well as senior managers and representatives from other areas of the business, e.g. client support or operations. Not always feasible to get "everyone" in the room. One technique is to give them results and tasks from an RCA they did not show up for. You can't assign actions to someone not in the room though, so you have to do something like "visit X daily for a week to ensure she does Y".

No blame

Some participants may tend toward blame, depending on your company culture. Need to set it up ahead of time to avoid blame - "inoculate" people against blame with a discussion at the beginning of the session. A related anti-pattern: as long as it isn't MY discipline, then I have gotten what I want out of this session and I don't have to do anything else.

Poll to identify problems

Go around entire room and ask "Please list all the problems event X caused". OK if others have said your items - this naturally happens. The poll ensures everyone gets a say and no one is left out. Other methods include private ballots on post-its or email solicitation. Try to avoid proxies. Get the right people in the room (see above).

Write a lot

When not sure what to do, write! Want to fill the board, and capture all the ideas. OK to have many whys for one item - rare that there is a single chain of whys.

Move down then across

Ask why again and again. People will resist going to the fifth why (it may actually take more). Write down items not in current why chain and promise to return to them later (be sure you actually do so to build trust). Push to get to a cultural or training issue.

If it doesn't hurt, then you aren't doing it right

Typically a big pause before someone says, "Well, maybe it's that we don't value testing highly enough" or something like that. Wait for the pause and let them squirm.

Proportionate tasks

If you are re-writing your entire app because of 3 minutes of down time, then you are not doing the right thing. It's OK to do part of a task ("Write the outline of a training programme for new sysadmins on our Puppet setup and add to induction wiki page") - if the problem occurs again, you can do the next step ("Fill in the deployment section of the outline").

All tasks done in a week

Identify tasks at many levels of the why chain (not just at the fifth why). Every task agreed to:

1) Has to be do-able in one week
2) Has to actually be done in one week

Someone (typically the one running the RCA) has to chase this to ensure it happens. Keeping to one week helps ensure proportionality and completion.

Questions and Comments

How does this compare to retrospectives? Retros are related to teams, the pain is more direct

Other techniques for NOT losing focus? Keep it short term. The next root cause analysis might highlight the "next" step, but for now, "all we have to do now is take this first step"

Vote every day on actions from retrospectives to determine whether or not they are being actioned. Smiley faces or sad faces depending on votes

Bickering can be a problem. Having a senior person present helps defuse these types of arguments.

Good article on what to avoid in an RCA (above process is designed to avoid most of these errors): Cult of the Root Cause by Don Reintertsen.

RCA for Funny Video

  • Building snowmen
  • Squirrel divided up the group into two: Wallace & Gromit
  • Squirrel: Did anyone take a photo of the board? Would be good to include here (sorry I didn't think of it).

Bad things that happened

  • Snowman destroyed - lost good snowman
  • Wallace covered in snowman
  • Got a cold
  • Wasted Gromit time and unhappy

Lost good snowman

  • in wrong place (wallace's garden)
  • Wallace inconsiderate?
  • Couldn't see - didn't look - Hard to look - Van too big - Wanted impressive snowman -

30 to 60 second pause is "good" (it has to hurt a little)

Worked down to Competition and Dog Can't Talk at end of 7 whys

Actions

  • Video, Mirrors, Reverse warning
  • Lightning talk on snowman
  • Board Agenda: Profit Sharing
  • Daily meetings (standups), Sign language classes for gromit

(take volunteers for each action)