Normal Accidents and Root Cause Analysis: Difference between revisions

Revision as of 02:58, 21 September 2014

Normal Accidents book: http://press.princeton.edu/titles/6596.html

Systems are categorized by Interactions that are Simple vs Complex, and Tightly Coupled vs Loosely Coupled.

There are a few different versions of the quadrant: http://paei.wdfiles.com/local--files/perrow-charles-normal-accident-theory/PAEI_043_Perrow_Normal_Accident_Theory.gif https://www.flickr.com/photos/metanick/139214026/ http://media.peakprosperity.com/images/3-Perrow-from-Accidents-Normal.png

Douglas Squirrel talking about root-cause analysis: https://skillsmatter.com/skillscasts/1986-talk-by-squirrel

Notes on Squirrel's talk: http://www.markhneedham.com/blog/2011/12/10/the-5-whysroot-cause-analysis-douglas-squirrel/

Notes from John Bradshaw:

Normal accidents:

3 Mile Island Accident - Blamed Operators
Any system can and will fail, and you should plan for it to fail
2 Axis graph
- Complexity -> Simple
- Loose Coupling -> Tight Coupling
- Complex & Tightly Coupled = Accident
Complex system that is Loosely coupled is the CITCON open space set up evening
- We did not all rush to get food and beer
E.g had there been a Lion in there, 1 person could have warned rest
Chance to warn of danger
Simple but tightly coupled = Dam
- Accident is water gets through the damn
- Anything goes wrong with dam e.g. hole, no chance to resolve
- Simple to reason about, wall of rock with a hole in
- But is high risk
In nuclear plant accident, cooling system near radioactive rods
- Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident
Book to Read: Normal Accidents by Perrow
Are micro services tightly coupled and complex?
- Depends
- It's down to design and implementation
Always strive to be in the bottom right corner of the graph, low complexity loosely coupled
How do people plan for failure?
- Rob - We go through a certification process to get into Retail
Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services
Internal team will run same tests internally before handing over to external certification team

How do you verify or even test your logging? Instance of a service that logged every time on failure, in a tight loop and filled the disks leading to further failure = Simple Tightly Coupled System

Root Cause Analysis

Scenario: Database deliberately down for maintenance. Instance of a service that logged every time on failure connecting to database, in a tight loop and filled the disks leading to further failure

Basic principals
- Everybody who was affected comes to the meeting
To identity cultural or people problems
Not allowed to place blame
Ask/poll everyone what was the problem
- Customer:
  - No system, was down, can't log on
- Operations:
  - Confused by phone call
- Customer Service:
  - Angry calls from customers, did not know what was going on
- Developer:
  - Database down, no disk space
  - Then ask why:
- Customer:
- Operations:
- Customer Service:
- Developer:

· Why: Maintenance on database, database down · Why: Analysed log files, saw huge files, checked code, logged with no delay · Why: Developer skills lacking · Why: No code review/inspection · Why: Test for this logging case lacking · When QA tested database was running · QA too busy to investigate database failures cases · No new blood in organisation · QA assigned/overbooked to too many projects · Action: Maintenance on DB, have redundant database to switch to · Action: QA involved earlier

§ Actions must be assigned and completed with a timeframe e.g. 1 week § When you hit that uncomfortable silence half way down, keep pushing

· The root cause of failure is always the culture in an organisation o It’s always about people e.g. · The developer adding no delay to logging · Lack of testing · Create a RCA timeline of failure o At what time did system go down o At what time did customers complain o At what time did developers react o At what time was the system back up o Etc · Do as much technical investigation as possible before the RCA meeting o Eg this was the problem o We had these tests · But we didn’t have one for this scenario

@@ Line 12: / Line 12: @@
 Normal accidents:
-·         3 Mile Island Accident - Blamed Operators
+* 3 Mile Island Accident - Blamed Operators
-·         Any system can and will fail, and you should plan for it to fail
+* Any system can and will fail, and you should plan for it to fail
-·         2 Axis graph
+* 2 Axis graph
-o    Complexity -> Simple
+** Complexity -> Simple
-o    Loose Coupling -> Tight Coupling
+** Loose Coupling -> Tight Coupling
-o    Complex & Tightly Coupled = Accident
+**    Complex & Tightly Coupled = Accident
-·         Complex system that is Loosely coupled is the CITCON open space set up evening
+*        Complex system that is Loosely coupled is the CITCON open space set up evening
-o    We did not all rush to get food and beer
+**    We did not all rush to get food and beer
-·         E.g had there been a Lion in there, 1 person could have warned rest
+*         E.g had there been a Lion in there, 1 person could have warned rest
-·         Chance to warn of danger
+*         Chance to warn of danger
-·         Simple but tightly coupled = Dam
+*         Simple but tightly coupled = Dam
-o    Accident is water gets through the damn
+**    Accident is water gets through the damn
-o    Anything goes wrong with dam e.g. hole, no chance to resolve
+**    Anything goes wrong with dam e.g. hole, no chance to resolve
-o    Simple to reason about, wall of rock with a hole in
+**   Simple to reason about, wall of rock with a hole in
-o    But is high risk
+**    But is high risk
-·         In nuclear plant accident, cooling system near radioactive rods
+*         In nuclear plant accident, cooling system near radioactive rods
-o    Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident
+**    Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident
-·         Book to Read: Normal Accidents by Perrow
+*         Book to Read: Normal Accidents by Perrow
-·         Are micro services tightly coupled and complex?
+*         Are micro services tightly coupled and complex?
-o    Depends
+**    Depends
-o    It's down to design and implementation
+**    It's down to design and implementation
-·         Always strive to be in the bottom right corner of the graph, low complexity loosely coupled
+*        Always strive to be in the bottom right corner of the graph, low complexity loosely coupled
-·         How do people plan for failure?
+*         How do people plan for failure?
-o    Rob - We go through a certification process to get into Retail
+**    Rob - We go through a certification process to get into Retail
-·         Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services
+*         Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services
-·         Internal team will run same tests internally before handing over to external certification team
+*         Internal team will run same tests internally before handing over to external certification team
 How do you verify or even test your logging? Instance of a service that logged every time on failure, in a tight loop and filled the disks leading to further failure = Simple Tightly Coupled System
@@ Line 46: / Line 46: @@
 Scenario:  Database deliberately down for maintenance. Instance of a service that logged every time on failure connecting to database, in a tight loop and filled the disks leading to further failure
-·         Basic principals
+*         Basic principals
-o    Everybody who was affected comes to the meeting
+**    Everybody who was affected comes to the meeting
-·         To identity cultural or people problems
+*         To identity cultural or people problems
-·         Not allowed to place blame
+*         Not allowed to place blame
-·         Ask/poll everyone what was the problem
+*         Ask/poll everyone what was the problem
-§  Customer:
+**  Customer:
-·         No system, was down, can't log on
+***        No system, was down, can't log on
-§  Operations:
+**  Operations:
-·         Confused by phone call
+***        Confused by phone call
-§  Customer Service:
+**  Customer Service:
-·         Angry calls from customers, did not know what was going on
+***         Angry calls from customers, did not know what was going on
-§  Developer:
+**  Developer:
-·         Database down, no disk space
+***        Database down, no disk space
-·         Then ask why:
+***         Then ask why:
-§  Customer:
+** Customer:
-§  Operations:
+**  Operations:
-§  Customer Service:
+**  Customer Service:
-§  Developer:
+**  Developer:
 ·         Why: Maintenance on database, database down
 ·         Why: Analysed log files, saw huge files, checked code,  logged with no delay

Normal Accidents and Root Cause Analysis: Difference between revisions

Revision as of 02:58, 21 September 2014

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools