Normal Accidents and Root Cause Analysis: Difference between revisions

Revision as of 02:54, 21 September 2014

Normal Accidents book: http://press.princeton.edu/titles/6596.html

Systems are categorized by Interactions that are Simple vs Complex, and Tightly Coupled vs Loosely Coupled.

There are a few different versions of the quadrant: http://paei.wdfiles.com/local--files/perrow-charles-normal-accident-theory/PAEI_043_Perrow_Normal_Accident_Theory.gif https://www.flickr.com/photos/metanick/139214026/ http://media.peakprosperity.com/images/3-Perrow-from-Accidents-Normal.png

Douglas Squirrel talking about root-cause analysis: https://skillsmatter.com/skillscasts/1986-talk-by-squirrel

Notes on Squirrel's talk: http://www.markhneedham.com/blog/2011/12/10/the-5-whysroot-cause-analysis-douglas-squirrel/

Notes from John Bradshaw:

Normal accidents: · 3 Mile Island Accident - Blamed Operators · Any system can and will fail, and you should plan for it to fail · 2 Axis graph o Complexity -> Simple o Loose Coupling -> Tight Coupling o Complex & Tightly Coupled = Accident · Complex system that is Loosely coupled is the CITCON open space set up evening o We did not all rush to get food and beer · E.g had there been a Lion in there, 1 person could have warned rest · Chance to warn of danger · Simple but tightly coupled = Dam o Accident is water gets through the damn o Anything goes wrong with dam e.g. hole, no chance to resolve o Simple to reason about, wall of rock with a hole in o But is high risk · In nuclear plant accident, cooling system near radioactive rods o Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident · Book to Read: Normal Accidents by Perrow · Are micro services tightly coupled and complex? o Depends o It's down to design and implementation · Always strive to be in the bottom right corner of the graph, low complexity loosely coupled · How do people plan for failure? o Rob - We go through a certification process to get into Retail · Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services · Internal team will run same tests internally before handing over to external certification team

How do you verify or even test your logging? Instance of a service that logged every time on failure, in a tight loop and filled the disks leading to further failure = Simple Tightly Coupled System

Root Cause Analysis

Scenario: Database deliberately down for maintenance. Instance of a service that logged every time on failure connecting to database, in a tight loop and filled the disks leading to further failure

· Basic principals o Everybody who was affected comes to the meeting · To identity cultural or people problems · Not allowed to place blame · Ask/poll everyone what was the problem § Customer: · No system, was down, can't log on § Operations: · Confused by phone call § Customer Service: · Angry calls from customers, did not know what was going on § Developer: · Database down, no disk space · Then ask why: § Customer: § Operations: § Customer Service: § Developer: · Why: Maintenance on database, database down · Why: Analysed log files, saw huge files, checked code, logged with no delay · Why: Developer skills lacking · Why: No code review/inspection · Why: Test for this logging case lacking · When QA tested database was running · QA too busy to investigate database failures cases · No new blood in organisation · QA assigned/overbooked to too many projects · Action: Maintenance on DB, have redundant database to switch to · Action: QA involved earlier

§ Actions must be assigned and completed with a timeframe e.g. 1 week § When you hit that uncomfortable silence half way down, keep pushing

· The root cause of failure is always the culture in an organisation o It’s always about people e.g. · The developer adding no delay to logging · Lack of testing · Create a RCA timeline of failure o At what time did system go down o At what time did customers complain o At what time did developers react o At what time was the system back up o Etc · Do as much technical investigation as possible before the RCA meeting o Eg this was the problem o We had these tests · But we didn’t have one for this scenario

Normal Accidents and Root Cause Analysis: Difference between revisions

Revision as of 02:54, 21 September 2014

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 8: / Line 8: @@
 Notes on Squirrel's talk: http://www.markhneedham.com/blog/2011/12/10/the-5-whysroot-cause-analysis-douglas-squirrel/
+Notes from John Bradshaw:
+Normal accidents:
+·         3 Mile Island Accident - Blamed Operators
+·         Any system can and will fail, and you should plan for it to fail
+·         2 Axis graph
+o    Complexity -> Simple
+o    Loose Coupling -> Tight Coupling
+o    Complex & Tightly Coupled = Accident
+·         Complex system that is Loosely coupled is the CITCON open space set up evening
+o    We did not all rush to get food and beer
+·         E.g had there been a Lion in there, 1 person could have warned rest
+·         Chance to warn of danger
+·         Simple but tightly coupled = Dam
+o    Accident is water gets through the damn
+o    Anything goes wrong with dam e.g. hole, no chance to resolve
+o    Simple to reason about, wall of rock with a hole in
+o    But is high risk
+·         In nuclear plant accident, cooling system near radioactive rods
+o    Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident
+·         Book to Read: Normal Accidents by Perrow
+·         Are micro services tightly coupled and complex?
+o    Depends
+o    It's down to design and implementation
+·         Always strive to be in the bottom right corner of the graph, low complexity loosely coupled
+·         How do people plan for failure?
+o    Rob - We go through a certification process to get into Retail
+·         Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services
+·         Internal team will run same tests internally before handing over to external certification team
+How do you verify or even test your logging? Instance of a service that logged every time on failure, in a tight loop and filled the disks leading to further failure = Simple Tightly Coupled System
+Root Cause Analysis
+Scenario:  Database deliberately down for maintenance. Instance of a service that logged every time on failure connecting to database, in a tight loop and filled the disks leading to further failure
+·         Basic principals
+o    Everybody who was affected comes to the meeting
+·         To identity cultural or people problems
+·         Not allowed to place blame
+·         Ask/poll everyone what was the problem
+§  Customer:
+·         No system, was down, can't log on
+§  Operations:
+·         Confused by phone call
+§  Customer Service:
+·         Angry calls from customers, did not know what was going on
+§  Developer:
+·         Database down, no disk space
+·         Then ask why:
+§  Customer:
+§  Operations:
+§  Customer Service:
+§  Developer:
+·         Why: Maintenance on database, database down
+·         Why: Analysed log files, saw huge files, checked code,  logged with no delay
+·         Why: Developer skills lacking
+·         Why: No code review/inspection
+·         Why: Test for this logging case lacking
+·         When QA tested database was running
+·         QA too busy to investigate database failures cases
+·         No new blood in organisation
+·         QA assigned/overbooked to too many projects
+·         Action: Maintenance on DB, have redundant database to switch to
+·         Action: QA involved earlier
+§  Actions must be assigned and completed with a timeframe e.g. 1 week
+§  When you hit that uncomfortable silence half way down, keep pushing
+·         The root cause of failure is always the culture in an organisation
+o    It’s always about people e.g.
+·         The developer adding no delay to logging
+·         Lack of testing
+·         Create a RCA timeline of failure
+o    At what time did system go down
+o    At what time did customers complain
+o    At what time did developers react
+o    At what time was the system back up
+o    Etc
+·         Do as much technical investigation as possible before the RCA meeting
+o    Eg this was the problem
+o    We had these tests
+·         But we didn’t have one for this scenario