Normal Accidents and Root Cause Analysis

Normal Accidents book: http://press.princeton.edu/titles/6596.html

Systems are categorized by Interactions that are Simple vs Complex, and Tightly Coupled vs Loosely Coupled.

There are a few different versions of the quadrant: http://paei.wdfiles.com/local--files/perrow-charles-normal-accident-theory/PAEI_043_Perrow_Normal_Accident_Theory.gif https://www.flickr.com/photos/metanick/139214026/ http://media.peakprosperity.com/images/3-Perrow-from-Accidents-Normal.png

Douglas Squirrel talking about root-cause analysis: https://skillsmatter.com/skillscasts/1986-talk-by-squirrel

Notes on Squirrel's talk: http://www.markhneedham.com/blog/2011/12/10/the-5-whysroot-cause-analysis-douglas-squirrel/

Notes from John Bradshaw:

Normal accidents:

3 Mile Island Accident - Blamed Operators
Any system can and will fail, and you should plan for it to fail
2 Axis graph
- Complexity -> Simple
- Loose Coupling -> Tight Coupling
- Complex & Tightly Coupled = Accident
Complex system that is Loosely coupled is the CITCON open space set up evening
- We did not all rush to get food and beer
E.g had there been a Lion in there, 1 person could have warned rest
Chance to warn of danger
Simple but tightly coupled = Dam
- Accident is water gets through the damn
- Anything goes wrong with dam e.g. hole, no chance to resolve
- Simple to reason about, wall of rock with a hole in
- But is high risk
In nuclear plant accident, cooling system near radioactive rods
- Operators can see there was a leak, but no context e.g. they can see the leak is leaking near/into the radioactive rod storage which would lead to an accident
Book to Read: Normal Accidents by Perrow
Are micro services tightly coupled and complex?
- Depends
- It's down to design and implementation
Always strive to be in the bottom right corner of the graph, low complexity loosely coupled
How do people plan for failure?
- Rob - We go through a certification process to get into Retail
Each system that could fail is tested, e.g. chaos monkey style someone will manually go take down services
Internal team will run same tests internally before handing over to external certification team

How do you verify or even test your logging? Instance of a service that logged every time on failure, in a tight loop and filled the disks leading to further failure = Simple Tightly Coupled System

Root Cause Analysis

Scenario: Database deliberately down for maintenance. Instance of a service that logged every time on failure connecting to database, in a tight loop and filled the disks leading to further failure

Basic principals
- Everybody who was affected comes to the meeting
To identity cultural or people problems
Not allowed to place blame
Ask/poll everyone what was the problem
- Customer:
  - No system, was down, can't log on
- Operations:
  - Confused by phone call
- Customer Service:
  - Angry calls from customers, did not know what was going on
- Developer:
  - Database down, no disk space
  - Then ask why:
- Customer:
- Operations:
- Customer Service:
- Developer:
  - Why: Maintenance on database, database down
  - Why: Analysed log files, saw huge files, checked code, logged with no delay
  - Why: Developer skills lacking
  - Why: No code review/inspection
  - Why: Test for this logging case lacking
  - When QA tested database was running
  - QA too busy to investigate database failures cases
  - No new blood in organisation
  - QA assigned/overbooked to too many projects
  - Action: Maintenance on DB, have redundant database to switch to
  - Action: QA involved earlier

Actions must be assigned and completed with a timeframe e.g. 1 week
When you hit that uncomfortable silence half way down, keep pushing

The root cause of failure is always the culture in an organisation
It’s always about people e.g.
- The developer adding no delay to logging
- Lack of testing
Create a RCA timeline of failure
- At what time did system go down
- At what time did customers complain
- At what time did developers react
- At what time was the system back up
- Etc
Do as much technical investigation as possible before the RCA meeting
- Eg this was the problem
- We had these tests
  - But we didn’t have one for this scenario

Normal Accidents and Root Cause Analysis

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools