MDD Monitoring Driven Development
From the Rubber Chicken to MDD
- James Shore's Rubber Chicken
- physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
- had to use a separate physical machine (solving the 'It works on my machine' problem)
- can run more stuff now (fast tests, slow tests) - but separate build for deploy
- pipelines with artifact passing
- promoting to test/prod
- CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production
- if any step fails, the change is automatically to be reverted
- if it made to prod, but business metrics down
- not reverting code
- take out from production cluster to investigate
State of the monitoring (first)
- metrics used in monitoring are not specific (high level business metric down, must have been this change)
- just like adding tests after writing the code is hard, so is adding monitoring/metrics
- who tried monitoring first?
- zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
- PJ/intent media
- monitoring can stop deploy/rollout
- stopped doing acceptance tests in favor of monitoring
- aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
- how could it fail
- what impact it would have
- how would we know (from customers? )
- it it worth adding it? (metric, alert)
- how many alerts should we create
- high level? e.g.: number failed API requests?
- more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
- metrics vs. monitoring
- monitoring triggers somene to look at it
- metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
- who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
"Failure Friday" practice
- during work hours!
- we think this should be redundant, so let's shut this off and see the team recover
- important: do it when you expect the exercise to be successful
Feature validation / AB testing
not the same as monitoring
- it's not always binary (on/off)
- normal is not the same as yesterday/last week / last year
- seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
- event driven - e.g.: if you publish tips, it depends on what happens in the world
- factor into
- what can we measure
- what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels
- make them actionable
- link to wiki of runbook how to fix
- write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
- metrics you don't use is inventory, thus not useful
(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)
- should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)
Workshop on MDD - 2 minutes to dropped jaws
Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels
what can we measure?
- nr of FAQ views
- # of calls
- ask support reps to ask if caller read the FAQ & feed that back to the system?
- instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))
=> the way you think of validation/measuring changes the product
Monitoring Embedded into Business
- SRE handbook only focuses on the tech
- if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring
- My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
- Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
- Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/
good questions to ask:
- what does this data mean?
- If we are not wachting it -> delete it?
- Should we try "Failure Friday"?
- Should we use "Daily Red"?
- Is this indicator fast enough (leading or lagging indicator) to react?