MDD Monitoring Driven Development
From CitconWiki
Jump to navigationJump to searchThe printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
From the Rubber Chicken to MDD
@jtf's "presentation"
- James Shore's Rubber Chicken
- physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
- had to use a separate physical machine (solving the 'It works on my machine' problem)
- CI
- can run more stuff now (fast tests, slow tests) - but separate build for deploy
- pipelines with artifact passing
- promoting to test/prod
- CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production
- if any step fails, the change is automatically to be reverted
- if it made to prod, but business metrics down
- not reverting code
- take out from production cluster to investigate
State of the monitoring (first)
- metrics used in monitoring are not specific (high level business metric down, must have been this change)
- just like adding tests after writing the code is hard, so is adding monitoring/metrics
- who tried monitoring first?
- zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
- PJ/intent media
- monitoring can stop deploy/rollout
- stopped doing acceptance tests in favor of monitoring
- aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
- how could it fail
- what impact it would have
- how would we know (from customers? )
- it it worth adding it? (metric, alert)
alerting
- how many alerts should we create
- high level? e.g.: number failed API requests?
- more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
- metrics vs. monitoring
- monitoring triggers somene to look at it
- metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
- who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
- (pagerduty.com)
"Failure Friday" practice
- during work hours!
- we think this should be redundant, so let's shut this off and see the team recover
- important: do it when you expect the exercise to be successful
Feature validation / AB testing
not the same as monitoring
Alert thresholds
- it's not always binary (on/off)
- normal is not the same as yesterday/last week / last year
- seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
- event driven - e.g.: if you publish tips, it depends on what happens in the world
- factor into
- what can we measure
- what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels
Improving Alerts
- make them actionable
- link to wiki of runbook how to fix
- write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
- metrics you don't use is inventory, thus not useful
(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)
- should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)
Workshop on MDD - 2 minutes to dropped jaws
Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels
what can we measure?
- nr of FAQ views
- # of calls
- ask support reps to ask if caller read the FAQ & feed that back to the system?
- instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))
=> the way you think of validation/measuring changes the product
Monitoring Embedded into Business
- SRE handbook only focuses on the tech
- if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring
Links
- My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
- Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
- Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/
good questions to ask:
- what does this data mean?
- If we are not wachting it -> delete it?
- Should we try "Failure Friday"?
- Should we use "Daily Red"?
- Is this indicator fast enough (leading or lagging indicator) to react?