MDD Monitoring Driven Development

From the Rubber Chicken to MDD

@jtf's "presentation"

James Shore's Rubber Chicken

- physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
- had to use a separate physical machine (solving the 'It works on my machine' problem)

CI

- can run more stuff now (fast tests, slow tests) - but separate build for deploy

pipelines with artifact passing
promoting to test/prod
CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production

- if any step fails, the change is automatically to be reverted
- if it made to prod, but business metrics down
  - not reverting code
  - take out from production cluster to investigate

State of the monitoring (first)

metrics used in monitoring are not specific (high level business metric down, must have been this change)
just like adding tests after writing the code is hard, so is adding monitoring/metrics
who tried monitoring first?
- zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
- PJ/intent media
  - monitoring can stop deploy/rollout
  - stopped doing acceptance tests in favor of monitoring
- aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
  - how could it fail
  - what impact it would have
  - how would we know (from customers? )
  - it it worth adding it? (metric, alert)

alerting

how many alerts should we create
- high level? e.g.: number failed API requests?
- more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
metrics vs. monitoring
- monitoring triggers somene to look at it
- metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
(pagerduty.com)

"Failure Friday" practice

during work hours!
we think this should be redundant, so let's shut this off and see the team recover
important: do it when you expect the exercise to be successful

Feature validation / AB testing

not the same as monitoring

Alert thresholds

it's not always binary (on/off)
normal is not the same as yesterday/last week / last year
- seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
- event driven - e.g.: if you publish tips, it depends on what happens in the world
factor into
- what can we measure
- what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels

Improving Alerts

make them actionable
- link to wiki of runbook how to fix
- write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
metrics you don't use is inventory, thus not useful

(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)

should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)

Workshop on MDD - 2 minutes to dropped jaws

Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels

what can we measure?

nr of FAQ views
# of calls
ask support reps to ask if caller read the FAQ & feed that back to the system?
instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))

=> the way you think of validation/measuring changes the product

Monitoring Embedded into Business

SRE handbook only focuses on the tech
if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring

Links

My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/

good questions to ask:

what does this data mean?
If we are not wachting it -> delete it?
Should we try "Failure Friday"?
Should we use "Daily Red"?
Is this indicator fast enough (leading or lagging indicator) to react?

MDD Monitoring Driven Development

Contents

From the Rubber Chicken to MDD

State of the monitoring (first)

alerting

"Failure Friday" practice

Feature validation / AB testing

Alert thresholds

Improving Alerts

Workshop on MDD - 2 minutes to dropped jaws

Monitoring Embedded into Business

Links

good questions to ask:

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools