- Splunk stack and hashi wrapper for monitoring
- NOC (Network Ops Center) -
- New bank - common stack, clojure, AWS
- Netflix also started with same stack, but it diverged assuming commonality of contracts. But Netflix has fantastic monitoring practices
- Challenge - in a micro-services environment, how do you catch issues before production if all tests were fine for individual environments?
- Heavy Monitoring
- Learning the process for a year before automating it was helpful (some automations are not needed)
- Disney provides client libraries. Versioning - providing min and max. Netflix - started with that, had to stop cause there was too many different libraries.
- Sometimes new practices require change in culture.
- Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region
- Microservices could become distributed monolith, one reason - relying on shared datastore.
- At Netflix, rather than trying to find issues in test, and trying to reproduce production (too expensive), Netflix invested in monitoring and expediting the fixing, and implementing "roll back" and "roll forward".
- In context of "roll back", do people use SLAs or KPIs? SLAs are helpful to make data driven decisions to invest into improving certain services, but also can become a stick for management to use in a toxic way.
- Is there a gold standard for data around service performance? Alternative way - find outliers comparing to average performance of similar services.
- In the notion of error-budget, in some cases you'd halt deployment if service isn't complying with specific metrics. You could override it, it's manual effort, and with timezones you sometimes need a duplicate team that you need. Netflix decided against error budgets not to introduce the additional burden of bureaucracy around it.
- When we need teams to update their services, "should" didn't help - teams need to do a lot of things they "need" to do. What worked for some folks - include this kind of work into a plan, and set up expectation with the company and leadership that engineering will spend 70% of time on features, and 30% on all kinds of "stuff comes up". Still takes time, but works. But the culture change needs to come from the top down.
- So, what do you do when stuff still goes down in prod? There's an incident commander / coordinator / collator - finding who needs to roll back / roll forward whatever needs to be done, and then coordinating a post mortem, which everyone can join. Action items are separated into short-term, med-term, and long-term.
- COE (correction of error) can become a cause of tension, who's responsible for fixing things? How do you ensure teams are motivated and enabled to spend time fixing their things?
- A lot of engineers understand technical KPIs but not business KPIs. Helping developers see how their work affects actual business metrics proved helpful. And severity of the incident must depend on the business impact, not on the internal politics (i.e. trying to hide incidents under "p2" to not get on CEO's radar)
- "If it's painful - do it more often" principle