Difference between revisions of "Testing Microservices"
From CitconWiki
Jump to navigationJump to searchLine 1: | Line 1: | ||
− | + | Conversation points | |
* Splunk stack and hashi wrapper for monitoring | * Splunk stack and hashi wrapper for monitoring | ||
* NOC (Network Ops Center) - | * NOC (Network Ops Center) - | ||
Line 10: | Line 10: | ||
* Sometimes new practices require change in culture. | * Sometimes new practices require change in culture. | ||
* Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region | * Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region | ||
+ | * Microservices could become distributed monolith, one reason - relying on shared datastore. | ||
+ | * At Netflix, rather than trying to find issues in test, and trying to reproduce production (too expensive), Netflix invested in monitoring and expediting the fixing, and implementing "roll back" and "roll forward". | ||
+ | * In context of "roll back", do people use SLAs or KPIs? SLAs are helpful to make data driven decisions to invest into improving certain services, but also can become a stick for management to use in a toxic way. | ||
+ | * Is there a gold standard for data around service performance? Alternative way - find outliers comparing to average performance of similar services. | ||
+ | * In the notion of error-budget, in some cases you'd halt deployment if service isn't complying with specific metrics. You could override it, it's manual effort, and with timezones you sometimes need a duplicate team that you need. Netflix decided against error budgets not to introduce the additional burden of bureaucracy around it. | ||
+ | * When we need teams to update their services, "should" didn't help - teams need to do a lot of things they "need" to do. What worked for some folks - include this kind of work into a plan, and set up expectation with the company and leadership that engineering will spend 70% of time on features, and 30% on all kinds of "stuff comes up". Still takes time, but works. But the culture change needs to come from the top down. | ||
+ | * So, what do you do when stuff still goes down in prod? There's an incident commander / coordinator / collator - finding who needs to roll back / roll forward whatever needs to be done, and then coordinating a post mortem, which everyone can join. Action items are separated into short-term, med-term, and long-term. |
Revision as of 12:05, 4 February 2023
Conversation points
- Splunk stack and hashi wrapper for monitoring
- NOC (Network Ops Center) -
- New bank - common stack, clojure, AWS
- Netflix also started with same stack, but it diverged assuming commonality of contracts. But Netflix has fantastic monitoring practices
- Challenge - in a micro-services environment, how do you catch issues before production if all tests were fine for individual environments?
- Heavy Monitoring
- Learning the process for a year before automating it was helpful (some automations are not needed)
- Disney provides client libraries. Versioning - providing min and max. Netflix - started with that, had to stop cause there was too many different libraries.
- Sometimes new practices require change in culture.
- Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region
- Microservices could become distributed monolith, one reason - relying on shared datastore.
- At Netflix, rather than trying to find issues in test, and trying to reproduce production (too expensive), Netflix invested in monitoring and expediting the fixing, and implementing "roll back" and "roll forward".
- In context of "roll back", do people use SLAs or KPIs? SLAs are helpful to make data driven decisions to invest into improving certain services, but also can become a stick for management to use in a toxic way.
- Is there a gold standard for data around service performance? Alternative way - find outliers comparing to average performance of similar services.
- In the notion of error-budget, in some cases you'd halt deployment if service isn't complying with specific metrics. You could override it, it's manual effort, and with timezones you sometimes need a duplicate team that you need. Netflix decided against error budgets not to introduce the additional burden of bureaucracy around it.
- When we need teams to update their services, "should" didn't help - teams need to do a lot of things they "need" to do. What worked for some folks - include this kind of work into a plan, and set up expectation with the company and leadership that engineering will spend 70% of time on features, and 30% on all kinds of "stuff comes up". Still takes time, but works. But the culture change needs to come from the top down.
- So, what do you do when stuff still goes down in prod? There's an incident commander / coordinator / collator - finding who needs to roll back / roll forward whatever needs to be done, and then coordinating a post mortem, which everyone can join. Action items are separated into short-term, med-term, and long-term.