Difference between revisions of "Testing Microservices"

From CitconWiki
Jump to navigationJump to search
Line 1: Line 1:
Tools
+
Conversation points
 
* Splunk stack and hashi wrapper for monitoring
 
* Splunk stack and hashi wrapper for monitoring
 
* NOC (Network Ops Center) -
 
* NOC (Network Ops Center) -
Line 10: Line 10:
 
* Sometimes new practices require change in culture.  
 
* Sometimes new practices require change in culture.  
 
* Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region
 
* Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region
 +
* Microservices could become distributed monolith, one reason - relying on shared datastore. 
 +
* At Netflix, rather than trying to find issues in test, and trying to reproduce production (too expensive), Netflix invested in monitoring and expediting the fixing, and implementing "roll back" and "roll forward".
 +
* In context of "roll back", do people use SLAs or KPIs? SLAs are helpful to make data driven decisions to invest into improving certain services, but also can become a stick for management to use in a toxic way.
 +
* Is there a gold standard for data around service performance? Alternative way - find outliers comparing to average performance of similar services.
 +
* In the notion of error-budget, in some cases you'd halt deployment if service isn't complying with specific metrics. You could override it, it's manual effort, and with timezones you sometimes need a duplicate team that you need. Netflix decided against error budgets not to introduce the additional burden of bureaucracy around it.
 +
* When we need teams to update their services, "should" didn't help - teams need to do a lot of things they "need" to do. What worked for some folks - include this kind of work into a plan, and set up expectation with the company and leadership that engineering will spend 70% of time on features, and 30% on all kinds of "stuff comes up". Still takes time, but works. But the culture change needs to come from the top down.
 +
* So, what do you do when stuff still goes down in prod? There's an incident commander / coordinator / collator - finding who needs to roll back / roll forward whatever needs to be done, and then coordinating a post mortem, which everyone can join. Action items are separated into short-term, med-term, and long-term.

Revision as of 12:05, 4 February 2023

Conversation points

  • Splunk stack and hashi wrapper for monitoring
  • NOC (Network Ops Center) -
  • New bank - common stack, clojure, AWS
  • Netflix also started with same stack, but it diverged assuming commonality of contracts. But Netflix has fantastic monitoring practices
  • Challenge - in a micro-services environment, how do you catch issues before production if all tests were fine for individual environments?
  • Heavy Monitoring
  • Learning the process for a year before automating it was helpful (some automations are not needed)
  • Disney provides client libraries. Versioning - providing min and max. Netflix - started with that, had to stop cause there was too many different libraries.
  • Sometimes new practices require change in culture.
  • Deploying across different geographic zones has to be staggered - deploy to EU, wait for 6 hours, confirm it works, deploy to next region
  • Microservices could become distributed monolith, one reason - relying on shared datastore.
  • At Netflix, rather than trying to find issues in test, and trying to reproduce production (too expensive), Netflix invested in monitoring and expediting the fixing, and implementing "roll back" and "roll forward".
  • In context of "roll back", do people use SLAs or KPIs? SLAs are helpful to make data driven decisions to invest into improving certain services, but also can become a stick for management to use in a toxic way.
  • Is there a gold standard for data around service performance? Alternative way - find outliers comparing to average performance of similar services.
  • In the notion of error-budget, in some cases you'd halt deployment if service isn't complying with specific metrics. You could override it, it's manual effort, and with timezones you sometimes need a duplicate team that you need. Netflix decided against error budgets not to introduce the additional burden of bureaucracy around it.
  • When we need teams to update their services, "should" didn't help - teams need to do a lot of things they "need" to do. What worked for some folks - include this kind of work into a plan, and set up expectation with the company and leadership that engineering will spend 70% of time on features, and 30% on all kinds of "stuff comes up". Still takes time, but works. But the culture change needs to come from the top down.
  • So, what do you do when stuff still goes down in prod? There's an incident commander / coordinator / collator - finding who needs to roll back / roll forward whatever needs to be done, and then coordinating a post mortem, which everyone can join. Action items are separated into short-term, med-term, and long-term.