"Shift left" observability - Get feedback at early stages

From CitconWiki
Jump to navigationJump to search

https://aws.amazon.com/compare/the-difference-between-monitoring-and-observability/

https://s4nchez.medium.com/citcon-2022-musings-6e676166beec

Monitoring Driven Development (MDD) on hacker news https://news.ycombinator.com/item?id=9137021 discussion for this blog post: https://benjiweber.co.uk/blog/2015/03/02/monitoring-check-smells/

See notes from CITCON NA 2023: https://citconf.com/wiki/index.php?title=Monitoring_driven_development

Related: https://citconf.com/wiki/index.php?title=Risk_management_and_voodoo_charms : Failure Analysis -> Impact Analysis -> Risk Analysis -> Business Case

https://github.com/open-telemetry/opentelemetry-demo

Observability Driven Development (sponsored by Sumo Logic) : https://stackoverflow.blog/2022/10/12/how-observability-driven-development-creates-elite-performers/

Failure Friday: https://www.pagerduty.com/blog/failure-fridays-four-years/

Doing the impossible 50 times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/ : change in business metrics halt the rollout of new deployments


A chatGPT summary of the session: During the conference session on shifting observability left, several key points were discussed:

   Monitoring vs Observability: The distinction between monitoring metrics and user-centric observability was highlighted. Observability emphasizes understanding the "why" behind something going wrong, especially in distributed systems, whereas monitoring focuses on the "what" and real-time alerting.
   Testing infrastructure vs production infrastructure: The differences in scaling, failure rates, data issues, and capabilities between development environments and production environments were addressed. It was suggested that investigating the tradeoffs of using production-like infrastructure in testing could be valuable.
   SLO/SLA tests in the pipeline: The benefits of implementing Service Level Objective/Agreement (SLO/SLA) tests in the development pipeline were discussed, although it was acknowledged that it requires energy to implement.
   Time to feedback: The importance of reducing the time it takes to receive feedback, both from production and from continuous integration (CI) environments, was emphasized. Merging groups of developers, testers, and integrators to provide faster feedback to developers was suggested.
   Development practices and context: The need to consider scalability and non-functional requirements (e.g., performance, security, incident management) during development and the importance of capturing user requirements in test cases were highlighted. The importance of context, such as team size and business uncertainty, was also emphasized in determining the level of investment in a scalable solution.
   Shift left: The concept of observability-driven development was introduced, which involves teaching developers defensive programming and writing logs and tracing to verify code behavior in the future. The aim is to eliminate cognitive load and improve the ability to detect and respond to issues earlier in the development process.
   Failure analysis and performance testing: Techniques such as creating architecture diagrams, performing failure analysis along the user journey, and involving key personas in performance testing were discussed. The goal is to ensure that new features do not break existing functionality and that they have the intended impact.
   Building a monitoring culture: The importance of making monitoring a first-class citizen in the development process, integrating metrics into canary deployments, and celebrating failures in testing as opportunities for learning and improvement were emphasized.
   Regression testing and simulating production conditions: The value of regression tests that simulate unexpected production conditions was highlighted, enabling teams to address potential issues proactively.

Overall, the session focused on the importance of shifting observability left in the development process, empowering developers to take responsibility for observability, and integrating monitoring and testing practices to improve the overall quality and reliability of systems.