Monitoring driven development
Context: There was this company who was doing all the agile things looking from the outside, however from the inside... It was a SaaS product, and they were incredibly immature in terms of running a production system - lots of trivial knowledge, lots of reliance on that one guy. We walked through the "marketecture", and go through every line - "what would happen if this arrow didn't work?". And things broke down very quickly. And then I thought about it from TDD background - how would we develop software differently if we were to think about how are we gonna monitor software.
- Would this replace testing? Maybe some, maybe acceptance testing, but not unit testing or TDD. Or maybe running your acceptance testing in production is a form of monitoring. Not replacing or removing acceptance testing before production, instead running acceptance tests in prod in addition to running them during the pipeline.
- Why isn't it done already? Probably cause there's traditionally separation of concerns between dev team and ops/maintenance team.
- So, what are the goals? Eg. unit testing = code works as expected, acceptance testing = product works as expected ==> monitoring = user receives value as expected.
- Eg. in the origin of DevOps there was an idea of using what's happening in production to inform next work in development. And not only whether features are broken or not, but also whether features are used
- Staggered rollout process: if unit tests or system tests fail - roll back the commit. If it works - roll it out to one node, and monitor number of dollars per minute that node produces, compare to other nodes, and if it produces less dollars - do not roll out the change more widely.
- Bringing telemetry into the conversation, the question comes up - when do you analyze it? And who gets alerted? And for monitoring or alerting - you can't add that after the fact if the data doesn't exist, you need to design the system with alerting and monitoring in mind.
- You can also do something like that with a test customer account and feature flags. The number of feature flags creeps up though, and becomes very challenging to manage. There's a policy to remove the feature flags that are older than 6m old, and there's an internal module to manage feature flags and removing flags that's older than threshold.
- Sometimes it's easier to add monitoring to the existing code then finding seams, extracting code, wrapping it in tests, etc.
- Honeycomb was mentioned several times.
- Data volume problem shows up, what do you do when you have 1Tb data per day (which happens if you have too much data).
- Sounds like lots of us are doing it, what's the problem? - We aren't discussing monitoring from the start, we aren't developing with monitoring in mind. Monitoring is supposed to answer a question whether the product is bringing the value it was intended to bring.
- Done vs Done-done = on one of the kanban boards we expanded it all the way to "1st customer used it", and we wouldn't take it off the board until the first customer used it.
- Product management lots of time thinks about what to build, and might even think about market adoption - but rarely go back when the feature is used, and monitor how well their features perform. Which gives no accountability to product management. This could be a healthy thing to look at when the pressure is always on developers to build faster, it gives accountability to PMs about choosing what to build. Cause if we build trash faster - won't make a difference.
- Another approach - check usage by months. How many people used the product at least once month during the year? Two months? How come this feature, that people should use every day, and it's used one month of the year. So, distinguish between "is it working now" and "is it working over time".
- How to get PMs to pay attention? - Ask "How would we know that it works in production? How would we know it makes the impact we hope it makes" when you're talking about a new feature.
- "We are gonna A/B test it" - how many interactions you want to see for statistically significant information? How many users do you have? How much improvement do you want to see? Cause in some situation it's gonna take a year to get that information, and then A/B isn't a good solution
- How much a process helps to solve a problem? Thinking about it from Cynefin framework perspective: situation can be simple, complicated, complex, or chaotic - and in some situations process is helpful, and in some - you need mastery of a person with expertize.