<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://citconf.com/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Treaz</id>
	<title>CitconWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://citconf.com/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Treaz"/>
	<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Special:Contributions/Treaz"/>
	<updated>2026-05-08T13:37:22Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.35.11</generator>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=MDD_Monitoring_Driven_Development&amp;diff=16635</id>
		<title>MDD Monitoring Driven Development</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=MDD_Monitoring_Driven_Development&amp;diff=16635"/>
		<updated>2022-10-18T08:47:36Z</updated>

		<summary type="html">&lt;p&gt;Treaz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== From the Rubber Chicken to MDD ==&lt;br /&gt;
&lt;br /&gt;
@jtf&amp;#039;s &amp;quot;presentation&amp;quot; &lt;br /&gt;
&lt;br /&gt;
# James Shore&amp;#039;s Rubber Chicken &lt;br /&gt;
** physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests  before commit&lt;br /&gt;
** had to use a separate physical machine (solving the &amp;#039;It works on my machine&amp;#039; problem)&lt;br /&gt;
# CI&lt;br /&gt;
** can run more stuff now (fast tests, slow tests) - but separate build for deploy&lt;br /&gt;
# pipelines with artifact passing&lt;br /&gt;
# promoting to test/prod&lt;br /&gt;
# CD - blue green deploy - rolling back based on KPIs &amp;#039;&amp;#039;&amp;#039;CI + monitoring now controls production&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
** if any step fails, the change is automatically to be reverted&lt;br /&gt;
** if it made to prod, but business metrics down&lt;br /&gt;
*** not reverting code&lt;br /&gt;
*** take out from production cluster to investigate&lt;br /&gt;
&lt;br /&gt;
== State of the monitoring (first) ==&lt;br /&gt;
&lt;br /&gt;
* metrics used in monitoring are not specific (high level business metric down, must have been this change)&lt;br /&gt;
* just like adding tests after writing the code is hard, so is adding monitoring/metrics&lt;br /&gt;
* who tried monitoring first?&lt;br /&gt;
** zsoldosp - checklist item in issue template, but too many issues it didn&amp;#039;t apply, so it kinda got ignored after on that project&lt;br /&gt;
** PJ/intent media&lt;br /&gt;
*** monitoring can stop deploy/rollout&lt;br /&gt;
*** stopped doing acceptance tests in favor of monitoring&lt;br /&gt;
** aparker / TIM - failure analyses: we built it, now that we know how it works, let&amp;#039;s figure out &lt;br /&gt;
*** how could it fail&lt;br /&gt;
*** what impact it would have&lt;br /&gt;
*** how would we know (from customers? )&lt;br /&gt;
*** it it worth adding it? (metric, alert)&lt;br /&gt;
&lt;br /&gt;
== alerting ==&lt;br /&gt;
&lt;br /&gt;
* how many alerts should we create&lt;br /&gt;
** high level? e.g.: number failed API requests?&lt;br /&gt;
** more specific  - e.g.: we know it after debugging that it failed &amp;#039;coz the middleware failed. Should we monitor the middleware?&lt;br /&gt;
* metrics vs. monitoring &lt;br /&gt;
** monitoring triggers somene to look at it&lt;br /&gt;
** metrics - kinda like classic OPs - collect data, don&amp;#039;t attach metrics, just eyeball &amp;quot;looks to be an unusual shape, let&amp;#039;s investigate&amp;quot;&lt;br /&gt;
* who should we call (e.g.: if only high level metrics, who should the alerts wake up?)&lt;br /&gt;
* (pagerduty.com)&lt;br /&gt;
&lt;br /&gt;
== &amp;quot;Failure Friday&amp;quot; practice ==&lt;br /&gt;
* during work hours!&lt;br /&gt;
* we think this should be redundant, so let&amp;#039;s shut this off and see the team recover&lt;br /&gt;
* important: do it when you expect the exercise to be successful&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Feature validation / AB testing ==&lt;br /&gt;
&lt;br /&gt;
not the same as monitoring&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Alert thresholds ==&lt;br /&gt;
&lt;br /&gt;
* it&amp;#039;s not always binary (on/off) &lt;br /&gt;
* normal is not the same as yesterday/last week / last year&lt;br /&gt;
** seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it &amp;quot;Mondays are usually about this many pageloads&amp;quot;&lt;br /&gt;
** event driven - e.g.: if you publish tips, it depends on what happens in the world&lt;br /&gt;
* factor into&lt;br /&gt;
** what can we measure&lt;br /&gt;
** what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels&lt;br /&gt;
&lt;br /&gt;
== Improving Alerts ==&lt;br /&gt;
&lt;br /&gt;
* make them actionable&lt;br /&gt;
** link to wiki of runbook how to fix&lt;br /&gt;
** write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented&lt;br /&gt;
* metrics you don&amp;#039;t use is inventory, thus not useful&lt;br /&gt;
&lt;br /&gt;
(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)&lt;br /&gt;
&lt;br /&gt;
* should we alert on causes (disk full) or symptoms (user can&amp;#039;t login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don&amp;#039;t alert on those)&lt;br /&gt;
&lt;br /&gt;
== Workshop on MDD - 2 minutes to dropped jaws ==&lt;br /&gt;
&lt;br /&gt;
Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels&lt;br /&gt;
&lt;br /&gt;
what can we measure?&lt;br /&gt;
&lt;br /&gt;
* nr of FAQ views&lt;br /&gt;
* # of calls&lt;br /&gt;
* ask support reps to ask if caller read the FAQ &amp;amp; feed that back to the system?&lt;br /&gt;
* instead of &amp;quot;was this helpful&amp;quot; &amp;quot;yes/no&amp;quot; maybe we could have &amp;quot;yes/Call support (link/phone number)&amp;quot; (talk to UX before doing this at home :-))&lt;br /&gt;
&lt;br /&gt;
=&amp;gt; the way you think of validation/measuring changes the product&lt;br /&gt;
&lt;br /&gt;
== Monitoring Embedded into Business ==&lt;br /&gt;
&lt;br /&gt;
* SRE handbook only focuses on the tech&lt;br /&gt;
* if decision makers use monitoring data, it&amp;#039;s important for the business, thus no need to justify why monitoring&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
* My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit&lt;br /&gt;
* Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/&lt;br /&gt;
* Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/&lt;br /&gt;
&lt;br /&gt;
== good questions to ask: ==&lt;br /&gt;
* what does this data mean?&lt;br /&gt;
* If we are not wachting it -&amp;gt; delete it? &lt;br /&gt;
* Should we try &amp;quot;Failure Friday&amp;quot;?&lt;br /&gt;
* Should we use &amp;quot;Daily Red&amp;quot;?&lt;br /&gt;
* Is this indicator fast enough (leading or lagging indicator) to react?&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16613</id>
		<title>Self care part 2</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16613"/>
		<updated>2022-10-15T10:25:47Z</updated>

		<summary type="html">&lt;p&gt;Treaz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Attendees:&lt;br /&gt;
* [[Horia Constantin]]&lt;br /&gt;
* [[Katri Ordning]]&lt;br /&gt;
&lt;br /&gt;
Practiced for 5 mins doing 4:6 breathing&lt;br /&gt;
&lt;br /&gt;
Practiced 1 session of Wim Hof Method breathing.&lt;br /&gt;
&lt;br /&gt;
Discussed the simple ritual of daily journaling of unpleasant thoughts.&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16612</id>
		<title>Self care part 2</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16612"/>
		<updated>2022-10-15T10:25:35Z</updated>

		<summary type="html">&lt;p&gt;Treaz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Attendees:&lt;br /&gt;
* [[Horia Constantin]]&lt;br /&gt;
* [[Katri Ordning]]&lt;br /&gt;
&lt;br /&gt;
Practiced for 5 mins doing 4:6 breathing&lt;br /&gt;
Practiced 1 session of Wim Hof Method breathing.&lt;br /&gt;
Discussed the simple ritual of daily journaling of unpleasant thoughts.&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16611</id>
		<title>Self care part 2</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Self_care_part_2&amp;diff=16611"/>
		<updated>2022-10-15T10:24:58Z</updated>

		<summary type="html">&lt;p&gt;Treaz: Created page with &amp;quot;Attendees: * Horia Constantin * Kata  Practiced for 5 mins doing 4:6 breathing Practiced 1 session of Wim Hof Method breathing. Discussed the simple ritual of daily jo...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Attendees:&lt;br /&gt;
* [[Horia Constantin]]&lt;br /&gt;
* [[Kata]]&lt;br /&gt;
&lt;br /&gt;
Practiced for 5 mins doing 4:6 breathing&lt;br /&gt;
Practiced 1 session of Wim Hof Method breathing.&lt;br /&gt;
Discussed the simple ritual of daily journaling of unpleasant thoughts.&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=Horia_Constantin&amp;diff=16600</id>
		<title>Horia Constantin</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Horia_Constantin&amp;diff=16600"/>
		<updated>2022-10-15T06:00:14Z</updated>

		<summary type="html">&lt;p&gt;Treaz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;https://horiaconstantin.com/&lt;br /&gt;
&lt;br /&gt;
https://mastodon.online/web/@treaz&lt;br /&gt;
&lt;br /&gt;
https://twitter.com/ConstantinHoria&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
	<entry>
		<id>https://citconf.com/wiki/index.php?title=Horia_Constantin&amp;diff=16564</id>
		<title>Horia Constantin</title>
		<link rel="alternate" type="text/html" href="https://citconf.com/wiki/index.php?title=Horia_Constantin&amp;diff=16564"/>
		<updated>2020-05-30T11:02:26Z</updated>

		<summary type="html">&lt;p&gt;Treaz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;https://horiaconstantin.com/&lt;br /&gt;
&lt;br /&gt;
https://twitter.com/ConstantinHoria&lt;/div&gt;</summary>
		<author><name>Treaz</name></author>
	</entry>
</feed>