Difference between revisions of "Configuration Changes"

From CitconWiki
Jump to navigationJump to search
(minutes of the session on configuration rollback/rollforward)
 
(Incorporated other notes (Peter Zsoldos))
Line 1: Line 1:
Configuration changes Session
+
==Configuration changes/Release Rolling Back vs.Rolling Forward Session==
  
 +
===Planning release steps while maintaining invariants===
 +
 +
* Andy Parker's master thesis
 
* https://github.com/zaphod42/Coeus
 
* https://github.com/zaphod42/Coeus
 
* start from a cluster configuration
 
* start from a cluster configuration
 
* how do you roll out a new service while keeping things going
 
* how do you roll out a new service while keeping things going
 
* There can be no SPOF
 
* There can be no SPOF
* Define a new language to describe rules/policies -> turn declarative policies-based language into puppet execution plan
+
* Define a new language (think Prolog like) to describe rules/policies -> turn declarative policies-based language into puppet execution plan
 
* match execution plans (goal based) against policies
 
* match execution plans (goal based) against policies
 
* provisioning  large systems
 
* provisioning  large systems
 
* idea: model-checking failing systems
 
* idea: model-checking failing systems
 
* applying modely-checking to sysadmin
 
* applying modely-checking to sysadmin
* real world failures are complex, how do you model them?
+
* real world failures are complex, how do you model them? Problem with all model checking approaches
 +
 
 +
===Why does it seem to be that so few plan for reverting releases? ===
 +
 
 +
* discussion starting point: tools are concerned with going forward, so are teams. Usually no explicit backout plans, or if there is, rarely tested - when some of it can be automated (contrast it with the 3am call and hacking a bugfix forward)
 +
* problematic naming: rollback is a bad word, backout is better
 
* you cannot roll-back time
 
* you cannot roll-back time
 
* continuous rolling forward: what happens when something goes wrong during deployment?
 
* continuous rolling forward: what happens when something goes wrong during deployment?
* migrating database? Whole concept of db refactoring
+
* migrating databases? Whole concept of db refactoring (see Scott W. Ambler's book)
 +
* django south, rails migrations, etc.
 
* link w/ application architecture: isolating things prevents failure propagation
 
* link w/ application architecture: isolating things prevents failure propagation
 
* migration use different data models
 
* migration use different data models
Line 21: Line 30:
 
* depends on your scenario? What's the difference between roll-forward/roll-back
 
* depends on your scenario? What's the difference between roll-forward/roll-back
 
* fail in unexpected way (corrupting data could affect your application)
 
* fail in unexpected way (corrupting data could affect your application)
* "stopping time" by switching systems
+
* "stopping time" by switching systems (maintain parallel installations of systems)
* easy to have a default rollback for mainline scenario
+
* easy to have a default rollback for mainline scenario, without losing newly gathered data (e.g.: added a new field to signup form, this needs to be backed out, we can remove field, and keep all data, even customer's that have signed up after the release)
 
* what about featuretoggles? Could be used to handle suche cases.
 
* what about featuretoggles? Could be used to handle suche cases.
 
* basic issue w/ the idea of rolling-back: means losing data, you  cannot rollback your data
 
* basic issue w/ the idea of rolling-back: means losing data, you  cannot rollback your data
Line 35: Line 44:
 
* DI is better than feature branch for doing that
 
* DI is better than feature branch for doing that
 
* deploy schema changes alongside the code
 
* deploy schema changes alongside the code
* just add to database, do not removing anything
+
* just add to database, do not removing anything - all older versions of the app can use the new schema (consider meaningful defaults, and beware of the potential performance hit you are taking with increased record size)
 
* featuretoggles used to test new database
 
* featuretoggles used to test new database
* deploying schemas in advance give your more confidence (but does not solve the rollback problem)
+
* deploying schemas in advance give your more confidence (but does not solve the rollback problem) - database shadowing, so it's like the additive only schema changes, just temporarily and not forever
 +
* running live data through a secondary installation that contains the old version
 
* event sourcing provides the ability to replay stuff
 
* event sourcing provides the ability to replay stuff
 
* pb: how much time does it take?
 
* pb: how much time does it take?
Line 46: Line 56:
 
* Google engineers are on ops for a month after pushing a new release  of a soft
 
* Google engineers are on ops for a month after pushing a new release  of a soft
 
* product teams actually run the software (not always feasible due to regulations)
 
* product teams actually run the software (not always feasible due to regulations)
 +
* the whole forward/backwards discussion is not concerned with undoing multiple releases
 +
 +
 +
====Some scenarios given that you can't recover from in a planned way====
 +
 +
* the new release of the application starts to generate gibberish data. How do you downgrade to the previous version and restore old data and clean data that has been generated since?
 +
* does your backout script work when the release has not completed, but failed halfway through?
 +
* what do you do with large amounts of data (though this might already be a problem for the actual release)?
 +
 +
 +
And unfortunately database level application integration (many apps read-write the same database tables) is not yet extinct.

Revision as of 11:24, 16 November 2011

Configuration changes/Release Rolling Back vs.Rolling Forward Session

Planning release steps while maintaining invariants

  • Andy Parker's master thesis
  • https://github.com/zaphod42/Coeus
  • start from a cluster configuration
  • how do you roll out a new service while keeping things going
  • There can be no SPOF
  • Define a new language (think Prolog like) to describe rules/policies -> turn declarative policies-based language into puppet execution plan
  • match execution plans (goal based) against policies
  • provisioning large systems
  • idea: model-checking failing systems
  • applying modely-checking to sysadmin
  • real world failures are complex, how do you model them? Problem with all model checking approaches

Why does it seem to be that so few plan for reverting releases?

  • discussion starting point: tools are concerned with going forward, so are teams. Usually no explicit backout plans, or if there is, rarely tested - when some of it can be automated (contrast it with the 3am call and hacking a bugfix forward)
  • problematic naming: rollback is a bad word, backout is better
  • you cannot roll-back time
  • continuous rolling forward: what happens when something goes wrong during deployment?
  • migrating databases? Whole concept of db refactoring (see Scott W. Ambler's book)
  • django south, rails migrations, etc.
  • link w/ application architecture: isolating things prevents failure propagation
  • migration use different data models
  • need a 3rd pipeline to build the data (after code, infrastructure)
  • eg. anonymizing data : cannot rollback, need to be done in production
  • once you got forward there are two many paths to go back
  • depends on your scenario? What's the difference between roll-forward/roll-back
  • fail in unexpected way (corrupting data could affect your application)
  • "stopping time" by switching systems (maintain parallel installations of systems)
  • easy to have a default rollback for mainline scenario, without losing newly gathered data (e.g.: added a new field to signup form, this needs to be backed out, we can remove field, and keep all data, even customer's that have signed up after the release)
  • what about featuretoggles? Could be used to handle suche cases.
  • basic issue w/ the idea of rolling-back: means losing data, you cannot rollback your data
  • you should implement a rollback scenario if you can (depends on the risk, costs...)
  • the effort to do it correctly is much higher than most people do
  • snapshot: need to be in a consistent state
  • no way to rollback after some time has passed (eg. deploy in weekend, failure occurs in week days)
  • if rollback is not possible, be aware of it and prepared to roll forward
  • come up with a design where you don't have to do it: lowers the risk enough...
  • clever system allow to dit by connection, by user, by feature
  • allow to tune for some users, provide some resource consuming feature to part of users, not to users
  • DI is better than feature branch for doing that
  • deploy schema changes alongside the code
  • just add to database, do not removing anything - all older versions of the app can use the new schema (consider meaningful defaults, and beware of the potential performance hit you are taking with increased record size)
  • featuretoggles used to test new database
  • deploying schemas in advance give your more confidence (but does not solve the rollback problem) - database shadowing, so it's like the additive only schema changes, just temporarily and not forever
  • running live data through a secondary installation that contains the old version
  • event sourcing provides the ability to replay stuff
  • pb: how much time does it take?
  • but the events have schemas themselves...
  • finding ways to mitigate your inability to do anything about something going wrong
  • reducing the barrier to going in production: being minutes away from delivering
  • how do we make people more aware of the problem? lot of developers have not worked on the ops part, dealing with the unexpected
  • Google engineers are on ops for a month after pushing a new release of a soft
  • product teams actually run the software (not always feasible due to regulations)
  • the whole forward/backwards discussion is not concerned with undoing multiple releases


Some scenarios given that you can't recover from in a planned way

  • the new release of the application starts to generate gibberish data. How do you downgrade to the previous version and restore old data and clean data that has been generated since?
  • does your backout script work when the release has not completed, but failed halfway through?
  • what do you do with large amounts of data (though this might already be a problem for the actual release)?


And unfortunately database level application integration (many apps read-write the same database tables) is not yet extinct.