Configuration Changes

From CitconWiki
Revision as of 12:24, 16 November 2011 by Zsoldosp (talk | contribs) (Incorporated other notes (Peter Zsoldos))
Jump to navigationJump to search

Configuration changes/Release Rolling Back vs.Rolling Forward Session

Planning release steps while maintaining invariants

  • Andy Parker's master thesis
  • https://github.com/zaphod42/Coeus
  • start from a cluster configuration
  • how do you roll out a new service while keeping things going
  • There can be no SPOF
  • Define a new language (think Prolog like) to describe rules/policies -> turn declarative policies-based language into puppet execution plan
  • match execution plans (goal based) against policies
  • provisioning large systems
  • idea: model-checking failing systems
  • applying modely-checking to sysadmin
  • real world failures are complex, how do you model them? Problem with all model checking approaches

Why does it seem to be that so few plan for reverting releases?

  • discussion starting point: tools are concerned with going forward, so are teams. Usually no explicit backout plans, or if there is, rarely tested - when some of it can be automated (contrast it with the 3am call and hacking a bugfix forward)
  • problematic naming: rollback is a bad word, backout is better
  • you cannot roll-back time
  • continuous rolling forward: what happens when something goes wrong during deployment?
  • migrating databases? Whole concept of db refactoring (see Scott W. Ambler's book)
  • django south, rails migrations, etc.
  • link w/ application architecture: isolating things prevents failure propagation
  • migration use different data models
  • need a 3rd pipeline to build the data (after code, infrastructure)
  • eg. anonymizing data : cannot rollback, need to be done in production
  • once you got forward there are two many paths to go back
  • depends on your scenario? What's the difference between roll-forward/roll-back
  • fail in unexpected way (corrupting data could affect your application)
  • "stopping time" by switching systems (maintain parallel installations of systems)
  • easy to have a default rollback for mainline scenario, without losing newly gathered data (e.g.: added a new field to signup form, this needs to be backed out, we can remove field, and keep all data, even customer's that have signed up after the release)
  • what about featuretoggles? Could be used to handle suche cases.
  • basic issue w/ the idea of rolling-back: means losing data, you cannot rollback your data
  • you should implement a rollback scenario if you can (depends on the risk, costs...)
  • the effort to do it correctly is much higher than most people do
  • snapshot: need to be in a consistent state
  • no way to rollback after some time has passed (eg. deploy in weekend, failure occurs in week days)
  • if rollback is not possible, be aware of it and prepared to roll forward
  • come up with a design where you don't have to do it: lowers the risk enough...
  • clever system allow to dit by connection, by user, by feature
  • allow to tune for some users, provide some resource consuming feature to part of users, not to users
  • DI is better than feature branch for doing that
  • deploy schema changes alongside the code
  • just add to database, do not removing anything - all older versions of the app can use the new schema (consider meaningful defaults, and beware of the potential performance hit you are taking with increased record size)
  • featuretoggles used to test new database
  • deploying schemas in advance give your more confidence (but does not solve the rollback problem) - database shadowing, so it's like the additive only schema changes, just temporarily and not forever
  • running live data through a secondary installation that contains the old version
  • event sourcing provides the ability to replay stuff
  • pb: how much time does it take?
  • but the events have schemas themselves...
  • finding ways to mitigate your inability to do anything about something going wrong
  • reducing the barrier to going in production: being minutes away from delivering
  • how do we make people more aware of the problem? lot of developers have not worked on the ops part, dealing with the unexpected
  • Google engineers are on ops for a month after pushing a new release of a soft
  • product teams actually run the software (not always feasible due to regulations)
  • the whole forward/backwards discussion is not concerned with undoing multiple releases


Some scenarios given that you can't recover from in a planned way

  • the new release of the application starts to generate gibberish data. How do you downgrade to the previous version and restore old data and clean data that has been generated since?
  • does your backout script work when the release has not completed, but failed halfway through?
  • what do you do with large amounts of data (though this might already be a problem for the actual release)?


And unfortunately database level application integration (many apps read-write the same database tables) is not yet extinct.