Configuration Changes

From CitconWiki
Revision as of 09:58, 13 November 2011 by ArnaudBailly (talk | contribs) (minutes of the session on configuration rollback/rollforward)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Configuration changes Session

  • https://github.com/zaphod42/Coeus
  • start from a cluster configuration
  • how do you roll out a new service while keeping things going
  • There can be no SPOF
  • Define a new language to describe rules/policies -> turn declarative policies-based language into puppet execution plan
  • match execution plans (goal based) against policies
  • provisioning large systems
  • idea: model-checking failing systems
  • applying modely-checking to sysadmin
  • real world failures are complex, how do you model them?
  • you cannot roll-back time
  • continuous rolling forward: what happens when something goes wrong during deployment?
  • migrating database? Whole concept of db refactoring
  • link w/ application architecture: isolating things prevents failure propagation
  • migration use different data models
  • need a 3rd pipeline to build the data (after code, infrastructure)
  • eg. anonymizing data : cannot rollback, need to be done in production
  • once you got forward there are two many paths to go back
  • depends on your scenario? What's the difference between roll-forward/roll-back
  • fail in unexpected way (corrupting data could affect your application)
  • "stopping time" by switching systems
  • easy to have a default rollback for mainline scenario
  • what about featuretoggles? Could be used to handle suche cases.
  • basic issue w/ the idea of rolling-back: means losing data, you cannot rollback your data
  • you should implement a rollback scenario if you can (depends on the risk, costs...)
  • the effort to do it correctly is much higher than most people do
  • snapshot: need to be in a consistent state
  • no way to rollback after some time has passed (eg. deploy in weekend, failure occurs in week days)
  • if rollback is not possible, be aware of it and prepared to roll forward
  • come up with a design where you don't have to do it: lowers the risk enough...
  • clever system allow to dit by connection, by user, by feature
  • allow to tune for some users, provide some resource consuming feature to part of users, not to users
  • DI is better than feature branch for doing that
  • deploy schema changes alongside the code
  • just add to database, do not removing anything
  • featuretoggles used to test new database
  • deploying schemas in advance give your more confidence (but does not solve the rollback problem)
  • event sourcing provides the ability to replay stuff
  • pb: how much time does it take?
  • but the events have schemas themselves...
  • finding ways to mitigate your inability to do anything about something going wrong
  • reducing the barrier to going in production: being minutes away from delivering
  • how do we make people more aware of the problem? lot of developers have not worked on the ops part, dealing with the unexpected
  • Google engineers are on ops for a month after pushing a new release of a soft
  • product teams actually run the software (not always feasible due to regulations)