Configuration Changes
From CitconWiki
Revision as of 08:58, 13 November 2011 by ArnaudBailly (talk | contribs) (minutes of the session on configuration rollback/rollforward)
Configuration changes Session
- https://github.com/zaphod42/Coeus
- start from a cluster configuration
- how do you roll out a new service while keeping things going
- There can be no SPOF
- Define a new language to describe rules/policies -> turn declarative policies-based language into puppet execution plan
- match execution plans (goal based) against policies
- provisioning large systems
- idea: model-checking failing systems
- applying modely-checking to sysadmin
- real world failures are complex, how do you model them?
- you cannot roll-back time
- continuous rolling forward: what happens when something goes wrong during deployment?
- migrating database? Whole concept of db refactoring
- link w/ application architecture: isolating things prevents failure propagation
- migration use different data models
- need a 3rd pipeline to build the data (after code, infrastructure)
- eg. anonymizing data : cannot rollback, need to be done in production
- once you got forward there are two many paths to go back
- depends on your scenario? What's the difference between roll-forward/roll-back
- fail in unexpected way (corrupting data could affect your application)
- "stopping time" by switching systems
- easy to have a default rollback for mainline scenario
- what about featuretoggles? Could be used to handle suche cases.
- basic issue w/ the idea of rolling-back: means losing data, you cannot rollback your data
- you should implement a rollback scenario if you can (depends on the risk, costs...)
- the effort to do it correctly is much higher than most people do
- snapshot: need to be in a consistent state
- no way to rollback after some time has passed (eg. deploy in weekend, failure occurs in week days)
- if rollback is not possible, be aware of it and prepared to roll forward
- come up with a design where you don't have to do it: lowers the risk enough...
- clever system allow to dit by connection, by user, by feature
- allow to tune for some users, provide some resource consuming feature to part of users, not to users
- DI is better than feature branch for doing that
- deploy schema changes alongside the code
- just add to database, do not removing anything
- featuretoggles used to test new database
- deploying schemas in advance give your more confidence (but does not solve the rollback problem)
- event sourcing provides the ability to replay stuff
- pb: how much time does it take?
- but the events have schemas themselves...
- finding ways to mitigate your inability to do anything about something going wrong
- reducing the barrier to going in production: being minutes away from delivering
- how do we make people more aware of the problem? lot of developers have not worked on the ops part, dealing with the unexpected
- Google engineers are on ops for a month after pushing a new release of a soft
- product teams actually run the software (not always feasible due to regulations)