What are you going to monitor? How will you know what you're monitoring happens? What will you do if it happens?
Continuous deployment - Will the site work while you deploy the version? Or take the site down while you deploy? What happens to the users on the site while you're deploying?
If you're deploying manually, then you could think about these questions, but it's not mandatory.
Ops job was to be a buffer between devs and remote system administrators.
This particular Ops team were not sysadmins and there no monitoring in place. They were doing market fixes, and provide a buffer between devs and remote sysadmins.
Internal staging site, but it has a different topology. Production has 3 machines behind a load balancer. B/c of remote relationship, could change the app, except through the database.
Company is successful, but there's a lack of growing up. They plug gaps with people instead of systems.
Co-workers checked in code that didn't work. Code should compile before you check it in. Read Scott Adams, "Goals are for losers, winners build systems." Now have common deployment contract across deployments, and it can run transformations and run rules. They now have blue/green deployments. Ask it, are you ready to shut down? Now have init scripts to start and stop the service. Now have metrics in place.
Internal API to load data warehouse, and the traffic is much higher than live site. Run extract of data, and it slows down the production site. Run extract during running hours, and be able to monitor that performance isn't impact.
PJ session on Anti-patterns Anti-patterns: Read book CI, and they want a plan. Provide a 2-year plan. Work with them for a year and half, and after 1.5 years, then they say that need 5 years. What they have in place, and what do they need to have in place. If they don't have it in place, then it's an anti-pattern. What have you seen where people think they have it's right, but it's actually wrong.
Want to do continuous delivery. Deliver to production every 2 weeks. Have a build script? CI running? Have developers check in frequently? 300-400 devs, and they want a roadmap. One year into it, they kind of have a CI in place. Check in, build happens and it's red or green. No industry standard CI practices, like not always include unit tests. Not doing CI if not running unit tests. If you say CI, then assume including unit tests in the build. Lack of commitment to a green build. Can check in, and have red for weeks of a time, and it's acceptable within the org. You'll never achieve continuous delivery if you have don't commit to green build.
Need to fix very quickly within a couple of hours. Whoever broke it is responsible to fix it.
If you can't break your CI process. Run your build before you check in. Use rubber chicken.
USB nerd control that will target developer.
Break build, but if not fixed in 15 minutes, then we'll roll back. If unit tests fail, then it's reverted.
How to prevent people from not checking in. Can't solve stupidity or malice. Only build systems that support good intentions.
Unstable test problems. Build will fail. Elaborate build radiator and mark test as flickering. Then call it a "bad test." Not obviously my problem. Next time runs, then it'll go green. Run 3-4 times to get different results. Tests that have non-deterministic behavior is an anti-pattern.
How to determine a non-deterministic test? Run again and it works.
Data-dependencies like data or class dependencies. Create object and it's not created first time. Race conditions is another cause. Tests that don't clean up after themselves. Run the tests in a random order can help make sure there's no dependencies.
Suggestion to write less end-to-end and write more unit tests.
Run forwards. Run backwards. Run in random order. Detect if not clean-up. Sometimes leak database connections. Detect when database leakages were happening. Red, yellow and green systems. Build scripts? CI? Frequent check-ins? Couldn't do deployment consistently. Needed to solidify system. Started with monthly deployments. Do a server every two days, and then move forward. Then moved to bi-weekly. Bring ops and developers and build their own deployment system.
Needed ops and dev collaboration in order to get to CI.
What makes CD unique from CI?
Why create user stories because tech spec was huge. Create spec and meet it, but it didn't bring any value. Then needed user story. Similarly, developer would meet requirement, but it still wouldn't get to CD.
Dev build something useful to the ops team.
Cucumber and nagios that provide ops-friendly output. Is it useful to bridge the gap between ops and devs? Yes. Ops Not familiar with Chef or Puppet. Only familiar with web sphere and native web sphere tools.
Being able to reproduce infrastructure from the command-line. Need to collaborate with script automation and site operations. Way to communicated with them was to write cucumber test ATDD. Collaborated with them to create tests. Cucumber were for the infrastructure "Give have a VM with an operating system with a Chef, when I run install_websphere.rv, then I go to this URL and should see an admin screen."
Use cucumber to monitor and do a virtual install. More often you're install an application, and with web sphere installers they were monitoring how well it was going. Given deploy_foo.sh, then I should NOT see X message. Or I should see Y message. Checking the log for details if something failed should not be a person. Use cucumber. Put cucumber output to nagios for ops people.
Will human every check log? Just for exploratory purposes. From systems POV, then look at log and know what's up. Suggestion to look at log to see if we're blind to anything not testing is covering. Showing all logs all the time is an anti-pattern. Don't plug gaps with people, do it with systems.
Deployment monitoring was an issue. Monitoring failed from the beginning. Eventually did hooks within system. Ping system for health check. Put output into a nagios alert.
Direction: Should only do manual testing as exploratory testing to discover unknown things that might be wrong. Regression testing is a confirmation that it works. See places that only 5% of unit testing and 95% of testing is manual regression testing. "Testing" could either be "checking" or "exploring." Jeff would insist that "testing" means exploring, but can't change industry systems. Only thing should not already be automated is looking at system or new ways to understand the system.
Continual thing, then have automated ways to find it. If a human is testing, then find out what needs to be tested. If new feature, then have humans who didn't design it, then there's usability testing. Can't do UX testing in CI. Do regular checks in system. If roll out new feature, then have humans use it in order to figure out what needs checking.
There is a test framework by Lou Wellon Falco to test visual appearance. "Approval testing." Do test, and then it records state of system. If it does change, then you detect it. Hybrid between manual and automated testing.
Area that's "hard" to test. You should test everything. Hard to test, then there could be tests. Layout in the browser isn't done well.
Take current build that's a golden build. Create a number of test cases, and take a snapshot of current state. New version of code on the other side, and then compare the test results according to the DOM. Spot the differences. Then a human can detect a CSS problem. Can do this cross-browser as well. Much less to spot UI issues.
Identify stuff that's in way of CI, and then identify ingenuity of solutions because of commitment to CD. Non-deterministic tests usually have bugs. If code is right, then why try to write tests. There's a barrier to commitment to CD/CI. Need to share ingenuity so that it's easier to do CD.
Treat test code as serious as production code. If the test is MORE difficult to write than production code, then it becomes hard to justify.
Shore: "Agile doesn't work if you don't have self-discipline." If you have a non-deterministic failure, then within a couple of weeks. Then you're accumulating debt.
But if you find a non-determinsitic failure, then put in another hour into it. Then they will eventually give up. May put 6-8 hours into a non-deterministic issue, and then give up.
1/2 of flicking tests are poorly written test and 1/2 are really difficult problem in code.
Turn on Code coverage before and after to detect issues.
Writing a book on How to detect flickering tests would be a best seller. a Database needs follow evolutionary design pattern. Duplicate data, maintain it, and then migrate it. Book on "Database continuous integration: Evolutionary database design." If make change to database, then write a delta script. ~Liqui-base.
Don't want downtime. Need to decouple structure of database. Address field. Split into two address field. Write migration that creates new stuff. Write new data, but only read old data. Need to have multiple versions of your code talking to the database. Mention of a "Refactoring databases" book.
Anti-pattern: Ivory tower DBA. Submit ticket to make change to database. DBA is a bottleneck to the organization. Very hard to reproduce the database. In order to reproduce the database, then you have to take db and reproduce it entirely. Takes a lot of time. If use a evolutionary database, then it's easier to grab 3% of database or a specific portion of db.
Instead of integrate with database, then integrate with services. Decouple database per service. Avoid having JOIN is reporting software. Pretend have persistent memory would it be okay objects to break database. Have a well-defined API. Have code that could read the alpha database. Versioned by class. Was there code complexity to deal with it? No, there were abstractions that dealt with it.
NoSQL databases will defend them on a version. NoSQL migrations can be really difficult.
Collaboration between Dev and Ops
"Failure Mode and Effects Analysis" -- Failure analysis: What could go wrong with system? If it went wrong, then how would we know? How quickly could we fix it? Do risk and impact analysis, and add issues to the backlog.
Blue/Green deployments as a prerequisite? CD is defined differently per organization.
Do releases at 11 pm, after US market close and before AU market open.
Do releases under load on the site while users are on the system.
Have to be testing what you're releasing. Need the packages of what you're going to deploy. If not testing it, then you'd have put a commit to bump version number. Made and snapshots. Code signing process brings ambiguity to if it's the same.