Many people have experienced random build failures.
Random build failures reduce the motivation for developers to take notice of the build status, make it harder to work out whether the build is really broken or not, and make it harder to work out why the build is broken if it really is broken.
This session was about what causes random build failures, and solutions that people have used, or could potentially use, to deal with them.
On the flip chart, we wrote up some causes and solutions, which I will transcribe here. I'll also comment on some of the discussion that we had, and the "ah ha" moment that I experienced. (Note - not quite 100% as on flip chart but very close - sorry I didn't write down solutions for all causes), without much in the way of embedded comments. I personally don't approve of all the suggested solutions.)
Causes Potential Solutions ----------------------------------------------------------------------- non-deterministic random shit exorcism/comment well known flickering tests multi-threaded apps test on multi-core machines, slow machines etc to try to make the build fail if the "random" failure is actually a real failure selenium adding waits/using existence of ids/don't make assertions for things that don't matter real problems try to make the build fail if the "random" failure is actually a real failure/have a "zero tolerance" approach to "random" failures (aka "broken windows")/KISS environmental differences virtualization external dependencies stub them/use a clean version of the external dependency state of machine/environment/dependencies interactions and dependencies between tests write isolated tests/run tests in random order to make sure such tests are exposed windows file handles auto reboot regularly microsoft out of memory show what memory usage is tomcat/redeploying use clean undeploy-deploy rather than redeploy/jetty incremental builds infrastructure start time, e.g. spring, hibernate don't use those things
Other things that came up in discussion or as suggestions:
Test your build scripts. Treat them like other production code. Have a look at Rake, http://rake.rubyforge.org/ and Buildr, http://incubator.apache.org/buildr/ Write the build in Java/language of your choice. (I suggested it might be possible to do this using ant code, as it's written in java, to save reimplementing lots of things, i.e. effectively using ant but from a java program that you can write tests for). Know why your build fails. If something is causing the build to fail randomly, don't use it. Keep it simple, stupid! (KISS). If the build fails randomly because the configuration/architecture etc of your application is really complicated - then make it simpler.
Some stuff that came up in discussion, including some comment by me:
Take a zero tolerance approach - treat every failure as a failure rather than ignoring things that looks like a "random failure". Work out how to either stop them happening, or (perhaps more insightfully - this is the "ah ha" that I experienced - I think it was from http://citconf.com/wiki/index.php?title=Jason_Sankey) make the random failure happen consistently. Many of the participants agreed to having seen "random" failures that were actually real errors. The existence of such "random" failures is the reason you need to make sure you understand what has caused the "random" failure, and why it's a good thing to make such failures happen more rather than less if they expose a real problem. "Random" failures which are not due to a real error need to be eliminated so that you really do treat all failures as failures and don't just press "rebuild" on reflex.