FlickeringBuilds

From CitconWiki
Revision as of 19:53, 5 October 2008 by 121.44.45.33 (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Many people have experienced random build failures.

Random build failures reduce the motivation for developers to take notice of the build status, make it harder to work out whether the build is really broken or not, and make it harder to work out why the build is broken if it really is broken.

This session was about what causes random build failures, and solutions that people have used, or could potentially use, to deal with them.

On the flip chart, we wrote up some causes and solutions, which I will transcribe here. I'll also comment on some of the discussion that we had, and the "ah ha" moment that I experienced. (Note - not quite 100% as on flip chart but very close - sorry I didn't write down solutions for all causes), without much in the way of embedded comments. I personally don't approve of all the suggested solutions.)

Causes                                              Potential Solutions
-----------------------------------------------------------------------
non-deterministic random shit                       exorcism/comment well known flickering tests
multi-threaded apps                                 test on multi-core machines, slow machines etc to try to make the build fail if the "random" failure is actually a real failure
selenium                                            adding waits/using existence of ids/don't make assertions for things that don't matter
real problems                                       try to make the build fail if the "random" failure is actually a real failure/have a "zero tolerance" approach to "random" failures (aka "broken windows")/KISS
environmental differences                           virtualization
external dependencies                               stub them/use a clean version of the external dependency
state of machine/environment/dependencies
interactions and dependencies between tests         write isolated tests/run tests in random order to make sure such tests are exposed
windows file handles                                auto reboot regularly
microsoft
out of memory                                       show what memory usage is
tomcat/redeploying                                  use clean undeploy-deploy rather than redeploy/jetty
incremental builds
infrastructure start time, e.g. spring, hibernate   don't use those things

Other things that came up in discussion or as suggestions:

Test your build scripts. Treat them like other production code.
Have a look at Rake, http://rake.rubyforge.org/ and Buildr, http://incubator.apache.org/buildr/
Write the build in Java/language of your choice. (I suggested it might be possible to do this using ant code, 
as it's written in java, to save reimplementing lots of things, i.e. effectively using ant but from a java 
program that you can write tests for).
Know why your build fails.
If something is causing the build to fail randomly, don't use it.
Keep it simple, stupid! (KISS). If the build fails randomly because the configuration/architecture etc of
your application is really complicated - then make it simpler.

Some stuff that came up in discussion, including some comment by me:

Take a zero tolerance approach - treat every failure as a failure 
rather than ignoring things that looks like a "random failure". Work out how to either stop them happening, 
or (perhaps more insightfully - this is the "ah ha" that I experienced - I think it was from 
http://citconf.com/wiki/index.php?title=Jason_Sankey) make the random failure happen consistently. Many of
the participants agreed to having seen "random" failures that were actually real errors. The existence of
such "random" failures is the reason you need to make sure you understand what has caused the "random" failure,
and why it's a good thing to make such failures happen more rather than less if they expose a real problem.
"Random" failures which are not due to a real error need to be eliminated so that you really do treat all 
failures as failures and don't just press "rebuild" on reflex.