Alan Ogilvie is working at a division of IBM responsible for testing IBM’s Java SE product. Some numbers from his presentation:
- A build for testing is about 500MB (takes 17 min to download to a test machine)
- There are 20 different versions (AIX, Linux, Windows, z/OS * x86, power, zSeries)
- The different teams create 80..200 builds every day
- The tests run on heaps from 32MB to 500GB
- They use hardware with 1 to 128+ cores
- 4 GC policies
- More than 1000 different combinations of command line options
- Some tests have to be repeated a lot of time to catch “1 out of 100” failures that happen only very rarely
That amounts to millions of test cases that run every month.
1% of them fail.
To tame this beast, the team uses two approaches:
- Automated failure analysis that can match error messages from the test case to known bugs
- Not all of the tests are run every time
The first approach makes sure that most test failures can be handled automatically. If some test is there to trigger a known bug, that shouldn’t take any time from a human – unless the test suddenly succeeds.
The second approach is more interesting: They run only a small fraction of the tests every time the test suite is started. How can that possibly work?
If you run a test today and it succeeds, you will have some confidence that it still works today. You’re not 100% sure but, well, maybe 99.5%. So you might skip this test today and mark it as “light green” in the test results (as opposed to “full green” for a test that has been run this time).
What about the next day? You’re still 98% sure. And the day after that? Well, our confidence is waning fast, so we’re still pretty sure – 90%.
The same goes for tests that fail. Unless someone did something about them (and requested that this specific test is run again), you can be pretty sure that the test would fail again. So it gets light red unlike the tests that failed today.
This way, most tests only have to be run once every 4-5 days during development.
Why would they care?
For a release, all tests need to be run. That takes three weeks.
They really can’t possibly run all tests all the time.