(Finally a tech blog post …)
Stop me if you’ve heard this one before!
The web application has excellent coverage in unit tests and integration tests that run continuously, but some time ago (weeks actually) some number of tests began failing with strange state errors. In our case, out of 138 test classes and 1176 tests, 82 would error out.
The errors were all strange platform related things, like:
org.springframework.transaction.IllegalTransactionStateException: Pre-bound JDBC Connection found! JpaTransactionManager does not support running within DataSourceTransactionManager ...
Naturally the failing tests all work when they’re run individually. Heard that one before?
I started out trying different combinations of tests. I could make a list of all the running test classes by grepping for “^Running ” in the test output log. I started out using the maven option “-Dtest=TestClassOne,TestClassTwo,…” to try tests in different combinations. Most of the time, the erroring tests would work perfectly. When they didn’t, the errors would occur in different tests or be different errors.
The failure now was non-deterministic! One of the difficulties is that Maven/Surefire would run the tests in whatever order it wanted to. That approach wasn’t going to work at all.
From studying the Spring references a little, I understood that Spring would cache the application contexts created for unit tests in order to improve the run time of tests. Wiring up a large application is slow when it’s done once — multiply that by 138 test classes and a slow test suite becomes glacial.
Clearly some test class being run prior to the error tests was corrupting the cached Spring context, and ruining the downstream environment. Spring provides a
@DirtiesContext annotation specifically to label tests that require Spring to reload the application context. The problem is finding the test doing the dirty work!
I needed to make the test runs deterministic — run the test classes in the same order, and start eliminating classes one at a time from the top of the order. Surefire doesn’t have a property to exclude a test on the command line, so it required editing the POM file to exclude each test class in order from the top.
It was a tedious task, as many hidden software problems can be. I had to keep careful track of the list of test classes, and change the
<exclude>TestClassExample</exclude> element for each test run. Fortunately each test run only required about two and a half minutes. After each test with one class excluded, I would examine the final result line for any change.
I was pretty confident that the culprit had to be an early test in the sequence, so I should only have to go about halfway through the successful test classes. Finally thirty-four classes into the list, I had my culprit.
Ironically enough, the test class causing all the problems was named
TestSpringConfigurations. It had two tests that would simply verify that all of our wiring would successfully produce an application context. Marking the tests with the
@DirtiesContext annotation made all of the following error tests run successfully.
Actually the @DirtiesContext annotation wasn’t necessary: The tests themselves included one fatal line:
context.close(). By not closing the contexts after the test load, the cached application context was just fine for all the following tests.
One might argue that this class is pointless when run as part of a large test suite, since earlier tests have already loaded the application context. When Surefire arrives at
TestSpringConfigurations, it is only using the already-cached context rather than loading a new one. Good point. But having the test in the suite also gives us a quick way to verify changes made to the application context configuration without running the whole set.
And finally the punch line: When the Spring configuration errors were finally vanquished, four test failures were revealed that were legitimately testing application code. Those test failures were completely masked by the Spring context corruption error.
Oh yes, that
TestSpringConfigurations class has been in the suite for many months. Why did we only recently find it causing this corruption? No one here is quite positive, but the only major platform change we can point to is a switch from Java 5 to a Java 6 runtime. Maybe that triggered the original problem, and maybe sometime I’ll be interested enough to test that proposition.