Recently in our current project we ran up against a luxury problem: The number of our tests has grown so much that it became unfeasible to let them run on a per-commit basis.
There are also so called 'long running' tests that test processes with a standard duration of several minutes1. We actually want to run all tests on all builds but we were faced with a few hurdles.
The build used for the tests is big - meaning it contains the whole system and thus has the slowest build times. Running the tests in the same build job only exasperates the problem blowing the self-imposed 10 minute limit sky high.
Currently the "monster" build (all platforms, all components) clocks in at ~30 minutes under high load or ~12 minutes if it gets the build server to itself2. The build required for tests comes second at ~8 minutes.
Throw hardware at the problem
The conventional approach fits our problem well: Split the tests in separate test suites and throw hardware at the problem. Since we are using VMs for our development environment it is actually very easy to instantiate a VM for every test suite and let the testing commence in parallel (for other reasons we cannot run test suites in parallel in the same VM).
Where it gets a bit complicated is coordinating the builds and the tests. We use Jenkins for build management and it is feasible to create a master-slave configuration feeding the build package to the test jobs but it still strikes me as unnecessarily complicated3.
It also means that the data on the dependencies between the different build jobs is distributed in the Jenkins instances. I have a particular aversion to this type of configuration: ideally any job management system should be able to function without built-in knowledge of it's place in the system. It just needs to know where to go for the information and we get to manage it centrally4.
What do you actually want?
In order to kick off the tests all we need is a URL pointing to the build package and a bit of metadata, in this case the revision the build corresponds to.
Once the tests are executed we need to know the status of the tests for the revision and a URL pointing to the detailed logs.
Given that we want to test all change sets and that test suites have greatly varying execution times we want to know as soon as possible which revision breaks a test suite...per test suite.
Please join the queue in an orderly fashion
So we want to distribute one build to N testers without knowing when they are going to be available to process it. Sounds like a background job thingy doesn't it?
Setup a queue for each tester, publish the build information (URL & revision) and let the tester grab the build and do it's thing. Fire and forget!
Still too complicated. You need Redis, or Rails, or some SQL database or you're tied to a POSIX OS5. You get all kinds of goodies like priority scheduling, persistence and monitoring (Resque is my particular favorite with the nifty web GUI) but honestly this is the best case of YAGNI I have yet encountered.
Not created here
Why can't I find something simple, with minimal dependencies that will run everywhere? Probably because it takes less time to built it than to look for it!
How it works
The rplex service waits. In Jenkins at the end of the build we simple post the data to rplex.
On the tester side an endless loop (implemented in Rplex::Processor) runs the tests whenever a build appears.
How do we get the test results? Same concept: At the end of the test run we post the data back to rplex, this time targeting a single rplex client. That process just updates our information radiator.
We have a whole bunch of rplex clients now but nowhere near the maintenance overhead of Jenkins jobs. The process looks like this:
- Start a VM
- Start the rake task with the Rplex::Processor
- Do something else
Fiddling with the data format allows us to do everything we need without touching rplex. If we get a huge backlog for slow test suites we simply restart rplex and empty the queues6.
Come to think of it, I could replace Jenkins completely using rplex as the core...hmmmm!
1 We're talking integration and system tests here. Where unit tests are concerned we exert draconian control to ensure fast build times.
2 Network based license schemes for compilers suck big time!
3 Some initial problems with setting up master-slave in Jenkins with our firewall and network zone configuration did not help.
4 Which means we get to version control a single source of configuration and handle redundancy by cloning instances.
5 Oh Windows! Thou art the bane of my developer life!
6 While in theory you would not want to leave any build untested when development is in full swing and you get 5 or 10 patches in the space of 5 minutes - just because somebody was careless - you only care about the last one.