The OpenStack project has a really impressive continuous integration system, which is one of its core strengths as a project. Every proposed change to our gerrit review system is subjected to a battery of tests on each commit, which has grown dramatically with time, and after formal review by core contributors, we run them all again before the merge.
These tests take on the order of 1 hour to run on a commit, which would make you immediately think the most code that OpenStack could merge in a day would be 24 commits. So how did Nova itself manage to merge 94 changes since Monday (not to mention all the other projects, which adds up to ~200 in 3 days)? The magic of this is Zuul, the gatekeeper.
Zuul is a queuing system for CI jobs, written and maintained by the OpenStack infrastructure team. It does many cool things, but what I want to focus on is the gate queue. When the gate queue is empty (yes it does happen some times), the job is simple: add a new commit, run the tests, and we're off. What happens if there are already 5 jobs ahead of you in the gate? Let's take a concrete example of nova.
By the time a commit has gotten this far, it's already passed the test suites at least once, and has had at least 2 core contributors sign off on the change in code review. So Zuul assumes everything ahead of the change in the gate will succeed, and starts the tests immediately cherry picking this change on top everything that's ahead of it in the queue.
That means that merge time on the gate is O(1), that is merging 10 changes takes the same time as 1 change. If the queue gets too big, we do eventually run out of devstack nodes, so the ability to run tests is not strictly constant time. On the run up to grizzly-3 both the cloud providers (HP and Rackspace) which contribute these VMs provided some extra quota to the OpenStack team to help keep things moving. So we had an elastic burst of OpenStack CI onto additional OpenStack public cloud resources, which is just fun to think about.
Speculation Can Fail
Of course, speculation can fail. Maybe change 3 doesn't merge because something goes wrong in the tests. If that happens we then kick the change out of the queue, and then all the changes behind it have to be reset to pull change 3 out of the speculation. This is the dreaded gate reset, because when gate resets happen, all the time spent on speculative tests behind the failure is lost, and the jobs need to restart.
Speculation failures largely fall into a few core classes:
Jenkins crashes - it doesn't happen often, but Jenkins is software too, and OpenStack CI tends to drive software really hard, so we force out edge cases everywhere.
Upstream service failures - we try to isolate ourselves from upstream failures as much as possible. Our git trees pull from our gerrit, not directly from github. Our apt repository is a Rackspace local mirror, not generically upstream. And the majority of pip python packages come from our own proxy server. But if someone adds a new python dependency, or a version of one updates and we don't yet have it cached, we pass through to pypi for that pip install. On Tuesday pypi converted from HTTP to HTTPS, and didn't fully grok the load implications, which broke OpenStack CI (as well as lots of other python developers) for a few hours when pypi effectively was down from load.
Transient OpenStack bugs - OpenStack is complicated software, 7 core components interacting with each other asynchronously over REST web services. Each core component being a collection of daemons that interact with each other asynchronously. Sometimes, something goes wrong. It's a real bug, but only shows up under very specific timing and state conditions. Because OpenStack CI runs so many tests every day (OpenStack CI may be one of the largest creators of OpenStack guests in the world every day), very obscure edge and race conditions can be exposed in the system. We try to track these as recheck bugs, and are making them high priority to address. By definition they are hard to track down (they expose themselves on maybe 1 out of 1000 or fewer test runs), so the logs captured in OpenStack CI are the tools to get to the bottom of these.
Towards an Even Better Gate
In my year working on OpenStack I've found the unofficial motto of the project to be "always try to make everything better". Continuous improvement is not just left to the code, and the tests, but the infrastructure as well.
We're trying to get more urgency and eyes on the transient failures, coming up with ways to discover the patterns from the 1 in 1000 fails. After you get two or three that fail in the same way it helps triangulate the core issue. Core developers from all the projects are making these high priority items to fix.
On the upstream service failures the OpenStack infrastructure team already has proxies sitting in front of many of the services, but the pypi outage showed we probably need something even more robust to handle that upstream service outage, possibly rotating between pypi mirrors on the fall-through case, or a better proxy model. The team is already actively exploring solutions to prevent that from happening again.
As always, everyone is welcomed to come help us make everything better. Take a look at the recheck bugs and help us solve them. Join us on #openstack-infra and help with Zuul. Check out what the live Zuul queue looks like. All the code for this system is open source, and available under either the openstack, or openstack-infra github accounts. Patches are always welcome!