OpenStack Havana – the Quality Perspective

Like a lot of others, I’m currently trying to catch my breath after the incredible OpenStack Havana release. One of they key reasons that OpenStack is able to evolve as fast as it does, and the whole thing not fall apart, is because of the incredible preemptive integration gate that we have (think continuous integration++).

In Havana, beyond just increasing the number of tests we run, we made some changes in the nature of what we do in the gate. These changes are easy to overlook, so I wanted to highlight some of my favorites, and give a perspective in everything that’s going on behind the scenes when you try to land code in OpenStack.

Parallel Test Runner

Every proposed commit to an OpenStack project needs to survive being integrated into a single node devstack install, and hit with 1300 API & integration tests from Tempest, but until Havana, these were run serially. Right before Havana 3 milestone we merged parallel tempest testing for most of our jobs. This dropped their run time in half, but more importantly it meant all our testing was defaulting to 4 simultaneous requests, as well as running every test under tenant isolation, where a separate tenant is created for every test group. Every time you ratchet up testing like this you expose new race conditions, which is exactly what we saw. That made for a rough RC phase (the gate was a sad panda for many days), but everyone buckled down to get these new issues fixed, which were previously only visible to large OpenStack installations. The result, everyone wins.

This work was a long time coming, and had been started in the Grizzly cycle by Chris Yeoh, and spearheaded to completion by Matt Treinish.

Large Ops Testing

A really clever idea was spawned this summer by Joe Gordon: could we actually manage to run Tempest tests on a devstack with a fake virt driver that would always “succeed” and do so instantaneously. In doing so we could turn the pressure up on the control plane in OpenStack without the overhead of real virt drivers slowing down control plane execution enough that bugs could hide. Again, the first time we cranked this to 11, lots of interesting results fell out, including some timeout and deadlock situations. All hands went on deck, the issues were addressed, and now Large Ops Testing is part of our arsenal, run on every single proposed commit.

Upgrade Testing

Most people familiar with OpenStack are familiar with Devstack, the opinionated installer for OpenStack from upstream git. Devstack actually makes the base of our QA system, because it can build a single node environment from git trees. Lesser known is it’s sister tool, Grenade. Grenade uses 2 devstack trees (the last stable and master) to build an OpenStack at the previous version, inject some data, then shut down everything, and try to restart it with latest version of OpenStack. The ensures config files roll forward smoothly (or have specific minimal upgrade scripts in Grenade), database schemas roll forward smoothly, and that we don’t violate certain deprecation guarantees.

Grenade was created by Dean Troyer, I did a lot of work towards the end of Grizzly to get it in the gate, and Adalberto Medeiros took it the final mile in Havana and got this to be something running on every proposed commit.

New Tools for an Asynchronous World

September was the 30th anniversary of the GNU project. I remember some time in the late 90s reading or watching something about Richard Stallman and GNU Hurd. The biggest challenge of building a system with dozens of daemons sending asynchronous messages, is having any idea what broke when something goes wrong. They just didn’t have the tools or methods to make consistent forward progress. Linux emerged with a simpler model which could make progress, and the rest is history.

If you zoom back on OpenStack, this is exactly what we are building. A data center OS micro kernel. And as I can attest, debugging is often “interesting”. Without the preemptive integration system, we’d never be able to keep up our rate of change. However as the number of integrated projects has increased we’ve definitely seen emergent behavior that is not straight forward to track down.

Jobs in our gate will fail, seemingly at random. People unfamiliar with the situation will complain about “flakey tests” or a “flakey gate”, and just recheck their patch and see it pass on the second attempt. Most of the time neither the gate nor the tests are to blame, but the core of OpenStack itself. We managed to trigger a race condition, that maybe shows up 1% of the time in our configuration. We have moved to a world where test results aren’t binary, pass or fail, but better classified with a race percentage.

This is a problem we’ve been mulling over for nearly a year, and the solution which has been created is ElasticRecheck, a toolchain that uses Elastic Search on our test logs to check new failures against known failures. While finding a “fingerprint” for a failure is still a manual step, it was still of dramatic benefit for the release process. It got us out of thinking that there were only a couple of race conditions we were hitting, and realizing there were dozens of very specific races, each with their own fix. It also gave us a systematic way of determining which race conditions were most impacting us, so they could be prioritized and fixed.

This work was spearheaded by Joe Gordon and Matt Treinish, and leveraged some background work that Clark Boylan and I had done early in the cycle. ElasticRecheck is exciting enough technology all by itself, it deserves it’s own detailed dive. But that is for another day.

And many more…

These are just some of the sexiest highlights from the Havana release on the quality front.

The number of tests in Tempest that we run on every proposed patch has risen from 800 to 1300 during the cycle. This included new scenarios and a massive enhancement on coverage in all our services. 100 different developers contributed to Tempest during the Havana release (up from 60 in the Grizzly release), enhancing our integration suite. We’ve got a new stress framework which can provide load generation to burn in your cloud, which I expect will make an appearance in our gate during Icehouse.

The point being, lots of people, from lots of places, contributed heavily to make the Havana release the most solid release we’ve ever had from OpenStack. They did this not just with new features that make for good press releases, they also did this with contributions to the overall system that validates our software not once a day, not even once an hour, but on every single proposed patch.

So to everyone that contributed in this extraordinary effort: THANK YOU!

And I look forward, excitedly, to what we’ll create for the Icehouse release.


One thought on “OpenStack Havana – the Quality Perspective”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s