OpenStack Failures

Last week we had the bulk of the brain power of the OpenStack QA and Infra teams all in one room, which gave us a great opportunity to spend a bunch of time diving deep into the current state of the Gate, figure out what's going on, and how we might make things better.

Over the course of 45 minutes we came up with this picture of the world.


We have a system that's designed to merge good code, and keep bugs out. The problem is that while it's doing a great job of keeping big bugs out, subtle bugs, ones that are low percentage (like show up in only 1% of test runs) can slip through. These bugs don't go away, they instead just build up inside of OpenStack.

As OpenStack expands in scope and function, these bugs increase as well. They might grow or shrink based on seemingly unrelated changes, dependency changes (which we don't gate on), timing impacts by anything in the underlying OS.

As OpenStack has grown no one has a full view of the system any more, so even identifying that a bug might or might not be related to their patch is something most developers can't do. The focus of an individual developer is typically just wanting to land their code, not diving into the system as a whole. This might be because they are on a schedule, or just that landing code feels more fun and productive, than digging into existing bugs.

From a social aspect we seem to have found that there is some threshold failure rate in the gate that we always return to. Everyone ignores base races until we get to that failure rate, and once we get above it for long periods of time, everyone assumes fixing it is someone else's responsibility. We had an interesting experiment recently where we dropped 300 Tempest tests in turning off Nova v3 by default, which gave us a short term failure drop, but within a couple months we're back up to our unpleasant failure rate in the gate.

Part of the visibility question is also that most developers in OpenStack don't actually understand how the CI system works today, so when it fails, they feel powerless. It's just a big black box blocking their code, and they don't know why. That's incredibly demotivating.

Towards Solutions

Every time the gate fail rates get high, debates show up in IRC channels and on the mailing list with ideas to fix it. Many of these ideas are actually features that were added to the system years ago. Some are ideas that are provably wrong, like autorecheck, which would just increase the rate of bug accumulation in the OpenStack code base.

A lot of good ideas were brought up in the room, over the next week Jim Blair and I are going to try to turn these into something a little more coherent to bring to the community. The OpenStack CI system tries to be the living and evolving embodiment of community values at any point in time. One of the important things to remember is those values aren't fixed points either.

The gate doesn't exist to serve itself, it exists because before OpenStack had one, back in the Diablo days, OpenStack simply did not work. HP Cloud had 1000 patches to Diablo to be able to put it into production, and took 2 years to migrate from it to another version of OpenStack.

CC BY 4.0
This work is licensed under a Creative Commons Attribution 4.0 International License.

One thought on “OpenStack Failures”

Comments are closed.