Tag Archives: elastic search

OpenStack Failures

Last week we had the bulk of the brain power of the OpenStack QA and Infra teams all in one room, which gave us a great opportunity to spend a bunch of time diving deep into the current state of the Gate, figure out what’s going on, and how we might make things better.

Over the course of 45 minutes we came up with this picture of the world.

14681027401_327a720647_o

We have a system that’s designed to merge good code, and keep bugs out. The problem is that while it’s doing a great job of keeping big bugs out, subtle bugs, ones that are low percentage (like show up in only 1% of test runs) can slip through. These bugs don’t go away, they instead just build up inside of OpenStack.

As OpenStack expands in scope and function, these bugs increase as well. They might grow or shrink based on seemingly unrelated changes, dependency changes (which we don’t gate on), timing impacts by anything in the underlying OS.

As OpenStack has grown no one has a full view of the system any more, so even identifying that a bug might or might not be related to their patch is something most developers can’t do. The focus of an individual developer is typically just wanting to land their code, not diving into the system as a whole. This might be because they are on a schedule, or just that landing code feels more fun and productive, than digging into existing bugs.

From a social aspect we seem to have found that there is some threshold failure rate in the gate that we always return to. Everyone ignores base races until we get to that failure rate, and once we get above it for long periods of time, everyone assumes fixing it is someone else’s responsibility. We had an interesting experiment recently where we dropped 300 Tempest tests in turning off Nova v3 by default, which gave us a short term failure drop, but within a couple months we’re back up to our unpleasant failure rate in the gate.

Part of the visibility question is also that most developers in OpenStack don’t actually understand how the CI system works today, so when it fails, they feel powerless. It’s just a big black box blocking their code, and they don’t know why. That’s incredibly demotivating.

Towards Solutions

Every time the gate fail rates get high, debates show up in IRC channels and on the mailing list with ideas to fix it. Many of these ideas are actually features that were added to the system years ago. Some are ideas that are provably wrong, like autorecheck, which would just increase the rate of bug accumulation in the OpenStack code base.

A lot of good ideas were brought up in the room, over the next week Jim Blair and I are going to try to turn these into something a little more coherent to bring to the community. The OpenStack CI system tries to be the living and evolving embodiment of community values at any point in time. One of the important things to remember is those values aren’t fixed points either.

The gate doesn’t exist to serve itself, it exists because before OpenStack had one, back in the Diablo days, OpenStack simply did not work. HP Cloud had 1000 patches to Diablo to be able to put it into production, and took 2 years to migrate from it to another version of OpenStack.

IPython Notebook Experiments

A week of vacation at home means some organizing, physical and logical, some spending times with friends, and some just letting your brain wander on things it wants to. One of the problems that I’ve been scratching my head over is having a sane basis for doing data analysis for elastic recheck, our tooling for automatically categorizing races in the OpenStack code base.

Right before I went on vacation I discovered Pandas, the python data analysis library. The online docs are quite good. However on something this dense having a great O’Reilly Book is even better. It has a bunch of really good examples working with public data sets, like census naming data. It also has very detailed background on the iPython data notebook framework, which is used for the whole book, and is frankly quite amazing. It brought back the best of my physics days using Mathematica.

With the notebook server iPython isn’t just a better interactive python shell. It’s also a powerful webui, including python autocomplete. There is even good emacs integration, which includes supporting the inline graphing toolkit. Anything that’s created in a cell will be available to future cells, and cells are individually executable. Looking at the example above, I’m setting up the basic json return from elastic search, which I only need to do once after starting the notebook.

Pandas is all about data series. It’s really a mere mortals interface on top of numpy, with a bunch of statistics and timeseries convenience functions added in. You’ll find yourself doing data transforms a lot in it. Like my college physics professors used to say, all problems are trivial in the right coordinate space. Getting there is the hard part.

With the elastic search data, a bit of massaging is needed to get the list of dictionaries that is easily convertable into a Pandas data set. In order to do interesting time series things I also needed to create a new column that was a datetime convert of @timestamp, and pivot it out into an index.

You also get a good glimpse of the output facilities. By default the last line of an In[] block is output to the screen. There is a nice convenience method called head() to give you a summary view (useful for sanity checking). Also, this data actually has about 20 columns, so before sanity checking I sliced it down to 2 relevant ones just to make the output easier to grok.

It took a couple of days to get this far. Again, this is about data transforms, and figuring out how to get from point a to point z. That might include include building and index, doing a transform on it (to reduce the resolution to day level), then resetting the index, building some computed columns, rolling everything back up in groupby clauses to compute the total number of successes and runs for each job on a certain day, and doing another computed column in this format. Here I’m also only slicing out only the jobs that didn’t have a 100% success rate.

And nothing would be quite complete without being able to inline visualize data. This is the same graphs that John Dickinson was creating from graphite, except on day resolution. The data here is coming from Elastic Search so we do miss a class of failures where the console logs never make it. That difference should be small at this point.

Overall this has been a pretty fruitful experiment. Once I’m back in the saddle I’ll be porting a bunch of these ideas back into Elastic Recheck itself. I actually think this will make asking the interesting follow on questions on “why does a particular job fail 40% of the time?” because we can compare it to known ER bugs, as well as figure out what our unclassified percentages look like.

For anyone that wants to play, here is the iPython Notebook raw.