With OpenStack summit only a few days away, I've been preparing materials for my Elastic Recheck talk. Elastic Recheck is a system that we built over the last 7 months to help us data mine failures in test results in the OpenStack test system to find patterns.
OpenStack is a complicated system, with lots of components working in an asynchronous way. This means that small timing changes can often expose some interesting issues. This is especially true in an environment like the upstream gate where we are running tests in parallel.
A good example of this is a currently open bug. If you run a security group list against all tenants at the same time someone is deleting a security group, the listing returns a 404. This is because of a nesting behavior in the list, which includes running a db get over all the items in the list to get additional details. There is an exposure window there where a security group is in the list, it's deleted by another user, then we go back to get it, and it fails. That failure currently propagates a set of exceptions which become a 404 to the end user. Which is totally unexpected.
That window seems really small, right? Like it never could actually happen. Well, in the gate, even with only 2 - 4 API calls happening at a time, we see this 7 times a day:
Starting during the Havana RC phase, we started turning this into a search problem. Using logstash and elastic search on the back end, we find fingerprints for known bugs. These fingerprints are queries that will give us back only test runs which seem to have failed on this particular bug.
The system includes real time reporting to Gerrit and IRC when we detect that a job failed with a known issue, and bulk reporting every 30 minutes to let us understand trends and classification rates.
Overall this has been a huge boon towards really identifying some of the key issues we expose during normal testing. What's also been really interesting is having a system like this impacts the way that people write core project code, so that errors are more uniquely discoverable. Which is a win not only for our detection, but for debugging OpenStack in a production environment.
If you are going to be in Atlanta, and would like to know more, you'll have lots of opportunities.
My summit talk, which is going to be overview intended for people that want to learn more about the project and technique.
Elastic Recheck - Tools for Finding Race Conditions in OpenStack
Date: Thursday, May 15th
Track: Related Open Source Projects
We'll also be doing a design summit session for people that are interested in contributing to the project, and helping us set priorities for the next cycle. Wed, 9:50am in the Infrastructure Track.
Also, feel free to find me anywhere to chat about Elastic Recheck. I'm always happy to talk about it, especially if you are interested in getting involved in the effort.
I believe the summit talk will be recorded, and I'll post links to the video once it's online for people that can't make it to Atlanta.