A week of vacation at home means some organizing, physical and logical, some spending times with friends, and some just letting your brain wander on things it wants to. One of the problems that I've been scratching my head over is having a sane basis for doing data analysis for elastic recheck, our tooling for automatically categorizing races in the OpenStack code base.
Right before I went on vacation I discovered Pandas, the python data analysis library. The online docs are quite good. However on something this dense having a great O'Reilly Book is even better. It has a bunch of really good examples working with public data sets, like census naming data. It also has very detailed background on the iPython data notebook framework, which is used for the whole book, and is frankly quite amazing. It brought back the best of my physics days using Mathematica.
With the notebook server iPython isn't just a better interactive python shell. It's also a powerful webui, including python autocomplete. There is even good emacs integration, which includes supporting the inline graphing toolkit. Anything that's created in a cell will be available to future cells, and cells are individually executable. Looking at the example above, I'm setting up the basic json return from elastic search, which I only need to do once after starting the notebook.
Pandas is all about data series. It's really a mere mortals interface on top of numpy, with a bunch of statistics and timeseries convenience functions added in. You'll find yourself doing data transforms a lot in it. Like my college physics professors used to say, all problems are trivial in the right coordinate space. Getting there is the hard part.
With the elastic search data, a bit of massaging is needed to get the list of dictionaries that is easily convertable into a Pandas data set. In order to do interesting time series things I also needed to create a new column that was a datetime convert of @timestamp, and pivot it out into an index.
You also get a good glimpse of the output facilities. By default the last line of an In block is output to the screen. There is a nice convenience method called head() to give you a summary view (useful for sanity checking). Also, this data actually has about 20 columns, so before sanity checking I sliced it down to 2 relevant ones just to make the output easier to grok.
It took a couple of days to get this far. Again, this is about data transforms, and figuring out how to get from point a to point z. That might include include building and index, doing a transform on it (to reduce the resolution to day level), then resetting the index, building some computed columns, rolling everything back up in groupby clauses to compute the total number of successes and runs for each job on a certain day, and doing another computed column in this format. Here I'm also only slicing out only the jobs that didn't have a 100% success rate.
And nothing would be quite complete without being able to inline visualize data. This is the same graphs that John Dickinson was creating from graphite, except on day resolution. The data here is coming from Elastic Search so we do miss a class of failures where the console logs never make it. That difference should be small at this point.
Overall this has been a pretty fruitful experiment. Once I'm back in the saddle I'll be porting a bunch of these ideas back into Elastic Recheck itself. I actually think this will make asking the interesting follow on questions on "why does a particular job fail 40% of the time?" because we can compare it to known ER bugs, as well as figure out what our unclassified percentages look like.
For anyone that wants to play, here is the iPython Notebook raw.