Tag Archives: data

IPython Notebook Experiments

A week of vacation at home means some organizing, physical and logical, some spending times with friends, and some just letting your brain wander on things it wants to. One of the problems that I’ve been scratching my head over is having a sane basis for doing data analysis for elastic recheck, our tooling for automatically categorizing races in the OpenStack code base.

Right before I went on vacation I discovered Pandas, the python data analysis library. The online docs are quite good. However on something this dense having a great O’Reilly Book is even better. It has a bunch of really good examples working with public data sets, like census naming data. It also has very detailed background on the iPython data notebook framework, which is used for the whole book, and is frankly quite amazing. It brought back the best of my physics days using Mathematica.

screenshot_109

With the notebook server iPython isn’t just a better interactive python shell. It’s also a powerful webui, including python autocomplete. There is even good emacs integration, which includes supporting the inline graphing toolkit. Anything that’s created in a cell will be available to future cells, and cells are individually executable. Looking at the example above, I’m setting up the basic json return from elastic search, which I only need to do once after starting the notebook.

screenshot_110

Pandas is all about data series. It’s really a mere mortals interface on top of numpy, with a bunch of statistics and timeseries convenience functions added in. You’ll find yourself doing data transforms a lot in it. Like my college physics professors used to say, all problems are trivial in the right coordinate space. Getting there is the hard part.

With the elastic search data, a bit of massaging is needed to get the list of dictionaries that is easily convertable into a Pandas data set. In order to do interesting time series things I also needed to create a new column that was a datetime convert of @timestamp, and pivot it out into an index.

You also get a good glimpse of the output facilities. By default the last line of an In[] block is output to the screen. There is a nice convenience method called head() to give you a summary view (useful for sanity checking). Also, this data actually has about 20 columns, so before sanity checking I sliced it down to 2 relevant ones just to make the output easier to grok.

screenshot_111

It took a couple of days to get this far. Again, this is about data transforms, and figuring out how to get from point a to point z. That might include include building and index, doing a transform on it (to reduce the resolution to day level), then resetting the index, building some computed columns, rolling everything back up in groupby clauses to compute the total number of successes and runs for each job on a certain day, and doing another computed column in this format. Here I’m also only slicing out only the jobs that didn’t have a 100% success rate.

screenshot_112

And nothing would be quite complete without being able to inline visualize data. This is the same graphs that John Dickinson was creating from graphite, except on day resolution. The data here is coming from Elastic Search so we do miss a class of failures where the console logs never make it. That difference should be small at this point.

Overall this has been a pretty fruitful experiment. Once I’m back in the saddle I’ll be porting a bunch of these ideas back into Elastic Recheck itself. I actually think this will make asking the interesting follow on questions on “why does a particular job fail 40% of the time?” because we can compare it to known ER bugs, as well as figure out what our unclassified percentages look like.

For anyone that wants to play, here is the iPython Notebook raw.

2 Gigs of Data

Interesting data point on whether 2 GB / month of data is enough on cell plans. At our MHVLUG meeting the local wireless wasn’t working, so I instead just turned on the wifi access point on my phone for my laptop and tablet so I could do a little bit of live demo. It was on for 2.5 hours, with a total data usage of 46MB (which is about 15% of what I’ve used total this month).

Yes, bandwidth caps basically kill mobile streaming as a business (mobile pandora, hulu, netflix, are definitely being hurt by this), but for non streaming interactions, 2GB is way more than I’ll use in a month.

Maybe it’s time to take a statistics class

From Wired’s Why We Should Learn the Language of Data:

Statistics is hard. But that’s not just an issue of individual understanding; it’s also becoming one of the nation’s biggest political problems. We live in a world where the thorniest policy issues increasingly boil down to arguments over what the data mean. If you don’t understand statistics, you don’t know what’s going on — and you can’t tell when you’re being lied to. Statistics should now be a core part of general education. You shouldn’t finish high school without understanding it reasonably well — as well, say, as you can compose an essay.

It goes on to explain a whole number of policy issues that are being argued with badly understood data.

On a related note: It’s dark out, that is proof the Sun has been destroyed.

The importance of getting government data online

Wired has a great interview with the Federal Gov CIO, which actually dates back just prior to data.gov‘s launch.  It’s definitely worth a read.

I firmly believe that this is the most important change that the current administration can make.  The Federal government did a tail spin into secrecy over the past couple of decades, and while I believe the previous administration took this to a new height, it seems like it was part of a trend that definitely predates them.  Secrecy breeds distrust in government, as well as bad decisions, as people don’t have access to all the facts.

Sunlight is definitely the best disinfectant, and nothing has quite the same power of light as the whole of the internet gazing in.