Never known an open web

Recently, a lot of people that I admire and look up to have raised their voices, advocating for getting the Internet back to what it once was. An open web. A web we shared and owned together. The old web was awesome.

It sure sounds awesome. Currently, our networks and our personal data are controlled by major corporations with no respect for privacy. Silicon Valley, that so-called tech hotbed of “innovation” and “disruption,” is by most reports becoming a culture of inequality and vapidity. Getting back to the founding open standards the web is, I’m told, a solution to all of this. The web should be a place where we can own our data, where our best developers focus on solving the problems we need to solve as a democratic society. An open web accepts all people and creates a culture of inclusion.

Again, sounds great. As a webmaker, I want an open web. But as someone who has never experienced that, I don’t know where to begin in making it. I’m not sure simply reverting back to what we had is the right path if we want to include people who have never experienced the open web or understand its principles.

via I’m 22 years old and what is this. — Medium.

It's interesting to realize that the digital natives have basically only known a SaaS web, and how we can move forward when the expectations are that the platform is closed and controlled by a small number of interests.

IPython Notebook Experiments

A week of vacation at home means some organizing, physical and logical, some spending times with friends, and some just letting your brain wander on things it wants to. One of the problems that I've been scratching my head over is having a sane basis for doing data analysis for elastic recheck, our tooling for automatically categorizing races in the OpenStack code base.

Right before I went on vacation I discovered Pandas, the python data analysis library. The online docs are quite good. However on something this dense having a great O'Reilly Book is even better. It has a bunch of really good examples working with public data sets, like census naming data. It also has very detailed background on the iPython data notebook framework, which is used for the whole book, and is frankly quite amazing. It brought back the best of my physics days using Mathematica.


With the notebook server iPython isn't just a better interactive python shell. It's also a powerful webui, including python autocomplete. There is even good emacs integration, which includes supporting the inline graphing toolkit. Anything that's created in a cell will be available to future cells, and cells are individually executable. Looking at the example above, I'm setting up the basic json return from elastic search, which I only need to do once after starting the notebook.


Pandas is all about data series. It's really a mere mortals interface on top of numpy, with a bunch of statistics and timeseries convenience functions added in. You'll find yourself doing data transforms a lot in it. Like my college physics professors used to say, all problems are trivial in the right coordinate space. Getting there is the hard part.

With the elastic search data, a bit of massaging is needed to get the list of dictionaries that is easily convertable into a Pandas data set. In order to do interesting time series things I also needed to create a new column that was a datetime convert of @timestamp, and pivot it out into an index.

You also get a good glimpse of the output facilities. By default the last line of an In[] block is output to the screen. There is a nice convenience method called head() to give you a summary view (useful for sanity checking). Also, this data actually has about 20 columns, so before sanity checking I sliced it down to 2 relevant ones just to make the output easier to grok.


It took a couple of days to get this far. Again, this is about data transforms, and figuring out how to get from point a to point z. That might include include building and index, doing a transform on it (to reduce the resolution to day level), then resetting the index, building some computed columns, rolling everything back up in groupby clauses to compute the total number of successes and runs for each job on a certain day, and doing another computed column in this format. Here I'm also only slicing out only the jobs that didn't have a 100% success rate.


And nothing would be quite complete without being able to inline visualize data. This is the same graphs that John Dickinson was creating from graphite, except on day resolution. The data here is coming from Elastic Search so we do miss a class of failures where the console logs never make it. That difference should be small at this point.

Overall this has been a pretty fruitful experiment. Once I'm back in the saddle I'll be porting a bunch of these ideas back into Elastic Recheck itself. I actually think this will make asking the interesting follow on questions on "why does a particular job fail 40% of the time?" because we can compare it to known ER bugs, as well as figure out what our unclassified percentages look like.

For anyone that wants to play, here is the iPython Notebook raw.

Moving off GMail

In early December I finally decided it was time to move my primary email out of google. There were a few reasons to do it, though the practical (reaching the limits on their filtering) largely outweighed the ideological.

Movable Email

If email is important to you, you should really register your own domain name, so you have a permanent address. I got back in 1999 to create a permanent home for my identity. This has meant over they years the backend for has changed at least 5 times, including me hosting it myself for a large number of years.

My Requirements

  • Can host email on my own domain - as I'd be moving
  • Web UI - because sometimes I want to access my email via Chromebook
  • Good Search - because there are times I fall back to full text search to find things
  • IMAP - because most of the time I'll be accessing via Thunderbird or Kaiten Mail
  • Good spam filtering
  • Good generic filtering on the server side - My daily mail volume is north of 1000 messages (40% spam), I need good filtering otherwise I drown

I eventually landed on, who I've been watching for a lot of years. They are fairly priced ($40/yr), their company contributes back to the open source software they run their business on, and because they are actually an Australian company, you'll get disclosure if some agency is accessing your accounts. They also give you a 60 day free trial, so you can do a slow migration over, and see if it will meet your needs.


One I was sure I was going to do this, I created my account, and then pulled and configured imapsync to sync my existing gmail content over. I have a couple of GB of email, which means an imap sync takes a good 24 hours at this point. Imapsync is rerunable, so run it once, wait until it finishes, then run it a second time, and pick up the changes. Once it seems like you've basically closed the gaps between the two accounts, you can change MX records, and start getting email at the new service provider.

For safety the first thing I do once this has happened is build a forward rule from the new provider to the old one. Then if something goes horribly wrong, all my email remains in both locations for a while. A month later I'm still running that forward, though will be disconnecting it soon.

So far so good

The webmail for fastmail is really solid, honestly I like it better than gmail's web ui, which has become incredibly cluttered over the years. This is just email, which is good. It also has a search facility which is on par with google's. It's also available as part of the IMAP protocol, which means real searching from Kaiten mail on Android. Switching from GMail App to Kaiten Mail on my phone was about 10 minutes. And it means I can actually customize things I get alerted to, which gmail broken at some point. Thunderbird transition was simple.

I had gotten used to Raportive on gmail that would give me people's pictures on their email. I found the Ldap Info Show extension on Thunderbird, which looks people up on various social networks, and gives you pictures they have public.

Lacking APIs

The one complaint I have with fastmail, is that it's lacking APIs to handle your data. For instance, my filtering rules are complex. 342 lines of sieve and counting at this point. This is managed via a web form, but copy / paste on every change is something I'm not really into. I solved this by writing a python mechanize sync script so I can manage the rules locally, version controlled, then sync them up afterwards.

Address book has some issues, and I've not built a work around. The sieve rules they give you whitelist your address book as spam sources, so it's something I'd like to keep in sync. However, without an API it's not really worth it.

Overall: Very Good

Overall I'm very happy with the move. My biggest complaints are around the API issue, which I hope they correct in the future.

Planet Money T-Shirt Project

If you haven't been following along: Planet Money has been making a t-shirt, and working to follow it's creation throughout the global supply chain. This has led to a ton of podcast episodes that trace that path, cradle to grave.

They also have this really great mixed media summary of the whole story, which includes a look at everything from the raw cotton, to the process of getting the shirt to you. Including great video of the machines used along the way.

spinning yarn


On this Christmas Eve take a minute to follow something as simple as a t-shirt through the global economy, and realize how connected we are, even through seemingly everyday things.