The OpenStack Gate

The OpenStack project has a really impressive continuous integration system, which is one of its core strengths as a project. Every proposed change to our gerrit review system is subjected to a battery of tests on each commit, which has grown dramatically with time, and after formal review by core contributors, we run them all again before the merge.

These tests take on the order of 1 hour to run on a commit, which would make you immediately think the most code that OpenStack could merge in a day would be 24 commits. So how did Nova itself manage to merge 94 changes since Monday (not to mention all the other projects, which adds up to ~200 in 3 days)? The magic of this is Zuul, the gatekeeper.

Zuul is a queuing system for CI jobs, written and maintained by the OpenStack infrastructure team. It does many cool things, but what I want to focus on is the gate queue. When the gate queue is empty (yes it does happen some times), the job is simple: add a new commit, run the tests, and we’re off. What happens if there are already 5 jobs ahead of you in the gate? Let’s take a concrete example of nova.

Speculative Merge

By the time a commit has gotten this far, it’s already passed the test suites at least once, and has had at least 2 core contributors sign off on the change in code review. So Zuul assumes everything ahead of the change in the gate will succeed, and starts the tests immediately cherry picking this change on top everything that’s ahead of it in the queue.

zuul-working

That means that merge time on the gate is O(1), that is merging 10 changes takes the same time as 1 change. If the queue gets too big, we do eventually run out of devstack nodes, so the ability to run tests is not strictly constant time. On the run up to grizzly-3 both the cloud providers (HP and Rackspace) which contribute these VMs provided some extra quota to the OpenStack team to help keep things moving. So we had an elastic burst of OpenStack CI onto additional OpenStack public cloud resources, which is just fun to think about.

Speculation Can Fail

Of course, speculation can fail. Maybe change 3 doesn’t merge because something goes wrong in the tests. If that happens we then kick the change out of the queue, and then all the changes behind it have to be reset to pull change 3 out of the speculation. This is the dreaded gate reset, because when gate resets happen, all the time spent on speculative tests behind the failure is lost, and the jobs need to restart.

zuul-reset

Speculation failures largely fall into a few core classes:

Jenkins crashes – it doesn’t happen often, but Jenkins is software too, and OpenStack CI tends to drive software really hard, so we force out edge cases everywhere.

Upstream service failures – we try to isolate ourselves from upstream failures as much as possible. Our git trees pull from our gerrit, not directly from github. Our apt repository is a Rackspace local mirror, not generically upstream. And the majority of pip python packages come from our own proxy server. But if someone adds a new python dependency, or a version of one updates and we don’t yet have it cached, we pass through to pypi for that pip install. On Tuesday pypi converted from HTTP to HTTPS, and didn’t fully grok the load implications, which broke OpenStack CI (as well as lots of other python developers) for a few hours when pypi effectively was down from load.

Transient OpenStack bugs – OpenStack is complicated software, 7 core components interacting with each other asynchronously over REST web services. Each core component being a collection of daemons that interact with each other asynchronously. Sometimes, something goes wrong. It’s a real bug, but only shows up under very specific timing and state conditions. Because OpenStack CI runs so many tests every day (OpenStack CI may be one of the largest creators of OpenStack guests in the world every day), very obscure edge and race conditions can be exposed in the system. We try to track these as recheck bugs, and are making them high priority to address. By definition they are hard to track down (they expose themselves on maybe 1 out of 1000 or fewer test runs), so the logs captured in OpenStack CI are the tools to get to the bottom of these.

Towards an Even Better Gate

In my year working on OpenStack I’ve found the unofficial motto of the project to be “always try to make everything better”. Continuous improvement is not just left to the code, and the tests, but the infrastructure as well.

We’re trying to get more urgency and eyes on the transient failures, coming up with ways to discover the patterns from the 1 in 1000 fails. After you get two or three that fail in the same way it helps triangulate the core issue. Core developers from all the projects are making these high priority items to fix.

On the upstream service failures the OpenStack infrastructure team already has proxies sitting in front of many of the services, but the pypi outage showed we probably need something even more robust to handle that upstream service outage, possibly rotating between pypi mirrors on the fall-through case, or a better proxy model. The team is already actively exploring solutions to prevent that from happening again.

As always, everyone is welcomed to come help us make everything better. Take a look at the recheck bugs and help us solve them. Join us on #openstack-infra and help with Zuul. Check out what the live Zuul queue looks like. All the code for this system is open source, and available under either the openstack, or openstack-infra github accounts. Patches are always welcome!

Two Solitudes – CS and Software

Via channels I can’t now remember, I came across this presentation about the very unsolved issue of how Computer Science as a field of study relates in any way to creating software.

With so many colleges in the area, and having a number of friends that are CS professors, and other IT staff at colleges, it continues to amaze me how disconnected these worlds are. All made the stranger by coming into the field sideways from a physics degree.

Plus, I love the term software carpentry.

Wise words about software

From a former Firefox developer, truer words were never spoken:

Software companies would do well to learn this lesson: anything with the phrase “users love our product” in it isn’t a strategy, it’s wishful thinking. Your users do not ”love” your software. Your users are temporarily tolerating your software because it’s the least horrible option they have – for now – to meet some need. Developers have an emotional connection to the project; users don’t.

All software sucks. Users would be a lot happier if they had to use a lot less of it. They may be putting up with yours for now, but assume they will ditch you the moment something 1% better comes along — or the moment you make your product 1% worse.

I gave up Firefox as my main browser when Chrome decided to support WebGL on Linux, and Firefox kept back burnering it. Mozilla is now deprecating Thunderbird, which is the only product of theirs that I still use regularly. Guess that means I’m going to have to get used to GMail’s web interface eventually.

Sad to see something as critical as Mozilla, the beast that cracked the IE hegemony, become an also ran.

Drupal Meetup Events Module

I just released version 1.1 of the meetup_events module for Drupal. I started building this about 6 months ago when we started using Meetup for Mid Hudson Astronomical Association to draw in new members.

I hate data entry. I find entering the same data twice into a computer one of the most soul sucking things I could do. So having both events on our website, and on meetup, meant I needed to automate things.

Meetup events was thus born. The model is simple, your Drupal site is considered the authoritative resource. You select which node types (which have date fields) you want to sync to meetup, and whenever you create or update an event of that type a meetup entry is made (or updated) accordingly. The body content that is synced is tokenized (and I added a few tokens for getting nodereference content). Venues are a little hokey right now, but I just provide a selector for a numeric field either on the main type or on a nodereference which maps to it.

meetup_events settings

Once you set up the module, including your API key and group url, you pretty much forget it is there. Then you just edit as per normal. Any time you save a type that you are syncing you’ll notice this:

Which includes a link to the meetup event that was just saved.

One of the more recent things I worked on was integration with views so that there is now a Meetup Events: Meetup Link view field for nodes. If you add that to a node listing (and you’ve registered for an oauth key) you’ll get the meetup dynamic RSVP button for your event.

That required a bit of tweaking of the rsvp javascript to make it play nice with Drupal, but given that the meetup team was kind enough to release that code under open source.

There is plenty of work remaining here, ways this could be nicer, but I’m pretty pleased with the results so far. It’s made me learn a bunch about Drupal’s ctools module, more than I ever wanted to know about the drupal settings system, and how to pull in incompatible versions of jquery in custom tools.

I’ve also been really happy with my experience with the Meetup team. They’ve added multiple API calls for me to make my life easier. Thanks guys. If every developer facing team acted like that, the world would be a much better place.

The meetup platform is turning out to be a really great way for our Astronomy club to draw in new people, and I’ve started using it for MHVLUG now as well. Now that I’ve got meetup_events, I can do that seamlessly without data duplication, or degrading our native experience. If you are interested in doing the same, check out the module. Bugs and patches welcomed.

Unity and Pidgin

One of the things that happened once getting to Ubuntu 12.04 was that gnome-do started acting up on me. Given that it’s a very minimally maintained project, I decided it was time to move on. Ubuntu’s dash provides a lot of the same functionality, so I finally started using it. But, I missed a few things.

Gnome-do isn’t just a launcher for programs, it’s an actions engine. It has support for Pidgin buddy lists, an I even wrote an NX launcher for it. I didn’t really want to give either of those up, so I started trying to figure out how to add those to Unity’s launcher itself.

Unity lenses, the plugins that support results, are written in either vala or python. Given that I’m trying to reflex my python muscles now, I decided that was my way in. After a few false starts I found the One Hundred Scopes project, which is an attempt to build a whole set of Unity lenses to add functions and examples for the world.

The wikipedia example is a good starting point. It gives you an idea of how to build a custom search and return results. That enabled me to build a basic launcher that fed up file urls for launching nx sessions, and I created an pushed unity-lens-nx that implements that.

But, pidgin is a little harder. There is no file to open for pidgin, this is about communicating with another program over dbus, and catching the action to do something else with dbus. Fortuntately through the help of David Callé I figured it out. Also, once I had it I found some good tricks (and a couple of undocumented dbus methods) here –  https://github.com/gregorl/Unity-Pidgin-Lens which I used as inspiration.

The net result is unity-lens-pidgin.

2 key MHVLUG people online right now

Super + b and search you buddy list. It only displays currently online buddies, and available buddies are preferred over unavailable ones. You can get it via ppa here.

There is plenty more to do. The search results should be smarter, especially taking into account most recently contacted buddies, which means integrating with zeitgeist or something equivalent. Ideas floating around there. I’d like to do some overlays with status icons, just to give a visual clue on either protocol or current status state.

If anyone else wants to help, or has their own ideas, I’d encourage you to join in on the conversation. The One Hundred Scopes community is pretty cool, and I’m happy to make their vision a little closer to reality.

Getting Involved in Open Source

The first week of 2012 was pretty jam packed for me, which is a good thing. One of the many things that made this week busy was my talk, entitled “Getting Involved in Open Source” at MHVLUG.

This presentation was one of the hardest I’ve had to pull together, as well as one of the most fun to give. I had 3 entirely different slidedecks, each with their own narratives, each with their own dry runs, before I found something I felt would keep everyone engaged, not be too abstract, and time in at 1 hour. (The final dry run was 1:03, the live presentation came in at 1:05). That left plenty of time for questions, and still the ability to end the meeting by the advertised 8pm.

The focus on this talk wasn’t building your own open source project, but really about interacting with various communities. I told stories about reporting bugs, fixing small features in projects, getting into flame wars, getting ignored, and becoming the accidental maintainer of projects. The core center of the talk was a tale of 3 projects: 3 drupal modules that I’ve submitted issues and code to, that have gone in completely different directions. This was to make the most important point of the talk:

Open Source, it’s made of people!

When folks get involved in Open Source, they think that it’s all about code. My experience has been that while the code is very important, the people are just as important. Understanding how to interact with a wide range of personality types is one of the most important skills for an open source developer. How do you get conversations rolling? How do you get your ideas listened to? When do you know it’s not going to work, and a new approach is required? When do you just walk away from an idea, because it won’t fit in this community?

With 37 folks in attendance, this was one of the larger MHVLUG meetings. The fact that it was also in our new location, made me really happy with those numbers. A very good way to kick off the new year.

 

Steve Yegge and the Google Platform issue

Steve Yegge is one of the most insightful people on the internet. I was really bummed when he stopped blogging, because his posts were always well thought out, funny, and really got to the heart of some key issues in software development.

Last night he posted publicly, by accident, Google’s current biggest issue, a complete lack of a platform. He’s really dead on.

I hope Google internalizes that post and does something about it.