Source Forge Open Source Again

Apparently Source Forge has gone open source again, and even is an incubated project at Apache. The source code is in git, and the new source forge looks like it's all written in Python instead of PHP.

Source Forge has had a pretty storied history. It started Open Source, all those years ago, then in the dot com collapse they stopped releasing source and instead tried to sell an onsite hosted solution, Source Forge Enterprise Edition. The Linux Technology Center was one of their few customers, providing an internal source forge for the rest of IBM. I had the "opportunity" to help debug some of that code for performance reasons, and discovered that a lot of Source Forge's slowness was due to a major lack of understanding by the development team on how database indexes work. Those fixes flowed upstream.

Later, one of the key developers from Source Forge forked GForge from the last open source release. So we had Open Source "source forge" again. Then a couple years later the GForge team pulled the same stunt as Source Forge, tried to monetize, and seal off the source code.

Then git happened, and all these CVS / SVN based hosting solutions looked really quaint. A couple years later we had github, and the center of gravity of Open Source has been migrating ever since.

Source Forge's current owner is Dice, the job search company, so the economics of keeping it Open Source are a little different. "What's your github id?" is now a standard job interview question, so I can imagine the new Source Forge team has a pretty broad brush to just make Source Forge as good as they can.

I wish them luck.

The OpenStack Gate

The OpenStack project has a really impressive continuous integration system, which is one of its core strengths as a project. Every proposed change to our gerrit review system is subjected to a battery of tests on each commit, which has grown dramatically with time, and after formal review by core contributors, we run them all again before the merge.

These tests take on the order of 1 hour to run on a commit, which would make you immediately think the most code that OpenStack could merge in a day would be 24 commits. So how did Nova itself manage to merge 94 changes since Monday (not to mention all the other projects, which adds up to ~200 in 3 days)? The magic of this is Zuul, the gatekeeper.

Zuul is a queuing system for CI jobs, written and maintained by the OpenStack infrastructure team. It does many cool things, but what I want to focus on is the gate queue. When the gate queue is empty (yes it does happen some times), the job is simple: add a new commit, run the tests, and we're off. What happens if there are already 5 jobs ahead of you in the gate? Let's take a concrete example of nova.

Speculative Merge

By the time a commit has gotten this far, it's already passed the test suites at least once, and has had at least 2 core contributors sign off on the change in code review. So Zuul assumes everything ahead of the change in the gate will succeed, and starts the tests immediately cherry picking this change on top everything that's ahead of it in the queue.

zuul-working

That means that merge time on the gate is O(1), that is merging 10 changes takes the same time as 1 change. If the queue gets too big, we do eventually run out of devstack nodes, so the ability to run tests is not strictly constant time. On the run up to grizzly-3 both the cloud providers (HP and Rackspace) which contribute these VMs provided some extra quota to the OpenStack team to help keep things moving. So we had an elastic burst of OpenStack CI onto additional OpenStack public cloud resources, which is just fun to think about.

Speculation Can Fail

Of course, speculation can fail. Maybe change 3 doesn't merge because something goes wrong in the tests. If that happens we then kick the change out of the queue, and then all the changes behind it have to be reset to pull change 3 out of the speculation. This is the dreaded gate reset, because when gate resets happen, all the time spent on speculative tests behind the failure is lost, and the jobs need to restart.

zuul-reset

Speculation failures largely fall into a few core classes:

Jenkins crashes - it doesn't happen often, but Jenkins is software too, and OpenStack CI tends to drive software really hard, so we force out edge cases everywhere.

Upstream service failures - we try to isolate ourselves from upstream failures as much as possible. Our git trees pull from our gerrit, not directly from github. Our apt repository is a Rackspace local mirror, not generically upstream. And the majority of pip python packages come from our own proxy server. But if someone adds a new python dependency, or a version of one updates and we don't yet have it cached, we pass through to pypi for that pip install. On Tuesday pypi converted from HTTP to HTTPS, and didn't fully grok the load implications, which broke OpenStack CI (as well as lots of other python developers) for a few hours when pypi effectively was down from load.

Transient OpenStack bugs - OpenStack is complicated software, 7 core components interacting with each other asynchronously over REST web services. Each core component being a collection of daemons that interact with each other asynchronously. Sometimes, something goes wrong. It's a real bug, but only shows up under very specific timing and state conditions. Because OpenStack CI runs so many tests every day (OpenStack CI may be one of the largest creators of OpenStack guests in the world every day), very obscure edge and race conditions can be exposed in the system. We try to track these as recheck bugs, and are making them high priority to address. By definition they are hard to track down (they expose themselves on maybe 1 out of 1000 or fewer test runs), so the logs captured in OpenStack CI are the tools to get to the bottom of these.

Towards an Even Better Gate

In my year working on OpenStack I've found the unofficial motto of the project to be "always try to make everything better". Continuous improvement is not just left to the code, and the tests, but the infrastructure as well.

We're trying to get more urgency and eyes on the transient failures, coming up with ways to discover the patterns from the 1 in 1000 fails. After you get two or three that fail in the same way it helps triangulate the core issue. Core developers from all the projects are making these high priority items to fix.

On the upstream service failures the OpenStack infrastructure team already has proxies sitting in front of many of the services, but the pypi outage showed we probably need something even more robust to handle that upstream service outage, possibly rotating between pypi mirrors on the fall-through case, or a better proxy model. The team is already actively exploring solutions to prevent that from happening again.

As always, everyone is welcomed to come help us make everything better. Take a look at the recheck bugs and help us solve them. Join us on #openstack-infra and help with Zuul. Check out what the live Zuul queue looks like. All the code for this system is open source, and available under either the openstack, or openstack-infra github accounts. Patches are always welcome!

Refactoring LibreOffice

The FOSDEM 2013 talks are up now, and this one of LibreOffice Refactoring really hit an interesting mark. The LibreOffice team has been aggressively rebuilding a culture of rapid change as a road to quality, bringing in a test and test automation culture, and leaving nearly no parts of the code as sacred.

It's interesting that LibreOffice seems to be doing a much better job than OpenOffice at removing technical debt. I think we're already seeing the effect of that cultural split, and I expect that in the future this is going to get far more obvious.

I can't wait to get the Android remote working for future presentations. Will be a lot of fun to drive my presentations that way.

Software Engineering Talk at Vassar College

While I've been giving talks at conferences and user groups for the last decade, I leveled up a little on Friday and was an invited speaker on the Vassar College Computer Science Asprey Lecture Series. The topic was Software Engineering at Scale, using the OpenStack project as an example.

I gave the folks there a glimpse of what's behind a successful project that is able to integrate code from over 400 unique developers in 5 months time. I talked about planning, the design summits, the contribution and code review tools we use. But, as with every time I talk about OpenStack, the thing that really wows people is the testing infrastructure we've got. It was equally latched onto by the students and CIS staff in the room.

Code Contribution Path in OpenStack

On every code submission we run style checks, unit tests (5000 of them in Nova now), and spin up a full OpenStack install and hit it with a nearly 700 test integration suite, before the first humans start looking at the code for manual review. It's an incredibly empowering system, that means developers have a high bar to submit working code that doesn't alter the behavior of the system. And it means that by the time the expert eyes do code review, the kinds of problems they are looking for are much more interesting.

Just this morning it meant I could look through a new proposed extension in gerrit and focus on some of the functional behavior, including understanding which kinds of code the test system has a harder time touching. The confidence that gives you as a reviewer that everything isn't on the verge of breaking all the time, is enormous.

I've submitted a similar talk to the OpenStack summit, with a slightly different perspective of educating new developers on what the process from idea to code landing in the OpenStack tree is. Hoping that gets selected as it should be a good talk, and give me an excuse to polish some of my code flow diagrams a bit more.