Tag Archives: openstack

REST API Microversions

This is the version of the talk I gave at API Strat back in November about OpenStack’s API microversions implementation, mostly trying to frame the problem statement. The talk was only 20 minutes long, so you can only get so far into explaining the ramifications of the solution.

REST_API_MV-0

This talk dives into an interesting issue we ran into with OpenStack, an open source project that is designed to be deployed by public and private clouds, and exposes a REST API that users will consume directly. But in order to understand why we had to do something new, it’s first important to understand basic assumptions on REST API versioning, and where those break down.

REST_API_MV-1

There are some generally agreed to rules with REST API Versioning. They are that additive operations, like adding a key, shouldn’t break clients, because clients shouldn’t care about extra data that they get back. If you add new code that adds a new attribute, like description, you can make these changes, roll them out to the user, they get an extra thing and life is good.

This mostly work as long as you have a single instance, so new attributes show up “all at once” for users, and that you don’t rollback an API change after it’s been in production for any real length of time.

REST_API_MV-2

This itself is additive, you can keep adding more attributes over time. Lots of public web services use this approach. And this mostly just works.

REST_API_MV-3

It’s worth thinking for a moment about the workflow that supports this kind of approach. There is some master branch where features are being written and enabled, and at key points in time these features are pushed to production, where those features show up for users. Basic version control, nothing too exciting here.

REST_API_MV-4

But what does this look like in an Open Source project? In open source we’ve got a master upstream branch, and perhaps stable releases that happen from time to time. Here we’ll get specific and show the OpenStack release structure. OpenStack creates a stable release every 6 months, and main development continues on in git master. These are named with letters of the alphabet, so Liberty, Mitaka, Newton, Ocata, Pike being the last 5 releases.

REST_API_MV-5

But releasing stable code doesn’t get it into the hands of users. Clouds have to deploy that code. Some may do that right away, others may take a while. Exactly when a release gets into the users hands is an unknown. This gets even more tricky when you consider private clouds that are consuming something like a Linux Distro’s version of OpenStack. There is a delay getting into the distro, then another delay about when they decide to upgrade.

REST_API_MV-6

End users have no visibility into the versions of OpenStack that operators have deployed, they only know about the API. So when they are viewing the world across clouds at a specific point in time (T1, T2, or T3) they will experience different versions of the API.

REST_API_MV-7

Let’s take that T3 example. If a user starts by writing their software to Cloud A, the description field is there. They are going to assume that’s just part of the base API. They then connect their software to Cloud B, and all is fine. But, when later in the week they point it at Cloud C, the API has magically “removed” attributes. Removing attributes in the server is never considered safe.

This might blow up immediately, or it might just carry a null value that fails very late in the processing of the data.

REST_API_MV-8

The lesson here is that the assumed good enough rules don’t account for a different team developing software than deploying it. Sure, you say, that’s not good Dev Ops, of course it’s not supported. But be careful with what you are saying there, because we deploy software we don’t develop all the time, 3rd party open source. And if you’ve come down firmly that open source is not part of Dev Ops, I think a lot of people would look at you a bit funny.

REST_API_MV-9

I’m responsible for two names that took off in OpenStack, one is Microversions. Like any name that catches on you regret it later because it misses the important subtlety of what is going on. Don’t ask me to name things.

But besides the name, what are microversions?

REST_API_MV-10

Let’s look at that example again, if we experience the world at time T3, with Clouds A, B, and C, the real issue is that hitting Cloud C appears to make time “go backwards”. We’ve gone back in time and experienced the software at a much earlier version. How do we avoid going backwards?

REST_API_MV-11

We introduce a header for the “microversion” we want. If we don’t pass one, we get the minimum version the server supports. Everything after that we have to opt into. If we ask for a microversion the server doesn’t support we get a hard 400 fail on the request. This lets us fail early, which is more developer friendly than giving back unexpected data which might corrupt things much later in the system.

REST_API_MV-12

Roughly, microversions are inspired by HTTP content negotiation, where you can ask for different types of content from the same url, and the server will give you the best it can (you define “best” with a quality value). Because most developers implementing REST clients aren’t deeply knowledgeable about HTTP low level details, we went for simplicity and did this with a dedicated header. For simplicity we also make this a globally incrementing value across all resources, instead of per resource. We wanted there to be no confusion what version 2.5 was.

The versions that a service supports are discoverable by hitting the root document of the API service. The other important note is that services are expected to continue to support old versions for a very long time. In Nova today we’re up to about 2.53, and we still support everything back down to 2.0. That represents about 2.5 years of API changes.

There are a lot more details on the justification here for the approach, but not enough time today to go into them. If you want to learn more, I’ve got a blog writeup from when we first did that that dives in pretty deep, including showing the personas you’d expect to interact with this system.

REST_API_MV-13

Thus far this has worked pretty well. About 1/2 the base services in OpenStack have implemented a version of this. Most are pretty happy with the results. That version document I talked about can be seen here.

There are open questions for the future, mostly around raising minimum versions. No one has done it yet, though there are some thoughts about how you do that.

REST_API_MV-14

Since I’m here, and there are a lot of OpenAPI experts around, I wanted to talk just briefly about OpenStack and OpenAPI.

REST_API_MV-15

The OpenStack APIs date back to about 2010, when the state of the art for doing REST APIs was WADL. A now long dead proposed specification by Sun Microsystems. Lots of XML. But, that was the constraints at the time, which are different constraints than OpenAPI.

One of those issues is our actions API, where we use the same url with very different payloads to do non-RESTy function calls. Like reboot a server. The other is Microversions, which don’t have any real way to make to OpenAPI without using vendor extensions, at which point you loose most of the interesting tooling in the ecosystem.

There is an open question in my mind about whether the microversion approach is interesting enough that it’s something we could consider for OpenAPI. OpenStack could easily microversion itself out of the actions API to something more OpenAPI friendly, but without microversion support there isn’t much point.

REST_API_MV-16

There was a talk yesterday about “Never have a breaking API change again”, which followed about 80% of our story, but didn’t need something like microversions because it was Azure, and they controlled when code got deployed to users.

There are very specific challenges for Open Source projects that expose a public web services API, and expect to be deployed directly by cloud providers. We’re all used to open source behind the scenes, and plumbing to our services. But Open Source is growing into more areas. It is our services now. With things like OpenStack, Kubernetes, OpenWhisk… Open Source projects now are defining that user consumable API. If we don’t come up with common patterns for how to handle it then we’re putting Open Source at a disadvantage.

I’ve been involved in Open Source for close to two decades, and I strongly believe we should make Open Source able to play on the same playing field as proprietary services. The only way we can do this is think about if our tools and standards support Open Source all the way out to the edge.

Questions

Question 1: How long do you expect to have to keep the old code around, and how bad it is to manage that?

Answer: Definitely a long time, minimum a few years. The implementation of all of this is in python and we’ve got a bunch of pretty good decorators and documentation that makes it pretty easy to compartmentalize the code. No one has yet lifted a minimum version, as the amount of work to support the old code hasn’t really been burdensome as of yet. We’ll see how that changes in the future.

Question 2: Does GraphQL solve this problem in a way that you wouldn’t need microversions?

Answer: That’s a very interesting question. When we got started GraphQL was pretty nascent so not really on our radar. I’ve spent some time looking at GraphQL recently, and I think the answer is “yes, in theory, no in practice”, and this is why.

Our experience with the OpenStack API over the last 7 years is no one consumes your API directly. They almost always use some 3rd party SDK in their programming language of choice that gives them nice language bindings, and feels like their language of choice. GraphQL is great when you are going to think about your interaction with the service in a really low level way, and ask only for the minimal data you need to do your job. But these SDK writers don’t know what you need, so when they build their object models, they just do so by putting everything in. At which point you are pretty much back to where we started.

I think GraphQL is going to work out really well for really popular services (like github) where people are willing to take the hit to go super low level. Or where you know the backend details well enough to understand the cost differential of asking for different attributes. But I don’t think it obviates trying to come up with a server side versioning for APIs in Open Source.

 

 

Notes from API Strat

Back in November I had the pleasure to attend API Strat for the first time. It was 2 days of short (20 minute) sessions running in 3 tracks with people discussing web service API design, practice, and related topics. My interest was to get wider exposure to the API Microversions work that we did in OpenStack, and get out of that bubble to see what else was going on in the space.

Events on the Web

Event technologies being used by different web services
Event technologies being used by different web services

There were lots of talks that brought up the problem of getting real time events back to clients. Clients talking to servers is a pretty solved problem with RESTful interfaces. But the other way is far from a solved item. The 5 leading contenders are Webhooks (over http), HTTP long polling, Web Sockets, AMQP, and MQTT. Each has their boosters, and their place, but this is going to be a messy space for the next few years.

OpenAPI’s version 3 specification includes webhooks, though with version 3 there is no simultaneously launched tooling. It will take some time before people build implementations around that. That’s a boost in that direction. Nginx is adding MQTT proxy support. That’s a boost in that direction.

Webhooks vs. Serverless

Speaking of webhooks, the keynote from Glenn Block of Auth0 brought up an interesting point: serverless effectively lives in the eventing space as well.

Webhooks are all fine and good to make your platform efficient and scalable. If clients now have to run their own redundant highly available services to catch events, that’s a lot of work, and many will just skip it. The found that once they build out a serverless platform where they could host their clients code, they got much more uptake on their event API. And, more importantly, they found that their power user customers were actually building out important features of their platform. He made a good case that every online service should really be considering an embedded serverless environment.

API Microversions

I was ostensibly there to talk about API Microversions, an approach we did in OpenStack to handle the fact that deployments of OpenStack upgrade at very different cadences. The talk went pretty well.

20 minutes was a challenge to explain something that took us all 6 months to get our heads around. I do think I managed to communicate the key challenge: when you build an open source system with a user facing API, how do users control what they get?  A lot of previous “good enough” rules fall down.

Darrel Miller had given a talk “How to never make another breaking API change“. His first 5 minutes were really similar to mine, and then, because this was about Azure, with a single controlled API instance, the solution veered in a different direction. It was solid reinforcement for that fact that we were on the right path here, and that the open source solution has a different additional constraint.

One of the key questions I got in Q&A is one I’d been thinking about. Does GraphQL make this all obsolete? GraphQL was invented by Facebook to get away from the HTTP GET/POST model of passing data around, and let you specify a pretty structured query about the data you need from the server. On paper, it solves a similar problem as microversions, because it if you are really careful with your GraphQL you can ask for the minimum data you need, and are unlikely to get caught by things coming and going in the API. However, in practice, I’m not convinced it would work. In OpenStack we saw that most API usage was not raw API calls, it was through an SDK provided by someone in the community. If you are an SDK writer, it’s a lot harder to make assumptions about what parts of objects people want, so you’d tend to return everything. And that puts you right back with the same problem we have in REST in OpenStack.

API Documentation

There were many talks on better approaches for documentation, which resonated with my after the great OpenStack docs migration.

Taylor Barnett’s talk “Things I Wish People Told Me About Writing Docs” was one of my favorites. It included real user studies on what people actually read in documentation. It turns out that people don’t read your API documentation, they skim hard. They will read your code snippets as long as they aren’t too long. But they won’t read the paragraph before it, so if there is something really critical about the code, make it a comment in the code snippet itself. There was also a great cautionary tale to stop using phases like “can be easily done”. People furiously hunting around your site trying to get code working are ramping up quick. Words like “easy” make them feel dumb and frustrated when they don’t get it working on the first try. Having a little more empathy for the state of mind of the user when they show up goes a long way towards building a relationship with them, and making them more bought into your platform.

Making New Friends

I also managed to have an incredible dinner the first night I was in town setup by my friend Chris Aedo. Both the food and conversation were amazing, in which I learned about Wordnic, distributed data systems, and that you can loose a year of research because ferrits bread for specific traits might be too dumb to be trained.

Definitely a lovely conference, and one I hope to make it back to next year.

Triple Bottom Line in Open Source

One of the more thought provoking things that came out of the OpenStack leadership training at Zingerman’s last year, was the idea of the Triple Bottom Line. It’s something I continue to ponder regularly.

The Zingerman’s family of businesses definitely exist to make money, there are no apologies for that. However, it’s not their only bottom line that they measure against they’ve defined for themselves. Their full bottom line is “Great Food, Great Service, Great Finance.” In practice this means you have to ensure that all are being met, and not sacrifice the food and service just to make a buck.

If you look at Open Source through this kind of lens, a lot of trade offs that successful projects make make a lot more sense. The TBL for OpenStack would probably be something like: Code, Community, Contributors. Yes, this is about building great code, to make a great cloud, but it’s also really critical to grow the community, and mentor and grow individual contributors as well. Those contributors might stay in OpenStack, or they might go on to use their skills to help other Open Source projects be better in the future. All of these are measures of success.

This was one of the reasons we recently switch the development tooling in OpenStack (DevStack) to using systemd more natively. Not only did it solve a bunch of long standing technical issues, that had really ugly work arounds, but it also meant enhancing our contributors. Systemd and the journal are default in every new Linux environment now, so skills that our contributors gained working with DevStack would now directly transfer to any Linux environment. It would make them better Linux users in any context, not just OpenStack. It also makes the environment easier for people coming from the outside to understand, because it looks more like what they are used to.

While I don’t have enough data to back it up, it feels like this central question is really important to success in Open Source: “In order to be successful in this project you must learn X, which will be useful in these other contexts outside of the project.” X has to be small enough to be learnable, but also has to be useful in other contexts, so time invested has larger payoffs. That’s what growing a contributor looks like, they don’t just become better at your project, they become a better developer for everything they touch in the future.

My IRC proxy setup

IRC (Internet Relay Chat) is a pretty important communication medium for a lot of Open Source projects nowadays. While email is universal and lives forever, IRC is the equivalent of the hallway chat you’d have with a coworker to bounce ideas around. IRC has the advantage of being a reasonably simple and open (and old) protocol, so writing things that interface with it is about as easy as email clients. But, it has a pretty substantial drawback: you only get messages when you are connected to the channels in question.

Again, because it’s an open protocol this is actually a solvable problem, have a piece of software on an always on system somewhere that remains connected for you. There are 2 schools of thinking here:

  • Run a text IRC client in screen or tmux on a system, and reconnect to the terminal session when you come in. WeeChat falls into this camp.
  • Run an irc proxy on a server, and have your IRC client connect to the proxy which replays all the traffic since the last time you were connected. Bip, ZNC, and a bunch of others fall into this camp.

I’m in Camp #2, because I find my reading comprehension of fixed width fonts is far less than variable width ones. So I need my IRC client to be in a variable width font, which means console solutions aren’t going to help me.

ZNC

ZNC is my current proxy of choice. I’ve tried a few others, and dumped them for reasons I don’t entirely remember at this point. So ZNC it is.

I have a long standing VPS with Linode to host a few community websites. For something like ZNC you don’t need much horse power and could use cloud instances anywhere. If you are running debian or ubuntu in this cloud instance: apt-get install znc gets you rolling.

Run ZNC from the command line and you’ll get something like this:

znc fail

That’s because first time up it needs to create a base configuration. Fortunately it’s pretty straight forward what that needs to be.

znc –makeconf takes you through a pretty interactive configuration screen to build a base configuration. The defaults are mostly fine. The only thing to keep in mind is what port you make ZNC listen on, as you’ll have to remember to punch that port open on the firewall/security group for your cloud instance.

I also find the default of 50 lines of scrollback to be massively insufficient. I usually bounce that to 5000 or 10000.

Now connect your client to the server and off you go. If you have other issues with basic ZNC configuration, I’d suggest checking out the project website.

ZNC as a service

The one place ZNC kind of falls down is that out of the box (at least on ubuntu) it doesn’t have init scripts. Part of this is because the configuration file is very user specific, and as we say by the interactive mode, is designed around asking you a bunch of questions. That means if your cloud instance reboots, your ZNC doesn’t come back.

I fixed this particular shortcoming with Monit. Monit is a program that monitors other programs on your system and starts or restarts them if they have faulted out. You can apt-get install it on debian/ubuntu.

Here is my base znc monit script:

znc monit

Because znc doesn’t do pid files right, this just matches on a process name. It has a start command which includes the user / group for running this, and a stop command, and some out of bounds criteria. All in a nice little dsl.

All that above will get you a basic ZNC server running, surviving cloud instance reboots, and make sure you never miss a minute of IRC.

But… what if we want to go further.

ZNC on ZNC

The idea for this comes from Dan Smith, so full credit where it is due.

If you regularly connect to IRC from more than one computer, but only have 1 ZNC proxy setup, the issue is the scrollback gets replayed to the first computer that connects to the proxy. So jumping between computers to have conversations ends up being a very fragmented experience.

ZNC presents as just an IRC Server to your client. So you can layer ZNC on top of ZNC to create independent scrollback buffers for every client device. My setup looks something like this:

ZNC on ZNC

Which means that all devices have all the context for IRC, but I’m only presented as a single user on the freenode network.

Going down this path requires a bit more effort, which is why I’ve got the whole thing automated with puppet: znc-puppet.tar. You’ll probably need to do a little bit of futzing with it to make it work for your puppet managed servers (you do puppet all your systems, right?), but hopefully this provides a good starting point.

IRC on Mobile

Honestly, the Android IRC experience is… lacking. Most of the applications out there that do IRC on Android provide an experience which is very much a desktop experience, which works poorly on a small phone.

Monty Taylor pointed me at IRCCloud which is a service that provides a lot of the same offline connectivity as the ZNC stack provides. They have a webui, and an android app, which actually provides a really great mobile experience. So if Mobile is a primary end point for you, it’s probably worth checking out.

IRC optimizations for the Desktop

In the one last thing category, I should share the last piece of glue that I created.

I work from home, with a dedicated home office in the house. Most days I’m working on my desktop. I like to have IRC make sounds when my nick hits, mostly so that I have some awareness that someone wants to talk to me. I rarely flip to IRC at that time, it just registers as a “will get to it later” so I can largely keep my concentration wherever I’m at.

That being said, OpenStack is a 24hr a day project. People ping me in the middle of the night. And if I’m not at my computer, I don’t want it making noise. Ideally I’d even like them to see me as ‘away’ in IRC.

Fortunately, most desktop software in Linux integrates with a common messaging bus: dbus. The screensaver in Ubuntu emits a signal on lock and unlock. So I created a custom script that mutes audio on screen lock, unmutes it on screen unlock, as well as sends ‘AWAY’ and ‘BACK’ commands to xchat for those state transitions.

You can find the script as a gist.

So… this was probably a lot to take in. However, hopefully getting an idea of what an advanced IRC workflow looks like will give folks ideas. As always, I’m interested in hearing about other things people have done. Please leave a comment if you’ve got an interesting productivity hack around IRC.

OpenStack as Layers

Last week at LinuxCon I gave a presentation on DevStack which gave me the proper excuse to turn an idea that Dean Troyer floated a year ago about OpenStack Layers into pictures (I highly recommend reading that for background, I won’t justify every part of that here again). This abstraction has been something that’s actually served us well as we think about projects coming into DevStack.

OpenStack_Layers

Some assumptions are made here in terms of what essential services are here as we build up the model.

Layer 1: Base Compute Infrastructure

We assume that compute infrastructure is the common starting point of minimum functional OpenStack that people are deploying. The output of the last OpenStack User Survey shows that the top 3 deployed services, regardless of type of cloud (Dev/Test, POC, or Production) are Nova / Glance / Keystone. So I don’t think this is a huge stretch. There are definitely users that take other slices (like Swift only) but compute seems to be what the majority of people coming to OpenStack seem to be focussed on.

Basic Compute services need 3 services to get running. Nova, Glance, and Keystone. That will give you a stateless compute cloud which is a starting point for many people getting into the space for the first time.

Layer 2: Extended Infrastructure

Once you have a basic bit of compute infrastructure in place, there are some quite common features that you do really need to do more interesting work. These are basically enhancements on the Storage, Networking, or Compute aspects of OpenStack. Looking at the User Survey these are all deployed by people, in various ways, at a pretty high rate.

This is the first place we see new projects integrating into OpenStack. Ironic extends the compute infrastructure to baremetal, and Designate adds a missing piece of the networking side with DNS management as part of your compute creation.

Hopefully nothing all that controversial here.

Layer 3: Optional Enhancements

Now we get a set of currently integrated services that integrate North bound and South bound. Horizon integrates on the North bound APIs for all the services, it requires service further down in layers (it also today integrates with pieces further up that are integrated). Ceilometer consumes South bound parts of OpenStack (notifications) and polls North bound interfaces.

From the user survey Horizon is deployed a ton. Ceilometer, not nearly as much. Part of this is due to how long things have been integrated, but even if you do analysis like take the Cinder / Neutron numbers, delete all the Folsom deploys from it (which is the first time those projects were integrated) you still see a picture where Ceilometer is behind on adoption. Recent mailing list discussions have hints at why, including some of the scaling issues, and a number of alternative communities in this space.

Let’s punt on Barbican, because honestly, it’s new since we came up with this map, and maybe it’s really a layer 2 service.

Layer 4: Consumption Services

I actually don’t like this name, but I failed to come up with something better. Layer 4 in Dean’s post was “Turtles all the way down”, which isn’t great describing things either.

This is a set of things which consume other OpenStack services to create new services. Trove is the canonical example, create a database as a service by orchestrating Nova compute instances with mysql installed in them.

The rest of the layer 4 services all fit the same pattern, even Heat. Heat really is about taking the rest of the components in OpenStack and building a super API for their creation. It also includes auto scaling functionality based on this. In the case of all integrated services they need a guest agent to do a piece of their function, which means when testing them in OpenStack we don’t get very far with the Cirros minimal guest that we use for Layer 3 and down.

But again, as we look at the user survey we can see deployment of all of these Layer 4 services is lighter again. And this is what you’d expect as you go up these layers. These are all useful services to a set of users, but they aren’t all useful to all users.

I’d argue that the confusion around Marconi’s place in the OpenStack ecosystem comes with the fact that by analogy it looks and feels like a Layer 4 service like Trove (where a starting point would be allocating computes), but is implemented like a Layer 2 one (straight up raw service expected to be deployed on bare metal out of band). And yet it’s not consumable as the Queue service for the other Layer 1 & 2 services.

Leaky Taxonomy

This is not the end all be all of a way to look at OpenStack. However, this layered view of the world confuses people a lot less than the normal view we show them — the giant spider diagram (aka the mandatory architecture slide for all OpenStack presentations):

OpenStack_Spider_Diagram

This picture is in every deep dive on OpenStack, and scares the crap out of people who think they might want to deploy it. There is no starting point, there is no end point. How do you bite that off in a manageable chunk as the spider grows?

I had one person come up to me after my DevStack talk giving a big thank you. He’d seen a presentation on Cloudstack and OpenStack previously and OpenStack’s complexity from the outside so confused him that he’d run away from our community. Explaining this with the layer framing, and showing how you could experiment with this quickly with DevStack cleared away a ton of confusion and fear. And he’s going to go dive in now.

Tents and Ecosystems

Today the OpenStack Technical Committee is in charge of deciding the size of the “tent” that is OpenStack. The approach to date has been a big tent philosophy, where anything that’s related, and has a REST API, is free to apply to the TC for incubation.

But a big Tent is often detrimental to the ecosystem. A new project’s first goal often seems to become incubated, to get the gold star of TC legitimacy that they believe is required to build a successful project. But as we’ve seen recently a TC star doesn’t guarantee success, and honestly, the constraints on being inside the tent are actually pretty high.

And then there is a language question, because OpenStack’s stance on everything being in Python is pretty clear. An ecosystem that only exists to spawn incubated projects, and incubated projects only being allowed to be in Python, basically means an ecosystem devoid of interesting software in other languages. That’s a situation that I don’t think any of us want.

So what if OpenStack were a smaller tent, and not all the layers that are in OpenStack today were part of the integrated release in the future? Projects could be considered a success based on their users and usage out of the ecosystem, and not whether they have a TC gold star. Stackforge wouldn’t have some stigma of “not cool enough”, it would be the normal place to exist as part of the OpenStack ecosystem.

Mesos is an interesting cloud community that functions like that today. Mesos has a small core framework, and a big ecosystem. The smaller core actually helps grow the ecosystem by not making the ecosystem 2nd class citizens.

I think that everyone that works on OpenStack itself, and all the ecosystem projects, want this whole thing to be successful. We want a future with interoperable, stable, open source cloud fabric as a given. There are lots of thoughts on how we get there, and as no one has ever created a universal open source cloud fabric that lets users have the freedom to move between providers, public and private, so it’s no surprise that as a community we haven’t figured everything out yet.

But here’s another idea into the pool, under the assumption that we are all smarter together with all the ideas on the table, than any of us are on our own.

Splitting up Git Commits

Human review of code takes a bunch of time. It takes even longer if the proposed code has a bunch of unrelated things going on in it. A very common piece of review commentary is “this is unrelated, please put it in a different patch”. You may be thinking to yourself “gah, so much work”, but turns out git has built in tools to do this. Let me introduce you to git add -p.

Lets look at this Grenade review – https://review.openstack.org/#/c/109122/1. This was the result of a days worth of hacking to get some things in order. Joe correctly pointed out there was at least 1 unrelated change in that patch (I think he was being nice, there were probably at least 4 things going that should have been separate). Those things are:

  • The quiece time for shutdown, that actually fixes bug 1285323 all on it’s own.
  • The reordering on the directory creates so it works on a system without /opt/stack
  • The conditional upgrade function
  • The removal of the stop short circuits (which probably shouldn’t have been done)

So how do I turn this 1 patch, which is at the bottom of a patch series, into 3 patches, plus drop out the bit that I did wrong?

Step 1: rebase -i master

Start by running git rebase -i master on your tree to put myself into the interactive rebase mode. In this case I want to be editing the first commit to split it out.

screenshot_171

Step 2: reset the changes

git reset ##### will unstage all the changes back to the referenced commit, so I’ll be working from a blank slate to add the changes back in. So in this case I need to figure out the last commit before the one I want to change, and do a git reset to that hash.

screenshot_173

Step 3: commit in whole files

Unrelated change #1 was fully isolated in a whole file (stop-base), so that’s easy enough to do a git add stop-base and then git commit to build a new commit with those changes. When splitting commits always do the easiest stuff first to get it out of the way for tricky things later.

Step 4: git add -p 

In this change grenade.sh needs to be split up all by itself, so I ran git add -p to start the interactive git add process. You will be presented with a series of patch hunks and a prompt about what to do with them. y = yes add it, n = no don’t, and lots of other options to be trickier.

screenshot_176

In my particular case the first hunk is actually 2 different pieces of function, so y/n isn’t going to cut it. In that case I can type ‘e’ (edit), and I’m dumping into my editor staring at the patch, which I can interactively modify to be the patch I want.

screenshot_177

I can then delete the pieces I don’t want in this commit. Those deleted pieces will still exist in the uncommitted work, so I’m not losing any work, I’m just not yet dealing with it.

screenshot_178

Ok, that looks like just the part I want, as I’ll come back to the upgrade_service function in patch #3. So save it, and final all the other hunks in the file that are related to that change to add them to this patch as well.

screenshot_179

Yes, to both of these, as well as one other towards the end, and this commit is ready to be ‘git commit’ed.

Now what’s left is basically just the upgrade_service function changes, which means I can git add grenade.sh as a whole. I actually decided to fix up the stop calls before doing that just by editing grenade.sh before adding the final changes. After it’s done, git rebase –continue rebases the rest of the changes on this, giving me a new shiney 5 patch series that’s a lot more clear than the 3 patch one I had before.

Step 5: Don’t forget the idempotent ID

One last important thing. This was a patch to gerrit before, which means when I started I had an idempotent ID on every change. In splitting 1 change into 3, I added that id back to patch #3 so that reviewers would understand this was an update to something they had reviewed before.

It’s almost magic

As a git user, git add -p is one of those things like git rebase -i that you really need in your toolkit to work with anything more than trivial patches. It takes practice to have the right intuition here, but once you do, you can really slice up patches in a way that are much easier for reviewers to work with, even if that wasn’t how the code was written the first time.

Code that is easier for reviewers to review wins you lots of points, and will help with landing your patches in OpenStack faster. So taking the time upfront to get used to this is well worth your time.

OpenStack Failures

Last week we had the bulk of the brain power of the OpenStack QA and Infra teams all in one room, which gave us a great opportunity to spend a bunch of time diving deep into the current state of the Gate, figure out what’s going on, and how we might make things better.

Over the course of 45 minutes we came up with this picture of the world.

14681027401_327a720647_o

We have a system that’s designed to merge good code, and keep bugs out. The problem is that while it’s doing a great job of keeping big bugs out, subtle bugs, ones that are low percentage (like show up in only 1% of test runs) can slip through. These bugs don’t go away, they instead just build up inside of OpenStack.

As OpenStack expands in scope and function, these bugs increase as well. They might grow or shrink based on seemingly unrelated changes, dependency changes (which we don’t gate on), timing impacts by anything in the underlying OS.

As OpenStack has grown no one has a full view of the system any more, so even identifying that a bug might or might not be related to their patch is something most developers can’t do. The focus of an individual developer is typically just wanting to land their code, not diving into the system as a whole. This might be because they are on a schedule, or just that landing code feels more fun and productive, than digging into existing bugs.

From a social aspect we seem to have found that there is some threshold failure rate in the gate that we always return to. Everyone ignores base races until we get to that failure rate, and once we get above it for long periods of time, everyone assumes fixing it is someone else’s responsibility. We had an interesting experiment recently where we dropped 300 Tempest tests in turning off Nova v3 by default, which gave us a short term failure drop, but within a couple months we’re back up to our unpleasant failure rate in the gate.

Part of the visibility question is also that most developers in OpenStack don’t actually understand how the CI system works today, so when it fails, they feel powerless. It’s just a big black box blocking their code, and they don’t know why. That’s incredibly demotivating.

Towards Solutions

Every time the gate fail rates get high, debates show up in IRC channels and on the mailing list with ideas to fix it. Many of these ideas are actually features that were added to the system years ago. Some are ideas that are provably wrong, like autorecheck, which would just increase the rate of bug accumulation in the OpenStack code base.

A lot of good ideas were brought up in the room, over the next week Jim Blair and I are going to try to turn these into something a little more coherent to bring to the community. The OpenStack CI system tries to be the living and evolving embodiment of community values at any point in time. One of the important things to remember is those values aren’t fixed points either.

The gate doesn’t exist to serve itself, it exists because before OpenStack had one, back in the Diablo days, OpenStack simply did not work. HP Cloud had 1000 patches to Diablo to be able to put it into production, and took 2 years to migrate from it to another version of OpenStack.

Processing OpenStack GPG keys in Thunderbird

If you were part of the OpenStack keysigning party from the summit, you are currently probably getting a bunch of emails sent by caff. This is an easy way to let a key signer send you your signed key.

These are really easy to process if you are using Thunderbird + Enigmail as your signed/encrypted mail platform. Just open up the mail attachments, right click, and import key:

screenshot_161

Once you’ve done this you’ll have included the signature in your local database. Then from the command line you can:

gpg --send-key YOURKEYID

And then you are done.

Happy GPGing!

Helpful Gerrit Queries (Gerrit 2.8 edition)

Gerrit got a very nice upgrade recently which brings in a whole new host of features that are really interesting. Here are some of the things you should know to make use of these new features. You might want to read up on the basics of gerrit searches here: Gerrit queries to avoid review overload, before getting started.

Labels

Gone are the days of -CodeReview-1, we now have a more generic mechanism called labels. Labels are a lot more powerful because they can specify both ranges as well as specific users!

For instance, to select everything without negative code reviews:

status:open NOT label:Code-Review<=-1

Because we now have operators, we can select for a range of values, so any negative (-1, -2, or any high negative value should it get implemented in the future) matches. Also negation is done with the ‘NOT’ keyword, and notable that CodeReview becomes label:Code-Review in the new system.

Labels exist for all three columns. Verified is what CI bots vote in, and Workflow is a combination of the Work in Progress (Workflow=-1) and Approved (Workflow=1) states that we used to have.

Labels with Users

Labels get really power when you start adding users to them. Now that we have a ton of CI bots voting, with regular issues in their systems, you might want to filter out by changes that Jenkins currently has a positive vote on.

status:open label:Verified>=1,jenkins

This means that changes which do not yet have a Jenkins +1 or +2 won’t be shown in your list. Hiding patches which are currently blocked by Jenkins or it hasn’t reported on yet. If you want to see not yet voted changes, you could change that to >=0.

Labels with Self

This is where we get really fun. There is a special user, self, which means your logged in id.

status:open NOT label:Code-Review>=0,self label:Verified>=1,jenkins NOT label:Code-Review<=-1

This is a list of all changes that ‘you have not yet commented on’, that don’t have negative code reviews, and that Jenkins has passing results. That means this query becomes a todo list, because as you comment on changes, positive, negative, or otherwise, they drop out of this query.

If you also drop all the work in progress patches:

status:open NOT label:Code-Review>=0,self label:Verified>=1,jenkins NOT label:Code-Review<=-1 
  NOT label:Workflow<=-1

then I consider this a basic “Inbox zero” review query. You can apply this to specific projects with “project:openstack/nova”, for instance. Out of this basic chunk I’ve built a bunch of local links to work through reviews.

File Matching

With this version of gerrit we get a thing called secondary indexes, which means we get some more interesting searching capabilities. which basically means we also have a search engine for certain other types of queries. This includes matching changes against files.

status:open file:^.*/db/.*/versions/.* project:^openstack.*

is a query that looks at all the outstanding changes in OpenStack that change a database migration. It’s currently showing glance, heat, nova, neutron, trove, and storyboard changes.

Very helpful if as a reviewer you want to keep an eye on a cross section of changes regardless of project.

Learning more

There are also plenty of other new parts of this query language. You can learn all the details in the gerrit documentation.

We’re also going to work at making some of these “inbox zero” queries available in the gerrit review system as a custom dashboard, making it easy to use it on any project in the system without building local bookmarks to queries.

Happy reviewing!

 

Bash trick of the week – call stacks

For someone that used to be very vocal about hating shell scripting, I seem to be building more and more tools related to it every day. The latest is caller (from “man bash”):

caller [expr]
Returns the context of any active subroutine call (a shell function or a script executed with the . or source builtins). Without expr, caller displays the line number and source filename of the current subroutine call. If a non-negative inte‐ ger is supplied as expr, caller displays the line number, subroutine name, and source file corresponding to that position in the current execution call stack. This extra information may be used, for example, to print a stack trace. The current frame is frame 0. The return value is 0 unless the shell is not executing a subroutine call or expr does not correspond to a valid position in the call stack.

This means that if your bash code makes heavy use of functions, you can get the call stack back out. This turns out to be really handy for things like writing testing scripts. I recently added some more unit testing to devstack-gate, and used this to make it easy to see what was going on:

# Utility function for tests
function assert_list_equal {
    local source=$(echo $1 | awk 'BEGIN{RS=",";} {print $1}' | sort -V | xargs echo)
    local target=$(echo $2 | awk 'BEGIN{RS=",";} {print $1}' | sort -V | xargs echo)
    if [[ "$target" != "$source" ]]; then
        echo -n `caller 0 | awk '{print $2}'`
        echo -e " - ERRORn $target n != $source"
        ERRORS=1
    else
    # simple backtrace progress detector
        echo -n `caller 0 | awk '{print $2}'`
        echo " - ok"
    fi
}

The output ends up looking like this:

ribos:~/code/openstack/devstack-gate(master)> ./test-features.sh 
test_full_master - ok
test_full_feature_ec - ok
test_neutron_master - ok
test_heat_slow_master - ok
test_grenade_new_master - ok
test_grenade_old_master - ok
test_full_havana - ok

I never thought I’d know this much bash, and I still think data structure manipulation is bash is craziness, but for imperative programming that’s largely a lot of command calls, this works pretty well.