All posts by Sean Dague

When algorithms surprise us

Machine learning algorithms are not like other computer programs. In the usual sort of programming, a human programmer tells the computer exactly what to do. In machine learning, the human programmer merely gives the algorithm the problem to be solved, and through trial-and-error the algorithm has to figure out how to solve it.

This often works really well - machine learning algorithms are widely used for facial recognition, language translation, financial modeling, image recognition, and ad delivery. If you’ve been online today, you’ve probably interacted with a machine learning algorithm.

But it doesn’t always work well. Sometimes the programmer will think the algorithm is doing really well, only to look closer and discover it’s solved an entirely different problem from the one the programmer intended. For example, I looked earlier at an image recognition algorithm that was supposed to recognize sheep but learned to recognize grass instead, and kept labeling empty green fields as containing sheep.

Source: Letting neural networks be weird • When algorithms surprise us

There are so many really interesting examples she has collected here, and show us the power and danger of black boxes. In a lot of ways machine learning is just an extreme case of all software. People tend to write software on an optimistic path, and ship it after it looks like it's doing what they intended. When it doesn't, we call that a bug.

The difference between traditional approaches and machine learning, is debugging machine learning is far harder. You can't just put an extra if condition in, because the logic to get an answer isn't expressed that way. It's expressed in 100,000 weights on a 4 level convolution network. Which means QA is much harder, and Machine Learning is far more likely to surprise you with unexpected wrong answers on edge conditions.

CAFE standard of 55mpg seem high? It's not the real number, and the real number is a lot more interesting.

If automakers complied with the rules solely by improving the fuel economy of their engines, new cars and light trucks on the road would average more than 50 miles per gallon by 2025 (the charts here break out standards for cars and light trucks separately). But automakers in the United States have some flexibility in meeting these standards. They can, for instance, get credit for using refrigerants in vehicle air-conditioning units that contribute less to global warming, or get credit for selling more electric vehicles.

Once those credits and testing procedures are factored in, analysts expected that new cars and light trucks sold in the United States would have averaged about 36 miles per gallon on the road by 2025 under the Obama-era rules, up from about 24.7 miles per gallon in 2016. Automakers like Tesla that sold electric vehicles also would have benefited from the credit system.

Source: How U.S. Fuel Economy Standards Compare With the Rest of the World’s - The New York Times

This is one of those areas where most reporting on the CAFE standard rollback has been terrible. You tell people the new CAFE standard is 55 mpg, and they look at their SUV, and say, that's impossible. With diesel off the table after the VW standard, only the best hybrids today are in that 55 mpg range. How could that be the average?

But it's not, it's 55 mpg equivalent. You get credit for lots of other things. EVs in the fleet, doing a better job on refrigerant switch over. 2025 would see a real fleet average of around 36 mpg if this was kept in place.

More importantly is that in rolling back this standard it's going to make US car companies less competitive. The rest of the world is going here, and US not just means companies that don't hit these marks have a shrinking global market.

The future of scientific papers

The more sophisticated science becomes, the harder it is to communicate results. Papers today are longer than ever and full of jargon and symbols. They depend on chains of computer programs that generate data, and clean up data, and plot data, and run statistical models on data. These programs tend to be both so sloppily written and so central to the results that it’s contributed to a replication crisis, or put another way, a failure of the paper to perform its most basic task: to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.

Perhaps the paper itself is to blame. Scientific methods evolve now at the speed of software; the skill most in demand among physicists, biologists, chemists, geologists, even anthropologists and research psychologists, is facility with programming languages and “data science” packages. And yet the basic means of communicating scientific results hasn’t changed for 400 years. Papers may be posted online, but they’re still text and pictures on a page.

Source: The Scientific Paper Is Obsolete. Here's What's Next. - The Atlantic

The scientific paper is definitely currently being strained in it's ability to vet ideas. The article gives a nice narrative through the invention of Mathematica and then Jupyter as the path forward. The digital notebook is incredibly useful way to share data analysis as long as the data sets are made easily available. The DAT project has some thoughts on making that easier.

The one gripe I've got with it is being a bit more clear that Mathematic was never going to be the future here. Wolfram has tons of great ideas, and Mathematic is some really great stuff. I loved using it in college 20 years ago on SGI Irix systems. But one of the critical parts of science is sharing and longevity, and doing that on top of a proprietary software platform is not a foundation for building the next 400 years of science. A driving force behind Jupyter is that being open source all the way down, it's reasonably future proof.

Electricity Map

In looking for information related to my ny-power demo (which shows the realtime CO2 intensity on the New York power grid), I discovered Electricity Map. This is doing a similar thing, but at a global scale. It started primarily focused on Europe but is an open source project, and has contributions from all over the world. I helped recently on some accounting and references for the NY ISO region.

You'll notice a lot of the map is grey in the US. That's because while most of the public ISOs publish their real time data on the web, private power entities tend not to. It's a shame, because you can't get a complete picture.

What also is notable is how different the power profile looks like between different regions in the US.

It's also really interesting if you take a look at Europe

Germany is quite bad on it's CO2 profile compared to neighboring countries. That's because they've been turning back on coal plants and they shut down their nuclear facilities. Coal makes up a surprisingly high part of their grid now.

The entire map is interactive and a great way to explore how energy systems are working around the world.

Climate change goes to court

Alsup insisted that this tutorial was a purely educational opportunity, and his enjoyment of the session was obvious. (For the special occasion, he wore his “science tie” under his robes, printed with a graphic of the Solar System.) But the hearing could have impacts beyond the judge’s personal edification, Wentz says. “It’s a matter of public record, so you certainly could refer to it in a court of public opinion, or the court of law in the future,” she says. Now, Wentz says, there’s a formal declaration in the public record from a Chevron lawyer, stating once and for all: “It is extremely likely that human influence has been the dominant cause of the observed warming since the mid-20th century.”

Source: Chevron’s lawyer says climate change is real, and it’s your fault - The Verge

This week Judge Alsup held a personal education session for himself on the upcoming case where several California Cities are suing the major fossil fuel companies under the assumption that they knew Climate Change was a real threat back in the 80s and 90s, and actively spread disinformation to sow doubt. This is one of many cases going forward under similar framing.

What makes this one different is Alsup. He was the judge that handled the Oracle vs. Google case, where he taught himself programming to be sure he was getting it right. For this case, he had a 5 hour education session on every question he could imagine about climate change and geology. The whole article is amazing, and Alsup is really a treasure to have on the bench.

The 10,000 Year Clock Under Construction

A clock designed to ring once a year, for the next 10,000 years has begun installation in the mountains of west Texas. This is a project of the Long Now Foundation, a group dedicated to promoting long term thinking. Human civilization is roughly 10,000 years old, so lets think about what the next 10,000 years might bring.

Clock of the Long Now - Installation Begins from The Long Now Foundation on Vimeo.

I really love this clock. I love the hope that it represents. We have a lot of challenges to solve to get there, but setting a milestone like this puts a stake in the ground that we're going to fight to ensure there is someone hear it in 10,000 years.

The Long Now also has an excellent podcast / lecture series which you should add to your rotation.

 

MQTT, Kubernetes, and CO2 in NY State

Back in November we decided to stop waiting for our Tesla Model 3 (ever changing estimates) and bought a Chevy Bolt EV (which we could do right off the lot). A week later we had a level 2 charger installed at home, and a work order in for a time of use meter. Central Hudson's current time of use peak times are just 2 - 7pm on weekdays, and everything else is considered off peak. That's very easy to not charge during, but is it actually the optimal time to charge? Especially if you are trying to limit your CO2 footprint on the electricity? How would we find out?

The NY Independent System Operator (ISO) generates between 75% and 85% of the electricity used in the state at any given time. For the electricity they generate, they provide some very detailed views about what is going on.

There is no public API for this data, but they do publish CSV files at 5 minute resolution on a public site that you can ingest. For current day they are updated every 5 to 20 minutes. So you can get a near real time view of the world. That shows a much more complicated mix of energy demand over the course of the day which isn't just about avoiding the 2 - 7pm window.

Building a public event stream

With my upcoming talk at IndexConf next week on MQTT, this actually jumped up as an interesting demonstration of that. Turn these public polling data sets into an MQTT live stream. And, add some data calculation on top to calculate what the estimated CO2 emitted per kWh is currently. The entire system is written as a set of micro services on IBM Cloud running in Kubernetes.

The services are as follows:

  • ny-power-pump - a polling system that is looking for new published content and publishing it to an MQTT bus
  • ny-power-mqtt - A mosquitto MQTT server (exposed at mqtt.ny-power.org). It can be anonymously read by anyone
  • ny-power-archive - An mqtt client that's watching the MQTT event stream and sending data to influx for time series calculations. It also exposes recent time series as additional MQTT messages.
  • ny-power-influx - influx time series database.
  • ny-power-api - serves up a sample webpage that runs an MQTT over websocket bit of javascript (available at http://ny-power.org)

Why MQTT?

MQTT is a light weight message protocol using a publish / subscribe server. It's extremely popular in the Internet of Things space because of how simple the protocol is. That lets it be embedded in micro controllers like arduino.

MQTT has the advantage of being something you can just subscribe to, then take actions only when interesting information is provided. For a slow changing data stream like this, giving applications access to an open event stream means being able to start doing something more quickly. It also drastically reduces network traffic. Instead of constantly downloading and comparing CSV files, the application gets a few bytes when it's relevant.

The Demo App

That's the current instantaneous fuel mix, as well as the estimated CO2 per kWh being emitted. That's done through a set of simplifying assumptions by looking at 2016 historic data (explained here, any better assumptions would be welcomed).

The demo app also includes an MQTT console, where you can see the messages coming in that are feeding it as well.

The code for the python applications running in the services is open source here. The code for the deploying the microservices will be open sourced in the near future after some terrible hardcoding is removed (so others can more easily replicate it).

The Verdict

While NY State does have variability in fuel mix, especially depending on how the wind load happens. There is a pretty good fixed point which is "finish charging by 5am". That's when there is a ramp up in Natural Gas infrastructure to support people waking up in the morning. Completing charging before that means the grid is largely Nuclear, Hydro, and whatever Wind is available that day, with Natural Gas filling in some gaps.

Once I got that answer, I set my departure charging schedule in my Chevy Bolt. If the car had a more dynamic charge API, you could do better, and specify charging once it flat lined at 1am, or dropped below a certain threshold.

Learn more at IndexConf

On Feb 22nd I'll be diving into MQTT the protocol, and applications like this one at IndexConf in San Francisco. If you'd love to discuss more about turning public data sets into public event streams with the cloud, come check it out.

Power usage after going Geothermal and EV

In November 2017 we replaced our Fuel Oil Heating system with a Geothermal one from Dandelion and bought a Chevy Bolt EV, which we're using as the primary car in the house. That for us means about 1000 miles a month on it. Central Hudson never actually read our meter in January, so applied an estimated based on our old usage. We finally got a meter reading, so now have a 2 month power usage that I can compare to the last couple of years.

By the Numbers

4700 kWh.

That seems like a lot, but I do have counters on both the furnace and the EV, which were ~2200 kWh and ~800 kWh respectively during this time period. Which leaves us at 1700 kWh for the rest of our load. That's compares to 1600 kWh last year, and 1500 kWh the year before.

There is also new electric load in the hot water system, which seems to be running pretty efficiently getting dumped waste heat from the water furnace.

This includes the stretch of time where we had a 14 day cold snap with 20 degree below average temperatures (ending with a record low). So while it's hard to compare to last year directly, it's pretty favorable. I'm sure that were we on oil we'd have had at least one tank fill during that window if not two, the oil trucks have been running pretty constant in the neighborhood.

 

Opening the power bill had a momentary "oh wow". But then realizing we no longer have an oil bill, and we've only paid for 1 or 2 tanks of gas in the Subaru in this window puts the whole thing in perspective.

Getting to a Zero Carbon Grid

This talk by Jesse Jenkins at UPENN is one of the best looks at what doing deep decarbonization of the grid really looks like. Jenkins is a PhD candidate at MIT researching realistic paths to get our electricity sector down to zero carbon emissions.

Price vs. Value

He starts with the common and simple refrain we all have, which is that research investments in solar have driven down the cost below that of fossil fuels, that cross over point has happened, and renewables will just take off and take over.

But that's the wrong model. Because of the intermitency of Wind and Solar, after a certain saturation point the wholesale value of a new MWh of their energy keeps decreasing. This has already been seen in practice in energy markets with high penetration.

 Sources of Energy

The biggest challenge is not all sources of energy are the same.

Jenkins bundles these into 3 categories. Renewables are great at Fuel savings, providing us a way not to burn some fuel. We also need a certain amount of fast burst on the grid, today this is done with Natural Gas Peaker plants, but demand hydro and energy storage fit that bill as well. In both of these categories we are making good progress on new technologies.

However, in the Flexible base camp, we are not. Today that's being provided by Natural Gas and Coal plants, and some aging Nuclear that's struggling to compete with so much cheap Natural Gas on the market.

How the mix changes under different limits

He did a series of simulations about what a price optimal grid looks like under different emissions limits given current price curves.

Under a relatively high emissions threshold the most cost efficient approach is about 40% renewables on the grid, some place for storage. The rest of the power comes from natural gas. 16% of solar power ends up being curtailed during the course of the year, which means you had to overbuild solar capacity to get there.

Crank down the emissions limit and you get more solar / wind, but you get a lot of curtailment. This is a 70% renewable grid. It's also got a ton of over build to deal with the curtailment.

But if you want to take the CO2 down further, things get interesting. 

Because of the different between price and value, relatively high priced Nuclear makes a return (Nuclear is a stand in for any flexible base source, it's just the only one we current have in production that works in all 50 states). There still is a lot of overbuild on solar and wind, and huge amounts of curtailment. And if you go for basically zero carbon grid, you get something a little surprising.

Which is the share of renewables goes down. They are used more efficiently, there is less curtailment. These are "cost optimal" projections with emissions targets fixed. They represent the cheapest way to get to a goal.

The important take away is that we're in this very interesting point in our grid evolution where cheap Natural Gas is driving other zero carbon sources out of business because we aren't pricing Carbon (either through caps or direct fees). A 40 - 60% renewables grid can definitely emerge naturally in this market, but you are left with a lot of entrenched Natural Gas. Taking that last bit off the board with renewables is really expensive, which means taking that path is unlikely.

But 100% Renewables?

This is in contrast to the Mark Jacobson 100% renewables paper. Jenkins points out that there have really been two camps of study. One trying to demonstrate the technical ability to have 100% renewables, the other looking at realistic pathways to zero carbon grid. Proving that 100% renewables is technically possible is a good exercise, but it doesn't mean that it's feasible from a land management, transmission upgrade, and price of electricity option. However none of the studies looking at realistic paths landed on a 100% renewables option.

Jenkins did his simulation with the 100% renewables constraint, and this is what it looked like.

When you pull out the flexible base you end up with a requirement for a massive overbuild on solar to charge sources during the day. Much of the time you are dumping that energy because there is no place for it to go. You also require storage at a scale that we don't really know how to do.

Storage Reality Check

The Jacobson study (and others) make some assumptions about season storage of electricity of 12 - 14 weeks of storage. What does that look like? Pumped hydro is currently the largest capacity, and most efficient way to store energy. Basically you pump water behind a dam when you have extra / cheap energy, then you release it back through the hydro facility when you need it. It's really straight forward tech, and we have some on our grid already. But scale matters.

The top 10 pumped hydro facilities combined provide us 43 minutes of grid power.

One of the larger facilities is in Washington state it is a reservoir 27 miles long, you can see it from space. It provides 3 1/2 minutes grid average power demand.

Pumped hydro storage is great, where the geography supports it. But the number of those places is small, and it's hard to see their build out increasing dramatically over time.

Does it have to be Nuclear?

No. All through Jenkins presentation Nuclear was a stand in for any zero carbon flexible base power source. It's just the only one we have working at scale right now. There other other potential technologies including burning fossil fuels but with carbon capture and storage, as well as engineered geothermal.

Engineered Geothermal was something new to me. Geothermal electricity generation today is very geographically limited you need to find a place where you have a geologic hot spot, and an underground water reserve, that's turning that into steam you can run through generators. It's pretty rare in the US. Iceland gets about 25% of it's power this way, but it has pretty unique geology.

However, the fracking technology that created the natural gas boom openned a door here. You can pump water down 2 miles into the earth and artificially create conditions to produce steam and harvest it. It does come with the same increase in seismic activity that we've seen in fracking, but there are thoughts on mitigation.

It's all trade offs

I think the most important take away is there is no silver bullet in this path forward. Everything has downsides. The land use requirements for solar and wind are big. In Jenkins home state of Massachusetts in order to get to 100% renewables it would take 7% of the land area. That number seems small, until you try to find it. On the ground you can see lots of people opposing build outs in their area (I saw a Solar project for our school district get scuttled in this way).

In the North East we actually have a ton of existing zero carbon energy available in Hydro Quebec, that's trapped behind not having enough transmission capacity. Massachusetts just attempted to move forward with the Norther Pass Transmission project to replace shutting the Pilgrim Nuclear facility, but New Hampshire approval board unanimously voted against it.

Vermont's shutdown of their Yankee Nuclear plant in 2014 caused a 2.9% increase in CO2 in the New England ISO region, as the power was replaced by natural gas. That's the wrong direction for us to be headed.

The important thing about non perfect solutions is to keep as many options on the table, as long as you can. Future conditions might change in a way where some of these options become more appealing as we strive to get closer to a zero carbon grid. R&D is critical.

That makes the recent 2018 budget with increased investment credits for Carbon Capture and Storage and small scale Nuclear pretty exciting from a policy perspective. These are keeping some future doors open.

Final Thoughts

 

Jenkins presentation was really excellent, I really look forward to seeing more of his work in the future, and for a wider exposure on the fact that the path to a zero carbon grid is not a straight line. Techniques that get us to a 50% clean grid don't work to get us past 80%. Managing that complex transition is important, and keeping all the options on the table is critical to getting there.

Python functions on OpenWhisk

Part of the wonderful time I had at North Bay Python was also getting to represent IBM on stage for a few minutes as part of our sponsorship of the conference. The thing I showed during those few minutes was writing some Python functions running in OpenWhisk on IBM's Cloud Functions service.

A little bit about OpenWhisk

OpenWhisk is an Apache Foundation open source project to build a serverless / function as a service environment. It uses Docker containers as the foundation, spinning up either predefined or custom named containers, running to completion, then exiting. It was started before Kubernetes, so has it's own Docker orchestration built in.

In addition to just the run time, it also has pretty solid logging and interactive editing through the webui. This becomes critical when you do anything that's more than trivial with cloud functions, because the execution environment looks very different than just your laptop.

What are Cloud Functions good for?

Cloud Functions are really good when you have code that you want to run after some event has occurred, and you don't want to maintain a daemon sitting around polling or waiting for that event. A good concrete instance of this is Github Webhooks.

If you have a repository that you'd like to do some things automatically on a new issue or PR, doing with with Cloud Functions means you don't need to maintain a full system just to run a small bit of code on these events.

They can also be used kind of like a web cron, so that you don't need a full vm running if there is just something you want to fire off once a week to do 30 seconds of work.

Github Helpers

I wrote a few example uses of this for my open source work. Because my default mode for writing source code is open source, I have quite a few open source repositories on Github. They are all under very low levels of maintenance. That's a thing I know, but others don't. So instead of having PR requests just sit in the void for a month I thought it would be nice to auto respond to folks (especially new folks) the state of the world.

Pretty basic, it responses back within a second or two of folks posting to an issue telling them what's up. While you can do a light weight version of this with templates in github native, using a cloud functions platform lets you be more specific to individuals based on their previous contribution rates. You can also see how you might extend it to do different things based on the content of the PR itself.

Using a Custom Docker Image

IBM's Cloud Functions provides a set of docker images for different programming languages (Javascript, Java, Go, Python2, Python3). In my case I needed more content then was available in the Python3 base image.

The entire system runs on Docker images, so extending those is straight forward. Here is the Dockerfile I used to do that:

This builds with the base, and installs 2 additional python libraries: pygithub to make github api access (especially paging) easier, and a utility library I put up on github to keep from repeating code to interact with the openwhisk environment.

When you create your actions in Cloud Functions, you just have to specify the docker image instead of language environment.

Weekly Emails

My spare time open source work mostly ends up falling between the hours of 6 - 8am on Saturdays and Sundays, which I'm awake before the rest of the family. One of the biggest problems is figuring out what I should look at then, because if I spend and hour figuring that out, then there isn't much time to do much that requires code. So I set up 2 weekly emails to myself using Cloud Functions.

The first email looks at all the projects I own, and provides a list of all the open issues & PRs for them. These are issues coming in from other folks, that I should probably respond to, or make some progress on. Even just tackling one a week would get me to a zero issue space by the middle of spring. That's one of my 2018 goals.

The second does a keyword search on Home Assistant's issue tracker for components I wrote, or that I run in my house that I'm pretty familiar with. Those are issues that I can probably meaningfully contribute to. Home Assistant is a big enough project now, that as a part time contributor, finding a narrower slice is important to getting anything done.

Those show up at 5am in my Inbox on Saturday, so it will be the top of my email when I wake up, and a good reminder to have a look.

The Unknown Unknowns

This had been my first dive down the function as a service rabbit hole, and it was a very educational one. The biggest challenge I had was getting into a workflow of iterative development. The execution environment here is pretty specialized, including a bunch of environmental setup.

I did not realize how truly valuable a robust Web IDE and detailed log server is in these environments. Being someone that would typically just run a vm and put some code under cron, or run a daemon, you get to keep all your normal tools. But the trade off of getting rid of a server that you need to keep patched is worth it some times. I think that as we see a lot of new entrants into the function-as-a-service space, that is going to be what makes or breaks them: how good their tooling is for interactive debug and iterative development.

Replicate and Extend

I've got a pretty detailed write up in the README for how all this works, and how you would replicate this yourself. Pull requests are welcomed, and discussions of related things you might be doing are as well.

This is code that I'll continue to run to make my github experience better. The pricing on IBM's Cloud Functions means that this kind of basic usage works fine at the free tier.