 Okay, so my name is Zane Selvins and I am the chief data wrangler or vaquero de datos for Catalyst Cooperative and I'm going to talk about public utility data sets in the US mostly related to electricity but more generally for energy and I definitely want to acknowledge my partner in crime who's not here with us today Christina Gosnell who has really been a co-equal partner in this process for more than five years and she's actually the president of our little employee owned consulting co-op so you can follow us both. So I came I came to this work from kind of a weird place at least I think it's a weird place I I used to work on space exploration and I went to grad school originally because I wanted to terraform Mars I read some science fiction books by Kim Stanley Robinson it was like that I'm gonna do that screw this boring tech job in Silicon Valley and I you know successfully got in and started working with NASA data and ended up you know working on Mars working on the icy moons of the outer solar system and then I took an atmospheric physics class and realized that actually we were already involved in a terraforming project it was just on the wrong planet and also in the wrong direction and I kind of had a meltdown and decided that after finishing my PhD in like a pit of despair I wanted to actually work on addressing this problem so I went into climate and energy policy after graduating since you're here I assume that you also like data and probably also live on earth and you know may have an interest in similar topics so data people when we talk to data people they often think like wow you know a lot about electricity and when we talk to electricity people they're like oh wow you know a lot about data but really we're just stubborn amateurs on both sides and that combination is valuable that doesn't seem like there's a lot of people that are aware of like the policy and advocacy universe and also the kind of data and coding universe so we're trying trying desperately to kind of bring these two groups together so I'm what I'm hoping we and maybe you will get out of this talk is we want more data people to understand what energy and electricity data there is so that we have more people kind of informed and playing with it and doing interesting things with it and we also want to get more activists and policy involved people the literate with the huge data sets that exist and that could help inform their work so I'm very interested to talk to somebody from the carpentries about setting up another domain specific kind of track for this kind of basic literacy for working with data we also really need help it's blank yes it's supposed to be blank yeah just so you know stripes we also need a lot of help telling these stories and we haven't been able to make it in good connections with data journalists yet that want to work with data on energy and climate stories so if anybody you know has connections like that we love love to know more so I don't know does anybody recognize these stripes do people know what this is this is a color-coded time series of the average global temperature on earth starting from 1850 on the far side to the present on this side so it really there's some crazy stuff happening right now and I periodically feel as if I am a space explorer who has been marooned on a planet with you know a civilization that doesn't know what it's doing and then I talked about humans in the third person so this is from Ed Hawkins and the climate lab book and he has lots of great graphics related to this so who here has been to a public utility commission meeting or hearing oh my god that's great that's usually the answer is zero this so this is where the magic happens in the US as far as deciding what our energy system looks like it looks like a bunch of lawyers because it is a bunch of lawyers and it's a basically a court a courtroom like proceeding in which typically utilities come in and say we want to build these things we think people use this much energy and we think we should make 11% return on the you know billion dollars I'm gonna invest and you get into a very adversarial kind of conversation with you know discovery collecting data making cases against each other and that the utilities really have a very very unfair advantage here and they have a huge information asymmetry they have all of the information about their system they have all the resources in the world to throw at it so every proceeding will typically have its own team of attorneys that doesn't have to work on anything else whereas on the advocate side you know you're probably working for some scrappy nonprofit or maybe in the public sector and and maybe you know there are eight proceedings going on simultaneously in conversation with the same utility and they have eight attorneys on every one of those eight proceedings and you have the same attorney who sleeps on a couch in a basement who's trying to like coordinate all of those things from the other side so this this asymmetry really leads to some problems as far as the equity and you know you know reasonableness of the decisions that get made sometimes because the PUC you know they're actually they're not often they're not domain experts in this and they certainly aren't you know fully briefed on the opaque models the proprietary data that's being used to make the case from the utilities point of view and when they get some like you know wild eyed hippies on one side and a bunch of suits from the utility on the other side with apparently all of the data in the expert witnesses getting paid $500 an hour to testify they typically like you know the utility probably knows what it's talking about we're gonna go with their suggestion you know and what could possibly go wrong so these are some of the things that can go wrong this is just the last couple of years and you know you can see we're not talking about Trump change here this is real money you know South Carolina has has decided to just totally bail on a nuclear plant nine billion dollars in the hole literally there is this you know the first kind of at scale carbon capture and sequestration plant in the deep south seven and a half billion dollars they totally scrapped it and they're not being sued by the DOJ for fraudulent use of federal loan guarantees and these are you know these are spectacular failures where they're actually getting some comeuppance and I'm in South Carolina typically a fairly conservative state the legislature wants to rip the face off of the CEO of that company and they're actually gonna you know there will be consequences but this is atypical you know usually they get away with it they you know can just the PC that what they want to build is a reasonable thing they pass the cost through the rate payers and you know these nine billion dollars seven billion dollar piles of infrastructure are what you're paying for in your monthly bill so we're never gonna be like I doubt that we'll ever be able to produce data and models that are on par with what the utilities have but I don't think that that actually matters so who recognizes guys anybody recognizes I'm probably gonna date myself here but this is Hank Paulson the Treasury Secretary under Bush the second and he presided over the the mortgage implosion a decade or more ago and you know was he incompetent did he not understand how markets work no he understand very well how markets work because he's also Hank Paulson CEO of Goldman Sachs the problem here isn't that he didn't know what he was doing the problem is that his interests and motivations and affiliations and you know his really allegiance was maybe not to the public interest so you can be the best in the world at what you do and if you're you know you have the wrong incentives the wrong motives the wrong interests it doesn't matter how competent you are and conversely you know if you get people in the room with the right incentives and you know at least a little bit of of material to work with you can make real change and I'll go through some examples of that later on places where this has been successful or where it could be successful in the future so our origin story you know we all started at this tiny little nonprofit called clean energy action it was basically unstaffed and we got very familiar with the Colorado utility landscape and as a result of that we were invited to help with a foundation project to understand the finances of their coal plants and it went well you know we understood you know how much it cost to generate electricity from the coal plants it was clear that it was more expensive than building new renewable facilities and the foundation which for some reason wants to remain nameless was like that's great do more of that we'd like to do 10 or 15 more utilities but we had scraped all that data by hand from PDFs and from crappy spreadsheets and we're like oh there's no way we are not doing that by hand we'd rather just do it generally for all of the data for all of the uses that anybody might want to you know use this for and it seemed like at the time in a kind of brazen act of hubris that that would be about the same amount of work so the nonprofit that we were working for they're like that doesn't that's not grass rootsy enough it seems kind of arcane we're not really into that but there was a $70,000 grant on the table to do it and we're like well we would like to do that so we're gonna spin off our own little thing that's called catalyst to do this work on our own and so that was our first kind of foray into trying to rectify all this horrible data and what's my time I didn't start the timer okay great great so yeah so we're really a pile of activists that have you know self familiarizer with the policy universe and also with the data universe and the tools that are necessary to work with the data and so now I want to talk a little bit about what is the data you know where does it come from what's inside there why is it maybe interesting and something that people should be paying attention to and playing with so most of the data we work with comes from the federal government from these different agencies and this is like scattered all across the the huge government bureaucracy you've got the Department of Interior which controls both EPA and Bureau of Reclamation which does a lot of hydro projects PHMSA which probably nobody's heard part heard of is the pipelines and houses material safety administration part of the Department of Transportation EIA and FERC the Federal Energy Regulatory Commission those are both part of the Department of Energy we've got the Department of Labor with the Mining Safety and Health Administration and the Castle in the corner is the Army Corps of Engineers Department of War I mean defense and you know together these agencies have mandates legal mandates typically to publish vast quantities of data and you know they do that but often the mandates are not specific enough to really make sure that the data comes out in a way that anybody wants to use it or is able to use it and that's very frustrating but a really nice thing about this is it's all public domain anything that comes from the federal government you can just do whatever you want with it's it's yours yours to play with if you can get it out of the box there are other big sources of data to this is a map of the ISOs and RTOs which are the independent system operators and regional transmission operators or regional transmission organizations these are the people the organizations that run the grid in the the colored areas here and typically they operate in places with competitive markets where you know at least in theory the people that are generating electricity are competing with each other to do that cheaply and efficiently the the Bayes areas are a mix of kind of traditional monopolies where you have as in Colorado where we started doing this work somebody with a government mandate to provide electricity and basically a guaranteed rate of return and there are all kinds of interesting problems that come along with that model or public power agencies like in the Pacific Northwest the Bonneville power authority and in the Tennessee you've got the Tennessee Valley authority and these kind of almost depression era large public power associations that are truly public entities so their data is often quite clean and very accessible but the ISOs are not gov they're kind of quasi governmental they don't really land in the public sector so they have they're all have their own licensing requirements it can be very challenging to get them to agree to let you kind of officially do stuff with their data but it's all it's all available it's online it's there via API's and it's terabytes of information so you know these people publish every five minutes the prices and a bunch of other interesting attributes about electricity at about 13,000 different locations across the US so that piles up very very quickly and also gives a kind of an insanely detailed picture of what is going on with the electricity system day in day out every five minutes how much does it cost how much is it going to cost tomorrow how much does it cost how much is it going to cost tomorrow who's producing what kind of power they producing where's it coming from where's their congestion all that kind of stuff and I wanted to just note like a little bit about the data sets that these folks produce so EIA is kind of a creature of the Carter administration and does a pretty good job it loves Excel spreadsheets almost everything is in Excel spreadsheet and they change format every year every year has a different Excel spreadsheet format and they have like 15 tabs and no NA values and like all of the wonderful fun things that I'm sure everybody in the room has dealt with at some point but they're they're the good kid so FERC FERC is especially FERC form one is just it's an archaic dbf file format so it's a binary database format that is undocumented you know they don't give you a schema for like what the database looks like what the name of the columns are we wrote a scraper that greps the binary files for strings so that you can you know then associate the the names of the columns with the data that's in them but it works and we actually had somebody we're working with who had been nominated by Obama to be the chair of FERC who was like oh my god you have the FERC form one data I've always wanted to access it you know so the potential chair of FERC who is a former regulator in Colorado couldn't access the FERC form one data so you know that's ridiculous on on the one hand but on the other hand like once you break that out you're the only one that has it at least in the civic sphere and people are very very excited to get access to it finally eia gives a lot of information this is a lot of financial information eia has a lot of financial information also some operational information you know about coal plants or you know all kinds of power plants where do they get their fuel from how much cost how much electricity did they generate that kind of stuff the PEMSA data is the stuff that we're interested in is about natural gas pipelines they you know every every year they get a report from all the natural gas utilities about where are the pipelines how long are the pipelines what are they made of how old are they so you can you can you know guess at when the pipelines will need to be replaced and if you don't want to replace them because that's a you know climate changing piece of infrastructure you know what your kind of timeline on that is the Bureau of RAC and the Army Corps are all about hydro from our point of view and then EMSHA has data on mine production and safety stuff so that's interesting and a bunch of different characters and there's also there's a lot of economic information in here about what communities will be impacted by changes in the energy systems like people that depend on mining that depend on power plants for the tax revenue for their county and it's like 90 percent of the property tax revenue for the county and they will implode when you shut down the coal plant which is important from a political point of view and also kind of a human decency point of view oh and just the the scale of this like all together there are there are billions of records you know maybe it's 10 billion maybe it's it's probably on the order of 10 billion records between all of this and several terabytes of data between all of it so it's not I don't know if it's really big data it's kind of medium-sized data it's definitely annoying to work with on a laptop and once we kind of integrated the first 100 gigabyte data set we're like this is not going to be a laptop thing anymore what do we do and had to start exploring out of memory computation and other things that we really still don't have any idea about or you know if you don't want to deal with all of that you know difficult tedious data munging you can go to s and p global which will you know for $20,000 a year give you a subscription to a terminal that has most of that data and they've done cleanup and they've interpolated and filled in blank values and it's you know relatively easy to use but aimed at kind of a wall street market so it's not even necessarily the right thing for advocates to use or policymakers to you and also it's totally opaque like they don't give you any information about how they clean the data or what their model was for interpolating or filling in those na values so you know and and they have a very traditional platform monopoly business model and every time a new small startup is founded to kind of do the same thing they buy them and they've been very successful in the North American market at just pretty much completely controlling access to this information boom okay so what what can we do with this data if it's available and usable in the public interest what kinds of things are possible so does anybody recognize this dam probably not if this is a local project so this is the lower snake river so it's I don't know if it's in this state or but you know it's it's flowing into the Colombia and you know several years ago there were people being like you know do we really need these dams we'd like to restore the salmon run maybe some of them should be removed and the utility was like no no no no you can't remove the dam we need the dams the dams are very important and they produce three gigawatts of power you know and to replace that with another carbon tree power source would be very expensive you wouldn't want to do that and typically in a utility commission they'd be like well I guess we can't replace the dams or take them out it's too bad but Northwest what's it called Northwest Energy Coalition actually had data from the Bonneville power authority in this case and they they could see the generation that the dam had put out for you know a decade or more and they and it was clear that this was a bluff on the part of the utility which is a public entity so it's I'm not totally sure what the incentives are there that make them want to keep the dams in place but the dam on average over the course of the year only generates about one gigawatt of power and you know in reality when when the Northwest needs the power when the power is is demand is high and supply is low because there's so much hydro up here it's only generating 500 megawatts so you know a factor of a factor of six different really functionally from what the utility was saying and you know so this kicked off a whole like new environmental impact statement process just and like how many bytes are there in the data that they they use to to make that case you know for not maintaining infrastructure that costs a quarter of a billion dollars year just to maintain and instead to explore you know other zero carbon opportunities in the Pacific Northwest you know it was a you know a megabyte or a couple of megabytes of data that they happen to have access to because it was published as a CSV file by a public agency that cares about data. Another example so this is the thing that we started working on okay five minutes which is you know looking at the economics of coal plants in Colorado in 2013 we realized you know based on data that the utility had submitted that the coal plants were no longer economic so at that point we didn't have to argue with the utility we could just be like hey look the thing that you already submitted says the more renewable energy you add the cheaper the electricity gets so that was a relatively easy case to make we still didn't win but it did inspire people to get involved in like trying to more quantitatively make these cases and change the utility's mind and finally you know six years later just last Friday the Colorado legislature finally passed a bill that will allow these coal plants to be refinanced and shut down you know very early and also some very aggressive climate goals and we've been kind of providing data to the people doing the analysis behind this for for like five years so it's very gratifying to finally finally see something work out politically. This is complicated I'm gonna skip it another another kind of project that we're kind of interested in doing going forward is the EPA data says how much power every power plant produced every hour for the last 20 years so you can look at and this is a graph of a particular power plant put together by Joe Daniels and the blue is profitable the red is not so profitable the yellow is startup and shutdown costs and you can see like they're operating in the profitable regime and then occasionally not but that's because they have these these startup and shutdown costs but if you were to put a renewable energy source in the same location on the grid the generated power at zero marginal cost so like every additional kilowatt hour is free because that's where your renewables typically work and you you made it compete with these power plants when they would be profitable maybe you could force them offline purely economically by strategically locating renewables nearby and that's that's interesting because then opens up transmission capacity which is a limiting factor for for many you know places now as renewables get get cheaper and cheaper it's not cost it's like well what do we connect it to so you know that there's there's these terabytes of data we're trying to break them out and put them into the you know real functional public domain and even if we we're successful at that which you know hopefully we will be over time this problem isn't going away like it's only going to get more complicated because the the amount of data that's necessary to run the electricity system is growing exponentially because you have people installing batteries because you have electric vehicles charging you have dispatchable demand for electricity where the utility can be like no no no don't consume now okay now consume and instead of having a few thousand entities nationwide that are trading electricity and and putting it onto the grid you might have with a few million or even you know if it's at the device level a billion devices in the u.s. that are interacting with the power system and the markets that have been constructed to serve the power system and you know that that's going to be truly truly large data and they're starting to roll out kind of trials of this kind of stuff so there's a group at Berkeley that's been deploying sensors that take two measurements or 120 measurements every second so twice the frequency of the kind of ac circuits and they can do things like when i'm sure you're all aware of the fires in california due to a p-genius some other power line failures the system can sense that there is a broken wire and before the wire has hit the ground it can de-energize the line and it can it knows that it's broken and it can you know prevent that fire from starting because it's you know at 120 hertz it's checking in on the system and yeah that's a huge huge amount of data and you know people are working on how to visualize that and the markets that are built around this are you know they're going to be another opportunity for a different kind of electricity monopoly a different kind of platform monopoly if we don't really do a much better job at understanding what data should be public how it should be publicized and and how people can work and should be working with it so we're hopeful that this this project will lead into you know kind of a deeper discussion about the future of energy data electricity data and how we can avoid that kind of captive environment in the future and i'm out of time so i'm going to skip these kind of pipeline questions this is our original pipeline mess pandas postgres uh jupiter but it turns out everybody just wants to download a csv advanced and beginning users so eventually we went to mess pandas um now we're starting to implement data packages instead and then like okay whatever you want um okay 30 seconds and thank you to all of the organizations that have supported us in the past