 Okay. Hi, everyone. I'm Zane Sullins, and this is Christina Gosnell, and we are both founders of the Catalyst Cooperative, which is a little worker-owned data wrangling organization. We mostly work with US energy system data supporting researchers and clean energy advocates and also just FYI. We are currently in the process of hiring for two positions. So if you're interested in the kinds of things that we're talking about here and the tools that we're using, please get in touch with us afterwards. I'll hand it off to Christina to set up our story. Yeah. Welcome, everyone. Thanks for having us. So I look so dinky right now, but hopefully it'll resolve. So yeah, in this talk, we're going to go over a little bit of context setting and then go over when we think data is actually useful in a changemaking context, actually making data work for change and getting data into the right hands. Then we'll leave some time at the end for questions. So be prepared. So yeah, a little background. Catalyst sprung out of advocacy and our need to have better access to data about the energy system. We were involved in a group of advocates that were pushing for early retirement of coal plants. In our strategy, it was to target expensive and unprofitable coal plants for early retirement. That for us looked like a lot of hand scraping of PDFs and cobbling together different datasets from different government sources. It was very manual and tedious, but the resulting advocacy was very positive. Our larger group went to expand and look at more than one utility, and we realized at that point, we needed to employ a different tactic. So we created Catalyst to tackle this larger mission of data access and automate a lot of this data curation. So let's start by figuring out when something actually is a data problem or could be a data problem. I know that we all love data here, myself included, but not every problem has a data or technical solution. Often the problem is political. Gallitarian solutions are very often not what is best for those wielding power. In a similar vein, we all have deeply integrated ideological frameworks that keep us from integrating data lessons that go against our worldview. Even when there is moving data story to be told, the work of communication, dissemination, and advocacy is, that doesn't just happen spontaneously when you have good data. Data can be a piece of a broader strategy, but it doesn't speak on its own. Data also doesn't make decisions for us. We use data to be informed about the world around us, but ultimately we make decisions based on our values. Lastly, we are fully open-source shop and believe in that strongly, but open data doesn't necessarily produce the best positive societal results. It very much depends on the context. Who has the data currently? What are their motivations? What is the economic sustainability of navigating open data? But when is data actually useful in a change-making setting? If decisions are happening in highly technocratic spaces, it's often true that arming advocates with data can be vital for the credibility of their message. Also, decision-makers often at least purport to want to make data-informed decision. So it can be important to speak in that language, but it's always important to keep in mind the deep, deep preference for confirmation bias. Another important indicator is when there is a lot of information asymmetry. When one actor has all of the information and often comes with a lot of wheeling power and decision-making ability. In those instances, data can be very useful to level the playing field. In our work, we found that even a little bit of data can go a long way into calling into question the assumptions of incumbent actors. Data can also be very empowering for movements and bolster public support. Lastly, there has to be some available public data that's pertinent to the question that you're trying to ask. Availability doesn't necessarily mean accessibility, which they will get into next. But if there is no available public data, you have a data collection problem, not a curation problem. So in our context, we saw the use for data in advocacy efforts, but we saw many others struggling with the same problem. Advocates, researchers, and journalists alike were all trying to answer questions about the energy system. And only after talking with lots of folks and exploring other related initiatives did it become clear to us that there was enough support and enough need to support common infrastructure. And that the extra effort required to create a common resource would be far less than the effort of the duplicated toil that we were all individually carrying out. I'll pass it over to Zane. OK. So if you've determined that you do really have a data problem, like an issue that can be addressed by data, and there's enough of a community to justify centralizing that effort, what are you really trying to accomplish by taking on that responsibility for the community? In a lot of cases, in our experience, it boils down to reducing overall user toil. And what do I mean by toil? It's this manual, repetitive work that can often be automated with the right tools, but frequently it ends up getting down on kind of a one-off basis. And data wrangling is notoriously filled with these tasks, like things that you can kind of do by hand once, but automating them takes some extra effort. And when you do it by hand, you often end up generating unreproducible results, which is not great in a legal context. When open data isn't analysis ready, when every user has to do some of the same work before you actually move on to solving their own problems, the data isn't really free, even if it is openly available. Or it's only free in the sense that the puppy in this warning sign is free. It comes with its own set of obligations attached. And that work that every single person has to do after they get the hands on the data is effectively a paywall. It just might not be a financial one. And in a lot of cases, it'll prevent some people from being able to use the data at all, either because they don't have the time or they don't have the specialized skills required to clean it up and use it. So what we're trying to do is centralize that work, do it once, do it well, and then share the benefits with everyone that can benefit from it. However, in providing analysis ready data, inevitably you end up having to make some choices on behalf of your users. All messy data requires interpretation before you can actually work with it. And cleaning, reshaping, removing outliers, filling and missing values, all of those things require judgment calls. So there's this tension between making messy data easily usable and preserving the original contents and structure. And that means that archiving and documenting the process, this collection of choices that you've made and applied to the data becomes at least as important as preserving the data products themselves. Otherwise, people can end up losing trust in the data when they find some unexplained discrepancies or they may end up using the data in ways that just aren't really appropriate given where it came from. So in our journey, treating the data more and more like a process and less like a static end product has pushed us towards using a lot of kind of continuous integration and deployment tools which are often associated with larger data sets. But as a small team, we really, we need to automate whatever we can because even a few small manual tasks will quickly accumulate and end up eating a big chunk of our time. That's definitely still work in progress. We have a lot of automation still to do but that's the general arc that we're trying to be on. One side effect of this is that the tools that we find ourselves using day to day to produce the data have diverged substantially from the tools that most of our users are comfortable with in their day to day. So it's important for us to make sure that we are bridging that technical gap and not just replacing user toil with technical barriers that also prevent the people that we want to have access to the data from using it. Great. So how do we actually get access, get people access to the data? So we've more and more been thinking about data access as rings that build on one another. These inner rings here contain the work that we've primarily been focused on so far. We pull in disparate data sets into one standard database format that is clean and tidy and connected to each other. This next set of rings encompasses all of the derived values that are generated from that processed data. This is where a lot of the tension that between usable and pristine data that Zane mentioned comes in. We try to integrate vetted methodologies for calculating commonly derived values and impute missing values to fill in the gaps. Currently all of the more complex analysis and imputations are done in a software layer that reads data from our database. We did this originally to have a very clear separation between that pristine data and the derived values. But this does create additional layer of friction for our users because they both have to be aware of these post ETL tools and also be able to use them. So we're currently planning on migrating a lot of these drives tables into the same database with well-labeled table names, flags for different methodologies and confidence levels, as well as accompanying metadata that explains a lot of the provenance and assumptions that go into these. But then how are users actually interacting with the data itself? Our previous philosophy around this led us to employ a single access mode using frictionless data packages that could serve many different types of users. But we've more and more come to believe that it is both preferable and possible to pipe these like standard outputs from our data processing pipeline into many different access modes that serve different types of users a little bit better. These are these blue inner rings that contain our production pipeline. These blue and green inner rings contain the production pipeline, which generates a small number of outputs. And those outputs include SQLite, Parquet files for a smaller and larger datasets, respectively, and metadata that's dynamically generated, as well as Docker containers which preserve the software environment. These outputs will be built nightly, they'll integrate lots of tests, unit tests, both unit tests and data validation tests, and will be cloud accessible so they're ready for distribution. And this is the point at which we switch from our means of production to means of distribution. I see the distribution modes as tentacles flowing out from our data pipeline, which are able to be nimble and flexible in reaching many different folks, which have many different use cases, skills, and tools that they're comfortable with using. And the tools for which we kind of use this dissemination see what it's gonna get into next. Yeah, so once we've got these kind of standard SQLite, Parquet files, Docker images that we've generated from the nightly builds, we wrap them in a bunch of different kind of accessibility tools. So one of them is dataset, which you may have heard Simon talk about yesterday. It lets us kind of embed an SQLite database in a friendly web interface for folks that just wanna kind of browse around in it online, or maybe download smaller parts of it to work with in spreadsheets locally. And one nice thing about having the data web accessible like that is that we can link to it directly from our documentation and data dictionaries. And that kind of helps bring the documentation to life and makes it a little less abstract. We're also using Intake, which is a library that was initially developed by Anaconda, that lets you wrap references to either local or remote datasets in a Conda package, creating what they call data catalog. So that Conda then manages the versioning and software dependencies, and Intake provides a standard software interface for accessing the data and metadata. It also does local caching if the data is being stored remotely. And then we're using Jupyter and Docker in a couple of different ways. First, we wanna provide self-contained archives for kind of long-term accessibility that include all of the data and the software environment required to work with it interactively. So a user can download a single 10 gigabyte tar pile, run Docker compose, and get dropped into a series of tutorial notebooks that will walk them through examples of how to use the data. And then we're using exactly the same Docker image, data catalogs and notebooks to provide a hosted Jupyter Hub in collaboration with 2i2c. It runs remotely, can provide scalable compute resources potentially, and really requires no setup at all by the user, except for getting a login from us. But in working with a lot of these tools for data distribution, we've found that we're kind of in a gap between small and big data. So the medium-sized data that we're working with is kind of in the 10 to 100 gigabyte scale range, which is too big to host on a GitHub repo. And so we've been drawn towards a lot of the bigger data tools that are out there, since they make automation and data distribution easier. And for this amount of data that we're trying to work with and distribute the storage and the computer really quite cheap. They're not a big expense. Seems like there's this baked in assumption that the infrastructure costs will end up being large. And so it's okay if the setup and maintenance are difficult or expensive, which isn't really true in our case. So we're also working with 2i2c to try and figure out what does an off-the-shelf medium data cloud environment look like? Something that's really optimized for easy setup with minimum configuration options that can make automated data processing and distribution both cheap and easy. And I suspect that there are other communities of data users that would benefit from a similar kind of setup. Christina. Great. Oh, sorry, I skipped over our... So I just wanted to quickly acknowledge a lot of some of our both client and client work that we've been working with that has helped us build a lot of the analysis that's on top of the data, as well as some partners and foundational support as well. Especially the Sloan Foundation for kind of helping us build out a lot of this foundational infrastructure. So I'm just gonna put our takeaways on the slides, but just open up the space for questions if anybody has any. That's great. I was just gonna chime in and say four minutes left because I forgot to say five minutes. We do have a question. I believe this is from Jonathan, who is our chat moderator. And Jonathan asks, well he says, this is such an interesting process and mission. Have you had any users that you haven't expected to access or use your product? I'm sure we have. We don't have like the greatest kind of communication with our users because everything is just totally open. So we will occasionally find out that someone we've never heard of is using the data and doing research with it. We've had some small kind of renewable energy developers work with the data in some cases and that's not really a market that we've gone after directly so much. That's interesting. I have a follow up there. What are the markets that you have gone after directly in terms of working with communities or publicity? Yeah, so we have a lot of, we first started, we sort of naturally attracted a fair amount of folks who were doing research, like doing their PhD user, in a postdoc or something like that. And we continue to be very excited about supporting those folks. But as I said, we really started coming from the advocacy community. So a lot of our client work is in kind of hand tailoring data products for them for their advocacy use. We're integrating a new data set that we would like to have available for everyone, but we haven't gotten around to it yet. So like a lot of client interactions will give us kind of initial money to do a rough draft of what the data looks like, do their analysis, and then we can use sometimes the foundation grant money to more deeply integrate that information into the underlying platform. Awesome. So there's one more question. There were a few journalists too. Oh yeah, oh yeah, of course, yeah. We have about two more minutes and so I want to get to this question from William Lachance and he asks, did you measure with which methods of accessing your data were the most used? And if so, what was your interpretation of that? This is a great question. So right now, a lot of the access modes we are in active development on. So we really only have, we have two or three main access modes right now that Jupyter Hub gives us the most kind of data about who is using it and how often, but our first and kind of main access point right now really is just like downloading the data from Sonodo. And so right now we have kind of minimal information about who uses things, but we're really excited about a lot of these methods will give us much better data about which one is gonna be most useful. Yeah, the data set and the Jupyter Hub are both kind of in beta right now. So they haven't been really pushed out very much, but we're looking forward to having much more detailed information about who and how many people are using each of the different access modes and potentially which data even they're using. Great, and in the 30 seconds remaining, do you wanna tell everyone what are the roles you're hiring for? Sure, we have one that's more like a data wrangler and data analyst that will probably have more client interactions and work on prototyping new data sets and analyses. And then the other one is a software engineer slash data engineer that will probably be working more on the infrastructure and data pipeline and distribution process.