 Thanks for attending. My name is Eric Rolenson and I'm going to be talking about how we're trying to scale climate data for fintech using an open source data mesh. So just to get things started, of course, like many of us, I burned carbon and put it into the atmosphere just to be here today for all of you. And I used the online carbon calculator just to see what it was and it's just shy of 350 kilograms of carbon to get here and back. And I bring this up not to instill guilt but just to point out, of course, that all of our economic human activities have various kinds of climate and environmental impacts and it's important these days more than ever that we try to actually measure them and do something about it. So with that, what is the actual goal of the open source or OS climate project? So as of 2020, estimated like total financial investments on all of planet Earth were like in excess of 200 trillion euros. And so the goal here is to align all of that human economic activity and investment to a try and control global temperature rise. So in a sense, what we're really trying to do is scale the climate impact of human investment and scaling is going to be a theme that comes up throughout this talk relating to things like data meshes. So as you can see, we're swinging for the fences. So what could this look like? Here's one example. Here's like a screenshot of a portfolio alignment tool for carbon. And if you look down here in the lower left, you can see that we're selecting a benchmark of saying, hey, you know, let's try and meet this, you know, Paris Accord's 1.5 degree Celsius maximum temperature rise. So it's like if we imagine that policy goal, how is, you know, this set of investments looking these days? And you can see up here that it's actually not meeting that goal. It's like using current measurements. Whatever the list of investments are in this portfolio is actually on track for 2.14 degrees Celsius. So this is an example of basically using tooling to assess whether or not the investments we're making as individuals or, you know, groups or organizations are actually helping or hurting our ability to meet different kinds of climate impacts. When you talk about asset alignment, there's a couple of different kinds. There's physical risk. And there's what was called transition risk. Some ways physical risk is easier to understand. If you imagine, again, you have a bunch of investments. So there's a data set, you know, describing all the different companies you might be invested in, different assets. But if you combine that with things like location, so like where are these assets located? Is it a factory on a river? Is it someplace that's dry up high? Is it near a seaboard? So like location matters. And then thirdly, you combine this with hazard locations. So like if I'm investing in a company, the company has a bunch of factories that are on a river and that river is likely to start flooding more and more often because of climate change. That impacts the actual risk to my investments. It may change how I want to actually allocate them. You can see that you're federating at least three different types of data here to achieve this result. That federation theme is going to also show up repeatedly. The transition risk is a slightly different animal. You have to imagine different possible policy regimes, like all the different, you know, climate related or environmental related policies that nations either choose to adapt or alternatively don't. And how that impacts the available energy sources, natural resource extractions or just available materials. And consider like how all those things impact the macroeconomic climates. So transition risk basically, you can think of it as like flowing either from policies down to actual investments or also frankly the reverse, you know, it's how different investments in their locations, you know, impact what kinds of actual policies might get made. So as you can see, all these considerations are environmental, social and government considerations or ESG. How many people show of hands are working with ESG considerations or know what ESG is? Okay, a few. Anyway, ESG is a term you should learn. Climate related impacts fall under this, but also as you can see, other kinds of environmental stuff or frankly, even social considerations all fall under this very, very broad term. So this creates a lot of challenges if you're, you know, a financial company or fintech. You know, the regulatory roadmap on how we're expected to report all these things is evolving pretty quickly and the regulations that are adapted have big implications for companies. You know, financial institutions need to actually be able to consider multiple scenarios because we're all trying to predict the future and we all know how that goes. So you try to imagine, you know, predicting different kinds of regulatory environments, different kinds of climate change scenarios and evaluate all of your investments against all these things to get a broad picture of what the possibilities are. I think, you know, as I mentioned, you're trying to federate a lot of different kinds of data and a lot of this data is technically available publicly, but truthfully, actually assembling all this data in a form that's usable is a huge technical challenge. If you're a very, very large financial institution, you know, you might have the resource to do this yourself, but a lot of financial institutions are smaller and don't and even if they did, it would be a massive duplication of effort. And lastly, of course, you're basing a lot of these on companies reports and company reports can, you know, basically be wrong either accidentally or deliberately or just misleading or, you know, have mistakes. And so there's a lot of, you know, data quality and governance trust issues around the data. So much so that the SEC for one is, you know, evolving its reporting requirements to try to address these uncertainties requiring investors to, you know, put information about what actual ESG factors are considering. Are they talking about carbon? Are they talking about other kinds of pollutants? You know, what exactly are they trying to assess with the report? And then talk about, like, the details, how did they compute it? What metrics are they actually using? Because it depends what kind of results you get. And you need transparency as to how you're arriving at the numbers you're putting in these reports. So these kinds of impacts, computing them requires complex relation graphs, if you imagine something like scope one emissions, which is just the emissions that I as a company create. You know, that doesn't require super complex relations. But if you go to scope two or three, where you're trying to say, Okay, not just the stuff I'm directly emitting, but like, emissions of the power that I purchase, or, you know, supply chain, it's like, you know, if I have people giving you I'm buying stuff from a supply chain, what are their emissions? And so if you're going to try to answer these questions, you have to understand a lot of, you know, complex draft relationships about corporate ownership and supply chain relations. So, you know, one example of a data set we've been invested in providing is the global legal entity identifier foundation or glyph. The glyph data set actually encodes these things like, you know, which companies own other companies, and allows you to start with one company and like meaningfully trace the total carbon impacts. So all these data, all these data problems are there in the background, if you're going to try to address this. And we are, we here at OS climate have been in, you know, in investigating how to do us all with a data mesh architecture. So what does that mean? There, in my opinion, three sort of basic properties that are important to data meshes. One is that there's federated data ownership, and possibly physical location of data. So we have a lot of different people who own the production of pieces of data that need to be brought together to answer these kinds of questions. And these data may not be all in the same place. Similarly, there's federated governance of all this data. So it's like you have a federation of people trying to produce products. How do you scaleably, you know, govern, govern the data and the people who are coming together to do this. And lastly, the data is to be provided in a way that allows self service use so community people can come and, you know, understand on their own as much as possible how to find what they're looking for how to use it. And we'll talk about all how we're trying to enable all three of these characteristics. So here's like a high level architecture diagram of us climate. So here on the left, you have what I would call like, you know, federated data, you got just a lot of different kinds of data sources, databases, raw files, stuff pulled off of API is streaming data, you know, the whole nine yards. And again, this stuff can be owned by disparate organizations and may not live like inside of your, you know, direct control. So you can see here, again, federated governance, all these people, the different domains of data that they're providing are owned by independent organizations. You know, they like have their own rules about what they're producing. And they're basically managing how they define it. But somehow you need to make transparent as to how that's working. And lastly is the self service aspect. So again, community people can come with different kinds of ways to try and access the data things like Jupiter and pandas, sequel clients, you know, again, API and application API is pulling it programmatically all our possibilities. And we want to enable all these people to do this as much as possible on their own so that we can scale. So not only is it a data mesh architecture, but we're really going to try and present open community. So by open, I mean, definitely, we're trying to focus on open source tooling, open data wherever possible, but also open operations. Like how do you actually deploy these things on clusters in an open way? And how do you govern in an open and transparent way? There are different ways to slice this kind of open stack. One way is like through some like data science workflows. So like, you know, in terms of data pipelines and ingest, you have stuff like Jupiter hubby liar pipelines or dbt. You can train models for extraction, things like actually trying to get these informations out of unstructured reports using things like TensorFlow or PyTorch. And you can do deployment pipelines with things like kf pipelines or kubeflow. And of course, there's actually understanding, you know, the data itself using metadata technology, and we're using tooling like open metadata and create expectations for, you know, data quality assessments. And then lastly, of course, you might be serving, serving all these applications or models in the cloud. Another way to slice this is sort of like layers, layers of providing data. So at the very bottom, you know, we're doing a lot with iceberg and parquet and stuff living on things like stuff. At the federation layer. Again, we're using open metadata, we're federating a lot using Trino, which I'll be talking about a little bit. And then at the highest level, the application layer, we're, you know, doing a lot with, you know, standard Kubernetes deployment tools, Istio, Envoy, and like business, business intelligence stuff like super set. And I'll be showing some of these applications. Well, let's talk, I want to talk a little bit about Trino because Trino is one of the core tools of the Federation. So what is Trino? Trino basically is a way to take SQL and under the hood run containerized workers in parallel. So you can basically when you execute a SQL command, the execution of that command can happen in parallel to speed things up. And anybody who's ever worked with Spark might recognize that as being kind of similar to Spark SQL. How many people like actually work with Spark or Spark SQL? Okay, so Trino kind of operates in that same gender general space. The other thing Trino is pretty good at. Also, frankly, a little bit similar to Spark is, you know, being able to scale out with data federation. So there are dozens of connectors that allow, you know, you connect your Trino install to different kinds of data sources, and they don't have to be, you know, co-located with you. And then for the ones that we're doing a lot with are just raw Parquet files, you can expose Parquet as like an actual database to run SQL on. We are using, we've federated with Google BigQuery, which I'm going to expand on a bit. We've actually done Kafka, which allows you to do streaming queries, again, a little bit like Spark structured streaming. And again, iceberg files, we're doing a lot with using icebergs, versioning and rollback capabilities. So I'll talk about the Google BigQuery one, one of our big Federation success stories is actually integrating a connector for risk thinking, which is its own separate company. Again, this is kind of like the federated governance, and they provide their own data. And we just declared a connector to it and got that working. And now it's like we can actually see it, as if it were part of our own Trino install. So if I log on to our Trino and execute something like this little SQL query, it actually under the hood goes off and talks to risk thinking is BigQuery data and returns data. One interesting thing you can see here on the left is the primary index key is actually the hex three geolocation codes, which is turning out to be a very interesting geolocation format. And it makes a nice, it actually works well as a database key. So again, I talked a lot about self service. We've had a lot of success producing self service environments. If you're somebody out there in the community, you can authenticate onto our cluster until recently that was using GitHub. And that automatically gets you spins you up your own environment with Jupyter Hub, you get access to superset and both of those are backed by that Trino deployment that I just talked about. And on the right here, all those individual components have actually been provided using open meta open data hub. And if you're not familiar with open data hub, think of it as an open source downstream of kube flow. So one of the things that was important for self service was a unified authentication. If you want people to actually be able to do this, you want to authenticate once and be able to get onto all your tooling without having them us around. So like I said, originally, we got quite a bit of mileage doing this with GitHub, we were able to like, you know, authenticate once with GitHub and get access to all these tools superset Jupyter Hub, Trino and the actual open shift cluster itself. Now, it began, it began to show some cracks as we tried to extend our tooling stack. Trino itself was always just a little bit clunky, definitely worked, but she had to do a multi step process to get a token. When it came time to like try to do this open meta data, it just didn't work very well at all. And so we had to, we had to do something if we're going to continue scaling our tool stack. The something that we're doing is migrating to authentication to Hashicorp vault. So now your single sign on is going to be to vault. That's going to give you access to all all the tooling. And it's going to be much more extensible. So that's a way that basically improving improving the functionality of our data mesh architecture. You could also use something like key cloak to do this. In terms of understanding the data and allowing people to search the core of that right now is using open meta data. And it provides a lot of nice functionality. You can see her down on the lower left, it can pull basic stuff like, you know, column names and their data types straight from Trino, which is good. But it does much more than that. Like if you're running data ingestion pipelines with DBT, it also allows you to add things like descriptive fields and searchable tags. And so it's, you know, unable to stand up, you know, more and more comprehensive, searchable meta data environment. A lot of what we want to do is leveraging data as code techniques for maximum reproducibility and transparency. The first, the first way that we are mostly doing this in the community was using Eliara pipeline so that you could write Jupyter notebooks. And once you have them working, you could basically make them as repeatable nodes. And a pipeline you can see down here in the lower left, you know, directed acyclic graph workflow, pretty typical. A notebook runs and then notebook B and C can run after that. And lastly, once all edits run, you can run notebook D. We were backing this using tecton pipelines to actually manage the process running. More recently, we've been migrating to use DBT. And instead of tecton backing that with Apache Airflow. And you can see it's basically the same kind of model, except that instead of a Jupyter notebook, you're creating DBT models in the same kind of directed acyclic graph pipeline. And Apache Airflow is nice because it turns out that open metadata requires it to use. And so we had it installed anyway. And so it's turning out to be a kind of a nice unification of pipeline. So what is what is a DBT model? I think what's nice about them is the DBT model is nothing but you know, a SQL operation. So you can see here, this is a relatively simple one, you're just sort of doing a field projection is select a bunch of fields and provide it. However, because it's you can see it's basically SQL, you can also, you know, generate much more complicated SQL commands joins. If you want to, all SQL sort of like is at your disposal here. So it's a pretty powerful formalism. I want to talk a little bit more about physical risk, get up to some of the actual applications we are constructing on top of the stack. Again, physical risk is a data federation, you're, you know, needing to actually federate data involving actual climate hazards, climate change hazards, and their locations and map that in geographic space to you know, you actually exposure of their assets. So like what are the actual physical locations, your investment assets and how they relate to all those hazards, and then produce, you know, actual vulnerability maps. And if you can federate all that, you can actually produce a risk model for your investments. So here's a screenshot of one of these models. This particular model is a basically a work loss due to heat model. So the idea is the world, the world warms becomes harder and harder to do work because it's too hot. This is a pretty aggressive model. You know, it's basically saying by 2030, things are not looking good at the equator. It's 80 to 100% work loss, at least part of the year. Frankly, if you go look up, I'll look up at the United States, a lot of the Southeast United States isn't looking that great. So here you see you have an interesting, a physical model of a particular climate change hazard. However, you don't see mapping to actual assets. And so we're also wiring up ways for people to enter the same similar information for the assets. So here you can see this is basically a demo, but, you know, actual physical locations of things like factories, warehouses in lat log. And so now once we have this, and we're in the process of like connecting these two things right now, what I'm showing you here basically is a conceptual mockup. But, you know, depending on where your assets are, you know, the risk based on this particular model scenario, you know, depending it be moderate risk to if you're in Florida, maybe more severe, and if frankly, if you're too close to the equator, it'd be catastrophic. Total, total like productivity loss due to this problem. This this this this tooling, as you can see here basically is running on REST APIs. That's important because we're not just doing stuff. We're not just federating things like via things like Trino, but we're also federating stuff using more traditional REST APIs or micro service architectures that are out running on the cluster using, you know, standard Kubernetes available tooling. And lastly, I wanted to talk about another exciting new project, which we've been calling the data exchange. This is also super important, we think, for self service community use. And the idea of the data exchange is all the different data products that people provide from the community get registered here, and user can come and look at each each one gets its own little tile, you can see six of these tiles here. And they're searchable. If you zoom in on one, you see lots of important information. We talked about federated governance. If you're providing a resource to the community, you have to register yourself as an owner. So people know who to talk to with questions and problems. It tells you about like, you know, what kinds of subsystems that could be useful for. Obviously, you can see up above, it's got an actual description of what do what are you looking at. And then things like life cycle. So here's like, you know, a maturity, a maturity model. It's an experimental one, like a beta, beta kind of thing. And lastly, like open metadata, it has searchable tags. And you can expand these things to get individual product artifacts. So like a data product, it consists of multiple individual artifacts. Here's an example of one, which is actually a SQL query. But it's nice that kind of example is nice to show integration with Trino, but it doesn't have to be this. This could be usable rest endpoints. It could be raw files. So there's like a ton of heterogeneous data products that can be handled by this kind of system. And it's all searchable. Again, you could authenticate onto vault to get onto this data exchange and find what you need by yourself without having to like, you know, get bottlenecked by the OS climate core staffing themselves. And the exchange itself doesn't just have a user interface, it has its own programmatic API. So like you can actually write tooling that talks to the data exchange and expose your own user interfaces. It's going to be a potentially extremely powerful, powerful infrastructure. So I've been talking a lot about a bunch of pieces, tooling, microservices, all of this is out running on our cloud, our clusters. As I mentioned earlier, we want this to be an open community and not just open source, but like, you know, open deployment. So what does that mean to be an open deployment? We're basically doing it by managing all of these deployments using GitOps out on public GitHub. So if you're an OS community person, and you want to make a change to a deployment, you can actually do that in a standard open source way. From a user, I can go make a pull request against the repo that controls our clusters. So here's an example of a pull request, basically updating the version of Trino. So I can like, as with all open source, I can make a proposal. I'd like to update Trino from 373 to 380. Trino is already into the 400s now, but yeah. And like all pull requests, you know, it can be reviewed by the community, discussed. And if it's adopted and merged, then our go continuous deployment will pick that up and automatically make that change to the clusters. So like if they merge my pull request, it will automatically upgrade the version of Trino and it will become the new tree, the new version running on the actual clusters. All in the open. And you can go back and look at it historically too. So you get the full value of like, you know, get issues, pull requests, discussions, and legacy, you know, historical, historical records. Lastly, I want to talk about the value. I said, you know, we're trying to build this as an open data mesh and I'll put my red hat hat on briefly and talk about the value of open solutions. While I was discussing, of course, you probably saw me like mentioning different kinds of alternatives to different solutions. So for instance, you could say, yeah, instead of Trino, maybe I'd actually like to use Spark SQL. Or instead of a liar, I could use DBT. You know, all these substitutions you see here could be made probably more than I'm not thinking of. And because it's an open architecture, if you wanted to deploy your own version of this kind of data mesh, you could make a lot of your own decisions about this. And it's the classic, it's the classic open solution value prospect of, you know, not having tool lock in. And so anyway, I hope that's been interesting for you. If you're interested in exploring OS climate further, these links are on the deck. There's the OS climate website. As I said, we're a fully open source. So it's like we have our own GitHub org, which you can look at and check out our code. And recently, a lot of the tooling deployments that I just described, you can see here there's actually a special piece of documentation that they produced that you actually read how you deployed these all the way from building the cluster itself to doing stuff like deploying vault and getting it to work with Trino. So if you're interested in like actually trying to make this happen for yourself, that bottom link is a great resource. And thanks very much. If you have any questions, I'm happy to try and answer them. Yes. Is there a microphone? This is all really cool and really impressive just from the perspective of the climate community and the integration into tech. They're just from the perspective of creating a data mesh architecture. Where would be a good place to go to like learn about these systems and get training for creating systems like this? Because a lot of this is really intriguing and really cool, but a lot of it is also kind of going over my head. But the organization I work for also is kind of oriented in creating communal data stores similar to this, not necessarily with financial risk, but yeah, just shared data pools between organizations. Right, right, right. Well, I think actually these resources would actually be one good place to start because I said we are building this all out in the open. And so all the code we're writing, all the open tooling, like I said, on this bottom link is literally actual instructions that some of the community members are building out. And they're very detailed instructions. It's like you can follow them and like build the deployments. So I'll be one place to look. I'm sure there are others. Frankly, I can't off the top of my head say you should go X, Y or Z, but I'd be happy to look into it. And if you like you send me an email, reach out. If I can find some other you know alternative resources, they must be out there. But we are actually trying to build this as that kind of thing. One of the goals for this is anybody can come and like recreate the deployments. Yeah. Yeah, so my question is, so you talk about, you know, FinTech, financial technology, you talk to all the technology, obviously the financial space, the financial industry is a key stakeholder here, interested party, and I was just wondering if there's been, I know you're in the process of developing this open source technology, but engagement with the financial community, specifically related to climate risks, the TFCD, the Task Force for Climate Financial Disclosure, which basically sets a variety of reporting standards for reporting risks, strategies, and also metrics for companies. Have you had any, are you aware of the organization or have you had any exposure to them? That's not ringing a bell for me. I'm going to come clean. My expertise is more tends to be more on the platform end, but Heather Akin who's in here is also asking, maybe you can. Big one. Yeah, well, it seems like a tool that the people, I mean, it's a set of recommendations, right, that organizations, more and more organizations are adopting, but this appears to be a tool that would be very useful to organizations in achieving those recommendations from the FSB, the Financial Stability Board. Oh, good to know. I look forward to definitely be visiting these links over the next couple of weeks. All right. Well, again, thanks very much. I appreciate your questions.