 Hi everyone. This is Vincent here. Today we are going to discuss our involvement in a project called Open Source Climate and in particular we'll be covering how we help building a climate-related data platform using open source technology and a data mesh architecture approach. So, in this agenda we'll spend a bit of time to go through the context, what problem we are trying to solve and also why open source as an approach to solving climate data issue. And of course the focus of the presentation is going to be on the platform architecture, try to explain how we are approaching the climate data problem and why it's a novel approach to do so. And Eric will spend a fair bit of time showing with some demo some of the platform capabilities. So first, I actually work in the financial services team and quite often I ask the question, why is financial services critical to the climate issue? I'm just back from Glasgow where the 26th climate conference happened in the past two weeks and actually one big change, an important change in the way the climate problem is now being approached is the involvement of the corporate sector into the resolution of the climate problem. So if you think about it, what's the way for us to transition to a net zero carbon emission pathway is really to motivate and encourage corporations to all make effort towards this direction. And if you start thinking about it, a financial sector and in particular the banking sector being responsible for about 90 plus percent of financing to corporation is actually extremely important to finance a net zero pathway. So what's interesting is for financial institution, climate in itself is actually a challenge, an important challenge because investments that these institutions are making translate to potentially climate risk. Let's say you are financing energy production and you start to lend money for the next 25 to 30 years. It becomes extremely important to understand how climate risk could affect the business that you're investing in and obviously their ability to service the loan that you are making. And that means being able to understand financial risk in terms of climate drivers from the physical risk point of view. So what happens if an extreme climate event tornado or if slowly rising could actually end danger of physical assets that the company you are investing in actually possesses. So a financial institution attention to try to model climate risk to secure their own investment. And in addition to this, we have now regulator but realize that the public sector itself is not going to be on its own able to really drive very significant climate change and requires the help of the private sector and financial institutions are very key to this because they are the main financing institution. In addition to this, all of us as investor, we are increasingly aware of what climate represent as a risk to our planet. And obviously, we are also starting to put pressure on financial institution to disclose the nature of their investment and make sure they invest in green business. So now let's look at the West climate. The purpose of West climate in the grand scheme of things is to help the corporate sector as well as the financial sector to have an open collaboration opportunity around one main challenge in modeling climate risk, which is about acquiring data. And for this, we are really three main tenants to really this approach. The first one is in the past, a lot of financial institutions or any sort of NGO or non profit that work in the field of climate data tend to try to solve problem on their own. We believe that using an open source approach and an open source community to actually collaborate to have multiple stakeholders invest their intellectual property and share the cost and burden is actually a more efficient alternative. And with this open source approach, we basically look at building mainly two capability. One is the global data commons, which is a curated library of public and private data for climate, which is what we are going to cover today, and also helping institutions to build scenario analytics across the different climate dimensions. Why is it important to adopt the community-based open source approach is mostly because we want to be efficient at solving problems and share the solution and obviously try to solve the climate data problem faster if everybody is kind of pulling the rope in the same direction. Running this open source means we have a shared and structured governance around the project. So the key industry player can agree on what main data challenge they have and what is the focus of this engagement. We also create a very high trust collaboration structure between player. This includes some commercial data player who are actually sharing their data on a kind of what we call pre-competitive basis, which means that they share the data with us to be able to train new data models and analytics model and then spread them across the industry. And of course, from a licensing point of view, open source code and platform design across the community. Now let's focus on the climate data challenges. There are mainly three main challenges that we're trying to solve. The first one is about data availability. Interestingly, there are really two challenges here. One is there's actually a huge volume of data that is generated by the industry and the data is extremely distributed. So it's very hard to actually connect data to each other. At the same time, although data volume is high, certain regions or certain industries have a lot of either insufficient granularity or limited coverage. So there can be missing data. So one example is you could have public utility disclosing energy production and CO2 emission for energy provider in the United States, but there is no such data available for China. So you may be missing some geographic or industry data. Now, even if you have the data, another problem is because distributed is extremely hard to compare across different sources because they may not follow the same disclosure standard or they may not follow the same data format. And last but not least, another problem is data reliability. As you see the data for this model, you want to make sure that you understand how the data has been produced. So one issue typically is there are a lot of community out there that build their own model, generate data, but nobody is able to understand transparently how the data that has been generated could be trusted. So let's move now to the architecture. So the idea is we have an extremely high number of potential data provider and we want to help by building a platform that solve this problem of data availability, comparability, and reliability. For data availability, our approach is to use a self-service data management platform. So what we are doing really is to make a collection of well integrated standardized open source tools for the community to ingest data, process data, and distribute data. So the idea here is to standardize and make this tooling easily available for data scientists. So typically people who don't have necessarily the knowledge of how to install the infrastructure behind it, and then they can start to build those data integration on their own. For data comparability, we use a data mesh architecture, which essentially means that we empower different data domain owner to solve the problem on their own in a distributed and federated way. So they can build their own pipeline ingesting data, processing data, distributing data in a fairly independent way. And they basically own data pretty much like a product. The life dimension is reliability. And so here this is interesting because typically in distributed system you get a lot of agility. But then the problem can be how do you ensure that people follow a certain set of standards and also respect certain governance and security around data. So here we use a federated governance approach, which pretty much is twofold. The first is we treat a data very much like code. So we use processes that have been used in the open source community for years. So they are extremely well mature process, well proven to manage and version code. We are extending this process to the data itself. So data sets are managed like code. They are transparently generated. Everybody can see how the data is produced and shared. But at the same time, because we are certain governance issue around the data, we also manage access control and data governance in a central way across all the different domains. So this helps us to turn what I call a data mesh into a data mesh. We have an extreme number of heterogeneous data source and data contributor. And instead of building yet another source or another database, what we do is we are actually standardizing the way that people now consume the data. We are federating across data source and we are standardizing the tooling and the approach to ingest, process, and distribute data. So I will now leave the next section to Eric to present on how we are actually using open source tooling to do this. And from the Red Hat perspective, we are really building on an engagement, which is actually an open source project, but it's called Open Data Hub. And Open Data Hub is really an AI platform powered by open source. So we are literally selecting multiple upstream projects in the open source community and putting together a platform that can easily be deployed by data scientists across the globe to build their own capabilities. Eric? Thanks, Vincent. So yes. Oh, I get to show my screen. Okay. I was expecting Vincent to drive. So riffing off what Vincent said, there's a couple of things to say about Open Data Hub. One important thing is it's an open source downstream in a very typical Red Hat-ish open source downstream way. In this case, it's a downstream of Kubeflow. And we are using it internally and externally as a reference architecture for machine learning workflows on Kubernetes. And lastly, it's federated, which to say that the actual components are selectable and mixable. And also just by virtue of Kubernetes and OpenShift platform, it's very easy to actually add other components to the system if you want to. The coverage of the available tools is actually quite good in both dimensions in terms of process, all the way from business goals through data preparation, model development, deployment, and then finally monitoring. And then along the axis of personas that similarly covers use cases for people, business leaders, data engineering, data science, all the way through app development and operations. I emphasized in this diagram a few of the tools you're going to see later today, superset for dashboarding and Jupyter for both exploratory data science and also pipelining. And then Trino, which is a great example of the advantages of a federated architecture. Trino is not actually a formalized component of Open Data Hub, but it's being deployed in quite a lot of our use cases, including the data commons for OS climate. So again, Open Data Hub is leveraging everything that's good about open source, a community-driven upstreams, in this case in the artificial intelligence machine learning workspaces. We are maintaining the actual cluster using another Red Hat-affiliated project called Operate First, and I'll show you a little bit more about what that means later. But fundamentally, what we're trying to do is take the popularity of GitOps and extend that to, whereas Open Source has traditionally been excellent at accumulating knowledge about writing software, we're trying to extend that into Open Source for accumulating knowledge in open communities about actually operating the same software. So you're going to see that the data commons is actually being managed via Operate First. And so these two tools are coming together into the OS climate data commons, which Vincent has been describing. Again, I'll be demonstrating a few components of that later. And so the goal is to be able to manage data as a product, just like we manage software as products. And we're hoping that this architecture can show that we can service all the different use cases in personas all the way from raw data providers to data engineering and structuring. And then to be able to actually do data science and maintain the quality of the results. And we're trying to achieve this by treating data as code, just as well as we're treating data as software. Now, what does that mean? It means that we're federating very disparate data sources with repeatable pipelines that run in orchestrated cloud environments. And so what that means is, if you want to update your data, you actually update the software that ingests it or processes it. And we're doing this all in Open Source. And so the entire community will be able to participate and understand exactly what software is being run, how it's being versioned, how it can be improved. And I'll hand it back to Vincent to talk more about how we're leveraging this architecture. Sure. Thank you, Eric. So in terms of platform perspective, now looking at, you know, typical flow and capabilities that are required in building, you know, data pipeline and what component of a data science platform are required. As explained by Eric, we are leveraging heavily on Kubeflow as a machine learning platform. And it provides a number of components that are, you know, used in terms of building data pipeline, model training and model deployment and serving. A lot of the work is done typically on Jupyter notebooks provided on a self-service basis in Jupyter Hub. Both notebooks can be stitched together into end-to-end data pipeline, you know, but are you using Alira? I'm going to think I'm on the wrong screen. Okay, that should be better. Sorry for that. And however, you know, we'll see later when I cover this topic is there's still a number of components that we're trying to add to the platform to provide better manageability across the community. And these components are really around the management of experiment and data experiment as well as the metadata associated with it. So we are now looking at project Pachyderm as well as data from LinkedIn to have a better automation around the metadata management of all these data streams. So in terms of data science roadmap, and please don't go into the detail here, this is really just to show that we do have currently a working platform and we have a number of streams in the Western community that are already building data pipelines for various dataset and data sources. However, we still have a lot of work to do in particular in the domain of metadata management and also trying as much as possible to automate some of the data catalog management. One of the big issue once you produce data from the community perspective is to make it discoverable. And really one of our objective in the move to come will be to try to create kind of catalog automatically for people to easily discover the data across the community. Now let's move to another set of problem and really more like the governance side of things. Here we are looking at different persona. Typically one of the challenge for data engineer or data user is the multitude of different data source and the difficulty to merge them and use them together to kind of make sense of the data. And typically this needs to have a common referential to do so. So that's one kind of first problem here that we have to deal with. Now the challenge in this is a number of these like datasets that we consume may be public but others may be competitive datasets provided by a commercial data provider which may be okay with the data being used to improve and tune certain data or risk model but they're absolutely not okay with just distributing their data totally in the open. So they want security and compliance to be managed. So I've explained it before what we're trying to do here is to create a single layer of compliance and security across the pipeline. So what this diagram shows is basically a kind of very high level skeleton of what the data pipeline typically could look like. You have some primary source of data which is nice but not necessarily usable as it is by the community. And then you try to merge it with some kind of reference dataset that will help people make sense of the data. So one example of this is let's say you have carbon emission of factories in the United States. Now you don't necessarily know which company actually owns this factory and how to actually aggregate all this data into a way that is meaningful for say an investor which will be looking at both emissions for specific maybe public company and also try to understand how these emissions relate to say the benchmark in the industry. So is this company doing well or not well? So for this what happens is there's a lot of the processing that is done to basically integrate dataset together but then when you integrate dataset and once you reach step number two in this diagram you are basically building your own new dataset with your data model and that needs to be documented. So the first thing that we're trying to do here is to build a catalog service that will keep track of dataset version, dataset metadata and also what we call pipeline metadata which is every time you generate a new dataset with a bit of code you want to keep the code version and you want to associate the data that is produced with the latest version of your code so people understand how the dataset is actually being built. As you do so as you can imagine a lot of dataset get produced. Now the question and challenge is how do you make sure that when data needs to be governed and some security access needs to be managed we can do this consistently. So here this is where a single layer of data federation and security based on Trino actually allows us to build a single logic of data access across multiple datasets. So right now we are using Trino to do exactly this to make sure that at the data element level we have consistent security at any point of the pipeline. So let's say that some of this data of this primary dataset is actually commercial and is restricted in access we can manage and make sure that the access is consistent on the primary dataset layer so the source data but also when the data is distributed. So that is extremely important. So this is what we have today but something that we are still working on is kind of going forward we want to be able to manage access at the dataset but also at the metadata level. So you know taking certain like specific dataset itself could be accessible to public like you know one example of this is maybe you have a dataset of data that is not up to date maybe it's data from six months ago you want to make it available to the public but for the latest data you need to have like a commercial licensing to do so. So you will use metadata tagging to basically tag your latest metadata set and make it subject to licensing but then you have a public dataset which is kind of older data just as an example. How do we make this happen is again leveraging an open source project to build and integrate security and governance across three layer which is the physical data layer so how we store data and serve data the virtual layer which is how do we consume data how do we query data in the pipeline and then the access layer which is once data is ready to be distributed how do you make it accessible by tools by data portals and by different data user. A lot of what we've built now is basically the first three sublayer that you see at the bottom of this diagram we use container storage technology based on self or object storage we use data serving relying on a packet and Apache iceberg so Apache iceberg basically is a big data format that gives us capability such as asset transactionality on the data itself which means we have consistent you know update and rid of the data across big dataset and the underlying format is actually a parquet format which makes it very scalable to just add you know gigabytes or petabytes of data potentially for specific dataset and on top of this we are using Trino as a distributed SQL engine so Trino helps us to federate a query of data across a very high number of data source the work that we have not done yet is what you see above building a metadata platform and some data security layer we've we are already experimenting with LinkedIn data project as well as Apache Ranger to build those layer and a lot of the work that we are going to do in the next 12 months or so is going to be on those two layer so VAL3D or roadmap for data access management as you can see today we have actually a very granular access capability already with the Trino layer and Eric will demonstrate this in the demo but some of the things that we are missing is is kind of how to tie the security to the metadata management layer and obviously how to generate catalog automatically so let's go now into the demo and then I will leave it you know to Eric again to share a screen thanks Vincent I don't know if the moderator has to yeah I've made you presenter oh wait there it goes perfect excellent so I'm just going to briefly give you a high level view of what I'm about to try and demonstrate here as I mentioned before we're intending to make heavy use of Elias capability of supporting Jupiter environments for both actual data exploration data science and prototyping but also to then take the same notebooks and be able to run them in repeatable pipelines and so sort of bridge the gap between actual data science and DevOps and along the way I'll show how we're using github and some of red hats continuous integration bots as well as clay and then similarly you know from the slide before you know Ellyra we're hoping to represent actual repeatable ETL pipelines ingesting from disparate data sources including things like object store then exposing these as actual SQL databases and from SQL you know it's easy to consume through Jupiter or tools like super set to do actual analysis so just briefly all of this is being driven through the OS climate github organization and we're making use of both github projects in teams and this has been a real help helping to onboard different community members who as Vincent mentioned are not really conversant and open source and so being able to give them a pathway to private repos on github and then eventually to public repos has been extremely successful and actually very happy with how much work now is being done completely out in the open with the different organization members and similarly I've had a lot of success unifying identity management through github so here you can see getting onto the actual cluster is done using github and if you're a member of the organization with the correct team membership any you can actually get on the cluster just by authenticating to github so here's the atrino space most people don't get on the cluster most are cluster admins of course but it goes beyond that we also are logging into the jupiter hub environment itself using github so if I want to log back in I have to click the same github and this is all letting me do this just by clicking a button because I've already authenticated to github so the unified github open authorization has actually been quite nice in terms of cleaning up the identity management and we've even got the atrino access control working the same way we have a JWT token server and to get onto that you also have to authenticate and it gives you a JWT token no peeking and here's where we've actually begun to already interact upstream so we want to unify atrino's group definitions with our github team identities and so I've been prototyping an actual atrino plugin that does group provision so it's actually extremely helpful we're going to hopefully test this out and actually get it upstream directly onto the main atrino repository I mentioned the operate first management and so we're actually managing the entire cluster configuration through github and so here's just one example where I was actually resyncing our atrino user groups with the github teams and it just took the form of a pull request and in standard open source fashion we discussed it and talked about a few alternatives and then it was eventually merged and now you know the new group configuration is automatically deployed on merge onto the cluster so the operate first principles and the github's principles have been a definite success so I wanted to talk about you know how we're actually leveraging the platform especially sort of like one of the central obviously features is the jupiter environment so obviously you can do pip installs but I already have these installed on the image we're managing actual credential storage with the python.m if you're familiar it's basically a way you can load credentials without having it show up in the code and of course if you're storing your code including notebooks up on github keeping your credentials out is extremely important so once you have these you can connect to the actual atrino data service we can do things like connect to s3 buckets using boto3 let's now look at some data here we have some implied temperature rise or itr data that i'm loading straight off our atrino database and its job is to map different investments so these are different companies in the energy space or non-energy space as well that you know are associated with implied temperature rise success and so over here in the far right there's an achieved reduction targets and so there are various points of progress achieving their targets and each of these is associated with a standard identification number an isin or isin number and so we can make use of these keys to do some interesting things now a completely separate data set contributed by the glyph organization provides some key data services like mapping these isin investment identifications to the um legal entities that actually issued them and so the legal entity identification or lei number we're planning to leverage heavily as a community resource for getting people to push their non-standard ways of tracking investments into standard ones and i see that time is passing so um we also have a table that can map direct issuers as you saw above to ultimate parents so this is to track um corporations and corporations subsidiary ownership so that um you can actually do a better job tracking true climate impact up through company ownership chains and so what if we wanted to take these three data sets um which come from two completely different organization members and get some kind of value out of it so here i have a uh sequel query which maps the company names and their reduction progress uh to the lei legal identification and the legal identification of the parent and so i could run this query um and it gives me an echo pandas data frame and so you can see that by the time it's done if this joins i get seven results um the parents are all empty which is to say these are all top level corporations um and so the main the main value here is that i've gone from standard investment ison numbers um to uh standardize l legal identification number to get the achieved reduction um and of course there's nothing special about the sequel what's really special here is that we've already been able to federate data sets from two different community members to support you know value add using standard data science techniques and so now suppose i wanted to like save this off um i could run the same table and run the same query and i could save it off as an actual new sequel table um there it is that's the query and it should tell me that it wrote seven records yes so i got all seven records um um so now i've done some data science in a notebook um again what i told you is the the lighter jupyter environment can do a lot of fancy things um the fanciest thing it can do is you can take these notebooks so i took here's that notebook i just showed you copied into a pipeline editor and i can give it different dependency injections um so what i can do is i can actually give it a new table name um to output two for the pipeline make sure i save the pipeline and then run it we're going to use the kubflow pipelines give it my demo config and it gives me a link to the actual pipeline console spinning up here it'll give me an actual graph of the pipeline node there it goes um so it's running i can look at its log output see what it's doing so i want to stress that it's actually running the notebook i just showed you um with exploratory data science and all um and of course you could remove some of that stuff to make things go faster but the main you know the main message is it literally gives you seamless transition from data science notebooks into repeatable pipelines and it should shortly give me a message that it succeeded yes the pipeline ran so now if i went back i can actually convince myself that i got a new data table for running that here we go so let's see everything went correctly yes um so that pipeline rerun that entire logic um off on the system and i got a new new table um and we are at time um i could try to demonstrate the if we have time superset is not cooperating well that's okay we're at the end of time i will um close out the demo and hand it back to vincent for closing remarks uh eric we still have a few minutes if you need to finish the uh demo let me try let me try one thing i'll see if i can bring the link back up um you need to share your screen again eric sorry that's okay are you seeing it i think it's yeah excellent okay well i'll see if i can get to superset here there it goes okay so um you know superset is another great open source tool for building um sort of like low code or no code dashboards um so let's imagine i want to do a fast dashboard off of that data table that i just generated i have a predefined plot and you can see here i actually plotted the achieved reduction metric against company name and then sorted it in decreasing order so you can see here that uh you know glaxo smith has almost completed its target reductions and um you know some of the others aren't doing so great barclays shame on you um i shouldn't say that this is i'm sure they're doing fine this data is just demo old data um but anyway there you can uh see sort of like the entire pipeline from exploratory data science to repeatedly executable pipelines through to actual end user dashboarding and charting um all on the same open source-based platform and oh i uh want to show the actual image so the container image i used to run that pipeline was built also off of the os climate repositories using red hats continuous integration build robots so here you can see i basically filed a github issue say hey could you please build me the latest image it's version 0.1.1 and um the bots went off and built that and put it up here on the koi repository so you can see here it was built yesterday i think um and then it came back and says oh i successfully built so here's the link and um that was what i used to actually run this image and yes that was actually the completion so i can truly hand it off to vinson thank you eric so let me let me conclude maybe with some next step you know and then reference links you can find out more about the west climate initiative on the west climate website under the linux foundation the link is here for your reference uh in terms of upstream projects we have an open data hub community page that shows the architecture as well as the the ongoing effort integrating more open source project in the open data hub initiative and as you can probably infer from this presentation we still have a lot of work to make this happen with the community and if any aspiring data scientist wants to work with us on the latest open source technology and and become a contributor don't hesitate at all to reach uh you know directly to us uh put actually my personal email out there as well as you can contact eric so don't hesitate to reach out we are actively integrating community members now into this initiative thank you everyone for your attention today