 All right, I think we can start. Good afternoon, good morning, good evening, wherever you are in the world. My name is Ricardo. I am a senior software engineer for Red Hat, specifically for the Red Hat OpenShift Data Science product. So a bit more about me. As I just said, I'm working as a senior software engineer at Rhodes, and I'm currently a master's student at the University of São Paulo. With that being said, I would say because of work and school, I would say that I'm a big fan of big data and data engineering. I'm also a piano student. And as you can see, I'm Brazilian. So if any of you can ask me about Capoeira, Cachaça, or Caipirinha, you can ask me. So what we'll discuss today. We'll talk a little bit about Open Data Hub, which is the open source project for Rhodes. My main skills is about data engineering and data governance. So I'll talk a little bit about both topics. And we'll discuss a bit about Spark and Spark CFO, Trinal Data Analytics, and whatnot, the future. I think I got a problem with the slides. One second, please. Something happened. OK. Maybe it's better if I show that way. If it's too sharing. No. Sorry about that, folks. OK. Yeah, that's interesting. When I'm sharing, I don't see the figure here. But I can do it in a different way. Just not let people curious about the picture. I'm talking about this one. So that's basically what represents the whole Open Data Hub project. Open Data Hub is supposed to be a hybrid cloud AIML platform in which you can have a complete platform where you can manage the whole data lifecycle from starting the data, transforming, creating models, running experiments, deploying them as a service, and whatnot, gathering metrics and results. Oh, we can see the image on this slide. Yeah. I know that because I just shared a Google Chrome, not the screen. All right. I think that works. Yeah, we can see it now. Thank you. All right. Sorry about that. So that's the picture. So we can talk about, in Open Data Hub, we can talk about transforming data, creating models, running experiments, deploying them as a service, and gathering and start metrics and results. The focus of this talk is about transforming data, not creating models, because it's a data scientist task responsibility. The idea here is to talk about the data engineering tasks. All right, moving on to the slides here. And now I suppose we won't have any problems with sharing any other pictures. So what's data engineering? Data engineering is a new and cool discipline of what we can call about the overall data science thing, which is about ingesting data, transforming the data, managing the metadata when you need, and storing them into better formats, and what not visualizing them. That's basically it's something that one of the biggest skills that data engineers need is software engineering, as well as SQL is data warehouse skills. OK, so for data governance, just like any other book of knowledge, we have one specific for data governance, which we call the theme of the MPOC, the data management bot of knowledge. And the bot of knowledge for data management shows up all the disciplines around data governance. And I'm here emphasizing some of the things that the Open Data Hub has been working through all these quarters. So we are working on a data architecture for storing data to enforce data security, to manage data in a way that you can have a data warehousing and business intelligent enabled. OK, so for our data architectures, we started with a few requirements, which stands for performance that will most perform with multiple sources of data. All of them should be like be part of one single data structure in multiple formats and providing structure data, coming from unstructured or semi-structured data. Security is the biggest requirement and the biggest challenges we've had and we're still facing, which stands for enforcing the confidentiality to data, getting right people and right teams getting access to the right data, or if that data can be shared publicly or parts of it. With that, that comes the policy management, which stands out here with all the data protection laws around the globe in a way that you can have a global management of all the data architecture components. That was for the data engineering side. For the data governance, we were looking for having multiple sources of data part of that data architecture in which ETL workflows could not always be the best solution. So those services should not be changed. They just need to be added into the architecture so the data can be queried in the right source. So we're not talking directly about transforming data to move from one source to another. Also, what we expect for leveraging this architecture for a broad data governance solution is to have global data security. No matter what the data assets and by data assets is a term of the TMPOC that says every entity that represents data is a data asset. So it can be data start in a set bucket. It can be a table in a relational or no SQL data star, or it can be a workflow, whatever it is. Everything is a data asset. So that's the idea of having global data security. It's to have the right access to the data assets for an individual or a team. Also, as we were talking about moving data through what we call transformations, like transforming a set bucket into a table, this transformation must be documented or at least track it by any solution. That's what we call the data lineage. And whatnot says we're talking about a vast amount of data coming from multiple sources of data in multiple formats. It can be a structure, semi-structure, or even a wall structure. We must ensure data quality no matter what is the data, how the data looks like. So Forbes has a good article about the three pillars of a real-world data engineering, which stands off a metadata, data lineage, and data quality. So in order to reach data lineage, you must have a good metadata management system, which will get all the information, all the data you have in your organization. But the data lineage and the metadata well structure with a good process for both will end up with a good data quality process. All right, so the question we made a couple of years ago is how to work with multiple sources of data, enforcing access and quality data, and with a central management console for metadata. One other topic that I didn't put in that question is this must be a hybrid cloud solution. So there comes our first architecture, which we call the data catalog that involved this part. And more specifically, one component that we call the Spark SQL Swift server. Spark is known as a good framework for parallel distributed data processing. So we decided to use Spark as we had a good amount of development effort on the Red Analytics project. And then we used all that experience with creating that architecture around Spark. All we needed is also add this additional component called the Drift server, which is just like a Spark application, but it mimics the Hadoop Hive endpoint. So because of that, we have a separate Hive meta star where the Drift server will start all the data about the tables created by the data catalog architecture. And then it will be used as a central endpoint for clients like the data analyst from SuperSat or the data engineer from Hue to get to the data through a Spark cluster querying the data link. This was good, but we had our first problems. Although we know about the capabilities for Spark to process data, we couldn't find the same process, the same performance for querying data. And we have some suspicions about that additional component, the Drift server, that might be like the single point of failure or even requiring more resources, like the same resources than the Spark cluster. But no matter what we did, we can find a good combination of resources configuration between Spark cluster and the Drift server to get a good performance on querying data. At this point, we're not talking about millions of rows, but millions of rows. We didn't get a good performance. One of the biggest problems we had and the most concerns on adopting Spark architecture is that security in Spark is not implemented by design. And most of the security aspects we need for our overall data catalog and data governance solution would need customer development. So most of it we did, but not with a good global security as we expected. Also, because of that, we didn't have a way to enable policy management throughout this architecture. With that problems, we were looking for other architectures that could solve the biggest problems we had. And then we found another project called Trino, which was known as a Presto SQL. And now it's the open source project for Trino DP from Starbucks. One of the biggest differences on the architecture is that we don't need an intermediate component like we did for the Drift server. So users, the data analysts from SuperSat are the data engineer from Hue, CanQuerry, the cluster directly, not an intermediary component, an intermediate component to reach the cluster. With that, we also separated the Hyphen Beta Star from the cluster. And we can connect whatever is, we can connect other clusters, like even a Drift server in their architecture. So we can have a mix of Trino and Spark in some situations. All right, so we have a better solution and also better problems. That doesn't mean with Trino we implemented everything we need for data governance, but at least for performance, we noticed that we got a better scaling for querying data. So thanks that if queried test that takes around 30 minutes, now can be done in 30 seconds. So that's a huge improvement. And but we didn't make any experiments with using multiple sources of data. We used to use one single source. And with single source, I would say using not only multiple SAP buckets, but mixing SAP with relational and no SQL data starts. So that would be a good combination for us. But we didn't have any experiments around that. One of the things for performance in Trino is that statistics must be collected from all the objects created by it. Because if with no statistics, we'll end up with the same performance as a Drift server. Also, we had better security. Well, actually, we have security in Trino because we don't need to develop anything additionally. And then we can create a good ACL configuration among all the objects managed by Trino. And even spreading partial access to data by masking sensitive data, or even letting the user only querying some specific data with some specific filter. But still, it's only applicable for data assets managed by Trino. With that being said, policy management makes better in things like integrating with superset and hue, where you can use those two components to use for policy management. We have better integration because Trino doesn't mimic, doesn't try to mimic Hive. It's a very special endpoint with all the specific limitations with all of these specific clients. So it has better integration with both components. But still, no global policy management. So with that being said, that's one of the things we've been working currently. So what would be our next big thing? We're looking for enabling our full data governance solution by having an open policy agent to give a global data security solution using a plug-in in Trino, which we'll need to develop. And with open policy agent, we can connect Apache Atlas to give a data cataloging solution. And what not, using a GUI like a Moonset. And for data quality, we're still looking for other solutions, but maybe Qtful pipelines will make that. So again, we'll still have a problem that data sterilization was sent for posturing, other than a distributed fashion was sent. But we're looking for other integrations with like ODF or other solutions that Qtive, multiple data repositories. For global data security, we believe that with OPA added into that solution, it will make real our proposal for data security. So we can have documented set of rules that can grant or deny partial or full access to data, which will be good. So for data lineage, we're still looking for solutions, but we would like to see solutions that could be added to this architecture that could do some kind of data versioning as well as getting track of all of the transformations made by the data assets. That would be a good thing. As for data quality, like I said, we can ensure data quality through automated workflows, but yet not something that's done in an automated way. So this is one proposal we have for data governance story in OpenData Hub and spoiler alert. There's only one person who knows about this roadmap, including myself. So it's not guaranteed to stick to that date. So we want to finish our idea of a global data security solution, maybe with Open, Trino, and Set. And then we'll move on to the data cataloging area where we'll finally have a good UI for users to look for the data available in the organization and see who are the owners of that data request access and maybe having an automated way to improve data access to that specific person. Also, we want to add data lineage and data quality. And at last, we're looking for some data mesh solutions to implement data scale. All right, that's basically it. My idea here is to just show up all of the experience we made in the OpenData Hub work within the data engineering and data governance area. So if you have any questions, just let me know. And thank you so much. So there's a question in the chat by Eric. Is there any solution in the roadmap for managing user identity uniformly, like SSO style, across all these tools, like SuperSet, Trino, OpenShift, Pac-Man execution, et cetera? We do. Sorry. We've been designing some post-architectures for global authentication and authorization among these components too. We are initially looking for ways to integrate all these components with the OpenShift wall. But we're also looking for other solutions like using search mesh. But we're still on the design steps on it. So that will take some months until we have a definite solution. But yeah, we're moving forward to having a global authentication solution for it. Well, we'll still have six minutes for more questions. Let's just wait for some minutes as people might be still typing their questions. Yeah, that sounds good. And by the way, I'm sorry about the background, but I'm still finishing my new apartment. So at least there's something good to see, my piano. No, I mean, as a college student, I think your apartment looks fantastic. Yeah. Yeah. Well, Ricardo is the right master's. I think there are no more questions, but if you want to reach me out, I'll send my email in the chat. So you can ask me and print about our efforts to enable data governance over OpenData Hub and whatnot growths. All right, thank you so much, Ricardo. Thank you. All right, bye, folks.