 My name is Chunping Du. Today's my topic is breakdown of new data silos in generative AI. This is about me. I'm founder and CEO of Data Strato. It's a company founded by three Apache members. Two is Hadoop Committer. The other one is Spark Committer. Before founding the company, I have more than 15 years in working in the data and open source industry. I'm expert chairperson for FEN Data. I'm also previously working for Hortonworks Cloudera and look like a Hadoop guide. I also committed a PMC for many Apache projects and also I'm a mentor for a lot of data and open source projects. Let me do a quick survey. Does anyone know FEN Data Foundation? Can you raise your hand? Okay, several of them. Definitely, the FEN Data Foundation is previously, when it's founded, it's called FEN Data. It wants to call F Deep Learning. Then we changed the name from Deep Learning to AI to be more generic. Then in 2020, we changed the AI and data with a margin of ODPI that was open source, Data Foundation stuff. Basically, from the name or change of the AI and data, you can see data and AI is actually binding together. It's just like two sides of the same coin. Just like Jim mentioned in the morning, actually, the model is an AI-S model. I think we have the same perspective. Now, of course, everybody is talking about generality, but this is a very exciting time when we have to do a lot of interesting innovations here. Basically, building on the super powerful foundation model, we're building an ecosystem in the form of different kind of software layers, tours. Eventually, other than chat GBT, chat GBT is definitely a disruptive application, but we'll see a lot of new application here in the coming in the next few years. The interesting thing is the key user experience from this wave of new application will not be the same at the old time, like Internet, like mobile computing at the time. The thing that's important is the UI, how fancy UI it is, how many functionality it is, or about performance. If you look at the chat GBT, its functionality or the UI perspective doesn't differ too much with other chat, chatbot. There's tens of thousands of chatbots, but what makes chat GBT uniqueness is something deeper. That's actually the model and the data. Definitely, yeah, data is under the water about the quantity and quality of data. It's really without how much efficient, how smart your model currently is. Speaking back to our data, because data is so important, so when we look at the data, we will see the data is increasing year over year, and it's continuing increasing. One of the top drivers behind of the data increasing is because of the applications. About 20, 30 years, we have Internet, and we have social media, we have mobile computing, we have Internet of Things. We have Metaverse now, we have Generative AI. All of this application is increasing more and more, and then no matter how it's different, the volume of different or different type of the data, I think there's some first principles for data. That is, data has a gravity. It means a lot, right? But the most important thing I summarized here is, it has to stay where it's being created, and also it's hard to discover once it gets buried, right? And also it's costly to move back and forth. So it's easy to get data siloed in some way. Data silo problem is not a new problem, actually. It's already over 30 years. Over 30 years ago, we started to do the on-silo data. At that time, they only have structured data. They only have the database applications. At that time, we threw the ETL to make sure all the DP database are merged into a data warehousing, providing a data mart to, you know, application to consume it. So that's the way over 30 years how we break the data silos. But the things, it's changed after, you know, the Internet of Things are rising. There's a huge amount of data coming in and a lot of unstructured data or semi-structured data make the data warehousing solution or it's not effective and not efficient. So then we invent something new, like Data Lake. Data Lake technology is a transformer like Hadoop and also something like S3. It can store the data and process the data in a very cheap way. The good news is you find a solution to store your data, you know, together in a very cheap way. Other bad news is there's no single stuff to choose anymore because, you know, no matter what kind of engine you choose, you just handle or process only part of your data but not the data single of the truth of all the things, all the things. In the year about 2020, something new is innovated as the lake has tried to combine what we have in data warehousing we have in the Data Lake. It's doing a good job, you know, to make sure all the data can merge together and also we have some into a same standard table format to be processed by, you know, multiple engines. But still, there are a lot of new silos today we're facing that Lakehouse is not helpful. Just like previously, a lot of infrastructure is moving from unpromised data center to the cloud. Today, a lot of enterprise, a lot of companies are actually choosing the multi-cloud as their strategy. The strategy is not only because of the vendor locking, not only the technical perspective and also maybe business perspective and also there are a lot of cases that the company has to, you know, using multi-cloud including if the company want to expand their business through the international, they have to choose something global cloud vendors instead of local vendors and also there are some M&A case once the company acquire a new company then they may choose a different cloud vendors they have to deal with this complexity. But what's the problem? If the application stay in the multi-cloud then the data has to stay in the multi-cloud. Just like I said, the data has a gravity, right? But this kind of data stack in the different cloud actually they're not working with each other very well to make this data get silenced in this multi-cloud. And also some third-party tours they also had a side of the problem just like, you know, snowflake and database they cannot... You won't expect to query some data from the database to a snowflake in a very efficient and quickly way. And also this is true for hybrid cloud cases. A lot of cases will bring data silos. If I look at the picture a little bigger not only multi-cloud but also the multi-region the things will be even worse and complicated because rather than, you know, physical distance rather than technical silos we also have to take care to think about the compliance the compliance problem because a lot of regions they have their own policy and how to imagine their data how to governance their data how to process their data. So that's the difference. That's why Leakhouse is not enough to handle the today's challenge for data cycle problem because it's a hardly, you know, impossible unified all the data in this situation into one place for cost, for performance, for regulation problem for all these reasons it cannot. If we step back and we think what we want to do to handle they actually in this situation the multi-cloud, multi-region distributed data lakes is endable with distributed and data lakes situation. If you look at the situation most of the company actually need it's a centralized place to governance your data and you can put your analysis workload no matter your data from which cloud or regions you just want to analysis your job or training your data this is what you want. So basically if we step back we will try to building something really innovative thing is we call it a data stratosphere that's a we have a building a cross-cloud layer of data fabric so that you can take care of all your metadata and also we have built something like a federal query engine to allows you the query can work with different heterogeneous data warehouse things that make your analysis work efficient and effective and also for data security and global acceleration part we like to providing a simple solution that you can apply easily in one site and you know apply it to globally and also as a next generation of data infer so you need to providing single substitutes for metadata is also provided smart governance you don't have to care about regulation problem and also some security problem here and it's better to have natural language interface so that data scientist team you don't have to involve all the efforts to writing complicated SQLs it also has dealing with ETL workload today so today a lot of so many efforts and so many cost is spent on ETL for the data preparation or data pipeline that is because this is because we through the traditional data warehousing way we try to prepare all the data into merge into the single place if we release the constraints just like we mentioned we don't have to maintain a very costly ETL across a different color moving data back and forth so let's talking more details the first thing first there's a single substitutes for metadata so metadata is so important because it's data about data instead of the centralized data just like I mentioned is very costly and maybe in most cases is even impossible we actually want to centralized the metadata this is to solve a problem how to governance the data in the centralized way how to secure data in the one place and how to achieve the single source of truth so we have metadata lake no matter it's a data warehousing or it's a SQL database or it's a Kafka topics or even the AM models we can have a unified metadata lake so this important thing once we have metadata centralized that's something like we have intelligent we have the intelligent build on top of it this intelligence is super important just like previous years the people like Ferrari, Lamborghini because the engine is fast they want to driving very fast but nowadays or even the future people more care about how can we do the autopilot to make the drive-through safely and saving my time and saving my efforts I think the data platform is also the same case as previous people won't care about how much time to do the ad hoc query or do the ETL workload but in the future we should pay extra attention on how much less effort for this engineering work we can saving their time to building the ETL pipeline we can saving the data scientists teams work to building the SQLs or the scripts I think that is why we bring intelligent to the data we will show a quick demo I will introduce my colleague Lisa the product manager of data straddle to continue to show the demo thank you my mic is on there is some really exciting stuff that is going on here I am very excited to be able to present this demo for everybody we talked a lot about data silos within an organizational context how can we actually see that played out especially across different departments we want to demonstrate a scenario where we have an HR and sales department running different databases how can we effectively query them in a way that doesn't necessarily have this huge technical data of having to build out ETL beforehand and then how can we platformize this in a way where analysts can very easily query it using a natural language interface so we are going to be going over a little bit but what might that look like to play this video here welcome to the Gratino UI here we have a default meta lake and that meta lake contains two catalogs it's a snowflake catalog which contains a sales database and this sales database has the typical tables that you would find in this sort of database we also have a postgres database and this has a HR database and again this contains the sort of tables you would expect in that database dealing with employees in this demo of Gratino as you can see on the left hand side that we have some multi-cloud metadata that is stored in Gratino we have a snowflake database which contains HR information so that's information about employees and in AWS we have a postgres database and this contains data about sales first off we are going to make a query from the sales database and this is just a simple query to find the top 10 employees with the highest sales we can describe the query in natural language and the SQL will be automatically generated for us the system understands the structure of the data and can infer relationships between tables based on the names of fields as well as generating the SQL we can also generate an explanation of the SQL we can then just run the query and have the results returned the next query we will perform is a multi-cloud query we are going to look up the employee's average performance rating and what their total sales is to do this we need both the employee information from the snowflake database and the sales information from the postgres database again we will use natural language to generate the SQL and run the SQL and the results are returned along with the results we can also see a physical execution plan and this shows where the data is coming from from each cloud provider and shows you where the data is joined together awesome, thank you everybody for watching that demo welcome to the Gravitino UI not again, only one time is good enough for us I wanted to get into the demo architecture and decipher what we saw just now because we talked a lot about data plane and creating that layer but what you saw right now was just an NLP prompt that queried that so the back end of that was DataStratos a project called Gravitino which is open sourced as of today and with that we had two JDBC catalogs one for snowflake and a snowflake database and another one for postgres and then on top of that we saw underneath was a federated query engine that unified everything together which was then used by a technology called Y as our front end there to actually be able to query on top of that and so with this we're able to use the metadata like really effectively and then use a text to SQL translation to be able to query it and then to be able to join really efficiently despite being a multi-cloud environment so in this case we're really skipping so much in terms of having these SQL scripts and doing this really laborious work of building an ETL pipeline and we're actually able to enable like data discoverability here and exploration in a really fun way that's accessible across organizations so a lot of the magic that we have here is within the actual metadata itself and the way that this works is that the meta lake service that you have is going to help build that data fabric that sits across your entire organization so we showed within one organization two different departments but thinking about multi-cloud environments thinking about cross-region environments there's a lot in terms of scale that this could build out to so right now with Gravitino people with authorization can have a global view on what would otherwise be really fragmented data and so just like the demo that you saw you can leverage this and then empower people across your organization in a really fun and exploratory way so a big reason why we've decided to open source I'm sure you saw JP's background and really what we're built on is open source as sort of our core and something like this is inherently going to have to need a lot of co-invention with the community so we want to really work with people as we're still building out this product instead of offering something that's super developed and maybe has no use case out in the real world to build technologies and features and connectors that people are using right now and we want to be really collaborative within that process we are taking the open source community extremely seriously and we want people to have fun with us we want people to be able to reach out to us and have casual questions they're having install issues we want to be right there and we have gone through incredible stakes to be able to have community managers around and to create discourse platforms for people to engage with and we also want to build out an open standard we want people to be able to not have to worry about interoperability and to again this is the whole point of this technology is to unsilow the data and to avoid these vendor lock ins and these silos that do tend to build out if you're really interested in Gravitino which is the back end that all of this sits upon we do have a QR code that takes you directly to our repo give it a star if you're interested and watch our development create issues engage with us we're more than happy to work with people and to to really make sure that it's open right that's really where we're coming from so a little bit more about Gravitino in terms of what a unified metadata lake even is we really want something that's agnostic between so many different database technologies and so whether or not you're working with data lakes or data warehouses you know document or files or Kafka topics or models we want people to be able to have critical features that would otherwise be a really large amount of technical data in an organization such as a single source of truth and also included with that are you know data and AI catalog services having a geo distributed architecture in mind that can scale to an enterprise level really well and then really have a combination of onsite catalogs and offsite data governance so we want people to be able to secure their data and have a very solid idea of what is secured I think as we know maybe software supply chain is very you know well developed these days and as we're moving into this discussion of what is a data supply chain how can we then even step forward even past that and secure things all in one go at this one step that everybody kind of understands to be this common source of truth and I think this is where a lot of headaches lie within organizations so this is our architecture there's four main components to this but I really want people to kind of look at the object model and really what we've done is you know have these different layers that we want to continually iterate on and to support more and more and I hope that you know in a year's time this is just horizontally built out like crazy and there's just an endless amount of things that we support but really we wanted to create a core object model that's generic enough to be compatible with a lot of different types of catalogs and to be able to connect to data warehouses to documents and files on your message queues and then on top of that have the interface layer that then provides a RESTful thrift or JDBC interface to then be used for other catalog services as well and something that's standardized something that people are already using in their stack and then to add on this layer of functional data governance and to provide you know ACL and optimization lineage and all of these different possibilities that can stem from having a solid entrusted metadata layer and the possibilities are so endless there as well so we went through a lot of different types of design choices especially for Gravitino I think that with a project this ambitious there's so many different architectures that you can run through and finding the right one was really hard so again we wanted to keep in mind this forefront goal of single source of truth especially in terms of your metadata context and so again looking at the flexible object model which was you know really wanting to be suitable for all different types of data just because complexity has really grown so much in this space in terms of what's possible and in terms of machine learning and we want this to be pluggable with different types of catalogs so the second thing that we also wanted to look at was to instead of gathering all this metadata into one place that would again just be more processing and more storage we wanted to use a direct managing mode and we wanted to manage the underlying data sources to get rid of these consistency issues the third thing that we wanted to look at was building a geo-distributed architecture especially for cross region joins which can be really difficult and we wanted to get this easily deployed and so in different regions in different clouds you would feel really supported and like all organizations we're having one single source of truth and each catalog request is routed to a location that fits best and to get rid of fragmentation as a byproduct of this so the fourth thing that we wanted to do as a part of this is to support multi-engines and so a lot of our developers have you know really core backgrounds in Hadoop and Hive and so we wanted to also look at Spark as well which a lot of us have really strong backgrounds and also realize that we wanted to be agnostic towards engines in some way and to create this ultimate flexibility and so Gravitino supports multiple engines such as Trino, Spark and Postgres and we hope to add more and more on to that regardless of whether or not it's open source or commercial again we're really looking at interoperability and I think that's what causes this to work at the end of the day and last but not least we wanted to enable security in a really standardized you know single source way where all of your data governance can be done like in this one single plane right so we looked at the Gravitino UI in the demo but we didn't necessarily look at it from the command line and I'm sure a few of us are curious what it looks like in terms of usage from the command line and so we wanted to create a metadata catalog and a metadata lake in this case where the schema and the table can also be attached to it really easily in a typical case a company would have one metadata lake to you know encompass everything in their data and then they can create the catalog underneath it and usually you can have different business groups with different catalogs and so we wanted to introduce that sort of flexibility and you know with different schemas under that as well so once you set up everything you can query your table with the catalog extension and then bind it to the execution engine and let's say it spark in this case and then apply a job configuration to different catalog services so you know one thing we're mentioning here is that this is a restful catalog and so it's really convenient to access and integrate this into you know another you know service if you wanted to compare to like a AGDBC catalog so going through these steps you can see that we're creating the metadata lake we're creating the catalog under the metadata lake and then we're creating the table and using spark to then query it and it can be really nice and easy to do this in an intuitive way without having to learn too much more ahead of time these are some of the milestones that we're really looking at into the future so at December 2023 you're looking at today where we've released 0.3 of Gravitino which is now supported by multi-engine adoption and also metadata operations as well iceberg catalogs if you've used iceberg before and we've open sourced that and we want people to be really involved in this process while we're developing it we're willing to do all of the hard work with implementation but we want to know where people are at and we want to be able to support data engineers across different organizations especially in enterprise as much as possible and in March 2024 you can expect to see our beta release which includes the access control layer security and hetero catalogs as well and we also have our availability release happening in June which will be production ready and compliant in so many ways and then HMS drop-in replacements as well so we're really looking forward to seeing how this scales and how this grows within the next year our team is working incredibly hard and really does believe in this vision so much and we want to get continuous feedback as part of that process so this is our vision for the future and really what we're looking at is as everybody moves into like multi-cloud native architectures and this might not be everybody but it might be a lot of folks who find themselves in messy data situations we want to be able to implement single sources of truth in terms of metadata and have this be really a way that can be leveraged and utilized by so many different parts of like let's say whether it's data ops or data engineering or machine learning to be able to be managed in a way that is leveraged and to also maybe not completely get rid of ETL but to simplify it in such a way that makes data so accessible across different organizations and to make data playful again and not have to deal with all of this under the hood infrastructure just to get one single query that maybe we realize we didn't need after all and to have something a little bit easier to work with and flexible and again we want to pair with platform engineers and we want to develop a platform where people can bring their insights into this data and to have that data itself be intelligent and to be really easy to work with and you saw that with our code demonstration with Y so again hopefully everybody enjoys this cute little picture here I really love it and we are hiring currently so if you're interested you can email HR at data straight .com or visit our website at data straight .ai and you can go to our career section and thank you so much I know that the end here is near so thank you thank you if there are any questions yeah there's a lot of there's a lot of related to the partition stuff so we think the partition this kind of information is actually not data it's actually metadata so we were having some better data describing on the partition details I'm not referring to partition partition how like based on how the data store it could be one to many or one to one among the attribution and how the query optimizer can provide the better query that you can use it when the data is present the thing is we think there's a lot of optimization for a query engine actually the building this optimization this the data metadata mix with the engines it's too tightly we think it's too tightly so that means one engine if you store data into one engine it will be very hard for other engine or other engine to consume because a lot of shortcuts or tricky things there so what do we think we should be moving all this kind of optimization and also the metadata part out of the engines so that this part can be shared by the multi-engines definitely we want the multi-engines because in the multiple cloud scenario situations there's no one engine can feed so we definitely we want to move some of this kind of information to out of the previously engine data engine part yeah but then don't you think you'll have a performance challenges you are differing it to the later state because when you have a queries coming in from natural language demo what you guys have shown it may not be optimal query again you just need to figure it out on how do you optimize the queries yeah so that's a good question so firstly this is SQL translation efficient right and second thing is rather than comparing the ETL the ETL they already do a lot of work ahead and now is we change update to ad hoc query and this will be lower performance as well but this is only for today for now if for the future if you think you know we still maintain some ETL pipeline for just for some high-frequent data you know requirement but for some ad hoc query just like business leaders or CXO they have some ideas I want to ask in some questions or everyone or everyone can have you know this is for this is infrequent data access requirement and that's will be our the power of the demo where it shows and also there are a lot of there are a lot of potential optimization we can do no matter from the SQL translate perspective or the way we execute a physical plan so the actual physical plan if we have noticed a lot of you know operators operations just like we merge the you know the data from different cloud sites and places we notice some tables it's consumed or it's used a lot then we can automatically building some cash index just like we said we have data stratosphere layer across it so we can put it on the view even a materialized view to accelerating it yeah is the join happening on your the data in and then joining it on your side which means you've brought the data over the wire to be able to join the thing is not actually the wire is only taking care of from the input of the languages and how to generate a SQL when the SQL is pushed to us and we have a federated query engine to make sure to you know just like to show that it's a physical plan how to execute it actually it's a way to execute it so you execute it and you have to somehow combine all the answers together to then bring back a single answer yeah that's true rather than you know putting the data from a lot of place we do the you know some operator push down to push different cloud engines with different engines on the bottom so so that's what will be the most efficient way and the most faster way to do so I had a question so for your metadata like are you we can use anything like Oracle or SQL server or anything yeah it's mostly can adding a lot of things just like you said Oracle MySQL from the beginning we're adding the Postgres so this is well tested by for the MySQL Oracle because we are open source project right and we will deliver and also our community member will deliver to it and eventually we will support most of them yeah thank you alright thank you all for the questions thanks