 to be very blunt, very honest in OBS. If you're not using generative AI for optimizing your task at work in your personal life, you're just staying behind. I mean, at work, you're a loser if you don't use this because your competitors are going to use it, they're going to beat you. Hi, this is your host, Abhinav Bhartiya and welcome to a brand new episode of Let's Talk about AI. And today we have with us one Sekeda Principal Scientist and Head of AI Lab at DataWord. Yuan, it's good to have you on the show. Thank you very much. I'm glad to be here. I would love to learn a bit about the company and also your own background. What do you do and what do you folks do? So as you mentioned, I'm the Principal Scientist and the Head of the AI Lab at Data.world. So a bit about myself first, I carry two hats. I carry my academic hat and my industry hat. So I'm a scientist by training and by heart. I did my PhD at the University of Texas at Austin in Computer Science. And I continue to do research in the area of data and AI. That's what I've been for the last 15 years. And then the last decade, I've been in the data industry really just moving our research into into practice. So that's how I ended up at Data.world actually sold my previous company, Data.world like four years ago. So Data.world. So who are we? So Data.world, we're an enterprise data catalog platform, and we're built on a knowledge graph architecture. So what does a data catalog platform do? We help organizations, enterprises organize their data so they can be AI ready. And specifically, there's like three applications that we focus on. The first one is around search and discovery. So consumers can find the data that they're looking for. The second one is around governance. So it's to really make sure that the data is well managed, you know, where there's PII data who can have access to data who can request access and so forth. So more for the data stewardship. And then the third is around data ops or data lineage, right, to be able to go manage and understand the health of your data, be able to understand if a pipeline is broken, why does it get broken, notify users and so forth. So the data catalog platform really is that provides that key context that an enterprise needs to really build these accurate and trustworthy AI systems. And I will talk about AI later. So that's what Data.world does. How have you seen the evolution of data? And then we can talk about AI, which we can start calling traditional AI because now the shiny objected generative AI in this space. I'd like to say if we look at, we look at history around data, the principles have been the same. So there's three, I consider there's three main principles within data. You move, you move data, you store and compute data, and you use data. And those principles have been the same, but we've seen evolutions of that. So ETL, ELT, batch streaming, it's still the concepts of moving data. Then you have storage of data, databases, data warehouses, right, OLAP, lake, the data lakes, lake houses, the data, all that's essentially the same thing, right? And then you move data. Then you use data, BI reporting, dashboarding, ML, the traditional, and then at the end of the day, today, it's generative AI, it's still just more of the data usage of that. So I think those have been the evolution of data. And those have these pillars have existed since kind of the beginning of data management probably 30 years ago. Now, a layer that has been in existence, but has been treated as what I call a second class citizen is the metadata. It's the data about the data, it's explained from the data dictionaries, the tables and columns. But because this, because all this data moves across these different pillars, this is something you want to keep track of. And that metadata layer throughout, I would say up to like 2019, probably 2018, it was really a second class citizen was an afterthought with the data catalog and the metadata management space is starting to get a forefront. And now that's what we're seeing, this category of the data catalog. And even that metadata management itself has kind of been partitioned in so many different categories. And, frankly, features be turned into companies and products when they're really just a feature. But what is really exciting now kind of seeing the evolution to AI is that that metadata is the key context that is needed because it for AI applications, because that context is really the knowledge about your business. So if I just give you pure data, if I give you a number, like what does that number even mean? You didn't have the context around it to give it a meaning. And then you know what to go do with that data. If you look at this space, how intimidating it is for new organizations who do want to tap into the data, who do want to add value, who do want to create services. When they look at this complexity, at the same time, there is a big gap in the demand and supply of data engineers data scientists folks who know about data. So talk about how do you look at this situation where you do see the evolution emergence of data at the front seat, but there are all these challenges or you feel like yes, challenges are there, but we are very well placed to help companies. That's what I'm trying to understand. Does that question make sense? Yeah, it does. So let me kind of share my perspective around this. I think first of all, I think one of the things that happens is that we live, we're in this data driven world, but I would actually argue that we live in a data first world, which means give me more data. I need more data. We need to extract more value of data, data, data, data, data. And I'm like, okay, hold on. So you're telling me that you can't solve that problem yet because you lack data. So if I give you more data, you'll solve the problem. I think we're hitting a wall there. And I think that's the change. There's a change needs to happen. And that change is what I'm calling from a data first world to a knowledge first world, where you actually need to start thinking not just about the data, but about all the context around it. So a knowledge first world is people first, its context first, its relationships first around this. And I think this is the change that happens. And this is not just a technology change, meaning you need to treat metadata as a first class citizen. It's also a people and process change around this. So I think that that is what it's a really a social technical paradigm shift that we need to get into to do that changes. We've been solving all the issues about like, can we do business reporting? Like, yeah, of course, people have been doing business reporting all that stuff for 30 plus years. I mean, that's a solved problem. Like, can we make that more efficient? And so forth. Definitely, we can always make those things more efficient. But to really get tapped more into the value of the data, we really need to go get into the knowledge. So right now, for example, Generative AI is the perfect scenario that showcases this. If you look at Generative AI, these large language models, it knows has knowledge about the general world. It has no knowledge about your organization. So this is the moment that we need to be able to start organizing that knowledge about our organization. And guess what? The large language models themselves can actually help us figure out what how to start cataloging all those knowledge, all those business processes within the within the organization. And we want to be able to kind of combine that context that had that the context of that of the knowledge of your business with the large language models. And I think that's why one of the big trends that we're starting to see right now is that with all the generative AI and large language models, you want to combine it with knowledge graphs with all that context from organization, because that the knowledge graphs represent that key context, the knowledge of your organization. And that's what it's needed to be able to use large language models to provide trusted and accurate and explainable answers. Can you also take a few moments to kind of elaborate and explain what what is a knowledge graph, because once again, when we go back to folks, you know, we use jargons, we use keywords, but they sometimes don't actually understand what is so what is knowledge graph and how does it actually help with some of the areas that you wrote blog bottom that you talk about there. A knowledge graph is a representation of the real world concepts and the relationships that you have. And they happen to be in the form of a graph. So everybody interacts with knowledge graphs without even knowing you go to Google, Google search underneath is now run by a knowledge graph. So if you search, for example, Austin, on the side, you'll get a panel that will say, Oh, Austin is a city, it's the capital of Texas, here's some images, here's some events going on. So you realize that the city is can be a capital of a state, there are events occurring, there's weather going like those are the real world concepts that you start connecting together. So all all all big tech companies use knowledge graphs underneath the hood to be able to go integrate all that data and knowledge at scale. I mean, this is how Netflix runs a lot of the recommendations. This is how Amazon does the same thing, like you have taxonomies. This is your representing knowledge around all these different concepts and how they're related. So what we're so if we compare that to what we people can seriously know, like like relational databases, right? Let's look at this. So relational databases, this has been around for 3040 plus years. And their superpower is to really deal with the known use cases of today, like a power applications have very specific inputs and very specific outputs where you're structuring the same data over and over again, right? You have the same type of repeat analytics. That's what the relational databases are great for the what they what their weaknesses are on on the flexibility on the extensibility of what you can be able to go add things. So if I wanted to add more stuff into a relational database, you can do it, but it's just more work that needs to be done. You lack that agility. So the knowledge graphs enable you to deal with those known use cases today and also those unknown use cases of tomorrow because it gives you that flexible that flexible power. They're not the the graph itself, you mathematically, it's just a very flexible data model. You can just add more nodes and edges to the graph and be able to go extend So examples that we see think about when we're bringing in all our data and our metadata together, I want to say, hey, this data set is part of this. This data is part of this database. But I also later I want to say these people use this database. Oh, there is this GitHub application that is code for this application that uses this data. This data is now being used in this decision. And these people also need to be part of this decision. So you start expanding this little by little. At the beginning, you were probably only tracking which data exists in which database because you were traveling, you were dealing with the issue of I don't even know what data I have. And later on, you're saying, Oh, I want to be able to figure out what GitHub what applications code to have in GitHub using that data. I want to know which people are using which data to go to drive what decisions and you start expanding that and naturally you start thinking about this as a graph. It's just kind of what you end up on a whiteboard drawing. You end up usually drawing graphs. And that's where knowledge graphs come in. That's why it's a very powerful to be able to go integrate all this data and metadata that comes at scale within an organization. What role do you folks play as we were talking about earlier complexity? It could be overwhelming shortage of skill gap. How do you help organizations to was again deal with the complexity. And as you said, you know, data data to actually identify how they should deal with data and help them actually, you know, move forward, focus on building business applications versus getting stuck in this data labyrinth. Yeah, so for the applications. So we consider three types of applications, kind of the core applications that our customers use. The first one is like the search and discovery, right? So these are the personas, the consumers of data that I need to search for data because I'm trying to go solve a problem. So that that is one of the tasks that we do. The second one is around the governance. It is making sure that we understand where there's proprietary information, where there's PII information, right? How can who is who are the responsible, accountable owners for this data, who can have access for this data, right? It's all that governance, the workflows around that. So that's the governance piece where the users of that are going to be the data stewards who are making sure that I think the analogy I like to use is you want to have breaks in a car just so you can slow down but also enable you to drive fast safely. That's where governance comes in. And then the third one is the data ops. And this is for the personas like the data engineers. These are the folks who are kind of down in the trenches or creating the pipelines, but they need to make sure that their data is, if it's, if something's going wrong, they can figure out where it went wrong and why it went wrong, go fix it, automatically notify users about this. So these are the main applications that the data catalog platform provides. And it's possible because of the knowledge graph architecture that we have underneath. And so it enables us to go integrate any type of data. And we call it a platform on purpose is because it's not just any of these three applications I describe. You can start extending that, you can start using the platform to create any new applications. So I brought up in my previous example, the GitHub or the business processes, because we have customers leveraging the flexibility and the extensibility power of the knowledge graph to say, oh, I already have my data inside of data world. I want to bring in my people. I want to bring in my work charts into data, because I want to do an analysis about how creative people are being. Who's using what data to go do things, right? Because we may be a creative company. We work, for example, with WPP, one of the largest ad agencies, they're a creative company, they need to foster more creativity. So they're creating their own, let's say, creativity app to go underlies that people are using data. Another one that we're seeing is creating an app around operational excellence. If we truly are a data driven company, we're using data to make our decisions. Well, what data is being used? By which people to drive what decisions? And one decision, the output of one decision may be the input to another decision. See, I'm kind of drawing a graph over here. And at the end, that final decision is a business outcome that should be tied to one of your strategic objectives of the organization. So you can, we're not talking just about the technical lineage of data, we're actually now can talk about the business lineage of data, how data actually drives value in a business. So those are the applications, type of locations that we enable. And and because of our knowledge graph architecture, the world you can do whatever you want, basically. When I was listening to you earlier, you also made a point about teams culture. So as you help, you know, organizations, I also see a very important role of culture and people, the tools are there, technologies are there, you can bring a horse to the lake, but you cannot make them drink, especially if it's a data lake. So talk about the role of culture and people. And sometimes what happens is that right tools, right technologies, they become a catalyst in bringing that cultural change within organizations as well. Fantastic question. So as I was mentioning earlier, we, I personally believe that we need to have a paradigm shift, a social technical paradigm shift. And within that social part of that paradigm shift is this people. And I think one of the trends that we're starting to go see is bringing product management into data. So treating data as a product or data data products. This implies that we need to have roles that deals with product management, just like we have it in software, we're starting to go see these things inside of data. And that is something that did not exist before. These are the roles that are, if there's a person in this particular role, they know how to go talk to the business users, be able to understand the problems they're trying to go solve, being able to go push them to save. That's really the problem. Really to understand, is it even worth it? Is there a market for this, right? Is there, who are the future users of this? And then be able to go translate this to requirements and create a roadmap. So these are the types of roles that we're not seeing today in the data industry, but this is the new trend coming up. And part of that role or another type of role could be is what I'm calling the knowledge engineer 2.0. These are roles that existed back in the 90s, or also called the knowledge scientist. I've heard also the business engineers. These are the actual people who will go and talk to the end users and try to figure out exactly what do you mean by order? What do you mean by customer? People always complain, oh, we have multiple definitions. If you ask different people, you'll get different customers. Okay, what are they? Let's go codify them. And let's treat that as a first class citizen, right? Because now a lot of the people do that work, but they're the typical, the data scientist spends 80% of their time cleaning the data. Well, that cleaning the data is actually really critical context. It's the knowledge of organization. So those are the kind of the social people changes that I see. So that's for your first question. And the second one, how is how is data that world part as a catalyst here? We are obsessed about adoption. Actually, half of the bonus of everybody at data.world is based on customer adoption. And we've actually had one of the we had the highest adoption rate of data catalogs in the entire industry. So we see customers, I mean, thousands and thousands of their employees on our platform, which is not a common in the data catalog industry. So this is this is what the catalyst is when you start getting people involved in using the data that this that is part of the change. But this is not just a technology, right? We all we see is also a lot of organizations we've worked with, they say we want to go drive adoption. How do we do it? We have a tool that's going to help us. But it's also about the culture. Do we have a community about data people? Do we have the data champions ambassadors that are going to do? Do we create hackathons around data? Like this is the cultural thing that does. And then you have data.world where people can go in and very quickly search for that. And remember, that was one of the first applications talked about search and discovery of data. How do you see or how do you look at generative AI? I'm looking, I want to look at it from two different perspectives. One is generative AI to health organizations in improving the tools, improving the workflow. Second is generative AI as a workload. This is another web moment. We are literally in 1990 of the web. And it's growing much faster than the web was. So this is a change in humanity. And second, if to be very blunt, very honest and no BS, if you're not using generative AI for optimizing your task at work in your personal life, you're just staying behind. I mean, work, you're a loser if you don't use this, because your competitors are going to use it, they're going to beat you, period. So now it is very well documented. The amount of productivity gains already from generative AI just in the last six months, it is very well. So productivity gains across the board. I mean, one of the latest studies I saw from Harvard Business School, they analyzed, I think consultants at BCG, and you can see the productivity gain was just tremendous. So the productivity gain is already a very clear thing. And I think there's another report from McKenzie that shows out the economical benefits are going to be trillions of dollars and data is going to be at the foundation. So this is another web moment. So we're living in amazing times. Now, having said that, it doesn't mean it's the panacea as a silver bullet, right? So I think right now I like to think about it. Enterprises have like three challenges. There's many more challenges, but there's three ones that I'm focusing on. One is the big famous one, hallucinations, right? And the issue about that is because these large language models don't know the facts of your organization. So that means they give a lack of accuracy. The second one is the whole black box. These large language models, they can't give you explanations. I mean, you can tell it, give me an explanation. And the explanation that gives could also be hallucinated too. You don't know for a fact. So that means that you have a lack of trust. And the third one is that it's really uncontrollable. Like you need to be able to make sure that you know what is being exposed. Is there any confidential privacy information that cannot be exposed, right? So you have this lack of governance. So as I started off in our conversation, I'm a scientist by heart. And one of the things that we're doing at RAI lab at data.world is to really focus on that first accuracy one. Because this is where we really need to go get, figure out the hype, the noise from the reality. And one of the things that the main applications that we've seen from generative AI is this chatting, chat with your data, chat with your documents and stuff. Now, one of the, there's a lot of work on chat with your documents. But when it talks about chat with your data means, oh, I have my data in my lake house and my in my data lake or so forth. That means am I going to translate all these natural language questions to SQL queries underneath and execute them? How do we know what is the accuracy of that? So we always see these examples and very easy questions on very easy data. But what happens when you increase the complexity of the questions and when you create, complete the complexity of the data. So one thing that we don't understand then is the extent that these larger, large language models can actually answer these natural language questions over these SQL databases. And second, what I'm really interested in learning is how much these knowledge graphs can actually increase the accuracy. And I think, so what we're actually doing in our lab is creating a benchmark to address the hype issue because we really want to have the evidence to be able to understand to what extent they can actually do it and also to what extent knowledge graphs can improve. So this is a lot of the work that we're doing. And what we're really excited to all the results that we're actually going to be releasing this benchmark by the end of the month, we are finding all the strong evidence to support the claim that invested in knowledge graphs upfront provides that higher accuracy of large language models to answer these natural language questions. And this is really important for enterprises to understand because that is how they know where they're going to invest, how they're going to put the noise and the hype aside and really be focused. And we're wanting to give you the facts around that. And that's me putting my scientist hat on here. There's also a lot of fear, we can call it FUD, but there is some general fear also around AI, especially generative AI or super intelligence. We're talking about it being regulated. There are malice intentions also behind that depending on who you talk to. But overall, what is your perception where we are heading? Are you worried? Are you scared? Do you think that it should be regulated or do you feel like, you know, this is just like another technology, we have full control over if there is always a button, we will push and AI will just shut down. I'm an optimist. I'm an optimist. Having said that, for any technology, we should be thinking about not just, oh, wow, we can use it for these amazing things. We should also think about, oh, it could be used for all these very bad things. I mean, that's something that should just be kind of in our DNA, how we test things out. And then from there, I mean, I'm not an expert in AI ethics or in government side, but I think those discussions need to be had. Now, it's just the pendulum swing, right? How much do we got to go to one side or the other? Let's go, let's think about the web. And I come from the web background, right? This is kind of my work. I come from the semantic web, the web community. So this is why it's very dear to my heart. Imagine we had no idea when Tim Berners-Lee invented the web back in 1989, and then it came out, and then it started more and more, then suddenly 92, this thing started exploding more. Nobody had any idea what this could have been. And I would actually argue that I don't even know people at that time could have predicted a future of how we use the web today, that we take it for granted. Actually, one of my pet peeves is when people say the word Internet, and they really mean the web. So we think the web is one thing, the web was created in the 90s by Tim Berners-Lee. The Internet has been now since the 60s, right? To do different things. We had no idea of what the web could have been, and it did amazing things. It's also enabled very bad things too. We have the dark web, right? And we have all the issues with social media and stuff, but nobody ever could have predicted that down. Imagine we started kind of trying to predict the future and regulate for these things early on. I don't even know what could have happened. I'm just looking at history. So I'm making this connection with the web, and we've seen how there wasn't so much relation to the beginning. Later on, we learned after time passed, and then we improved on this stuff. And I think that's what we learned as a society. I mean, at the end, it's almost, think about what I said before, that breaks in a car. How much do we want to consider as slow us down all the time, or enable us to drive fast safely? You need to figure out what those safe boundaries are, little by little. I mean, we didn't have seat belts in cars all the time. Later on, we did have seat belts, right? Things happen and started evolving. So this is not just, we start regulating it now, and there's no one regulation, and we don't even know what that looks like. So I think there's going to be so much evolution. But at the end of the day, I am an incredible optimist. One of my, earlier this year, I was at TED, and they had a whole session on AI. I'm one of my, two of my favorite talks. One of them was by Jay Jean Choi, who talked about knowledge graphs and large language models, how they're key together. And the second one was by Saul Kahn. And he's by Khan Academy. And he's like, imagine if every single child in this world had its personal tutor. Imagine if every single teacher in this world had their personal teaching assistant. Wouldn't the world be a better place? And yes, I do believe that, and that's the world that I want. That's why I'm the optimist. Right now, everybody's within the hype of AI and everything. At this moment, always focus on business value, understanding what you're trying to go do, how is this going to help the organization make money, save money, reduce, mitigate risk. So we're in this really kind of almost, because of the hype, almost in an infancy phase right now with all this generative AI. And this is the time to make your data AI ready, right? And AI ready, by AI ready data, I mean, you have to make sure it's trustworthy, it's explainable, it has a context so it can be accurate. And our position is that to do that, you need to have a data catalog. The data catalog is the foundation to be able to understand what data you have. And it needs to be on this Knowledge Graph architecture to be able to connect and bring in all the semantics and all that meaning, all that context together. And that's how you're going to be AI ready. That's where you need to go start. So I think a lot of people are excited about AI. If you want to have a successful AI future, you need to start by organizing your data, catalogs, and Knowledge Graphs. Juan, thank you so much for taking time out today. And I mean, it was a really, really interesting discussion around, you know, I mean, any discussion around AI, generative AI becomes interesting. That's the nature of this topic. But thanks for creating insights. Thanks for being optimistic. And also thanks for, you know, that yes, we need to regulate, but we also need to understand what we are regulating without knowing we cannot go there. So we have to tread very carefully. Thanks for all the insights. And I'd love to chat with you again to discuss more about not only the space, but also what data.word is going. Thank you.