 Hello and welcome to this analyst angle where we're going to really continue to dig into the intelligent data platform Used to call it the six data platform But we're really looking at how we're bringing intelligence to data platforms and how it's really expanding Today, we're going to have a deep discussion on the role of metadata involving data platforms emphasizing its critical function in the transition from traditional intelligent Platforms to really intelligent data platforms and data applications and how you build those data applications We will dig into the industry's gradual recognition over the last decade of the benefits of decoupling computational power from data storage which has enabled centralized data consolidation and Considerations where how that metadata is actually exposed however to leverage the centralized data Effectively across intelligent Applications there is a need to embed within the metadata The intelligence previously confined within application logic and operational silos This process of building integrated metadata is complex and ongoing With artificial intelligence serving as a key facilitator AI aids in harmonizing unifying and activating conventional data pipelines and is envisioning for Further assisting and transferring Intelligence that was formerly embedded solely in applications directly into the data repositories That same intelligence can help data consumers Including analytic engineers data scientists and application developers. I'm Rob stretch a managing director with the cube research today I'm joined by Gaurav Pothick VP of product management at informatica and George Gilbert principal analyst with the cube research Welcome both. Thank you Rob. Thanks for having me here. Yeah, it's great to have you on I think again, you know, we're from our Palo Alto studios here. We're continuing to dig in on really how You guys are doing some really interesting stuff that I don't think people have heard about Lately, I know you guys back from when I used to build cubes and do a lot with kind of the heritage product that you had But I think what we want to really dig into here is and Understand is the journey from there to where Organizations are today, which is really how do they bring all of this metadata together and how are they digging into that? Why don't you kind of give us kind of a high level of where? Informatica is at and where you guys see metadata going absolutely and Lot of our users know informatica for data integration We did that very very well in the 90s But since then we've added so much more and and then it's all about doing data management Holistically and comprehensively being able to do data integration very well But also data quality data cataloging data governance master data management being able to reach Modern users and being able to do all the tasks that they want to do with data So that's what informatica has been all about Made big transition into the cloud in 2018 and 2019 now all of our services are in the cloud And that's that's what we sell For data management users right now and one of the key Key thesis that we have within informatica is that having this entire data platform together rather than individual bits is And and all of that glued through by metadata and intelligence and in AI that we are going to be talking about today So that really is how we have changed in the last few years Yeah, I think it's again how the organizations are looking at this and trying to kind of get their data estate really Under control and I think George we were talking about this And I think you had brought up some of the things and the challenges that you were looking at well you know that The modern data stack sort of took everyone by storm because it blindsided all this activity that was going on in Hadoop There was you know big data and then with the modern data stack We had all these new players that emerged to do data quality to do connectors to do entity resolution but what you guys are are doing is From from what we've talked about before using AI and the integrated metadata platform to create This Knowledge graph that adds the semantics where it connects all those islands of metadata that maybe you can take us through Some of the challenges, you know in combining AI a human in the loop and creating this Higher-level artifact that gives meaning to all the different parts of the metadata absolutely the Things with stacks is that if you stack them high enough there is a chance and if it's not glued together very well They can topple over very badly as well I The way we think about it is We talk about data network effects being able to use the information the thesis is that The more the products are used by end users We understand more about them We understand more about the patterns that they use make the products better and then there is more usage So it's a virtual cycle that that's keep getting that keeps getting better and better Netflix is a great example Right they understand all information about their end users. They produce better content More users come in and then and then the product gets keeps getting better as well What does that mean for data management for data management? That means that we need to understand how are the patterns of integrating data? We speak to customers. There's this good Story that we have, you know, and this was recently I was at the New York MDM and data governance summit that informatic Iran and we met this superannuation customer Out of a pack and out of 34 critical data sets that they had 31 of them were coming externally these were their know your customer data sets and How the usage of their product all of that was getting managed outside of the organization now being able to Understand those patterns because these know your customer providers and then other providers are not only Providing these services to one customer. They're doing it across the customers in the region Being able to look at their data and especially meta data the structures the usage patterns We can now understand how these this data is getting integrated into pipelines What's the problems with quality that they can have and then making it making that? Intelligence available to everybody That can only be possible if we understand that aggregate metadata at scale and and then just like Netflix We have these big hits like sales force and work days and these are packaged applications Everybody in the world is using and then being able to use that metadata and then making it available to everybody That really is the key just a clarification when you talk about Understanding how these let's say know your customer data sets Fit together and how people are using them Do those learnings get applied to other enterprises that have subscribed to those know your customer data sets from data? Providers the the way it works is we understand the underlying patterns and and metadata and it all goes down actually very deeply into What is the actual key data element that we are looking at in case of know your customer that may be the demographics The first name last name's addresses And then so on we're not interested in the actual data that is very specific to that organization But what we're interested in is how these key data elements get moved in the pipeline How does address get cleansed right and then how do they move into? Analytics and AI so our understanding what we do is we collect all of this metadata We look at how all of this metadata moves through an organization And then we apply those principles and intelligence across every pipeline That makes it really easy for new customers to just to be clear that means The the process of refining the data that each organization does let's say with know your customer You can improve you that you don't have to know the particular customer instance But by understanding how the know your customer data sets are used Does that make the refining process easier for future uses or future customers? Different customers who are who are subscribing to that data And let me let me kind of build off of that because I think that to that point like customer 360 is all the rage still I mean yeah has been around for years But when other people are going and doing that outside of Informatica a lot of times they're going and building models They're gonna go use dbt or something like that. They'll use They'll either use that for building models or maybe they start out at a catalog like an atlan or data bricks with unity or something like that How does how does that approach versus the approach you've taken really? How does that differ and how how is it? Do you see the advantages of what you're doing? Absolutely, so the approach for us is there are two approaches to unifying these metadata Management and then and bringing it available making it available for data management tools First is a well thought-through approach even before you create the products You think about how this metadata gets used whether there are going to be patterns and commonalities that Will you know for example if a data catalog? Is able to look at certain datasets and say okay this looks like it contains PII information Now you don't want to keep that information just in the data catalog You want to send that information to the data integration tool so that when a data engineer brings that data in you want to tell them There's PII you better mask it before you send it out to another store or maybe you know God forbid another geo So in those cases one plus one with metadata becomes a lot more than two and then we have so many more examples So the two approaches are well thought through from before or after the fact trying to stitch that together Which can become very very expensive. Do you look at it is? ETL or ELT is from that perspective ETL ELT real time There are five different ways in which data can be integrated We have been doing ELT for a very very long time Except that we didn't call it ELT. We called it at Informatica in two thousands push down optimization push down optimization was Looking at a pipeline and seeing what parts of that can be actually sent down directly to Tera data at that time that was the data warehouse of choice and now any other data warehouse that that users use So ELT was baked in into a product for a very long time, but ETL Now a lot of people are also realizing that doing all the compute in a particular warehouse can become costly as well So there are certain kinds of compute that you may want to do Beforehand and and with Informatica that's possible too Yeah, I think that's kind of where we come to this whole separation of storage from the compute in the execution layer and I think you guys really concentrate on that execution and how to get it to Into the form that it needs to be to then be used for AI or for a recommendations engine And I'm talking big AI not just gen AI and because I mean that's all the rage And if we don't talk about it will you know get kicked off the internet or something like that But when you start to look at it is that really that where your customers are coming to you and saying hey listen Where we're strapped for resources. We need a solution that helps us really streamline The gathering of the metadata the interpreting of the metadata and we need AI to help us do that Is that really why they come to you? Absolutely last few years People have been struggling with Getting that data to humans in analytics form and then and then on top of it if you are stacking things together You have to write this glue code yourself and then it probably you know what we see is customers become Experts at gluing things together and writing that code and still not becoming Not getting the outcomes that they were looking for but I'll say this Right and then because you mentioned generative AI and and and all the new things not new use cases that customers are now realizing that They can use the data platforms for The data network effects and then how they change in in two ways number one the amount of data that's required to generate a Great outcome has reduced which means that the generative AI algorithms now can Learn new skills in few shots rather than having us sending 10,000 different kinds of examples to it if the examples are similar the The amount of information needed to train that generative AI model is is not very much Right, so the new thing that we'll have to look at is information density and the marginal utility of adding new Example or new instances of something similar is is not that great, but something also more subtle is That we also don't want a eyes to learn Just that one particular example We need to give it in a way that it learns the the why behind it the the explanation behind it The reason why Microsoft 5 models are really really good is because the training data for it was sent in a certain way as well so not only lesser number of examples, but also Giving the examples in a way that AI's can understand so that that's how data network effects are changing That's how metadata management is changing and that's how we So, so let's pick up on that that now that In the past if you wanted to use AI you had to train each model for each feature on on large You know number of labeled examples now that you've got Gen AI and you're saying you can the use fewer shot training fewer shot examples now Now that you've got this metadata foundation How can AI help in the let's let's take the data engineer persona So in in the ingest process in the harmonized process in the unify You know process to build the customer 360 take us through some of the big, you know Productivity enhancements and quality enhancements. So absolutely. So If we take the customer 360 example the first thing that the You know that our end users need to do is to create a target customer model And they don't need to start from scratch because you know We have the learnings built from thousands of customer 360 projects to give them a template of what customer looks like and then Based on that target customer profile schema model We can already look at the metadata that we have gathered about their data sources Whether it's coming from Salesforce their customer success stores or all the different places where they gather customer data You can already say, you know, this particular part of customer success model can be fulfilled by this particular source Not only that we automate the pipeline creation between those sources and those targets Automatically cleansing those data sets from Salesforce as they come into customer 360 as well, but and that's already available today What we are thinking about and then using this metadata for his generative AI is such Fascinating feels and then if you've been following you've seen how the context window sizes are increasing and you know Some of the things are you don't even need to prepare it. I just feed it in raw and and it's going to be able to answer it now where metadata helps here is it looks at all the structured unstructured data that an end user might possess and Based on the query figure out What unstructured data sets are right for this query and feed it in raw Given the prompt window size so that they get the answer right away You metadata then becomes like your filter of what goes in to answer that particular query You don't want PDF files that have ads in them and spam in them You need to remove all of those things out In the future, maybe you don't want you want to vectorize and then and send it to rag databases because you know these context sizes are Becoming so well just feed that in raw. So Lots of users of metadata. That's where our research budgets are going in just one thing that you said that is Suggests a pretty profound change today when when we do data pipelines We ingest the data with some connectors then there's a whole process of you know of harmonizing to get to these Normalized entities and then you know beyond that to get the denormalized like data products But what you said was very different like if you want to build a customer 360 you start with that as the objective And then the AI uses the metadata to figure out How to invert that process and pull together what's relevant and build the pipeline Expand on that that very different approach from today's you know sort of digging a trench Yeah tomorrow just setting an object an objective a data product and then that's been another fascinating Area for us as well till now it it's always been that business users come up with requirements and then IT teams and data engineers realize what are the right data products that are needed to answer those questions as well now with AI it becomes easier even for I'm not say business users directly, but maybe a little bit technical business users to able to frame their requirement and Us understanding. What is the right data product for it in the case of customer 360 it? Customer 360 is probably the most valuable data product that organizations will have but you'll always need more and then these data Products can be created on the fly Just based on the requirements that you have but more than that Now let's go one level meta on on this particular problem Right the one level meta is business will have all kinds of requirements Right and some of these requirements are p0 requirements like customer 360 and then you need a full Hose of data governance data engineering data quality to make sure that the customer 360 data products are great We identify what are those requirements p0 requirements and then those data products are created by default Everything is set up for that But the next level once we can decide whether those need can be fulfilled in instead of real time Can be fulfilled in one hour can two hours a day, right? so you don't spend all of the cost of doing it and And then you create the right data products for those as well And when you say we create that is that the AI that's helping assist create that or that that model for them in there or is it hey it's an interactive Exercise with like you said maybe the data engineer who's going in there and helping build that out because they have this I mean every customer 360 is different between one retailer and another retailer versus You know a car manufacturer or something of that nature. So how does that part? Work where they're customizing because you have the templates and things of that nature and the AI I assume is helping build that out Yes, and eventually we may get to a place where AI fully builds the pipelines today That's not the case We definitely need humans in the loop for the planning part of it and then making sure that the right models are created The right pipelines feed into it right sources are there as well One of the things that we are working on with our new cleared GPT Line which comes out in April is being able to create ELT pipelines just through natural language So a human is still directing the process saying this is my target model These are my sources AI. Please help me create pipelines and then saying oh, this does not look right Can you correct that and so on right, but we learn and and then eventually I tell my Development team that AI is like a two-year-old right now You know you tell it give it enough instructions to put the pants on right and then it may put The right they may put the right leg in the right hole and maybe 50% of the cases But eventually they grow up they grow to be 5 year olds and 10 years old and I think that will accelerate so so just to dig a little deeper on that with where you're going with this and making building the pipelines is that similar to Would they say hey, I here's the API for Salesforce or is it I have you already have a connector built to Salesforce And I've already made that connection Or how does that work for them because I think there's a lot of companies out there that for instance are Consolidating pipelines and providing one API into all the pipelines and they're doing some modeling in the background How did when people have some of that stuff there already? How do how do you guys see that playing out and I think that's where we complete the full circle and come back to metadata It's very important for AI to be grounded in a particular organizations data estate So that when somebody is asking a question like what was the revenue of product X and quarter Y We are able to answer that question very well because every single word in that sentence is Right with the ambiguity Product X does not mean anything to AI because it has been trained on internet data Quarter bike is different for every organization because you know some have January to December some have June to July So so so being able to understand their business context and then mapping them to the right data set becomes Very important and the metadata foundation helps metadata foundation that has technical business operational usage Metadata that can guide them towards getting the answer and pipelines for that just to elaborate on that girls are Who's Who manages the interface for that business intelligence query where you're asking the data and and how do they interface with your metadata? so The AI when it is trained when we train clear GPT we trained it on all the data management concepts We trained it on how to create pipelines better We trained it on if you have to work with sequel snowflake for example What's the right snowflake sequel dialect or what's the right red ship sequel dialect and then so on so now it has the skills But it still does not have the knowledge and that comes from Organization specific metadata knowledge graphs that we create for each organization that you know, that's only for them the AI now at inference time looks looks at that metadata knowledge graph and the user interface for it is Directly with the end user where it's a data analyst data scientist data engineer for their tasks And AI manages it for them as well So the the knowledge graph is basically doing the fine-tuning at that at that level so that it's very specific like I Call a customer you said this is like you go back to that model customer defined by one Organization is very different than how they define it how you define, you know revenue is very different between different organizations So that knowledge graph on top of the AI that is created Through the metadata is really how it grounds itself in context grounding is the right word and I think being able to Look at the business context, which are taxonomies that end you and customers organizations already have and then Mapping that with the schemas the technical data sets that are available. We call it the semantic layer So every time the AI starts up the first thing it does is it creates an underlying semantic layer for that organization From that metadata knowledge graph and everything all the queries then go through it to be able to answer this but just just to clarify is the end user experience this this clear GPT or or is the knowledge graph and The the semantics for the organization the schema on which their own bi tool uses to access the data because there's You know, everyone has a rich visualization experience. That's their own bi tool And you know would be a long time for Claire GPT to match that Oh, we are not informatic is not in the business intelligence game at all. We are all about data management and plumbing for data, so our goal is not to do Great bi tools and visualizations and then and so on our goal is to make data Available to those bi tools and AI tools in the right way now for that There have been you know approaches like creating these semantic layers through humans we think that creating a lot of having humans created in a rigid manner that does not change will not scale for Large organizations that we sell to fortune 10 fortune 50 fortune 200 for them This semantic layer has to be dynamic has to be created and maintained by AI and every time it sees new thing It changes it as well Yeah, because that was going to be my my question because I think you got people like five tran build pipelines You have my former company snowplow who has you know data pipelines coming in and things of that nature where it's all this different data and You're kind of sitting there and Transforming it before before or after depending on how they Yeah, exactly how how the data flows through their pipelines and into The data warehouse the data data platform of choice Do you look at that and say okay? I'm norm. I'm gonna go through and is the what helps normalize? Customer from say the snowplow data versus the Salesforce data that's coming through five tran or something like that We have customers using all tools A lot of them using hand code Python scripts etc moving data around our differentiation is That you know, we think that creating a data system of record for any organization is very very difficult You know for any of these things and then like customer 360 is Multi-week multi-month effort right very very thought through but our USP has been always to create that metadata system our understanding of What are the important data sources? What are the relationships between them if X is moving to Y using five tran or any other Tool how you know where it is moving we collect metadata from all of these tools So our metadata approach for extracting this metadata called the catalog of catalogs. We connect to unity catalog from Databricks or Per view catalog from Azure and then we bring this metadata to create this enterprise view of metadata How data is moving from X to Y? Regardless of what's moving it and then feeding that into our clear AI layer to be able to do all the things that we How much does a human need to be in the loop to start building that? Coherence across these islands of metadata a lot right now Right, we need humans in the loop to make sure that AI is right AI is guided properly does not hallucinate is and it's not behaving like a two-year-old that you were talking about earlier, but for some tasks like let's say Entity matching in master data management here We are looking at hundreds of millions of records depending on the company right and in those cases We look at thresholds if the AI is confident at a 90% level the human say You know I trust your judgment after a while after looking at the predictive capabilities of the AI and and they put the threshold there Anything below 90% goes through the human in loop, but as soon as it becomes 90% It becomes AI's responsibility. We have automated Entity matching or deduplication for a master data management like that We went from about 84% industry standard deduplication, which is all rule-based. It was SSA name three a few years ago We put an AI model on top of it and became 92% after like 84% every single percent Takes a long long way, but at a scale of 100 million records for our organization Each percent is giving them Months and months of productivity that was not So how much of those learnings to go from say 84% to 92% How much of those learnings are applied just to that customer and how much can you take and leverage? With other customers because your models are and your rules are now better. We Ship based models that take our learnings Across thousands of different projects. That's one of the key usp's that we offer to our customers this learning that you know They're not starting from zero. They're standing on the shoulders of giants who have already done this For a very very long time, but then to really get it to high levels of accuracy We need to train that AI on organization specific metadata or and data. So in this case the deduplication I think right so so we we look at their data We look at controversial examples We say, you know this particular thing AI does not know whether it goes X or Y whether it is duplicate or not And they they swipe left or right and then say, you know, yes duplicate. No duplicate below. Yeah No, I think again when you start to look at how people are doing this it's super complex because there's so much but you know, we look at intelligent data platforms we see it having to come up to this level because As we see there's in in the ETR data that we have there's actually an overlap between And depends if you go Databricks to snowflake or snowflake to Databricks somewhere between a 40 to 50 percent overlap of Companies and organizations that are actually using both for instance And then if you add Mongo and others in there and redshift and it becomes even crazier at how many different data silos Do you see organizations looking to informatica to help them from because Certain data platforms seem to be good at certain types of data and they're looking to optimize that from a cost perspective Yes, and this complexity will only increase right as new AI tools come into the play. There are Assuming at least tens of thousands of open source large language models that are available in hugging face each doing specific skill Really really good and then other new AI tools that will be coming in for all of this organizations look to us to manage this complexity and in that we reduce the cost and You know the amount of overhead that will have to be spent on that. So let's give you the last word here What where do you see this going and metadata in the future in particular? How do you see this really coming together? In the AI space and data space, I think Trying to predict anything beyond six months is becoming really really difficult. Everything changes every week Right and your and your capabilities come in. I believe that AI models become stronger and stronger, right? And I think that's where everything is marching down to second, I think the Every time these kind of changes happens we ask what are the core assumptions that are changing like with ETL and ELT You know when Assumptions before was bringing anything into a data warehouse can be costly because you know storage cost and compute cost So ETL was really really good in that But then you know snowflake came and an ELT became the main thing What are the core assumptions that are changing with gender delay? right in my mind one of those core assumptions is that we have to really Decide beforehand. What are the kind of questions that we will ask? We'll create all these pipelines and infrastructure and dashboards and we'll only answer those questions I think that assumption will change because the questions that humans ask as soon as the dashboard Is created now we have new questions, right? And then and then that dashboard is old news Right, so I think that assumption will change being able to maintain that code and so on that will change So I think AI's will become intelligent to understand what are the query patterns create those data pipelines Optimizing for cost performance accuracy, and then that will be a really really good world. Oh, I agree I think we're we're gonna keep on this and I know we're actually gonna be at your event in a couple months here Awesome out in out in Vegas So really looking forward to continuing this discussion because I think that when organizations are really looking at it Metadata is the glue. I mean it's the glue that brings together the intelligent data platform And I think that federated metadata is super important. So Garav. Thank you for coming on board George, thank you for helping out because again, I love how you think about these things and just drive to the drive to the Thread in there. So but thanks for coming on Rob and George. Thank you And thank you for watching this Episode of the analyst angle and stay tuned for more on intelligent data platforms on the cube the leader in high-tech enterprise analysis and coverage