 Hello and welcome to this episode of the Road to Intelligent Data Apps. I'm Shelly Kramer, Managing Director and Principal Analyst here at theCUBE Research and today I'm joined by my colleague and fellow analyst George Gilbert and we are joined by Jason Neppol, the Senior Director of Data Management for Databricks. So this is part of our continuing conversation about what we call the sixth data platform. This is an emerging data platform where leading vendors in this space are Databricks, Snowflake, AWS, Azure, and Google. And what we're seeing in this evolution is that the data platform is actually becoming an application platform, thus the name of our series, the Road to Intelligent Data Apps. So this is happening as all applications are becoming driven by data, analytics, and AI. So for our conversation today we're going to take a closer look at Databricks, both the company and the customer activity and we're thrilled to have Jason join us. Jason, hello and thank you. All right, thank you so much Shelly, it's a pleasure to be here today. Absolutely, we are very much looking forward to this conversation. So before we dive in with Jason, I want to give you some historical context. Snowflake kind of defined the modern data stack with the separation of compute from storage, which essentially allowed the management of tons of tons of infinite amounts of data in a data lake house architecture. The data lake house gave customers a simple solution for managing their end-to-end data and their analytic data estate. Then with partners DBT Labs and 5Tran, they provided an end-to-end solution that also reimagined and pretty much radically simplified ETL to ELT. And this is the next generation of ETL. I could, could I throw a few more acronyms in here please. But this is significant because in ELT transformations took place before the data was loaded. And running transformations before that load phase results in a more complex data replication process. So in contrast, this is a significant switch. In ELT, the process is reversed. And so it's extract, load, transform. And here's where it gets interesting that we see Databricks as redefining Snowflake's core innovation of separating storage from compute. In Databricks version of the data lake house, storage and compute can come from different vendors. This is actually, I think, very attractive to customers. End-to-end data management becomes end-to-end data governance in the Databricks Unity catalog. And we have different compute engines, including Snowflake, which can run on workloads on the data, which can run different workloads on the data. And this gives customers a wider choice of price and performance or capabilities. So there's lots to unpack here. Jason, we have so many questions for you. So let's start with the basics. Why is talk with us about what a data lake house is and how it's different from a more traditional data platform, you know, multi-vendor compute versus single-vendor governance separate from DBMS. Just jump in there. Yeah. Well, so the data lake house, I mean, fundamentally, you can't have a data lake house unless you have your data stored in open data format. And what does that mean? It means, you know, more and more data warehouse is being built in the cloud, not on-prem. So you're going to put your data in the place that's the cheapest place and most durable, which would be cloud storage. So in AWS, that'd be S3, Azure, ADLS, Google, GCS. And this is the cheapest place to put your data. And then to process the data, you're going to want to spin up compute. And then usually, you only need that compute for a finite amount of time. And then once you're done processing it, you can shut down the compute or the virtual machines. And because you separate these things, now you're basically optimizing for cost. So in the cloud, you pay for storage based on how much data you have and how long you want to retain it. And you pay for the compute or the virtual machines based on the size of the machines and how long you want those machines up and running. So contrast that with like the old days with like, let's say you've got a Oracle database on-prem. That Oracle database has a finite amount of compute. And that means you have to basically schedule things. So if you want to run your ETL, you're not going to want to run it at the same time during the day that your business users are running their BI reports. You're going to want to run that ETL at night time, which then precludes you from doing things like real-time data analysis because now you can't really be using that same compute during the day as you're doing those BI queries. And so what's that? You got to wait. You got to wait. And so with the separation of compute and storage, which I don't think it was snowflakes innovation. I think BigQuery might have been the first commercial vendor to do it. But everybody's doing it now. With the separation of compute and storage, now you could have your ETL running on one cluster and then your BI queries being served on a different cluster. And that really opened up a lot of possibilities. Now with the Lakehouse, because all your data is in an open format. And so for Databricks, we invented this open source format called Delta Lake. It's the most popular open source format for Lakehouses. And it basically is supported by almost every open source tool out there and commercial providers. So what you can do is you can write your data to the Delta Lake really with either using Databricks or using regular old Apache Spark. And then you can read that data using other engines. So you can read with the Databricks photon engine, or you could read it with a Presto or Trino engine or even the snowflake engine if you want. So because of that, because what happens is over time, all these different data processing engines go in and out of favor. So 10 or 15 years ago, Hadoop was coming out of the scene. It was replacing Terra data. Then Spark became popular. Then Trino. Snowflake has grown in popularity. And so I can't tell you what the most popular data processing engine is going to be in another 10 years. I mean, I hope it's still going to be photon, but we don't know. And so if you've got a large amount of data, some customers have petabytes or even exabytes of data, you can't just like move that around. You can't just like export it out and then import it into the next data platform. You need to basically keep the data where it is and then bring the right engine from the right tool for whatever you're trying to do with your use case. Makes sense. So what about an intelligent data lake house? Well, so the lake house is something that we kind of pioneered, I think around 2020. We wrote a great blog article. And now recently, like last November, we came out with what we think is the next evolution. I think it's similar to what you guys are calling the six data platform, but we call the data intelligence platform. And the whole idea is that historically, you've had different silos or different tech stacks for your data stack and your AI stack. And now these are kind of coming together with the lake house, because whenever you want to train your models, you want to be able to train them in a distributed fashion over open data sets. And the lake house has facilitated that. But really, what you need is you want to build these intelligent applications on top of it. And to do that, you need some sort of intelligence engine, which is going to be able to take knowledge of all of your metadata about your data, and then combine that in a large language model that you train on top of your own data, and then leverage that to answer questions. So if you think about it, historically, if you were a user on Databricks, you had to be technical in some way. You had to be able to write at least a sequel. Most people write Python, but then there's also like ScholarNor. But if you couldn't leverage any of those languages, you would have a hard time with Databricks. But now, we're enabling the rest of the users of an organization enterprise. So if you can speak English and type English, you can ask a question. And then the data intelligence platform will be able to take that English question, translate it into queries that run against your data using the context of your data, and then give you an answer in a visualized manner. So we're calling these like data rooms or the product name is called Genie. And we're already trialing this with customers where they just ask questions, and then Databricks generates answers for them in real time. And those users don't have to know even how to write sequel. You know, this is just another step in the democratization. Sometimes I get tired of that word, but that's really what it is, right? You know, the democratization of the access to and the ability to use this data that doesn't just rest in the hands of people, you know, of coders and developers. That's true. Yeah. So I think that we want to talk now a little bit about some some items on your list, George. Take it away. Okay. So Jason, let's talk about some discrete workloads and how over time those might transform into the building blocks of these intelligent applications that build on the data and the metadata. But let's go discrete workload by discrete workload, that how you might have different capabilities, performance characteristics, price points to run the data engineering, business intelligence, supervisor or unsupervised like old fashioned or more gen AI. And then and then applications and how Databricks defines those today, help us understand those as, as discrete capabilities, but then how they become essentially workloads within an application. Yeah. So I mean, I think Databricks, you know, we've been, we've been at it for now for, for almost 11 years, I think, but we've, we kind of started as just basically a notebook service where you could have notebooks, you can interactively interact your data. Then we added the ability to run those notebooks on an automated basis. So basically kind of like, you know, job scheduling service. And we've since kind of like expanded into more different product lines. So we've got a separate product line for streaming. We've got a complete orchestration tool, we call Databricks workflows, which you can use to orchestrate anything across data in AI. We also have Databricks SQL, which is our data warehousing engine, so that you can basically any sort of like BIOC style query you can run, you can run directly on your Lakehouse without having to replicate it out to another data warehouse. And so what we can do is because we've got this very flexible engine, photo engine, we can spin up different types of clusters that are optimized for these different types of workloads. And then likewise, we can, we can price these workloads at different rates. One of the, one of the things we learned early on when we were doing Databricks is we were always competing against other open source Spark product products out there that's hosted by different cloud vendors. And, you know, they would, they would price those things much lower than they would an interactive session with a notebook. So we are able to, you know, one of our first price points was like we offered our automated jobs at half the price point as the interactive notebooks. And then that really allowed us to capture more of the market that otherwise we would have been locked out. So when we look at, you know, every product that we release, we, we actively look at, you know, what are the, what are the price points that customers are willing to pay? Because, you know, if you just have one price point for all of your use cases, that means you're, you're probably overpaying for, for some of those use cases. Sharing though. Yeah. Let me, let me just get you to clarify something on that. So data engineering is typically more throughput oriented, where business intelligence is more latency oriented. Might you be using different engines? Or might you just configure the engine, one engine differently for the different workloads and, and then help us give a sense, help give us a sense for the different, the different price for say data engineering versus the interactive business intelligence. Yeah. So you can, I mean, for, for Databricks, we have one engine, the, the photon engine, which is our native vectorized engine that we built. And we use that for, for all the different use cases, but the type of cluster that we spit up for you, it might be configured a little bit differently. So even in streaming, if you're, if you're basically just doing, like ingestion, where you're basically just reading data from a source and then writing it to another target, and maybe you're doing a little bit of map reservation, you know, you can get away with a cluster that's more CPU focused, but as soon as you start to do, you know, online aggregation within that stream, then you're probably going to want a more memory optimized cluster. So we can kind of like take those types of things to consideration as well as like the amount of memory we allocate for certain things. But you could also use other open source engines as well. So we have some customers that are using flink. So they'll, they'll read and write data to Delta tables using flink, because they've already got a lot of expertise in house and how to use that. And that's perfectly fine. Okay. So, um, and then talk about how machine learning has transformed into gen AI and the difference in that workload, like with machine learning, there was a lot of feature engineering because you've had a sort of handcraft each model. Um, and you had to say what features to train on and then there's, you know, essentially you're creating a mini application for each, for each use case. And how does that change with gen AI? Yeah, it's, uh, it's interesting. So like, you know, gen AI came on the scene a little bit over a year ago when chat GBT came out. And before that, you know, you know, people didn't really talk about LMS too much. It was like, you know, about natural language processing and it just kind of exploded overnight. And I was, um, I was talking to one of our large financial services customers a few weeks ago and they've got a, um, an AI governance team that basically reviews every sort of like model that gets proposed or developed. And I think about 30 to 40% of the models that are getting proposed are now these, these gen AI time models. So, um, I was really surprised by that because they're, you know, they're a very, um, older enterprise, large enterprise, but, you know, overnight, essentially 34% of what they're looking at this gen AI. And I think that's really important because, um, it's just going to get continued to grow. And if you think about it, I mean, all, um, this is kind of like simplification, all a generative AI model is, is a transformation. You're just like, you're getting inputs in, you're transforming it, and you're getting an output. Now it's, I'm oversimplifying it, but that's what it is. And I think, um, this is going to like cause a whole bunch of more, uh, scrutiny around AI governance and regulations. So there's already a whole bunch of different regulations that are being proposed in the US and Europe and the UK. And if you look at those, there's a few similarities. One is that, you know, you're going to have to have data and model security and privacy protections at every stage. You're going to have to have, um, lots of, you have to be able to document a lot of what's going on, um, across these models and then how you're generating them, because you're going to get audited. Um, you're going to get to produce those results and you have to be able to address, um, you know, bias and inaccuracy and you have to be put in guardrails as well. So there is, we, um, there's one example where if you went to, I think it was the, uh, the, the Chevrolet website and you typed into the chat box, what's the best truck out there? They would say the Ford F-150, right? Like, and you know, you shouldn't allow that. There should be some guardrail. You put in things like that. And all that type of, um, you know, governance is, um, it's, it's not, uh, it's not being implemented as fast as like the models themselves. So, um, there's a lot of catching up to do there. And I think Databricks is well positioned because we have a single unified platform and we have a single governance layer with unique catalog to be able to do these types of things. I do a, uh, webcast on cybersecurity weekly and we just actually covered this this week talking about governance and guardrails and the danger of jailbreaking and all that sort of thing. But to touch on your comment about your financial services customer, you know, I, I see this, this was not a surprise to hear you say that you're seeing this traction from them because really, I think the financial services sector, sector has kind of long led some of these early transformation initiatives. Um, and so I think it's a good thing actually because you know, these are financial services, healthcare. Those are, those are areas where governance, data privacy and security, all of those things are critically important. So I think that as a use case, it's much better to step out with a customer like that. And, you know, I think that will perhaps speed some things along as it relates to some of these other things, the policies, the guardrails, the governance, that sort of thing that we need. And if, you know, 30, 40% is GNI, that means, you know, 60 to 70% is like the, the older traditional ML and that older traditional ML, you know, that's really good at math, you know, whereas the LMS are not good at math, they're good at more creative things. So you're so probably gonna need both of these things. And I think that's why there's been an interest in what, you know, what's being called compound AI systems because you're gonna have to leverage an ensemble of all these different models to be able to come up with an entire application. It just gets more complicated. Jason, maybe, maybe help, help educate our, our viewers and listeners to what some of those compound systems are doing. Some of the use cases, like when you mix prediction, you know, with something that's generated as part of something larger. And, you know, what does that application look like? Also, you know, the, in the supervised case in the traditional machine learning, you know, you built pipelines for feature engineering. And then, you know, you built, you built something where the model was repetitively, you know, doing an inference on those features. Help explain how the gen AI pipeline, what, what characteristics it shares with that, and, and then how it's different and then how they work together. Yeah. So, I mean, did you say something, Shelley or? No, it was a very big question. Well, there's a, there's a, you know, it's a very big space. So it is. So, so feature stores and feature engineering. So like, think about, you're like a retail customer, you might, you might have a feature of like, how many, how many sales is this customer made in the last 30 days or something like that. And, you know, for like my household, like with Amazon, I think we buy something on Amazon, like every day, not like multiple times a day. So that could be like an interesting metric. Now, if you're, let's just say you're Amazon and you're, and you're, you need that feature, you know, if left to your own device, you got all these two, two pizza teams out there, they're basically be recreating the same feature over and over again. And so if you know that this feature is going to be useful for lots of different teams and models, you want to centralize that feature, calculate it once, and then, you know, keep it up to date in a streaming fashion, and then, then use it for all your different models. So that's, that's why people create feature stores and Databricks has a feature store. The nice thing is that every, every inference of those features gets basically tracked with lineage, and then we can basically make those features available online in a very low latency way. And so by doing that, you're governing the individual functions that create these features that they can be used. Now, these are probably going to be used more in traditional machine learning, like for statistical analysis, but there's no reason why if there is context and knowledge of these, these functions or these features, they could be called by the LLMs themselves. So, so that's another thing. And there's a, there's a really good research paper put out by Berkeley and their artificial intelligence research center. I think it's called, it's the, it's B A I R Berkeley artificial intelligence research, but it's all about compound AI systems. And the idea is that, you know, you're going to have, if you think about an application, you're trying to achieve something, then, you know, it's probably not just going to be one giant model that you call, you're probably going to like start by calling them, you know, one model. And then every time you call a model, it's usually precipitated by doing some sort of retrieval augmentation where your question is searching for documents. And then even the, the searching and the ranking of documents, that's a whole other set of expertise that people are focused on. And then those documents are then fed into an LLM to give it some context. And then, and then that asks a question, and then it gives it the output. And you're probably going to have like, you know, a workflow or a chain of these models where instead of having one giant model, which is expensive to train, expensive to host, you could have much smaller models that are tuned for specific domains, specific use cases. And then depending on, you know, the path you take, you're just calling these models. And the smaller the model, the cheaper it is to host it for the inference and the cheaper it is to train it. So instead of having, you know, one large language model, we're probably going to see a bunch of, you know, chained up small language models that create like what we call these compound AI systems. And it's very nascent. Like this is very huge stuff that everybody's trying to figure out. Yeah. So let me ask you then, how does Databricks define an application today? And, and how might those different, like the old traditional machine learning and the new gen AI become workloads within a broader application? Yeah, I would say the way that I'll give you a couple of examples of how customers use Databricks. So like one example is like, you know, you're just using it like any old platform, you're doing your analysis on it. Another is like, you might, we have customers that use Databricks as the back end for the heavy data processing and, and training their models and then actually serving them up. So they're kind of like a back end for the, the customers like website or app. And then we have some customers that actually white label Databricks. So like they make Databricks available to their customers and they might put their logo on there. And then we also have instances where, let's see, customers are basically creating, you know, applications that then talk to Databricks. And in those applications, I think, are going to evolve a lot of our time. So we announced a data and I sounded last year, this concept of Lakehouse applications. So we're going to have the ability to basically have customers create an application and whatever language they want, put it in a container, and then Databricks will host that container for them. And then that can, that container will be, you know, within our customers environment. So we have over 12,000 customers now around the world. And so if you were to create one of these Lakehouse apps, any one of those 12,000 customers could conceivably discover it in the marketplace, download it and then make it available within their environment. And because, you know, Databricks is, you know, got 12 years of security behind them, you know, it's a very secure environment to run. We already have all the certifications you would expect, like, you know, ISO 27,000, one, two compliance, bed ramp, et cetera. So you don't have to worry about, you know, those security compliances and you could basically just be running in someone's container environment. Now, another thing we released last year was SDKs. So we created SDKs for Python, Java and Go. And these basically map to the REST APIs that we've had in the platform since day one. And so these, the combination of these SDKs will allow customers to build pretty powerful applications. And I think these next generation applications, they're going to require AI. If they're not using AI, it's just a regular application. So, so you're going to have to have the SDKs that map to the APIs to do things like serve up the features or create new features to be able to serve models and then basically infer the results from that and then serve up the large language models as well. And it's pretty early days, but I think we'll see some really good examples by the end of this year. Shelley, this is, this is maybe a chance for you to start exploring how that application platform evolves. I think you just did. Okay. Well, then Jason, maybe, maybe help us, help us by describing how a definition of an application might change over time. And I think maybe some of this too, Jason, based on what you're saying is, you know, we're early days, right? So some of this is that we don't yet know what we don't know. Right. So anyway, yeah, let's so let's know that we're hypothesizing in some ways, but tell us how you think the definition of apps will change over time. Yeah, I think, you know, these people will be able to host essentially like chat bots. And then those that those chat bots will, you know, be focused on an LLM that, you know, the person, the company that creates the lake house application, they'll have some LLM that they've trained themselves on a set of data that's for specific purposes. And then as part of that interaction, you know, that chat bot will need to go out for more information within the customer environment. And I see that that chat bot then being able to, you know, have some sort of like query against the context that we have a unique catalog to figure out, what's the next set of questions that could be asked on that customer's data set. So that way, you know, the data is not leaving the customer's environment, but the LLM has to be aware enough of the context to be able to ask questions about the environment. So I see that is going to be something that I could see like one or two of these types of things being becoming the de facto standard for every industry, you know, because if those industry, if the companies that create these applications are, you know, clever enough about how they secure and, and then make data exclusive to them, then they'll be able to have the best, the best environment and the best application for, for that particular vertical. Well, this will no doubt change how developers build applications, right? In some ways. Oh, yeah, for sure. You're going to have to be much more, in some ways, it's easier, right? If you can basically just make a, make a question call to an LLM and it gives you a result. That's, you know, that's, that's pretty simply put an output. But I think the, the hard part will be, you know, really optimizing and tuning that. And at the end of the day, you're going to want to train these LLMs on your own data. So, and if you can basically train that model faster than somebody else, then you can operate these models much cheaper. So with the technology we have with Mosaic AI, we can train the same model as, as, you know, PyTorch, but we can do it like seven or eight times faster. And, you know, that equates a lot of money in the cloud, especially if we get the GPUs sometimes. But Jason, let me, let me just ask when you're talking about training on your data, there's, there's, you could have different objectives. One is you're using an LLM, perhaps the customer or, or Databricks to infer from all the data. So where you can essentially materialize the metadata that, that that's what populates the Unity catalog, you build a representation of their business beyond, it's actually one level beyond Unity, where you're learning, you're learning the physics of their business from their data. And so Unity, it doesn't just tell you what tables there are and what pipelines you have and what the lineage is, but you start building a knowledge graph that says who are the people, places and things, and the activities that connect them in their business. So that would be one direction. But then another might be, that becomes the application platform. And then within that, you might have models that say, okay, we use our Uber example, you know, a lot just to make a concrete. Now I want to match a driver with a rider. I want to calculate a fare. I want to calculate a route. So one is sort of the model of your business. And then the other is at runtime, you know, help me orchestrate my business. Do you see a, do you see an emerging pathway among Databricks technologies to help customers get in those both those directions? Yeah, I feel like it's kind of like, you know, metadata versus record level. So like, so today, like when I talk about context, it's a lot of it is metadata. So the, so like an example we gave at data summit last year was that, you know, if you were to ask the question, you know, how well is our serverless product line doing in an Europe last quarter, you'd have to have context that serverless, you know, we have a code name called nephos that waste of serverless. And then last quarter would be offset by a month because of how we do our fiscal calendar. And Europe would actually be Europe, the Middle East and Africa, because, you know, that's how we call Europe. So, you know, you'd have to have all that context, just to be able to run a query on the data. And we have enough of that information within the catalog that you can kind of like form that query and then, and run it and get the response. Now knowledge graphs would be more like, at a record level, like if you, if you have a graph of like this customer is connected to this address, and they've had this many claims associated with them, if you're insurance company, for example, then, you know, you'd have to have some sort of representation of that and more importantly, API is to be able to call it. So that's something that I can foresee that, you know, coming into the mix at some point, I don't think it's going to be this year. But, you know, that's something further out where I think it'll definitely just improve the accuracy of these applications. But we've got a, we've got a lot of, you know, low hanging fruit that we can get to before then. I mean, just as, as one example, we've got one of our customers is a Chevron and they've got all these people out in the field that are, you know, fixing oil rigs and everything else. And, you know, there's a there's a mountain of like, in the instruction manuals and parts manuals and what to do. And so we've helped them create LLMs where they just load all that unstructured data. And now the people in the field can basically just ask a question about what they're, what they need. And then it'll be intelligent enough to give them more or less the right answer without them having to flip through all these manuals. So, I mean, that's like an example of low hanging fruit today that we don't need an algebra for. But I think it's also an exciting glimpse in the future for those of us who hate instruction manuals and who hate the process of having to, oh, my dishwasher is, you know what I'm saying? I mean, it is a glimpse into the future because I think in a, in a relatively short period of time as these things go, we'll see that kind of a universal capability. Yep. So let's talk about how maybe keying off that, the keying off that example, where you start to integrate all the data, all the data types into this one end to end data estate. And it's not sort of what's in your DBMS, but it's giving coherence and definition to data that could be managed by other systems. But you want to, you want one view of it and one common definition for how to navigate it. Explain how unity starts to, you know, change the focus from what data am I managing to how do I govern all this data wherever it is. Yeah. So just kind of like using that, using that Chevron example, I don't, I don't know the implementation details, but if I were to build it, you know, I'd be, I would use like volumes within your catalog, which is basically a way to store and govern unstructured data sets. So these are a bunch of PDF files, you could have those sitting in a volume, which should really just map to like an S3 bucket somewhere. And then, you know, you might create a vectorized index off of those. And so that would be something that also gets governed within unique catalog. And then when you go to access that vector index for retrieval automated generation, you can, you can basically, you know, do so going through unity, every request that you're, that you're making unity is going to be audited so that you can look at later. And then everything you do for transforming or creating that vector, that's going to be logged as lineage as well. So you can go back after the fact, if you need to audit, you know, how is this model being done? And then you can also start to add security as well. So maybe, maybe the whole Chevron field shouldn't have access to all the mandates for some reason. So you can kind of like specify, you know, which, which groups have access to which things. So then when you go to do that look up, you know, if you don't have access to it, it's not going to be part of the context. So yeah. Okay. So now help us explain or help explain to us like how unity can create sort of like an abstraction layer across some like data that's managed in different systems that makes it look coherent as though it's in one system. Yeah. So I was, I was like to start with like, you know, so with Delta Lake, it's like, you know, a unified storage layer across the cloud. So it doesn't matter if your data's in S3 or ADLS or GCS, you know, Delta Lake works on all of those. And then whenever you're interacting with Delta Lake, it's just one common API. Unity catalog is very similar in that on all those different clouds, you've got different services that they provide for identity access control. So with AWS, you've got IAM roles with Azure, you've got like either managed identity, service principles, and Google Cloud, you've got service accounts. And so once you hook those things into Unity catalog, then from that point above, it's all one common API. And we've made the API pretty simple. So we've, we've focused on the ANSI SQL standard of the domain control language DSTL. So you like grant access to this object, and that object could be a table or it could be machine learning model. Then to a group, and that group is synchronized with your identity provider. So you've got one common layer. So if you need to, you know, move applications from one cloud to another, you're, you don't have to change your code because it's all the same. It's all the same API underneath. And you just have to hook up your data catalog on those different clouds. So that's, that's one advantage of having like this common governance layer. It really allows you to leverage these multiple clouds, which, you know, we're finding like, you know, just being able to find GPUs, like some clouds are more able to give you GPUs than others. So you're transferring workloads because of this already. So would it be fair to say that you've sort of abstracted a permissions model, and then it maps to the different permissions models? And that's one simplification. That's one simplifying abstraction. Would, would lineage be another one? And tell us some more simplifying abstractions that would work across systems that, that Unity provides. Yeah. I mean, another one is, you know, I talked about the, the storage, but we've also got something called Lakehouse Federation. So you can register different databases. So those could be Postgres, MySQL, Oracle, or even Snowflake, and register those as what we call foreign catalogs within Unity. And then once you've done that, you've registered their metadata. And now we, Unity has context about these datasets. And when it goes to auto-generate queries for the data intelligence, it can do so against those. And that's one. But then two, you can, you can federate queries out to them. And then all of those queries would be tracked from an auditing in a lineage point of view. Now with lineage, I think there's like, anytime you talk about lineage in general, there's like three components. There's like capturing the lineage. There's like persisting that lineage in a, in a common format, and then indexing it. And then there's the visualization of that lineage. So everything you do in Databricks, we automatically capture that lineage for you. Now, if you want to, if you want to do use some other system as an engine, we're not going to capture that lineage. We don't have any knowledge of what, you know, Oracle or Teradata is doing. But if you, if somebody were to write a collector that would capture that lineage, then you can push that back into Unity and then leverage all the visualizations we have. So, you know, we can accept a third-party lineage and we can basically automatically collect what's done in Databricks. But we're not going to try to, you know, create the collectors for 200 different systems. So so the part of the platform is to let the, let other providers write collectors so that you can have a global view of lineage. Yep. And when you, when you map these external databases, so lineage might be one thing you could collect from them if they write a provider. But what other context are you capturing that allows the Intelligent Lakehouse to query their, this external data as if it were native? What external context do you need to map that, again, just the way you're mapping permissions? I mean, really, you just kind of need the metadata as well as, you know, the lineage. And then in that metadata, you've usually got some sort of descriptions. And one of the things that we did, one of the first applications of LLMs we did in the product was if you open up Unity catalog, you can automatically create descriptions of tables and columns based on what the metadata is. So most customers, most customers, they don't actually ever fill in the description fields on their metadata, but we can auto generate that for them. And then by doing that, it just makes searching for things in your catalog that much more accurate. So I think it's awesome that you can do that for them. That's a, you know, I mean, that's a step that is important. And we trained our own small LLM to do that. Love it. And love it. Well, I think, George, I'm going to give you one opportunity before we wrap this show to ask another question, and then I'm going to move on to our last question. So, okay, think of, or should we think of Unity as the application platform? And if so, how does that grow over time to make the developers experience more intelligent, you know, so that they're less down at the physical technical metadata level, strings, and they're more thinking about the people, places, and things. I know that gives you a lot of room, you know, but just help us think, help us reframe how that abstraction is going to maybe evolve over time. Yeah, I mean, I don't think of Unity as like the application platform per se. I think it's more of like a governance layer. And as part of that governance, it's capturing and persisting a whole bunch of metadata. You know, you can think about like, hey, if you've got two tables in your data lake, and one of them gets queried like a thousand times a day, and one of them gets queried once a day, you know, which one is more important, you know, if you thought it was this one that gets queried a thousand times a day, you might be incorrect because unless you look at the lineage, you might know that this one is actually feeding the other tables. So you need lineage to have the complete context of what is the most important, you know, versus other tables. So I think, you know, that's just one example of how the metadata can be interesting. Another thing is, you know, which columns get joined on the most, you know, and that could kind of like give you some sort of insight about how people are using it. Yeah. I mean, in terms of like application and hosting, I think like our model serving layer is a really good example of this where if you want to do model serving where you've trained a model and you want to make it available behind a rest endpoint for an application to hit, you know, you need a way to host this thing so that it's up and running all the time. You need a way to automatically scale it. So if you have more requests coming in, you know, all of a sudden you can scale out the compute behind that and then also scale it back. You also need to be able to monitor this thing. So for every request that goes in, you know, what was the inputs and outputs? We say that to a Delta table and then you be able to monitor that. So if the result of that model starts to drift over time, you want to be able to have a way to query that and then, you know, set up alerts so that if it starts to drift too much, you can trigger actions to either retrain the model or have somebody go investigate. And so that's what I kind of think of as a really small application. Now, that one is just like a rest endpoints. There's no UI, but you can kind of like build it out from there where you're eventually going to add a UI and you're going to add, you know, calls to multiple model serving layers as well to try to build a more complete application. Okay, that's helpful. Yeah. So Jason, as we kind of close in on the last part of our conversation here, I want to talk for just a minute about, you know, we did touch on this earlier, the Intelligent Lakehouse, but can you just share maybe broadly what you see Databricks vision for that Intelligent Lakehouse is? Yeah, I mean, I feel like, you know, being able to address the rest of the users of an enterprise, you know, the tens of thousands, if not hundreds of thousands of users that are not really like democratizing, you know, data and AI, they're basically leveraging other teams that do that for them. I mean, think about all the questions that never get asked because, you know, if you have to ask your data science team to go figure out, you know, what's, how was this category of product doing at a particular location, you know, you're going to be, you're going to be thinking of like, do I want them to spend time answering that question or these other 10 questions? So when do they have time to talk to me? I got to wait in line. I mean, that's an issue too. Yeah. So if you just had an easy way to ask these questions and get good enough results to guide you to your next decision, you're going to ask a lot more questions. And I feel like, you know, that's, that's a vision that's not going to take a year to develop fully. It's going to take multiple years to really optimize this. And so I see that as kind of like the evolution of where things are going is being able to democratize or, you know, make available the insights of your data to a broader set of users without having to, you know, upskill them in programming languages or data science. So that makes perfect sense. So then I'm guessing that is a natural part of this process. We'll see more and more semantics that are captured in cattle. Is that correct? I think so. Yeah. I mean, I think George is onto something with this, this knowledge graph. Like, you know, we've, we've got a, we've got a lot of stuff to build this year, but I could see, I mean, not just Databricks in general, but like just the industry moving towards these knowledge gas, if you can get you a more fine grained, optimized answer or so. Yeah, I love it. You know, George is always onto something. That's just who he is. So who do you think will author data semantics and how, you know, will it be a combination of a data engineer and an LLM mining that metadata? What do you think? I mean, I kind of feel like the answer is yes. Well, I think, you know, if you look at a lot of these semantic layers that have been generated over the years for every VI tool has got a semantic layer. And now, you know, there's been more interest in semantic layers and metric layers in recent years, but for the most part, all of them are still like manually curated. So someone is manually going in and specify what, you know, what type of data this is and what currency it is and description of it and et cetera, et cetera. And I think the evolution will be like, how do we basically, you know, get to get to the same, if not more accurate response by doing less manual curation. So how can we basically automatically kind of infer a semantic layer rather than, you know, having somebody manually curate it for you? Because that's going to be what the real efficiency is. And once you kind of like, once you make things super easy, like one click, people don't want to go back to manual stuff. Absolutely. We all live that, right? All right. Well, Jason whole, and thank you so much for joining us today. We have so appreciated you sharing some of that, some of the work that you're doing at Databricks and some of the forward looking vision. And, you know, I don't have to tell you what an exciting space this is. And, you know, as we're seeing this emergence of the six data platform and this change within the industry, it's, it's, it's really exciting times. Thank you so much, Shelly. It's, thank you, George. It was, it was a pleasure to have a conversation today and looking forward to doing more of these. Well, I have a feeling that this is the first of, well, I know this isn't our first conversation with you, but we'll be talking about this for some time to come. So we'll be, we'll be tapping you to rejoin us again in no time. And we look forward to that. So George, Jason, that's a wrap. Thank you so much. Thanks to our audience and our listening audience. And we'll see you here next time.