 Hello and welcome to SuperCloud 6 and the continuing discussion around AI innovation. I'm Rob Streche, managing director with theCUBE Research. Today I'm joined by Craig Wiley, senior director of product or AI ML at Databricks, a company that has grown up at the intersection of data and AI since it's founding. Welcome on board, Craig. Thank you very much, excited to be here. Yeah, I am too because again, I feel like we've lived in similar lives and around having both been at AWS and other places and really been in the data space for quite some time. You definitely, again, you got the chops in the AI space, so I'm really excited to go through this with you because I think that innovation is at the heart of why people do AI. They're trying to figure out the use cases a little bit and we'll kind of unpack that a little bit as we go through. But really, I wanna start with the fact that Databricks and the underlying Apache Spark have always been popular with the data science crowd. Even the fact that you guys are supporting two APIs for R, which is really where they live in R and really those interfaces for Apache Spark are SparkR and SparklyR, if that's how it's pronounced. What are you seeing as you talk to customers about this and about AI in particular, about the personas that are really pushing AI innovation at those customers? Yeah, you know, the last 18 months has I think really driven a change in this space. And the reason I say that is that, you know, if we were to rewind the clock 24 months ago, we'd find that the vast majority of machine learning and AI was done by data scientists, right? Classically trained, have PhDs in data science or something along those lines. And then, you know, with the introduction of large language models and obviously the open AI moment, we all of a sudden saw a second group coming in and doing a lot of that work, which is developers. And so today, you know, we really think about this as, you know, having to serve both a kind of, you know, trained specialist in data science at using these tools and making the workflows easier for them while at the same time, making sure that we're building in such a way that folks who may not have as much depth in data science in particular, but still have technical depth, giving them the tools to be able to build and deploy these more generative apps we're finding as super useful. Yeah, no, that makes sense. I think, again, when you start to look at where Databricks has been, it's always been with Python and being able to, you know, do that type of, I guess you could say data wrangling and data engineering and things of that nature. And, you know, both of us having the backgrounds that we do, we've kind of seen a lot of the innovation over the last few years to put it mildly. And what are you seeing now that you're at Databricks? What is the innovation that Databricks is really focused on and currently shipping out to its customers? Yeah, I mean, within the space of AI, this really comes down to, as you said, data plus AI. If you look at, for example, any of the cloud, you know, kind of the hyperscaler AI platforms, you know, I think what you find is that they're these kind of distinct AI platforms separate from all of the rest of the tooling, right? They can be connected to all the other tooling through kind of, you know, clear APIs and what have you. But Databricks has really done this differently. Databricks has said, hey, what if data and AI were managed completely within the single stack, right? What if we reused as many kind of of the kind of data, the classic kind of data management capabilities? What if we reused those wherever we could? And so, you know, as an example, you know, using our feature store and training a model, you know, your data can actually just inherit the governance of the data in the feature store. And then your model is fully governed. And now, you know, all of the say, you know, whoever had access to the data, they're the only ones who have access to that model unless you grant others access. And this idea of kind of unifying not only data, but the assets that are created from data, right? Having one lineage graph and one governance capability across all of these data assets is really a massive opportunity in this space. You know, if I look back on the kind of development of ML and AI over the last couple of years, you know, obviously open AI and the generative movement has had a massive impact. But if I were to go back before that, really the introduction of ML ops was the last time we saw kind of an order of magnitude increase in developer velocity. And I think that, you know, the next of these opportunities is really in connecting the data plus AI. Now, as we look forward towards a more generative kind of a time when more and more of this AI is used through generative models, this kind of governance and controllership of how the data is interacting with the model, right? Are you fine tuning? Are you running rag? Are you maybe pre-training? You know, how you're choosing to interact the models with the data becomes increasingly important and it becomes increasingly important to ensure that this is all locked down and fully governed. And so that to me is one of the areas that I'm most excited about is really capitalizing on this relationship and this integration between the data stack and the AI stack. Yeah, I mean, we talk about it a lot. We actually have this whole talk track around and not talk track actually research deep research around what we're calling the sixth data platform. And I think one of the things, and I think you kind of briefly got onto it with the kind of the feature store and the feature engineering that you can do within Unity catalog, which is kind of the central governance place within there is that really metadata in connecting AI and the actual data together becomes that be it people are talking about all kinds of different ways about, hey, you have to have a vector store or you have to have vector search or what have you and like you said, to do your rag appropriately and what kinds of generative AI, but we also see that it's being used for more than just generative AI, although that is absolutely the hottest thing on the planet right at the moment, but you also have traditional AI and ML. What are you seeing from the customers about, again, wanting to be able to do more than just gen AI? Yeah, you know, it's funny. We still see a massive kind of what I'll call traditional AI or traditional ML business. And let's be honest, that's where the money in this space is being made by customers, right? When practitioners are building better forecasts or better classifiers or better recommenders, there's a clear path for revenue today for those models, right? We've known for years now how to build those models, how to get those models successful into production and then how to take care of them in production. You know, I think with gen AI, we have a new opportunity to learn how is it that we drive accuracy or how is it that we drive kind of fit between our model and the use case? But going back to your question on kind of what are we seeing? You know, this idea, as you said, of merging the data catalog and kind of the feature store or the vector store, the ability for people to access the data so that data scientists are no longer using these kind of separate feature store systems where all the data has to be copied and pushed over into these. And these separate feature store systems probably don't have meaningful governance, or at least if they do, it's probably not aligned with the kind of data catalog governance, right? They probably have their own lineage system that's not aligned with their data lineage system. And so, you know, we're seeing more and more customers take advantage. For example, you know, if you use Databricks model, model surveying for surveying your models and you've used our feature store, you don't have to build any data pipelines because we know where all your data is. So all you need to do is send us your primary key and we'll go do all the lookups and deliver the data to the end point, right? Similarly, if you're, let's say you're using our model monitoring capability and you're looking at your models for SKU and drift, you know, if your model starts drifting, you know, traditionally, we would turn to the data sub hits team and say, hey, go, models, this behaving, go fix it. Now, with a fully kind of, you know, with an AI system that's completely connected to the data system, we can sit there and say, hey, your model's drifting. And we think it's because feature number four is broken. And the reason we think feature number four is broken is there's more nulls in that table than there ever have been before. And we actually know the load job that put those nulls in. And the real impressive part is this lineage, it doesn't just read backwards. As soon as we say, hey, feature number four is the problem, you can immediately turn and say, great, I know feature four is not only in this model, it's in these other two models. It's in this load job for another table and it's in this dashboard my executive just looked at. And so the ability to kind of be able to see all of that and treat your data as part of, or your model as part of your data system as opposed to a completely separate system is really driving, you know, a massive increase in velocity for the customers who are adopting this point of view. Yeah, no, I think that when we look at it, we look at it similarly that you have to have lineage, you have to have governance security and being able to deal with PII and things of that. So masking and mapping and being able to then, like you said, use it for more than just, hey, Gen AI, because like I guarantee the retailers would agree with you that a recommendations engine definitely brings them more revenue than say a Gen AI that would help them figure out how to spell, you know, figure out the right product name or something like that. Yeah, absolutely. But what we do see, and I guess I'm interested in, what use cases are you seeing in the Gen AI space right now? Cause we see a lot in, you know, customer service in IT ops and things of that nature where, hey, we need to be able to either do self-service or be able to bring better data quicker. What are you saying for the customers? You know, I feel extraordinarily lucky to get to work with the Mosaic team. And, you know, we acquired Mosaic MLD creators of the MPT series of models, you know, getting to work side by side with them. And not only do I get to see the customers who are building, you know, rag box and things of this nature, I also get to see the customers who are fine-tuning very large models or even pre-training their own foundational models. And, you know, as far as use cases go, I think we see every use case under the sun. I'll say 90% of it today is rag, right? What we know is that customers are desperate to get their data into these models, right? How can they make it so that these models are contextually aware of their space, their business, their problem, their issue? And so this is where we see rag. And, you know, whether it's in HR, like HR, you know, hey, what are my benefits? Or to your point, a customer service or a product information bot, hey, you know, well, how would I handle this problem? Or why isn't this thing working, right? Sometimes those are, you know, usually what we're seeing today is those are internally facing, but we're starting to see customers get confident enough with the accuracy of these models that they're willing to start turning them to become externally facing as well. And so, you know, we're seeing a lot of those kind of QA bots. And, you know, I will say, I had a customer the other day who got upset with me and he said, you know why everybody's using these rag bots and doing rag and all this? He said, because we're not really creative enough to figure out what else we should do with this technology yet. And I was reminded of kind of, you know, the internet back when all the pages were static HTML and you couldn't type anything, you couldn't interact other than click links, right? The only thing you could do was tap on it. And, you know, my guess is that's the day one we're at. I'm starting to see more creative uses. I always get excited when customers come in and say, yeah, we've got a rag bot going and we've also fine tuned a model. But what we want to talk to you about today is this new real generative capability, this new creative capability we're doing. So, you know, we're seeing use cases all over the place. I think that the biggest thing we're seeing is separate from the use case. It's really about how do I gain the confidence that the model is going to be accurate, that it's not going to hallucinate, that it's not going to share information that I don't want shared, right? Private information, how do I gain confidence that this model is going to react and behave the way I will want it to in all circumstances? And so really, working with customers, regardless of their use case on how we can drill down and distill the problem down to a place where we really can help them drive accuracy, whether it's on the data they're putting into their vector store, whether it's on the data they're fine tuning on, or whether it's just how they're structuring their products. All of these things can have massive impacts on the results you get back from these models. I think that is so true. And I think, like you said, I always get excited when it's more than, I mean, don't get me wrong, being able to figure out how to do direct deposit at your company, and you just joined an intergalactically large company and using the gen AI bot to go figure that out versus having to call somebody, fantastic and absolutely has an ROI, especially internally, but when they go and do something that's more creative, I also see that a lot of these technologies are really about human in the loop at the same time, because you don't want somebody to go out and recommend a stock, for instance, through a customer service. And part of that is actually that policy and governance and making sure it's not making those recommendations. Where do you see innovation from Databricks going in the future to really help kind of drive towards that kind of deployment as well? Yeah, I mean, it's gonna be all about accuracy, right? And kind of, you know, accuracy, safety, right? Making sure that your model is, to your point, is not sharing things you don't want shared, right? We all, many of us have probably seen this recent story with Air Canada, where Air Canada has been held liable for a discount that their model, that their chatbot offered over a kind of online chat, right? Somebody was asking about discounts. The model said, yeah, hey, we'll give it to you. And, you know, ultimately the airline didn't want to. The user sued them and won the lawsuit. And so, you know, making sure these models behave the way you want them to behave is gonna be absolutely critical. And so, you know, whether it's RAG, whether it's fine-tuning, you know, our focus is really on, you know, yes, the model is important and we'll continue to drive work in the open source space and help companies train their own models. And we expect to continue to be an active part of that space. But, you know, as we look forward, it's often the AI system. It's the vector DB and the prompt. And maybe you're calling the feature store for some structured data. You need to know when the person's last order was or something like that, right? You know, you know, these chains or these multi-step AI systems, that's where the really interesting kind of out, you know, kind of use cases can be created, but it introduces more and more opportunity for error. So, you know, things like giving you full trace logs of every single step of your chain so that you can go in and debug kind of any sort of accuracy issues. You know, these are the kinds of things we're spending a lot of time thinking about is how is it that we can help customers not just improve their prompter or pull back a document, but how do we optimize exactly which document was pulled back and exactly how big the chunks are and which documents were included in order to help them achieve their specific outcomes? Yeah, absolutely. And I think that we look at that as being absolutely required going forward because I think that type of confidence and customers have to have confidence. I also think the guy who from, you know, the customer of Air Canada, he should be hired by somebody as a prompt engineer, QA person because he definitely did a good job being able to go in and figure out how to go about, you know, getting at them. And I think that's, but that's the case is you always want the prompt to come back with the same answer and he sat there and went through and he was tuning his prompts to get that answer to come back to him. And I think that's what some companies are very worried about to say the least. So last word, what innovation are you excited about just from it can be data bricks, it can be more in the market, what are you really excited about right now? I mean, you know, the thing that we see a lot of right now that gets me really excited is what I'll call, you know, model in the loop, right? You know, we all talk about human in the loop and using humans to kind of, you know, see whether or not answers are good answers or not good answers. And that's great, except that there's really a finite capacity for how much of that we can do and how quickly we humans can do that. And so, you know, what's really exciting is this idea of training models to be able to judge how well the model solved a particular case, right? You know, hey, I'm supposed to answer this question. Here are the rules I have around answering this question. Hey, model, can you help me see whether or not I did that right and give me some confidence about whether or not I did that right? And obviously this can, you know, can have hallucinations and things like this, but you know, the idea of being able to do this at a much higher scale, right? Being able to kind of, you know, label tens or hundreds or even thousands of these utterances, you know, at once to be able to kind of drive model accuracy. You know, I think synthetic data is also playing a huge part in this, but this, you know, I think we're just now starting to see the, what are those tools and what are those actions that we're gonna be using on Gen AI over the next few years or decades in order to kind of drive this technology towards our needs and use. And you know, the faster we can automate that and the faster we can drive kind of the ability, I mean, here's the deal, we all know what we want, which is I wanna be able to come in and say, you know, hey, build me a model that does the following, right? You know, we're eons of the way from that still, but we're starting to see that some of the pieces come together that will allow that, whether it's, you know, model in the loop, whether it's synthetic data, whether it's fine tuning, you know, aggressively fine tuning, both, you know, continue training or instruction fine tuning. These are the kinds of capabilities that I think are really going to allow us to start to ask these models to do more and more sophisticated and exciting things in just the way we want them to. No, I think that's, we agree. And I think we're having somebody on from a synthetic data company as well as part of this, as part of our SuperCloud 6, so. But Craig, hey, I wanna thank you for coming on board. I really appreciate it. And thanks for giving the insight from your place over at Databricks there. Really appreciate it. Thanks for the time today. And thank you. Stay tuned for more SuperCloud 6.