 Hello, welcome back to theCUBE's live coverage here in San Francisco for Databricks's Data Plus AI Summit. I'm John Furrier, host of theCUBE. Rob Stritch, I've been here two days, breaking down all the action. We had the analysts on, and now we're psyched to have the co-founder and chief architect, Natasha Haria, who's here with us in the press room. Thanks for making time out of your super busy schedule. Been on stage, been giving demos, meetings, back to back, having fun? Yeah, having fun, yeah, definitely being hectic, but it's awesome to talk to you. It's great to see the rise of Databricks and continued growth. Obviously a big event here, 12,000 people. Younger crowd, I see a lot of open source developers out there and I see a lot of data people, engineering, and you guys have the theme going on around democratizing AI, basically, and bringing AI to the masses. Bold vision, great vision, we love it. What does that mean for Databricks? As you guys have rolled out, you've been very successful as a company, you continue to get more data. You've been doing the Lake House, you have great open source presence with your code, data sharing's been booming, things are happening. What's the key ingredient to make AI work for you guys? Yeah, great question. So I think just supporting generative AI is a very natural progression to what we've been doing. Since we started out, we set out to democratize working with very large data sets. That was our bread and butter. We integrated it into the traditional data stack with data warehousing that we've been doing and we've always supported machine learning and data science. And generative AI is a very effective, a very powerful form of machine learning that basically every company's trying to apply. It's extremely general in its applications. And we think we have both some great existing foundations to support the whole lifecycle of it, from the data perhaps to the serving and so on. And we reach the exact same users in each enterprise that want to build the generative AI stuff. So it's awesome to work with them. So I want to ask your thoughts on it. We're going to get to the security piece in a minute. Obviously people want their IP to be protected. Data is their intellectual property. You guys have a very strong position on that in the keynote. We'll get that in a second. But the enterprise needs and cloud enablement are two factors we've been seeing a lot of conversations around. How do you see the LLM movement, large language models and generative AI specifically, take advantage of this next generation cloud as well as the needs of the enterprise, which are different than say a straight up developer. Yeah, the needs of the enterprise are very different and this is exactly the thing that we're specializing in and the thing that I think we can provide unique value in even compared to the very generic like language model as a service type of providers. So a lot of those providers are focused on consumer apps that are trained with just data from the internet that can talk about public knowledge in various ways. They're very amazing kind of applications like chat GPD. You can ask it about anything, eventually it'll do things like search the web and it's great. But in the enterprise, first of all, you need like a level of precision and reliability that's quite a bit higher. That's hard to build. There's a lot of R and D in that. Then you also wanted to understand your specific data that you've got in your company and your specific jargon and business terms and so on. So for example, you might have trade secrets, right? If you're TSMC manufacturing processors better than anyone else in the world, there's a whole lot of stuff in there that you want your internal people to know about but there's no way like open AI knows about it. There may be other proprietary knowledge. Like if you have medical research form and like all the science papers say one thing about an enzyme and you just discovered that it works a different way and it's gonna lead to the next breakthrough. How do you get that research into like your AI applications? So this is the kind of thing that we're enabling folks to do. It's very exciting to see the interest in it already. Yeah, we've reported that AWS and Bedrock has that same capability to keep everything in the VPCs of the clouds. You can start to segment kind of that data access. Well, how does that change the role of authorization and access controls? How do you see that playing as? That part of governance or management? Who controls that piece? Yeah, yeah, so we think enterprises will want to control it in very domain-specific ways and we're building the governance tools for AI based on the rich governance tools we already have for data and for sort of more classical data science and machine learning. We have something called Unity Catalog which is the only basically data catalog in the industry that also spans AI and unstructured files and gives you very rich controls, lineage quality across them. And with Generative AI, one of the new challenges is you want to train on lots of data. You maybe you get it from the web or other places but at the same time, some of it is wrong. Maybe some of the policies around copyrighted data and use of that are changing over time. So you really want to trace exactly what data went into this and be able to fix that as you release your applications. And I think increasingly just through regulation you'll be required to sort of explain and document what went in there. So it is a new set of use cases where this matters. So I want to ask you about the competition. We were just talking with our analyst team on our segment earlier today and before we came on camera about the other guides. They want to manage the data and control the data and govern it and then they want to let the people build apps on top, see analytics. You guys, your vision is to govern the data and the analytics, right? What's the vision of Databricks? What's the specific goal? I mean, one of the big things we've always bet on is basically open interfaces. So that means open storage format. So you can use any computing engine and platform with that open APIs like Apache Spark and MLflow and so on. Because we think that will give customers a lot more choice and ultimately lead to a better architecture for their company. That's going to last for decades as they build out these applications. So we're doing everything in that way where if some new thing comes on, that's better at ML training than we are or better at SQL analytics or whatever, you can actually connect it to your data. You don't have to replatform your whole enterprise, maybe lose out some capabilities you like from Databricks in order to get this other thing. And you don't have to copy data back and forth and generate zillions of dollars of data movement. Yeah, great, great call out there. I want to ask you about the Mosaic ML you brought up, training, big acquisition, big number, 1.3 billion. We've been tracking Davine's company, obviously we've known him before he sold his first company to Intel and we've kind of been reporting since he launched his company. Those guys have a lot of GPUs on hand and they were going to do more. So they were about to do a big Cat-Bex build out anyway to build up that training. Obviously the Yang to training is inference but they're the training side. Is there more work to do there on the Cat-Bex side for the training piece with Mosaic ML? And how does that relate into what you're building or have built in the current Databricks? Yeah, there definitely is work and the way Mosaic ML handled capacity, a lot of it was sort of a very kind of easy to use SAS model where you can sort of submit a job and they'll run it and they'll have a big pool of GPUs that they can assign to different workloads or even use internally for research when it's idle. So we really like that model. It's basically a serverless model which is also what we've been doing with data warehousing and with model serving now and eventually basically everything on our platform will work that way. So we think a provider like us or like Mosaic who can serve many customers and share a pool of resources and optimize that will just provide inherent advantages and availability and TCO and like ease of experimentation. How would you describe for the folks watching who are learning about inference and training? Which is harder, can you even compare them they're apples to oranges? What's the difference between training and inference from an impact standpoint, from architecture, how you think about it, how would you explain training and inference? Because they go hand in hand, they're ying and yang but they're two different things. They do, yeah, so training is when you build your model. It includes a lot of, in some cases, it includes a lot of sort of machine learning, sort of know how, like what exactly is my objective? How do I get really good examples of data? How do I even evaluate the quality? And it includes heavy weight computation on the biggest GPUs. Inference is actually serving predictions and it can range from sort of trivial if you don't have a high rate of requests, for say internal users in your company and you get a couple of requests per minute, then it's not very expensive to serve. It can range from that to extremely challenging if you have many automated systems, like every time a log record is written by a piece of software, you wanna analyze it or if you care about super low latency, like you're doing ad bidding and you wanna read the text of the web page and place an ad. So there's a huge amount of depth there. In general, inference is being easier to optimize. If you can train a model that does a certain thing, there are so many techniques you can use to make it faster and these are some of the ones that we package and our model serving solution and some of the ones where Mosaic ML has done research. It's a group of researchers who started the company. Yeah, great description. Talk about the LLMs because I wanna get into that. Everyone wants one now, then we know what they are. Large language models, obviously. There's different sizes as proprietary. There's a third party in a long tail of open source developing nicely. When should someone build their own LLM? When should they work with a managed service? Because there's trade-offs. You got to get to know your data. That's the thing that came out of the conference for us as you guys just talking to your customers. Know your data and you can get value out of it. But when do I, how do I deal with LLMs? What, if I want one, what do I do? Do I just get my data and call it an LLM? Yeah, no, great question. Yeah, there is work required to build. You could build something, but to build one that's really useful, you're going to need to do additional work. Otherwise, you'll get something that's fused back your own data in sort of a random format. But I think there are a few factors to think about. So one of them is the control of your data and it's security, it's geographic placement, stuff like that. Like for example, if there's personal information in there, like you may, you got to keep it in specific regions. If there's like trade secrets and stuff like that that you don't want to pass. So that's kind of an obvious one. A second one that I think is even more interesting is you may want to build your own model if you want the absolute best performance in terms of quality and customizability. So this is like, you know, like basically in any area where you want to be able to iterate on it and improve quality, you want to control all the pieces. And it's hard to do with these one size fits all models. For example, something like chat GPT gets a lot of value out of Reddit because it's called Reddit, it knows random stuff that people talked about. But if you're in a domain where you don't want to have that stuff, then you can't really ask them to remove it. It's all baked in there. So if you- That's a public corpus and they keep adding to it. That's your point about IP as well. Yeah, it is. Yeah, I mean, not to mention the IP. Yeah, exactly. But they have, yeah, even the stuff on the web, like we think all the world's knowledge is on the web. But yeah, the web is also full of like junk, SEO, totally incorrect things, yes. Which you may not want to think to know about. And if you also care about like computational cost or other interactions, like what if I want the model that knows how to talk to my internal database, like see our user's latest orders, do something for them. All that stuff is easier to do to customize to get to like 99.9% reliability if you can own the whole stack. And then I think the third reason would be a just a strategic reason like you may, if you think you'll develop, like if you have unique data, you can probably very soon create unique like AI products also and you may want a team that can do that. And you could sell like the chat GPD of like finance or whatever, like retail, whatever it is. It'd be video, that's the key. Yeah, video, that's an awesome one, yeah. Yeah, you showed me like, yeah, you folks are doing some of these. Well, we've got a lot of linguistics and then I want to ask you about the future because I see hearing some people in the hallway we're kind of riffing with some of the entrepreneurs that built on top of Databricks. You know, old school techniques like Kumo has a recommendation engine background. They essentially have personalization in neural networks and graph databases as a service with you guys. So they're like a super on top of Databricks. We're going to see more of those. We think we'll see more examples like that. But you know, a lot of the old AI models were built on seed, seed algorithms and setting up the ontologies, if you will, language. Do you see more dynamic ad hoc LLMs emerging in the future where things will just grow dynamically with AI? Do you see AI going down that path? So adapting more to sort of new modalities of data. I think it could happen, yeah. I think it is hard, like even though LLMs are powerful it's actually not super easy to bridge the traditional machine learning and the signals there and all the metrics. So I think it's a somewhat open question, but definitely the idea of integrating LLMs with things like knowledge graphs and other, you know, like structured tables makes a lot of sense. We're also seeing some really cool multimodal applications if you've got some time series or images or videos, you know. Yeah, so it will be, that's one of the things I'm really excited about with customizability as well. You can build models for the unique domains in your application. If you've got the best image sensors on the market, the best gene sequencers, you know, if you've got like a jet plane with like 10,000 sensors in it, you can build AIs that understand that modality. Yeah, this is what we call the data developer that's emerging. We see a future where AI is going to be native in applications today. Most people are thinking about AI coming into an ex-application and adding value, obviously augmentation or bolt-on. We think there's also going to be a direction where there's going to be native AI applications that might be lighter weight faster but in the application and then more horizontal scalable data products that will need to work in real time with these data sets. What's your reaction to that? Yeah, absolutely. So you talked about the data developer. So you've been seeing a lot of our top customers, major software vendors actually, who say things like I have several thousand, you know, engineers and I want all of my engineering teams that build a feature to also own their data, like to do their data engineering or to do their own ML engineering. For example, one of these vendors say, like, look at my user interface, everywhere you see a text field in here, everywhere you see a drop-down, it should be able to recommend stuff, you know, like it should tell you, like, whatever this ticket, assign it to this person, whatever it is. And of course, they want those feature teams to own it. So we're seeing, and actually even internally, you know, we announced AI features and like our SQL editors, notebooks and search, so we're also embedding basically engineers into those things. I'm going to say you're like a masterclass here on theCUBE, really appreciate your time. Final couple of questions, as you look at the future, you know, reasoning has been around for a while in AI circles as you know, metadata got reasoning to it. What's the big aha moment? What's the big inflection point now about reasoning that was different than before? Was it the LLMs? Is it the more compute power? Is it the confluence of all these things coming together at this one time? Why is reasoning better now than it was? Yeah, that's an interesting question. So I would actually say there are different types of reasoning. I think with LLMs and the kind of stuff you can do with chain of thought, it's much easier to do kind of fuzzy reasoning with kind of fuzzy concepts. So that's really the thing that they've given us. Like computers are very good at precise stuff. On the other hand, there are other things like the chess engines, the, you know, like optimizers that solve a giant problem for like scheduling. Those in some sense did reasoning, but in a very different way. And actually LLMs are not that good at that. And I actually think you'll want to combine them as external tools. So for fuzzy stuff, where like, you know, there's no precise description, but there's like common sense, you can do it. But for stuff like solving math problems and so on, honestly, like I think the existing tools are in many ways more useful and they can do things that like people already couldn't do. And maybe you'll combine them for something even better. Yeah, that's a great question on the fuzzy versus more specific. Let's talk about the vector database. It's been all the rage. Everyone's announcing a vector database. I mean, you know, who doesn't have a vector database these days? So why is it important to have a vector database versus using Office Shelf? I know you get platforms as big wave right now. Why a vector database? Why is it important? Yeah, I think, I mean, I think vector search is important. So vector index, the ability to search on these, you know, numerical embeddings that can do fuzzy matching. But I do think it's something that may, will probably be incorporated into many technologies in the same way that like most modern, you know, data processing engines have like, you know, basically like Starks and arrays and stuff like that. Basically these like sort of unstructured data types and inside, I think maybe soon they'll have, you know, vectors and vector search. So we're definitely looking at it more holistically as like, you know, in your tables and your engine, can you just use these? But it does open up this, you know, powerful like matching. It is really more of search less database kind of concept. Yeah, that's my feeling. We'll see how it develops. You know, it's a new, it's a new area for sure. Final question for you, and I appreciate your time. What do you guys have to do now at Databricks to bring this AI to life? That's net new capabilities and how much are you leveraging internally with previous work done? Yeah, great question. Yeah, so I think that we're building on a lot of the foundation in terms of data management, governance data pipelines and ML ops that we had, including like serving and monitoring models. The most, the biggest difference with generative AI language and image and so on is it's much harder to automatically evaluate the output. Right, like before you were doing predictions, we just see like, hey, how many of the ads I showed that a person click or whatever, it's just a number. With this, how can you tell like, whether this paragraph you created today is better than the one you created yesterday? So people are looking for new tools there. You can do some automated stuff. You know, you can ask a model, you can look for keywords. I think you bring in human evaluators a lot more and you ask them, what do you think of this? You know, but that's, I think, the big open thing. So we have like UIs now in our ML platform offering where like you can compare, you know, just a big spreadsheet of text snippets and say which one you like. And we're working on, you know, various other ways to do this, including integration with partners in this space. As chief technologist, I got to ask you a question, two points to the question. What's the coolest thing you've seen here at the event and what's the coolest thing you're working on? Yeah, great question. I think, I mean, honestly, the coolest thing I saw was the Rivian, you know, vehicle in the expo hall and like the discussion of the tech behind that. It's really amazing, you know, like it's not just like, oh, you know, it's electric, it's got a battery. Basically, you know, as it drives like, every component of that system which works in these extreme environments, they monitor it, they have software there, they can update it and sort of they learn how to make that on better. So I think it's sort of the future of a lot of, a lot of basically stuff that, you know, you build industrially, it makes so much sense. And things we're working on, I think the thing I'm really excited about is basically opening data to, you know, less technical users to business users directly with Lakehouse IQ, which is this knowledge engine that sort of learns how you query your data, how that maps to business questions. And, you know, lots of folks in the industry are working on this, but if this succeeds, I think it will open up data, AI, and so on to way more users. Very cool to see that. Greg, a question keeps jumping in my head. Final, final question, I promise. When Ali's on stage, he said the format wars are over, that unification measures really hit home to a lot of people that like that. Where'd that come from? There's just philosophy of data, you know, we open source, but that eliminates a lot of this confusion. What does that mean when you guys, when he says you're eliminating the format war? Yeah, so this is about the different Lakehouse data format, so it's kind of an interesting evolution where for a while, you know, everyone in the open source world was just using Apache Parquet for large data sets, but then it was only a file format, it couldn't do transactions and stuff, so these three different open source projects started that do that, and all of them, like 99.99% of the bytes are these Parquet files, and there's little, little files on the side with metadata that tell you like, you know, here's the latest version, here's the transaction log, and you know, I mean, they started out, we started Delta Lake, there was Apache Iceberg and Apache Hootie, and then I think a lot of vendors kind of got pressure from customers to support open data, but they were worried about like, there being a single format that runs, especially the competitors format, so they would say, oh yeah, I do support it, but it's this other one. And so it creates a lot of confusion for customers, right? So because these are so similar, we built a capability that just in your Delta table, you can read and write it as all three formats, it simply just creates that metadata three times, and you can just open it up, like any tool that understands, say Iceberg, for example, can connect through that catalog API and write to it. So we think, I mean, all our customers cheered when they saw this, because they had these headaches of like, oh shoot, am I in the wrong ecosystem? And hopefully, you know, hopefully it gets people to focus on like what you do with the data itself, and really like, you know, like let customers actually move workloads between platforms and use the best one. To win all around, congratulations. Thanks for coming on, and congratulations on the success. Tell Ali, we passed along, we missed him on this one. I knew he was super busy as well. I appreciate you taking the time. You're a super busy schedule. Thanks a lot. Okay, it's theCUBE. Here with the co-founder of Databricks, sharing all the data, masterclass of what's going on with Databricks today, and then in the future, as Gen AI comes up forward, it's going to be a developer centric, business-focused world. Thanks for coming up today. Appreciate your time.