 Welcome back everyone, live here in the Palo Alto, Cube Studios, I'm John Furrier with Dave Vellante. This is Vast Presents Build Beyond Live Event, launching their data platform. Jeff Denworth, the co-founders here on the stage with us. Great to see you, Jeff. Thanks for coming on. Hey, John. Congratulations, big launch. We had your co-founder on earlier, CEO, Brandon talking about the vision, the business trade-offs. Big idea, data platform perfectly positioned for this market as one half is kind of like retooling and refactoring, the other half of the world is powering with AI, big time, a lot of action with data. Great time to have this platform. Yeah. Yeah, the way we look at it is, you know, Vast is the fortune of working with some of the largest AI projects in the world. And, you know, you watch the collection of technologies that people need to put together to build things like deep learning pipelines. And when we started to study the market, it was very clear that the concept or the term of art of a data platform has very specific meaning, right? If you can put together an assemblage of, you know, a compute layer with a data layer with the storage layer and then kind of put it all together and wrap it up in a bow so that it doesn't even really cater to infrastructure people. You were talking to Ren and about how the target customer changes to more data developers. Well, then it just makes infrastructure really simple. But we realized there was a huge gap for us. People moved away from like business reporting systems and data warehousing to actual deep learning. There's just, there's nothing in the space. So we're here to make it much simpler than it's been in the past. Your keynote, you unpacked everything in the keynote. Video is awesome. Explain infrastructure impact here because the storage industry is going to be disrupted by this because this isn't just storage. It's infrastructure. It's a layer. It's horizontal. It's got the specialism for the applications for data developers. It's a very unique positioning. Explain the impact to the way you see it and the infrastructure. So when people start deploying the vast data platform. I think we think about solving first principles problems. And the way that we see it is there's a lot of companies out there that are trying to optimize bad architecture decisions that were made maybe 10 or 20 years ago. And we looked at it and said, well, what if you could solve the root of the problem? And so I just wrote a blog this morning. It's called the Grand Unification Theory of AI Infrastructure. And the thinking is, well, if you can make, first of all, if you can make flash as cheap as disk, well, then you just need one tier of infrastructure. And oh, by the way, that infrastructure becomes exponentially faster because now you're all your data on a much larger pool of flash than you ever had. So access to data is unlocked. And then we looked over in the database space and we said, well, if you can build a system that's transactional and can handle streams, but also enable you to analyze data at any level of scale, squash that stack. So now I've got two stacks that have been squashed. I'm talking about the squash or vacation of stacks. And then you take the next step and you say, okay, why do you have to have a system that's independent database services and unstructured data services put those together? And so that almost goes back to projects that Microsoft was working on years and years and years ago where they tried to build an unstructured database. Okay, well, now I've got this one data system in my environment. Let's squash all the data centers together, both edge and cloud and on-prem data centers just have one namespace and one compute engine that allows you to do everything in real time with insights that go all the way to the archive. It's a big, big simplification that we're aiming at. So you're blowing away a lot of these conventions. I mean, the storage hierarchy I thought would never collapse. I mean, I thought it would actually would flash and then I kind of gave up on it. Now you guys making that happen. Maybe it comes back with genomic storage. Yeah, right, sure. But so this notion of days, disaggregated, shared everything architecture. So I want to understand why it's important, but specifically why it's important in the context of AI. Yeah, so if you think about what's come before us, pretty much every distributed system was built from a concept of data partitioning, right? Google laid the foundation for this idea 20 years ago when they invented the Google file system and they said, okay, let's build large clusters of commodity servers and just build partitions out of all your nodes, right? And so that gave birth to the whole file system industry, the whole object storage industry, the whole NoSQL database industry, the whole data lake industry, the whole hyper conversion. Like the impact of that paper that Google wrote is probably at least hundreds of billions if not trillions of dollars. And as we saw it, what we realized is that the price of partitioning is transactional consistency because once I split my data across all these nodes to keep them in concert with each other, everything slows down to the lowest common denominator, right? That same idea is also true for distributed web scale storage across multiple data centers. But if you can just build cores that all are stateless, that all have the same access to a global volume of SSDs as one big volume, well then you can basically take like a von Neumann style architecture and abstract it all the way up to like a web scale data center scale. And so now you don't have machines that have to talk to each other in order to interact with the data store. That not only works really well for things like deep learning where you're doing things like distributed checkpoint operations across all of your data, which requires a lot of parallel writes or distributed read operations where you're kind of reading in a common input directory that may have locks required on it. Like all that stuff happens inside storage and slows everything down. And we said, let's get it away by building something that's just truly parallel. And it pays dividends not just for the file layer, but as we step up to the database layer, you've solved fundamental database scaling problems that have been around for decades by going to this disaggregated shared everything approach. So you're right, it's like a caravan where everything has to slow down for the slowest truck. Sure, right, okay. So that's what's happening with partitioning. But then how do you deal with, I think you got this locking is at the edge or wherever. Okay, so that is the point, the single point of control is whoever has the lock, is that right? So there's an idea of a data owner and that's somebody that takes, just claim of ownership of that data, but ownership isn't authoritative, right? And so anybody that's in the network can say, I wanna own that data for some period of time or loan that from you. And in this case, what happens is the lock manager then just gets pushed out to the edges of the network where the transactions are actually happening. Now that really doesn't make sense if it's just like one big clunky service that has to move from data center to data center. But if you can bring it all the way down to the smallest data element, which is like either a file or an object or a table, then you can have all of these different sites operating on all different parts of the namespace and very rarely will they actually conflict with each other or overlap with each other. But when they do, there's consistency management to ensure that it's strictly consistent. And okay, and so how do you carry through that strict consistency like when something goes wrong and you've gotta recover? Sure, so there's two layers to the vast data space. The first layer is access, right? And so we wanna be able to provide any site around the world with access to your data at any given time. And then we do things like buffering and prefetch and caching in order to make sure that that's fast. But availability is an entirely independent topic. So I could have thousands of sites that all access the same data. But if I wanna go and actually ensure that I can fail over across data centers, then replication schemes are at a layer underneath the availability layer. Does that make sense? Or I should say the access layer. Does that make sense? Yeah. Yeah, so customers can have end to one replication schemes, one to end, whatever they wanna do. And that is totally independent from their access policies where they can just see their data and transact to their data anywhere. And they have the flexibility to determine how to deploy that? Sure, sure, it's all based on policy. You'll now start to be able to deploy this into cloud environments as well. We're walking before we run. So we'll start with relatively small cash nodes to start. Then you'll be able to build large stretch clusters by the end of the year in different public cloud platforms. And customers really shouldn't think about or care about where their data is upon access. They should only, admins should think about where it is in terms of availability policies. And the system accommodates for both. It's just policy-based. You guys will support whatever those environments are. Obviously multiple clouds, saw that on there. Well, whatever is a gratuitous term. But AWS, Google, and Azure to start. We've been partnering with all of them to kind of build towards their best practices, architectures. We'll add more as customers need. Edge, on-premise, all there. Edge is already here. So we have a few partners here. HPE is obviously a great partner. We did a lot of work with their GreenLake for Files launch. We work with another company called Mercury Computer. Actually, they're Boston based. You probably know them. Yeah, I saw them on your slide. I know them. So Mercury is crazy, they're building standardized systems that go into battlefield deployments or peacekeeping missions and things like this. And at some point it'll go into space. We're not there yet. Jeff, you've been around the industry for a long time. You've seen the storage, you've seen the server, networking, classic infrastructure business. You guys have this platform. It's resonating. The AI wave is perfect for you right now with seeing that in some of the training models you had mentioned. What does some of the skeptics say when, obviously the big customers' picks are, those guys are at scale. You probably walk right in, they get it right away. What are some of the customers that might you roll in? Like, well, we have the other guys, storage stuff. And what are some of the skeptics saying? And then what do you say to them? And what happens next? Well, I think there's two layers to it. The storage layer, if you think about the biggest systems that people are building for deep learning, large, NVIDIA, super pod style systems, everybody's kind of fought against standards in this space forever because the conception was that NAS wasn't ready for this level of scale. Now, topics like NFS are kind of villainized in this story because historic implementations of NAS systems always had that partitioning and scaling problem that I talked about earlier. What people didn't realize is if you solve the architecture problem, then you can use NAS for pretty much anything. And so we kind of turn the conversation around and say, well, if your enterprise systems are now ready for super pod levels of supercomputing performance, well, then you can bring AI to your data as opposed to having to bring data to your AI. And that actually is a pretty transformative kind of thing that's happening in the market. And what we're about to do over the next four weeks is basically announce a string of Perl's customer announcements that you will know are the people that are buying not just a few dozen GPUs, but thousands to tens of thousands of them to train on infrastructure. So the storage layer was always like, can I use enterprise infrastructure for that? The answer is yes. Now, the second consideration is as you move to this mode where you're refining structure from unstructured data, the data engine comes into play. And to the point that you guys were making earlier, that's a developer discussion. And this is a huge step up that we need to make. So we need to build an entire developer relationship community. And if Ren is still listening, the marketing budget will have to grow by about four X. But yeah, we're ready for it. I mean, he talked a lot about the business side trade-offs, architectural trade, as you mentioned architecture. That's something that we're seeing a lot of now with the next gen cloud. We talk about all the time, super cloud, super apps are here. AI is a big influence with the developer community. I mean, I think there's like 172 projects in open source right now working on this kind of projects. How do you talk to that data developer from an architecture standpoint? And also the solution architect who's saying, okay, I'm redoing DevOps. I got security shift left. Now data is now going to have to be rethought through at the architectural level from an op standpoint, yet enabling the developer. Well, I think the separation between application and storage, whether you're talking about unstructured data or database infrastructure has always been an encumbrance to people that ultimately want to try to solve that first principles problem, which is, I need pipelines, pipelines that are easy to program and easy to deploy. And so our thinking is, data has a ton of gravity as we all talk about. But if you can essentially attach code to your data, then you have like a more declarative environment that you can work with that takes the complexity of all those different stack components out of the equation. And so that's a long-winded way of saying, we think that there's something simpler here by the synthesis of data encode. And we need to work a lot over the next couple of years to convince the market. Well, again, you're blowing it away. I mean, we've seen efforts to sort of optimize the data pipeline. If I can get 10% faster or 20%, take 20% of the cost out of the pipeline, that's lucrative for my customers, but you're rethinking that whole thing, that whole concept. And like you said, you're squashing that stack, operationalizing basically the data portion of my business, which has never been done before. I once had a meeting with an enterprise customer that said, unless you can talk in terms of 10x gains, you have no business being in the room. And so we try to kind of build all products with that thinking in mind. And just on the topic of just operational efficiency, if you think about, well, what if I could solve for anti-gravity, and what if I could move data to wherever available processor resources are in my kind of a state of not only the data centers that I'm managing, but also just the elastic resources that you have in the cloud. What we realized by talking to a few of the largest cloud vendors is that like, if you can just move data to wherever available processors are, you're talking about like massive, massive savings that can come into play. Not only just on like the infrastructure side, but also on energy. Like you've got all these idle machines just sitting around, and if you can move the data and then get them working, it's a big, big partnership. You're paying for these underutilized resources, right? 100%. So I want to ask you, we joke sometimes that AI was invented last November when ChatGBT came out. So when I was listening to you this morning, I wrote down a question, and I want to make sure I got it right. It looks like you're essentially building, you've built this architecture that can store and organize exabytes of data. And then you're scheduling these compute functions across this globally distributed set of AI supercomputers. I think scheduling is probably not the term I would use because data itself creates the event, right? And so as long as you've programmed it appropriately, there is no need for job scheduling. There's just this continuous series of operations that happen data rolls into the system. We realize, okay, it needs to be catalogued, it needs to be inferred upon. That inference creates some sort of like metadata enrichment and that may trigger more functions to happen. So we don't want to schedule, we just want the data to do its thing. Good clarification. So that's organic. So you're organically applying these functions across this distributed set of what I call an AI set of supercomputers, that's fair. I think it is. And so explain how this is different from the generative AI that everybody sees today. Oh, well, we don't develop language models, we just build the infrastructure for people to build models on. And so I think in some respects, what we see is that a lot of the LLM work today is being done on text, right? If you think about even the earliest forms of GPT, it's not very large corpuses of data. GPT-3 was rumored to train on something like about 40 terabytes, right? 40 terabytes, you can actually, if I go back to what George said about a 60 terabyte SSD, like that's just one SSD, it's not even, right? So it's not a ton of data. And so what you get from that is a system that is not necessarily understanding what the physics are and can't reason about what's happening around the world, they're just kind of reciting back what they've learned. And so our observation is we're starting to see some people think about deploying not just like, you know, exabyte scale infrastructure, but a few 10s or a few hundreds of exabytes. And if you wanna like represent the natural world to an AI engine, just show it everything, show it everything and train it not only on all observations but also all learnings that have come from those observations. And here I think we don't have a word for this, but actually a few weeks ago, the DeepMind team announced a new sorting algorithm that was discovered by the Google AI effort, like in the AI itself discovered this new kind of algorithmic approach. And that's the glimmer of something to come that we think we wanna enable even more so that these machines can solve grand challenges. In terms of the gender of AI piece, you mentioned earlier your training models for folks, you're providing the infrastructure for people to train models. What are you seeing there for performance? The bankers call that picks and shovels, by the way. Yeah, so I mean, everyone's doing it right now, you see a lot of people storing their data for legal reasons, for compliance reasons. We were talking before you came on camera about some of the legal aspects of where AI's coming in. So gotta explain that. You got GDPR issues, we saw that in the video. Oh, sure, sure, sure. So what's happening is that you've got all these foundational models that are coming out and people are then looking to refine them and tune them and parameterize them towards their business. If I take chat GPT and I'm like Exxon or Shell or something like that, it's not gonna help me go find oil. But if I can take these basic linguistics and understanding tools and then apply them against the corpus of data that these organizations are sitting on, you can start to get more enriched answers back. And so what we see is that there's this gold rush right now for people that are building the foundations and then the enterprises are now waking up and saying, okay, now that we have that base tool, we need to go apply some really interesting stuff within our environments that can be on-prem or in cloud to build something that's suitable for our businesses. And so we think this gold rush will continue on for some time. All right, well, great stuff to have you here. It's starting. Great to have you on. That's right. Great to have you on. Real quick final question. Data lakes have been all the rage. Data platforms now are taking, seems to be taking the data lake concept and kind of bringing it to life. You guys talk about that. How do people rationalize the difference between like data lake, the data platform? How would you articulate the differences? Cause a lot of the enterprise has been riding on that data lake way for a while. Actually the announcement we're going to make next week, it's a very, very large AI environment. They call it an AI lake, which I thought was an interesting term. But a data lake is just a component, right? So if I use kind of Hadoop as a point of reference, I think that was popularized. Actually the first cube I was on was that strata here. Data lakes have a connotation of just being this unstructured place that you put your data, that ultimately then needs to go refine through some other process and you get it to a point where it looks like a data warehouse and other companies come in and say, okay, well that's a lake house, right? But there's still an added component which is what's the execution engine? What's the computing engine that sits on top of this? And that's the, I think the big distinction between what a data platform is and all of these lakes and lake houses is you need both to come together. You just want a programming environment that customers can go in and refine their data with. Jeff, thanks for coming on Build Beyond. Congratulations on your launch, a lot of years, a lot of great customers. Thanks for coming on theCUBE. Thanks, John. Appreciate it. Okay, we'll be right back with more Build Beyond here in the Palo Alto Studios. I'm John Furrier, Dave Vellante. Stay with us. We'll be right back with Nvidia up next.