 From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. We believe today's so-called modern data stack as currently envisioned will be challenged by emerging use cases and AI-infused apps that begin to represent the real world in real time at massive data scale. Now to support these new applications, a change in the underlying data and data center architectures we think will be necessary, particularly for exabyte scale workloads. Today's generally accepted state of art that is separating compute from storage has to evolve in our view to separate compute from data so that compute can operate on a unified view of coherent data. Moreover, our opinion is that AI will be used to enrich metadata to turn strings, i.e. ASCII code files, et cetera, into things that represent real world elements of a business. Hello and welcome to this week's theCUBE Research Insights powered by ETR. In this Breaking Analysis, George Gilbert and I continue our quest to more deeply understand the emergence of a sixth data platform that can support intelligent applications in real time and to do so, we're very pleased to welcome two founders of vast data, CEO Renan Halak and Jeff Denworth. Gentlemen, thanks for taking the time. Welcome. Thanks for having us. Thank you. Hey, by the way, congratulations on the recent news. You're in the audience, you haven't heard, Vast just closed a modest $118 million financing route that included fidelity at a valuation of nine billion, which implies a very minor change to the cap table by my math, so well done you. Okay, let's start the conversation. We want to set a baseline on today's modern data platforms with some ETR data. Here we're showing data with net score or spending momentum on the vertical axis and presence in the data set, i.e. the end mentions in a survey of around 1,700 IT decision makers on the horizontal axis. Think of it as a proxy for market presence. We're plotting what we loosely consider the five prominent modern data platforms, including Snowflake, Databricks, Google BigQuery, AWS and Microsoft, and we also plot the database king Oracle as a reference point. That red line at 40%, anything above that is considered a highly elevated net score. It's important to point out that this is the database data warehouse sector in the ETR taxonomy. So there's a lot of stuff in there that's not representative of a modern analytics and data platform, for instance, Microsoft SQL server. That's a limitation of the taxonomy that you should be aware of, but allows us to look at the relative momentum. And also we're not focusing on operational platforms like MongoDB at this point in time. So the more important point we want to share is shown in the bottom right corner of the chart and that is a diagram of what looks like a shared nothing architecture. Now in a shared nothing architecture, each node in the system operates independently without sharing memory of storage with the other nodes. Scale flexibility is the benefit, but ensuring coherence and consistent performance across these nodes, of course, is challenging. The modern data stack was built on shared nothing architectures. Oracle and SQL server originally were not. George, can you just add in the salient points here on the attributes of the modern data platform and then we'll get into it. Okay, really quickly, the modern data platform assumed a scale out, shared nothing infrastructure, but we also moved it to stand for cloud-based software as a service delivery. So it's managed, it's not Hadoop on-prem. But then the pioneering vendors separated compute from storage to take advantage of that data center architecture. But typically the compute and the storage are controlled by the same vendor. And we can't really separate data from compute until we have a lot of metadata that really puts all the intelligence about the data associated with the operational or analytic data itself. And that's what we're going to start to get into today. Okay, thank you. And let's do that now. Okay, so that is background. We want to share a chart from vast data which is shown right here. Guys, please share your point of view on the limitations of today's leading shared nothing platforms, the title of this slide. Consistent rights are slow. Can you please add some color here? Sure, so the term of art in the market is shared nothing talking about essentially a systems architecture that was first introduced to the world by Google in 2003. So it's about 20 years old. And that architecture, as you and George just articulated, as the challenge of, whereas the term has shared nothing, the challenge is that all of the nodes within a distributed system have to be kept in contact with each other. Typically for transactional IO where transactions have to be strictly ordered that becomes a real challenge when these systems are also faced with scale. And so very rarely do you find systems that can scale to, in today's terms, exabyte levels of capability, but at the same time can deliver consistent performances the systems grow and grow and grow just because you have internally to these architectures just a ton of East-West traffic. And so this is one of the major problems that we sought out to solve in what we've been doing. Okay, so one of the challenges customers have with their data is they have objects, tables, files that use different formats like we're showing here and they get different metadata and generally this creates stovepipes. So this chart depicts a big global namespace what you call in this chart a data space and the infinity loop from edge to on-prem data centers to the cloud. And the way you guys position it is your data platform integrates unstructured data management services with a structured data environment in a way that you can turn unstructured data into queryable and actionable information which is really important. The reason this is important is because it allows the disparate elements shown on this chart, objects, tables, files, et cetera to become those things that we talked about earlier. So the question gentlemen is why can't today's modern data platforms do this and what's your point of view on the architecture needed to accomplish this? I think they weren't designed for it. When these modern data platforms were built the biggest you could think of was as you say strings it was numbers, it was rows, it was columns of a database. Today we want them to analyze not numerical pieces of information but analog pieces of information pictures and video and sound and genomes and natural language. These are a lot larger data sets by nature and we're requiring much faster access to these data sets because we're no longer talking about CPUs analyzing the data, it's GPUs and TPUs and everything has changed in a way that requires us to break those fundamental trade-offs between price and performance and scale and resilience in a way that couldn't be done before and didn't need to be done before but now it does. If you think about just the movement that's to put now with generative AI technologies like deep learning for the first time in IT history we now have tools to actually make sense of unstructured data. And so this is driving the coexistence within the VAS data platform of structured and unstructured data stores because the thinking is once you have the tools in the form of GPUs and neural networks to go and actually understand this data well that has to be cataloged in a place where you can then go and inquire upon the data and interrogate it and to your point Dave turn that data into something that's actionable for a business and so the unstructured data market is 20 times bigger than the structured data market that most BI tools have been built on for the last 20, 30 years. And what we view is essentially AI as this global opportunity in the technology space to see roughly a 20X increase in the size of the big data opportunity now that there's a new data type that's processable. So let me follow up on one thing Dave what does the pipeline look like for training large language models that's different from what we might have done with today's modern data stack training deep learning models but where you had sort of one model for each task that you had to train on. What does that look like in dataset size and how you curate it and then how it gets maintained over time. In other words, you're talking about a scale. One thing is you're talking about scale that's different and then the other thing seems to be the integration of this data and this constantly enriched metadata about it and trying to unify that. Can you elaborate on where today's stack falls down? I think if you think in terms of scale for example, if you take the average snowflake capacity we did some analysis of the capacity that they manage globally and divide it by the number of customers that they have and you're somewhere between depending upon when they're announcing 50 to 200 terabytes of data per customer. Our customers on average manage over 10 petabytes of data so you're talking about something that is at least 50 times greater in terms of the data payload size when you move from structured data to unstructured data. At the high end, we're working with some of the world's largest hyper scale internet companies software providers who talk and think in terms of exabytes and this is not all just databases that are being ETLed into a data lake. It's very, very rich data types that don't naturally conform to a data warehouse or some sort of BI tool. And so deep learning changes all of this and everybody's conceptions of scale as well as real time really need to change. Now, if you go back to that earlier discussion we were having about shared nothing systems the fact that you can't scale and build transactional architectures has always led us to this state where you have databases that are designed for transactions, you have data warehouses that are designed for analytics and you have data lakes that are designed for essentially cost savings. And that doesn't work in the modern era either because people want to infer and understand data in real time. If the data sets are so much bigger then you need scalable transactional infrastructure and that's why we designed the architecture that we did. Yeah, very good. So extending the original premise that we put forth right up front we're kind of going back to the future into this shared everything scale up architecture and I want to explore that a little bit more. This is another chart from VAS presentation. It depicts many nodes with shared memory which are those kind of little squares inside of the big squares at the bottom with access from all these connections over a very high speed network and the cubes represent compute. So the compute has access to all the data in the pool memory and storage tier which is being continually enriched by some AI and metadata magic which we're going to talk about. But Ren and Jeff, explain your perspective on why we need scale up and shared everything to accommodate exascale data apps going forward. And if I can just add to that, why is it possible now? Yeah, yeah, how is it possible? Right, yeah. It's possible because of FATS networks and because of new types of media that is both persistent and fast and accessible through those FATS networks and none of these things could have been done at this level before we started. And that's another reason why we didn't see them before. Why do we need them? It's because of the scale limitations that we're reaching. It's a short blanket in the shared nothing space. You can do larger nodes and then you risk longer recovery times when a node fails. They can take months in which case you can't have another node fail without losing access to information or you can have smaller nodes and a lot more of them. But then from another direction, you're risking failure because you have more nodes. Statistically, they're going to fail more often from a performance perspective also. As you add more nodes into one of these shared nothing clusters, you see a lot more chatter. Chatter grows quadratically and so you start to exhibit diminishing returns and performance. All of that limits the scale and resilience and performance of these shared nothing architectures. What we did when we disaggregated, we broke that trade off. We broke it in a way that you can increase capacity and performance independently but also such that you can increase performance and resilience up to infinity again so long as the underlying network supports it. Got it. So go ahead, do you want to add something, Jeff? The network is what allows for desegregation or the decoupling of persistent memory from CPUs. But the second thing that had to happen is some invention and what we had to build was a data structure that lived in that persistent memory that allowed for all the CPUs to access common parts of the namespace at any given time without talking to each other. So it's basically a very metadata rich transactional object that lives at the storage layer that all the CPUs can see and talk to in real time while also preserving atomic consistency. And so once you've done that, you can think of this architecture not as a kind of like a classic MPP system that most of the shared nothing market kind of uses as a term to describe what they're doing but rather more as a data center scale SMP system where you just have cores and one global volume that all the cores can talk to in a coherent manner. So it's basically like a data center scale computer that we've built. And that network is, by understanding is either it's an InfiniBand or Ethernet and it's got your IP to make the magic on top, right? Yeah. Yeah. Most of our customers are just using standard commodity Ethernet. Interesting. Would it be fair to say, Dave, I just want to drill down on something. Like the first reaction one might have is, well, we've built, we've had tens, if not hundreds of billions of investment in the scale out data center architecture, is it that you're drafting off the essentially rapid replacement of big chunks of that data center infrastructure for the LLM build out that you're running on a lot of the, well, this new super fast network with a dense topology that you're running on that because it's being replaced so rapidly or being installed so rapidly at the hyperscalers and elsewhere. I think LLMs are the first piece of this new wave. It's going to span way beyond language models and texts. But yes, there is a new data center being built in these new clouds in enterprises in a way that didn't exist before. And this new data center, its architecture perfectly matches our software architecture in the way that it looks. You have GPU servers on one side. You have GPUs in them in order to enable the infrastructure layer to be very close to the application in a secure manner. And then you have a fast network and SSDs in enclosures on the other end. And that's the entire data center. You don't need anything beyond that in these modern locations. And we come in and provide that software infrastructure stack that leverages the most out of this new architecture. Okay, I want to come back to this funny phrase that we used earlier, turning strings into things. And Jeff, you kind of alluded earlier to, and you guys have been talking about this, the we envision a future where the AI ultimately becomes a system of agency and can take action. And this is why it's so important to speak about things and not strings. So in previous episodes, we've explored the idea of Uber for all. And we've had Uber on the program to explain how back in 2015, they had to go through these somewhat unnatural acts to create essentially a semantic layer that brought together transaction, analytic data, structured, unstructured and turn those database, things that databases understand, ASCII code, objects, files, et cetera, into things that are a digital representation of the real world, in Uber's case, riders, drivers, locations, destinations, prices, et cetera. So guys, in this example, we have the data up top. You got NFS and S3 or structured SQL in the form of files, objects and tables, which we can discuss in more detail. But our inference is that the metadata at the bottom level gets enriched by triggers and functions that apply AI and other logic over time. And this is how you turn, progressively turn strings into things. Is our understanding correct? And can you elaborate, please? I think it is. The way our system is built, it's all data-based. What do we mean by that? Data flows into the system and then you run functions on that data. And those functions are very different than what used to be run on strings. These are inference functions on GPUs. They're training functions on GPUs. They enable you to understand what is within these things and then to ask intelligent questions on it because it's not just metadata that gets accumulated right next to the raw information. It's also queryable. All of this is new and it brings, again, computers closer to the natural world rather than what was happening over the last two decades where we needed to adapt to the strings. Now computers are starting to understand our universe in a way that they couldn't before. And as you say, it will be actionable. But action is a byproduct of being able to make intelligent decisions. Well, I should say, it should be a byproduct of that. And so this is why this architecture that we built is really important because what it allows for is the strings to be ingested in real time. As I mentioned, we built a distributed transactional architecture that typically you've never seen before because of the crosstalk problems associated with shared nothing systems. But at the same time, if I can take that data and then query it and correlate it against all of the other data that I have in my system already, then what happens is I have real-time observability and real-time insight to flows that are running through the system in ways that you've never had before. Typically, you'd have like an event bus that's capturing things and you'd have some other system that's used for analytics. And we're bringing this all together so that regardless of the data type, we can essentially start to kind of process and correlate data in real-time. So George, and guys, I think it's worth stopping for a second and taking an example that could be instructive. Take AWS. I mean, awesome, right? We're talking about a 90 plus billion dollar company created the cloud. It's data stores. It's probably got, I don't know, 11, 12, 13 different database data stores. But they're very granular and by design, the piece parts. But when you think about the metadata, there's at least today anyway, not a unified way to get your hands around all the metadata. You've got data zones, might have the business metadata. Glue might have technical metadata. They use different data stores. And so that's challenging for customers to basically create this new world where you've got intelligent data apps that are taking action as we've just discussing. George, you and I have talked about this. Anything you would add to that? Yeah, it's one, there seems to be a need to unify all the metadata or the intelligence about the data. But then maybe you can elaborate on your sort of roadmap for building the database functionality over time that's gonna do what Jeff was talking about, which is observe changes streaming in and at the same time query historical context and then take action, what that might look like. And understanding that, as you said, you built out the storage capabilities over time and now you're building out the system level and application level functionality. Tell us what your vision for that looks like. Yeah, so this new era, the database starts as metadata. In fact, you can think of the old data platforms as having only metadata because they didn't have the unstructured piece. They didn't have those things in them. But the first phase is of course to be able to query that metadata using standard query language which is what we came out with earlier this year. What the big advantage of course of building a new database on top of this new architecture is that you inherit that breaking of trade-offs between performance and scale and resilience. You can have a global database that is also consistent. You can have it extremely fast without needing to manage it, without needing a DBA to set views and indexes. And so what you get is the advantages of the architecture. As we go up the stack and add more and more functionality, we will be able to displace not just data warehouses but also transactional databases. And in fact, we find a trade-off there between row-based and column-based traditional databases. You don't need that when the underlying media is 100% random access. You don't need ETL functions in between to maintain multiple copies of the same information because you want to access it using different patterns. You can just again give different viewpoints or mount points into the same underlying data and allow a much simpler environment than was possible before. Interesting. So you got a 20X increase in big data opportunity. There's a TAM increase there as well as potentially eating away at some of the traditional methods. I want to talk more about data management and how data management connects to new data center architectures. We're showing here on this chart, shared persistent memory eliminates the historical trade-offs. We touched a little bit upon that, but let's go deeper. So using shared persistent memory instead of slow storage for writes in combination with a single tier of, let's say all flash storage brings super low latency for transactional writes and really high throughput for analytical reads. So this ostensibly eliminates the many decades-long trade-off between latency and throughput. So guys, I wonder if you could again comment and add some color to that. And maybe one other thing before you start, Jeff, which was something we missed when we were first looking at Snowflake, which was they are now claiming to add unstructured data, just as Oracle did many, many years ago, but there was a cost problem. And then one thing, the one thing that you've talked about the scale issue, but I don't know if you guys have done any cost comparisons when you're at that 20 X scale. I'm sure. Okay. So to start in terms of the data flow, as data flows into the system, it hits a very, very low latency tier of persistent memory. And then as the, what we call the write buffer fills up, we take that data and then we migrate it down into an extremely small columnar object. If you're building a system in 2023, it doesn't make sense to accommodate for spinning media. And if you have an all flash data store, the next consideration is, well, why would you do things like classic, data lakes and data warehouses have been built around for streaming with flash, you get high throughput with random access. And so we built a very, very small columnar object. It's about, I don't know, about 4,000 times smaller than let's say a iceberg row group. And that's what gets laid down into SSDs that is then designed for extremely fine-grained selective query acceleration. And so you have both in the, about 2% of the system is that write buffer. The remaining 98% of the system is that columnar optimized random access media. And then when I put the two together, you can just run queries against it. You don't really care. So if some of the data that you're querying is still in the buffer, of course you're reading on a row basis, but it's the fastest possible persistent memory that you're reading from. And so in this regard, even though it's not a data layout that's optimized for how you would think queries naturally should happen, it turns out that we can still respond to those data requests extremely fast time thanks to persistent memory. Now, to the point, I think you were getting to George around cost, we started with the idea that we could basically bring an end to the hard drive era and every decision that we make from an architecture perspective, presumes that we have random access media as the basis for how we build infrastructure. But cost is a big problem and flash still carries a premium over hard drive based systems. And that's why most data lakes today still optimized for hard drives, but people weren't really thinking about all that could be done from an efficiency perspective to actually reconcile the cost of flash such that you can now use it for all of your data. And so there's a variety of different things that we do to bring a supernatural level of efficiency to flash infrastructure. One of those is basically a new approach to compression which allows us to look globally across all the blocks in the system, but compress at a level of granularity that goes down to just two bytes. And so it's the best possible way to find patterns within data on a global basis and is designed to be insensitive to noise within data so that we'd always find the best possible reductions. Typically, our customers see about four to one data reduction which turns out is greater than the delta between flash and hard drives in the marketplace today. So it's completely counterintuitive, but the only way to build a storage system today that is cheaper than a hard drive based data store is to use flash because the way that you can manipulate data allows you to find a lot more patterns and reduce data down in ways that would just not be sensible to do with spinning media. Right. Okay, thank you for that explanation. And then, Ken, if you bring up the last slide instead of talking points, I want to just summarize and then you'll give you guys the last word. You're vast, we see vast as a canary in the coal mine and what we mean by that is this new approach to data center architectures shines a light on the limitations of today's data platform. So of course the big question over time is, you know, is vast a replacement of or a compliment to the sixth data platform? And the second part of that question is, okay, how are the existing data platforms going to respond? And we think the determining factor is going to be the degree to which companies like vast can evolve their ecosystem of ISVs to build out the stack. Of course, the starting point as we heard earlier is really a full blown, you know, analytic DBMS, but I'll turn it over to Renan, Jeff, George. You know, take your last shot. Maybe just Renan, since I'm always adding, you know, a question from the cheap seats before you get going. My question would be, you know, as you grow into a richer, you know, aspiring replacement to the full modern data stack, customers are still going to want to take their tools. They're, you know, the pipelines that, you know, even though you bring data in without ETL, you got to still take raw data and turn it into refined data products. And you know, those pipelines, the observability, the quality rules, they have tools they like. How do you expect or intend to accommodate that sort of migration over time? I think in the same way that we have done until today, we support standard protocols, we support standard formats, and we support multi protocols and multi formats. For example, if you look back to our storage days, you're able to write something via a file interface and then read it through an object that allows you to continue working in the same way that you did, but to move into the new world of cloud native applications. The same is true at the database layer. Us providing the standard formats and protocols allows you to start using the VAS platform without needing to change your application, but over time enables you to do things that you couldn't do before. Back to Dave's question about does it start with a DBMS? I think the last wave of data platforms did because it was focused on structured data. I think the next wave of data platforms, I like to use the analogy of the operating system. It's no longer just a data analytics machine, it's a platform that provides that infrastructure layer on top of new hardware and makes it easy for non-skilled enterprises and cloud providers to take advantage of these new AI abilities. And that's what we see in the future, and that's what we're trying to drive in the future. Yeah, I got to hear from you. You got to give us your last thoughts on all this discussion, please. Yeah, I think we generally think that solving old problems isn't that fun. And so if you think about what's happening now, people are now trying to bring AI to their data, and there's a very specific stack of companies that are starting to emerge. And we talked a little bit about our funding. We went and did some analysis of the companies that have increased their valuation by over 100% this year of real-size companies. So companies that are valued over $5 billion, and it's a very, very select number of organizations. With the compute layer, it's NVIDIA, with the cloud layer, it's a new emerging cloud player called CoreWeave, which is hosting a lot of the biggest large language model training in inference applications. It's us, it's Anthropic, and it's open AI. And that's it. Yeah. So something is definitely happening, and we're trying to make sure that we can not only kind of satisfy all these new model builders, but also take these same capabilities to the enterprise so that they can all benefit. Yeah, that was, I don't know if it was your chart or somebody else's chart. I saw that recently. It's a, there's various bars, and most of them were under the zero line. So thank you for your last thoughts. George, I'll give you the final word. I think this was kind of enlightening for us to see something that shines a light from a perspective, you know, from a new perspective on the shared nothing architecture, which we've grown over two decades to take for granted. And once you have to revisit that, that assumption, when you can think about scale up as back instead of scale out, then everything you build on top of that can be rethought. And so Vast is helping us question those assumptions, but I thought it was particularly interesting to see, you know, the idea that we're going to, we need to unify files, objects and tables as sort of one unified namespace, but that there's no real difference between data and metadata. It all kind of blurs as you turn these strings into things. And the one thing, you know, that we keep harping on is that ultimately, you know, when you turn strings into things, there's, there is a semantic layer that has to represent these things richly and the relationships between them. And that's what's going to allow us to separate compute from data, you know, where you can have composable applications talking to a shared infrastructure. And it sounds like you're building towards that vision. I remember last June, Jeff watching you, I think it was you and John Furrier talking at the Databricks event as part of our editorial and you were saying, you know, we're not really just a storage company. And I remember saying, hmm, what does that mean? So we're starting to see that come into focus. Guys, it's so helpful to have folks like you that are articulate, technically oriented, but also business oriented to help our community understand what the future is going to look like to really appreciate your time today. Thank you. Thank you. Thank you. And I want to thank Ken Schiffman, who's flying solo today. Alex Meyerson's also on production and manages the podcast. Kristen Martin and Cheryl Knight, they helped get the word out on social media and on our newsletters and Rob Hof is our editor in chief over at SiliconANGLE.com. Remember all these episodes are available as podcast, wherever you listen to search, breaking analysis, podcast. I publish each week on wikibon.com, which is being rebranded to thecuberesearch.com and siliconangle.com. And you can email me at David.Valante at siliconangle.com or DM me at dvalante. If you want to get in touch or comment on our LinkedIn posts and do check out etr.ai. They've got great survey data in the enterprise tech business. This is David.Valante from George Gilbert and theCUBE Research Insights powered by ETR. Thanks for watching everybody and we'll see you next time on Breaking Analysis.