 Hello everyone, welcome back to SuperCloud 4 here live in our Palo Alto studios, my name is Dave Vellante and you know, you hear a lot about the arms race for GPUs and talent, but not enough talk about the arms race for data. I'm here with Rob Streche. Andy Pernsteiner is also here. He's the CTO of Vast Data. Andy, good to see you. Thanks for coming in. Hey, thanks for having me. Yeah, so arms race for data, right? Quality data, right? When you talk to customers, they say that they're spending a lot of time trying to get their data act together. They got to do that before they can actually really take advantage of LLMs. Chris Miller just wrote a really interesting article on this idea of model collapse. You were prominently featured in it. What's the basic premise behind that article and some of your thinking at Vast? Well, honestly, what we've been finding at least with the customers that we talk to is that their main goal right now is obviously to capture as much of the data as they're generating as possible and keep it as long as they can, but also to curate and cleanse it so that it's useful for training against. And I was actually kind of having a hypothesis and it's not one that I'm the only one having, right? So I've also done a little bit of research on model collapse and there's different camps that either believe in it or don't believe in it. And honestly, it's probably well beyond my realm of expertise to say whether it's mathematically likely or not likely, but the premise is basically that based on the fact that much of data being generated now is being generated by generative AI technologies like GPT-4 and others, that the data that's being proliferated into the world is now data that isn't really useful for training on. And I think that even open AI and other organizations are having to purchase and acquire real data because they don't want to be left in a place where they're only left with training against data that is synthetic. And that's why at least my assertion and other people that I've spoken with can share this kind of thought process is that organizations, no matter what kind of organization it is with manufacturing, retail, finance, there's data that's being generated either about their customers, their supply chain processes, all of the intellectual property, that's real data and it can be used either to generate models that they can then use to improve their processes, but ultimately the way that I see it is that they're going to start creating data products. Not only data products in terms of data they can sell, but also data products in the sense of creating models that they can sell and allow people to refine on their own. So that's where we sort of see things happening and the proliferation of the amount of people investing very heavily in LLM training is kind of proof of that to me. And then the model collapse concept is because it tends to over rotate of synthetic, if there's too much synthetic data tends to over rotate on the highly probable and then basically under count the improbable, is that right? Yes, and I think there's ways that the people developing the models can work around it or fine tune against it, but it's extra effort as opposed to training against real data. So if everything you're getting is synthetic, then it's challenging to know if that's a real signal or a fake signal and it's very easy to get bias into the models. It's very easy for the models to become polluted over time. And I think that's one of the challenges to prevent model hallucinations especially, is that the input has to be quality. It can't be something garbage. Yeah, I think that what we're seeing when we're talking to customers that are building data products. So they're building these data products for what is a customer and here's how customer journey goes along and how we're gonna model that customer journey and they look at that as, hey, here's what we think they go and do and they're starting to look into that. So they're gathering signals from all over the place, trying to cleanse them, understand how that works, from mobile, from on laptops and what have you and they're looking at these signals and bringing it in. I think what they're seeing is it's the improbability, that the fact that the theory of improbability, almost like gambling on Baccarat where the improbable is probably going to happen. So if you keep going down that path and I think to your point about model collapse, that seems to actually validate that, right? The improbable is what you're really looking for because that's when people are exiting, saying you're abandoned shopping cart or something like that. Are you seeing as vast really is underpinning a lot of these large AI cloud service providers now? Are you seeing them get into that data cleansing business and leveraging the platform that you guys have? I think they are. The evidence that we see is that a lot of the effort, imagine it this way, you're gonna go rent GPUs from a cloud service provider, whether it be someone like CoreWeave or G42 or Lambda or any one of these new AI focused cloud service providers. Your goal as someone who's paying money for something is to only pay for something you're actually using as much as possible. And the challenge is that to get data to a state where you can actually use it for training is very difficult. And you don't wanna waste all of your time where you're renting these precious GPU hours spent cleansing and prepping the data. So a lot of effort right now is focused on that area. There are technologies that can actually leverage GPUs to do data prep and ETL. And so that's another area where we're seeing people start to try and refine their processes to make it so they get the most out of these GPUs because not only they're expensive or they can be expensive at scale but they're also somewhat hard to come by. And so when you're in the state where it's hard to get something and it costs a lot of money, you better be ready for it when you have it. So you guys have had some success with these specialized cloud service providers. You mentioned CoreWeave and I think you just signed up some others recently. Hit the news. I think that's public. It is, otherwise I'm not allowed to say it. Okay, so explain why these sort of emergent CSPs are more viable than the cloud CSPs because the traditional big cloud players are so sort of virtualization focused. One system, many customers. Can't they pivot to be as good as a specialist? Explain your thinking on that. I think they can do that to some extent. And I think what it comes down to is scale, right? So I was recently speaking in London at a talk and it's all data scientists, data engineers and people who are in theory involved in AI deep learning and machine learning and the track that we were on was a generative AI track. And so I asked everybody if they knew what an H100 was which is one of NVIDIA's most popular GPUs and nobody knew what it was. Really? So there's this big disconnect between people who are practicing and using the infrastructure but what I think is happening is the people especially at that conference they're focused on smaller scale projects. They're just getting started. And so I think for those cases you can get started pretty much anywhere. You can get started on your laptop, you can get started in one of the major cloud service providers without having to go to one of these specialized ones. But once you find that you're having to scale the amount of data that you're training and the number of GPUs and the orchestration required to keep all of that running in concert that's where you need somebody who's more specialized in it. The networking requirements alone to do deep learning training are well beyond what most of the cloud service providers the general purpose ones have built for. They built for a general purpose world for everybody running all types of applications at once. They didn't build for people running these super high performance very IO intensive deep learning projects. It just wasn't in their thought process. That doesn't mean they can't build these things out but it is also something where they have to also protect their existing customer revenue and not sort of bifurcate their position. So we see the new AI focused cloud service providers as being the ones who can bring scale in a way that allows enterprises and commercial entities the ability to deploy AI machine learning without having to go build the infrastructure themselves. Obviously we have been an infrastructure focused company for several years and we typically sell directly to customers but we're very excited to work with these new cloud service providers because it kind of gets the customers that are trying to start and deploy large scale AI projects. It gives them the ability to do so without having to go and build it all themselves. That was something else that you talked about to Mellor Chris Mellor was the need for turnkey solutions. And so this is something that Rob and I and George have been sort of kicking around is that that need for turnkey integrated versus sort of a more modular approach. And when you see all the action around things like iceberg and Delta and Hoody and these different data types there is that sort of modular stack that's emerging and potentially quite competitive versus the sort of integrated approach that you're I think espousing. Maybe you could talk about that a little bit and help us understand your point of view. Well, so the three technologies you mentioned are relatively focused on analytics. Delta, iceberg and Hoody are all effectively table formats on top of the parquet file format. Traditionally speaking are deployed on cloud service. I'm sorry on cloud storage offerings. AWS S3, Google cloud storage, as well as Azure ADLS. And so I think that there is a distinction though. So those are very heavily focused on analytics which means the data has to be structured to get into it in the first place. Whereas a lot of the LLM based sort of deep learning projects and of course all the multimodal projects are focused on unstructured data to start with. And so I think that the platform that we've developed has the ability to do both. And I think that as customers start to learn more and more about how to actually do unstructured deep learning they're going to need a platform that can handle both. It's not to say that you couldn't do so with what we wouldn't want to call a cloud data lake or data warehouse. But the problem is that you have to put something into a rigid structure at first. You can't put images into the same rigid structure that you can put text, right? And so I think that for LLMs maybe it's possible that you could leverage more of a data warehouse style implementation to be able to do your training but ultimately you're adding a step to the equation if you have to transform all of your data into something that can be inserted into effectively rows and columns in a database. Yeah, I mean, is it that things have changed so? Because deep learning is not new. I mean these problems HPC has been around for high performance computing has been around for quite a while. And the data problem there has been around for a while. I mean, NASA has been, you can see all kinds of things right down the street here at Moffitt Field that they used to use. How is it that things have really changed in how people are using the data and why it's really just ballooned? I think that in the world of HPC, most consumers, most people of the world, even people in my field weren't really aware of what most of these HPC institutions were doing. We didn't know what TAC was up to. You knew to some extent what NASA was doing just because of their name but behind the scenes, nobody really knew what they were up to. I think the difference now is that the data that people are training against and the types of insights they're deriving are from everyday real world data that everybody is generating. And so I think that's one of the big differences. It's almost like, if you would go back a couple of decades before the iPhone existed and you introduced technologies like you're seeing now, it would be so foreign to people that it wouldn't make any sense to them. It would be in the back world of HPC or some very specialized place. But now that the human psyche is prepared for seeing these revolutions in technology and the data that's being trained against is data that they understand. It's data that comes off of the cars that they drive or the pictures they take or the calls that they're on. Now they understand what these things are. And of course, once OpenAI released ChatGPT late last year, they kind of unveiled to the world the fact that this is for everybody now. It's not just for people off in the corner somewhere. Is that bringing HPC into the mainstream or are they going on to harder problems? You know, it's interesting. I find that, so I mentioned TAC for a moment because they're a customer of ours and one of the reasons that they're choosing us is because of our integration with AI and deep learning applications, especially because we have accelerations for GP Direct Storage and we've been proven to be very fast as a NAS platform for being able to do large scale analytics and deep learning. And they're finding that most of the research organizations that use their platform are clamoring for GPUs. They're being used for training and testing. And so whether they want it to or not, I'm sure they want to because they just like to get big grants and get bigger and bigger systems. They are being pulled into this world. And I think the classical HPC engineers, you know, for them it's oftentimes it's just another type of CPU, right? It's a GPU, okay? It has a different letter at the front of it. There are some considerations they're learning that they have to make though because it's not always, you know, the same paradigm that people have been using to do HPC. Now they have to start thinking about things a little bit differently. And so I think that the engineers that have been doing HPC for a long time are embracing GPUs. And if you go to any of the national labs, they're using them in a heavy way. Everybody's using them. And we're starting to see enterprise HPC organizations start to bring them into the fold as well. It's kind of the first question we ask almost every customer now is whether they have projects that involve GPUs. And almost every single one of them says yes. And so you're seeing and just going back to big data London and the personas that were there and you're seeing a disconnect between the folks that are actually building the models and the algorithms and things from that from what infrastructure are they actually running on? And, you know, model ops or ML ops or what have you is really that the place that's plugging that gap for those people? It might be. I think it might even be another layer to down. And a lot of credit goes to the cloud for abstracting people away from the underlying infrastructure. I've been in the infrastructure world for some time but I also realized that most people don't care. In many ways it's the plumbing of the world, right? It's the plumbing of the IT world. No one wants to have to deal with it. No one just assumes that it's dealt with already. And so I think that there is a disconnect between a lot of the people who are coming up with the ideas, asking questions of the data, coming up with the pipelines and the underlying infrastructure. And what we want to do is we want to continue making it invisible to people but we want to give people something that brings them much faster time to insight at a very cost-effective price because that I think is what's going to propel us forward. I don't think that having the fanciest fastest everything is going to do it. It has to be cost-effective. It has to be deployable at scale. It has to be easy to use. And that was a great setup because that is what SuperCloud is all about is that hiding that, abstracting that complexity of the underlying clouds, building value on top of it, data, the data layer is obviously one of those opportunities to do so Andy, thanks so much for coming on. Thank you so much. You've been SuperCloud. Thanks a lot. All right, and keep it right there. John Furrier, Rob Strecce and I will be back. David Glick is up next. He's from Walmart. We've had Walmart guests on before. They've got deep tech expertise and we're going to find out how they're using large language models. Keep it right there. We'll be right back at SuperCloud 4 live from Palo Alto.