 From around the globe, it's theCUBE with digital coverage of AWS re-invent 2020, sponsored by Intel, AWS, and our community partners. Everyone, welcome back to theCUBE's coverage of AWS re-invent 2020 virtual. This is theCUBE virtual. I'm John Furrier, your host. This year we're not in person. We're doing remote interviews because of the pandemic, the whole event's virtual over three weeks. For this week, we're going to be having a lot of coverage in and out of what's going on with the news, all that stuff here happening on theCUBE. Our next guest is a featured segment around Venkatesh VP of engineering at Cloudera. Welcome back to theCUBE. CUBE alumni, last time you were on was 2018 when we had physical events. Great to see you. Likewise, good to be here, thank you. So, you know, Cloudera, obviously modernized up with Hortonworks, that combination has been for a while. Always pioneering this abstraction layer originally with Hadoop, now with data. All those right calls were made. Data is hot, it's a big part of re-invent. That's a big part of the theme, you know, machine learning, AI, AI, AI, edge, edge, edge, data lakes on steroids, higher level services in the cloud. This is the focus of re-invent, the big conversations. Give us an update on Cloudera's data platform. What's new? Absolutely, you know, you're really speaking a language with the whole data lake architecture that you alluded to, right? So, Cloudera's mission has always been about, you know, we want to manage half the world's data. Like what this means for our customers is being able to aggregate data from lots of different sources into central places that we call data lakes and then apply lots of different types of processing to it to derive business value. Like with CDP, with Cloudera data platform, what we have essentially done is take those same three core tenets around data lakes, multi-functional analytics, and data stewardship and management to add on a bunch of cloud native capabilities to it. So this was fundamentally, I'm talking about things like disaggregated storage and compute, that being able to now not only take advantage of HDFS, but also at a pretty deep fundamental level cloud storage. But this is the form factor that's really, really good for our customers to operate at from a TCO perspective, if you're going to manage hundreds of petabytes of data. Like a lot of our larger customers do. The second key piece that we've done with CDP has to do with us embracing containers and Kubernetes in a big way. On-prem are heritages around machine machines and clusters and things of that nature, but in the cloud context, especially in the context of managed Kubernetes services like Amazon's EKS, this lets us spin up our traditional workloads, SQLs, Spark, machine learning and so on, in the context of these Kubernetes containerized environments, which lets the customer spin these up in seconds as opposed to tens of minutes. And as their processing needs grow and shrink, they can actually scale much, much faster up and down to make sure that they have the right cost-effective footprint for their compute. And the third key piece, go ahead, please. Go ahead, third piece. But the third key piece of all of this right is to say, along with like cloud-native orchestration and cloud-native storage, is that we've embraced this notion of making sure that you actually have a robust data discovery story around it, right? So increasingly the data sets that we create on top of a platform like CDP, they themselves have value in other use cases, right? So we want to make sure that these data sets are properly replicated, they're probably secured, they're probably governed, so you can go and analyze where a data set came from, capabilities of security and provenance are increasingly more important to our customers. So with CDP, we have a really good story around that data stewardship aspect, which is increasingly important as you get into the cloud and you have these sophisticated sharing scenarios there. You know, Cloud Air has always had, and Hortonworks, both companies, had strong technical chops, it's well-documented, certainly theCUBE's been to all the events and covered both companies since the inception of 10 years ago, of big data. But now we're in cloud big data, fast data, little data, all data, this is what the cloud brings. So I want to get your thoughts on the number one focus of problem-solving around cloud. I got to migrate, or do I move to the cloud immediately and be born there? Now we know the hyperscalers born in the clouds, companies like the Dropbox and the World, they were born in the cloud and all the benefits and goodness came with that. I'm going to be pivoting, I'm a company out of COVID with a growth strategy, lift and shift, okay, that's over now, it's a low-hanging fruit, that's use cases kind of done, been there, done that. Is it migration or born in the cloud? Take us through your thoughts on, what does a company do right now? I think it's a really good question. If you think of where our customers are in their own data journey, right? So increasingly, a few years ago, I would say it was about operating infrastructure, that's where their head was at, right? Increasingly, I think for them, it's about deriving value from the data assets that they already have. And this typically means combining data from different sources, be it structured data, semi-structured data, transactional data, non-transactional data, event-oriented data, messaging data. They want to bring all of that and analyze that to make sure that they can actually identify ways to monetize it in ways that they have not thought about when they actually stored the data originally, right? So I think it's this drive towards increasing monetization of data assets that's driving the new use cases on the platform. But traditionally, it used to be about, SQL analysts, or if you were like a data scientist using Apache Spark, so it was sort of this one function that you would focus on with the data. But increasingly, we are seeing, these are about, these are collaborative use cases, where you want to have a little bit of SQL, a little bit of machine learning, a little bit of potentially real-time streaming, or even things like Apache Flink, that you're going to use to actually analyze the data. So in this kind of an environment, right? We see that the data that's being generated on-prem is extremely relevant to the use case. But the speed at which they want to deploy the use case, they really want to make sure that they can take advantage of the cloud's agility and infinite capacity to go do that. So it's really, the answer is it's complicated. It's not so much about, I'm going to move my data platform that I used to run the old way from here to there, but it's about, I got this use case, and I got to stand this up in six weeks, right? In the middle of the pandemic, and how do I go do that? And the data for that has to come from my existing line of business systems. I'm not going to move those over, but I want to make sure that I can analyze the data from there in some cohesive way. Does that make sense? Totally makes sense. And I think just to kind of bring that back for the folks watching, and I remember when CDP was launched and these data platforms, it really was to replace the data warehouses, the old antiquated way of doing things. But it was interesting. It wasn't just about competing at that old category. It was a new category. So yeah, you had to have some tooling, some sequel to wrangle data and have some prefabricated, data fenced out somewhere in some warehouse. But the value was the new use cases of data where you never know, you don't know where it's going to come until it comes, right? Because if you make it addressable, that was the idea of the data platform and data lakes and then having higher level services. So to me, that's I think one distinction, kind of new category, coexisting and disrupting an old category data warehousing. I always bought into that. You know, and there's some technical things, Spark could do, all these elements on mechanisms underneath, that's just evolution. But Income's cloud, and I want to get your thoughts on this because one of the things that's coming out of all my interviews is speed, speed, speed, deploying large scale at very large speeds. This is the modern application thinking. Okay, to make that work, you got to have the data fabric underneath. This has always been kind of the dream scenario. So it's kind of playing out. So one, do you believe in that? And two, what is the relationship between Cloudera and AWS? Because I think that kind of interestingly points to this one piece. Absolutely. So I think that, yeah, from my perspective, right? This is what we call the shared data experience that's central to CDP. Like the idea is that, you know, data that is generated by the business in one use case is relevant and valid in another use case. That is central to how we see companies leveraging data or the second order monetization that they're after, right? So I think this is where getting out of a traditional data warehouse, like data silo context and being able to analyze all of the data that you have, I think is really, really important for many of our customers. For example, many of them increasingly hold what they call these like data hackathons, right? Where they are looking at, can we answer this new question from all the data that we have? That is a type of use case that's really, really hard to enable unless you have a very cohesive, very homogenous view of all of your data. When it comes to the cloud partners, right? Increasingly, we see that the cloud native services, especially for the core storage, compute and security services are extremely robust. Like they give us, you know, the scale and that's really truly unfathomable in terms of how much data we can address, how quickly we can actually get access to compute on demand when we need it. And we can do all of this with like a very, very mature security and governance fabric that you can fit into, right? So we see that, you know, technologies like S3, for example, have come a long way. And along the journey with Amazon on this over the last seven, eight years, that we both learned how to operate our workflows. When you're running at petabyte scale, right? You really have to pay attention to matters like scale out and consistency and parallelism and all of these things, these matter significantly, right? And it's taken this certain maturity curve that you have to go through to get there. The last part of that is that because the TCO is so optimized for the customer to operate this without any ops on their side, right? They can just start consuming data, even if it's a petabyte of data. And so this means that now we have to have the smarts in the processing engines to think about things like caching, for example, very, very differently. Because the way you cache data that's in HDFS is very different from how you would do that in the context of S3. Or similarly, the way you think about consistency and metadata is very, very different at that layer. But we've made sure that we can abstract these differences out at the platform layer so that as an application consumer, you really get the same experience, whether you're running these analytics on-prem or whether you're running them in the cloud. And that's really central to how I see this space evolving, is that we want to meet the customer where they are rather than forcing them to change the way they work because of the platform that they are sitting on top of. So could you take a minute to explain some of the integrations with AWS and some customer examples because, first of all, cost is a big concern on everyone's mind because it's still lower cost and higher value with the cloud anyway. But it could get away from you. So you're constantly, petabytes at scale, there's a lot of data moving around. That's one thing. Two, integration with higher level services. Can you give, explain how cloud area integrates with Amazon, what's the relationship? Customer wants to know, hey, you guys partnering, explain the partnership and what does it mean for me? Absolutely. So the way we look at the partnership when they hit that one person and they can get to the rest of us, right? It's really a full layer, okay? Because the lowest layer is the core infrastructure services. We talked about storage and computing and security and IM and so on and so forth. So that layer, it's a very robust integration that goes back a few years. The next layer up from that has to do with increasingly, as our customers use analytic experiences from cloud error and they want to combine that with data that's actually in the AWS compute experiences like say a Redshift, for example. That's what the analytics layer, the cloud error data warehouse offering and how that interacts with the other services in Amazon that could be relevant. This is common file formats like open source file formats really help us in this context to make sure that they have a very strong level of interrupt at the analytics layer. The third layer up from that has to do with consumption. Like say if you're going to bring an analyst on board, you want to make sure that all of their SQL like analyst experiences, notebooks, things of that nature, there's really strong interrupt there at the third layer. And the highest layer is really around data sharing. So as AWS new and technologies like that become more prevalent. Now customers want to make sure that they can have these data states that they have in the different clouds actually interoperate. So we provide ways for them to browse and search data regardless of whether that data is on AWS or on track. And so that's sort of the fourth layer in the stack. The vertical slice running through all of these is that we have a really strong business relationship with them, both on the commercial go to market side as well as in AWS marketplace, right? So we can actually, by having CDP be a part of AWS marketplace, this means that if you have an enterprise agreement with Amazon, you can actually pay for CDP through the credits that you've already purchased. And this is a very, very tight relationship that's designed again for these large scales, feeds and feeds kind of customer. So just to get this right, so if I love the four layer cake, icing is the success of CDP. Love that birthday candle is going to be on top too successful. But you're saying that you can, you're going to mark with Amazon two ways of marketplace listing and then also jointly with their enterprise field programs. Is that right? Can you say, because they have this program where you can bundle into the blanket POs or PO processes. Is that right? Can you explain that again? Yeah, so if you think of it, right? These data states that we're talking about are significant. So we want to make sure that, you know, we are really aligned with them in terms of our cloud migration strategy, in terms of how the customer actually executes through what is a fairly, you know, it's a complex deployment to deploy a large multifunction data and state takes time, right? So we want to make sure that we navigate this together jointly with AWS to make sure that from a best practices standpoint, for example, we are very well aligned from a cost standpoint, you know, what we're telling the customer architecturally is very well aligned. That's where I think really the heart of the engineering relationship between the two companies is at. So if you want cloud era on Amazon, you just go in, you can click to buy, or if you got a deal with Amazon in terms of global marketplace deal, which they've been rolling out, I can buy there too, right? Exactly. All right. Well, Ram, thanks for the update and insight. Love the four layer cake. Love to get to see the modernization of the data platform from cloud era and congratulations on all the hard work you guys have been doing with AWS. Thank you so much. I appreciate it. Okay, good to see you. Okay, I'm John Furrier. You're here in theCUBE for CUBE virtual for AWS re-invent 2020 virtual. Thanks for watching.