 Welcome to theCUBE, I'm Dave Vellante. Today we're going to explore the ebb and flow of data as it travels into the cloud and the data lake. The concept of data lakes was alluring when it was first coined last decade by CTO James Dixon. Rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data mart, a data lake is formed by flowing data from a variety of sources into a scalable repository like say an S3 bucket that anyone can access, dive into, they can extract water, aka data from that lake and analyze data that's much more fine grained and less expensive to store at scale. Now the problem became that organizations started to dump everything into their data lakes with no schema on a right, no metadata, no context, just shoving it into the data lake and figure out what's valuable at some point down the road. Kind of reminds you of your attic, right? Except this is an attic in the cloud, so it's too big to clean out over a weekend. Well look, it's 2021 and we should be solving this problem by now. A lot of folks are working on this but often the solutions add other complexities for technology pros. So to understand this better, we're going to enlist the help of Chaos Search, CEO Ed Walsh and Thomas Hazel, the CTO and founder of Chaos Search. We're also going to speak with Kevin Miller, who's a vice president and general manager of S3 at Amazon Web Services and of course, they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it. But let's get started, Ed, Thomas, great to see you. Thanks for coming on theCUBE. Likewise, always look. Face to face, it's really good to be here. It is nice, face to face. That's great. So Ed, let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data lakes? Good question. I mean, data analytics at scale has always been a challenge, right? So we're making some incremental changes as you mentioned that we need to see some step function changes. But in fact, it's the reason Chaos Search was really founded. But if you look at it, the same challenge around data, where else, or a data lake, really it's not just a flowing the data in, it's how to get insights out. So it kind of falls in a couple of areas, but the business side will always complain. And it's kind of uniform across everything in data lakes, everything in data we're asking. They'll say, hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientists and DBAs. Most of the time they're a centralized group. Sometimes they're business units, but most of the time because they're scarce resources together. And then it takes a lot of time. It's arduous, it's complicated. It's a rigid process of the deal with the team. Hard to add new data, but also it's hard to, it's very hard to share data. And there's no way to governance without locking it down. And of course they want to be more self-service. So you hear from the business side constantly. Now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the 2000s, right? So if you look at it, it falls to two big areas. It's one, how to do data prep. How do you take our request comes in from a business unit? I want to do XYZ with this data. I want to use this type of tool sets to do the following. Someone has to be smart how to put that data in the right schema, you mentioned. You have to put it in the right format that the tool sets can analyze that data before you do anything. And then second thing, I'll come back to that because that's the biggest challenge. But the second challenge is how these different data lakes and data we're also gonna persisting data in the complexity of managing that data and also the cost of computing it. And I'll go through that. But basically the biggest thing is actually getting it from raw data. So the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process, extract, transform, load. They're actually taking data, a request comes in, I need so much data in this type of way to put together they're literally physically duplicating data and putting it together in a schema. They're stitching together almost a data puddle for all these different requests. And what happens is anytime they have to do that someone has to do it and it's very skilled resources or scant in the enterprise, right? So it's a DBAs and data scientists. And then when they want new data you give them a set of data set they're always saying, well, can I add this data? Now that I've seen the reports I want to add this data more fresh in the same process as it happened. This takes about 60 to 80% of the data scientists and DBAs to do this work. It's kind of well documented. And this is what actually stops the process halt. That's what is rigid. They have to be rigid because there's a process around that. That's the biggest challenge of doing this. And it takes in the enterprise weeks or months. I always say three weeks or three months and no one challenges me on that. It also takes the same skill set of people that you want to drive digital transformation, data warehousing initiatives, modernization, being data driven or all these data scientists and DBAs that you don't have enough of. So this is not only hurting you getting insights out of your data like data warehousing. It's also this resource constraints hurting you actually getting. That's the smallest atomic unit is that team that super specialized team, right? Yeah, okay. So you guys talk about activating the data lake for analytics. What's unique about that? What problems are you all solving? When you guys created this magic sauce? No, and basically there's a lot of things. I highlight the biggest one is how to do the data prep. But also you're persisting and using the data. But in the end, it's like there's a lot of challenges at how to get analytics at scale. And this is really where Thomas founded the team to go after this. But I'll try to say it simply. What we do and I'll try to compare and stress what we do compared to what you do with maybe an elastic cluster or a BI cluster. And if you look at what we do is we simply put your data in S3. Don't move it, don't transform it. In fact, we're against data movement. What we do is we really point us at that data and we index that data and make it available in a data representation that you can give virtual views to end users. And those virtual views are available immediately over petabytes of data and it actually gets presented to end user as an open API. So if you're elastic search user, you can use all your elastic search tools on this view. If you're a SQL user, Tablo looker, all the different tools, same thing with machine learning next year. So what we do is we take it and make it very simple. Simply put it there, it's already there already. Point us at it. We do the hard work of indexing and making it available and then you publish in the open API as your users can use exactly what they do today. So that's dramatically, I'll give you a four and after. So let's say you're doing an elastic search or you're doing log analytics at scale. They're landing their data in S3 and then they're ETLing. They're physically duplicating and moving data and typically deleting a lot of data to get in a format that elastic search can use. They're persisting it up in a data layer called Lucene. It's physically sitting in memories, CPU, SSDs, and it's not one of them, it's a bunch of those. They, in the cloud, you have to set them up because they're persisting. Actually, they stand up seven by 24. Not a very cost-effective way to do the cloud computing. What we do in comparison to that is literally pointing to the same S3. In fact, you can run a complete parallel. The data in S3 is being ETLed out. We're just one more use case. Read only or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra time, cost, and complexity of doing that. So what happens is when you look at what happens in elastic, they have a trade-off of how much you can keep and how much you can afford to keep and also it becomes unstable at time because you have to build out a schema. It's on a server. The more the schema scales out, guess what? You have to add more servers, very expensive. They're up 7 by 24, and also they become brittle. You lose one node, the whole thing has to be put together. We have none of that cost and complexity. We literally go from keep whatever you want, whatever you want to keep on S3, a single persistence, very cost-effective. And what we're able to do is cost, we save 50 to 80% Y. We don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up 7 by 24. We're literally asking our cluster, what do you want to keep? We bring up the right compute resources and then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any of the toil of moving the data or managing that layer of persistence, which is not only expensive, it becomes brittle. And then it becomes, and I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constantly coming at from a business unit down to the centralized data team, give me this flavor of data. I want to use this piece of, this analytic tool in that data set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data, I'm duplicating that data, I'm moving it and stitching it together. And then the minute you want more data, they do the same process all over. We completely eliminate that. And those requests queue up. Thomas, Ed had me, you don't have to move the data. That's kind of the exciting piece here, isn't it? Absolutely no, I think the day like philosophy has always been solid, right? The problem is we had that Hulu hangover, right? Where let's say we were using that platform a little too many variety of ways. And so I always believed in day like philosophy when James came and coined that, I'm like, that's it. However, HCFS, that wasn't really a service. Cloud Abish storage is a service. The elasticity, the security, the durability, all that benefits are really why we founded on Cloud App Storage as a first move. So Ed was talking, Thomas, about being able to shut off essentially the compute and you have to keep paying for it. But there's other vendors out there and Snowflake does something similar, separating compute from storage that they're famous for that. And you had Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? Well, you've heard this term data lakes, warehouse, now lake house. And so what everybody wants is simple in, easy in. However, the problem with data lakes was complexity of out, driving value. And I said, what if, what if you can have the easy in and the value out? So if you look at, say, Snowflake as a warehouse and solution, you have to all that prep and data movement to get into that system. And then it's rigid, it's static. Now Databricks, now that lake house has the exact same thing. Now sure, they have a data lake philosophy, but their data ingestion is not data lake philosophy. So I said, what if we had that simple in with a unique architecture and index technology make it virtually accessible, publishable, dynamically at petabyte scale. And so our service connects to the customer's cloud storage. They just stream the data in, set up what we call a live indexing stream and then go to our data refinery and publish views that can be consumed in elastic API, use Cabana, Grafana, or say SQL tables, look or say Tableau. And so we're getting the benefits of both sides. Schema on read, write performance with Schema on write, read performance. And if you do that, that's the true promise of a data lake. You know, again, nothing against Hadoop, but Schema on read with all that complexity of software was, what was a little data swamp? Well, Hadoop got us started, okay. So we got to give the props, but everybody I talked to who's got this big bunch of spark clusters now saying, all right, this doesn't scale. We're stuck. And so, you know, I'm a big fan of Jamak Degani and her concept of the data lake and it's early days, but if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, you know, people call it data mesh, but you got the analytic stack. You're a visionary, Thomas. How do you see this thing playing out over the next decade? I love her thought leadership. To be honest, our core principles were her core principles now, you know, five, six, seven years ago. And so this idea of, you know, decentralized data as a product, you know, self-serve and federated computer governance. I mean, all that it was our core principle. The trick is, how do you enable that mesh philosophy? I could say we're mesh ready, meaning that, you know, we can participate in a way that very few products can participate. If there's gates data into your system, the CTLing, the schema management, my argument with the data mesh is like, producers and consumers have the same rights. I want the consumer, people that choose how they want to consume the data, as well as the producer publishing it. I could say our data refinery is that answer. You know, shoot, I love to open up a standard, right? Where we can really talk about the producers and consumers and the rights each others have. But I think she's right on in the philosophy. I think as products mature in this cloud and this data lake capabilities, the trick is those gates. If you have to structure up front, if you set those pipelines, you know, the chance of you getting your data into a mesh is the weeks and months that Ed was mentioning. Well, I think the problem with data mesh today is the lack of standards. You know, when you draw the conceptual diagrams, you get a lot of lollipops, which are APIs, but they're all unique primitives. So there aren't standards by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people in the shoulder and say, how can I use this? Where does the data live? And being able to add their own data, that's kind of the future. You're exactly right. I'm a great organization. I'm generating data. Wouldn't it be great just to stream it to a lake? And then the service, okay, a search service is the data's discoverable and configurable by the consumer. Let's say you want to go to the grocery store. You know, I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you know, all the things you can buy at a grocery store and what you want to make for dinner. And if it's static, if you have to call up your producer to do the change, was it really a data mesh-enabled service? I would argue not. Ed, bring us home. Well, and maybe one more thing with this. Yeah, please, yeah. Because some of this is we're talking 2031, but largely these principles are what we have in production today, right? So even the self-service where you can actually have business context on top of a data lake, we do that today. We talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% is done by this refinery where you can do virtual views, the right R back and do all the transformation need to make it available. But also that's available to, you can actually give that as a role-based access service to your end users, ask your analysts. And you don't have to be a data scientist or DBA. In the hands of a data scientist or DBA, it's powerful, but the fact of the matter, you don't have to. In fact, all of our employees, regardless of seniority, if they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about data lakes, the business unit wants to do it themselves, but more importantly, because they have that context of what they're trying to do. Instead of queuing up a very specific request that takes weeks, they're able to do it themselves. And if I have to put in different data stores and ETL that I can do things in real time or near real time, and that's game changing and something we haven't been able to do ever. And then maybe to wrap it up, listen, eight years ago, Thomas and a group of founders came up with a concept. How do you actually get after analytics scale and solve the real problems? And it's not one thing. It's not just getting S3. It's all these different things. And what we have in market today is ability to literally just simply stream it to S3. By the way, simple to do, what we do is automate the process of getting the data in a representation that you can now share and augment. And then we publish open API so they can actually use the tools you want. First use case, log analytics. Hey, it's easy to just stream your logs in and we give you elastic search type of services. Same thing now with SQL. You'll see main machine learning next year. So listen, I think we have the data lake 3.0 now and we're just stretching our legs. We're having a lot of fun. Well, you have to say log analytics, but if I really do believe in this concept of building data products and data services because I want to sell them, I want to monetize them and being able to do that quickly and easily so that I can consume them is the future. So guys, thanks so much for coming on the program. Really appreciate it.