 Hello everyone, welcome to theCUBE's presentation of the AWS startup showcase data as code. This is season two, episode two of the ongoing series covering the exciting startups from the AWS ecosystem. Here we're talking about operationalizing the data lake. I'm your host, John Furrier. My guest here is Mark Lyons, VP of product management at Dremio. Great to see you, Mark. Thanks for coming on. Hey, John, nice to see you again. Thanks for having me. Yeah, we were talking before we came on camera here on this showcase. We're going to spend the next 20 minutes talking about the new architectures of data lakes and how they expand and scale. But we kind of were reminiscing about the old big data days and how this really changed. There's a lot of hangovers from us. Hadoop kind of faltered, cloud took over. Now we're in a new era. And the theme here is data as code really highlights that data is now in the developer cycles of operations. So infrastructure as code led DevOps movement for cloud programmable infrastructure. Now you got data as code, which is really accelerating data ops, ML ops, database ops, and more developer focus. So this is a big part of it. You guys at Dremio have a cloud platform, query engine and a data tier innovation. Take us through the positioning of Dremio right now. What's the current state of the offering? Yeah, sure. So happy to, you know, thanks for kind of intro and into the space that we're headed in. I think the world is changing and databases are changing. So today, you know, Dremio is a full database platform, data lake house platform on the cloud. So, you know, we're all about keeping your data in open formats in your cloud storage, but bringing that full functionality that you would want to access the data as well as manage the data. Like all the functionality folks would be used to from, you know, ANSI SQL compatibility, inserts updates, deletes on that data, keeping that data in RK files and the iceberg table format, another level of abstraction so that people can access the data in a very efficient way. And going even further than that, what we announced with Dremio Arctic, which is in public preview on our cloud platform is a full get like experience for the data. So just like you said, data as code, right? We went through waves and source code and infrastructure as code and now we can treat the data as code, which is amazing. You can have development branches, you can have staging branches, ETL branches, which are separate from production. Developers can do experiments, you can make changes, you can test those changes before you merge back to production and let the consumers see that data. So, you know, lots of innovation on the platform, super fast velocity of, you know, delivery and lots of customers adopting it just in the first month here since we announced Dremio Cloud generally available where the adoption has been amazing. Yeah, and I think we're going to dig into a lot of the architects, but I want to highlight your point you made about the branching off and taking a branch off Git. This is what developers do, right? The developers use GitHub, Git, they bake branches from code, they build on top of other code. That's open source. This is what's been around for generations. Now, for the first time, we're seeing data, data sets, being taken out of production to be worked on and coded and tested and even doing lookbacks or even forward-looking analysis. This is data being programmed. This is data as code. This is really, you couldn't get any closer to data as code. Yeah, it's all done through metadata, by the way. So, like, there's no actual copying of these data sets because in these big data systems, right, you know, cloud data lakes and stuff and these tables are billions of records, trillions of records, super wide, hundreds of columns wide, thousands of columns wide, right? Like, you have to do this all through metadata operations, right? So, you can control what version of the data, basically, a individual is working with and which version of the data the production systems are seeing because these data sets are too big. You can't, you don't want to be moving them. You can't be moving them. You can't be copying them, right? So, this is actually, it's all metadata and manifest files and pointers to basically keep track of what's going on. I think this is the most important trend we've seen in a long time because you think about what Agile did for developers. Okay, speed, DevOps, cloud scale. Now you've got agility in the data side of it where you're basically breaking down the old proprietary old ways of doing data warehousing but not killing the functionality of what data warehouses did. Just doing more volume. Data warehouses were proprietary, not open. They were different use cases. They were single application developers would use data warehouse query, not a lot of volume. But as you get volume, these things are inadequate. And now you've got the new open Agile. I mean, is this Agile data engineering at play here? Yeah, I mean, I think it totally is. It's bringing it as far forward and as possible. We're talking about making the data engineering process easier and more productive for the data engineer which ultimately makes the consumers of that data much happier as well as like way more experiments can happen. Way more use cases can be tried. Like if it's not a burden and it doesn't require building a whole new pipeline and defining a schema and adding columns and data types and all this stuff, you can do a lot more with your data much faster, right? So it's really going to be super impactful to all these businesses out there trying to be data driven. Especially when you're looking at data as a code of branching a branch off, you can de-risk your changes. You're not worried about messing up the production system, messing up that data, having it seen by end user or some businesses, data is their business. So that data would be going all the way to a consumer, a third party, right? And then it's really scary. There's a lot of risk if you showed the wrong credit score to a consumer or you do something like that, right? So it's really de-risking that. Even updating machine learning algorithms, right? So for instance, if the data sets change, you can always be iterating on things like machine learning or learning algorithms. This is kind of new. I mean, this is like awesome, right? I think it's going to change the world because this stuff was so painful to do and there was, we were just, the data sets have gotten so much bigger, as you know, but we were still doing it in the old way, which was typically moving data around for everyone. It was copying data, down sampling data, moving data, and now we're just basically saying, hey, don't do that anymore. Stop, we got to stop moving the data. It doesn't make any sense. So I got to ask you, Mark. So data lakes are growing in popularity. I was originally down on data lakes. I called them data swamps. I didn't think they were going to be as popular because at that time distributed file systems like Hadoop and Object Store in the cloud were really cool. So what happened between that promise of distributed file systems and Object Store and data lakes? What made data lakes popular? What made that work in your opinion? Yeah. Yeah, it really comes down to the metadata, which I already mentioned once, but we went through these waves, right? And John, we saw all we did, they're EDWs to the data lakes and then the cloud data warehouses. And I think we're at the start of a cycle back to the data lake. And it's because the data lakes, this time around with Apache Iceberg Table Format with Project Nessie and what Dremio's working on around metadata, these things aren't going to become data swamps anymore. They're actually going to be functional systems that do inserts, updates and deletes. You can see all the commits. You can time travel them and all the files are actually managed and optimized. So you have to partition the data. You have to merge small files into larger files. Oh, by the way, this is stuff that all the warehouses have done behind the scenes and all the housekeeping they do, but people weren't really aware of it. And the data lakes the first time around didn't solve all these problems. So that those files landing in a distributed file system does become a mess, right? If you just land JSON, Avril, Parquet files, CSV files into HDFS or SNS3 compatible, you know, object storage doesn't matter. If you're just parking files, right, and you're going to deal with it as schema and read, right? Instead of schema and write, you're going to have a mess, right? If you don't know which tool changed the files, which user deleted a file, updated a file, you will end up with a mess really quickly. So to take care of that, you have to put a table format. So everyone was looking at Apache Iceberg or the Databricks Delta format, which is an interesting conversation similar to the Parquet and ORC file format that we saw play out. And then you track the metadata. So you have those manifest files, you know which files change, when, which engine, which commit, and you can actually make a functional system that's not going to become a swamp. So another trend that's extending on beyond the data lake is other data sources, right? So you have a lot of other data, not just in data lakes, so you have to kind of work with that. So how do you guys answer the question around some of the mission critical BI dashboards out there on the latency side? A lot of people have been complaining that these mission critical BI dashboards aren't getting the kind of performance as they add more data sources and they try to do more. Yeah, that was a great question. And Dremio does actually a bunch of interesting things to bring the performance of these systems up because at the end of the day, people want to access their data really quickly. They want the response times of these dashboards to be interactive and they want, you know, otherwise the data is not interesting. If it takes too long to get it, I'd like to answer a question. So yeah, a couple of things. First of all, from a data sources side, you know, Dremio is very proficient with like our k-files and an object store like we just talked about, but it also can access data in other relational systems. So whether that's a Postgres system, whether that's a Teradata system or, you know, a Oracle system. So that's really useful if you have like dimensional data or like customer data, not the largest data set in the world, not the fastest moving data set in the world, you know, but you don't want to move it. We can query that where it resides. So, you know, bringing in new sources is definitely, we all know that's a key to getting better insights into your data is joining sources together. And then from a query speed standpoint, you know, there's a lot of things going on here. Everything from kind of Apache Arrow project, which is an in-memory format of Parquet and not kind of serialize and deserialize the data back and forth, as well as what we call a reflection, which is basically a re-indexing or pre-computing of the data, but we leave it in Parquet format, you know, open format in the customer's account so that you can have aggregates and other things that are really popular in these dashboards, you know, pre-computed. So millisecond response, lightning fast, like tricks that a warehouse would do that the warehouses have been doing forever, right? More data is coming in. So, you know, obviously the architecture will get into that now, has to handle the growth. And as customers see the volume, your customers and practitioners see the volume and the variety and the velocity of the data coming in, how are they adjusting their data strategies to respond to this? Again, cloud is clearly the answer, not the data warehouse, but what are they doing? What's the strategy adjustment? It's interesting, you know, when we start talking to folks, I think sometimes it's a really big shift in thinking about data architectures and data strategies when you look at the Dremio approach and it's very different than what most people are doing today around, you know, ETL pipelines and then bringing stuff into a warehouse and oh, the warehouse is too overloaded, so let's build some cubes and extracts into the next tier of tools to speed up those dashboards for those tools. You know, and Dremio has totally put this on in the sentence and said, no, no, let's not do all those things. That's time consuming, it's brittle, it breaks, and actually your agility and the scope of what you can do with your data decreases, right? Then you go from like all your data and all your data sources to like a smaller and smaller, we actually call it the pyramid of doom and a lot of people look at this and say, yeah, that kind of looks like how we're doing things today. So from a, you know, Dremio perspective, it's really about, you know, no copy, try to keep as much data in one place, keep it in one, you know, open format and less data movement. And that's a very different approach for people. I think they're not, they don't realize how much you can accomplish that way and in your latency shrinks down too, right? Your actual latency from data created to insight is much shorter and it's not because of the query response time, that latency is mostly because of data movement and copy and all these things, right? So you really want to shrink your time to insight, it's not about getting a faster query from like a few seconds, you know, down, it's about changing the architecture. Yeah, the data drift, as they say, interesting there. I got to ask you on the personnel side, team side. You got the technical side, you got the non-technical consumers of the data, you got the data science, the data engineering is ramping up. We mentioned earlier, data engineering being agile is a key innovation here. As you've got to blend the two personas of technical and non-technical people who's playing with data or coding with data, where are the bottlenecks in this process today? And how can data teams overcome these bottlenecks? Yeah, I think we see a lot of bottlenecks in the process today. A lot of data movement, a lot of change requests, update this dashboard, oh, well that dashboard update requires an ETL pipeline update, requires a column to be added to this warehouse. So then you've got these personas, like you said, more technical, less technical, all the data consumers, the data engineers. Well, the data engineers are getting totally overloaded with requests and work and it's not even super value you add work to the business and it's not really driving big changes in their culture and insights and new use cases for data. It's churning through kind of small changes, but it's taking too much time. It's taking days, if not weeks for these organizations to manage small changes. And then the data consumers, the less technical folks there, they can't get the answers that they want. So they're waiting and waiting and waiting and they don't want to, they don't understand why they have to, why things are so challenging, right? Like how things could take so much time. So from a drive-in perspective, it's amazing to watch these organizations unleash their data, get the data engineers, their productivity up, stop dealing with some of the last mile ETL and like small changes to the data. Dremio actually says, hey, data consumers, here's a really nice GUI. You don't need to be a SQL expert. Well, the tool right the joins for you, you can click on a column and say, hey, I want to calculate a new field and calculate that field. And it's all done virtually. So it's not changing the physical data sets. So the actual data engineering team doesn't even really need to care at that point, right? So you get happier data consumers at the end of the day they're doing things more self-service. They're learning about the data and the data engineering teams can go do value add things. They can re-architecture the platform for the future. They can do POCs to test out new technologies that could support new use cases and bring those into the organization, right? Things that really add value, you know, instead of just churning through backlogs of the, hey, can we get a column added? Oh, we changed. Everyone's doing app development, AB testing and those developers are king and those pipeline stream all this data down when the JSON files change, right? You need agility. And if you don't have that agility, you just get this endless backlog. Yeah, this is data as code in action, right? You're committing data back into the main branch that's been tested. That's what developers do. So this is really kind of the next step function. So I got to put the customer hat on for a second and ask you kind of the pessimist question. You know, okay, we've had data lakes. I've got data lakes, there's been data lakes around. I got query engines here and there. They're all over the place. What's missing? What's been missing from the architecture to fully realize the potential of a data lake house? Yeah, I think that's a great question. I, the customer say exactly that, John. They say, I've got 22 databases. You got to be kidding me. You showed up with another database or hey, you know, let's talk about a cloud data lake or a data lake again. I did the data lake thing, right? I had a data lake and it wasn't everything I thought it was. It was bad. It was a swamp. Yeah. So like customers really think this way and you say, well, what's different this time around, right? Well, the cloud in the original data lake worlds and I'm just going to focus on data lakes. The original data lake worlds, everything was still direct attached storage, right? So you had to scale your storage and compute out together and we built these huge systems, right? Thousands of, thousands of HDFS nodes and stuff. Well, the cloud brought the separated compute and storage but data lakes have never seen separated compute and storage until now, right? It was, we went from the data lake with direct attached storage to the cloud data warehouse with separated compute storage. So the cloud architecture and getting compute and storage separated is a huge shift in the data lake world and that agility of like, well, I'm only going to apply the compute that I need for this question, for this answer right now and not get 5,000 servers of compute sitting around, at some peak moment or just 5,000 compute servers because I have five petabytes or 50 petabytes of data that need to be stored in the disks that are attached to them, right? So I think the cloud architecture and separating compute and storage is the first thing that's different this time around about data lakes but then more importantly than that is the metadata tier, is the data tier and having sufficient metadata to have the functionality that people need on the data lake whether that's for governance and compliance standpoints to actually be able to do a delete on your data lake or that's for productivity and treating that data as code like we're talking about today and being able to time travel it, version it, branch it and now these data lakes, I mean, geez, the data lakes back in the original days were getting to 50 petabytes. Now think about how big these cloud data lakes could be, right, even larger and you can't move that data around. So we have to be really intelligent about and really smart about the data operations and versioning all that data knowing which engine touched the data, which engine, which person was the last commit and being able to track all that, right, is how, is ultimately what's going to make this successful because if you don't have the governance in place these days with data, the projects are going to fail. Yeah, and I think separating the query layer and the SQL layer and the data tier is another innovation that you guys have. Also it's a managed cloud service, Dremio Cloud now and you got the open source angle too which is also going to open up more standardization around some of these awesome features like you mentioned the joins and I think you guys built on top of Parquet and some other cool things and you got a community developing so you get the cloud and community kind of coming together. So it's the real world is coming to light saying, hey, I need real world applications not the theory of old school. So what use cases do you see suited for this kind of new way, new architecture, new community, new programmability? Yeah, I think that I see people doing all sorts of interesting things and I'm sure with what we've introduced with Dremio Arctic and the data as code is going to open up a whole new world of things that we don't even know about today but generally speaking, we have customers doing very interesting things like very data application things, right? Like building really high performance, data into use cases, whether that's a supply chain and manufacturing use case whether that's a pharma or biotech use case, a banking use case and really unleashing that data right into an application. We also see a lot of traditional data analytic use cases more in the traditional business intelligence or dashboarding type use cases, right? That stuff is totally achievable, no problems there but I think the most interesting stuff is companies are really figuring out how to bring that data when we offer the flexibility that we're talking about and the agility that we're talking about you can really start to bring that data back into the apps, into the work streams, into the places where the business gets more value out of it not in a dashboard that some person might have access to, right? Or a set of people have access to. So, even in the Dremio Cloud announcement the press release, there was a customer, they're in Europe, it's called Garvis AI and they do AI for supply chains that is, it's an intelligent application and it's showing customers transparently how they're getting to these predictions and they stood this all up in like a very short period of time, right? Because it's a cloud product, they don't have to deal with provisioning, management, upgrades, you know, I think they have their stuff going in like 30 minutes or something like super quick which is amazing. The data was already there, right? And a lot of organizations, their data is already in these cloud storages and if that's the case- If they have data, they're a use case. I mean, this is agility. I mean, this is agility coming to the data engineering field, making data programmable, enabling the data applications, the data ops for everybody. For code- And for so many more use cases at these companies, you know, like they, these data engineering teams, these data platform teams, whether they're in marketing or ad tech or five server, Telco, they have a list, there's a list, a roadmap of use cases that they're waiting to get to, right? And if they're drowning underwater in the current tooling and barely keeping that alive. And oh, by the way, John, like you can't go hire 30 new data engineers tomorrow and bring on the team to get capacity. That's not, you have to innovate at the architecture level to unlock more data use cases because you're not going to go triple your team. Like that's not possible. It's going to unlock a tsunami of value because everyone's clogged in the system and it's painful. Right? They've got delays, you've got bottlenecks, you've got people complaining, it's hard, scar tissue. So now I think this brings ease of use and speed to the table. Yeah. I think that's what we're all about is making the data super easy for everyone. Like this should be fun and easy, not really painful and really hard and risky, you know? In a lot of these old ways of doing things you all the risk, you start changing your ETL pipeline, you add a column to the table, you do all of a sudden you've got potential risk that things are going to break and you don't even know what's going to break. Proprietary, not a lot of volume and usage and on-premises, open cloud agile. Yeah. I mean, come on. Which path, the curtain of the box, what are you going to take? I mean, it's a no brainer. Which way do you want to go? Yeah. Mark, thanks for coming on theCUBE, really appreciate it for being part of the AWS startup showcase. Data as code, great conversation. Data as code is going to enable a next wave of innovation and impact the future of data analytics. Thanks for coming on theCUBE. Yeah. Thanks, John. And thanks to the AWS team, a great partnership between AWS and Dramio too. And talk to you soon. Keep it right there. More action here on theCUBE as part of the showcase. Stay with us. This is theCUBE, your leader in tech coverage. I'm John Furrier, your host. Thanks for watching.