 Welcome back to theCUBE's coverage of AWS re-invent 2021. The industry's most important hybrid event, very few hybrid events, of course, in the last two years. And theCUBE is excited to be here. This is our ninth year covering AWS re-invent. It's the 10th re-invent. We're here with Joel Minnick, who's the vice president of product and partner marketing at Smoke and Hot Company Databricks and Greg Rokita, who is the executive director of technology at Edmunds. If you're buying a car, leasing a car, you got to go to Edmunds. We're going to talk about busting data silos. Guys, great to see you again. Welcome. Glad to be here. All right, so Joel, what the heck is a lake house? This is all over the place. Everybody's talking about lake house. What is it? Indeed. Well, in a nutshell, a lake house is the ability to have one unified platform to handle all of your traditional analytics workloads. So your BI and reporting, the workloads that you would have for your data warehouse on the same platform as the workloads that you would have for data science and machine learning. And so if you think about kind of the way that most organizations have built their infrastructure in the cloud today, what we have is generally customers will land all their data in a data lake. And a data lake is fantastic because it's low cost, it's open, it's able to handle lots of different kinds of data. But the challenges that data lakes have is that they don't necessarily scale very well. It's very hard to govern data in a data lake house. It's very hard to manage that data in a data lake, sorry, in a data lake. And so what happens is that customers then move the data out of a data lake into downstream systems. And what they tend to move it into are data warehouses to handle those traditional reporting kinds of workloads that they have. And they do that because data warehouses are really great at being able to have really great scale, have really great performance. The challenge though is that data warehouses really only work for structured data. And regardless of what kind of data warehouse you adopt, all data warehousing platforms today are built on some kind of proprietary format. So once you've put that data into the data warehouse, that is kind of what you're locked into. The promise of the data lake house was to say, look, what if we could strip away all of that complexity and having to move data back and forth between all these different systems and keep the data exactly where it is today. And where it is today is in the data lake. And then being able to apply a transaction layer on top of that. And in Databricks case, we do that through an open source technology called Delta Lake. And what Delta Lake allows us to do is when you need it, apply that performance, that reliability, that quality, that scale that you would expect out of a data warehouse directly on your data lake. And if I can do that, then what I'm able to do now is operate from one single source of truth that handles all of my analytics workloads, both my traditional analytics workloads and my data science and machine learning workloads. And being able to have all of those workloads on one common platform means that now, not only do I get much, much more simple in the way that my infrastructure works, and therefore able to operate at much lower cost, able to get things production much, much faster. But I'm also able now to leverage open source in a much bigger way, being that a lake house is inherently built on an open platform. So I'm no longer locked into any kind of data format. And finally, and probably one of the most lasting benefits of a lake house is that all the roles that have to touch my data, from my data engineers to my data analysts, to my data scientists, they're all working on the same data, which means that collaboration that has to happen to go answer really hard problems with data, I'm now able to do much, much more easy because those silos that traditionally exist inside of my environment no longer have to be there. And so a lake house is the promise to have one single source of truth, one unified platform for all of my data. Okay, great. Thank you for that very cogent description of what a lake house is. Now that's, I want to hear from the customer to see, okay, this is what he just said true. So actually, let me ask you this, Greg, because the other problem that you didn't mention about the data lake is that with no schema on right, it gets messy. And Databricks, I think, correct me if I'm wrong, has begun to solve that problem through series of tooling and AI, that's what Delta Lake does. It's a managed service. Everybody thought you were going to be like the cloud era of Spark and you've made a brilliant move to create a managed service and it's worked great. Now everybody has a managed service. But so can you paint a picture at Edmunds as to what you're doing with, maybe take us through your journey, the early days of a dupe, the data lake. Oh, that sounds good, throw it in there. Paint a picture as to how you guys are using data and then tie it into what Joel just said. So as Joel said, the Delta Lake simplifies the architecture quite a bit. In a modern enterprise, you have to deal with a variety of different data sources, structured, semi-structured, and unstructured in the form of images and videos. And with Delta Lake and Delta Lake, you can have one system that handles all those data sources. So what that does is that basically removes the issue of multiple systems that you have to administer, it lowers the cost, and it provides consistency. If you have multiple systems that deal with data, you always arise as the issue is to which data has to be loaded into which system, and then you have issues with consistency. Once you have issues with consistency, business users and analysts will stop trusting your data. So that was very critical for us to unify the system of data handling in the one place. Additionally, you have a massive scalability. So I went to the talk with Dom Brzezinski from Apple saying that he can process two years' worth of data instead of just two days. And in admins, we have this use case of backfilling the data. So often we change the logic, and we need to reprocess massive amounts of data. With the Lake House, we can reprocess months' worth of data in a matter of minutes or hours. And additionally, the Lake House is based on open standards like Parquet. That allowed us to basically hook open source and third-party tools on top of the Delta Lake House. For example, a Metson. We use a Metson for data discovery. And finally, the Lake House approach allows us for different skill sets of people to work on the same source data. We have analysts, we have data engineers, we have statisticians and data scientists using their own programming languages, but working on the same core of data sets without worrying about duplicating data and consistency issues between the teams. So what are the primary use cases where you're using the Lake House and Delta Lake? So we have several use cases. One of the more interesting and important use cases is vehicle pricing. You have used admins, so you go to our website and you use it to research vehicles. But it turns out that pricing and knowing whether you're getting a good or bad deal is critical for our business. So with the Lake House, we were able to develop a data pipeline that ingests the transactions, curates the transactions, cleans them, and then feeds that curated feed into the machine learning model that is also deployed on the Lake House. So you have one system that handles this huge complexity, and as you know, it's very hard to find unicorns that know all those technologies, but because we have flexibility of using Scala, Java, Python, and SQL, we have different people working on different parts of that pipeline on the same system and on the same data. So having Lake House really enabled us to be very agile and allowed us to deploy new sources easily when they arrive and fine tune the model to decrease the error rates for the price prediction. So that process is ongoing and it's a very agile process that kind of takes advantage of the different skill sets of different people on one system. Yeah, because you guys democratized by car buying, well, at least the data around car buying, because as a consumer now, I know what they're paying and I can go in, of course, but they changed their algorithms as well. I mean, the dealers got really smart and then they got kicked back from the manufacturers, so you had to get smarter. So it's a moving target, I guess. Exactly, the pricing is actually very complex. Like, I don't have time to explain it to you, but knowing, especially in this crazy market, inflationary market, where used car prices are like 38% higher year over year, and new car prices are like 10% higher and they're changing rapidly. So having very responsive pricing model is extremely critical. I don't know if you're familiar with Zillow. I mean, they almost went out of business because they mispriced their houses. So if you own their stock, you're probably on the short hand of it, but you know. No, but it's true because my lease came up in the middle of the pandemic and I went to Edmunds and said, what's this car worth? It's worth like $7,000 more than the buyout cost, the residual value. I said, I'm taking it. Can't pass up that deal. And so you have to be flexible. You're saying the premise is though that open source technology and Delta Lake and Lakehouse enabled that flexibility. Yes, we are able to ingest new transactions daily, recalculate our model within less than an hour and deploy the new model with new pricing, you know, almost real time. So in this environment, it's very critical that you kind of keep up to date and ingest the latest transactions as the prices change and recalculate your model that predicts the future prices. How does the business lines inside of Edmund interact with the data teams? You mentioned data engineers, data scientists, analysts. How do the business people get access to their data? So originally we only had a core team that was using Lakehouse, but because the usage was so powerful and easy, we were able to democratize it across other units. So other teams within software engineering picked it up and then analysts picked it up and then even business users started using the dashboarding and seeing how the prices change over time and seeing other metrics within the business. What did that do for data quality? Because I feel like if I'm a business person I might have context of the data that an analyst might not have if they're part of a team that's servicing all these lines of business. Did you find that the data quality, the collaboration affected data quality? The biggest thing for us was the fact that we don't have multiple systems now so you don't have to load the data. Whenever you have to load the data from one system to another, there is always a lag, there is always a delay, there is always a problematic job that didn't do the copy correctly and the quality is uncertain. You don't know which system tells you the truth. Now we just have one layer of data. Whether you do reports, whether you're data processing or whether you do modeling, they all read the same data. The second thing is that with the dashboarding capabilities, people that were not very technical that before we could only use Tableau and Tableau is not the easiest thing to use if you're not technical, now they can use it. So anyone can see how our pricing data looks, whether you're an executive, whether you're an analyst or a casual business user. But I have so many questions, you guys are going to have to come back for the run of the time. But you now allow a consumer to buy a car directly. Yes. So that's a new service that you launched. I presume that required new data. We give consumers offers, yes. And that offer- You offer to buy my lease? Exactly. And that offer leverages the pricing that we develop on top of the lake house. So the most important thing is accurately giving you a very good offer price, right? So if we give you a price that's not so good, you're going to go somewhere else. If we give you a price that's too high, we're going to go bankrupt like Zillow did, right? And to enable that, you're working off the same data set. Yes. You're going to have to spin up it. Did you have to inject new data? Was there a new data source that were required? Once we curate the data sources and once we clean it, we feed it directly to the model and all of those components are running on the lake house. Whether you're curating the data, cleaning it, or running the model. The nice thing about lake house is that machine learning is the first class citizen. If you use something like Snowflake, I'm not going to slam Snowflake here, but you have to- That's a different use case. No, but go ahead, Snowflake. You have to load it into a different system. Later, we're going to- You have to load it into a different system. So like good luck doing machine learning on Snowflake, right? Whereas data bricks, that's kind of your raison d'etre. Right, so whether you're a data engineer, I feel like I should be a salesman or something. Yeah, I'm not saying that just because, you know, I was told to, like I'm saying it because that's our use case. Well, it fit your use case. So question for each of you. What business results did you see when you went kind of pre-lake house, post-lake house? What are the, any metrics you can share? And then I wondered, Joel, if you could share sort of broader what you're seeing across your customer base. But Greg, what can you tell us? Well, before their lake house, we had two different systems. We had one for processing, which was still data bricks and the second one for serving. And we iterated over Neteza, Redshift, but we figured that maintaining two different system and loading data from one to the other was a huge overhead. Administration, security, cost, the fact that you had the consistency issues. So the fact that you can have one system with centralized data solves all those issues. You have to have one security mechanism, one administrative mechanism, and you don't have to load the data from one system to the other. You don't have to make compromises. And scale is not a problem because of the cloud? Because you can spend clusters at will for different use cases. So your clusters are independent. You have processing clusters that are not affecting your serving clusters. So in the past, if you were running a serving, say on Neteza or Redshift, if you were doing heavy processing, your reports would be affected. But now all those clusters are separated. So a consumer could take that data from the producer independent. Using its own cluster. Okay. I'll give you the final word, Joel. I know it's been, like I said, you guys got to come back. This is great interview. What have you seen broadly across the cluster? I think Greg's point about scale is an interesting one. So if you look across the entire Databricks platform, the platform is launching 9 million VMs every day. And we're in total processing over nine exabytes a month. So in terms of just how much data the platform is able to flow through it and still maintain extremely high performance is bar not out there. And then in terms of if you look at just kind of the macro environment of what's happening out there, I think what's been most exciting to watch are what customers are experiencing traditionally or on the traditional data warehousing kinds of workloads. Because I think that's where the promise of Lakehouse really comes into its own. It's saying, yes, I can run these traditional data warehousing workloads that require high concurrency, high scale, high performance directly on my data lake. And I think probably the two most salient data points to raise up there is just last month Databricks announced it set the world record for the TPCDS 100 terabyte benchmark. So that is a place where Databricks at the Lakehouse architecture, that benchmark is built to measure data warehouse performance. And the Lakehouse beat data warehouses at their own game in terms of overall performance. And then what's that meant from a price performance standpoint as customers on Databricks right now are able to enjoy that level of performance at 12X better price performance than what cloud data warehouses provide. So not only are we jamming on this extremely high scale in performance, but we're able to do it much, much more efficiently. We're going to need a whole another segment to talk about benchmarking. Indeed. But guys, thanks so much. Really interesting session and best of luck to both of you. Enjoy the show. Thank you for having us. Very welcome. Okay, keep it right there, everybody. You're watching theCUBE, the leader in high tech coverage at AWS re-invent 2021.