 Hi, this is your host Sopran Bhartiya and welcome to another episode of TFR. Let's talk and today we have with us Tony bear founder and Industry analyst from DB Insight. Tony is good to have you on the show. Oh, thanks for having me Sopran. Talk a bit about your company DB Insight. What is your focus here? What do you folks do? My focus is data and databases and clap and basically and Kind of a key focus of that I've been you know of mine is really looking at how the cloud is really Reinvented database architecture into how we manage data since you talked about cloud cloud NetEar cloud is also gonna change if you if you look at cloud native technology like Kubernetes earlier It was a stateless and not a stateful So a lot of things are changing people talk about gate data. How their house versus data lake So, you know, a lot of things are happening there if I ask you what kind of evolution you have seen of Data space, you know, and then we'll talk about lake house warehouse all those, you know terms that we use because we are creating so much data But data has data itself has no value. You have to extract value from it But when you look at massive on data, you you cannot just move from one place to other place So things are getting very very complicated. So I want to hear from you a holistic approach. How you're seeing this evolution Okay, well first off we'll take we'll start from the top down, which is that When you go from on-prem to the cloud You're managing to a different goal and that goal, you know in the in on-prem you're managing the capacity In the cloud you're managing to resource. So that's the that's the first thing and the other is that since that resource is there It's very tempting to use lots of that resource And so that's really the care that you have to take because even though resources cheap in the cloud a Lot of cheap gets expensive Now in terms of how we manage data what the cloud has done Because we've essentially optimized many different parts of infrastructure is that we can now distribute data In a way that was never that we could never do we ever think about previously We were managing on-prem and so that really opens a lot of horizons on how we can manage data And how could deal with it the same time of course there the usual, you know security and governance I mean those concerns haven't changed except for the fact that in the cloud Since you're gonna be you know working with data that will be remote you start have to worry about you know for instance You know data sovereignty and data locality data locality sometimes from the standpoint of performance But also from the standpoint of well can this data leave the country? So I mean this is this itself It's a very we could go on for hours about how the cloud changes how you manage data But that's I would say at the hundred thousand foot level. Let's talk about the evolution We used to talk about data warehouse Then we start talking on data lakes and now we're talking about data lake house So talk about the difference and I don't know I maybe I'm not an expert in data But it's not that versus sometime it right plays for the right kind of you know data as well So let's hear your opinion on that, right? Well put this way back in you know in day one when we were still basically trapping furs And having to rub sticks with fire I remember basically the you know the birth of data warehouses and the idea at that time. I mean I'm just you know, I mean Is that when you're doing analytic queries? It's gonna you know the traffic and you know the workload is gonna be very different from doing transactions So that's where the really the separation started and So, you know for about a good 20, you know 25 30 years the idea of the data warehouse You're off loading those complex queries, you know from the you know from the transaction system and only working with the data That's significant, you know from an analytics standpoint. That was but you know, that was a very successful idea Obviously and it became essentially a deep a default practice The question is of course when the cloud came in and when we start and we start to have the ability to deal with not just relational data but basically Multistructured data or you know variably structured data a lot of limits Allow the limitations of data warehouses start. We start to hit the wall with data warehouses. Also, we start dealing with a lot of data Data warehouses can get expensive. So we reinvented them for the cloud And that's fine But the thing is that still in the cloud your data warehouse services, you know, pretty much all of them like redshift Yes, you know, whatever Are you you know, we're using a more expensive form of storage block storage than what you would store? Just miscellaniously on the cloud which would be cloud storage cloud object storage like Amazon s3, you know Azure blob storage or ADLS, you know, Google Google cloud storage, etc That has become essentially the de facto sort of like, you know, you know inexpensive economical Durable storage for the cloud and so at that point We started building lots of data lakes because at least with, you know, in the cloud with cloud storage The limits on what we could do in terms of storing data and accessing and processing it basically, you know Be, you know, basically start to lift. I mean, for instance, like spark when it really started to commercialize around 2015 and all of a sudden we could use in memory to really speed up this batch processing that we were Probably having to do with with map reduce It was a huge revolution and a revelation from that standpoint The problem then became is that, you know, as data lakes became more popular You start to see the shortcomings, which is that how confident are we in this data? Is this the most current, you know version of the data and In the data lake, you didn't have any mechanisms for doing that. What you needed was Acid transactions in other words that an update is only committed, you know, I mean the whole idea of an acid transaction I mean assets is four different properties, but from a transactional standpoint It means that we can you will not commit, you know data in a transaction that is only executed part way You either execute or you don't and that was where the whole idea of the data lake house came out Which was you know came up, which is the idea of being able to have Acid transactions in the data lake not to turn it into a transaction not to turn into a No LTP and online transaction processing system, but to gain confidence in the data as we were discussing earlier that it was not about Our game it was and the technology complement, but when we look at data lake house Is it going to like kind of replace data warehouses or is once again? It will be a word of coexistence And I'm going to give you a I'm going to start with a very wishy-washy answer, which is that it depends Okay, in general what I've concluded in my research is that it will co-opt and you could you could say eventually replace The classic multi-purpose data warehouse the reason why I say that is the multi-purpose data warehouse Has been evolving to take on data lake type properties But unfortunately it's with not with the economical data lake storage But in other words when I say data lake properties, it was the ability to Analyze non-relational data the ability to base to also invoke python routines So that you can make diet diet scientists, you know, and you're happy Um, but the the the limitation, uh, you know, so the reason why I think that the lake house will Basically supplant those types of data warehouses Is that it's going to do far more economically What you're trying to do in that multi-purpose data warehouse anyway Now I need to basically give you a very important basically, um, you know caveat to all that Which is that if you're doing something like you're working with say like a teradata or something like that where you have A sequel query engine that can do That is basically designed, you know to do let's say like dozens if not hundreds of table joins You're still going to need a very high-end data warehouse to do that It will not replace your very high-end teradata's on the other hand at the lower end at the longer tail Where you may not have, you know I could see it basically coexisting alongside and in fact teradata has basically started its own lake house strategy As a matter of fact, so that's the short of it. It also is not going to replace Let's say fit for purpose, you know, let's say datamarts You know if you just if you have just a very if you're working with walled garden data and you're And your queries are well-defined. It's basically doing monthly reporting or Um, or your customer affinity or something like that. It's not going to replace that either But I think the mixed purpose the mixed workload A general purpose data warehouse. I think it will eventually replace Can you talk about how our enterprise customers they're approaching data lake house? You know, as you said, you know, they will be using both But how are they approaching it to to not only take leverage of both but also And also since you're an expert, you know, how they should approach it. Okay. Well, first off this is a case where The vendors that you know, the vendor community is ahead of the market in terms of awareness right now If you look at who is actually adopting lake houses, it's your classic early adopters And I can tell you from personal experience Whether it's been through my exchanges on linkedin or when I go to data meetups and I meet basically Basically practitioners data professionals who should know this stuff There's still a very low awareness Of what data lake houses are Right now, frankly, you know I use this as like, you know, the the linkedin rule of thumb the bear rule of thumb Which is like looking at the number of responses I get When I put up a certain post on on on linkedin based on hashtag If I hashtag a data mesh, it's going to get 10 x the number of responses I'm going to get for data lake house So we're still very early in the awareness stage And but the reason why I hopped on this now is that in the past year I saw the vendor ecosystems really starting to crystallize It's become the latest front in the proxy war between data bricks and snowflake And so that's why I'm saying like if there's smoke, there's fire If I ask you how would you define data lake house, you know in the hashtag language quick, short and crisp Data lake house is a data lake that has acid transactions Let's go back to the vendor ecosystem. Talk a bit about how your how mature Is the data lake house landscape ecosystem? And also if you can also talk about the role of open source that is playing there, right first off the ecosystem Is still solidifying I will say it became a I think when to me basically the watershed event Was when snowflake, you know announced last year. It was going all in on iceberg and not just As a token man, I mean they previously it's supported You know external access like federated query and that's not a huge deal most of the cloud data warehouses can do something like that But you know last year they announced a full commitment to iceberg and they were basically going to throw everything at it Um and make it essentially a first class citizen To me that was a huge announcement now for snowflake They're not they're not really giving up anything there because where they make their money is on compute not where you store the data So for them, it's really an oberian But the fact that they felt that that in this case of patchy iceberg was mature enough for them to admit You know to put their stake in the sand. I think was hugely important You've seen a few others clad era start, you know, you know, it started Is thrown in its lot with icebergs You're starting seeing the hyperscale is saying that on their various data lake analytics services and their query services That they're now supporting that they're now supporting read access to you know to data lake houses Not yet write access that's Right, you know to me the real, you know Basically, I think the you know the real sort of um, I guess Watermark will be when you get read plus write access So essentially we're still really early days, but it's that I think in the last year the sides have started to crystallize What's interesting is that most of the household names the oracles of the world the ibm's teradata's And so on sap have not yet weighed in and so I think that's the next shoe that we're weighing to drop um, so in terms of the vendor ecosystem, I'd say the next 12 to 18 months is when basically you're going to see That system pretty much, you know where they're going to be essentially making their choices And I see this very much becoming an open source play I mean, yes, there are proprietary lake house table formats out there right now Teradata has its own You know aws with lake with um governed data lake governed tables Is a proprietary format even though other aws services are going open source Even um a very specialized vendor, you know dynatrace Is doing its own proprietary lake house format, but you know, but long run I see open sources winning out here and the main reason for that Is that the differentiation is going to occur is not in the table format Which is what these lake houses are they're a table format. Um, that's not where the differentiation happens Where the differentiation is going to happen is in the control plane for it and the query engine and so and the other part, you know of that is that You know for a vendor, why should they then spin their wheels? You know if they're not going to differentiate there. So that's why I see open source winning out Right now there are three major projects that are that that are out there, you know, uh, delta lake uh, apache iceberg apache hoodie um and um What I see ultimately is that you know, I I believe that the market's going to win down to winnow down to two of those three Now I want to ask a question about what is happening right now in the industry We are looking at a lot of cost cutting is happening layoffs are happening companies also Looking at cost the cloud cost as a cost center How much impact is it going to have on? Of course that option of uh data lake house at the same time We are uh some of this technology actually make you more cost efficient. So What role can lake house play in actually making companies more cost efficient in your question? You've almost given me my answer Um In the short run, I think you know with companies entering sort of a more cost, you know conscious phase Is that there'll probably be a little more it'll there'll be more lead time on adopting new initiatives But you basically answered the question in the in in the last part in in your la in your last part Which is that in long long run as I see that using cloud storage Could be you know could become a you know, I mean for You know for essentially a lot, you know a number of data warehousing functions Will essentially be you know Be seen as a cost cutting strategy. So I see kind of It's going to be a hockey stick, but not for a while. It's going to be a gradual ramp up But I think probably about 12 months from now I think you will then start to see awareness build that you know something This new idea could save money And and and it could scale and and it could give Good enough performance it is it going to give as good a performance as It's a proprietary tables on block storage. No, but we'll give good enough The types of queries you're going to throw at a data at a you know at a you know multi workload data warehouse or data lake And I would say at that point That point I think the answer will be pretty clear Tony thank you so much for taking time out today and of course talk about warehouse data lake Data lake house the difference and the benefits and as usual I would love to have you back on the show to learn more from you and to see where the market is going. Thank you And thank you. So swapping a lot. It was a pleasure meeting you this morning