 Welcome everyone to theCUBE's presentation of the AWS startup showcase. The theme is data as code, the future of enterprise data and analytics. This is the season two, episode two of the ongoing series of covering the exciting startups in the AWS ecosystem around data analytics and cloud computing. I'm your host, John Furrier. Today we're joined by great guests here. Three guests, Wen Pan, who's the director of product management at HANA. Satyam Krishnan, engineering manager at Blinkit. And we have Ashkay Agarwal, senior engineer at Blinkit as well. We're going to get into the relationship there. Let's get into it. We're going to talk about how Blinkit's using the open data lake data house with Presto on AWS. Gentlemen, thanks for joining us. Thanks for having us. So we're going to get into the deep dive on the open data lake, but I want to just quickly get your thoughts on what it is for the folks out there. Set the table. What is the open data lake house? Why is it important? What's in it for the customers? Why are we seeing adoption around this? Because this is a big story. Sure. Yeah, the open data lake house is really being able to run a gamut of analytics where there be BI, SQL, machine learning data science on top of the data lake, which is based on inexpensive, low cost, scalable storage. And more importantly, it's also on top of open formats. And this to the end customer really offers a tremendous range of flexibility. They can run a bunch of use cases on the same storage and great price performance. You guys have any other thoughts on what's your reaction to the lake house? What is your experience with it? What's going on Blinkit? No, I think for us also, it has been the primary driver of how as a company we have shifted our completely delivery model from us delivering in one day to someone who is delivering in 10 minutes, right? And a lot of this was made possible by having this kind of architecture in place which helps us to be more open source, more where the tools are open source, we have an open table format, which helps us be very modular in nature, meaning we can pick solutions which works for the best, which works best for us, right? And that is the kind of architecture that we want to be in. Awesome. When we last time we checked with the HANA, we had a great conversation around Presto data. The theme of this episode is data as code, which is interesting because in all the conversations in this episode it's all around developers, which administrators are turning into developers. There's a developer vibe with data and open source is software. Like now you got data kind of taking a similar trajectory as how software development was with code, but the people running data is not, they're not developers, they're administrators, they're operators. Now they're turning into data ops. So it's kind of a similar vibe going on with branches and taking stuff out of them, putting it back in and testing it, data sets becoming much more stable, iterating on machine learning algorithm. This is a movement, what's your guys' reaction before we get into the relationships here with you guys, but what's your reaction to this data as code movement? Yeah, so I think the folks at Blinkit are doing a great job there. I mean, they have a pretty compact data engineering team and they have some pretty stringent SLAs as well as in terms of time to value and reliability. And what that ultimately translates for them is not only flexibility, but reliability. So they've done some very fantastic work on a lot of automation, a lot of integration with code and their data pipelines. And I'm sure they can give the details on that. Yeah, Satya and Ashka, you guys are engineers, software, but this is becoming a home to the paradigm where the frontline coding and or work or data engineering is implementing the operations as well. It's kind of like DevOps for data. For sure, right? And I think whenever you're working, even as a software engineer, the understanding of business is equally important. You cannot be working on something and be away from business, right? And that's where, like I mentioned earlier, when we realize that we have to completely move a stack and start giving analytics at 10 minutes, right? Because when you're delivering in 10 minutes, your leaders want to take decisions in mere real time. That means you need to move with them. You need to move with business. And when you do that, the kind of flexibility these softwares give is what enables the businesses at the end of the day. Awesome. This is really kind of like, is there going to be a book called Agile Data Warehouses? I don't think so. I think so. The Agile Cloud data. This is cool. So let's get into what you guys do. What does Blinkit up to? What do you guys do? Can you take a minute, explain the company and your product? Sure. I'll take that. So Blinkit is India's biggest 10 minute delivery platform. It pioneered the delivery model in the country with over 10 million Indians shopping on our platform, ranging from everything, groceries, staples, vegetables, emergency services, electronics and much more, right? It currently delivers over 200,000 orders every day and is in a hurry to bring the future of commerce to everyone in India. What's the relationship with Ahana and Blinkit? Wayne, what's the tie in? Yeah, so Blinkit had a pretty well formed stack. They needed a little bit more flexibility and control. They thought a managed service was the way to go. And at here at Ahana, we provide a SaaS managed service for Presto. So they engaged us and they evaluated our offering. And more importantly, we're able to partner, you know, as an early stage startup, we really rely on very strong partners with great use cases that are willing to collaborate. And the folks at Blinkit have just been, have been really great in helping us push our product, develop our product. And we've been very happy about the value that we've been able to deliver to them as well. Okay, so let's unpack the open data lake house. What is it? What's under the covers? Let's get into it. Sure, so if I bring up a slide, like I said before, it's really a paradigm on being able to run a gamut of analytics on top of the open data lake. So what does that mean? How did it come about? So on the left-hand side of the slide, we are coming out of this world where for the last several decades, the primary workhorse for, you know, SQL based processing and reporting and dashboarding use cases was really the data warehouse. And what we're seeing is a shift due to the trends in inexpensive scalable storage, cloud storage, the proliferation of open formats to facilitate using this storage to get certain amounts of reliability and performance and the adoption of frameworks that can operate on top of this cloud data lake. So while here at Ahana, we're primarily focused on SQL workloads and Presto, this architecture really allows for other types of frameworks. And you see that the ML and AI side. And like to Satyam's point earlier, offers a great amount of flexibility modularity for many use cases in the cloud. So really that's really the lake house and people like it for the performance, the openness and the price performance. How's the open source, open side of it, playing a call open source is kind of open formats. What is the open source angle on this? Because there's a lot of different approaches. I'm hearing open formats, you know, you have data stores are a big part of seeing that. You got SQL, you mentioned SQL, there's a mishmash of opportunities. Is it all coexisting? Is it one tool to rule the world or is it interchangeable? What's the open source angle? There's multiple angles. And I'll let definitely Satyam add to what I'm saying. This was definitely a big piece for Blinkett. So on one hand, there's, you have the open formats and what really the open formats enable is multiple compute engines to work on that data. That's very huge because it's open. It's, you're not locked in. I think the other part of open that is important and I think it was important to Blinkett was the governance around that. So in particular Presto is governed by the Linux foundation. And so as a customer of open source technology, they want some assurances for things like how is it governed, is the license gonna change? So there's that aspect of openness I think is very important. Yeah, what's Blinkett? What's the data strategy here with Lakehouse and you guys? Why are you adopting this type of architecture? Adding to what, yeah, I think adding to what Ben said, right? When we're thinking in terms of all these open stacks you have got these open table formats, everything which is deployed over cloud. The primary reason there is modularity. It's as simple as that, right? You can plug in place so many different kind of table formats from one thing to another based on the use case that you're trying to serve so that you get the most value out of data, right? I'll give you a very simple example. So for us, we use, even we not even use one single table format. It's not that one thing solves for everything, right? We use both hoodie and iceberg to solve for different use cases. One is good for when you're working for a certain kind of data set. Icebergs works well when you're in the SQL kind of interface, right? Hoodie is still trying to reach there. It's going to go there very soon. So having the ability to plug and play different formats based on the use case helps you to grow faster, helps you to take decisions faster because now you're not stuck on one thing that they will have to implement it, right? So I think that's what it is great about this kind of a data lake strategy. Keeping yourself cost effective, yeah, please. So the enablement is basically use case driven. You don't have to be re-architecting for use cases. You can simply plug and play based on what you need for the use case. Yeah, yeah, you can, you can, and again, you can focus on your business use case. You can figure out what your business users need and not worry about these things because that's where PestoKanzer comes in, helps you stitch that data together with multiple data formats, give you the performance that you need and it works out the best there. And that's something that you don't get with traditional warehouse these days, right? The kind of thing that we need, you don't get that. I do want to add, this is just to riff on what Satyam said, I think it's pretty interesting. So it really allowed them to take the best of breed of what he was seeing in the community, right? So in the case of table formats, you've got Delta, you've got Hoodie, you've got iceberg and they all have got their own roadmap and it's kind of organic of how these different communities want to evolve. And I think that's great. But you have these end consumers like Linkit who have different, maybe use cases overlapping and they're not forced to pick one. When you have an open architecture, they can kind of really put together best of breed. And as these projects evolve, they can continue to monitor it and then make decisions and continue to remain agile based on the landscape and how it's evolving. So the agility is the key point, flexibility and agility and time to value with your data. Yeah. All right, I got to get in why the Pesto is important here. Where does that fit in? Why is Pesto important? Yeah, I mean, for me, it all comes down to the use cases and the needs and reporting and dashboarding is not going to go away anytime soon. It's a very common use case. Many of our customers like Linkit come to us for that use case. The difference now is today, people want to do that particular use case on top of the modern data lake, on top of scalable inexpensive low cost storage, right? In addition to that, there's a need for this kind of low latency interactive ability to kind of engage with the data. This is often arises when you need to do things in an ad hoc basis or you're in the developmental phase of building things up. So that's kind of what your need is and latency is important and getting your arms around the problem is very important. You have a certain SLA, I need to deliver something. That puts some requirements in the technology. And Pesto is a perfect for that ideal use case. It's ideal for that use case. It's distributed, it's scalable, it's in memory. And so it's able to really provide that. I think the other benefit for Pesto and while we're betting on Pesto is it works well on the data lakes, but you have to think about how are these organizations maturing with this technology? So it's not necessarily an all or nothing. You have organizations that have maybe the data lake and it's augmented with other analytical data stores like Snowflake or Redshift. So Pesto also a core aspect is its ability to federate or connect and query across different data sources. So this can be a permanent thing. This could also be a transitionary thing. We have some customers that are moving and slowly kind of shifting their data portfolio from maybe all data warehousing to 80% data lake, but it gives that optionality, it gives that ability to transition over a timeframe. But for all those reasons, the latency, the scalability, the federation is why Pesto for this particular use case. And you can connect with other databases. It can be purpose-built database, it can be whatever, right? Sure, yes, yes. Pesto has a very pluggable architecture. Okay, here's the question for the Blanket team. Why did you choose Pesto and what led you to Ana? Oh, so like I'll take this, but they're like, oh, this like what Pesto sits well in the architecture is how it is designed. Like basically Pesto decouples your storage with a computer. Basically like people can use any storage and Pesto just works as a query engine for them. So basically like it has a concept connectors where you can connect with a real-time databases like Pino or Adroid along with your warehouses, like Redshift along with your data lake that's like based on hoodie or eyes book. So it's like a very, very landscape that you can use with the Pesto and the consumers, like the analytics cannot, doesn't need to learn the sequel or different paradigms of the querying for different sources. They just need to learn a single source and like they get a single place to consume from. They get a single consumer and their single destination to write on also. So kind of it's a homologous architecture like which allows you to put a central security like which Pesto integrates and like it's also based on open architecture that's Apache Ranger. And it has also certain innovative features like that you can see based on caching which reduces a lot of the cost. And since you have further decoupled your storage with compute you can further reduce the cost because now the biggest part of the of our traditional warehouse is the storage and the cost goes massively upwards with the amount of data that you add in. Like basically each time that you add more data you require more storage and warehouses ask you to write the data in their own format. Over here since we have decoupled that the storage costs have gone down. Like it's literally that your cost that you are writing and you just pay for the compute and you can scale in, scale out based on your requirements. If you have higher traffic you scale out if you have more traffic you scale in. So all those. So huge cost savings. Yeah. Yeah, cost effective, that's for sure. Yeah, cost effectiveness and you get a very good price value out of it like for each query you get a, you can estimate what's the cost for you based on that tracking and all those things. I mean if you think about the classic iceberg and what's under the water you don't know it's the hidden cost. You think about the tooling, right? And also the time it takes to do stuff. So if you have flexibility on choice when we were kind of riffing on this last time we chatted with you guys and you brought it up earlier around you can have the open formats to have different use cases and different tools or different platforms to work on it. Redshift, you can use Redshift here and use something over there. You don't have to get locked in. Locking is a huge problem. How do you guys see that? It sounds like here there's not a lot of lock in. You got the open formats and you got choice. Yeah, so you get both of the best of the both worlds. Like you get the with AHANA or with the Pesto you get the best of the both worlds. Like since it's cloud native you can easily deploy your clusters very easily within like five minutes. Like your cluster is up. You can start working on it. You can deploy multiple clusters for multiple teams. You get also flexibility of adding new connectors since it's open and further like it's also much more secure since it's based on cloud native. So basically like you can control your security endpoints very well. So all those things comes in together with this architecture. So you can definitely go more on the lake house architecture than warehousing when you want to deliver data value faster and basically like you get the much more high value out of your data in a short time period. So Satyam it sounds like the old warehousing was like the application person, not a lot of usage, old, a lot of latency, okay here and there but now you got more speed to deploy clusters, scale up, scale down. Application developers are as everybody it's not one person, it's not one group it's whenever you want. So you got speed, you got more diversity in the data opportunities and you're coding. Yeah, I think data warehouses are a way to start for every organization who is getting into data. I don't think data warehousing is still a solution and will be a solution for a lot of teams which are still getting into data. But as soon as you start scaling as you start seeing the cost going up as you start seeing the number of use cases adding up having an open format definitely helps. So I would say that's where we are also heading into and that's how our journey as well started with Presto as well that why we even thought about Ahana, right? So like you mentioned, one of the things that happened was as we were moving to the lake house and the open table format, I think Ahana is one of the first ones in the market to have Hoody as a first-class citizen completely supported with all the things which are not even present at the time of even with Presto, right? So we see Ahana working behind the scenes improving even some of the things already over the open source ecosystem and that's where we get the most value out of Ahana as well. This is the converges of open source magic and commercialization when because you think about data as code reminds me, I hear data warehouse is not going to go away and but you got cloud scale or scale, it reminds me the old, oh yeah, I have a data center. Well, here comes the cloud. So like doesn't really kill the data center, although Amazon would say that the data center is going to be eliminated. No, you just use it for whatever you need it for. You use it for specific use cases but everyone, all the action goes to the cloud for scale. The same things happen with data and look at the open source community. It's kind of coming together. Data as code is coming together. Yeah, absolutely. I do want to again connect another dot in terms of cost and that and you know we've been talking a little bit about price performance but there's an implicit cost and I think this was also very important to blink it and also why we offering a man of service. So the one piece of it and it really revolves around the people, right? So outside of the technology, the performance one thing that you know, Akshay brought up and it's a very another important piece that I should have highlighted a little bit more is you know, Presto exposes the ability to interact your data in a very a widely adopted way which is basically ANSI SQL. So the ability for your practitioners to use this technology is huge. That's just regular Presto. In terms of a managed service, you know the guys that blink it are a great high performing team but they want, they have to be very efficient with their time and what they manage. And what we're trying to do is provide leverage for them. So take a lot of the heavy lifting away but at the same time, figuring out the right things to expose so that they have that same flexibility and that's been kind of the balancing point that we've been trying to balance at Ahana but that goes back to cost. How do I total cost of ownership? And that doesn't include just the actual querying processing time but the ability for the organization to go ahead and absorb the solution and what does it cost in terms of the people involved? Yeah, great conversation. I mean, this brings up the question of, you know back in the data center to cloud days you had the concept of an SRE which is now popular site reliability engineer. One person does all the clusters and manages all the scale. Is the data engineer the new SRE for data? I mean, are we seeing a similar trajectory? Just want to get your reaction. What do you guys think? Yeah, so I would say like, definitely like it's it depends on the like teams and the size of that. We're a high performing team. So each organization takes bets on the pieces of the architecture like where they want to invest in and it comes out with the value of the engineer's time and basically like how much they can invest in how much they need to configure the architecture and how much time it will take to tank to market. So basically like this is what I would also highlight as an engineer. Like I found like, I would say like as a presto in a cloud native environment. I think so there's no one in the market that seamlessly scales and then scales out. And further like with a team of us I would say like a team size of like three to four engineers managing cluster, day in and day out, conferring, tuning and all those things takes a lot of time. And Ana came in and takes it off our plate and the Hensina solution which works out of box. So that's where like this comes in like and it's also based on open source community. The time of the engineers time is so valuable. You know, my take is, I think, sorry, my take on it really in terms of the data engineering being, yeah, sorry, I think that can work. It kind of depends on the actual person and we definitely try to make the process as easy as possible. I think in Blinkett's case, you guys are, they're data platform owners but they definitely are aware of the pipelines. So they have very intimate knowledge of what the data engineers do. But I think that in their case, you guys, you're managing a ton of systems. So it's not just even present. They have a ton of systems and, you know, surfacing that interface so they can cater to all the data engineers across their data systems. I think is the big need for them. I know you guys want to try them in. I mean, we've seen the architecture and things like that. I think you guys are doing an amazing job there. So adding to John's point, right? Like I genuinely think what like DevOps is to the tech team. I think what is data engineer or the data teams are to the data organization, right? Like they play a very similar role that you have to act as a guardrail to ensure that everyone has access to the data. So the democratization and everything is there but that has to also come with security, right? And when you do that, for us, there are a lot of points where someone can interact with data we have. And again, there's a mixed batch of open source tools that works well as well. And there are some paid tools as well. So for us, like for visualization, we use read-ash for our ad hoc analysis and we use Tableau as well whenever we want to give a very concise reporting. We have Jupyter of notebooks in place and we have EMRs as well. So we always have a mixed batch of things where people can interact with data and most of our time is spent in acting as that guardrail to ensure that everyone should have access to data but it shouldn't be exploited, right? And I think that's where we spend most of our time. Yeah, and I think the time is valuable but your point about the democratization aspect of it, there seems to be a bigger step function value that you're enabling. And these we talked about the 10X engineer, it's more like 50X, right? If you get it done right, the enablement downstream at the scale that we're seeing with this new trend is significant. It's not just, oh yeah, visualization and get some data quicker. There's actually real advantages on a multiple with that engineering. And we saw that with DevOps, right? Like you do this right and then magic happens on the edges. So yeah, it's interesting. You guys, congratulations, great environment. Thanks for sharing the insight. Blink it, went great to see you to Hanigan with Presto. Congratulations, open source meets data engineering. Thanks so much. Thanks, John. Appreciate it, okay. Thanks, John. Thanks for having me. This is season two, episode two of our ongoing series. This one is data as code. This is theCUBE, I'm John Furrier. Thanks for watching.