 Hi, this is Yo Sapin Bhartiya and welcome to another episode of TeethChama topic of this month. The topic of this month is data. And today we have with us once again Arjun Narayan, co-founder and CEO of Materialize. Arjun is great to have you back on the show. Thank you so much for having me. It's great to be here. Yeah. And as we're talking about, we have covered you folks earlier, but it's always a good idea to remind our viewers what is Materialize all about. So Materialize is a data warehouse like experience for building real-time applications, analytics and experiences with just SQL. So it looks and feels exactly like a cloud data warehouse that you're used to. You write SQL on top of your data sets and you get answers. The unique thing about Materialize is all of this is happening as the data is changing second by second or millisecond by millisecond and your results are staying up-to-date rather than doing analytics at the end of the day or on yesterday's or last week's data. I want to talk about data in general, how it has evolved. If you look at, of course, today things have changed so fast that we cannot even talk about too much in past. But if you look at the traditional IT world, I won't use the word legacy IT and compared to the cloud century or Kubernetes native world, how you have seen not only how we are generating data but consuming data and how organizations are also looking at data differently. So talk about this evolution, please. Yeah, this is a great question. I think, you know, there's a lot to be said for, you know, the data stacks of the 90s, right? So you used to have data that was generated by and large by transactional databases that was then etailed out to a data warehouse on-premises on which you would run a bunch of retrospective analysis. What changed was with the internet scale, data sets got larger and larger. So, you know, more than a decade ago, maybe two decades ago, was the big data revolution. It was very real, right? These data sets were getting larger and larger primarily as we were collecting more data about each and every transaction. So rather than just having transactional databases as the source, we also had clickstream analytics, all sorts of tracking analytics and things like that. And so far on a per transaction basis, we were collecting more and more data and analyzing this was getting more and more difficult. In the early 2000s, you saw the first wave of trying to deal with this, which was the Hadoop wave, right? Which was trying to, because the traditional data warehouses were not able to scale to these larger data warehouses, sorry, to these larger data set sizes, folks were essentially unbuttoning the data warehouse. They were building larger scalable analytics stacks to essentially get some value out of this larger and larger data sets that they had. The problem was this was really challenging, right? Like building a Hadoop program or an analytics stack on top of Hadoop was 10x, 100x as complex as writing a SQL query on your data warehouse. You had to write a lot of Java code, you had to orchestrate this, you had schedulers, you had Hadoop file system. It was a lot of complexity. And I don't want to talk down that complexity because it at least let you do some analytics that you could not do before. What changed with the cloud was the emergence of cloud native data warehouses, right? Which gave you all the benefits without having to roll your own infrastructure, right? So we had the first wave of modern, what we call the modern data stack, which essentially another way of saying the cloud native data stack, which would give you all the benefits of elastic scalability, but the user experience of just typing SQL into a terminal, right? And the number of people in your organization that can type SQL into a terminal is much larger than the number of people who can write a set of complex Java services and orchestrate that in the cloud. And so you saw this tremendous wave of productivity, right? So with this increased analytics productivity, organizations were able to make better use of all the data that they had lying around and become more data driven. And that's where I think roughly we find ourselves today when it comes to batch analytics. Can you also talk about the, of course it's not anything new, but emergence of real time, you know, data, data sets. And it's not applicable for almost every use case. There are different industries of different use cases where real time data does make sense. A lot of time it doesn't make sense. Also, if you look at the old days of Kubernetes, things were stateless. Now they're becoming stateful. We do talk about, you know, data is becoming big. I mean, when we talk about data, we are trying to narrow down the focus with you, but there are so many things we can talk about. We can talk about egress fee, we can talk about data lake house, warehouse. So it's very much calling, but I want to just focus on real time data sets, talk about their images and how they are solving some of the modern challenges. Of course, complexity is they're not just with the data, but also the whole cloud data system. And the fact is we cannot, that complexity is not going to go away. We have to learn how to deal with it. So talk about the emergence of these solutions to help with that complexity. I think where we are in real time, parallels in many ways where we were with the Hadoop to the modern data stack. So a lot of uses of, and by the way, not every single query needs to be live and up to date, right? So there's the emergence of real time data uses is not going to completely displace batch analytics. In fact, a lot of data science, a lot of historical analytics, computing your revenue for the quarter close, you know, that could take a few days, and it's absolutely fine. You want to build a trusted pipeline that is repeatable and simple to operate. And batch data warehouse is fantastic for that. But when you start to move towards using this data in your application, using this data in your day to day business processes, all data becomes a liability because you want to be operating on the freshest data possible. So take advertising, take segmentation, personalization, supply chain, logistics, all of these uses for data all benefit from using the freshest data possible, the most up to date information that you have about a customer or a package sitting in a physical logistics center goes to it's much better for this to be up to date. And the old tools for doing batch analytics are not the right fit. Nevertheless, we end up reaching for them because they're easy to use. And that's because until fairly recently, real time data is not easy to work with, right? You have to build a lot of manual infrastructure in this again, parallels meant in many ways where we were with the Hadoop ecosystem. If you want to build a real time sub second analytics dashboard, you're going to have to build and operate some Kafka clusters, you're going to have to build and operate some microservices, you're going to have to put some of those things, the state, the stateful ones, the stateless ones in Kubernetes, the stateful ones, you're going to bring in your own third party key value store, right? This is a lot of infrastructure. And a lot of people are of course doing this because the value is extremely is very much there. It's very beneficial for organizations to be operating on real time data. But similar to how the modern data stack enabled every organization to be data driven, we believe that a fully sequel experience that can where people can work with real time data using just SQL will also enable every organization to be work with to be able to work with real time data. And not just those that can build and scale and operate large distributed systems clusters in the cloud. What kind of evolution of two links that you have seen around real time data sets? So I definitely think that there's a de facto standard for moving data from point A to point B, which is Kafka, right? So most people who are moving data around in their organization tend to use Kafka or a Kafka compatible tool like Red Panda to move their data around. It's very good for decoupling your services, allowing people to both produce different teams to consume that data without having a very hard API that they must agree on. You can buy in large dump some data into Kafka and then hover it up on the other side very well. Beyond that, it's a bit of a Wild West, right? So in terms of building services that orchestrate on top of ingesting Kafka, manipulating it, getting some value out of it, that today tends to be mostly bespoke custom written microservices. Many of our customers and users have all sorts of different internal architectures that they've standardized on and for sure there's plenty of organizations getting a lot of value out of Kafka. What we want to bring is the ability for organizations that are moving to real time or building some of this infrastructure or early in their journey can do so with the same skill sets that they have built over decades working with batch data, which is namely speaking SQL, right? SQL most organizations have a tremendous capability of writing a lot of SQL. They usually have a lot of existing business logic already defined and many of our customers and users they start with, look, we've done this analysis in batch on our historical datasets and we know if we can segment our users in real time based off of this query based off exactly these parameters, that would be incredibly valuable. And today, if they want to do that, it's sort of take this business logic and reimplement a completely different architecture, write a program from scratch that has the same semantics, which is a really hard lift and also creates all sorts of maintainability and support burdens over time. And you don't want these two definitions to drift, you want them to co-evolve. And with Materialize, what our users can do is they can start with that SQL query or that DBT project that they've defined and just move that to real time in less than a day's worth of effort. When you look at Kafka or any other open source technologies, of course, folks can get started with that, but as you scale that's where the challenge starts, support, additional features, and therefore commercial players come and play very big role. Can you talk about how Materialize is kind of helping users of Kafka through commercial support? So Materialize works very well with Kafka. If you have a Kafka cluster, be it on Confluent Cloud or an Amazon or a hyperscaler managed Kafka deployment, Materialize can connect to that and let you interact with the data that's in that Kafka topic or that set of Kafka topics using just some SQL queries. So it's the fastest way to get some value out of the data that is in Kafka. It connects and pulls data from whatever source of Kafka that you have and allows you to be productive very quickly. Let's look at some of the pain points of developers that how are developers building modern application, once again, to scale from the start because of course there are a lot of greenfields there, brownfield deployments there. Talk about how they are doing it and if you look at in general what kind of capabilities they are looking for when it comes to, once again, real-time data sets. Working with real-time data that is changing at high through boots at large scales or volumes of data is one of the most challenging problems. So you have to build and scale microservices that can consume from many Kafka top, from many partitions of Kafka. Kafka has this notion of a partition, which is a subset of the entire volume of data passing through a topic in order to allow you to scale this out to multiple workers. So you may not be able to consume, you may have so much data, say millions of messages per second and the work that you're doing on that is pretty complex. You may need to use a suite of workers in parallel and Kafka has the notion of partitioning your topics so that you can say partition this 50 ways. You can have 50 workers consuming this in parallel. Of course now you have to orchestrate and perhaps scale up and scale down and elastically support a changing amount of volume, a changing amount of workers. The work that you may be doing may be stateful and if it's stateful then these workers may either need to manage the state in Kubernetes natively or they may offload this to yet another service, typically a NoSQL data store, which may itself be scalable and elastically sizing up and down. These are some of the hardest problems in distributed systems. So managing this at scale and our recommendation is not to start there. It's to build something simple, something that's maintainable that you can understand in semantics that you are familiar with and use the best in class tools that will scale in the cloud for you. And I think this has been very successful in batch. So one reason I think for the success of Snowflake is nobody really talks about scaling and offloading state in Snowflake, they just say, size up my cluster, take my medium cluster and make it a large cluster and you may not know or really want to know the details of what's going on behind the hood because that is a distributed scaling, auto-scaling, elastically changing set of infrastructure that you are controlling behind the scenes. But the beauty of it is you don't have to worry about all those challenges. You're just typing a single command in SQL. Can you also talk about when we look at once again as developers are building these modern applications, what are some of the common stacks that are there for real-time data? There are some databases that support low latency aggregations and compacted views of fast-changing data. But one of the hardest problems in manipulating real-time data is joining across multiple streams. So joins are inherently a stateful computation, particularly if you're writing a join that could join back from one topic to another across the entire histories of these topics, and materialized really shines at that. So writing SQL queries on highly normalized data, particularly when it's coming from transactional databases, traditional ones like MySQL or Postgres or Oracle, can be very challenging because you might have to implement state management yourself. And one thing that a lot of our customers are delighted by in materialized is you can write, say, an 8-way join or a 10-way join across your highly normalized set of input data, and it just works out of the box. Speaking of transactional databases, I forgot to tell you to your earlier question on one of the standard stacks, it's changing a capture coming from transactional databases and putting that into Kafka tends to be a very common pattern. So using an open-source tool like Debezium or commercial tools that will move your data from your transactional databases into your real-time stack tends to be a large source of the real-time data in the first place. Then developers are kind of building modern applications irrespective of where they are in their journey. Talk a bit about what kind of approach they should have so that they can continue to use the current capabilities of real-time infrastructure to build things for future or future-proof their applications so that they can ensure that, hey, they're able to scale without worrying about having to undergo through some kind of rearchitecture at some point. I think one of the most important things to do is to consider from the start the full lifecycle of managing your real-time application. This means that it's not just about building something that works in a prototype form but also deals with changes to the logic. Having an ease of maintainability as you update this logic. This can be very complicated in real time because there's multiple questions for which you have to solve for. I changed my business logic. I need to go back in time and rerun everything according to the new business logic because I corrected an error. I may not be able to change some of the events that I admitted but if I'm maintaining, say, an aggregation, I still want the corrected and up-to-date count. This requires a lot of capabilities. It requires the ability to replay history. It requires the ability to do a live migration from an old real-time service to a new real-time service. It may require handling schema changes in the upstream database. These are not new problems in software development. We have databases have supported things like schema changes, migrations, and things like that. I think taking a database-like or database- oriented approach will serve you excellently. What would I do if this was a standard transactional database and looking for tools that will give you that capability is very important? We talk about tech solutions. We talk about cultural changes. When it comes to data and to leverage VR, as you rightly said, moving more and more towards not VR moving. We live in a data-centric world. Do you feel that within organizations they also need some kind of cultural change? Because first of all, it's hard to find data scientists. A lot of organizations are going through cost-cutting. It depends on how much effect is going to be on the data teams. But do you feel, hey, organizations do need a culture where they don't look at data as once again different as silo, but just the way we talk about shift left for security, we should also talk about cultural changes with data as well. I love this one because I think the most important thing is to make data accessible to the entire organization. I think SQL is one of those standards that allows you to allows the largest set of your employees to get access to data. It's a lot easier to say, hey, here's the data. You can write any SQL query to get any answer that you want. I think that's a large part of the success of the modern cloud data warehouse is the fact that a lot more people can get access to this data. The second thing that the modern cloud data warehouses allow you to do is to isolate performance and use cases so anyone can spin up a virtual warehouse or a cluster to ask their own question without threatening or destabilizing the 14 other production tools that currently rely on non-interference. This was not the case on-prem when you had a limited hardware footprint that was under contention. Some analysts came in and could ask a question. That question was very compute intensive and now you're sort of threatening regularly scheduled batch jobs that are competing for the same hardware resources. This is not the case in an elastically scalable cloud. So adopting modern tooling can help you to build a more democratic within your organization access to all the data that you do have. Arjun, thank you so much for taking time out today and sit down with me talk about not only the evolution of data but also how developers can look at it also. You loved it and I also loved that discussion around the cultural change that is needed. So thanks for all those insights and I would love to chat with you again. Thank you. Thank you so much for having me Swapnil and enjoy the rest of your day.