 This is your host Apni Bhaktia and welcome to another episode of TFLS Talk. And today we have with us Shita Sen, product marketing at Seller Data. Shita, it's great to have you on the show. Thank you for having me. I'm Shita. I am a product marketing manager at Seller Data. We have covered Seller Data before. So, you know, our audience, they do know about the company, but since you are here, I would like to hear from you what is Seller Data and what are the problems that you folks are solving? What are the areas you are focusing on? In a nutshell, we're an analytical database company. And we focus on making analytics simpler and allowing data engineers to build new analytics projects faster. Our product Seller Data Cloud is a analytics service that is based on Starrocks, the open source OLAP database and query engine. The one thing that we focus on is to provide class-leading query performance and so we can leverage this to eliminate data pipelines for our users. So, our users can go pipeline-free. It shortens the amount of time for our users to develop new analytics projects and go into production faster. Can you explain, when you say pipeline-free, what does that actually mean? The concept of pipeline-free data analytics is a shift in how we approach big data and data engineering. My observation is that for a lot of the existing challenges on the database side, the industry has relied heavily on data pipelines that the user have to build as workarounds. And those pipelines are probably useful for solving the original problems, but they often introduce more problems like costs, complexity and governance issues. I think the real innovation is in how we address these problems on the database level, on the database side, other than troubling our users. And I can give two scenarios as examples. So, first, let's talk about data lake analytics. Data lakes are scalable, they're cost-efficient, and it's a place for you to throw all of your data in. Recently, there are open data lake table formats like Pajee Hoody and Pajee Iceberg that give data lakes with data warehouse features, including indexing and transactional properties. Making data lake looks like an ideal choice for data warehouse scenarios, for low-latency, interactive customer-facing analytics. So, the users don't have to copy their data from the data lake into another service, purely for career acceleration. But that's not really the case, because most of the query engines available today still rely on outdated technology, or they're still optimized for ETL workloads. So, not really ideal for data warehouse-like low-latency queries. And this forces the users to transfer and replicate their data into a proprietary data warehouse still, purely for fast query processing. This approach probably addressed the performance issue, to be frank it does, but it introduces unnecessary costs into maintaining separate systems and copying the data. It's bad for data governors and maintaining the data ingestion pipeline. Right. And another problem or another example I want to bring up is with multi-table joins. Multi-table join is probably the most expensive thing you can do with writing a SQL. Join is expensive. And optimizing multi-table join is a big challenge. And this is especially a challenge in the field of real-time analytics, because most of the real-time OLAP databases struggle to perform joins at scale. So, they actually force the users to implement the normalization pipeline, which is essentially pre-computation by pre-joining the multiple tables together into a big flat table beforehand, so the database doesn't have to handle the join at query execution. This process is actually like setting money on fire. It's extremely efficient in terms of storage and compute. Users are building big flat tables that might never get used. It also adds complexity due to the need for specific technologies like Flink, those stateful stream processing tools to meet the strict freshness demands of real-time analytics. It also makes the system rigid. Any business change in the upstream that causes a schema change on the original table and that requires a complete reconfiguration of the denormalization pipeline, as well as data backfilling of all of the related data. And that's my second example. But there's a silver lining. In our experience from our clients and our open-source Starbucks users, both of these problems and even more problems can be solved on the database side. So our users don't have to go through this painful process of developing those unnecessary data pipelines to work around the problems. And that's why we're building the new pipeline-free data analytics schema architecture. Can you also talk a bit about where is the market? Or let's just look at a lot of new use cases are emerging. So when we do look at seller data or we can look at a stock, you know, rocks, what kind of market evolution you're seeing today? Who is like, they are traditional use cases and then there are new emerging use cases because of new workloads. There are definitely new workloads that are valuable, but there are also traditional workloads that can benefit greatly from, like, newer technologies, right? You get better latency and you get better, like, you get better latency and, like, simpler architectures, right? Because you can do more with one system now. And, you know, otherwise, in the back in the days, you probably have to have, like, each one system for each kind of workload that you run, right? So I think both. Can you also talk about that as organizations, there are, once again, Aarish and I, who have been around for a long time. At the same time, there are new companies. How challenging is it for them to move the time to upgrade and implement new data analytics solutions? For analytics solutions, a lot of them are on the cloud now, right? Most of them are on the cloud now. And it's very different on the cloud versus on-prem. Because, you know, if you want to test a new system on-prem, you have to buy all the servers, right? And hire a bunch of people, develop the thing yourself, and then you can start testing. By that time, probably spend, like, 90% of your budget already, right? But in the cloud, everything is elastic, right? So you can actually, with the pay-as-you-go cost structure of the public cloud, you can actually get started with just a few hundred bucks, with one person, two, three days. And you can get started on the testing a new solution, right? So it's way cheaper that way. And also, like I just talked about, with newer technologies, we can definitely simplify the data pipeline, right? So there's less stuff you have to build. From your POC to your production, right? Like, say, for data lake analytics, data lake can do a lot more now, paired with modern query engines such as Starbucks. So you can actually run your very demanding workloads on the data lake without actually moving your data into another data warehouse, right? And that saves a lot of trouble. And the same thing goes for joins and generalizations, right? Modern query engines can handle joins and scale. So you don't have to build the denomization pipelines, where you would have to, like, back in the day, right? So the answer is yes. So it's a lot easier now than before to test a new analytics solution. A few days ago, a few weeks ago on Twitter, also Kelsey Hightower wrote that, you know, if you look at the world around us, it's all about data. You just put in different boxes, you know. But it is all about data. All the software we write is that to extract the value of data or to present it to users, which I want to talk about a new kind of workload which is emerging, which is, of course, generative AI. I want to understand, you know, from seller data's perspective, how do you folks look at generative AI either as a workload or to leverage generative AI for seller data solutions? The vector database has been, you know, the hot topic in the past year, right? People use vector database as a long-term memory for large language models, right? And not on the seller data side, but in the community, we have a lot of contributions from our community users to make Star Rocks work as a vector database. And it's great to see those contributions and a lot of innovative projects that's building around the Star Rocks community. I also want to talk a bit about the open source aspect here and the community. Talk about the important open source and talk about the community around your technologies and solutions. Open source is magic. Actually, before I was working as a product marketing, marketing, I was actually a product manager for the Star Rocks core. Having a community with thousands of active users, we get the first hand insight faster than everybody else, right? We can develop something and have a user telling me what we develop is wrong in like four hours after we release. So that is magic. And also like for databases complex, especially with, you know, like query planning and distributed compute for our optimizer and for a lot of other features, actually all of our features, we release, we GA them before testing with our seed users. And that's only possible we have a active community. So thank you, Star Rocks community, for making that happen for us. We can break this question in two parts. Number one question can be just talk about the impact of people on companies. Do you feel that there are enough data scientists, DNA engineers that companies need to keep up with that? Or do you feel, no, there is shortage? Good technology should not require a lot of labor. And they're supposed to be easy to use. And also good technology actually simplifies the pipeline and simplifies the technology like what we do, right? So, you know, like if you're, so the answer is actually if you're using the right technology, no, you should not lack labor. But if you're not, and that's certainly possible. So that's a time that you think about your technology stack and where the labor is being spent on. If you're doing analytics, are they using, are they busy ingesting data into a data warehouse instead of occurring the data on the data lake? Or are they building that unnecessary digitalization pipelines or other type of pipelines? And is there any technology you can replace with newer ones that you don't have to build those data pipelines that are expensive and labor intensive? If you can talk about cellular solutions which are targeted at this audience. So what we offer is cellular cloud and, you know, that is cellular cloud is built to handle low latency, complex, all that queries at scale, right? So the main purpose of what we build is to deliver extreme query performance to let our users have data warehouse performance on the data lake, right? So they can run high concurrency, low latency, all lab workloads, such as customer facing analytics directly on the data lake without moving the data, right? And also cellular data cloud is really good at running complex multi-table join queries at scale with low latency. So allowing our users to ditch digitalization, especially, you know, in real time analytics, right? As an example, you know, I think a good example is Airbnb's Minerma platform. And that's a platform, that's a metric platform that holds six petabytes of data, right? Before they adopted Starrocks, they were running Trino as a query engine for the data lake and copy data into Apache Druid for query acceleration, right? Apache Druid and Trino cannot really handle multi-table join queries at scale, so digitalization was a must for all of their data. Creating a flat table for all of the six petabytes of data that they have was expensive, not only storage but compute, right? To create the flat tables that never get used. So the whole digitalization pipeline also make the system rigid. A single schema change, which is kind of normal for a metric platform, right? It could take from hours up to days because of the need for reconfiguring the entire pipeline and data backfilling on, you know, petabyte scale, right? And then they move Minerma to Starrocks. With all of the data handled by Starrocks, there was no more data copying, right? No more data ingestion. So there was only one copy of the data and that's great for data governance, right? And Starrocks can perform joins operations well as well. So well enough that the normalization is only done on demand for extreme cases instead of default for every single table, right? And that really decreased their cost overhead and also make their analytics flexible to business changes. One more question that I have for you is that I don't want to go into detail of what is a better data warehouse or data lake, but, you know, customers may choose to keep the data wherever they want. In respect of where the data is, how do you make it cost effective and more kind of, once again, optimized for them in respect of where their data is so they can get more value out of it? At the same time, if you can share some advice, how organizations should approach where they're putting their data to extract most value. Data lakes are cost efficient, right? They're scalable and it's a place to dump all of your data, right? But back in the day, limited by the query engines, the performance of the query engines, a lot of your workloads needed to be extracted from the data lake. So you have to copy your data from your data lake to a high performance data warehouse, proprietary data warehouse, right? Going from open formats to closed formats, also duplicating your data, right? That makes everything expensive, right? But with newer technologies, modern query engines now can actually get data warehouse-like performance on directly on the data lake, right? So you don't have to go through that data ingestion, data copying process, right? Very expensive. So my advice would be to keep your high performance demanding scenarios like customer-facing scenarios on the data lake and give newer technology a try, right? And that's actually possible today. Siddharth, thank you so much for taking time out today and talk about, of course, you know, real-time analytics for data lakes, but also a lot of broader questions. Thanks for all those insights. And I would love to talk to you folks again. Thank you. Thank you. Thank you so much. Thank you.