 Hi, this is Yoosafin Bhartiya, and welcome to another episode of T3M, a topic of this month. The topic of this month is data. And today we have with us, Sarup Jagdish, CUN co-founder at Actile Data. Sarup, it's great to have you on the show. Thanks for having me, I'm really glad to be here. I would love to hear a bit about the company itself because you are one of the co-founders. So talk a bit about when the company was created and what was the whole idea behind the company. So just as quick background, I used to lead data at Airbnb prior to this, where we built the platform and open-source many tools during the rapid growth of Airbnb during the hyper-scaling period. And my co-founder Sushanka led the data team as a tech lead for an extended period of time. And he also went through various stages of evolution of the data platform. So much of the learnings and the core thesis for forming the company came from there, which is data, to be effective at an organization, you need certain fundamental things. First, of course, productivity of data. Everyone getting access to data and getting the full context behind why something needs to be looked at is the starting point. From there on, you have concerns like data quality. Like for example, when Airbnb was about to go IPO, core company metrics needed to be in pristine state. And then quickly followed by that compliance, Sushanka, my co-founder was the overall lead for GDPR at LinkedIn, and then cloud cost efficiency, which I was the lead for Airbnb from an engineering standpoint. All of these things are highly connected use cases and often they're tackled with bespoke solutions. So the core thesis is you need a metadata-driven data management approach to tackle all of these use cases using the same underlying platform. And Sushanka, of course, started the data hub project at LinkedIn to prove out this hypothesis, which was widely successful within LinkedIn. And then he open sourced it. And since starting the company, of course, data hub has become the number one open source metadata platform. But the core thesis is really to build that control plane for data using metadata as core substrate. When we look at data hub or any other such open source project, some of these projects, they were created as you talked about LinkedIn or when you were working there, within the company. So it's not a project catered for the sake of project on somebody exposed on base. It's been just like Kubernetes work being used in production. So talk a bit about how widely data hub is being used. How is Acryl, you know, how are you folks involved with it now? It's an open source project. And then we'll talk about the company in other aspect, but I want to understand these two basic things as well. So Sushanka started the project at LinkedIn and got it to a stage where it was widely used within LinkedIn. And also there were some external companies using it. But really after we both identified the big need in the market, LinkedIn leadership was very supportive of forming a company behind it. LinkedIn is an investor in us. And now since starting the company, Acryl is the main driver of the open source project. The project itself is thriving. There are thousand plus companies that have adopted data hub now, including the likes of Pinterest, Stripe, Optum Health, Peloton, a whole bunch of companies. And our Slack community has like 7,800 people now. So it's really buzzing. And there's a lot of contributions from many companies across the world. So all of that adoption and continuous feedback does a few things. First, we do product development at scale using community led input. So it's not just about the product, it's also about how do you make it work in real world environments at scale? So there's a lot of feedback about how data contracts have to be developed, how data products have to be. Some of these emerging concepts that really needs the input of the best data practitioners in the world. That's number one. Number two, some of the deep contributions, especially in the integrations area, which gives us a lot of breadth in terms of being able to connect with a variety of different sources beyond what we can imagine. And they've really, the community has really pulled us forward there in terms of creating that integration depth. And finally, when some of these adopters become accurate customers, they continue to contribute to the project and creating that virtual cycle in terms of actual as a company where people start out as adopters and then become customers. They continue to contribute to the project. And when I look at actual data, and of course data hub, it looked like the same kind of symbiotic relationship where we see with a lot of open source project where there's fully open source project. And then there's a, because open source can very easily solve day one problem. Especially when you look at data, two other big challenges is scaling things, adding features. So talk about the role that actual data is playing in, of course, adoption is already there. When you talk about a lot of use cases and those are some of the big companies they can have all the resources internally, but commercial players play a very big role in helping the community. So talk about the role that you folks are playing in helping data hub ecosystem. Before I jump into that, I think it's important to understand how data hub is fundamentally differentiated and how we gain the mind share of data practitioners within the companies. So there are a few reasons why data hub is differentiated. First, we strongly believe in creating a 360 degree view of all types of metadata. So existing catalogs mostly focus on business metadata and mostly focus on consumers, but forget about producers. Actually bringing operational metadata, technical metadata and business metadata in the same context gives a lot of benefits. For example, if a business user is relying on a dashboard, they should not only understand business context like ownership and documentation and the glossary terms that best describe that dashboard and so on, but also understand the reliability of the data sets that are powering that dashboard and same for machine learning models and so on. So the metadata platform has to be able to deal with things like pipeline runs, row counts, data quality tests in addition to the classic business metadata. So bringing all of it in one place is crucial. Second shift left, we actually strongly believe that developers should care about emitting the freshest metadata in the workflow that they already have. So it's important to be integrated into the developer workflow and not create yet another thing that they have to worry about just because there's a metadata platform that the company uses so that automatic integration into their toolkit is really important. And the last thing is data hub is event oriented. It's not just meant for human users. It's based on a real time platform. So it allows subscribing to events that are happening within data hub and actually implementing business workflows based on that subscription. So these are fundamental capabilities. Now what Acryl does on top, obviously as you would expect with an enterprise SaaS product, the focus is on two things, enterprise readiness and time-to-value reduction. Time-to-value reduction is focused on improving key governance KPIs through automation capabilities. Now, when it comes to enterprise readiness, table stakes, uptime and disaster recovery SLAs. We have our largest customers onboarding millions of datasets onto the platform and pushing hundreds of QPS of API traffic and being able to support that kind of scale and support the SLAs that we need for uptime and disaster recovery and all of that, that's table stakes. And then the second piece is around automation capabilities. So we have a monitoring framework called metadata test which allows for automated creation of a governance layer on top of all the raw data assets by tearing them into concepts like gold, silver, bronze. And the central team can actually set the standards for what qualifies as gold, what qualifies as silver and so on through, again, all through automation. And the last piece is really more around intelligence and workflows. So we have search ranking built into the SaaS product and we have some intelligence features like flagging duplicate datasets automatically so that you can reduce costs associated with them. And we're working on some AI capabilities to further reduce the time to value for the last mile. There are other projects also that are providing a meta platform. Talk a bit about, once again, of course, first of all, it was created in those companies. So there's a massive use case already that validates. But if you look at it and you compare the project, hey, this is the unique value that Data Hub brings to the ecosystem or the players and help them, once again, move forward and embrace a lot of modern technologies because of these capabilities that Data Hub has. Just to kind of repeat some of the things that I mentioned about metadata 360, shift left and event-oriented metadata. Those are some of the core capabilities of Data Hub which allows us to gain strong mind share with data practitioners. So the classic bottom-up adoption of a platform like this by people who are actually trying to get the work done as opposed to what the CDO or the VP of data cares about, being integrated into their toolkit and a platform which allows you to integrate programmatic use cases and not just for humans is a fundamental differentiator of ACRO. Is data still a silo where there are still specific teams who specialize on that or the same thing with shift left, it's not that it's a silo. Developers have to get involved at the same time. You have to lower the barrier of entries so they don't get intimidated by one more component when we look at the whole cloud, cloud native stack. Does that question make sense? Yeah, it does. So let me talk about a few trends really quickly. So if you go back 10 years, the data stack was more or less contained. There were a few different platforms, but today there's been an explosion of categories. A typical enterprise deals with 10 or 15 stages of data moving through all these different phases and that causes a massive loss in context and certain key use cases like data discovery, data quality which used to be much easier 10, 15 years ago has actually suddenly become much more complicated now due to the fragmentation that has happened, extreme fragmentation. The second thing is the data teams, the composition of the data team itself has undergone a lot of change. So if you go back several years, there used to be central data teams that are massive in size, but now there's a lot more decentralization. So the data engineers and data scientists are actually more in the business-facing teams and the central team is trying to support a large number of stakeholders with a relatively small team size. So essentially what this does is more emphasis on automation and defining standards that all the different business units can adopt and giving the right tools to the data practitioners so that they can get data right from the get go so it doesn't take massive investment in terms of people. And finally, for the business users, providing the unified context across all the different transit points that the data is traveling through to finally give that reliability indicator that hey, when you're looking at a revenue metric or a churn metric, this is all the context behind it. So can you actually trust it? If you're seeing different numbers, is it because of a difference in the business or maybe there's something that happened operationally five or six hops away? So conveying that last mile context to the business user is really important. So to kind of get back to the question of shift left and whatnot, it's really important to standardize how change is being done, how the right context is being emitted directly in the context of where data has been produced and transformed. Otherwise you'll have a lot of difficulty given the decentralization that has happened. If you look at the larger ecosystem, do you see that a lot of cultural movements are also happening within the industry where teams or organizations, they are actually embracing that we live in the data doing worse. So that when you look at them, you're like, hey, you know, they actually have all the right practices in place when it comes to leveraging data versus you're like, hey, you know what? Is there not doing what we would want them to do to leverage your data? Look, I think now with the rise of AI and streaming, like if you are not putting in the right practices from the get go when it comes to more automation software engineering practices, it's the age old adage of garbage and garbage out. It doesn't matter how much you invest in your AI capabilities or streaming data, you will not realize the business impact. So companies that recognize this actually invest a lot in getting data clean from the get go when you're acquiring data from external sources or from your own in-house sources, enforcing the right validation checks, the right data contracts and then making sure that as the data travels through, you have the right ways of automatically extracting context, you know, lineage. And so that focus on automation is important. And the second piece is, you know, operational reliability of data is something that companies should really, really care about. Gone are the days when you can just be wild west about these things. Now if data breaks, it has real business consequences. If there's a customer facing data set, let's say for an e-commerce company, you're predicting the prices and the pipeline which is predicting the price breaks, it can have a real revenue impact. And so those are use cases where you really need to be at the top of your game when it comes to monitoring capabilities and alerting and monitoring and having someone on call for data. These were not things that were emphasized as much. So companies that are still the human process heavy approach are gonna be less behind where data will be used much more operationally. So the term data governance has to be really, really up-leveled and modernized to deal with the current reality. What advice you have for organization or what approach they should take so that, you know, as you were like talking about, they should have this, you know, governance model or they should have the right strategy, light approach for data. At the same time, if you can also add that how actual data can help them in, you know, moving forward with the journey. First thing is, you know, pay serious attention to shift left, bring your data developers into the fold, give them the right tools so that they use SDKs, APIs in the CICD context so that they govern data products, only they are shipped to production. You don't react after the fact. And once things are in production, definitely have a lot more automation to make sure you're making only the tier one data products visible to your business users and make sure they're in pristine state by having this monitoring capabilities. So some of our customers like Zendesk and Notion, they've really shown the way by investing in shift left governance. A lot of the, you know, the business metadata and the operational metadata is enriched in their pipelines and their CICD pipelines. And, you know, Zendesk, for example, pushes a lot of business context into the protobuf schemas so that when the CICD pipeline runs, that's when the business metadata is emitted into the data catalog. It's not done on the UI through humans, for example. We have a very large fintech company that has onboarded millions of entities onto Actrel. And, you know, they pretty much integrate all of their internal services with us for data quality classification. Again, from CICD pipelines and a lot of focus on operational and programmatic use cases. Another large customer of ours, DPG Media, it's a large media company based in Belgium. They were able to cut 25% of the snowflake costs every month by through using, through our automation framework, which flags high cost but low value datasets. And how can you more automatically stay on top of this? It's not just a one-time thing. Continuous monitoring, continuous data quality, that type of thing is what you need. So, if somebody wants to get a start at the Actrel, what is the right path? Getting started with Actrel is super easy. You can finish integrating with your stack in a matter of hours if you have a cloud-native stack, for example, a snowflake. Look, a DVT, these types of things, you can just get started in a matter of hours. If you have a much more complex stack, more internal data sources, even then the integration can happen within a matter of days. Our customer success team brings a lot of learnings from the open-source community. This is the advantage of having a very large open-source community on how to rapidly onboard and set milestones for improving key governance KPIs, like ownership or reliability of key business-facing data products. And as I was mentioning, our automation capabilities allows you to rapidly get control over your cloud costs as well so that any wasteful data assets can be quickly retired. Now, if I may ask, there are certain things that you can share this point, something you cannot share, but what are the things that you folks are working on? We are generally very open with our roadmap given the nature of our open-source community. So, we've recently introduced data products and data contracts. And so, in the Actrel product, we will be actually enhancing our monitoring framework with more operational capabilities, like freshness monitoring, monitoring percentage of nulls in your data sets, things like that. So, more operational monitoring in addition to governance monitoring that already exists. So, being able to propose data contracts based on what we have inferred based on this monitoring and then more AI capabilities, how do you automatically generate documentation tags, things that really enrich metadata in a very short amount of time and even answer complicated questions like, can you generate the SQL query that I need to answer this complex business question that I have? And generating that using integrating with LLN models, those are things that are not roadmap in addition to what I mentioned about more data quality monitoring. So, thank you so much for sitting down with me and talk about, of course, the company, the project and the larger ecosystem. Thank you for sharing all those insights and I would love to chat with you again. Thank you. Absolutely. Thank you, Swapnell, for your mighty.