 Thanks for the great introduction, Melissa. So we are here to talk about transforming data processing with Kubernetes and the journey that Intuit is taking towards building a self-serve data mesh infrastructure. So we'll be touching upon the architecture and we'll be going over some of the core aspects of data mesh. We'll also be talking about how Intuit went about it and some of the components that we have built and we are still building and the scale of it. So I am Rakesh. I'm a senior staff software engineer in the Intuit's A2D organization. A2D deals with AI analytics and data. I am currently leading the batch processing segment with an Intuit and I have with me Janik who will be co-presenting and Janik is leading the data mesh implementation for Intuit within data processing as well. I'll flash the agenda just for a brief second. We'll be covering these in the presentations today. So before we get into the presentations, I want to give a quick touch about Intuit. So Intuit's one of the core mission is building an AI-driven expert platform and Intuit's products are powered by five key platform areas which includes the modern dev experience powered by our modern SaaS infrastructure which is completely built on Kubernetes which enables us to build better AI infrastructure through our data, fintech and identity and these are some of the core infrastructure that powers Intuit's product line-ups that you guys must be familiar with such as Intuit TurboTax, Credit Karma, QuickBooks and MailChimp. So Intuit believes in open-source contribution, collaboration. So Intuit was a recipient of the end-user award, CNCF end-user award in 2019 and 2022 and we have also created an open-source several projects and we're also a user of a lot of cloud-native and mobile tech open-source. So with that, we'll get into the first segment of this presentation where we'll be quickly talking about what is data mesh and then we'll be getting into the other segments. So data mesh is an architectural pattern that is primarily suited for large data organization dealing with high-volume data and looking for a structured pattern to manage and improve the value of the set data. So it primarily deals with data distributed domain-driven architecture which we'll talk more about self-serve platform design and product thinking. So a data mesh is a decentralized data architecture that organizes data by a specific business domain. So what does that mean? So usually data falls under different categories and segments and data mesh wants organizations to think about building data from the data domain and building architecture that helps promote this structure. So for example, data belongs to marketing, sales, customer service and data mesh wants to provide producers more ownership of this data and let them be the data product owners. So typically in majority of data-driven organizations, data is usually a byproduct of a process rather than data being treated as a product. So data mesh also promotes the aspect of building your solutions where all your components attribute data as a product and so that's one of the key themes that we are going to talk about as well. So the reason why Intuit is interested in taking a building data mesh for our infrastructure is because we believe it will provide better smart experiences using data. It will improve the value of data, reduce the discovery and access taken for identifying data and serve a variety of data personas. Through this we believe it will power our AI experiences and our AI-driven expert platform mission. So currently this is also enabling our generative AI capabilities like Intuit Assist which we came out with in a very recent announcement. So the four core principles like of data mesh or a domain-driven ownership which we briefly touched upon building like your architecture where data is treated usually under a domain, building federated data governance and data access. So what that basically means is like building principle of least access policies where data access is governed across the organization, building self-serve infrastructure for promoting deployment of data infrastructure and also data mesh and treating data as a product which is the product thinking aspect. So we'll get into the next segment which is the data lake and data mesh how we think like these merges together. We'll give an example of a data product that we have built. So for this example we will talk about the small business and QuickBooks. QuickBooks is one of the Intuit products that is dealing with providing accounting and like invoice solutions for small businesses. So in this example a small business owner has an unpaid invoice. So the small business owner logs into QuickBooks and they realize that they have invoices that aren't paid by few customers. So the system provides them two options. The system reminds the owner about unpaid invoices and the system also provides an additional feature to automatically send invoice notifications to their customers. So let's take that as the business problem that we want to solve for and so when we break that down into a business technical requirement so we want to get unpaid invoices by business given invoices are data. Get unpaid invoices for each customer group by business we want to identify which customers have not paid their invoices for the business and notify them and build notification capabilities to track and remind business owners and their customers. So given the use case let's take a step back and look at the very generic version of a data like architecture. So this is an architecture that would represent a lot of data driven organizations and how they might be building their structure today. So it usually consists of an operational data plane. The operational data plane is where your microservice are hosted, your databases like an in-product experiences and your app deployments happen. From this operational data plane there are data pipelines that will be moving your data to an analytical data plane. The analytical data plane is usually your data lake and the data lake is managed by a separate like domain and a separate like part of like the teams organization. And from the data lake there is like ML training like data scientists like in track with the data. The data goes to a data warehouse. So there's a lot of data use cases that is driven from the data that arrives at the data lake. So if you look at the for invoice like how the data is aligned in with an intimate it follows a very similar pattern. So there is a QBO application that has its own relational database. The QBO application produces invoice data that is like being like written to the data lake which is the sync using the data pipelines and from the data pipelines there is further transformation that happens for enabling invoice related dashboards like providing customer related like attributes and then providing like more data analytical experiences. So from this like one of the core aspects we talked about is like taking this invoice as our data product. So how treating invoice as a data product will turn the management and like usage of like improve the usability and the discoverability of invoice. So in this case like there is the operational user which is the producer which is the application and also the analytical user. These are data users that utilize the data that arrives in the data lake. So the producer produces the data to an even bus which is our Kafka infrastructure that gets streamed to the data lake which the data that arrives in data lake we consider as a data product. And from the data that arrives in data lake we can do queries from the data like such as like getting the invoice by business, invoice by user and business. So the producer of the app like can enable in-product experiences and the data that arrives in the data lake can enable our GNA experiences and ML training and other uses that can be had with this data. So let's ask some questions on this invoice data. So given we've been tasked with this specific use case of getting solving some of the invoices like problem for the business owner as an app developer or as a data persona. So you want to understand like how do I find invoice data for my use case? Like who is the domain expert for this invoice data? How do I find the schema for this invoice data? Where is the data located for consumption? And how can I get access? And how is there any derived data from invoice? So these are some of the questions we want to build self-serve solutions around and that's what we'll be focusing this remainder of the presentation on. So if you look at the way that the data is structured. So we call this the data map and like the solution that we have built is what like Janik is going to talk about in the next segment. Thank you. Thank you Rakesh. So basically what we have seen up until now is the problem statement that typical data producers are tasked with and we want to understand how data mesh would make it easier and faster for data producers to tackle the use case and produce data which adheres to certain standards. So that's where we'll introduce the data mesh concepts add into it on how we have taken the traditional or the data mesh architecture and figured out what are the aspects of into it that really needs to be enabled by this architecture and based on that we have defined some concepts which will essentially answer the questions that Rakesh has laid out. So let's take each question one by one. The first one being how do I find invoice data for my use case? So at a company like Intuit if you just typically look at data lake exploration on just finding invoice table it could be created by multiple domains for example Intuit could be charging its own customers and creating invoices and Intuit's customer for example a QBO company is charging its own customers and creating invoice. So how do we distinguish these two invoice data and that is where data map comes into play. So data map is what we are calling it an organization of data using domain, subdomain and bounded context. So in this case the data map architecture or the example you are seeing is small business is the primary domain and QBO is one of the product in small business category so it is a subdomain and then under that there's commerce, invoicing and sales subdomains and finally invoice workflow is the bounded context. So this is the typical domain driven design we are applying towards organization of data and under that a user will find invoice data so they can be confident that this invoice is created for QBO customers for the company who is using QBO and for their customers rather than a domain like commerce, invoicing and sales and invoice workflow which could be just invoices created by Intuit to its own customers and the next step in this data map is data product which is the foundational unit of data mesh and under each bounded context we are planning to have data products that are related to each other and data products are mainly geared towards consisting essentially information that the consumer of data needs so what does a data product means or what do we have as part of data product basically consolidates essential information like the bounded context or the domain under which the data belongs to what is the data model SLO and the primary aspect of like who is the domain expert of that invoice data how does somebody find that and that is where we have defined a role called data steward who is basically an expert in the data domain comes from the organization or the system that actually produces this data and knows all aspects about that data that in what use cases you could use this data how this data is generated etc and he is also responsible for the contract of that data and that's where the next question comes into play like what is the schema of that invoice data which basically consists of two aspects a semantic model which consists of modeling and schema information that enables data consumers to understand what exactly this data is it could be name of the fields, data types semantic data types for example field is called string but actually it is a type of currency or an address so it basically maps the traditional data types of a table or a schema to a semantic model and also it consists of SLAs which could be data quality and data freshness which consumers care about that if I am consuming from this data it meets this minimum bar and as a data steward the steward is responsible to ensure that this SLAs are ensured and from system wide we are basically implementing mechanisms so that data stewards can declare SLAs and as the system provides mechanism to measure those SLAs and then provide visibility of those SLAs of whether it is meeting where it is standing to the consumers so that consumers is very confident when they are consuming any data and the next question is where is invoice data located for consumption so we talked about invoice data in this architecture slide Rakesh showed that it could be a data like table or a Kafka topic so that is where data ports come into play, a data product could have more than one data port so data assets we call it data assets and that data asset basically has information about what type of data asset is it whether it is a high table Kafka topic it is a redshift table or any other type of data assets that we support and what is the location from where they consume it and a data asset level of data steward can also add additional tags for optimal discovery so for example a data mart could have tags that basically notifies or basically mentions the business domain it is coming from the analytical domain it is targeting to support so that basically makes discovery much easier for data consumers and finally access and governance so how can I get so I found the data that I needed I found what the data looks like how do I get access to that data and that is where we are tracking the access for each and every data asset at data asset level and access is typically approved by the data steward and also it is self managed and self automated so user basically finds a data product and data assets to discovery and that itself they can request access for the data that request goes through the data steward of the data product and they basically see for what purpose the user is requesting access to and it makes sense to give them automatically access or not so these data mesh concepts enable logical structure like this where a teams deploys a data producing system that produces a data asset like a table or stream like invoice and a data steward owns this data asset and creates data product and encapsulates this data asset along with other information like data contract and logical model but now a different team with a use case to consume this data produces derived data and discovers it and then gets access to it now wants to create a data consuming system and create new data products so this is where the power of self-serve data mesh architecture comes into play where we have made all the tools available to make data producers easy to make their data discoverable and now a data consumer wants to consume this and creating a data producing system so that is where our next section about self-serve data processing platform comes into play so in this section basically we want to talk about the scope of data processing at Intuit and basically we have a variety of users data engineers who want to write complex CTL jobs or derive enriched data, data scientists and MLAs who write pipelines to generate features or train AI models and data analysts who build pipelines to power business dashboards and the scale we have more than 2,000 data consumers or data producers at Intuit who fall into one of these variety of roles and we have more than 100K pipelines who processes data from various domains and produce or use one of these flow so what does self-serve data processing mean for us given we have this many variety of use cases typically user perform these three high level of operations when they are writing or when they are creating a data processing pipeline first they do author and define where they basically write code, get access to data define the schedule of job etc after that the provision and deploy that is when provisioning the infrastructure, deploying the code and if applicable registering all the data they are producing in data mesh creating data product, data assets providing data contracts that come into play and also according end to end lineage and then when pipelines are deployed they want to get alerted in case of any issues and debug to fix and rerun their pipeline and to solve these kind of use cases and increase the velocity of users to perform actions we provide different level of architecture abstractions to these user personas and into it we have built two platforms one is batch processing platform and stream processing platform and these platforms are the front facing applications that the data organization offers to this customer where they come to create the data pipelines and inherently these platforms are plugged into the data mesh architecture so that whenever users create data processing pipelines the data assets registered data products are created these platforms provide the interface for users to do all that and we will basically talk briefly about the batch processing architecture I have put it very high level here and if we go from bottom up the runtime where majority of jobs uses Apache Spark for batch it runs on Kubernetes and infrastructure provides out of the box integration with logging and matrix supporting then we have deployment orchestrator and job dependency orchestrator which basically are responsible for deploying the pipeline and ensuring dependencies are set and met when the jobs run and then services layer enables pipeline definition infrastructure provisioning setting job dependencies etc and on top we have user experience for various personas like UI or GitOps to create and manage the pipelines and similarly for stream processing platform the runtime is again on Kubernetes natively and we have majority of streaming jobs run on Flink and few use cases run on Apache Spark similar to batch we have deployment orchestrator and the additional components for streamings are like checkpoint manager and DL orchestrator which basically ensures that checkpoints are taken during redeploy and DL orchestrator we have few use cases which requires disaster recovery and want their streaming pipeline to be available all the time so DL orchestrator ensures or detects a disaster and then deploys the pipeline to another region if needed and then API layer is similar to batch except for streaming there is no need for scheduling one of the aspect that we have for streaming is processor registry and processor templates so we have noticed while getting started on streaming to add into it that the time to write a streaming pipeline is usually higher for customers because they are new to the framework or new to the concepts of streaming and that is where we have created or we are providing processor templates which basically gives them boilerplate code of a streaming pipeline as well as for common patterns we have processor registry where if somebody writes a processor for a specific streaming use case like filtering or aggregation which is typically reusable other users can discover those processes and deploy that as well and as you may have noticed from the batch and stream processing platform Kubernetes powers the data processing at Intuit Intuit has an MSAS or Intuit Kubernetes service layer which manages a fleet of Kubernetes cluster and name spaces where our code infrastructure runs and even the control in API is also deployed on Kubernetes and we use RGO workflow in events to perform actions or like scheduling orchestration and deployment workflow so given the breadth of this topic we are only able to touch the surface of many of these concepts but we hope this gives you an idea of how we are thinking about data mesh what problems we are trying to solve using data mesh and data processing and if we have captured your interest these are some of the resources on data mesh as well as data processing our architect has basically laid out in Intuit's data mesh strategy and concepts very detailed description of how we are thinking about implementing data mesh and we are currently just getting started there so if you are on a similar journey you would love to connect and share our learnings any questions awesome talk so question on how do you guys surface the data products to your end users who are you end users data centers but is there a UI or some kind of interface where you expose the schemas to users so we have a data discovery and exploration portal where all the data products that are registered with the data mesh so all those products will go to a central metadata registry and discovery basically exposes that metadata registry to users and that also has like a pretty extensive search so some of the aspects we talked about tags and other aspects so user could search and it is also integrated with Intuit assist which is JNAI powered database or JNAI powered service users could just write that I am interested in this data this is my use case what kind of data should I look at and it will give them options of that these are the data products that exist and as they browse those data products they would be able to get to see all the details that we talked about like schema SLAs what is the quality score of this data product etc we built it we built it internally I thank you for the talk I have a question about based on your experience where data normally resides where the actual data you mentioned about the model schema but normally where does data resides is it in databases or S3 we are basically so typically we had a data lake architecture where the data resided in S3 so Intuit's majority of data architecture is on AWS so our data lake was based in S3 but the concepts we described are very generic so we are expanding the boundary of where data could reside currently we are targeting data lake and the event bus architecture or Kafka topics to start with but we extend it to we expect it to extend to all other type of hosting locations we call it Intuit which data which host data so we have customer data cloud which host certain custom related data we have feature management platform which host all the feature sets so we as we expand this journey we will try to incorporate more and more hosting locations where data could reside and the concepts are defined in such a way so they are generic to be able to applicable to all of those use cases it's interesting because I graduated many many years ago in data warehousing and we are actually trying to for example we work with Postgres we are actually trying to get to the very large database kind of use case in Kubernetes and I think that there is a lot of interaction between what you are mentioning and also the possibility to use real databases like for example Postgres to perform these queries especially now that we are addressing the very large database scenario not only just OLTP so it could be interesting maybe to speak after you know and show what we've done and see maybe how also databases can fit into this data lake and this is a huge scenario so the data mesh provides an abstraction on top of all these infrastructures so we have showed an example of an operational data plane and an analytical data plane and so applications that have these relational databases can also produce the data assets but the data analytical segment to the data lake also belongs to the same data domain in the bounded context so you might be producing the invoice data which we are giving an example of that stored in your database and then it's also streamed and it goes to a data lake but the data has and shares the same schema and like from an end user point of view you want to show them the lineage that the data starts from a specific point which is the service layer arrives and it's also streamed to a data lake and the structure of the data and the producers of the data in this example the application or the stewards they provide the schema they provide like the documentation they drive how the data is accessed they are the product owners in a typical service architecture product owners exist in data architecture product owners are not a common theme so that's one of the main points like that it was in the last few months Thank you for this presentation it was excellent you gave the example of the invoice I am curious two questions how many data products already exist and how many of those are analytical versus operational could you have a data product write on Kafka bus for like anomaly detection or something like that That's a good question how many data products exist. So right now we are in a journey where we are implementing this architecture across all data into it. But to test this architecture and its feasibility and its benefits, we started a year or two ago where we started adopting some of the widely used or widely asked data assets and using data mesh concepts. So as of today, we have around 900 or so data products in production. And now we are applying these concepts in basically in a year or so where we want to go towards 100% of Intute's consumable data, which basically other teams want to consume outside of the team that is generating the represented as data product. And the second question, sorry, I missed the second question. So the architecture data products could have, as I showed once, it could have multiple data assets. So the way we are thinking about it is if an operational data asset or a table is streamed towards an analytical space, ideally both of those should live under the same data product. And a consumer would have option, like if the operational data store is really shadowable, then the consumer would choose that which of that is more applicable to that use case and they would want to consume off. Typically, operational data stores are not shared outside of the team or outside of the microservices. So we have an outbox architecture where this data that operational data store produces is streamed in real time to a Kafka topic and then it is materialized in Data Lake. So any and the stream is used in features generation in product experiences while the analytical data that is replicated in Data Lake is used for batch processing and analytical use cases. But all those data assets would ideally decide in one data product and user would be able to see that this data is available and to consume in real time in this format, it is able to consume in analytical fashion in this location and format, et cetera. Feel free to ask questions. We're a very inclusive community. Hi, so just before this question, you talked about building abstraction on top of your back ends, right? And that's what DataMesh is doing with your structured metadata. Does that imply that you're actually building adapters to a common interface, right? So you can hit Postgres or MySQL or whatever through your DataMesh service or is it the data maintainer is providing instruction, providing guidance? Yeah, it's not really the adapters specifically but it is more of schema and transform. So typically semantic model if the same data decides an analytical table to a topic to a data lake table, its semantic definition would be same across all of them but the structure of it may be different. Like in analytical database, it may be in denormalized tables was when it goes to Kafka, it may be a normalized JSON event and then it is replicated. It may be again a normalized high table. So the semantic schema remains same. So that is what the metadata would provide to consumer. But then consumer would decide whether they want to use from one or the other but there is no adapter we are at least thinking as of now. And so the way that we are enabling this across into it is because through the paved path infrastructure. So we have built paved path where service deployments like pipeline deployments, pipeline management, all of this is through unified platform infrastructure. So teams do not like go and deploy their own Kubernetes service layers. They do not like deploy their own data pipelines, batch pipelines. So everything is like going through a paved path infrastructure. So these paved path teams are like centralized teams that work with the data organization to build the data mesh architecture that we promote and we want the services to adapt. So it's much more streamlined for teams to integrate rather than like providing them, for example, libraries or adapters independently because it's a huge organization and it takes a lot of effort to drive it that way.