 Yeah, good afternoon, everyone. It's really exciting to be here today. Today we are going to talk about building a batch processing platform on top of multi cluster Kubernetes and our overflows at Intuit Scale. We have around 40 to 50k pipelines at the metadata level itself, which has 100k or 2,000k concurrent execution. So let's talk a little bit about it. So here is a quick introduction about ourselves. My name is Arumalekel. I'm a group manager at Intuit. And I'm responsible for building the next generation batch processing capabilities, along with scheduling and orchestration. And with me today, Rakesh Suresh, who is the lead engineer on building this platform capabilities. So here is a quick agenda. We will be talking a little bit about Intuit background. And I will be talking a little bit about the platform overview. And Rakesh will be actually doing a deep dive into the platform architecture and some of the operational excellence and observability challenges we faced. And how did we actually solve them? And they keep demo if we have time. So before we actually deep dive into the platform, a quick shout out to Intuit. So we are a global technology platform that helps you to achieve financial confidence. We are a purpose driven, value driven company. Our mission is to power prosperity around the world with our flagship product, which is TurboTax, Qtbooks Online, Mailchimp, and Credit Karma. So next is about batch processing platform overview. So here are some of the personas, use cases which you are solving. As a ML engineer, I want to actually build and iterate on models faster and more models and features faster. As a data analyst, I want to actually provide insights to the business as quickly as possible. As a data engineer or as a software engineer, I want to produce clean data and I want to manage and run reliable pipelines. And these are some of the indirect customers who, their use cases we are solving. So now let's discuss a little bit about the core platform capabilities. There are main four platform capabilities we are providing. One is the runtime scheduling and orchestration capability. The second one is the DevOps tooling. The third one is governance. And fourth one is the service level management. So I'll be walking us through all the four capabilities in detail. So in runtime scheduling and orchestration capabilities, as part of this platform, we are providing four different runtime capabilities. One is the Spark on Kubernetes. And the second one is EMR. The third one is Databricks. And the fourth one is the Docker container. What is more interesting is all these computers are fully managed. That means the developers doesn't need to worry about touching the hardware or doesn't need to worry about managing this infrastructure at all. And we provide the capability to switch between these runtimes, any of these runtimes based on their requirements. It can be due to either the performance reason or the cost reasons. They have the option to switch between the runtimes with a few clicks on their pipeline. And along with all these Spark runtimes and Docker runtimes, we provide advanced intuitive scheduling and orchestration capabilities also. That means as a developer, you don't need to learn any scheduling or orchestration tool. It's a local solution. You can actually go and actually build you on data pipeline with a few clicks. The next one is DevOps tooling. All these pipelines are CI-CD enabled, which is built on top of Argo CD. And we provide a managed infrastructure. And we are a fully cloud enabled company. As a developer, they need to actually work with a lot of cloud resources. So they need to learn how if they want to do the policy configuration, they need to learn about different kinds of cloud resources. But we provide an intuitive way of configuring their policies and automated provisioning because the platform does the provisioning for them. They don't need to worry about it. And alerting monitoring is inbuilt capabilities from the platform. The developer doesn't, as a ML engineer or as a data engineer, they don't need to actually worry about either infrastructure or setting their monitoring or alerting capabilities. The next one is asset lifecycle management and the developer velocity. So we measure the developer velocity also for those data pipelines. The teams doesn't need to worry about setting up their own way of measuring their velocities. So the next one is the governance. So Intuit is one of the leading fintech companies. So we are part of a lot of compliance and security requirements. So compliance is very important for us. So as part of this capability, we have tied along with Intuit authentication and authorization schemas. Also, we have custom approved workflows for compliance, SOX compliance, or other compliance pipelines. We do provide change management capabilities and we do provide versioning capabilities for all the data pipelines. In case, if the developer finds that, okay, so he wants to run some of the pipelines which was running fine yesterday, they have the option to roll back to the previous version of the pipeline. Now, I want to talk a little bit about the service level management. This is a highly available platform. We provide four nights of availability and we run some of the tight SLA use cases. Some of the use cases which we run on this platform has two minutes SLA use cases Intuit and we do provide cost insight between different kinds of Spark runtimes that enables the user to actually choose what runtime is suitable for their use cases rather than just one solution. And we provide paved path. What does this mean is it's a low code solution which actually enabled a developer to create a data pipeline with a few clicks. Earlier, setting up a production pipeline was taking probably a few days or probably one or two weeks. Now, a newly joined engineer at Intuit can actually productionize a pipeline probably in probably less than 15 minutes, right? So he will be having a production pipeline running with monitoring alerting capabilities with all the scheduling and orchestration capabilities. Even he can actually set up a different compliance data pipeline also. The next one is the shared process. This is core capabilities. I believe Rakesh will be talking a little bit more about shared capabilities. What this shared process means, we actually onboard a lot of platform and platform on platform use cases. Whereas some of the ML platform can actually onboard to this platform to actually trigger a workflow on the ML platform, right? So a Qflow workflow can be triggered based on the shared processing capabilities on a ML infrastructure. So here is a Q-plan of all the capabilities with like 30,000 view of our capabilities. So, but this is just a very high-level view. There are a lot of other additional features which we provide. I want to touch up a little bit about one specific runtime because this is a Kubernetes conference. So we provide the Sparklin Kubernetes. This is a Google open source project. We run, this is one of the supported runtime which we provide as part of the platform capabilities. The reason we are providing this is because we are, most of our services are in Kubernetes, so we want to actually provide this Sparklin time also in Kubernetes infrastructure. It actually, it's simpler, more homogeneous because everything is localized and it's very cost-effective, right? It doesn't, there is no vendors lock-in or there is no search charge and it's a shared resources and shared OPEC. So very cost-effective. Our developers love using this and it's very declarative infrastructure. With that, I will hand it over to Rakesh to do a platform architecture debug. Right. Okay. Thank you, Ro. So next we'll talk about the platform architecture and go over on the high-level how we have built the platform. Before we get there, let's talk a little bit about the, into its data processing flow from a very high level. On the left, we can see high-level overview of a simple data processing use case. So the top layer has some Intuit Customer Products, which are emitting data. These could be clickstream data. These could be data that we would like to post-process, do data enrichment too. So before the data arrives in our data lake, there is stream transformations done through our stream pipelines. And so there is a bunch of post-processing layers where we do post-processing and analysis. So from this layer, we are able to derive real-time recommendations, data enrichment, feature generation, model training, data curation, fraud detection, and a bunch of other capabilities. So we will look a little bit more into how we are providing the capability to build the DAG in the coming slides. So in this slide, we'll talk about the architecture. So as we went over earlier, so batch processing's aim is to simplify the use cases for a lot of personas, being it like ML engineers, data analysts, data engineers. And we do that through providing a simplified development, deployment, management, and orchestration and monitoring as one holistic solution to both engineering teams and also to the platforms to integrate. So the high-level architecture broadly has two sections as can be seen in the top. There is a control plane layer and at the bottom there is a compute plane layer. So in the control plane, we have the service layer which helps with the creation of the users, code repositories, their build environments, their artifact tree, and also hookups to observability. So this is a fully automated self-serve process that we have through our other paved road ecosystem called Modern SAS, which is fully running on Kubernetes. And we have a job orchestration service for helping with the scheduling and orchestration of your pipeline. So we'll talk about the orchestration types in the next slide. And we have namespace service for multi-cluster Kubernetes namespace management and also for Orgo workflow deployments and notification service for providing alerting capabilities on the data pipeline. So let's talk about what we perceive as a data pipeline and what a data process is to set the context for the next part of the slides. So on the right, we have a data pipeline representation. So in the previous slide, we saw a stream data pipeline and a batch data pipeline, but the abstracts are very similar between the two. The data pipeline is a compute and an orchestration layer, which is a representation of one end to end data processing job and processors or code artifact that runs in the data pipeline to do data transformations. That can be data enrichment, curation, training your model, feature generation, a lot of that. I mean, there is a lot of data process that happens before the data is usable. So all that can be abstracted. And these are processors which are shareable. So for example, data usually that needs an enrichment or curation, these can be shared between multiple different data jobs. And so that's one of the reasons why we abstract that into a separate unit by itself. And we make the processor as a shareable entity that can be added to any of your compute infrastructure, which being the pipeline. So there are varying use cases running today in the batch processing platform, be it platform use cases, we have feature management platform running to generate features in a schedule timeframe using a Spark on Kubernetes layer. We have ML platform using us for their workflow orchestrations and data scientists and other data personas directly interacting with the UI for running their SQL queries to generate reports. So the next slide we'll talk about the how we are providing the capability to build that and how we provide dependency management. So the main problem that's there with a lot of data engineering teams and platforms is how do we make sure data pipelines are decoupled and also data availability can be made sure from one data pipeline to another. So in an organization, like you could have pipelines and jobs that are not under your control, but you rely on the data. So through the abstract, such as the data pipeline, we are able to make sure that the dependency that can be built on upstream data sources. Each data pipeline for BPP translates into an Orgo workflow and each pipeline could have two types of dependencies. So we support a cron-based dependency through Orgo cron workflows and a trigger-based dependency through Orgo events capability for complicated DAG mechanics. So in this example, the root node are pipelines that have time-based dependencies. So that's starting up on a specific calendar schedule and that's defined through the cron workflows and all the downstream dependencies that are waiting on the root node are waiting using the Orgo event source and the Orgo sensors. So each of these workflows run in our orchestration compute plane that we went over earlier and the workflows are decoupled from our runtime compute layer. So like I wrote, we went over earlier, so we do support multiple different time computes based on the use cases, be it like Spark on Kubernetes or if you're interested in running in Databricks or EMR based on your data processing need. So we put the workflow on a, the runtime of your choice. So in the next slides, we'll talk about the operational excellence and the observability capabilities and the journey that we went through in the past year to make sure that we are able to get top of the line availability and also the SLA number. So with the help of the paved road ecosystems within Intuit, we get a lot of observability features from our paid path Kubernetes platform which is called MaudenSatch which provides advanced GitOps and Kubernetes control plane and observability. So deploying of a new cluster and managing the cluster and also the observability of the cluster is handled by different paid path teams within the team which we interact with, which we built the platform on top of. As a platform providing distributed compute infrastructure as a service to Intuit, there are several operational challenges when it comes to managing multiple Kubernetes clusters. So right now in our production, we run four Kubernetes clusters, two for our orchestrations and two for our runtime. So we tackled improving the operational excellence to optimizing on two areas. One is platform operational excellence and the other is providing developer and customer operational excellence. So when it comes to platform operational excellence, we took several steps to improve our confidence and visibility into the health and reliability of this platform and the infrastructure. Starting from tightening the opt-mic process within the team for building cross-service expertise and confidence through unified operation monitoring and continuous improvements by monitoring the MTD area and MTDR quarter over quarter and also constantly tackling stability and scalability as P0 for the platform. So through this journey, what we did is we were able to scale to thousands of concurrent workflow runs and we were able to provide four nights of availability and less than one minute job deployment guarantee. So let's talk about the one minute job deployment guarantees. So one minute is the time that we guarantee for provisioning and deploying your entire data pipeline infrastructure and get it to running stage. So once the provisioning happens, your data pipeline can run for hours but the time between you fire the request and the time that we provisioned through our platform is one minute. So we achieve this by decoupling our compute plane into orchestration compute plane and fine-tuning the Kubernetes clusters for the specific job that it's given. In our orchestration compute clusters, we have fine-tuned our IPs to be the warm pool in the clusters to have high IP limit for the port provisioning times to be very low so that we can support large amount of concurrent workflow deployments. So and also we have built a solutioning on multi-cluster workflow deployment which we will be coming out with a better solution together with the Argo team, with Argo open hosting. So next aspect is the developer operation excellence. So as a platform, so we have to monitor our SLAs and monitor our availability but also to provide operation excellence capabilities to the users that are interacting with the platform. We have pipeline headaches to track the status and health of your pipeline, notification integration for monitoring your data pipeline into SLA and also failure alerting, auto-scaling and also pipeline retries for more fault tolerance. So next we'll take a look at the platform demo. So there is a more extensive demo coming up that's to start with. So we wanted to start showing the capability internally that we have built for a fully automated processor and a pipeline deployment experience. So here we can see that for, we have an Intuit developer portal experience where users could create a data processor by providing their runtime in this example Spark, their language of their choice and a bunch of other default Spark configuration because this is Spark data processing application. So we write out of the box, spin up the GitHub repository, the build environment, the Argo CD for orchestrating your builds and also hookups to the observability. So the developers can focus on the code and not on the process of getting your thing up and running. So this has severely accelerated the amount of time anyone can get to production. And then once you also, so like we discussed earlier, the processor being the code artifact that can be shared between multiple computes. So you can now create your data pipeline which is the computer and orchestration layer where you will be able to pick your compute of your choice. For example, the Spark on Kubernetes to pick the Kubernetes cluster, it's getting deployed to the auto provision, the namespace with the required IAM roles. We also have a Kubernetes control plane that is built by Intuit so users can configure their policies on access to the data. And we also provide abilities to add your processors. You can stack a lot of processors into a single pipeline and create your orchestration. In the next, you use the processor and the pipeline, you could create your scheduling. So like we discussed, you could either do a cron scheduling if you want your processor to run on a defined schedule or you could create a dependency-based schedule where your processor depends on an upstream data pipeline. On completion of an upstream data pipeline, your processor will automatically trigger. This provides the decoupling between data layers for improved efficiency on mapping the data availability between data organizations. So next slide, we have a prerecorded demo of our execution which will play, which shows how all the things that we talked about comes together. In this platform demo, we are going to look at data pipeline running on Spark on Kubernetes and look at all the standardized capabilities that is provided out of the box by the platform for a developer in the ecosystem. What we are seeing here is Intuit's developer portal experience where data engineers can configure their data pipelines and data processors. This Spark on Kubernetes data pipeline has been configured to run on the given Kubernetes namespace and the cluster, which are configured during the initial pipeline provisioning phase. As can be seen, the data pipeline has a processor, a schedule, and some additional configurations. Let's take a look at the processor. A processor is a code artifact with a standardized GitOps process provisioned by BPP and modern SaaS during the fully self-serve process creation experience. The developer gets a code repository and CI CD built out of the box with integration to artifactory. Besides the processor, user also gets access to into its Kubernetes service manager, or IKSN. IKSN is a control plan for managing Kubernetes cluster and namespaces. Here we are looking at the developer's configure namespace, where they have the ability to manage the namespace resources and ACLs. Besides, developers also get access to mayfriend metrics and dashboards for monitoring their pipeline. This is a sample dashboard that comes out of the box that is customizable by the developer based on their needs on their monitoring. Besides, plunk dashboards are also provided out of the box for monitoring pipeline runs and execution history are available for better visibility. Developers and teams also get a quick costing view and additional costing dashboards for managing cost on the pipeline. Additionally, Spark history server is provided but the Spark runs can be looked at. Let's now execute the pipeline and see how that works. Here I'm going to execute this pipeline twice. An execution of the pipeline can be done in an ad hoc fashion like we just did or can be based on the schedule or an upstream dependency. Each execution creates an orgo workflow which we'll just take a look at which will orchestrate the steps required for running this data pipeline. So here there are two running workflows. Let's take a look at one of the workflows. So each workflow creates the Spark application. So here I'm putting up like all the running Spark application. So as can be seen, there are two running Spark apps which are being monitored by the workflow. The workflow besides firing the Spark application on the customer's name space also monitors the Spark app and then emits an event to Kafka which is then read by the orchestration service for updating the pipeline state. If we go back to the pipeline UI, we can see that the state of the pipeline is currently running and the execution has been updated. In the execution, you can take a look at the configuration using with the execution supplier and also other execution logs which are readily available for the users to look at running executions. In this platform demo, we are going. That's all with the presentation. Brilliant, thank you so much. What a great talk here and good to see demo at the end as well. Questions from the audience? Again, raise your hands and I'll come around. Don't all jump at once. There we go, yep. What if your clusters don't have enough resources to run all the jobs your users have? Do you have any stack for that? So we have cluster monitoring based on our capacity. So we are able to load balance both our Spark application and also our workflows to multiple Kubernetes clusters. So usually when we are getting close to a capacity, so we provision new clusters. So the main aspect of this is having the ability to monitor, manage multiple clusters and load balance the load both on the orchestration and also the Spark job, so which we have. So we have the ability to add more and more clusters as we scale. So right now we are running on four production Kubernetes clusters, but we'll scale up as there are more pipelines that are getting onboarded. Other questions? How extensible is the scheduling capability on the platform side, right? So for the Spark jobs, could you use external schedulers like Unicorn or is it using a standard Kubernetes scheduler? Scheduling on the Spark jobs? Yeah. So for scheduling, we fully used Argo. So Argo provides the Argo events and Argo workflows, so which we use for both the scheduling and the orchestration. And the jobs are deployed. So the scheduling is outside of Spark jobs. I see, I see. Okay. Also, would it be able to do things like gang scheduling? There's a lot of workload which is still running on EMR. So we want to actually do the dependency management between these two jobs. So it needs to be outside of the Spark. Yeah. So that is the reason we actually build a scheduling and orchestration layer outside the Spark on time layer. I see, gotcha. Thank you. I'm kind of curious, when you have a one minute job SLA, I'm assuming you run on AWS, you said I am and some other stuff. Do you just pre-provision nodes or have some pre-scheduling so that pods can actually go on to nodes or is it that pods are created and the cluster auto scaler is gonna kick in? So the pods are not pre-provisioned. So the nodes are dedicated for, so we simply have a dedicated clusters for our provisioning, I mean, and scheduling layer. So those clusters are highly available. And on top of that, we pre-provision the IPs required for bringing up a pod. That has been our throttling point which we noticed when we were doing our load testing and when we are trying to provision thousands of pods, pods were taking six minutes and 12 minutes. So the optimization strategy for us was providing each nodes in the Kubernetes cluster more pre-warmed IPs so that it can provision. So the resource never was the issue. It was always IP limitations that has been enforced in this case by the cloud provider AWS on the EKS which we run on. Great, any final questions? Okay, well, yeah, thanks again, Arup and Rakesh and let's give him a round of applause. Thank you.