 So here we're going to be talking about scaling argovents for enterprise from Intuit. So go ahead guys, thank you. Thank you. Hello everyone. My name is Antonio and here is my colleague's primer. Today we will talk about how I work into it on scaling argovents for our enterprise scheduling system. So last year at argocon, our colleagues actually present how Intuit batch processing platforms utilize argovents to orchestrate the interdependencies among different pipelines. So as an example, like in this diagram we have a deck of relationships among different pipelines. And when the pipelines complete, it sends the events to argovents, and the argovents will notify the downstream pipelines to begin executions. So in today's talk we will go over, like as we will go over the challenges, when we try to scale this platform up to like more than 10,000 pipelines, and we will briefly describe our work in addressing those challenges to support the scalability of our platform. So here is the agenda for today's talk. So I will give a little bit more context about how our platform utilizes argovents. And then I will go through the different problems followed by a brief description of our solutions. And then I'll pass over to Prima to talk about all the different optimizations that we did followed by a brief demo and she will conclude with the impact of our work. So in our platforms, what we call PEPs, again we utilize argovents to orchestrate different dependencies among the pipelines. So as you know, the argovents contains the event source and the sensor, and both of them communicate through the event bus. So in our platforms, when a customer or a user defines pipelines, we provide an SDK. We provide an SDK and SDK will actually translate the definition of the pipelines into the definition of a sensor. And usually the sensor will run in an H8 mode and so usually that sensor will spin up at least two sensor parts. So as we scaled our platforms to more than 10,000 pipelines, we start to see a few challenges. First, back then when the versions of argovents that we used were still using Nest Streams and the Nest Streams version does not have a persistent store. So whenever there is a need to change the definition of a sensor, it will always result in a restart of the port. And when the port restarted, we experienced data loss or stage loss. And because of the stage loss, we are not able to trigger the downstream pipelines. Also the sensor definitions and the sensor runtime specifications are coupled into a one sensor specification. So when there is a change in the pipelines requirements, we will have to modify the sensor definitions. And because of that change or because of the coupling, when we modify the sensor definition, you will trigger the restart of the sensor port. And again, when you trigger the restart sensor port, we experience stage loss and prevent the triggering of the downstream pipelines. Another challenges that we are facing was in our organizations, there is actually a limit on the number of ports available within a Kubernetes cluster. So because of the limitations, we are only able to support, because of the limitations and how every pipelines correspond to one sensor specifications, we are only able to support like 1,800 pipelines within the cluster. And this also results in some inefficient use of the port resources, because some pipelines are running more frequent than others. So that means some ports are actually much busier than others while sitting there idle most of the time. So to address these challenges, we actually extend our events in several ways. First, we actually decoupled the sensor specification into two portions. One that we call a sensor definitions or we call a sensor metadata, which actually specify the dependencies and the triggers of the original sensor. And the sensor runtimes, we actually contains the runtime specification. Also, we also have like a centralized deployment of the sensor, which is like a centralized pool of ports to handle all the events. So this actually utilizes resource-impacted resources utilizations, because all the ports are actually listening for events and processing events all the time. We add an external persistent store to hold the intermediate state of the sensor or the trigger definitions. So even in the case where we have to restart the sensor port, the states are kept in the external store, so we can prevent lots of states. And finally, we actually add a Kafka implementation of the ARGO-UNC Rampus, which can support much higher TPS. So this is a very high-level refined architecture of our changes. So again, we have an event source, and we still have a sensor port. And the event source and sensor port communicates through an event bus. But within the sensor port, we define two components. One we call a condition handler. The other is we call a trigger handler. These two components also communicate asynchronously through another topic of the Kafka event bus. The condition handler basically is responsible to handle the evaluations of different conditions, of different triggers. When the conditions or triggers is matched, you pass the information to the trigger handler, and the trigger handler will take over and will be responsible in firing the actions to run the triggers. The sensor port has communications with an external persistent store, which is holding the current intermediate states of every trigger evaluations. So this highlights the refined changes and shows how the refined runtime will look like. So again, the pipelines, we have multiple pipelines. The definitions of pipelines, we still provide SDK to translate the decisions of pipelines into our events. But instead of creating a sensor, the new SDK actually creates the sensor metadata with just a specification of the definitions of the trigger. And then we have a centralized pool of sensor parts, which is responsible in handling all the processing of the events. So next, I'm going to pass over to Prima. She will go over all the different optimizations that we did. Thank you, Antonio. So I hope we now have a good understanding of our design. So I'll now discuss two significant enhancements that we have done to achieve this high performance. Our objective was to have a high scale of 15,000 pipelines and approximately 25,000 dependencies. So this was our goal, and we were running towards that. So let me talk about the first optimization that we have done. So initially, we had one event source per pipeline. So let us say we have event source A and then event source B. Event source A has two eventing servers, server one and server two. And then the event source B has eventing server one, which is going to filter the events for the pipeline B. We had the filter at the event source level. So let's assume that we receive an event for the pipeline A, then the pipeline B event source, it's going to ignore that event because that is not intended for pipeline B. And the eventing server two of pipeline A is going to ignore that as well because it's still looking for the pipeline C. So the pipeline A event will pass through pipeline A event source via the eventing server one. And then meantime, let us also take a look at how the sensor spec would look like. So this is how the sensor spec currently looks like. So we have the trigger definition all over here. And then we have the condition as P1. So that is a dependency that it is dependent on. So the dependency definition would look like this. The name of the dependency, the name of the event source name, and the name of the event name. So going back to our presentation, so whenever a sensor is created with that definition, the sensor will create a mapping between the event source name and the eventing server name. And in the value, we are going to have the entire dependencies. So let's now assume how a pipeline A event would work. It will pass through the event source and eventing server one. This is going to add other information to the event, which is the name of the event source and the name of the event, eventing server. So when the sensor receives that event, it now knows through which event source and the eventing server it came through. And it sees through this map and then gets back the dependency and then it processes it. So it all looks good now. But let's say we are going to have a 15,000 pipelines. Then we don't want to have a 15,000 event source part, which is going to process all of that. Moreover, we were using the Kafka event source. So what we thought is maybe we can have a single event source, which can process all of the events. And then we moved the filter to the sensor. Now, let's assume that all of the events are going to come through this event source. And all of the event is going to have the same event source name and the same event name, because all of them are coming through the same event source now. And the sensor, the mapping would look like the event source name underscore underscore the eventing server name. All of the dependencies now have the same key. So let's assume this event, we are passing 150 transactions per second. Then per second, for a single event, we need to process 25,000 dependencies. And if it is 150 events per second, then per second, we need to process 150 times the 25,000 dependencies. This were impacting our performance. So let's take this with an example. Let's assume that we have four different doors, D1, D2, and D3, D4. And we have the corresponding buildings, B1 to B4. And in the reception, we have a mapping between the door and the building. Let's see who entered through the door D1. They get identified as D1. And they've mapped to the corresponding building. And now we replaced all this door with a single door. Now all the people are passing through the one door. Now the mapping looks like all the people who are coming through D1, they'll be mapped to all the buildings. They'll have to go to each building to identify whether they have any intention over there. It is such a tiring process. So what we thought is maybe instead of having the unique feature at the door level, maybe we should pull some unique identity from the person itself. So maybe at the door, whenever a person enters, we can get the name and give them a badge. And in the reception, we will also have a mapping between the name and the building. And now when a person enters through the door, he now has a badge. He can get to map to the right building. With a similar fashion, what we thought is maybe we can introduce a new identity to the dependency. So now our dependency would look like this. So in the dependency definition, we added a new field called identifier value. This is going to, in our example, this is going to have the name of our pipeline. So whenever we define the sensor metadata, now we are going to add the unique identity that we will be interested in. So now the mapping would look like event source name, eventing server name plus the identity. And we also added the identifier path to the event source. So all the events that are coming through the event source will now have an identifier path so that that path helps us to pull the identity value from the event itself. And when the sensor now receives this event, it knows to which map it should look at and pulls the right dependency and it processes the right dependency. This were giving us a performance benefit of 85 person. And the other performance optimization that we did was, let's say we have the BPP users and they are doing some CAD operations on their pipeline and it comes to our BPP platform and then BPP in the underlying layer, it creates the sensor metadata and the Kubernetes. So in the existing architecture, all the metadata is tightly coupled with the sensor. So whenever there is a change in the metadata, the sensor part used to get restarted and it always had the most recent updated value. But in our case, since we decouple the sensor metadata at the runtime, we need to know what is the actual metadata that exists at the runtime. So we thought maybe our sensor can pull those sensor metadata from the Kubernetes, but at 150 TPS and 15,000 sensor metadata, we were facing more than 10 seconds delay to fetch the metadata from the Kubernetes. This was giving us a huge processing delay. So in order to address that, we thought to repurpose our sensor metadata controller. So it will watch for the events of the sensor metadata whenever it is created, updated or deleted, we would persist the same in our RDS. And during the runtime, our sensor can now pull the latest metadata from our RDS, which gave us like less than five milliseconds, we were able to get the updated metadata from our RDS. And let's go for a quick demo. I have already prerecorded the demo. So this shows us that we have 15,000 pipelines, which corresponds to 15,000 sensor metadata definitions that we have. And then when I just go and query the database, and it shows we have 15,000 runtime instances that are in the pending state. And now if we see the count of completed instances, it is zero right now. And our aim is to have 15,000 here and zero here because after this performance test, we wanted to process all of these pipelines. So for this performance test, we took 20 sensor pods, just 20 sensor pods, and we thought to run our performance test. And this is a way to simulate the events for us. So we had a topic, and then the number of events is 25,000 events, and at 150 TPS. So this program simulates 25,000 events at 150 transactions per second. I'm gonna fast forward that. So if we see we have sent 25,000 events at 160 seconds. So immediately after sending all of these events, if we go back and query our database, we notice that 14,955 events, pipelines have been processed, and when we queried it again, we noticed all the 15,000 pipelines have been completed, all the events have been processed within just few seconds, and we processed all the 15,000 pipelines. And we made this with just 20 sensor pods. So this is huge. Let me get back to my presentation. So when we talk about the impact, right now in BPP we have 35,000 pipelines that are running in our production environment. If we haven't gone through this scaling mechanism, then we could have ended up with 70,000 pods in the production, which is huge for us to manage. And then we also notice that sometimes at low TPS, it's really the resources haven't been used efficiently. And we might also ended up managing the even bus cluster. With this new architecture, we noticed that with just 12 pods in our centralized cluster, we were able to accomplish 35,000 pipelines. And we also replaced the Kafka cluster with our intuitive even bus cluster because we already have a well established and reliable Kafka cluster. So we provided an exotic option to connect to our intuitive cluster. So as a BPP batch processing platform, we were able to provide our users an isolation to do credit operations on their pipeline, though we have a centralized cluster, but it is way abstracted from the users and still they have isolation to control their pipelines. This is an impact that we had after this change. And soon to be in our production, we are expecting more than 50,000 pipelines in which we'll end up with more than 60,000 dependencies. So this is our journey of scaling the Argo events. Thank you so much. So we have some time. Oh, sorry. So we have some time for some questions. So if anyone has a question, there's a mic up there. You can go ahead and queue up over there and ask some questions. Hi. So given the amount of workflows that you guys seem to be processing, did you ever run into situations where at CD was throttling or kind of backing off because it couldn't handle a load? And if so, how did you kind of handle those situations? Yeah, exactly. So when we wanted to pull the dependencies, I mean, when we wanted to pull the sensor metadata directly from the Cuban, it is we noticed that it tries to pull from the at CD and it was pretty slow because the rate of operations on our pipeline were pretty low. We didn't face any issues while the pipeline is created or getting updated or deleted. But when we wanted to fetch the pipeline metadata, that is where we were facing issues to address that whenever a pipeline is getting created or updated or deleted, we persisted that in our RDS. So we pulled that from RDS to overcome that. Any issues in particular when like adding so many pods potentially at one point to the cluster? When we have, do you want to answer that? So when we have too many pods in the cluster, we followed the CIDR 18 mechanism. So that had the number of IP addresses per cluster limit. And as our Intuit Kubernetes system, they also have some wrappers utilizing some of the IPs. So we were in limit of using the number of pods because of the IP issues. That is one of the issues that we faced. Thank you. Thanks. Hi, quick question on the sensor. So you guys said like sensors are centralized. Are they decoupled on running an independent cluster from the actual workload? Or is it like on the same clusters? So they are decouples from the actual workflow. So sensor is just, before it was a definitions of the dependencies and the conditions for the triggers and what are the actions for the triggers. And now the sensor is mainly just the specifications of how many pods, et cetera, the specification of the runtime. So the definitions of the dependencies and the triggers are the couples to ensure the sensor metadata. Got it. So are they running in an independent infrastructure or like different cluster from the actual sensor or even source deployment itself? The controller is running on its own control name space and the sensors pods are running in a separate name space but it's within the same cluster. All right, thank you so much. Cool, any other questions? No? Once, twice? All right, thank you very much. All right, thank you everyone. Thank you.