 All right, let's get started. Welcome to my session, or me and Vikram's session. This session is about key takeaways from scaling Adobe CI CD solution to support greater than 50,000 apps. My name is Andrew Lee. I'm a senior architect with AWS. There's also another name up here, Vikram Sethi. He's a principal scientist at Adobe. He actually couldn't make it today. So I'm going to try to convey the message that we both wanted to tell. Let's get started. So starting with the agenda, we're going to go over a flex introduction. So this is actually the name for Adobe CI CD solution. They call it flex. We'll go through an initial architecture that they developed for flex. And much like the phases of a marriage, Adobe also went through these phases when adopting this new architecture. They went through a honeymoon period. They went through a reality check. Then they went through recovery. And now they're working on the foundations for the future. So what is flex? Flex is Adobe's GitOps-based CI CD foundation. And in a GitOps context, we have a source on the left, which is Git. And then we have a destination on the right, which is what you see there. In between this is Adobe's GitOps foundation. It consists of various Argo projects, Argo CD, Argo workflows, Argo events, and also some custom components that they've built themselves. On top of this is also a CI CD experience, like a developer portal, which allows their developers to see across all the different products or projects that they're using and components. Some of the aspects of their CI CD foundation is they allow for bootstrapable golden templates, a paved path, but still being flexible to allow fully customizable CI pipelines, just in time provisioning during deploys. So there is some objects that need to be deployed when services are onboarding with this particular solution. And they actually created a custom component to provision these objects, things like Kubernetes namespaces and Argo CD applications. They support advanced deployment strategies through Argo rollouts. And as mentioned about the CI CD experience or the developer portal, they have a single pane of glass that they can see across all the different projects. And they call that flex experience. Another thing to mention is that if you look at the destinations, currently they only support Kubernetes clusters today. But they eventually want to be able to support multiple different destinations, things like serverless, CDN, or static websites. And then you can see the list there. So moving on to the initial flex architecture, you notice that this architecture diagram is actually a drill down from the previous diagram, where you have a source on the left, which is the GitHub. And then you have destinations on the right, which is the target remote clusters. And in between them is their GitHub solution, which we're going to be calling Flexbox number one. And inside Flexbox number one, you're going to notice that there's various Argo projects, Argo CD, Argo workflows, Argo events, and then also this provisioner component that they built. And so what the provisioner does is it provisions various objects, things like Artifactory, Secrets, DNS Endpoints, Kubernetes namespaces, and Argo CD apps at the time of onboarding and deployment. And now moving on to kind of like Adobe's flex journey. It all started two years ago, around May 2022. And they were able to contact a few Lighthouse clients to go live with them. They had a flex go live in October 2022. And after that, it was just smooth sailing. So there was quick adoption, organic growth, lots of people signing up with the new service. And they liked to call this period their honeymoon periods, because everyone was happy. The flex admins were happy. The service owners were happy. Everything just worked. Eventually they ran into something with what they call a reality check. And during this time, they called it this event March Madness, because they had basically several outages. From these outages, they learned that they didn't have a lot of Argo expertise. And they were also running into Kubernetes control plane challenges. This all happened around the 2200 Argo CD app level. Fast-boring to today, there have 1,600 services onboarded today. 460 of those are prod services. They're performing 16,500 deployments per month. And they are managing around 11,000 Argo CD apps at this time. So drilling down into some of the challenges that they had, one of them was that they're running Argo CD and Argo workflows on the same cluster. And so we did some testing where we were syncing around 4,000 apps. And as we increase the amount of workflows per minute from 30 to 60, all the way to 90, we noticed that the sync time to actually sync all those 4,000 apps increased. And you can see that it goes from 15 minutes all the way up to 25 minutes. And so why that is? The reason is because the Argo CD and Argo workflows, they both depend on the API server performance. So if we look at the API server at the time, all the way from test one to test three, we can see that there's some increase in CPU utilization. We're actually at the limit of this particular control plane. And you can see that from the Argo CD application read and write latency, you can see from test one all the way to test three, you can see increased read latency and also write latency. And this is the reason why the syncs took longer. Another issue that Adobe ran into was SED. So SED is the backend storage for Kubernetes. They found out that it's actually possible to fill up their SED backend storage. I think the default storage size for Kubernetes is around 2 gigabytes. Maximum size that they recommend is 8 gigabytes. Adobe's at their maximum. And basically when the SED backend storage kind of fills up, you notice is that, one, the Kubernetes control plane becomes unresponsive. And if the Kubernetes control plane is unresponsive, Argo CD and Argo workflows will dysfunction. And specifically to Argo CD in this graph, you can see from the work queue depth is that the reconcile ops never finish. And if you look at reconciliation performance, you can see that the reconciliation ops are just timing up. Other issues with SED is that during a defragmentation event, so these are defrag events, they happen periodically on SED. When they occur, they can also cause the API server latency to spike. You can see in this graph that we did a normal sync on the left, and then we did a sync where we triggered a defrag event right in the middle. And you can see that the unsinked apps actually goes to zero. That's not really the case where it's going to zero. It's just that Argo CD can't determine the status of these applications. And that's why it's showing in the graph like that. And if we drill down into the API server latency, you can see that at the defrag event, there's this latency spikes to one second. So moving on to the flex journey, starting from where they were at the March Madness event, they realized a few things is that, one, their Argo components were not tuned properly. Going back to their lack of Argo expertise, they were using a shared Kubernetes control plane for everything. And they were still having growth. So more and more services were still trying to onboard with the new service, and also more Argo CD apps to manage. And so they started with forming a stability tiger team. These are engineers that are dedicated to getting them out of the situation. Acuity, everyone knows acuity here, but they had a partnership with acuity. And acuity gave them a lot of great information about some of the knobs and controls that you can use to get out of both CD and Argo workflows scalability and stability issues. And it's at this point that they started to make improvements to stability and scalability, particularly in Flexbox number one that they mentioned. Now, they still had their share of problems. So they had some outages in September, somewhat related to process, but also related to technical issues. And it's at this point that they decided to invest in a new architecture, which they call the flex in a box architecture. And we'll go into more details in later slides. Now, getting back to some of the key stability improvements that they made. In general, they switched from using Kubernetes APIs to using Argo CD APIs for some of their custom components. This allowed the use of an informer cache in the Argo CD APIs. This decreased the amount of load that was being placed on the Kubernetes control plane. It's more specific to them, but their default VPCs was running out of IPs. And so they increased the IP space for Argo CD. They increased replicas from 8 to 12. This is sharding. They increased reconciliation timeout from 3 to 6 to 9 minutes. They removed some health status. And this helped with application CRD and SID churn. On the rebuild server, they increased the number of replicas. They increased the QPS and burst limits for the Kubernetes client on the Argo CD application controller. And then they tuned their resource inclusions and exclusions to decrease controller churn. This is a big one, actually. In terms of some of the stability improvements they made on Argo workflows, they reduced STD churn by turning off workflow node events. And they started to archive workflows and reduce DTO. And so there's actually a panic slide at the end of this presentation, which goes into more improvements that they made. You can download the presentation from, I think, the SCHEDGE website. Moving on to some of the stability improvements that they made, we want to drill down on some of them and do some testing and show some data. One of them is around reconciliation timeout. With the default reconciliation timeout of three minutes, you notice from the work queue depth is that the reconcile ops just never finish. They are finishing, but then new ones are being piled on with the reconcile period being so short. And so if you make a change to the reconcile timeout from three minutes to six minutes, or in their case, nine minutes, you'll notice that Argo CD now has time to reconcile all the apps that they're managing before the next reconciliation period kicks in. And I wanted to also note is that this particular graph is managing 10,000 applications. There's also a decrease in the API server load. So there's going to be a theme around the API server load and the control plane scalability. But when you decrease the amount of reconciliation timeout or increase the reconciliation timeout, you'll see a corresponding decrease in the app and write request per second, so API server load. Now a common way to get more performance out of Argo CD is sharding. And as you notice from syncing 10,000 apps, when we had one replica, it took over 35 minutes to do a sync. When you switch to 10 replicas, that can go down to six minutes. And it really depends also on your particular architecture, whether you're managing multiple remote clusters because sharding in Argo CD is by cluster. Now nothing is for free. If you improve Argo CD performance, what happens? You also will increase API server load. So if you notice is that from moving from one replica to 10 replicas, you could potentially increase your API server load by 10 times. Now in the previous slide, we went over how to improve Argo CD performance at the expense of the API server load. Why don't we try to now decrease the API server load at the expense of Argo CD performance? And so what we did is we changed the QPS and burst QPS of the app controller's Kubernetes client. We went from 5,100 to 25,50. And you can see that it dropped the API server load. But we also saw a corresponding drop in Argo CD sync time performance. So I think it took six minutes before. This new graph shows that it takes 13 minutes. So going back to Adobe's flex in a box architecture, the goal of this new architecture is they want to provide a foundation to predictably and reliably scale flex for their future needs. So some of the product leads or product managers on their side was saying that they needed to be able to support more than greater than 50,000 Argo CD applications. And so this new architecture they're hoping can do that. There's three aspects. One of the first aspects is they want to be able to recreate. So having an automation to be able to recreate flex boxes, we've been talking about flex box one. But what if you wanted to have a flex box two, flex box three, flex box four, they would have an automation to be able to automatically create those flex boxes. Currently all the applications are being redirected or being directed to the flex box number one. But say they also want to now redirect those to flex box two, this is something like a load balancer. They wanted the ability to redirect. And one of the last aspects is they wanted the ability to relocate. So if they had apps that are running on a flex box number one, they wanted the ability to move it to flex box two. Maybe for flex box one is getting too busy or there's something wrong with flex box one. Now moving on to the architecture for flex in a box. What we have here is one, if you look at the bottom, now there's a flex box number two and this is fulfilling the recreation aspect. And then there's a new component now in front of these flex boxes, which we call the redirect service. And so what the redirect service does is through some mapping of services to flex boxes, they determine which services go to which flex box and these rules are determined by the flex admins. And then they also want to have the ability to relocate. So they would be able to relocate applications from flex box number one to flex box number two with very little downtime to the deployment pipelines. And so we've been talking about how the API server might be a potential bottleneck for this particular, the original architecture. Now that we were able to scale horizontally scale across multiple flex boxes, that also means we're gonna have multiple clusters and API servers to work with. And so we did some testing where we basically took 50,000 apps and we took 300 workflows per minute and we split that up into five different clusters. And we just random all in parallel, a sync test. And so I think the longest sync test took on cluster four it took 24 minutes. So potentially you're able to sync 50,000 apps, run 300 workflows per minute in around 24 minutes, the sync time for Argo CD. One thing I wanna note is that when we tried to increase the workflows per minute to 90 workflows per minute, the control plane crashed. So there's still some work that needs to be done to perhaps vertically scale a single flex box, but now they have a path forward of horizontally scaling across multiple flex boxes. Some of the future explorations they wanna do is they wanna separate control planes for Argo CD and Argo workflows. One of the ideas that they had was around using V clusters to host Argo CD and maybe run Argo workflows on the host cluster. And if you're able to run Argo CD inside V cluster, then it's also possible to have multiple V clusters in a single flex box and you can run multiple Argo CD instances. They also wanna look into supporting GitHub actions for CI alongside Argo workflows. So we also did some testing on V clusters. We ran the same test before. This is around syncing 4,000 apps. And what we did was even when we increased the workflows per minute to 150 workflows per minute, you can remember back before it took 90 workflows per minute to sync 25 minutes. Now, even if we increased it to 150 workflows per minute, the sync time increases not that much. So it went from 13 minutes to 16 minutes. That three minutes is still pretty significant and probably some more experimentation is needed, determine if V clusters is the right path forward, but is this something that Adobe is thinking about? So some of the key takeaways from Adobe's journey. You need to know what metrics to watch. If you know what metrics to watch, you'll also be able to find your limits of your system. You'll be able to know if the knobs and controls that you're changing are actually making a difference. Knowing the limits of the components in your system. So if you know the limits, you'll be able to set alarms that will alert you of any potential problems before the users do. And then knowing the knobs and controls, like if you get into some sort of situation where you're having scalability or stability issues, knowing the right knobs and controls will probably help you get out of those situations. And then knowing when to scale vertically versus horizontally. So this is where Adobe kind of struggled. They were first trying to scale vertically. And then when they decided eventually that they weren't able to vertically scale a single flexbox, they decided to invest in a new architecture that will allow them to scale horizontally. And then experiment. So always ensure the knobs and controls you're changing are actually making a difference. And of course, make sure you experiment in a staging environment instead of a production environment. And then sometimes processes are just as important as architecture. So, one of the things that Adobe learned is that there was some process issues that they had which caused their September outages. And they're hoping basically that they have a good process in place in case they have some issues that they'll be able to get out of it in a timely manner. So with that, I'll open it up to Q&A and also there's a QR code to tell us how we did. Thank you.