 My name is Andy Anderson. I work for IBM. And today I'm talking to you a little bit about an Argo CD study we performed to show some of the scalability issues we had with the platform. I also represent NMNM retainer of the kcp-edge.io open source project. So does anybody know what this is? This has been piquing my curiosity for the past two days. Anybody? Hands? Does anybody know what that blue thing is? So I just found out that that's a raindrop. Isn't that great? So Vancouver's nickname is Raincouver. I didn't know that. Did you know that? All right, Carlos didn't know that either. OK, so this talk is largely motivated by the open source project that we have called kcp-edge.io. And what kcp-edge.io focuses on is primarily how to make the best use and scale of the Kubernetes API machinery. And so what we noticed is right out of the gate is we needed to test that scalability. So how do we do that? Well, why don't we plug Argo CD on the top and see what breaks? And so what we believe is that most of the automation solutions that will encounter large numbers of edge locations are going to rely upon customization of resources. And so that, in turn, is going to cause more problems. And then any automation solution like Argo CD, Flux CD, anyone doing a GitOps automation, should take scalability into consideration. So our goal is to help identify and solve some bottlenecks in Argo CD. And we're a part of the Argo SIG scalability. You may have heard some people talk about that this week. Nick More had a talk this morning. And while we at the community look to update and fix bottlenecks in Kubernetes. So the research question here is, how many Argo CD application sets or applications can be supported with a reasonable performance? And what do we consider to be performance? Well, it's reducing the sync delay. How much time does it take between updating something in your Git repo to how long does it take to actually take effect on that targeted cluster? And so what we found is very quickly is CPU became a bottleneck. And of course that impinges further on the delay between synchronizing those resources. So what do you do in this situation? Well, if you're managing edge locations and sending out lots of application sets, you've got to somehow reduce the computational requirements that are required to, let's say, customize those resources before they're delivered. Because you don't want to go hand modifying each one of these YAML files. It's terrible. It's agonizing. So we can provision more compute power or we can just choose to stick with accepting longer synchronization delays. And so we managed to actually get 10,000 Argo CD applications to synchronize in less than 40 minutes. And so by popular demand, this has stirred up quite a bit of interest in the community in Argo CD. We just came out with today, if you want to go ahead and shoot this in scan, you can follow along with us. This is exactly how to recreate what I'm showing you here. And so you can actually follow along as I'm giving this presentation if you prefer. So what is the design of the experiment look like? Well, roughly speaking, of course, on the left, your GitOps, GitRepos, the mandatory piece. We've got Grafana and Prometheus, which are spitting out the metrics for us. Argo CD with a plugin called RAP4KISS, the naming's not really so relevant here, but suffice to say what it does is it makes it so, each one of the names of each of the resources are unique. And then, of course, our clusters, or our cluster in this case on the right, and something called cluster loader 2, which allows us to load and tranches as many applications as we need to. So first up is the, the first hurdle is the CPU limitation. So how did we arrive at this? Well, if you look at the graph on the left-hand side, you're getting some of the quantitative information and the visual information correlates with that, in that you can see that the CPU usage starts to spike. And this is only with 2,000 applications being sent into the API at that moment. And the number of replicas of this API server that we have for Argo CD is just a single instance, okay? So let's see, maybe we can get some more performance if we increase the number of replicas. Well, here we increased it to three. And what we saw here is even with the increase of three, we still get the spikes, but why is that? Well, look at the graph here. You can see we probably would have preferred, and see we thought that we could probably get an advantage by increasing the number of replicas. The time it should take to synchronize should reduce by a third, right? So we would expect that kind of reaction. But that's not what we got. We actually hit the bottleneck with the CPU. This is a T2 medium on AWS. And so that constraint really wasn't gonna help us. And of course, most of the CPU load is being affected by the customization of each one of these pods as these application sets, as they're being sent out to preserve uniquity. So the second hurdle we hit was this thing called the default resync period. And so you could see the default resync period when you install Argo, the default is three minutes. So that means from the time that you've sent an application set in to the time and it will synchronize with the system and then eventually go on to target could take as long as three minutes. Now you may hit somewhere in the middle of that window and you may get lucky, maybe two minutes or a minute or 30 seconds. But typically people don't look at it that closely. So this is the max time that you'd be waiting for. And in the case of here, we did some other experience with 2000 and 4,000 but I thought that was kind of redundant. We come to this, if you look at these 6,000 applications being sent in, you start to see the work queue, just giving off a sine wave, the CPU usage is following intact. And then we get up to 8K, things get a little bit wonky in the work queue. So again, measuring this with Grafana and Prometheus metrics. And then finally we arrive at 10K. And of course, and then you start looking at the, the workloads themselves don't always happen to be the same size as well in this case. So you could see, we start to get this overlap with the work queue. This is showing the difference between those that are being processed and so those are in the reconciliation queue and it starts to clobber each other. So now you're getting interference. Now the CPU is getting even more erratically affected. So why don't we increase the resync period here? So if you increase the resync period, now this is that trade-off that we were talking about in the beginning of the presentation. Now you're at six minutes. So you can see the work queue now has plenty of time to drain and the CPU starts to fall in line with some more normal parameterized fashion. So our final result here, you can see, this is an interesting graph at the top I think is the number of synced applications versus the number out of sync. And if you watch this graph as it converges and it crosses, you'll see the effect of how much time it will take and you look at the graph at the top, it should line up about 40 minutes worth of time for 10,000 of these Argos CD applications to sync. So that's that other trade-off. So first trade-off was CPU, second trade-off being the time it takes to sync. So as I mentioned earlier, one of the main reasons why we did this is because of the KCP Edge community and the work that we're doing there. And so shameless shout-out to Lori and the come join us meme, but I thought that was cute. So if you scan this QR code down here, you can come take a look at all the other work that we're doing in this space. KCP Edge, just as a promotional ad here, we're trying to increase the scale and preserve compatibility with the rest of the community that's out there making all these get-up solutions available to you to use. We found that folks like OCM and Carvel, they're all doing things with changing the types of resources and bundling and manifests that we feel are not necessarily native enough and it starts to split or break apart the standards that we're starting to arrive at at the higher level. And then of course, the introduction of logical clusters, grouping, customization and status summarization are also important to us in this project because if you can think about trying to manage a million edge locations simultaneously, you can easily run out of fingers and hands and heads to be able to process all that information. So this is our project and this is what spurred us to do the Argos CD scalability experiment. And at this time, I'd like to welcome any questions you might have. I think I got there pretty close. Yes, Gerald. I think a lot of overhead to the processing. Yeah. ACP utilization. Yes, yes, yes, you saw that, right. So right here is where we switched over. I didn't remember you mentioning it. No, just curious how much of an impact Customize had and did you test with Helm at all as well? No, we didn't test with Helm. So in terms of Customize, what it takes is it takes the, you saw that wrapper that was in the experiment. I'll put it up here. Boom, boom, boom. Here you go. So that plug-in that's in the center is where that customization is taking place. And so when you, when cluster loader says, okay, give me 10,000 applications to go ahead and put in to Argos CD to then send down to the target cluster, what you wind up with is the name of those objects don't, will clobber each other or overlap each other if you don't do some degree of customization. So it was required to have that in this case. Oh, okay. What I was curious about was in terms of performance, the, you know, feeding Argo raw YAML we saw earlier today with Intuit they're pre-rendering everything. You could do that. Versus running it with Customize versus running it with Helm. Yeah, so you, that's the whole bit. So about the parameter replacement is we don't feel that at that level and that number of applications that you're going to use that you could get away with not having some degree of customization, right? Okay. There's that transformation that has to take place otherwise you're doing everything by hand. Right. Good question. Thanks. Thank you. I'm going to mess it up this time. From the New York Times, please, your question. Ah. I'm kind of so finding my way through all the metrics that come out of Argo and like what are the best things, the numbers and it kind of feels like coming out of your research one of the metrics is like work queue depth that that is absolutely something you want to always get down to zero at some point. Is that true? And then likewise there are other metrics you found like how these are the ones to watch for again like work queue depth like make sure that always cycles down or anything else like that. Yeah. So work queue depth is more of an indicator for us, right? So you could actually I did this experiment and when I was replicating it myself I actually took it to 20, 30 and 40,000. And you know, and you can just imagine the delay that you receive after a while of work looking at that. So the work queue depth starts to it's not so much the that they're both maybe draining and adding at the same time or draining increasing. It's when you get this kind of on the pattern starts to change and it starts to get erratic and that's when you start to realize, okay well some sketch is here some things being let's say just kind of paused for a minute, right? Some process is stopping to give CPU or cycles to another and that's really when you start when it becomes less of a square wave and it starts to become more of a curve or it starts to get jagged that's when you start to see issues and you know that other processes are probably impinging on its resource allocation. And so it's really more of kind of again, Grafana and Prometheus give you the ability to eyeball that and to actually maybe then start to theorize or make a hypothesis about why it happened. And so in this case, this is more for demonstration purposes to show you what you can look at and I urge you to go out there and try and replicate the experiment and see if you can push its limits as well. But thank you for your question, I hope that helps. Any other questions? Carlos, I'll repeat it back. The UI stopped work, you did experiments with multiple clusters? Yeah, so the question, the first question for Carlos was at what point does the GUI stop responding? And the second question is did we do this with multi cluster? So to answer the second question first, it's no, we did not do it with multi cluster. But there have been results, there's been changes now, I think there was an AWS experiment went in and did that with multi cluster or at least with sharding implemented, right? So there are some changes and there are some advantages of doing it that way so I encourage other people to look at that as well. The first question was though is when does the UI stop responding? Well, I mean, I gotta tell you, I ramped this up to about 40K and there is a lag, there's a lag between finding out how many were added versus how many reconciled but over the course of time, it really didn't stop. For me, it was responsive enough that if I were looking at it for a snapshot, I wouldn't be monitoring at real time, that's for certain, right? There was a lag, but I would eventually get the results. Interesting. I got different results. Did you find anything different? Did you find it? 8,000, 6,000, like the UI doesn't work. It was amazing, honestly. So we did it on a Friday afternoon and I'm like, you know, to hell with it, let me try it over the weekend and see if I can do more. And so I pumped 30K more into this and I actually still got it. Yeah, I was doing 100 costars, so that's something, but yeah, the SIG scalability, that's where I'm going to discuss those things. Yeah, so Carlos brings a good point. So ArgoSIG scalability, Argo-SIG scalability, look it up in Slack, join us for the conversation. We're really looking forward to fixing any bottlenecks that we see because as soon as this bottleneck or this settle bottleneck gets resolved, the next step is, well, Kubernetes. How do we get SCD to store the right number of objects? I mean, you really have to start thinking much further down the workflow as to what's going to be the next hurdle for you. Thank you, Carlos, for your questions. Yes, anybody else? Great, thank you all for joining me today. I really appreciate your time. Have a great rest of your afternoon.