 I'm an Argo project maintainer though probably the least of the maintainers and I am also a co-creator of the open get-offs project and Get-offs working group and underneath the CNCF and you can follow me on Twitter at today was awesome I would love to see you on there and if you didn't like this talk, please rage at me on Twitter if you liked it Be sure to tag me. Yeah, and I'm Brandon Phillips I'm a principal technologist and also the head of our technical product marketing at Codefresh Also the host of merging to Maine, which is a CI CD webinar and podcast series So feel free to check us out. You can at me at techie over time on Twitter Just a little heads up Codefresh itself. We are an enterprise Argo solution We've actually been working with Argo and customers for more than two years now And it's just been an amazing experience. Yeah, thank you Brandon So we're gonna be talking about Argo. Yeah, simple. Yeah, how hard could it be right? I mean you have your desired state. You have your actual state, you know following all the get-offs principles We want those to sync up, you know utilize their declarative configuration. How hard can it really be Dan? It's pretty simple, right? Just comparing two things. Yeah, now my question is let's just check with the audience here Who's using Argo CD in production today? Everyone yeah, like everyone in the room that that is amazing by the way. Good job So Argo itself obvious an amazing project it can be nearly bulletproof, but What happens when you start adding lots and lots of Kubernetes clusters? We know that the world is not that simple, right? You're not going to get away with just having a few clusters linked into it, especially as you scale over time You're going to add more and more clusters and what happens if we start to add even more get repose as we know You're not going to have applications without source code repose You're not going to have the clusters without applications to deploy to the clusters So you start to grow and grow and grow at what point could this potentially become a problem? Yeah, and obviously ten thousand apps with different sync policies different sized repose You know what could possibly go wrong? Yeah, so then the question is is one Argo instance enough Can it handle everything that your organization needs maybe some of you are already separating out your Argo instances for other reasons But if you're sticking with one Argo instance Maybe you're set up in hub and spoke and you have a lot of clusters tied into a lot of applications a lot of developers committing changes You know when is one Argo instance not enough is the question? Yeah, and this is a this is where people can sometimes get into trouble because all of you have deployed Argo already You already have a bunch of users. You're working. You're feeling good. Everybody's deploying apps. You're grooving. You're in sync Yeah, thanks, and so you're feeling great, but then as you keep on adding stuff you're going to start notice different performance things now Argo CD has a ton of different knobs and things you can tweak to really extend the performance So what we're going to talk about today is some of the issues you might run into. It's not really focused on performance It's more about figuring out how you can How you can preview those things before you hit them Yeah, so the question is let's say I'm hitting kind of the limit of Or we think we're hitting the limits of what a single Argo CD instance can actually do what kind of things might you actually Experience when you hit that limit. What is the behavior that you might see? So we've had actually a lot of people reporting different behavior to us over the last couple of days at this conference We appreciate that come by the booth and we can talk about whatever performance issues you're having Maybe we can get them straightened out, but typically things that become problematic Slow reconciliation times. They're not necessarily an issue in terms of eventual consistency But if you're trying to sync an application right now eventual might be too far away So you want to think about your sync times your sync depth cue So if it takes you an hour to go through all of your applications, obviously, that's going to impact the end user experience That's going to be a challenge other kinds of issues that create problems are even things like the client-side UI Responsiveness, so I've noticed that when I get up over around 7500 apps in a cluster in an Argo CD instance that like my browser starts to struggle So at that point you're like hey, can we issue more more ram to the machines or users? Because it's really a browser issue But there's a lot of different things you can run into and the key here is we want to give you some tools today So that you can find these things Before they happen to your users Wouldn't it be nice to be the hero who fixes it before anybody finds out and then they don't know that you're the hero? Yeah, absolutely And I mean that's always the goal right? I mean we want to keep you know development teams and DevOps teams from being blocked by situations like this So having tools to actually validate that can help a lot So let's let's just take a quick poll so for those of you that are running Argo CD in production Which seems like everybody in the audience which is amazing How many of you have more than a thousand apps out there in one Argo in one Argo instance? Okay, more than a thousand. There's a handful more than a thousand. All right, thousand going once going twice All right, how about three thousand applications? We got one over here one over here. All right, how about there's one over here. Oh another one over a thousand Yeah, how about five thousand? One hand stayed up. How many do you have? What did he say five thousand four hundred five thousand four hundred ladies and gentlemen? That's the KubeCon talk you need to go to this Because he might have it figured out so I actually talked to somebody yesterday who deployed 15,000 in a production instance. Yeah, so that's that was yesterday. I was hiring with somebody. They should have got him Yeah, so we are gonna go Not quite a 10,000 today, but there is a wonderful tool that we have for you It's called Gen resources like I said it's secret hidden Undocumented so it's a chance for you to learn how to use this tool and you can actually document it And you can get that first commit into the open-source project. Yeah, so Dan, where can I actually find it? Yeah, it's been hiding in plain sight this whole time. Okay in the Argo really repo. Yeah, wait wait just in the main repo Yeah in the main repo under the hack folder now hack folders are famous for all the secret goodies I recommend when you go to an open-source project first thing you do is look in the hack folder because that's where all kinds of shenanigans are happening so this tool that we have here is called Gen resources and It has a simple job It basically acts as a sort of an agent of chaos for your for your Argo instance and it will generate clusters now Here we leverage V cluster and it's very simple. Basically. It's going to use helm to Deploy a whole bunch of V clusters as many V clusters as you want Now there's some really interesting nuances about how Argo CD works with clusters Because you can replicate out repo server, which allows you to shard jobs across Across all these different clusters now We're not going to do that today because it would make it less likely to break and we want to break it I think rather clear so it also generates applications and Usually with Argo CD. There's a kind of a confusion that you think oh, I have a lot of apps So for example this gentleman over here has over 5,000 apps on an instance and you might be saying hey I'm struggling with 500. What's the deal? Well, it's probably because of the number of objects that are under management So objects are actually the bigger deal and how many how those objects are distributed across different Kubernetes APIs Because if you have one Kubernetes API handling all of those objects, it's going to be a bigger challenge So there are situations where we see that Argo CD is running fine But the issue is actually the Kubernetes API is too slow So in that case even splitting up a cluster with V cluster can actually help you scale What Argo is doing because you're actually helping scale what what Kubernetes API has to handle Yeah, absolutely, and I think a key thing here is as well This is maybe not going to always give you the smoking gun But it's going to help you eliminate variables when you're scaling on where you might have you know performance blockers Yeah, so let's get into this demo here and Hopefully this is big enough. You can see it So I'm inside of the the secret hack folder with the gen resources and there's a command there and to build this You just do a go build and you output and you can name it whatever you want I named it Argo CD generator because I didn't like consistency. This is a chaos talk. Okay, so that's what we're doing So building it is pretty quick. It don't only takes about 10 seconds and then we have a file in here Yeah, I'll show you. Yeah, I'll show you here Look at that colorful to fancy talk So this file is basically an argument that you can pass to generate sources and It allows you to specify how many Applications you want to make and then you have a strategy for the sources of those applications and their destinations Currently, there's one argument available. It is random. You could be the person to implement the second strategy. That could be you And then time yeah And then you give it a number of clusters and then it's going to take values file That's going to be passed to the V cluster Helm chart And then you can specify if you want these operations to happen in parallel I've found that if I'm spinning up like a thousand of these it like takes a long time So having parallelism happening is very nice. Yeah It'll throw errors and retry if the Kubernetes API hasn't loaded yet. And so it's pretty robust in that way And then it will generate repositories and projects For you so and then where do those repositories come from? Yeah, all of these applications. This is another agent of chaos element for you This is a great idea by Alexander Metz and I have for for those of that are where the other maintainers He said where can we find a whole bunch of repos that have files that we can deploy? And he said a sure hell of a lot of people have forked the Argo CD repo and those have a demo Files in them so this basically pulls all of the forks of Argo CD and then deploys them on your cluster So beware there be dragons, you know Yeah, I mean if deploying random stuff from the internet sounds like a bad idea to you This might not be the talk for you But if you would like to help out You can go and fork the Argo CD repo right now and it'll give us more applications for other people to deploy If you wanted to take extra steps and have like a caverno policy to make sure that only the images that were You wanted were deployed that wouldn't be a bad idea, but you might not want to deploy this I wouldn't deploy this into a production cluster and I would be careful deploying it into a cluster that had Networking access to other stuff. Yeah, definitely sandboxed. Yeah sandbox this one for sure Double bag it folks. We're in Amsterdam. All right Uh, so the next thing that we're going to do is um, let's uh modify this and actually start spinning up some resources And while we're doing that, uh, we're going to I'll show you what the demo environment looks like So applications here, uh, let's go to Let's generate 2000 applications And I'm only going to generate one cluster because the cluster takes a little while to generate like just a few seconds But if we're doing like a hundred we'd be here for a couple of minutes and That might be fun for you might be fun for me, but we have a time to we have a schedule Um repositories. We'll stick with 50. We'll do 50 projects. That looks pretty good Save this and now, uh, we're going to go ahead and run this File and you're going to see it starts by generating all my projects Um, we actually use this tool the reason that this tool was created. It was really created by, uh, Alexander m and pascha, uh from the code fresh team and uh, I made a few small Code tweaks on and then claim credit for the whole thing Um, but uh, they created this because when we're working on our go cd We need to be able to profile performance over time So we use this tool to run against our go cd and figure out if there are any big gotchas And the performance that are changing over time before we ship it to uh to you folks Um, so while this is happening So it's firing off the v cluster and then you'll be able to see the applications fire off in a second I'll introduce you to our demo environment so you can see I've got an argo cd instance that we already have 850 applications out of sync 100 synced Um, I bootstrap this with our go cd autopilot. Yeah, yeah, uh, you want to tell people what our argo cd autopilot is? Yeah, absolutely. So our go cd autopilot. It's one of the argo labs projects You can utilize it to easily basically spin up the argo You know runtime itself But also bootstrap repos and bootstrap the cluster setup. It makes it very easy to kick this process off So I'd recommend you check out that project I was hucking random stuff onto this cluster before we started you said what happens if it falls over and I can Said well, I can just do a repo bootstrap with our go cd autopilot and it'll come back Though maybe after the talk. Yeah, so Anyway, so now we've got these running You can also see I've got a grafana dashboard. This is not a special one. This is the one that's available in the community You can see I've currently got 69 clusters deployed. That wasn't on purpose. I promise I've got around see we're over 100 now You can see I'm at currently at 952 applications 42 repositories And there are a couple of things that we can look on here that are pretty important And my dashboard is going to start getting a little funky as we show this but reconciliation performance shows you how long each app is taking on average and so when uh, you're having issues with Rendering manifests and applying them and maybe you need to scale up repo server or you need to change how much memory You're allocating to it. You'll see this one starts to tick up In its time, but the bigger issue is actually the ones that take infinite time. So those ones obviously failed So as that Q depth backs up, it'll start to create a problem Looks like it's getting that v cluster running still like I said, it takes a few minutes Um You can also see reconciliation activity. So currently right now it's reconciling around 470 apps per 10 second window or something like that And most of our applications are not currently synced Oh, it's not even Sometimes for me if this slows down. Okay. So now we can see that it's actually Setting up random applications and as it's doing this We'll actually see our Cluster our uh cluster count start to rise in Prometheus the way that argo cd works with clusters when you add them They'll show up with no status Because nothing is syncing to it. So once you assign an application to it, they'll actually start counting So there are actually more clusters here than are currently being shown They just haven't had any applications assigned to it. Yeah, they'll start to pop in just takes a little bit time Yeah Now uh as I refresh this again We should see our application count starting to grow a little bit our cluster count starting to grow a little bit This is this application depth shows you also how the queue rises So if I make this a little bit bigger, you can see my out of sync count is currently 842 It should start ticking up as it starts processing these applications and actually finding them and Reconciling them because what our gen resources actually does is it just creates the application custom resource in Prometheus So you have to wait a few seconds for the for argo cd to discover it start syncing it start throwing it into The caching and all of that stuff. Yeah, and I think generally, you know for Your developers out there your teams that are delivering code the reconciliation time in your production environments Or even your lower environments is really the first thing that you're probably going to notice Yeah, I think I think that's something that might be like Uh, you know if your ui is not working that may or may not be a big issue for you Because if you're operating entirely from git, maybe you're only using it for like debugging and stuff and that's important But it's going to be deploying. It's going to be syncing Maybe some argo cd core users in here. You don't even use a ui your hard core You wouldn't have those kinds of issues Definitely like monitoring this from a user perspective and what you're trying to accomplish is a pretty important task So if you just spin up a cluster and you start throwing this thing at it It'll give you the experience of going through and starting to tweak those knobs like changing how many repo servers there are changing the memory On on the processors. Uh, there's even like you can change How frequently the kubernetes api is hit? Which is a pretty important one especially depending on your cluster setup in this case I'm using a lot of clusters a lot of v clusters So each kubernetes api isn't getting hit very hard So if you plan to have one cluster with 10,000 applications, I wouldn't spin up a bunch of clusters I would just throw the applications at it But keep in mind the object count because it's really about the objects a lot more than it is about the The application number itself. So it's a good proxy for objects, but In this case, there aren't like tons of objects for each of these applications All right, so you can see this is starting to count up pretty good If I go ahead and refresh my prometheus You can see we've just jumped up to 1700 applications now right now. It's still showing 101 clusters They're probably closer to 180 clusters on here But because it hasn't gotten through the reconciliation depth for those clusters yet It's not recognizing them So this shows you how the process like you can see it happening in real time For example, you can see over here under the sync status The unknowns has has jumped So this white line represents all of the applications that argo cd hasn't gotten to to figure out You know, what the repo is even doing where it's supposed to be deployed And so as we work through the queue, maybe I shouldn't have done 1500 I'm coming to think about it But as it starts to work through the queue, this will start to come down and you'll see all of the sync statuses start to show up We have another tool in the hack folder that I didn't want to present today because it's more rough around the edges But it's called simulator and basically what it does is it deploys Fake developers onto your cluster and they just go around and sync stuff and delete apps and break things So it's sort of like a toddler basically who's a developer here in the room Okay, you're the dangerous They're the danger. Yeah, okay, so Yeah, you can see my application count is now up to two over 2000 But my unknown queue is sitting stuck because I'm getting uh, I'm getting pulled I'm getting bottlenecked basically with repo server being able to handle it. So let's cut off Well, we're almost done to the 1500. I guess we can let it finish Um Brandon said well, we should make it break. I think people will like to see that mostly what that looks like is like Just it stops loading So it's like not like it's like hey, you ever seen a failed to load page. That's amazing But that's kind of what it looks like. So You can see we're still missing several thousand applications that the ui hasn't caught up with. Yeah, they're just queued While we're letting that load for a second Keep in mind that this is going to cost a lot of resources Uh, this is one of the only talks I've ever done where I spun up all of the nodes before this And then I thought this is going to be an expensive 25 minutes I also um, the most I've done is uh, somewhere over 10 000 applications And uh, I documented the whole journey and there's a whole talk on this that you can follow There's also a lot of work being done by some other folks that I'll talk about in a second actually Um conservatively, uh, in this case, I'm using Argo CD HA Most of you are probably not who's using Argo CD HA Keep your hand down if you don't know what I'm talking about. Okay. So all of these folks deployed the version of Argo CD that allows you to spin up your redis instances and uh, uh increase the replicas of Argo CD Very effectively. So in this case, I'm actually using Argo CD HA So these numbers that I'm showing you will actually not apply to most of you because you're not using the HA version It's really easy to switch and it's very low resource, uh, cost if you're not scaling it. So I would actually probably recommend it Um, but you can expect at least 1500 apps 14 000 objects 50 clusters 200 developers before stuff half starts to maybe need to be tweaked Yeah, and these are pretty rough estimates, right? Yeah, and very a lot because like 50 clusters It's like well, you can see I've actually got like 200 cluster right here, and I'm still using it and it's working okay Uh, but there's only like one application. Yeah. Yeah for sure And so it could vary based on your environment, you know the size of uh, your clusters and you know What your devs are actually committing as well. So yeah, so it's quite expensive. Uh, I'm sorry. I mean, um You can get you can get pretty Uh Efficient with this and you can really tweak it. So our go CD is very very scalable There are a lot of reasons why you'd want to have More instances though, um, then this goes into some of the scalability content we put out Yeah, absolutely And it's one area because you know, we've worked with a lot of folks and talked to a lot of people who are Learning how to scale out argo learning how to make it more resilient in their environments And so we wanted to start a document, you know Some of that change that we're experiencing out there and some of the growth around argo as well And so there's two blog posts that I would highly recommend checking out One is about scaling argo securely in 2023 And it also talks about the architecture models and some of the ways that you can spin it up And then we have another great one that does a deep dive of the argo CD architecture And talks about, you know, how you would utilize it in your environment What's the best way to grow it and then I think just in general, you know, I think we talk about the ha right Just to say hey, ha model hub and spoke models, uh standalone versus Using a control plane. There's there's probably six different architectures document on there And both of these are evergreen. We keep these updated very regularly Yeah, and so they'll continue to get changes and you know, we are sure that there will be architectural improvements as well over the next year A couple years before we check back in on how the demo is going. Is there anywhere else that they can learn from? Oh, I don't know, uh Maybe you should check out our get off certification. So who here has done our get off certification? Okay, cool. Yeah, percent of the audience. That's like quite a few hands. That's great. Um, so we do a get off certification We have two different levels. Uh, the primary level is actually, you know, it's a lot of the basics But you do get into rollouts and doing canary and blue green deployments The second level it actually, you know, you do a deep dive and things like the pull request generator and the other generators We we start talking about the argo CD how to organize your repository is you can't get to this kind of scale if your repositories Don't make sense. Yeah, uh, so that's really helpful. We have two codes for you. Yep me love chaos For those that are quick. There's uh, that will gives you 100 off the the certification will be free For those that didn't catch it me slow chaos You get a 50 off and come by the code fresh booth. I'm sure they'll give you some codes Let's check in on how this is going. So we'll refresh the ui here and see how quickly it responds We should have decent performance here. You can see we finished generating all of our applications. Oh, we generated 2000 Okay, well, it's good to keep track Um, and if I let's pull up grafana and see if we're able to refresh that Uh, so yeah, we've we're just peeking around 3000 applications. It looks like our sync status is still sitting very high on the unknowns We've got a little over 1100 that have been reckoned like figured out by argo cd at this point And there are 1700 still working through the queue Um, there is a worker queue depth. Oh, yeah, you can see the worker queue depth This is actually started to stall out because it's been sputtering Uh, this is just during the spin up period But you can also see in this scenario where like once this is all operating it'll be working but also How long does it take me to get all the way around robin back to uh, my applications being kind of fully functional again Yeah, I mean, how long does it take to get to consistency, right? I mean, that's that's the real question You can also see my memory usage is actually started to stall. So I think some pods have been killed And so it's restarting them. Uh, so that's happening. You can see my ui is starting to Struggle a little bit While that's loading for a second So now the secret is out. There is a project within argo called sig scalability this is something that was started by code fresh aws ibm acuity red hat in adobe and There's really great work done by aws last week. They presented a lot of findings from their Usage of these kinds of tools to figure out the different areas that will start to go wrong when you're using eks specifically And there are several tickets that we opened off of that But I think that's a presentation you can find online And if you join the sig scalability channel, all the slides are in there and they have some really interesting Tweaks that you can do to update your performance We'd love for you to join us in this slack channel. And if you're interested in scalability, we'd love your help I already gave you several easy contributions you could make today And then finally, I just wanted to thank mid journey for chaotically generating all of these bizarre different argonauts for us For the presentation. So with that, um, we're going to let this. Oh, look, it's loaded So one last thing and then maybe we can take a few questions unless somebody tackles us from the stage Brandon Should I uh, should I sync apps? Yeah, let's do it. Let's see what happens Should I sync one app? No, I don't think so. You want me to sync all I think you should sync all apps Okay, let's let's auto find out what happens. Let's sync all apps. Okay. So remember I'm just deploying random apps from the internet here so To the wide open internet I don't want to give anybody the idea to deploy like bitcoin miners into these four creepos It will run for like two minutes. I'm gonna shut it down. So, you know, it wouldn't be worth it Um, but yeah, we're gonna let this sync. Uh, and that's when you say What is it syncing about? Yeah, okay I was definitely gonna say that so with that, um, I think we could take a couple questions If anybody has any actually I have a question for the audience first So has someone out there like provided a lot of custom tweaks already on their argo instance Is anyone heavily customizing their argo instances out there scaled repo server. Yeah, that's guy with 5 000 apps raised his hand. Yeah Okay, just curious. All right. I mean that to me that proves that argo cd is immensely scalable already But there's always more progress to be made, right? Yeah, and uh, I do I We it's fun to push it like this. Yeah, but generally I recommend to people that they have multiple argo cd instances for organizational, right separation or responsibilities, right as well Yeah, and blast radius in this case if this were a production cluster and some knucklehead connected to it Like one of you knuckleheads connected to your production cluster after this and started throwing this at it You know your admin's gonna suddenly be freaking out and you're gonna be affecting production apps So, you know, don't do that, but um, someone could do it. So it's good to be aware of and uh And so it's it's good to separate concerns. So your blast radius is relatively low So, um, this is gonna this is gonna be a while. So this isn't gonna like be a fast process or anything So we'll do one more refresh on the on the graffata here and then open it to questions and then open it to questions. Yes Okay, so what what are we top out at? Just under 3 000 applications. Um, well, can we ask you to give us a round of applause? Thank you Uh, any questions, uh, I think we're pretty close on time, but uh, Raise your hand shout it out really I'm surprised totally clear to everybody We did a great job. We crushed it. All right. So you guys are gonna go out there and run it right away, right? You know start testing boundaries Okay, if you have any other questions, feel free to hit up brandon and I we're gonna be at the code fresh booth The next two days, uh, or you can find me potentially at the argo booth Between one of the two for the rest of the conference. Thanks everybody. Have a great hoop con Enjoy