 Welcome everybody. Have a seat. We're gonna get started. Thank you for joining this get-offs-cons session Plenty of seating up here in the front sitting over there Make room if you need to looks like people are coming in so Today's talk is well, I can't say today's talk. I guess you have to read it because it's bleeped out But how how the heck? Do I deploy that many apps to that many clusters? And this is a story That comes out of adobe Which has done this very effectively so to introduce myself. My name is Dan Garfield I am the chief open-source officer and co-founder of a company called code fresh. We are the first in still only company to fully commercialize on on the Argo project on our workflows on our go events Argo CD Argo rollouts and We launched our enterprise Argo offering about three years ago and So we've seen a really great adoption and pick up on that I'm also an Argo project maintainer and I also helped create Get-offs the get-off standard and so get-offs con is something that is like a progenitor like a coming out of that So that's very cool for me. You can find me on Twitter at today was awesome And then this is my co-speaker Mike to Geron who is a lead cloud engineer at adobe. You'll notice that he is Not on the stage with me Unfortunately, he was in a car accident last week and he said he needs to be home. He said he's fine. He says tell everybody. I'm fine but also Just give the talk without me there. So I said no problem. I'm gonna present all your work as if it's my own So appreciate appreciate your Your help on that. So how did this start? Well, they started actually at KubeCon EU last year I happened to meet the folks at adobe and they were looking for advice on how they could architect their Argo Rollout and basically they were in in the midst of this huge modernization Project where they were using old-school tooling and they wanted to start taking advantage of Argo tooling to fix everything They were doing and to do it the right way and so we started talking about their approaches and their architecture And this led to a pretty fun collaboration for me to help them get up and running and figure out the right stuff to do so This is the goal that they had the goal was Provide infrastructure for adobe's new developer platform and their developer platform has about 285 clusters across 28 regions 20,000 nodes about 2,000 GPUs and you need to install about a hundred helm charts per cluster So this is stuff like you know Prometheus Grafana and and all the other tooling that the infrastructure team needs to be present on each cluster that people are going to Be using and this is a Baseline so this is this was the very start of the project the fact that it's grown quite a bit even since then And it's already, you know getting close to a million CPUs So it's it's a fairly robust project There are a lot of developers involved that they're working with and so there's a chance to do a lot of impact on this so What we did Terraform apply that's it. Thanks. Cool No, in fact, this is this is not the solution so What what we did so what where they were coming from they had a big python monolith They had a lot of Azure arm templates AWS cloud formation templates Kosas and I were just talking before the talk about cloud formation templates and just just shaking a lot of Manifest templating that was very custom and then of course a lot of imperative operations and the whole Thrust of get ops is let's get away from imperative Let's start doing declarative and so we wanted to move to a hundred percent declarative structure Taking advantage of cluster API for cluster provisioning Argos CD use the entire suite using Helm charts and then using some operators where we needed to and so this is where a lot of the power of Argos CD is going to really come in and shine and You can see I've got my AI generated logo for cluster API there It's a stack of turtles so if you if you're not familiar with cluster API It's basically a way you can provision Clusters declaratives Lee so you can create a custom resource inside of a Kubernetes cluster and have and that customer resource was then responsible for spinning up a new New Kubernetes cluster So how does how does Argos CD work now how many of you are? More than casual Argos CD users just by ray of rays of hands Okay, so like a good chunk of the audience, but some of you are still casual maybe new to it. So Even for even if you've been doing Argos CD for a long time, maybe this concept might feel a little bit new But the concept of an application in Argo if you think about it, it's really a policy Okay, so you have your desired state and yes, it's defined in git. It's probably a bunch of manifests It's maybe maybe some Customize or some help charts or whatever and then you have your actual state That's living on your cluster and then in between that you have your application and that application is The source and definition of that of that policy so for example here if you look at This application here. It's called. Hello, Vancouver because hello, Vancouver Thanks Appreciate that enthusiasm bringing it back to me So we we have this ability in Argos CD to do something called ignore differences. Now, how does this work? Well Argo when we take a source of truth We render out Kubernetes manifests then we compare that against the actual state and then we can apply our policy And we can say oh, I actually don't want you to track the replicas In fact, I have a controller that does that so don't ignore any changes to the replica set That's not an issue or maybe I'm using Kyverno and it adds a lot of annotations on the things And so I'm going to ignore those differences because Kyverno is in charge of Kyverno stuff and I'm just going to be in charge of my stuff and so the the ability to do ignore differences and actually compare state and Get a view of what's different between your source of truth and what's happening in the actual state That's something that's very unique to Argos CD And so that's that's a really important part of this and it's going to go into what we're talking about so When we talk about these these kinds of policies, they also include things like You know self-healing so if something has changed in production Does that mean that I should overwrite that change? Should I smark it is out of sync or should I leave should I leave it? You know you can set different policies for different specific fields and things like that So that's really powerful for our go CD And then many of you have probably already heard of the app of apps pattern So if an application is Kubernetes manifests Well With a policy and I just showed you what one looks like this is just a custom resource That I have created inside my Kubernetes cluster. Well, that means that I can have an application point at a folder With just a bunch of applications in it. This is called app of apps. So app of apps is really great Because it's just YAML and this is really fantastic for end-user space. So I can create The way that at code fresh we actually refer to app of apps internally and we in our application We mark app of apps as we just call it get source app because that's how we think about it We think this is an application that represents a Get source that has a bunch of other applications in it And then the next kind of meta layer up above Beyond that our application sets And application sets allow you to programmatically generate applications, right? So we've got our application policy We can have an app of apps and now we can have an application set and it's basically two components It's generators and I'll tell you what those can be in a second and then it's templates and templates just represent if we're going to generate an application this is what it's going to look like and you're going to fill in the details and We've got a ton of generators within Argo CD we have a list generator so you can do a list items cluster for every cluster added our Argo CD You can generate applications you can do a git Generator you can do a matrix where you combine them so you can combine multiple generators together Merge source code management providers pull request is an especially interesting one because you can say anytime somebody opens a pull request to generate an application and of course We also have this custom decision resource generator, so There is a common misconception and some of the people in the room are like we know Get on with it. The reason I bring this up a there are some people that don't necessarily know but the other reason I bring this up is because there is a really big common misconception in the community that application sets are app of apps 2.0 and I'll talk to people sometimes and they'll say oh, I don't use. Oh, I don't use app of apps. I Use application sets Well, good. That's great. I'm glad you use it for a use case So almost every single Argo CD instance that I deploy has both In fact, I use application sets to generate off of get app of apps So I'm actually using them together. And so this is a this is a great pattern. They're more tools in your toolbox Okay, so the one that we're going to leverage for Adobe here is going to be the cluster generator So if you add a cluster to Argo CD, that's as simple as creating a secret So it's just declarative right so I can just create a secret on my cluster where Argo CD lives And I can pass in there are a million different authentication schemes I can use and I can leverage a secrets manager and I can use external secrets and and whatever and What I can do is for each cluster that I add I can generate My applications and so I can say you every time I add a cluster I want Prometheus added to it and I want garfana added to it And I want these things and I want them to have this kind of policy So that's that's very possible to do and that's going to be really important for the strategy. So here's the plan We've got about 75 apps that need to be installed sometimes it's more like a hundred apps as a baseline and Each of those has at least 10 resources and then we have 275 clusters. So now we have 200,000 resources for Argo CD to manage Right sounds great. So Argo CD is very scalable and there's some really great content that Well, I was gonna say there's some really great content that I put together There's some really good content out there and I did put it together some of it but you can you can search for like Scaling Argo CD securely in 2023 you'll find a blog post there And you can use Argo CD has a mode called HA most people don't use AJ How many people are using Argo CD HA today? Okay, a smaller number than the people who said they weren't just casual users. So Argo CD HA just adds high availability and It allows you to have more scale ability and there are a ton of knobs You can tweak now the focus of this talk is not necessarily how to scale Argo CD But we will talk about some strategies if you want some performance data. There's a talk that I gave at Kubcon two weeks ago that covers some of that and there's another talk that I gave at Argo con last year that talks about that So this is actually gonna be pretty intense and in our case It's gonna be too much for a single Argo CD instance to handle. So no problem. We're just gonna layer it on now There are a couple of different Argo architectural patterns So this is a blog post. You can find it a comprehensive overview of Argo CD architectures in 2023 You can Google that you can go code fresh Argo architectures. You're gonna find it But this comprehensive guide and it's it's fairly I mean it goes pretty in-depth, but they're basically four models So there's hub and spoke so that's where you have an Argo CD instance and you connect all your clusters into it So cluster generators perfect for that use case because I can create a single application set that Argo CD instance can then Just generate applications for all the connected Kubernetes clusters. The next is split instance So this is basically where you have part of Argo CD living in different clusters And there are some open-source solutions for that and there are some specific use cases where that makes sense like for instance, you want to talk to something behind the firewall or you want to distribute the load of What repo server is doing which is essentially reconciliating with clusters. There are standalone So this is very popular for edge clusters. So if you have a whole if you have a cluster, let's say You know every Starbucks has a cluster. I don't know about Tim Hortons. I didn't check I'll have to go behind the counter After the conference is over and see if they have a Kubernetes cluster back there but The this is very popular for for edge where you basically have an Argo CD instance in each location and that's really great because it's resilient and they can each just reconcile themselves and I'll tell you that I'm aware of some use cases where those locations aren't Stationary, they're mobile and you might have them on vehicles or transport of some kind So that's interesting and then finally there's the control plane model and the control plane model is you essentially introduce a Managing overhead control plane that all of your instances can be managed by it So this is something that Codefresh does you can use it for free. You can go check it out at codefresh.io and try it out You can also in this case we're going to be doing sort of Getting partial way there with a diff with a kind of modification off the control plane pattern Which is doing an Argo of Argos. So this is where you set up an Argo CD instance It connects to X clusters You then install Argo on those clusters using Argo CD and Then those Argo CD instances can then connect to additional clusters and deploy and manage from there Of course, you can also scale up Argo CD components individually like repo server is the most common one You can tweak how many Kubectl like Kubernetes API requests. It's making you can do all kinds of those things So this this setup for Adobe does offer us a lot of resilience So one of the reasons that going with one giant Argo CD instance may not be your cup of tea is if it goes down You can't update anything So having multiple Argos CD instances is very valuable because okay It's something can be going wrong over here, and it's not going to affect everything else So that's that's a much better situation to be in so that's that's one reason you might want to split it up These are all fully managed and get so nobody's clicking through and setting these up These are all bootstrapped and automated so that every instance is fully self-managing once it's once it's bootstrapped And it also allows us to do Testing and progressing apps between clusters and instances so we can actually start changes in one instance of Argo CD and Generate them on to like a staging and then a production and we can also manage regions So we can have different areas that are managed by different clusters And then we can manage the rollout between these different areas So Codefresh does something very similar to this so I mentioned Codefresh has a control plane. We also offer Fully hosted Argos CD environments and the way that we do that is we basically we leverage Codefresh was the enterprise version of Argos CD to to maintain all of these Community versions of Argos CD in their own V clusters And so we basically have one instance that spins up and manages thousands of instances So there's not that much work that the parent instance of Argos CD actually has to do to keep them all running And there's a there's a great talk on that That you can check out that Coastus gave at Kube conee you Two weeks ago, and it's called how we securely scale multi-tenancy with V cluster cross-plane at Argos CD great YouTube video And you should go watch it skip the next session and watch it And then you can ask Coastus about it when he gives his talk this afternoon. He's given another talk So from a git perspective Couple of strategies I mentioned earlier you can mix and match application sets and app of apps so one of the things that we do with the This like I mentioned Codefresh we have an Argos CD instance that's managing every customer Argos CD instance And what we do is we basically use application sets to generate that and then we we bootstrap in an app of apps Into each one so the user can show up and they have a git repo and they can just throw their apps at it And the client Argos CD instance will pick those up and deploy them and they can still operate it Without even you know looking at the UI. They don't have to they can just operate it entirely from git if they want to So that's a really nice pattern that helps a lot. So with Adobe's flow, we're gonna handle this a little bit differently We're gonna be leveraging Argo workflows and Argos CD and Argo events and basically the way it works is this When a new cluster is created That's a cluster API definition, right? so we can use Argo events to listen for those for those the creation of of Clusters and whenever a cluster is created doesn't matter if it's created manually if it's created with git or whatever It's going to automatically pick up that event and trigger an Argo workflow that Argo workflow will collect the details from the config map on that cluster and Then it will find the proper Argo CD instance and cluster to deploy to based on the metadata And then it will upset those cluster details into that Argo instance Which will then allow the application set in Argo CD to automatically bootstrap and manage all of the Applications on the cluster. Does that make sense? Everybody follow me? Does that get confusing? It's it's actually a pretty simple workflow if you think about it cluster is created Triggers a workflow that grabs the details pushes them to Argo CD and then Argo CD uses application sets to Dynamically generate all the applications for that cluster so that we have everything we need from a system space from a security standpoint from a monitoring standpoint, and then the users can just start deploying to the cluster So what does that config map look like so the config map that is Created for these for cluster API shows things like is does this art? Does this cluster include arm does it include? GPUs is it kata containers and not not just container D is it? Is there a specific Argo CD instance that you want to go to that's optional is it? You know all that metadata that we need so that just lives in a config map And so that's short enough that we can consume it and then push it on to the cluster details Now there is another feature that can help us here called application set progressive rollouts and Application set progressive rollouts is a pretty new feature to Argo CD I think it debuted in 2.6, and we're currently in 2.6 the release candidate for 2.7 Just came out a few weeks ago And this is considered an alpha feature and I want to call that out But this is what you can do with it So in adobe's case we have all these clusters right and we're using application sets now to manage all of the the system space the The security and whatnot and we have a nice mechanism so that whenever a cluster is created it automatically gets bootstrapped with all the components And it's all managed very nicely Well what progressive rollouts allows us to do is we can say You know these clusters actually represent regions, and I don't want to deploy to all these regions at once I want you to deploy to one region at a time Run some sort of health check and then progress on to the next one that way if there's an issue My blast radius is small And this is a this is an issue that you have to think about with get-offs because when you have the power To make a commit that triggers all of your deployments or Triggers all your changes that means you have the power to make a commit that destroys all of them Right, it's a two-edged sword if you can bootstrap everything in one command you can destroy everything in one command So it's very powerful. So the ability to do progressive rollouts. That's an excellent feature So this is a community contribution So it was really cool that somebody came to the Argo project and they said hey, we want to do progressive rollouts here's how we think it would work and As maintainers we worked with them and help them get this in So it's very cool that way because you can basically say well these are Prod clusters these are staging clusters or these are the regions and these are the these are the correct progressive and this is how I want to go So this is what it looks like so application sets You can see that we're looking at application set here and you can see that we have different Cluster generators that we're using to specify different environments. So we've got Like staging environment is getting certain versions versus prod environment is getting different versions And then within our progressive sync. We basically can set Update Criteria for how these are going to roll out and then how they're going to progress up Across the instances. So this allows you to make an update and not accidentally You know destroy 285 clusters Instead you can do an update and destroy a small percentage of those clusters And then not destroy the rest of them and then you can fix those ones and you'll be back up and running So progressive rollouts is a very cool feature and I definitely would encourage you to check it out again It is in alpha. So there are a couple of gotchas So first of all it does not respect sync windows now if you're not familiar with sync windows in Argo CD I mentioned applications are essentially a policy that says this is the source of truth This is where it goes. Here are the reconciliation rules Sync windows basically dictate when can updates happen? When can the application be synchronized and this is a really useful feature? I was talking to somebody the other day and they said I want to be able to make it so that I can hit a button And nobody can deploy anything because there's an issue and I don't don't touch anything It's like sync windows are for you buddy turn on those sync windows and you can basically say don't allow any Synchronization to happen. Well unless you're using progressive rollouts Because unfortunately applications that progressive rollouts do not obey those yet Sometimes it does get stuck. So it is still alpha in that way Also, selectors are based on application template template labels not cluster labels. So that means that You can't for example leverage the labels that are just on the cluster object in Argo CD You actually have to push them up into the application object itself So that that's something to be aware of And also all clusters must be healthy or it could affect larger rollouts. So if a cluster is unhealthy For any reason it may just prevent the rollout from continuing So if you just you know if your cluster is having issues for something totally unrelated It will still stop progressive sync from happening just because one cluster in that pool Happens to not have enough memory to spin up some pods or something And then this is probably the worst one is that if if an application Is stuck in pending for too long it will then be treated by default as healthy So you could actually have a situation where stuff isn't able to deploy and it's misunderstood is actually being healthy because it's been that way long enough So that's not a great gotcha, but it for the case of Adobe We initially rolled this out The the progressive sync and then with this many gotchas we said you know what we're actually gonna stop doing progressive sync for now And as these issues get resolved We're gonna bring it back. So if you're looking for an opportunity to contribute to the Argo project Got one for you here. I just laid out the issues now. You're very familiar with it You know why they're important. We appreciate your code looking forward to your PR Okay, so to wrap up here The goal again was to provide infrastructure for Adobe's new developer platform So we leveraged cluster API to generate all of our clusters We use Argo workflows and events to then trigger those clusters to be added to Argo CD And then we used application sets to generate all the applications across all of our instances And like I mentioned, we also have the opportunity to leverage things like app of apps still they're not a replacement They're not fighting each other and this way The infrastructure team is able to manage the baseline for all these clusters But the end users are still able to Deploy everything they need to their clusters and everybody's doing it in a get ops way They're all doing it in a declarative way. They all have you know self healing They can all take advantage of all the wonderful get ops tooling that exists So what's next? Well next up is gonna be hopefully getting progressive sync progressive delivery for application sets back in there once it's ready and then Obviously looking forward to more opportunities to Streamline this and find more efficiencies, but the just the availability of of how the application sets work Already very powerful and has put adobe in a much better footing for managing this Now if this sounded interesting This sound like an interesting project. This is something that I put out there a little while ago I said if you ever want an extra set of eyes on your Argo CD or workflows or Argo setups my DMs are opened I've worked with quite a few companies in the last two years and well This was like a year ago So I should say three years and I've seen a lot of what works and what doesn't snow streams attached No money just a friendly architecture review. I have loved doing this as an Argo maintainer This has given me a chance to talk to a ton of teams about what's working for them What's not working for them and how they think about the project? I've been surprised to meet people who are like, hey, we really wanted to get ops right So we we're gonna be moving everything to Argo workflows and I'll say okay. Well Like what do you mean by that and it turns out that? They heard Argo was really good for deploying software and when they went to the web page They saw Argo workflows and they said that kind of looks like CICD. So I'll use that Like oh, well, I wouldn't build your strategy around that because actually this is what Argo CD is So sometimes it's as simple as that and sometimes it's it's more involved So please feel free to hit me up on Twitter I would love to chat with you if you've got an interesting problem Or if you just want somebody to take a look at what you're doing or get a second set of eyes It's been great for me and at some point I'm going to come by compile this into a blog post And it helps me also figure out which things we should prioritize development on within the Argo project I also have a free giveaway for you. So We Created and costus is the the primary author over here the main visionary Created a get ops certification with Argo and so we're offering a free code So you can get a hundred percent off if you use the code Dan van Coover There are not enough codes for all the people in this room And so whoever claims them first gets them and then they'll go away But you'll get a full environment to run Argo CD It'll teach you how to build your get repos how to do progressive delivery how to do canary releases How to take advantage of the different health checks and and goes much deeper than just you know Oh, I'm in Argos CD. I can create an application. Okay. Well, how's it stored? How do you manage it? How do you change it? So that's a that's a freebie for you and That's it That's my talk. Thanks for coming. I Think we have a couple of minutes for questions if anybody has any questions. Yes Nick Yes, so the question was what event specifically is firing Via Argo workflows that allows that so it's it's actually just the creation. We're looking specifically for the creation of The config map that comes with cluster API So when you create so with Argo events, you can actually say anytime a specific object is created fire off So any kind of object even fire off an event that will trigger a pipeline So it literally is if somebody goes to bootstrap a cluster It'll just automatically pick it up and run And that is nice because there are some teams that still do this manually even though they're told not to But even if they did it manually it would still trigger All of the downstream effects that would and in a declarative way That would then trigger the application sets and everything. Yeah, good question back here Yeah, so the question is how much of this flow is declarative. Are there any imperative operations that are happening? So Yeah, it's interesting when you get into writing controllers You write imperative operations So like if you look at Argo CD the code is imperative, but it's imperative so that it can Consume declarative formats. So in this case You know basing off of the event is pretty reliable But like what would happen if that event didn't fire or if it misfired Well, then we would actually have to have some kind of roll-up that would say okay What's happening? You know, what are all the resources that exist and and update them here? So we would actually just trigger it by updating the resource and that would re-trigger the workflow to update and it's an Upsert operation so you can write over it over it again, and it's it's going to be I didn't potent so it's not going to cause issues in that way Argo workflows is obviously an imperative format But as long as it's done in such a way that it's going to consume Declare itably and have item potency. We're good to go Yeah, good question Over here Yeah, okay. So yeah, so yeah, you've got your application sets and you're like, okay So I want to create a cluster for every developer in my company and there's 10,000 developers in my company Is that a problem? Well as I mentioned, yeah, probably kind of is because you need to have some way of splitting things up Now Argo CD does have scalability tooling in it So you can increase the number of repo servers and you can basically make one repo server for each cluster Now I've never tried to do 10,000 I think the most I've done is like 2,000 clusters and repo server will scale up and You're probably going to run into other scalability issues For example, like if you have that many clusters and each of them has applications on it Your UI and Argo is probably just not going to load very well So at that point you probably want to split up and have an additional instance or have something like a control plane So like within Codefresh for example All my instances grow up into a single view so I can just go into the control plane and I can just filter the applications I don't care what instances they're on So that's like an extra tool for scalability So that's something to be aware of but there is definitely some other ones There's some good blog posts on it and happy to chat with you afterwards back here clusters oh Awesome, so that's an Argo CD talk Excellent, so be back here at 220 225 and he's going to show you 20,000 clusters on a single Argo CD instance. Oh Start with 10,000. Okay, awesome. Yeah, I'm excited to see that. Yeah, great. Thank you Any other questions? Yeah Well server side apply so Argo CD does support server side apply server side apply as of Argo 2.5 Maybe 2.4 and server side apply is Gonna allow things like controllers to run so they can do some modification and stuff. However It's actually you bring up an actually very interesting use case So in the case of Argo CD We don't do server side apply for diffing and the reason for that is because if we did We would have to hit the Kubernetes API infinitely more and so from a scalability perspective We found that that was probably a bad idea and so we actually Most people don't know this we actually re-implemented and by we I mean not me but Within the Argo project we re-implemented the Kubernetes API to create fake server side apply dry run for diffing So when it comes to the diffing engine It's a matter of efficiency because if you actually did it off of a server side apply dry run Then it would it would allow the controllers to run and so you would get a more accurate perspective But the but it would be a challenge from a scalability standpoint So the ignore differences stuff actually works really well because you can basically delegate fields and Delegate them and their responsibility to other controllers and that ends up being a simpler way to do it And a lot more scalable, but I think you you make a good point about just syntactically how it would work I mean you could do it that way, but it's a For the scale that most people are deploying Argos CD would be bottleneck great question. I think we have time for one more They say no more. Okay, so thanks everybody for coming. I appreciate you coming Definitely check out costus's talk that's going to be later today in this same room and this gentleman's talk with 10,000 Kubernetes clusters that'll be excellent and I'll be hanging out you can find me on Twitter at today was awesome. Thank you so much