 I'm happy to be giving my first talk in my own home country of Canada, though I am from the East Coast, so I still get stuck staring at the mountains outside. I don't know if that's just me, but I'm in awe. So today's talk is on Argo CD multi-cluster architectures. And throughout the talk, I'm going to ask some show of hand questions and I appreciate any participation in it because it does help me kind of guide the talk towards the people that have chosen to attend it. Thank you. So to start off with, maybe I should talk a little bit of why you should listen to anything I have to say. So first and foremost, before my current role, I was a platform engineer at a company in Canada called Raithub. Anybody heard of Raithub.ca? Hey, okay, cool. We got a couple. Nice. Normally I get nothing, especially when I do these talks in the States. So as a platform engineer there, where I was running Argo CD, I got to implement it into their Kubernetes environment, which was a dev prod and stage cluster, so fairly basic early in their Kubernetes journey. But I feel the pain. I have had to teach developers how to utilize Argo CD and also learn how to implement it effectively. And then so in my current role, I'm a developer advocate at Acuity, the first logo you'll see on the left there. Acuity is the company created by the founders of the Argo project. Acuity actively maintains the Argo project and provides an Argo CD SaaS offering with some architecture improvements and innovation beyond the open source project, but I'm not here to talk about that. Today it's all open source multi-cluster architectures. So I spent the last about seven months working with the Argo project community, talking to end users such as yourselves and learning about your challenges and the same challenges that I face when I try to implement Argo CD. I'm a member of the Argo project and a contributor to Argo CD. And I'm also the coordinator for the newly found scalability sig for the Argo project. And if anybody actually cares about certs anymore, I have my CKA, but nobody's ever asked about that. So I don't know why I include it. So we already know the topic for today, which is which Argo CD architecture is best. That's the question that I'm going to answer today. And here it is. It depends. It highly depends on the organization that's implementing Argo CD. And each architecture is going to have its pros and its cons and how severe of a pro or a con that it is will once again depend on your organization. So we'll do one of those show of hands that I was asked mentioned earlier, which is who here is currently running Argo CD in some capacity. Okay, that's going to help because I don't really go over a lot of the basics for Argo CD. This is really focused on the cluster administrator perspective of implementing it. So if you're missing some pieces, come talk to me after and I'm happy to explain anything that's missing. But again, the answer is it depends. It really depends on the perspective that you're looking at Argo CD from. So if you're, say, non-technical business leadership, Argo CD is just a software that helps your organization achieve faster time to market by automating the deployment of application features and updates. Pretty high level, pretty basic, but focused on business outcomes. But say you're an application developer, which I typically view as the end user of Argo CD. Argo CD is a tool that automates the deploying, updating and managing of applications using, say, a command line tool, an elegant web UI, or probably more aptly configurations and get, I would hope, given where we are right now. And then finally, from the perspective of a cluster administrator, which, again, is kind of what I'm framing this talk in, since we are talking about implementing Argo CD into multi-cluster environments. So from the perspective of a cluster administrator, Argo CD is a control plane for automating Kubernetes cluster management using continuous delivery and GitOps practices. So kind of getting into the specific details of you're already aware of continuous delivery, already aware of GitOps practices, now you're interested in implementing Argo CD into your multi-cluster environment. So simply put, I describe Argo CD as an extension of Kubernetes. All of Argo CD's workloads and configurations are represented as resources in Kubernetes. And from a broader perspective, Argo CD is a part of the Argo project, which is a suite of tools consisting of workflows, Argo CD events, and rollouts. So here are the two core models that you're probably already aware of to some extent. And I tend to consider them on opposing ends of a spectrum, which is you have the management cluster model, which I've heard often referred to as the hub and spoke model. I'm not old enough to have to have lived through that pain of the hub and spoke networking model. Frankly, IPv6 is older than I am, so I go with the management cluster model. So in this architecture, Argo CD is running in a central cluster, one that is outside of your normal application or service clusters. Then on the other opposing end is the per cluster model, where for every cluster that you have, you have a standalone instance of Argo CD running in it. So let's talk about the advantages of the management cluster model. Primarily, it's that you have a single view for all of your deployment activity across all of your clusters. So you essentially have this single control plane that simplifies the installation and maintenance of Argo CD because you're only setting up things like RBAC and SSO and repo credentials and app projects all in one location. This also means that you have a single server for easy integration of API and CLI because if you've got multiple clusters, you have to then specify which Argo CD instance am I running this Argo CD command against. That also means for your teams that your end users, your application developers, they only have one place they need to go and that when they log in, they can see everything they need to see to do their work. But of course, there are disadvantages to this. So primarily the disadvantage of the management cluster model is that you're going to have to scale the individual components of Argo CD as you add more and more clusters. So you have one instance of Argo CD and that consists of three components, the API server, the repo server and the application controller. And the challenge is that as you add more and more clusters, you're adding more and more pressure onto those components. The API server fairly easy to scale. It's stateless. It's frankly just an interface between you and Kubernetes and you just add more replicas basically. Typically you don't have to worry too much about it. The repo server also fairly easy. You add more replicas, but more importantly, you're going to have to add more resources because the repo server is what's doing the manifest generation. So as you add more clusters, you're probably adding more applications, which means there's more manifest to render, more helm processes, more customization processes and more QCTL processes that need to get run to get the underlying manifest that is going to apply to the cluster. So again, not too challenging to scale at reasonable scales. You just add more replicas and more resources. The challenge is the application controller. Frankly, it sucks to scale. You basically have the option to shard it per cluster. So you can add more resources, but that only gets you so far. Add more memory in CPU. But in reality, you're going to have to add more instances of the application controller. The challenge with that is that the application controller has to have a one-to-one or one-to-many relationship with clusters. You can't have multiple application controllers managing one cluster without going into, frankly, setting up multiple instances of Argo CD at a namespace level. So it also runs as a stable set, which means you have to roll out a new instance of it when you add, say, more clusters if you're going with a one-to-one relationship. And you also have to update this environment variable that tells the application controller how many replicas that it's running of itself. Frankly, it's a challenge to automate that as you add more clusters to continue to scale the application controller. So that's the scaling side of the disadvantage. The other side is that you now have this single point of failure for your deployments. If Argo CD goes away in the management cluster model, you can't do deploys to not only dev, but prod. And you have to understand as an organization, is that something that you can tolerate? And what are your processes for being able to recover that management cluster and that instance of Argo CD when you have this single point of failure? And that's a single point of failure from reliability. The other side is that it's a single point of failure from a security perspective because you have this management cluster that requires direct access to the API server of all of the downstream clusters, and it also needs cluster admin credentials. So if that management cluster gets exposed, so do all of your other clusters with full access to those other clusters from that management cluster. And speaking of the management cluster, you have to maintain another cluster. So depending on how mature your organization is in its Kubernetes journey, that could be a large burden if you're really only expecting to manage dev, stage, and prod adding a fourth cluster. And I've been through this discussion. It can be a lot to justify the management, the cost of it, the resources of like, okay, so what are the processes for managing this third cluster? It doesn't fall into what we're used to. All things to take into consideration. And the last point here is that there can be significant network traffic introduced between the management cluster running Argo CD and the downstream clusters that it's managing because the application controller requires to watch all of the events from the Kube API server because that's what powers the application controller, right? When a deployment gets deleted on a downstream cluster, the management cluster needs to be aware of that so that the application controller can react to it and say run the self-heal policy to redeploy that deployment resource. So streaming all of those API server events from your application cluster to your management cluster can result in some end user testimonies, thousands of dollars a month just in network traffic. So let's go to the other side, which is the per cluster model. So instead of having a dedicated cluster, you run Argo CD in it, you're now saying let's just put an instance of Argo CD into all of our clusters. And so the advantage here innately is that you're, let's say, distributing the load of your clusters across each one or to put it in a better term, you're having a copy of the Argo CD components per cluster so that as you add more and more clusters, you're not adding more load to your central management cluster instance. So this is great because you add more clusters, you get more Argo CD, it works, and then you can even tune Argo CD in that cluster to match it. So if you've got one really big cluster and you say your development environment because you're running a namespace for every developer or something and then you've got prod, which is, you know, simpler, it's only one copy of the application, you can tune Argo CD per cluster to match that. The other advantage is that you no longer need external access to the cluster API server from a management cluster. Each of the clusters with Argo CD running in it can be in a private network and be all self-contained within the cluster network only requiring internet access out to, say, Git. And even you can solve that by running a self-hosted Git. So by doing that, you save on the network traffic costs, you're no longer streaming events out of it up to a management cluster and you also eliminate the security concerns of exposing the API server. And now there's a bit of a debate around that which is exposing the API server isn't inherently a security risk because there are credentials and there are certificates required to access the API server and actually do anything with it. But it is an attack vector that not having to have it open can help in a scenario where, say, a cube config gets leaked. They still have to have access to your internal network to be able to get to your cluster API server. And the other one is that an outage in one cluster is not going to affect the rest of the cluster. So you don't have to worry as much about if dev goes down or if one cluster goes away, you're not preventing production deployments or you're not having a cascading effect to the rest of your clusters preventing you from doing deployments. So there's no longer that kind of single point of failure if Argo CD goes away for whatever reason. And frankly, as platform engineers, this can be an important point because really your application developers are your end users. So even the dev cluster really should be treated as production in terms of reliability in an internal sense because if the dev cluster breaks for whatever reason that can prevent them from doing the work that is important to the business from them being productive. So it's important to take that into account and with the per cluster model you can, say, add a fourth cluster called the lab environment and you can break Argo CD in there without affecting your end users. But frankly, that's a bit of a tangent, so I'll move on. And finally, the credentials that Argo CD is using is scoped to that cluster. So an Argo CD instance exists in that cluster, uses a service account and uses the local cluster network to get to the API server. So you no longer have this central place where you're storing all of your cluster admin credentials just waiting to be compromised. But of course, there are disadvantages to this model as well. And don't let it fool you, there's only two points here, but that first one I should have put in bold because it can really be cumbersome to run multiple instances of Argo CD, especially depending on, remember it depends on how many clusters you're running and how you manage them because for every cluster you're going to be duplicating configuration like the SSO connectivity or the RBAC rules or the repo credentials. For every cluster you have to maintain that. And that also means, I mentioned this is kind of a pro earlier that you get to tune an Argo CD instance to match the cluster that it's running in, but you also have to keep track of that. You have to be aware that this cluster has something different from the other cluster and you're starting to introduce inconsistencies between these environments. So if you've got Argo CD tuned in a specific way for production, but it's not like that in stage and your deployment works in stage but it breaks in prod, that's a result of having these distinctions or these differences between the environments creating that inconsistency. So, you know, the goal is to have as production-like environments as possible, which you can do like this, but you have to be aware of when you have multiple instances, are you going to choose to make them different or is the priority to keep them the same to maintain that consistency? And speaking of consistency, with multiple clusters and an instance in each one, your end-users, your application developers and your API and CLI integrations need to know where to go to get to these Argo CD instances. So if they're deploying apps to multiple clusters and they want to go check how that deployment went by going to that Argo CD instance, they need to have an easy way to know which instance to go to. You want to reduce that cognitive burden of, hey, I want to look at my deployment as much as possible, because nobody's paying them to figure out where Argo CD is. Their job is to produce software for the business. So every minute that they waste searching for Argo CD is time that, frankly, they shouldn't have to spend. So those are the two core models. Now, we can get into kind of a little bit more of an advanced section where we have hybrid models. So you take the two ends of the spectrum and you kind of blur them. And so two, I would say common, let's put it that way, hybrid models is the instance per logical group and the Argo managing more Argos. So each of these are a combination of the previous ones, meaning that they're going to come with all the pros and cons that I've already described, to varying degrees. So in this case, we've got the instance per logical group. So, frankly, the advantages and disadvantages here are just all the ones that were on the previous slides, but now it depends on which group you're in. So for every group, you're getting the advantages and disadvantages of the management cluster model because you're running Argo CD in your management cluster and it's managing some subsection of your downstream clusters. The idea here is that you want to, the logical group in this context is, say, however your organization structures itself. So that could be per business unit is a logical group, per team could be a logical group, any of that. But ultimately, you're ending up with load distributed per group. An outage in one group won't affect the other credentials or scoped per group. Configuration duplication is, in theory, reduced compared to the per cluster model. But still, it's duplicated, right? That's still one of the disadvantages is that you are maintaining multiple instances of Argo CD. You are still running at least one management cluster. But if you choose to run one management cluster and that goes away, all of your Argo CD still goes away. So you still have that disadvantage of the management cluster model. So do you then choose to run a management cluster per logical group? Well, it depends. Is your organization willing to take on those extra costs? Is your organization comfortable running more and more Kubernetes clusters? How mature are you in your ability to stand those up and maintain those? So that's the instance per logical group. Basically, the management cluster model but with a bit of per cluster sprinkled in there. This is kind of the other end of that. It's the management cluster model with an instance per cluster. So you've got your control plane cluster that's running Argo CD, and then it is connected to all of the downstream clusters. But as a part of that connection, it's deploying a new Argo CD instance into each one. The advantage here is that you could, if your organization is structured in a way that say every team gets a Kubernetes cluster, that becomes their Argo CD instance. And if you're at a point where you want to give them ownership of their Argo CD configurations to some extent, say like they can create all the applications they want within the app project restrictions that we put in place, this is a good model for that because everyone's getting their own Argo CD instance. But then it depends. Are you comfortable running multiple Argo CD instances? Are you prepared for the burden of we want to upgrade Argo CD from 2.6 to 2.7 because it's been three months and there's a new release? What processes do you have in place to manage that? So this is great for say a cluster per team or maybe edge clusters where you want to do that initial bootstrapping from a central control plane, but in the end you want these clusters to be totally independent and sometimes you might not have direct access to them from a central control plane, but after that initial bootstrapping they can be mostly independent. Now the question you might ask is cool, so I run an instance per team. Is that great for multi-tenancy? Well it can be. If you need a strong sense of isolation to say that every team has a totally different instance of Argo CD, they're not sharing a repo server which could leak potentially secrets because they get access to that namespace because it's doing all the manifest generation or something like that, then it's really good if you need strong isolation between your clusters, but if all you need is rules in place to say that these teams can do this in this cluster, that's what app projects are for. So is it worth taking on the additional burden of running an instance per cluster or should you just use app projects, the functionality built in to do that multi-tenancy when you're not super concerned about serious isolation between the components? And frankly this architecture is what the acuity platform is attempting to solve because you kind of want this central control plane that everybody can go to and access all of the clusters, but you want that scalability of an instance per cluster. So that's where the architecture improvements of the acuity platform are trying to come in to solve that problem. So as a part of this, I'm going to leave you with some questions to ask yourself, like what is the meaning of life or how much Argo CD downtime can you tolerate? Is it okay if Argo CD is gone for three hours while your platform team is attempting to bring the management cluster back up or is that going to cost millions of dollars because you can't do a new deployment and you've missed some sort of deadline and contract that you have with another provider and that breaks an SLO? How disposable are your clusters? So if you want to get rid of and spin up a new cluster, can you do that? Are they ephemeral pieces of infrastructure yet or are they still pets? Do you have dev stage and prod and you treat them nicely and they never go away? How sensitive are you to drift in your environments? So if one Argo CD instance is slightly different from another, is that going to have a substantial impact on your deployments? Are you doing deep integrations with the precinct and post-sync hooks of Argo CD and if that's slightly different between instances, how severe of an impact is that going to have? Will your number of clusters grow over time? Are you prepared to continue tuning the individual components of Argo CD as you add more and more clusters to support that additional burden that you're putting on a central control plane or would you rather take on the burden of managing multiple Argo CD instances in each cluster? And frankly, how crucial is isolation between your teams and tenants? Are your end users different customers because you're a professional services firm and you're consulting and building infrastructure for them and maybe hosting some of it for them? Or are these internal teams where really the only thing that matters is our back because you're comfortable running a central repo server for all of your end users? And how many applications will you manage with Argo CD? Is that a static number or will that grow significantly in the future? It's important to ask these questions because they all affect how Argo CD is going to run and how much of a burden it is to manage Argo CD. So I'm going to leave off with some final thoughts here. If you're interested in more Argo CD content, I recommend checking out the Acuity blog where we write about things like Argo CD architectures, the inspiration for this talk or how to manage Kubernetes secrets with GitOps, our most popular posts, or maybe something fun like best practices using Kiteverno and Argo CD to enforce them for your end users. Speaking of the scalability sig that I mentioned in the beginning, we are looking for end users of Argo CD whether you're running at a small scale and you're struggling to find the right information on how Argo CD is going to scale in the future or you're running 10,000 application sets like one of our members and you want to know how Argo CD is going to handle that or you're actively facing scalability challenges. I recommend coming and joining the scalability sig where you can meet with other end users like AWS and Red Hat and IBM and into it to come and hang out and tell us your stories. Join the Argo Sig Scalability Channel on the CNCF Slack. It's public, post interesting things that you find, bugs that you're running into, an issue that's related to scalability that you want to maintain your attention on. We host bi-weekly meetings on the 2nd and 4th Wednesday of each month and you can request to become a member of the sig by creating a pull request at that URL. In here we do have people running 10,000 application sets in production with one application per application set. It's these weird niche use cases that are super interesting to hear about that people are actually using to solve problems they have. Finally, if you've got teammates or colleagues that didn't get a chance to come to GitOpsCon and you want them to get up to speed with continuous delivery, GitOps and Argo CD, I recommend checking out our free course where you can get certified by the founders of the Argo project on this knowledge. We have, here's some things that people have said about it. That's my talk. I have, I believe, five minutes for some questions. Yes, sir. Yeah. So how app of apps plays into managing multi-cluster architectures? Yeah, yeah, like this distribution of Argo in a one-to-one or multi-to-one relationship. Right. Frankly, in my experience, you want app of apps every time. Every Argo CD instance you have is going to have an app of apps because unless you're doing some edge case anti-patterns, you're going to want to manage your application resources and all of the other Argo CD configurations declaratively. So you're going to have an app of app. Like the pattern is typically install Argo CD, install the app of apps, and then Argo CD does everything else. So there's, I would say, no matter which architecture you choose, you're probably implementing that. And the, let's see if I can go back to it. In the management cluster model, you're probably going to have, like, maybe a top-level app of apps that's going to manage, like, I think in modern Argo CD, you're going to have an app of apps that deploys an application set that then deploys app of apps for every cluster. And then in the per-cluster model, you kind of get rid of that top-level abstraction and just have an app of apps for each instance. Okay, thank you. You're welcome. Any other questions I can answer? It doesn't even have to be talk-related. Frankly, any Argo CD question, I'm happy to answer. If not, you can always catch me out in the hall, and I'm always happy to talk Argo for the red shirt. Okay, cool. Well, thank you, everybody, for joining. I really appreciate it. Yeah.