 All right. Hello, everyone. Welcome to ArgoCon. I hope you're all enjoying yourself so far, learning lots. My name is Michael Goodness. I am a principal DevOps engineer on the cloud platform team with Major League Baseball. We are a six-person team, half of us working out of the MLB headquarters in New York. The other three of us work remote. I'm actually based out of my home, just up the road in Milwaukee, Wisconsin. Our primary focus is working with dev teams to establish best practices around cloud infrastructure, CI CD, and Kubernetes in particular. We then assist those teams in adopting those best practices and then support them all the way through production. We also manage many of the shared services, such as GitHub Enterprise, Terraform Enterprise, Artifactory, et cetera. And we work very closely with our systems infrastructure team, site reliability engineers, and our InfoSec folks. So let's play ball. We've been running Argo CD self-hosted since 2019. As you may have gathered, it was the very first commit I made to our configuration repository. This was shortly after one month, or just one month after I joined MLB. We had just begun a migration to GCP as part of a sponsorship deal. And we're rolling out Kubernetes on a wide scale across the organization. Felt like I needed, you know, I was in a new environment and I needed something familiar. I needed a safety blanket. And I had worked with Argo CD at my previous job, and it really kind of helped me get my bearings in the new organization. Within a month of this, we had already determined a need to roll out multiple applications to multiple clusters, including a cluster in every one of the Major League ballparks. So pretty quickly, we found a use for Argo CD because that's exactly what it does. Any diehard baseball fan will tell you it's all about the stats. And this is no exception. So by the numbers here, we run 210 clusters. 119 of those are running in Google Cloud on their GKE managed platform. A portion of our ticketing infrastructure runs out of Oracle Cloud, so we use their OKE managed service. And then as I said, we have a Kubernetes cluster running on, well, in the Major League ballparks, we use again GKE's Anthos bare metal software. And then in the AAA, that's minor league ballparks, and a number of partner leagues, we run K3S clusters in those ballparks. Those are generally pretty small, but we do a lot of onsite stats processing. We do some ML on player positions, and we're working on the automated balls and strike systems that again, if you're a baseball fan, you have probably heard of. All told, we have 6,200 applications under ArgoCities, so when you load it up, you're scrolling through 6,200 applications. Now, only 900 of those are unique, so that's why I led with the big number. But because we're deploying applications to every ballpark, that adds up. We're deploying multiple versions of software to non-prod prod clusters, so that's how we go from 900 unique to 6,200 total. 28 of those unique applications are what we call cluster services. These are components that we consider kind of essential to providing Kubernetes as a platform. Our team manages these applications, we deploy them, we manage them, we make sure they get upgraded on a regular basis. So things like GitHub actions runners, cert manager, data dog, external DNS, ingress engine X, and what, 23 more. All in, we have 59,000 resources being managed by ArgoCD across all of these applications. We have anywhere from 200 to 400 deployments happening through ArgoCD every day. Now, that's not always a brand new version of something, it might just be a small configuration change. But across 30 plus developer teams, it works, it works really well. Our configuration, so we run 10 application controllers using the native sharding. That's kind of an arbitrary number. Until recently, sharding was pretty hit or miss. You couldn't really, I guess you still can't really dictate which clusters, which applications are gonna end up in which shard. So we kind of experimented with this number and ended up with 10 replicas as kind of our sweet spot in that stateful set. Because it is a stateful set, you have to, again, assign resources based on the biggest one. And for us, that was 24 virtual CPUs and eight gigs of memory. Needless to say, there's some waste there because some of our application controllers are not using anywhere near that much. We only run one repo server. Main reason for that is that we've had some struggles keeping our GitHub Enterprise server happy. When you look at the number of fetches that Argo CD is doing over the course of a day combined with developers doing fetches and clones and pushes from their desktops combined with GitHub actions that are cloning repositories, all of that traffic adds up and we were having a really hard time kind of keeping it under in reign so it would run away from us and knock our GitHub server down several times. So one of the things that we did to address that was just running one repo server. We don't run Redis in the cluster. We are instead using GCP memory store for all of the caching. That's their managed service so we don't have to worry about our cache kind of disappearing. We run the DEX component with the GitHub connector in particular for authentication and authorization. The way we have it set up is we associate each of our Argo CD app projects with some number of organizations and teams out of GitHub. So when we have a new team member joining an organization and they now need access to sync or modify an app project in Argo CD, all they have to do is be added to the correct organization through GitHub and that becomes our single source of truth. We maintain a mono repo for all of our Argo CD manifests across all of our dev team so all of the application manifests, app projects, application sets, all live in a single repository. When a developer needs to make a change to one of their applications, whether that's an addition, a deletion, or just changing something about the application itself, they do that through the normal GitHub pull request flow. That used to then in turn kick off a GitHub actions workflow that would then take the changes from that push and apply only that to Argo CD so it was a push model. We've since moved to the more popular and widely used app of apps model so it's become more of a pull. Somebody pushes a change to the mono repo that fires off the Argo CD web hook which then refreshes the app of apps and then any changes get applied through that. However, we disabled prune on that app of apps so that it forces kind of a two-step deletion process. So if somebody wants to delete an application from Argo CD, from a cluster, they first delete the manifest from the mono repo and then they have to go into the Argo CD UI and click meaningfully, deliberately click delete with a cascading delete in order to delete the application. That way we eliminate that risk of an accidental, somebody deletes somebody else's manifest from the repo and then the application is wiped off the face of the earth. We make a very heavy use of application sets. They've actually been a godsend. That's no exaggeration. When you're deploying this many applications to this many clusters, it's really nice to only have to write one YAML file to make it happen. We rely primarily on the cluster generator, so each of our clusters has a set of labels and then the application set uses those labels to then template out each of the individual applications. That's primarily what we use for our cluster services those are going to all or at least some subset of the clusters. We have started using the Git generator as well and are really excited about the new-ish pull request generator for spinning up ephemeral preview deployments. These are deployments that are triggered by the creation of a pull request and then only live as long as the pull request. As soon as the pull request is deleted, the application set tears down the deployment that it created. If you're interested in that, and you probably should be, the talk right after this, I'm not sure which room, but there's a talk right after this that goes into an actual case study on the pull request generator. One of the problems we had to solve was that of secrets. GitOps says that you keep as much configuration in Git as possible, but Common Sense says you don't commit plain text secrets to a Git repository. There are, as with most things, many ways to solve this problem. Maybe you deploy HeshiCorp's Vault and then your application code pulls the secrets as it needs them. You might also deploy something like the external secrets operator which can pull secrets from your secret store into Kubernetes and then you can use the secrets API to mount those secrets into your applications as needed. In our case, we use a command line tool called SOPS which is now a, as of a few months ago is actually a CNCF Sandbox project. Mozilla donated the project to the CNCF. It's, as I said, it's a client. It's a command line utility for encrypting and decrypting plain text files using public key encryption. So the workflow is a developer has SOPS, has the SOPS CLI on their machine. They have a YAML file, maybe it's a Helm values file with some sort of secrets in it. It runs, the developer runs SOPS against that YAML file that encrypts it in place using the key that the dev has provided. That file is now safe to commit to get. Now when it's time to use that values, consume that values file, Argo CD also has access to the key that was used to encrypt. So it pulls that values file, it decrypts the values and then uses that in our example in the Helm flow to then template out the manifests and apply them. There are a couple of plugins that we use. One is KSOPS, which is for customize and the other one that we use is called Helm SOPS. Now, under normal circumstances, you might think that Helm SOPS would have to be installed as a plugin. So you'd have to create a configuration management plugin in Argo CD and able to use it. Fortunately, you don't have to. Helm SOPS is actually a wrapper. So when you have Helm SOPS installed, you can call it as Helm and that it will then make a call to the actual Helm. So Helm SOPS will do the decryption and then call Helm to do the templating and everything else that Helm does. So what does that look like for Argo CD? And I swore I would only have one page of YAML and I'm happy to report I was successful. So what we do is we use the custom tools facility in Argo CD. For those of you who aren't aware, you can mount a volume from which you can mount additional tools into the repo server. So that's what we do here. Following along with the init containers, we have a container that contains both the Helm executable and the Helm SOPS wrapper. We copy that into the custom tools volume and then we mount that into the repo server, but we do a rename. So we rename Helm, the Helm executable as underscore Helm, and we rename Helm SOPS as Helm. So that enables that flow that I described where Argo CD, when it goes to render using Helm, it actually calls Helm, which is actually Helm SOPS to do the decryption, and then Helm SOPS calls Helm to do the rendering. It's a little funky to get set up, but it works really well. We've had absolutely no problems with it. Another thing that we do a little bit differently is we have something called the MLB app chart that we, that a good 80, in fact, at this point, it's probably closer to 85, 90% of our teams use so that they don't have to write their own Helm charts. It's very generic, you plug in the values that you need and you get the functionality out of the other side. We support a lot of, because we're aiming to fit the 80% use case, it's got a lot of functionality built into it. It supports deployments and Argo rollouts. I'm sorry to report that nobody is actually using the Argo rollouts feature yet. I'm really hoping that we can get, we can light a fire under folks around that. Supports pod disruption budgets, full ingress support. So we can, if you want TLS in front of your service, we can use cert manager, which remembers a cluster service in each of the clusters to go out to let's encrypt, provision that cert and attach it. And then we also use external DNS to create the DNS records for that service. We support both the Kubernetes events-driven autoscaler and the native horizontal pod autoscaler. If teams want to have their metrics scraped by Datadog, it's just a few values away. Of course, we support soft-based secrets, number of different volume options. And if teams want to say, run a database migration before a new version of the software is rolled out, we've enabled them to run a precync hook job, which is again, just an Argo CD feature. The downside of the MLB app chart is that it's 1,058 lines of YAML. We do publish schema, which is, it's been available as a helm feature for several years, but I don't know that I've run into more than a handful of helm charts in the wild that actually have schema published. And the reason for that is probably because ours is more than 2,000 lines. Pros of using this approach is that we manage it. We're very strict about semver. We won't make a breaking change unless it's part of a major version. But we can roll out, if there's a Kubernetes deprecation, an API deprecation on the horizon, we can make that update ahead of time in the helm chart. And then any team that is using it just gets that for free. If they're following the latest, the latest minor version of the MLB app chart, their application just gets it and they don't have to worry about API deprecations. The downside is, again, it's super complex because we're trying to solve everybody's problems. So when we try to add a new feature, we have to make sure that we're not breaking existing functionality. And there's a lot of logic in there, a lot of if statements and debugging becomes pretty difficult. Our CI is pretty simple. We use GitHub Actions for our workflows. If you were in here for the last couple of talks, it's the standard CI workflow. You build the image, you push the image, you update a YAML file with the new value and Argos CD does the rest. We don't get any fancier than that. Our standard workload supports repositories in which the code lives alongside the config, but we also support and actually prefer separate config and code repositories. And if teams have adopted a protected branches, so if even that image tag requires a PR, that can be done through GitHub Actions as well. So that it creates a PR and then somebody has to accept that PR before it gets merged to their deployment branch. Which then of course calls the Argos CD webhooks so that we don't have to rely on polling. So a few of the challenges that we've faced over the years. The application controller sharding that I mentioned before, kind of general resource waste. 2,600 applications results in a pretty slow UI. Caching issues, the Git traffic issues, monitoring, Argos CD exposes a ton of metrics, but we weren't really sure. We're not Argos CD experts. We weren't really sure how to use those metrics to really optimize our setup. So I'd like to say that we solved all of these. We put our noses down. We put some serious engineering effort into it and we were able to solve it all. But the truth is, we just made it somebody else's problem. So about mid July, we partnered with Acuity to become customers of their managed Argos CD platform. So our new current setup is we run one Acuity agent and one application controller in each of the clusters. And the advantage there is that that application controller only has to have the resources that it requires to manage that one cluster. So it's much more efficient. Yes, you're running more application controllers, but each one of them is very well tuned for that cluster. We still do run a single, well, we run three replicas of the repo server, but it's still centralized. We didn't want to have to worry about managing permissions for repo servers in 200 different clusters. So it still runs centralized, but due to quite a bit of engineering work from Alex and Jesse and the team, they've solved our problems with GitHub Enterprise. Same story with the application set controller. One feature that I know they want me to shout out, call out is their AI assistant. So every application has a tab that will actually tell you it will look at a failed deployment, it'll look at the events, it'll look at the log lines, and it will do its best to tell you in plain English what went wrong. As the team that is kind of the first line of support when something goes wrong, we love that. We absolutely love it. And then of course, again, I can't say enough, we get a direct line to the creators of Argo CD and that's paid off in spades already. So what's next? Well, we need to finish the acuity migration. We've, in the first two months of our adoption, we were able to migrate 5,800 applications, which is a pretty good statement about the seamlessness of the acuity platform. So we still have another 800 to migrate over. We need to actually, so our MLB app chart supports KIDA, but we haven't actually rolled it out. So that's a thing we need to do. We're working on an internal developer portal, which will encompass some of the Argo CD functionality or kind of make it more available, readily available. We have off-season upgrades. As you might imagine, it's not a good idea to upgrade ingress controllers in every one of the ballparks in the middle of a baseball season. So that's work that we've got to do. And then we're looking at just kind of general maturity around some of the Argo CD features. So multi-source applications is in beta. We're really looking forward to that, kind of hitting GA, application set maturity, being able to see application sets in the UI, not having to dig into the logs when a renderer fails. Argo rollouts, I'm really pushing. And of course, as Jesse mentioned in his keynote a little bit ago, cargo. We're super keen to see what cargo provides us. That's all I've got. So I just want to say thank you obviously to, well, maybe not obviously, but thank you to creators, the maintainers, the contributors of the Argo CD project. Anybody who's approved a PR that I put in, thank you. And thank you especially for not looking too closely. And thank you to all of you because we're all trying to do the same work and Argo CD would not be half the project it is without the community. So thank you to all of you. I think we do have a few minutes for questions if anybody has any. Otherwise, I'll be floating around. Happy to have a table convo at any point. Yeah, hello, thanks for the talk. Very good question. Did you ever consider not using Helm but generating the manifest and having that store somewhere and maybe use external secret or something for the sensitive information? Sure, sure, yeah. So the question was have we ever considered not using Helm or only using Helm to template things out in advance and then commit that to the repo? That's certainly a valid use case. I think it adds one step to the CI process to have to do that and we saw Argo CD can use Helm. And so we just decided to leverage the functionality that it already provides. Yeah. I guess I have two questions. First one is for your application sets. How are you managing the kind of rollout of custom resource definitions and custom resources? And then the simple part two question is are you using the git file generator or the git directory generator? I believe we're using the git, I'm gonna answer the easy one first. I believe we're using the git directory generator and the first one was how do we manage CRDs through application sets? Yes, well specifically like the ordering, like if I wanted to have a cluster issuer and then I wanted to create a certificate that uses the cluster issuer, are you just relying on the reconcile loop to eventually become what you want or? Sure, sure. Yeah, so we actually, when we deploy cert manager, we deploy a cluster issuer alongside it. So I don't, no actually we do, we use sync waves for that. So we deploy cert manager so that deploys that makes everything, gets everything up and running and then in the next wave, it actually creates the cluster issuer. Okay, and you found that sync waves has been okay, I know it's in an alpha state but it's been there for a couple years, so. Yeah, yeah, I won't lie, there are occasional hiccups where, mostly it's because the previous wave didn't complete successfully and so the second wave hasn't fired but the waves themselves have been, yeah, pretty reliable. Yeah, thanks. Are you doing any post deployment validations for your components? Are you just doing synthetic testing or what's your process there for? You want to just validate that the application is working after you deployed it? Yeah, so that's a point of weakness. You know, we leave that to each of the dev teams to determine how they measure the health of an application after deployment. So you're just using health checks? Yeah, sorry to say, but that is one thing, that's one reason why I'm really looking forward to getting somebody on board with article rollouts is that it automates that. You don't use like helm tests or anything, right? No. Cool, just a thing, cool. Yeah, thanks. Okay, I guess that's all the time we have, we gotta get ready for the next speaker but yeah, just let me collect my stuff and happy to talk one on one. Thanks everyone.