 Well, let's do it. Hi, everyone. Thank you for coming. I'm Mike Tujeron. I'm a lead cloud engineer at Adobe. This is where you can find me on the interwebs. So just do have one favor of you. Leave me a good review. It is my birthday. It's been wanting to come out. So a little nervous, but it is my birthday. So I figured, hey, you've got two reasons to celebrate today. So birthday and a talk, right? So a little bit about Adobe. We have a decent size installation, quite a bit of compute, memory, CPU, workloads there. The interesting thing with this is we have around 75 to 100 different infrastructure components that get installed per cluster. Now, those do vary depending upon the cluster itself. And by the number of clusters that you can see there, we have close to 300 of them right now. That is growing. KubeCon North America last year, I think we were around 235 clusters. We are somewhat newer users of the ARGO ecosystem, newer GitOps users for the infrastructure side. Our applications teams have been using them for longer than the infrastructure side. And we are migrating all of our stuff over to Cluster API and being used along with the GitOps system. So why does GitOps matter for infrastructure teams? I apologize, this does look like a marketing slide. It came from a marketing slide. But it does resonate with me. Declarative infrastructure code does make things simpler to understand. It does make it so that every level of the engineering team who works on the infrastructure can understand what's going to happen. They can understand the inputs. They can understand the outputs. What you see is what you get. Everybody wants that. It's shown to work on application code. It has been shown to work on infrastructure code as well. It lowers the risk. And GitOps, it's ARGOCon. You know the benefits of GitOps. I don't need to explain that. But it does reduce the complexity. It does reduce that drift that happens with infrastructure code that happens before GitOps. Somebody's made some change manually. Somebody's, you know, some cloud system has changed. And it's manipulated what's happened there. And all of a sudden, you go to do something. And something got blown away because you didn't know what happened. GitOps keeps all of that in sync. So in the way back before times, last year, we had this homegrown Python monolith. We had some Azure ARM templates, cloud formation templates, all this mess. And now we're moving everything to a declarative infrastructure, cluster API, Amazon for Kappa, Azure for Cap Z, the ARGO ecosystem, Helm charts for all of our stuff, and custom operators. Custom operators is really huge for a declarative system for your infrastructure. It's back to that, you know, what you see is what you get. If you don't use a custom operator, you're back to writing that code that interrogates the system, goes to what's there. It sounds counterintuitive because that's what the operator does, right? It still has to go out, talk to the cloud providers. It still has to talk to the different systems to figure out what the end stage should be like. But that code is written by the subject matter experts. The individual infrastructure engineers are still writing their code in the declarative fashion so that their risk, their understanding is still much more minimal. So that's why I put this on that slide. Moving to the custom operators is a large part of this journey for us. So I've mentioned cluster API. Building clusters for us now has gone from that monolithic Python code, Azure ARM templates, cloud formation, to looking something very simplistic in a way like this. On the left, we have our cluster definition. It references a control plane, references some infrastructure. On the right, that's kind of what it looks like. Hey, we got a managed control plane. Here's its name. We got some public and private endpoints. We have an allow list. We have some roles that are associated to it in a Kubernetes version. So to a developer on the infrastructure team that wants to deploy a Kubernetes cluster, an EKS cluster in this case, this is what their template ends up looking like. Now this is Helm, so they don't even have to program all of this. They put in a couple of values, a cluster name and a Kubernetes version, boom bottom being they're done. Now you do have to do a little bit more when you get into the compute tiers. There's a little bit more to it. You have to define, okay, what's the machine gonna look like? What are your failure domains or availability zones? Sizes, encryption, instance profile, stuff like that. So it does start to get a little more complicated. But those of you that I've had to deal with ARM templates, CloudFormation, at least from my perspective, this is a lot easier to understand and for a human to grok and go through and say, oh, I just want this size root volume, boom bottom being I'm done. You throw into Argo CD, the output is a rendered Helm chart. Argo CD applies it through the cluster API. Next thing you know, you have a Kubernetes cluster that's built. Performing upgrades with cluster API really helps us manage this large fleet of clusters. So the 300 cluster fleet, if we had to go through in the old world, it would take us months upon months to do upgrades. We'd have to run through the code, do it, go along. Now what happens is, as we go through these Helm templates, update the Helm template, change the Kubernetes version, cluster API takes that Kubernetes version change and starts going through its processing. We do this through two commits. The first commit updates the control plane. So we want to make sure that our compute nodes are not running a newer version than the control plane. So after the control plane's been updated, the AKS or EKS cluster's been updated, then we go through and do that second commit that starts updating the compute nodes. At that point, cluster API starts going through its drain process. We don't follow that exact cluster API process, though it's a very good one. We'd run a tool called Kate Shredder that we open source. This allows us to handle things that cluster API can't do. So I'm sure all of you have run large installations of just workloads. I've dealt with pods and crash loop back off or people writing bad PDBs where you just can't evict workloads and you're like, well, what do I do? So I just boot the workload and risk having an outage or am I stuck not being able to replace my worker nodes? And so Kate Shredder goes through and has some business logic in there that tries to be safer, I guess, with doing those things and booting the workloads. But it still tries to respect the PDBs but is a little more flexible and a little more aggressive with it. But this process then goes over a period of like seven days and evicts the workloads over time. It notifies the tenants of their workloads and labels and annotates the pods and nodes and things are going. So at this point, what's happened is all the developers had to do is update the Kubernetes version in their manifest, clusters are upgrading. So the power of the declarative infrastructure has made it excuse me, easier for someone to understand what's happening as they go along. So one of the cool things with this is using the Argo ecosystem, we also have the events that get triggered. So when this applies happen through Argo CD, Argo event is created. That event, then the sensors and all of that goes through and a workflow gets triggered and that's what adds those labels that things like Kate Shredder can use and just other things that tenants can then be notified of that there are upgrades happening on the clusters. And that allows us to do things like saying, hey tenants that don't want their pods to be evicted, a pod disruption budget of zero, we can push a notification to them through the event cycle and over that seven day period they can do their own evictions and their own migration of their workloads. So we're a little bit forgiving with them and give them that time. So eventually over this time, all of our nodes are replaced but from a infrastructure engineer perspective, all they've done is that commit. So there's no real human actions happening there. Few things to watch out for with large installations though. If you don't change your reconciliation concurrency rate, you can end up where it doesn't reconcile all of your clusters or all of your resources. There's a good Git issue on there to check out. Somebody's done some really good performance data with that. It's fairly new, a lot of good stuff. If you've ever run a cluster with a lot of secrets and watchers and config maps where there's a lot of things watching resources, you need to run a larger control plane. Watches are very expensive on at CD, uses a lot of memory, they're expensive operations. Do use a management cluster per cloud provider. That way there you can take advantage of native authentication, IRSA for Amazon workload identity for Microsoft, things of that. Watch out with the Kappa, the AWS managed machine pool. I know I have that explicitly mentioned there but I really wanna mention that. That uses ASG instance refresh which does mean that your workloads are not drained first. It just uses Amazon's replace the node. That could be really bad. That almost bit us very badly. Thankfully we noticed that before it happened. Let's see here. So with that many clusters, we need to dynamically register them into Argo CD. So we have an automated process that again uses the Argo ecosystem to go through and do that. It sits there, we've added an event source that triggers another Argo workflow. It watches for config map that gets deployed by the application set generator. You'll notice there as I talk here application sets are really big for us. What this does is it gets the different labels and annotations that we're gonna need for the Argo CD from this config map. It authenticates the appropriate Argo CD cluster, registers the cluster into there or updates the cluster into there based upon the information that's in that config map. We wanna make sure we do updates because if we add labels or annotations we don't want it to be whatever the cluster was first registered with. We wanna keep whatever is the current for there. This config map is something like this. We have just populated again via helm. It pulls either from that chart that defined the cluster API definition for a cluster. What's some of the attributes for a cluster? Is it GPU, Cata, where's our environment or cloud provider for filtering and then optionally an Argo CD server or it calculates out which Argo CD server it should be registering this cluster into? And we'll talk a little bit later on about why the Argo CD server is a field to be selecting to because we'll see about, well, I'll talk about that in a moment. And then the event source is a straightforward Argo event source that just watches on the config map and then triggers that workflow, takes it Argo CD cluster add. So we talked about 285 clusters, 75 to 100 different components. It's around 200,000 resources at its basic set. That's a lot. It's actually more than that. Last I checked, we're actually running around 20-ish on average resources per thing. It's too much for a single Argo to run in an optimal manner. Argo does do really, really well out of the box. Here's some stats that Dan Garfield and Joe Sandoval, Dan's at CodeFresh and Joe's another Adobe employee did at last KubeCon. I highly recommend checking out their talk. There's some docs on that. It does really well and in most situations out of the box with some basic HA setup, you're good to go. However, for our expected growth this year and into next year, that's not gonna be sufficient. So what we've done is we've set up an Argo with Argo sort of pattern. So we have basically a Tier Zero Argo that manages a bunch of sub Argos. And so as you commit to one, it creates the sub Argos based upon the config files that we've set up. We commit it into Helm, creates the sub ones and we actually technically have two Argos, two Tier Zeros that manage each other so that we have that kind of redundancy, a non-prod and a prod and it trees down into the different environments. That registration process does the calculation based upon the cluster's environment, region, cloud provider and so on determines which Argos server to go into. Now that can be overridden on an individual basis if decided or as we update this Argo of Argos chart that auto registration process will deregister from one Argo and move it to the other Argo. And that way there when we find that one is overwhelmed and we want to say okay we had five sub Argos, we want to have six now, it'll auto reallocate them all to where they need to go. As we determine from the monitoring, wrong choice of words, they're not monitoring, just the metrics dashboarding, seeing how the performance is going, are things going slow, are they going appropriately? Just be careful if you do that, you don't want to accidentally unregister it and have it uninstall your applications. v0.1 did do that, not what you want to have happen. So with this many applications and clusters, testing of these Helm charts is really important to us. So our testing needs to go against full featured clusters due to dependencies and we use Proud to trigger our CI. So we use the headshaw of the PR and the application name as we build these clusters either with our legacy system during the transition period or with cluster API. And so we build up the cluster with that and then each of the applications that we're deploying has again an application set that either installs that PR code or the head and then we run a bunch of metric checks to see when not things are successful. So what this looks like is this here, we use the merge generator that does a selector of any clusters that are in the CI environment and we say that the version is blank. And then we select the pull request where from this repository for my app and then we say where the app and then we have another selector as another generator, all the clusters where the application name is not my app. And the reason why we do these two different ones is we say when it hits that merge generator, it'll create a cluster that installs with that PR and then when that cluster is not using the my app, it'll install head of all the other applications that are out there. So with 75 applications trying to get into there, we want to use the latest code from all of those that is stable but we wanna use what's in the PR for the other ones. I see a few days faces so maybe I'm not explaining this properly. But let me see if I can word this slightly differently. When there is no PR associated with this application, it uses whatever the latest released version was and that's what that selection on the right does. The selection on the left merges the CI cluster with the pull request to say, okay, in this case, I'm using the version of the application that's associated with this PR to install onto this cluster. And that way there we can do the test and that's where this conditional on the bottom right tells it which version to install. This is the same application set that is gonna end up being used for the production deployment and we'll expand upon this as we go on over the next couple of minutes here. So our Prometheus rules for testing are also in the same realm where it's the same ones we use for production for our learning and our monitoring. And so that way there it's back to that simplicity concept where when a developer looks at something they have that full understanding of the stack. I don't have to look here for what I'm doing CI testing. I don't have to look here when I'm doing unit testing. I don't have to look here when I'm doing production testing. It's all right here. So my application up, select sub two, that'll go to a page duty. Check for CI only. We have a label integration test and then on the right you can see kind of the query that we use within that. And so this unified rule is really cool. The best thing about it is is we can do dashboarding at any time that shows the state of the CI tests. So a lot of times historically what we've had is you get to know the state of your CI tests when you do a deployment because you did your deployment and you run CI, you know the state. Well, so what if something happened along the lines of the live cluster? It's not something you alert to page a duty on but it does affect the state of the application, let's say. So in this case, if you do have an alert that triggered and you wanna know the state of the health of the application, you gotta do a bunch of research. With this, all of your CI testing, those metrics will automatically be available in that dashboard at any time, all the time. So that's been a really powerful debugging feature for us as we've had to do research. So as we've gone along, coming out of the myths in the Argo world, there's evolved the application set progressive syncs. This is really cool. For those of you that have used application sets before, you got a large number of clusters you wanna go out to, you may be aware, you wanna go to some clusters first then to another and others, you have to use different generators to say, okay, I'm selecting this group first and then this group and then this group and kind of conditionalize how you go out and every time you wanna move to that next set, you have to do another PR, change the version and progressively do it through a PR commit to get it out to all your different environments. Now, application set with progressive sync will do that for you automatically. Very cool. However, do keep in mind it is still alpha and there are some limitations and we will talk about those as well. So this is building upon the application set that we talked about before. So we started with that merge generator there. We got a selection of the CI and then we have a selection now for where the environment is not in CI, stage or prod, we're gonna install head, that's for our dev clusters. Then we have an environment selector of stage, we're gonna install version 1.0.13 and for prod, we're gonna do version 1.0.8. We still wanna keep prod and stage different versions perhaps but we have 280-something clusters, there's a lot of stage clusters and there's a lot of prod clusters. So we still wanna be careful with how we roll these out to those environments. Again, it's still the same application set. So here's where it gets interesting, we have the rolling sync. So for the environments not in stage or prod, we do a max update of 100%, do all of those clusters at the same time and just do them. For the environment stage canary group, do a max update 25% at a time. So at that point, when it hits that maintenance group, it'll do 25%, 25% then 25%, assuming that the previous 25% went fine. Then for those that are not in the state canary maintenance group of stage, it'll do it 10% at a time and then progressively on for the prod and so on. So what is really nice about this is we get that progressive sync and again, a human doesn't have to keep making those updates each time we want to go on to the next one. If the sync has gone successfully, it's gone successfully. And sorry, if it hasn't gone successfully, it will stop progressing to that next bit and things will stop and you'll get the notification and you can take whatever action you need to take with it, really cool. Something to watch out for with this, that's not quite clear in the documentation that I need to open a PR for. I banged my head against this in the beginning and I keep forgetting to get back to it is that the match expressions is on the application template not on the cluster labels. So when you see here the match expressions key environment and key maintenance group, that's on the template labels environment and labels maintenance group of the application not on the ones coming from the cluster. So it's a little bit confusing. So it's not selecting clusters that match those labels, it's selecting the applications that were generated from those cluster labels. So just a little bit of a gotcha there. So really big gotchas with this which unfortunately is preventing us from moving forward with progressive syncs is that it does not respect sync windows and it does sometimes get stuck, ends up in a state where it constantly thinks it needs to reconcile and it doesn't actually progressively roll out. These bugs are being worked on so they're getting really close there but we are really excited about this functionality. Don't get too tricky with your rolling sync selectors that can get really complicated and you can end up in a situation where you've excluded a set of applications where you didn't mean to. So assign those labels ahead of time on the cluster and then a space upon that assign that label to the application so that it's much easier to do that selection. Also broken clusters, if you've ever had a cluster that's just gone down that will prevent the rollout from progressing through the entire set. So that can be really, really bad. So quickly here, what does this look like? That's our plan now. So here's basically the application set that we saw before. It's got two different shaws. This is just the basic guest book. I've got a whole bunch of clusters here. They're all synced. You'll see that they're on that Shaw and I'm gonna go ahead and swap the Shaw and apply it. Apply it, maybe I could have sped this up a little, sorry. What we'll see here in a moment is everything's gonna kick over to out of sync and you'll see everything's out of syncing including the devs that are at the top. You'll see some of the prods on the bottom right there and now you'll see that all the devs are syncing but you'll see it like on the bottom right now some of those prods are not syncing so it's just syncing the ones. So it's doing it, like we talked about, doing it progressively. I am going to speed this up because of time. So after the first set, it moved on to the stage and now it's doing those in the smaller sets. Where is it here? So you can describe your application set and see where it's gotten with each one and in the status it'll show you where it's gotten. Does it have pending changes? Is it a waiting state to get applied? Is it healthy and applied? Is it pending and in progressing state? So on and so forth. I am just about out of time here. So that is it on the demo. Here's where you can reach me on the interwebs. Questions for the minute or so I have left. Do I have a minute or so left, Dan? Yes, great, sorry. I heard you. I mostly have one question. We have a nearly identical architecture as you do but much smaller and we are using Carpenter to manage our AWS worker nodes instead of the ASG because the ASG is very slow and it does some dumb stuff that we think it shouldn't be doing. But of course it's not made for Kubernetes so oh well, what do you do? But these rollouts, these, for example, Kate's shred has it been tested to work with something like Carpenter? So it has not been tested with Carpenter. We are working on that actually. Somebody's got a PR that I was code reviewing this morning. But the way that Kate shred works is it actually relies on the cluster auto scaler or Carpenter or cluster API or something else to actually do the final removal of the node. It just handles the eviction of the workloads. So it has the business logic to either do a rolling restart eviction or a forced delete when the time has expired. So. We had one more question over here that was. The gentleman in the front, you got it. Oh, same question. Oh great, yeah. And I'll be happy to talk with you more about it if you'd like to know more details. Well, give Mike a big round of applause. Thank you, Mike. All right.