 Thanks, everyone. It's actually hundreds of clusters sitting in a tree with A, R, G, O, C, E, D, just like the poem goes. Just real quick, who am I? I'm Mike Tudoran. I'm a lead cloud engineer at Adobe. I'm on the Ethos platform. That's Adobe's kind of bespoke Kubernetes install. It's very similar to a Kube ADM install, just with Adobe's business logic on top. We're a fairly good size installation. It's around 230 clusters, 22 different regions, AWS, Azure, Private Data Center, 18,000 compute nodes. One of the interesting stats I came across was it was 1.8 petabytes of memory that kind of caught me off guard. That was kind of a cool stats. But you can read the rest on there. I don't need to repeat it all. So that was kind of cool. So I thought it'd be good to start off with some of the problems we're facing. Traditional dev clusters can be expensive, and if you deal in the cloud, cost controls are difficult to manage. I don't know if you've ever tried to set up quotas in AWS, but they're very easy for people to just expand. Sometimes they're auto-expanded, and sometimes when you pass a quota, they just let you go pass them, or they go ahead and bill you for them and retroactively say, hey, you passed up your quota. So it gets very expensive, very quick, and sometimes they're kind of worthless. Other times with testing environments, you need them in an isolated environment. Testing custom operators, that's very difficult. We're in a multi-tenant environment, so we have multiple tenants on the same clusters. We have some dedicated clusters, but most of our environment is multi-tenant. So if you're working on a custom operator, like, let's say, Prometheus, in that sort of situation, if you're updating the CRD, and let's say for some reason it's not backwards compatible, what's gonna happen is you could break other developers on the same cluster, so those changes need to be isolated. Clean environments for integration testing can be very difficult to do in a microservice environment. Perhaps you need to spin up services that are in multiple namespaces, test it all together in one thing. It's very difficult to do if you just have access to a single namespace inside of a cluster. So, and like I said, multi-tenancy makes that a whole heck of a lot different. So these are kind of some of the problems that we're facing and why we're starting to look at this whole idea of multiple clusters per pull request so that kind of testing can be done. So there's a few tools that we're gonna discuss today that helps us address these problems and we're gonna go through those three tools now. So the first is a tool called VCluster. This allows you to build virtual clusters inside of a single namespace. These virtual clusters are pretty cool. They give you full API access with inside the confines and the scopes of that virtual cluster. So I don't know if any of you have looked at these virtual clusters before, but you can create namespaces. You can give RBAC controls however you want. You can basically do whatever you want with inside that confines. As it creates pods, those pods that were still synced to the host cluster so that they run on the nodes inside of the host cluster. That means security controls like network policies, admission webhooks, OPA, that kind of stuff are still applied. If you try to create a pod with a host map or elevated permissions, those can still be blocked. So that makes it so that some developer, junior developer who you don't want admitting your host cluster can still admin that virtual cluster but still have the restricted access on that development cluster. Where we were talking about cost before, quotas can be applied on the host cluster that apply into this virtual cluster. So as the virtual clusters are created, there's still pods on the host cluster. So quotas can be applied. How many pods are you creating? How many virtual clusters are you creating? These pods get applied to the quotas of the host namespace. So your cost controls can be kept under control, but developers can still create as many of them as they want with inside those cost controls that you set there. Also because they're just pods, they're cheaper to run than the full VM infrastructure. If your API server only needs two gigs of memory, one CPU, that's a heck of a lot cheaper than launching up a full VM, a VPC, a load balancer, all that infrastructure that you may need inside of some sort of cloud provider. It also makes it a lot easier and faster to provision this resource than having to wait for those cloud providers to come up. You get it inside of a minute or two instead of inside 10 or 15 minutes. I don't know who likes to wait 10 or 15 minutes for their PR to start their CI testing. I know I don't. So the other tool that leverages this is the cluster API by the Kubernetes SIG cluster lifecycle group. This is one you probably haven't heard of, but it is up and coming. There's been a lot of work on it. You're gonna hear a lot more about it this week at KubeCon, but it allows you to do declarative API provisioning and managing of Kubernetes clusters. It's essentially a CRD operator. You define your cluster definitions via YAML manifest. The operator's inside of the management cluster takes that YAML. It's a series of operators. It builds your Kubernetes clusters, builds your surrounding infrastructure based upon that. Like we were talking about earlier with a crossplane during the keynotes. You know, all of that surrounding stuff is built in a GitOps style fashion. Works with multiple cloud providers, which is really nice. And most importantly, it's extensible via custom providers. So it's not just built for what the cluster API working group says if some third party wants to create providers for it, if you wanna create your own, you can. And most importantly, as it relates to what we're talking about today, there's a V cluster provider. That's gonna really relate, you're gonna see here in a little bit how this helps us create a cluster per PR. So that's really cool. But basically the TLDR, this is a really cool tool. I started using it, I don't know, three, four months ago, and I just found a lot of really useful things with it. I've become really involved in this community and I cannot recommend it highly enough, both from a GitOps perspective and from a core infrastructure perspective. So really highly recommend checking it out. And lastly, the glue is Argo CD. You've probably heard of this tool, along with the other ones that are very similar. I'm not gonna dive too much into it, but the key call-outs are the pre and post sync hooks. Those are really powerful features. I've noticed some people who are not used to GitOps may not be aware of those sort of things. It's got that powerful ecosystem around it. We're gonna be using some workflows and events with this demo that's gonna be coming up here in a few minutes. The application sets are really powerful, especially the PR generator that's possible with it that allows us to create applications per PR that's generated inside of a GitHub or any sort of Git or SCM repo. There's also other ones based upon cluster lists that are generated already inside of your Argo servers and there's a bunch of other generators that are there as well. So this is gonna glue everything we're doing together. So why would I want a cluster per PR? Well, we've already talked about a lot of these things, but I kinda wanted to highlight them specifically as these are the answers to some of the problems that we were talking about earlier. I'm not gonna deep dive into them because like I said, we talked about them, but those pieces that solve of microservice developers needing to have multiple namespaces deployed for those applications, that can be solved by this cluster because they have cluster admin access. That CRD, if there are cluster admin on a cluster, they can install that and not impact other developers. One of the really cool things of it though that I didn't mention earlier is Kubernetes upgrades. Let's say you're up, you're currently running Kubernetes 124, there's a deprecation that happens in Kubernetes 125. You wanna test that your code works with Kubernetes 125, launch a virtual cluster as part of your PR that's running Kubernetes 125, test your application code against that. You can do that. You can't test new features that the older version doesn't support, but you can test your deprecations to make sure that your application code will work with that new version. That's really cool. Excuse me, we have some applications that hammer the heck out of the API. Tens of thousands of secrets, tens of thousands of config maps, watches those and can frequently bring down the API server or SCD in the dev environments, especially as multiple developers run those applications in the same dev cluster. It also allows us to isolate and create basically a federation of API servers. So as each PR can come in, they'll have their own API server and SCD server that they run their code against. So if they have a bug in the application, they would bring down just themselves, not the entire set of potentially hundreds or thousands of developers at the size of Adobe. So some really powerful reasons why we would want to cluster per PR. So what we're talking about here is a flow that looks something like this. A PR is generated. At that point, Argo CD detects it. It then creates an app because we're running the application set with the PR generator. It launches a Helm chart that builds a cluster because that Helm chart is using the cluster API to build Kubernetes clusters. So that gets applied to Kubernetes. It builds it. It then uses Argo events to trigger a workflow. That workflow again inside of Argo takes, it runs, registers that cluster when it's ready back into Argo CD as a cluster. And at that point, since it's now registered as a cluster inside of Argo CD, it can sync any new Helm charts, customize whatever you happen to be using onto this new virtual cluster that had been created. And now you have this new environment that has either application code or is bootstrapped with whatever operators you want to have on there. So let's see what this actually looks like. That's more interesting stuff than me talking. So come on. What I have here is this hundreds of clusters demo. So I have this repo and let's create a branch, minor edit of readme. Okay. Tried that earlier. I guess I didn't do it enough. Is it not going any bigger? Okay. I'm gonna vimreadme.md and you know I'm gonna be one of those people. I want this in caps because warnings, people don't read it unless it's in caps, right? You know, that's how it works. So commit AM, put warning in caps. Before I push this, I wanna show you what it looks, what's currently running in Argo. So right now, the only thing I have up here is this cluster workflows application. We're gonna, don't worry about what's in there right now. We're gonna talk about what's in that in just a moment. But this is where the meat of the logic is happening. We're gonna dive through each part that's in there. Ignore it for the moment. So I'm gonna get push, I can type. I really can type just cause you're all watching me. Minor edit of, come on, readme. So now that I've created this pull request, what we're gonna see here within 30 seconds, there we go, pulls up and we now have a new application being synced. What you see here, it's syncing a V cluster and the cluster from the cluster API. Now in a real world situation, you're gonna use a webhook from git to trigger that sync. It'll do its polling. I'm only doing polling because you'll see that it's running on localhost. So let's get pods-n-argo, pipe, grep, ah, go away, add, and let's find our workflow that's running. And the one that is running is this one right here. So we're gonna say kubectl logs of that, and argoc main, whoops, forgot the dash f. And what this is doing is it's now watching, looking for that cluster to be provisioned. This is actually taking a little bit longer than a V cluster would normally take to provision because it's provisioning a cloud load balancer. If you are using, say, a V cluster just to do a V cluster create, it just takes a few seconds for the time a pod comes up. If you're waiting for cloud resources like a load balancer, it can take a few seconds more because you're waiting on cloud resources. Cloud resources, 45 to 90 seconds, V clusters, 30 seconds. And we wait a moment. Come on, demo gods, demo gods, fingers crossed. Mike, if you have a thing for you, why did you do a cloud load balancer service when you called like me? I did a cloud load balancer because it was faster and easier for the demo. It's just being here, not on my corporate network, connectivity, local laptop, that kind of thing. The question was, why did I use a cloud load balancer instead of just using the local connection? I didn't want to put that complexity. So there you go. It got to provision state, created the argo add. We check back over to argo itself and we can see now there's the PR and we can see that all of a sudden, or here's the cluster that it built and we can see here there's also two more tiles. It's synced in a Prometheus stack. So that's in there now. It's synced in an argorollout stack. So that's in there now. And all of a sudden we have this kind of environment that's been bootstrapped for this cluster. So if we say kubectl get cluster and I put this in a vcluster environment, we can see that we have a cluster named demopr11 that's been provisioned. So I can say clusterctl get kubet config that redirect to yaml. And I can say export kubet config equals .yaml. kubectl get pods dash a. And here's all the pods that are running inside of that virtual cluster. So it's kind of been bootstrapped. I didn't have actual application code that's running, but you can kind of get the concept that PR has application code. It does the same sort of thing, applies it, moves on, does its thing. But the cluster's been bootstrapped with those generic resources as well that that application may need to use. Cause every application's gonna need some sort of CRD, it's gonna need monitoring that it's not manipulating itself. So let's go back to this and shoot, I lost my timer, so where am I on time? Okay, so everybody's got pencil and paper, right? You can write down all of this works, I hope. I have a friend who does it that way. That's how I know he's paying attention. But don't worry, all this code is on that repo. So if you wanna see how this works, this is the screen to take a screenshot of or a picture of. So I'll give you a few seconds to do that. It's at the end as well. But the repo has building a cluster to do it, using Cluster API. It has building virtual clusters with it, the PR generator, the whole shebang. So this is the one you're gonna want a picture of if you get a picture of anything in this talk. And camera's down. All right. So, the PR generator application set. I don't know who's seen this or not, but the top bit of code here that you see, I don't, do you see my mouse around the pull request area? Okay, good. What that's telling you there is go to GitHub, look to this repo for every pull request that's there, generate an application tile. Name it with this name, then look to this location here for a Helm chart, or for whatever it is. In this case, it happens to be a Helm chart. Could be customized, whatever you want. With vCluster, ignore this control plane endpoint. We don't want Argo to constantly be syncing it because Cluster API will automatically populate that control plane endpoint as the cluster's created. So, not much to this one, but it's important to point that out how the PR application set works. We then have the cluster API spec for creating a vCluster. There's a lot more to this that we can set, but this is the key parts to it. This control plane endpoint is dynamically populated, if not set, but this is where you control services like I have set in here that I'm using a load balancer. If you don't specify load balancer, you use a local, you can use port forwarding to get to it, or a local ingress to get to it, or there's a couple of different options of how you can connect into this cluster. Security's important. I like calling this out on slides, even though it's not there. Set in who can connect into this cluster. People scan ports everywhere. You don't want people connecting into this virtual cluster, having admin access, running bit currency on your clusters. Please don't do that. You can define what Kubernetes version you wanna run, and there's a whole bunch of other Helm values you can run on there. Basically, a vcluster applies a Helm chart as part of its bootstrapping, and so that's where all those values come from. You can see the API version up here is a, that's the one for a cluster API. So how does it actually get this into Argo CD? We saw it happen, but what's kind of the more depth on that flow? So what happens is, is you have the Argo event source. What that does is that watches for a cluster API resource to be added. The Argo sensor watches for that, picks up that event, and then triggers an Argo workflow. That Argo workflow sits there. It pulls for that vcluster to be fully created to that provision state that we saw. Once that's there, it gets the kube config. Once it has the kube config, it's able to then connect to that cluster, and then it's able to connect to that cluster and to the Argo CD cluster, or sorry, the Argo CD server, add the cluster to that Argo CD. It's then able to sync those resources like we saw, and now that cluster's been bootstrapped. When you close the PR, the reverse happens. So let's take a look at that real quick. We go back to that PR, and let's merge it. What we're gonna see happen here first within a few seconds is that tile went away. So there, and then all of a sudden, boom. Everything else went away, and that's because an Argo workflow triggered again that went through and cleared. You cannot see that. That's because PowerPoint doesn't switch the stuff while it's running. Well, you'll see the end result that there's nothing there. Yay! And you can see that the PR is closed. Darn it. I forgot that little trick with PowerPoint. Apologies. But basically what happens is, is again an Argo event happens, triggers a workflow. The workflow does in the reverse. It gets, connects to Argo CD, runs Argo CD delete, and at that point the cluster is removed. It does try to remove the permissions from that V cluster, but the V cluster has already gone away. So at that point it's just basically an or true so that it doesn't actually fail that command the workflow completes successfully. So coming back to this, how are we doing on time? 10, 35, okay. So does anybody have a time check because I've lost it because of the PowerPoint switch? 10, 58, all right. So we talked about the event source. Here's what it looks like here. The ad on the right, you can see the resources. It looks for the clusters, the ad, names the event bus that it's going to be going through. So when the ads happen for the cluster API, we get that event happening. This is really hard to see because there's a lot there. So I recommend checking this out in the actual code on the site, but essentially what happens is, you can kind of see it over here, but what happens is, is when that event gets triggered, it creates a workflow and I really don't know how well you can see that because I can't see it from here. It comes over to a workflow, takes a parameter called cluster, runs a script, and that script's mounted from a config map. I see squiggly I, so I'm assuming you can't really see it. I can see it pretty well. You can see it pretty well? Okay. I'm not seeing it front, so. Yeah, so what it does is it passes in the parameter coming from the resource that was found to be created in Kubernetes, grabs the name of that resource, passes it in as the first parameter to that script. So the script can take in the cluster name that was created so it knows what to act upon. And at that point, it's just running a workflow. So whatever you wanna do in that workflow, inside your container, inside your script that the container runs, however you want that workflow to execute. In this case, it's a very simple script. Why? Simple script, so I don't do much. This container image here, there's no magic in it. All that is is it has kubectl, clusterctl, and the Argo CLI. So nothing special inside of that. The script itself, as we mentioned, is simple from the perspective of it doesn't do very much. Does an Argo CD login, it loops, looks for the cluster being to a provision state right here. Once it has it, clusterctl, get the kube config, set up the kube config, set the right context. Argo CD, cluster add, boom, the magic happens. And that's how it gets in there, bootstraps the cluster, applies things into it. And all of a sudden now, every PR has a dynamic full-featured API server to go along with that PR and developers can do whatever they want against it and has that full power of a Kubernetes cluster without having to go through an ops person, dev ops person, have to deal with some other team to update a feature and feature development becomes really easy in an isolated environment. So thank you very much for your time. All the code, everything found at that repo. If you didn't get a picture earlier, now's your opportunity. Here's the links of the tools that we used. There's all my contact information. If you don't have questions now, some come up later as you look at the code, don't hesitate to reach out. I will be taking questions in a moment. Some other talks that we have are up there. Friday, Dan from CodeFresh and another colleague of mine, Joseph Sandoval will be diving into some performance testing and some scaling stuff that we're doing as we move along this journey with stuff. You wanna learn a little bit more about the PR generator and doing more about application environments per PR. Brandon from CodeFresh will be doing a talk later this afternoon as well. So there's a lot more that can be done with this sort of process, the sort of philosophy of doing things on a per PR basis. Any questions? So I saw lots of hands. I saw yours first. When you're doing virtual. So the question is, is when using a virtual cluster to use daemon sets or do you scope that down like you would in the host cluster? That's really up to you. The way the cluster works is that it, you can sync in all of the host cluster's nodes or you could sync in just the ones that have pods running on them. In our environment, because we're multi-tenancy, we only sync in the nodes that have pods running on them. So once a pod is launched, it's assigned to a node. At that point, it syncs that particular node into the virtual cluster. It does it as a fake node so it doesn't pull in all the other running pods so it doesn't break that multi-tenancy boundary. You could directly run a daemon set and it would show up on just those ones instead of everywhere. But I haven't had a use case to run a daemon set inside of that virtual cluster, kind of like you're describing. Can you talk a little bit about the process of coming up with the idea of doing this? The idea? And also two minutes left. Okay. The idea behind this came up from, we were talking about ephemeral clusters. Developers wanted a fresh environment to test their application code. We, at a previous business unit that I was in before I was on the Ethos team, spent a lot of time coming up with fresh environments to do testing. We did hashes on name spaces and that kind of stuff and it was a huge pain in the arse. It just was expensive and a hassle to build up full clusters that you'd CRD testing. Yeah, you could do it with kind locally but a lot of times you need to interact with cloud provider credentials, that kind of stuff. So there's just a lot of different factors that came into this. A lot of it was with the cluster API building real clusters because this process of adding and syncing, we're gonna be doing this with real clusters on real cloud infrastructure, not just virtual clusters. Got about a minute less left. Yeah, I think we can go one more minute. Yeah, so are you testing like only sort of Kubernetes operators with this or like we have a bunch of microservices and we would like to be able to deploy some portion of our stack in a virtual cluster for like a PR but then how do you define how much of your whole application is getting deployed or which parts of it you need or whatever. That's the part we're still working out. That's the approach we're going for. That's what we wanna do. This whole ephemeral cluster concept is so we have some teams that wanna deploy everything, some teams that wanna deploy just a little, some teams are still saying they want, no, I wanna brand everything, new VPC, new VMs and we're like, do you really need all of that? So we haven't quite come to a consensus as to what people really need versus what they're saying they want. But yes, I do want to be able to do what you're describing and how to find that balance. I wanna leave it up to the developers for their spec and want to be able to say, okay, this project will sync these ones, this project inside Argo will sync these ones or give that option within a config inside of their repo to say, run this Argo workflow that will register them in this way with these labels and that label will then sync that way. So I will be here all day, hit me up in the hallway, hit me up over lunch, hit me up on Twitter, my email, I am on CNCF and Kubernetes Slack, I am open for questions, I saw lots of other hands so don't hesitate to hit me up if you have questions. Thank you.