 Bonsoir everyone. I'm Alejandro Martinez and with me is Steven Smith from the platform and engineering productivity team and elastic. And we're here to talk about the adventure that we had while trying to build it to use the other, not on not only Agocity, but a lot of the tools in the Agro ecosystem to build our entirely new platform and stack. So this all began with the design and launch of Elastic Cloud Serverless, which is a new platform that I think is coming up as in GA in a few weeks that provides users with hosted elastic search clusters and a bunch of other solutions. And that is fully Kubernetes powered in contracts to our other offerings. And that was designed from scratch to be operated by Kubernetes. And it's very controller centric. So a lot of the operations that are done are really scaling up customer pots on the month. So everything was assigned around having sitting around a Kubernetes cluster. There's a lot of more. You can hear more about this architecture on self stock Wednesday at 2 30. But anyhow, the case about this platform is that for it we need to deploy a lot of new workloads to a lot of new clusters because we have a platform in which whenever we have a customer demand, we not only spin new pots or new nodes, but we also spin up new clusters to be able to accommodate with the demand across every single CSP almost. And likely since we need to collocate a lot of times with the customer workloads in a lot of different CSP regions. And on top of that, we also have a lot of new services. We have around 15 teams internally developing towards this new platform that we need to start from scratch. So we came to a problem where, okay, how do we find a platform? How do we build the pieces missing to have a platform that lets us accommodate this piece of multi cluster deployments where the number of clusters are going to be changing and being updated over time and growing, likely dynamically and automatically by using cross paint to build the underlying Kubernetes clusters. And to on top of that also provide a joyful internal developer experience. So someone that is a developer that wants to onboard to the platform has at least a path to have to get to get running as flawlessly and unreliably as possible. And we are the target console, obviously the solution for this that we came up has to do with our city. So your typical or city configuration is something like this. To answer the question of where we deploy to, you typically just have an other city cluster which a cluster secret that tells you, okay, I want to deploy to this specific Kubernetes cluster. Sometimes it's not even a separate cluster, it's just the same cluster. But things start becoming problematic the moment you start growing in clusters. And growing in clusters to the amount of dynamically growing in clusters that we have. So that we get to a situation where we may have 100 clusters and maybe 10 more coming in quick demand that need to be deployed as soon as possible to be onboarded to accommodate customers capacity needs. So to fill in this missing gap, we came up with a custom controller that we call the cluster registration controller where we've got an infrastructure team that's bringing up those clusters, pretty much auto scaling in the number of Kubernetes classes that we have that we built using cross plane across all CSPs. And then we have a cluster that pulls that inventory and manages the creation of cluster secrets. So once we've got the cluster secrets in place, it's actually our city and the application set that does the rest of the magic for us. So we got to a point like this. So the inventory publishes the clusters available that they're on the right side. And then the cluster registration controller puts them to ARGO CD to know where to deploy. And the application set controller knows to use the application set controller combines this with the different applications that we have to create specific instances of the applications that are to be deployed to each cluster. So I'm a developer coming with a Hello World application and ARGO CD automatically multiplies it using the cluster generator to deploy the same application to every single application controller to every single application workload cluster. This solves the first half of OK, we need to grow in number of clusters, but then the problem has more access to it. Because when we have a single application like in this demo, OK, it's simple. We just have an application that's exactly the same everywhere and just installs to every single cluster. But it is not as simple because the way our controller and our workloads operate, they are very dependent on CS specific resources. I've seen maybe they need to store some data on S3, but S3 is not S3 on GCP, it's going to be the Google's cloud storage. So we need to be able to be more granular than that and be able to determine, OK, I have this application specific that I wanted to only be installed to GKE clusters or this other application that only needs to be installed to Azure. When there are only applications that, and when it only depends, there are individual applications that aren't configured the same way, but to deploy to a specific cluster, OK, maybe that's one layer of complexity. But on top of that, we need to add another one, which is what about an application that needs to be configured in a specific way based on the provider. Take, for example, external DNS. External DNS will take a separate way to be configured with GCP DNS than doing AWS DNS. Or a storage controller that needs to have a specific variable on their configuration to be, to have a separate bucket, to use a separate set of buckets for stay in your production, for instance. So all of this gets more complicated and makes it even more complicated to a developer coming to try to come up to understand all of this complexity on top of here. I just want my application to get to production and maybe I don't care or need to know about all of this complexity while still allowing for an experience that extends across all of that. So to onboard new developers and to be able to make them fit into all of this complexity, we come up with an application golden path, a starter kit, which is a template that we use backstage as an our ADP. So someone can try to onboard a new application, can just come up to backstage, click a few buttons, and they will have automatically generated for them a GitHub repository that has a Go application with a hand chart on it and some CI jobs already configured to be able to push the artifacts for that application and the hand charts to our internal OCI registry. A critical part of this that we thought about all of the experience because our applications have a lot of CRDs, have a lot of little tweaks that need to also be rolled out as the at the Kubernetes level and not only application level, is that the hand chart is the artifact that we promote and it also references the Docker image. This allows us to pretty much combine the infrastructure rollouts with the application code rollouts in a single pipeline. And we also provide them with a draft pull request into adding this application into our GitHub repository. And we can have come up with an opinionated mechanism to kind of solve all of the configuration management problem. So when someone comes to the, when a developer looks at the pull request that they have in the GitHub repository, they have something like this. So it's a YAML file. We come up with a custom abstraction of that allows them to define, okay, this is my service. This is where you can find my service. This is the team that owns my service. This is the hand chart for my service that needs to be pulled from the OCI registry internal to elastic. This is what I want you to be deployed and this is the different versions that I want to be deployed to any specific environment. And you may think, this does not look like an application set, because it is not an application set. It's a custom type that we built to make all of the pieces fit together. We are using Q. Q is, I could spend a lot talking about Q, but to say it quickly, it's a strongly type constraint-based language that allows us to define types and easily transform them into different outputs. So in this case, we got an input, which is a service configuration that I showed in the previous slide, that we can validate the YAML adheres to this specific type. And from this service, from this structure, we are able to generate all the data structures easily. And the good thing about Q is that it also allows us to import Go types. So we are able to produce, not only produce, but also the application sets but make sure they are strongly typed to what a CRD definition for an application set is. So Q is the piece one of making the application, the making all of these pieces fit together. So if you remember, the previous slide was really shallow and simple. Listen, okay, I have an application and it needs to be deployed to these environments. So we input this to Q and have some Q code that outputs the actual Argosity application set that ends up looking something like this. So this is more like an Argosity, this is an actual Argosity application set where we have extracted inside Q all of the logic of how do we do the mapping to what does an environment mean to actual cluster labels, for example. In this case, the developer didn't have to know how the staging cluster were tagged. We do the mapping for them so that we can actually go and change the mapping for all applications at the same time without them needing to be involved. We do the mapping on that mapping while they just own the application that needs to be targeting that environment. And here's where we also we got a little bit opinionated on the configuration management to kind of enable okay, I have a hand chart that I want to be configured in a separate way, in a different way based on the environment. So we use multisource for this use case. So we combine the two sources, one of them is the hand chart that I talked about earlier from the OCI registry with the revision that is specified on the GitOps repository together with some extra parameters and some extra values. So what we do is those values that you can see here actually refer in the application set cluster generator refer to cluster labels. So this is how we put the pieces together. We got an application set that looks like this. So the end result is that that application set for every specific cluster will end up generating something like the right. The developers don't need to know all of the pieces that go inside of this but the end result is that I know okay, I am developing a hand chart to this platform. My hand chart will have this hand value set so I could template based on them to do any specific environmental configuration and we'll have those value files from the GitOps repository source and they will always refer to the cluster the application has been developed to. So this allowed us to kind of make every single configuration parameter on the hand chart be based on the cluster values and allow for different environmental configuration while keeping everything a path to promoting the versions because the version is really in the hand chart. Which brings me to the next question. How do we then update the hand chart? And this is the other piece of the algo side where Steven is going to fill in. Okay, so there's been a lot of different talks that have brought up how you handle promotion across disparate environments and ARCHERCITY is a great tool to point things into a cluster and depending on how you serve up your environments and clusters, things can get difficult in terms of moving things along from one target environment to another and when I say target environment, I mean a subset of clusters that you've defined to go into that target environment. Sorry. So we had a couple different goals in mind for supporting promotion. We did not want developers going into the GitOps repo manually updating a target revision and then having to promote it across different environments. Along the same lines, most of the time you're not just going to be updating a target revision hopefully when you're promoting across environments, so you're going to be running some suite of tests, system tests, integration tests, etc. And so you want some automated way to support running these external pieces in between your different environments when you're promoting. Lastly, you probably want to do some validation. You want to validate your environments ready, there's not a deploy actively happening, or it goes into a consistent state across however many applications you have for a specific environment. You want your quality gates to pass and you want to be able to prevent multiple deploys from happening at the same time. And so what we try to do with our solution is use as much of the existing Argo ecosystem as possible. This picture is not ours, it's ripped off directly from Argo events. But to kind of give you a picture of what we're doing, we essentially have an event source that is a webhook. Serves as our API that our CI systems can call. That kicks off a what we'll call a promotion event and behind the scenes we have a sensor that's wrapping a Argo workflow which wraps our internal tool. And we're using the NAT's streaming event bus. And so what is this tool? We call it GPCTL, because why not, GitOps promotion control. And so I guess the fancy marketing term is a tool that orchestrates service promotion across environments in an automated, extendable and safe fashion. Really, that's to say if I'm a developer I want to get from QA to production, not touching anything as safely as possible. And so this provides that functionality. This is the declarative service, what I call the service configuration that would get set up. And you might be wondering, okay, well this is yet another layer of configuration that the developer has to concern themselves with. However, the service.yaml that Alejandro showed earlier actually feeds into this file and it gets generated automatically in our GitOps repo. It is then used when the service gets promoted. So developers actually need to know very little about this. They can obviously if they'd like to, we help them with it if necessary. But the idea is mostly it's kind of hands off. And then to go a little bit deeper into what's actually happening. So if you take it from the highest level, if I'm a developer, I'm working in a service repository or application repo, I merge something into main. CI hopefully passes, builds. This would trigger that event webhook that I was talking about earlier in Argo Events. That would then kick off the sensor to kick off the Argo workflow that runs gpctl. And so gpctl does a few different things. We actually validate that the commit came from the application repository that is getting promoted. We also check to make sure that we're not deploying an older commit from what's already out there. And we have, like I said, some level of locking to make sure that we're not deploying more than one thing to a specific target environment at the same time. And after all that, we say, okay, let's go update the GitOps repo. gpctl then talks to the Argo CD API. Make sure everything is synced. It's healthy. If it's not, it will fire notification over Slack that we're getting the owning team know. If it is, we run quality gates and those are in the form of whatever the team defines. And so one of the abstractions we have is these gates can run inside of an Argo workflow itself. And so we have something that can run build type pipelines. If you want to run inside of Argo workflow itself, you can. And two of the primary things we do are the end-to-end tests and slow checks. If you haven't checked it out, Elastic has a great slow product. You can take a look at it and maybe use it for something like this. And then assuming those things pass, we basically kick off another promotion to the next environment and kind of start the whole loop again. So if you're familiar with Argo workflows, this is the sensor view just to give you a high level of kind of like what's going on. We separate the CI piece from what I'll call promotion just to prevent to only allow our CI system access to that endpoint and then restrict the endpoint for promotion of other environments to itself essentially and to the CI piece as well. And so there's been a lot of talk about supporting CD which is a bit of a loaded term because it's delivery, continuous deployment. We want both. Some teams want to batch up their changes. They don't want to deploy every single commit. They also want manual judgments, which they're able to do through the quality gates. They can associate basically they can add a manual judgment to that through build kite, through Argo workflows if they wanted to. And so we need a way to facilitate basically locking the requirements like I said for teams that want continuous deployment because we don't want more than one thing happening at the same time. And we also don't want at the very end you'll see this concept of an eviction lock. We also don't want things coming in if a revert has to happen for some reason because it's manual at the moment in terms of rollbacks. We don't want new deployments coming in over top of what might have been rolled back basically deploying the pre-existing bug that's going to be worked on. So teams can unlock their environment when they feel it's safe to do so and then have the next deployment come through. It's also very convenient for things like maintenance windows or holiday freezes or things like that, but we don't want any deployments coming in. And that's it. Are there questions? Thank you. We also have the QR code for feedback and we'd appreciate that if you don't mind. Hello. Great presentation by the way. I have a question about the queue abstraction. So let's say because you want to keep the developers' hands off as possible and let's say for example as a baseline you started off with an abstraction using Helm and then developers one day said yeah we want to use customize now. Are you responsible for making those queue abstractions? Basically does the abstraction take away some functionality like the application said? So you meant like a developer wanted to use something else than Helm for example? Yeah, I'm just wondering what your abstraction approach is. Yes, for now abstraction has been provided the opinionated default and then if they come up with a better solution than this or that fits their needs the queue allows us to kind of extend the CMPs for example to kind of say okay maybe the default is Helm but we can change it to also allow overriding specific things if they want to deploy a customized application for example. Actually we migrated a lot of these from customized applications so there are some remnants of supporting customized but the area around using queues providing with a default that is sensible and works and if we need to extend it or find another different approach I'll open up ways to override it. Okay, all right, thank you. Okay, I'm just wondering is this open source tool GP CTL? It is not open source we could consider doing so it would require a bit of sponsorship internally and a little bit of effort but if there's enough interest I mean we could consider doing it. Yeah, it looks useful. And we are basically having the same problem issues. We found there's a lot of folks running into the same problem and building their own internal things after all. This started more or less at the same time that cargo did. As your preface this started probably when cargo was like first before it even was like a twinkle in our eye so a lot of this stuff maybe could be fed back into that and or the new work that Michael was talking about this morning. I'd be interested with talking with those folks about maybe how we could work together on some of these problems. Okay, thanks. Hello. I watched the application set that you are using in the generator to land on the different clusters and the different environments and it's very very similar to what was presented at the previous ArgoCon in the Github's bridge project. It looks almost exactly the same and this is very good because it means that the community is converging to a solution that works. So my question is did you get inspiration from that pattern or is there like a willing from the community to come with a standard repostructure for this kind of project so that we could all improve the same structure? So we actually came up with this kind of from the needs that we have and how when we started building the platform we were structuring things naturally. So it probably made it look similar because it's when it's how you will structure an application like this. So if we have reached the same solution it means that we have the same problem and I think it validates that yes we probably need a standard to kind of pull all of these things and we honestly didn't have a look at it. Yeah it wasn't from that but I think there's in general there's a strong symbiotic relationship between your promotion tool, whatever it may be and your GitOps repository and I think those things need to work in concert and so having a standard around the GitOps piece I think would be really nice for the community to converge on. Thank you. Can you have a custom generator to generate application sets? That's quite complicated that use multiple applications. So how do you validate the new version, the new manifest of the new version like a deep tune to compare with the current design state? How do you do that in your CI or you just deploy by EgoCity and what? We have some CI checks at first before they start hitting this that validates that everything is up to place. So we run the first CI and then we trigger the promotion in the CI before we launch everything else. I think we are out of time so if you have any questions we'll be around.