 Thank you so much for coming to our session. We really appreciate your time. So today, we'll be talking about our current solution for environment promotion with Flux. So I'm Adelina Simeon, and I'm a technology evangelist at Form 3, and I'm joined here on stage by Sam Tovokoli, who is a technical architect working at Form 3 as well. And what we'll be talking about today is the Cades Promoter Project, which is open source. We've recently open sourced it, so you can see it on GitHub as well. And it has been the work of a lot of folks from the tooling team. So it's a huge honor to share it with you today. Oh, and it's built using Flux, and it's written in Go, so just to set your expectations. So first, I'm going to tell you a bit about the background, because a lot of the background that we have has influenced our decisions. Then we're going to tell you about the release flows of Cades Promoter. We'll do a demo if the demo gods are with us. And then finally, we'll conclude and let you know some of the next steps for the project. So let's look into some of the decision factors that have shaped our solution. So first off, we had familiarity with Flux V1, and we liked the toolkit approach of V2. So our tooling teams immediately wanted to go ahead and build it with Flux. Our platform has a requirement to deploy on multi-cloud, so our teams are building multi-cloud stacks. And we needed something that would work for the multi-cloud project. And furthermore, this was something that we had to deliver to unblock the teams. So we wanted to build a simple solution and then iterate on it. We had some requirements around we wanted to manage cluster stay and get, and also we needed to enforce release order, as well as easily diff the differences between environments. And finally, of course, because it's such an important piece of software, we wanted to have end-to-end testing of our promotion flow as well. All right, so let me just set a few main concepts before we really show your environments and go into it. So first off, we have jurisdictions which are fully separate accounts for data isolation and location. Then we have stacks, which are ring-fenced isolated copies of our entire platform. And they contain all of the same services, but they give isolation between the stacks. A workload is a unit of work represented by all of its resources, by which we mean, for example, a service and all of its related infrastructure. And then we have the tenant repository, which contains manifests for all of the services in all of the environments. So our environment setup is, as you see it here, we have stacks which are deployed on multi-cloud, on AWS, GCP, and Azure. And a promotion goes in order from dev to test to prod. There is a one-to-one relationship between an environment and a stack, so the same stack cannot exist in multiple environments. And we have multiple jurisdictions that can as many as we need. So we are talking about managing many multi-cloud stacks. But this talk is more about what happens after the CI flow. So our CI flow is standard. Our engineers raise PRs. We have the code review. And then at the very end, of course, you have the CI triggering the build and testing it. And at the end, we have a new image that's pushed to the registry. But how do we move all of these changes in all of the environments that we're talking about? The team considered three main solutions, which you see here. So first, we could have used customized overlays. We could maintain separate branches for each cluster. Or we could go for a directory structure. Maintaining separate branches for many environments is painful, as some of you might know. And the problem with customized overlays was that a change to the base would trigger across all of the environments, breaking our requirement for deploying in order. So I mean dev, test, production. So the solution that we decided to go for, based on our experience and our circumstances, was to go with a directory structure and use it together with manifests. Manifests would give us an immutable artifact that we could then promote across environments. And we didn't consider templating because we wanted to keep a low barrier for adoption. As we said, we wanted to unblock the teams and then iterate on this full solution. So this is the hierarchical directory that represents our environments. So the manifests folder is the source of truth. And changes to the manifests folder will trigger the promotion. And changes in here could be either done by PR or using Flux's image update automation. And each cluster has its own directory from which it reconciles from. And then you also have stacks that have their workloads and their states in this directory. And Sam will show you more about the directory structure and how it works. We use YAML to specify our workloads. And another one of our requirements was to be able to do exclusions. And as you can see here, on lines 8 to 10, we could exclude for a particular cloud or whatever we wanted. The cluster specification is also done in YAML. And you can see it's got labels saying to which cloud and to which environment it belongs to. And furthermore, it's got two paths for the manifest folder that I was telling you has the specification for the workloads. And then you have the configuration folder, which contains environment-specific configuration for that particular workload. But managing all of this, moving manifests via PR requires raising a lot of PRs. So we needed a way to automate this. And the tooling team kicked off the Cades promoter project in March 2021 to automate this for us. And I'll now hand over to Sam, who will tell you more about it. OK. Oh, now. Right. So as Adina said, we wanted to unblock the internal multi-cloud efforts very early on and build on top of it. We didn't want to invest in changing the whole CI flow. We didn't want to invest in replacing that with anything new in that way. So we kept the developer experience of our old approach, which was the PR approach and quite familiar to us. We wanted to be able to gatekeep using our existing cap process. And cap is the change advisory board, which is quite specific to our industry. And we already had quite a lot of bots and integrations to do these sort of checks and info security aspects as well. We also wanted to codify our release and conventions, as in the ordering of what environment needs to get targeted. And we wanted to make sure of the changes that we do since the sort of template of the way that we do releases wasn't very specified in the early days. We wanted to be able to move quickly and change how we worked by being able to test our promotion. So as part of this tool, we also wrote a wraparound go git to implement the HTTP backend so we can kind of visualize, not visualize, but assert on git state over the HTTP wire using our own sort of tests. The tool works with single and multi-tenant setup. And we will focus mostly on the single-tenant approach. So I will go through the sort of flow with case promote at the moment that we have. And then we'll show a little bit of overview of how the tool works. So as I mentioned, the starting point is essentially an author raising a pull request to flux manifest as we consider the sort of source of truth of our workloads. We also have flux image automation controller that sits here scanning our OCI registries and bumping the versions of flux manifest. And this, of course, doesn't go to a PR process, but it's merging to master. But then it triggers ACI flow, which the review process here will do as well after it's been merged. So the review process also includes the cab aspect, as I mentioned earlier. And we have many, many tenant repositories. So once the PR has been merged in the tenant repo, it will kick off the CI job. And it will essentially have three tasks, promote the development, promote the tests, and promote the production, which will kick off the case promoter given these parameters. So the parameters being to our target environment you want to promote. When the tool runs, the first thing it does though, it's it clones or actually checks out master on another repo, which is infrastructure cluster config, which is essentially where we have a cluster definition across our fleet. This essentially means that you don't, in the beginning we had the cluster state as part of the tenant repository, which meant that you had duplicate information across the place. So we optimized that from UX experience to put it into one place. So when the binary runs, for promotion to dev, it will group all of the changes for the dev clusters in one go and raise a PR where the author proves it. For the test task, what it will do, it will observe that there hasn't been any promotion to development. Then hence it won't actually do any promotion, so it just stops executing there. The same happens for promotion. And then once that PR, the phrase to development has been raised, it will, and once it's been merged, it will trigger another merge CI job, which will then again raise three tasks for the case promoter. And once, and as it has observed, the change has now been done in development, it won't do anything. However, for the test clusters, as we have requirements on being able to raise, on conducting changes one by one in that environment, it will raise three individual PRs where the author has to then merge them and then wait for the production. And this essentially repeats for production once all of those changes have been rolled out to test. So if, for instance, only one cluster had been merged, it wouldn't raise the production PR until it's observed the changes across the whole state given infrastructure cluster config. So I will go through the brief overview of the tool itself. The two most important aspects is the fact that which target environment is and the commit range. It goes through the clone step, which is essentially getting the state of the tenant repository and then detect what changes has been happening to the workload. And then there's a filter step where it excludes given the workload exclusions and then also checking the previous environments. And then it will promote it. But the actual step of the detection is it's using GoGit and we do diff tree over the commit range. We then get the sort of the git state of what actually has changed. We infer the workload changes upon this to figure out, all right, given certain changes to manifest inside of these workload directories, the following actions need to happen. And we take those sort of workload operation and pass it on to the filter step, which then observes given these changes to the workload and given the workload exclusion rules, should it be allowed to be promoted upwards or not? And we check and validate the previous environment as well. And as part of the promotion, it's essentially grouping a cluster, checking out the new branch and applying the changes, commit and push and raise the PR as the tool does for us. And that's about it. So as part of the demo, let's see if this works. I pray to the demo gods, let's see. I raise the PR here where essentially it's quite simple. Let's see if I can show the, yeah, let's not. Essentially, I'm raising a change to flux management. Can you see this? It's readable, it should enlarge it. Okay, cool. That I'm introducing a change to just a version that's very simple to the flux manifest demo services. And I am going to merge this. This will then click off the jobs that I mentioned earlier. And hopefully GitHub will be responsive. And we can see for instance, promoted development, set up job, come on. Yeah, let's see. Let's give it two more seconds. I think the other ones actually have gone through, yeah. So for the promoted development, we will see that it will, these are the options going into it. It will begin the promotion and we are promoting the four clusters in one PR and it will actually raise it to GitHub, right? For tests, what it will say is it will observe the fact that it will drop the change. It's observed the change, but it will say that the source is not the development as expected because we're promoting two tests and therefore it won't do anything. And the same goes for production. So if we now go to the pull requests, we now see an overview page which essentially links to the actual change that caused the promotion to kick in. We have an overview of what clusters have been affected which workload it is. And if we had more changes in multiple workloads, we would actually enumerate them as new columns here. And we also introduced to ping the author so that he actually sees that something is being raised. If we check the differences, it's essentially the same version change being promoted to all of the clusters that we can see, right? And this is done by the cluster's definition is here. So this is what it actually checks out to observe the state of the clusters. So if I merge the following pnl, it will then kick off the next one and we will then be able to observe the promotions to test. And hopefully this will go a little bit quicker. Maybe not, but we'll see. If it doesn't, then essentially I would just have to show previous, oh, there we go. So here we can see that it is beginning the promotion. It started to promote to test internal AWS. It raises specific protocols for that. And then it goes on and raises the other PRs. And as you can see, it is not the fastest here, but this is something that we introduced to be able to rate limits on the APIs. And I will explain more about that. But yeah, essentially you will see four pull requests in the end and it's the test promotions. And you would then merge them one by one. But I won't do that now since we're kind of limited on time. But you can see here that this is to a specific test GCP cluster. And yeah, that's essentially it that I wanted to show. So as you can see, there's gonna be a lot of PRs if you have a lot of environments and a lot of clusters. And even though the automation kind of solved the problem of getting those changes in, we still have a lot of PRs to review. And one of the things that people came back to us with is like, even though you don't have to do the automation yourself, if this binary is doing it for you, if you leave a release and come back to it, it's not always easy to understand at what particular state it is in. And that's one aspect that we have problems with. And also the GitHub rate limits, as I mentioned, and this is due to the fact of we could have a number of works concurrently running and we wouldn't be able to control that. So therefore, this caused us a bit of pain. We had to artificially induce, like regulate ourselves given the sort of responses from GitHub. And also the fact that we would have to manage the maintenance efforts to bump the actual case promoter version across the many tenants, which isn't too hard, but it still is labor. We don't like doing labor. So what we've started looking at, well, I'm not part of tooling anymore. I'm in a market team. But what they're doing now is experimenting with pipelines and essentially keeping the sort of initial change to be PR by the CAD process and making sure there's properly reviewed. But past that point, it's essentially is governed by the pipeline itself. So you don't need to kind of review the state of the PR as to be able to induce what sort of cluster the change has gone on to. And here in this example is we produced a pipeline using GitHub Actions. And the sort of development clusters have been released and somebody has approved to deploy it to a specific integration environment. And yeah, that's pretty much it. Come chat to us at our stand. We're here. And yeah, check out the tool. If you like it, if you don't, leave it out. You know, don't be nasty, but yeah. Yeah, it's pretty nice. Ask me any questions. Do you have any kind of policies that you can define? So for example, you can say that when you deploy something to test, run this particular test suite, and if that passes, just automatically move to production or move between environments? Or is this like a manual process? This is not what Kate Promoter would do. But we have had the thought of having essentially the next iteration of it would be the pipelines and then building progressive delivery on top of that. But no, it's not doing that right now for what you're describing. But nothing really stops us from being, like essentially the Kate Promoter has evolved quite a lot and for a single binding of the steps that it does, it's quite cramped. So we're now separating it, letting the pipeline generation understanding the cluster state be one thing and the actual steps of promoting, moving the changes from the workload from manifest to different environments as a separate step that could run as part of engineering, CLI tooling, but yeah, part of the pipeline. But what I was trying to say is nothing stops us from actually plugging in like flag or anything on top of that, which we, well, at least myself, I will be very keen on experimenting with that, but we haven't yet. But hopefully we got investment time, so hopefully I will do that. Any other questions? All right, we are at break now until 1455 CEST. So five till come back. Thank you. We'll be talking again. Thank you all so much.