 Hello, everyone. My name is Stefan Prodan. I'm a software engineer at WeWorks. And I'm very happy today to talk to you about Flagger and progressive delivery. We started Flagger two years ago as a way to decouple deployment from release and enable automated rollbacks and make the release process safer to optimize. Let's see what transformations you need to do to your CACD pipelines in order to move from a classical way of deploying things to something that enables progressive delivery and how we can apply these techniques. So, I draw a diagram here how CACD looks like in general. You have your app repo where your source code is. In there, you also have a docker file, Kubernetes manifest, scripts that or channels that define how your CACD process goes. So, you push a change to your app repo that creates a container image. Then your script will place the new image tag inside the Kubernetes manifest and apply those manifest on the cluster. This is how you release a new version of the app. There are a couple of challenges with this approach. If you have multiple apps that share some infrastructure like let's say an ingress controller or the same namespace and so on, where do you place those shared items? If you place them in all your app repos and you modify something in one repo, then when you run these pipelines in parallel, they'll be fighting each other. They'll be undoing each other's changes. Another challenge is around rollback. If you want to restore your app to a previous version, you have to rerun the whole pipeline. That means creating a new container registry, deploying that on production. That's not actually a rollback because you create a new artifact and whatever is in there could be different. I've listed here more things around CACD as a monolith. One aspect is around configuration drift. If all these manifests are all over the place in all your app repos, how do you ensure that your production system can be versioned? How do you detect when your production system changed somehow? Let's say someone added something directly on the cluster. How do you make sure that those changes are being ported back to Git? What happens to them when you run a new pipeline? Some of them could be overwritten and so on. In order to make the production release is more stable, more traceable, we can break CI from CD. How CI would look like is going to be the thing that runs your end-to-end tests, maybe with a Kubernetes kind cluster that you can run inside your CI system. If everything goes okay, you can also validate your Kubernetes manifest with something like OPA Conf test. In the end, you will be creating an immutable container image and you push that to the registry and that's where your CI platform role ends. It doesn't know about your production clusters. It doesn't connect to Kubernetes that way. Now, for the continuous delivery part, the proposal is you will be using GitOps. What GitOps means is you will have this repository where you define your whole fleet state. The continuous delivery controllers will not be running outside the cluster. They will be running on each cluster. This is how the clusters themselves will reconcile their own state. They will connect to your fleet repository. They will take, let's say, a customized overlay made for that particular cluster or that particular group of clusters, let's say, have a staging group and a production group and so on. They will be continuously reconciling their own state with what's defining Git. That means you can version your infrastructure along with your app deployments and if you want to roll back to a particular point in time, you will have matching definitions for all these things. What challenges are with this continuous delivery approach with GitOps? One of the issues is what are you going to do if an app misbehaves after you deploy it? If the app crashes, let's say, during deployment, the Kubernetes running update will hold. You can add health checks to your CD system and you can know about it. But what if your app rolls out nicely? Then when production traffic ends up on the new version, your version starts to error out with 500 errors or maybe the code changes add a lot of latency so people get timeouts or the app will be very hard to interact with. Also, there are things like can you run multiple versions at the same time and do tests between them and so on. In order to make such things easier to describe, we can break deployment from the release process. We use the continuous delivery tool to create deployments inside the cluster but instead of letting Kubernetes roll out that deployment to everyone, we have a new thing that sits at the end of the pipeline and drives the release process differently from what Kubernetes is usually doing. Here is where Flagger comes into place. You have your cluster repo, you have your deployments there and you also have a Canary object. Custom resource definition that Flagger understands where you define the policy of how you want the release process to happen inside the cluster. So when you change something in your cluster state, let's say you bump the version of an application, Flux or another GitOps operator that are many out there will apply that change but instead of letting that change end up directly in your node balancer so your users end up on the new version, Flagger will take over from there and will route traffic, small portion of your traffic towards the new version will keep increasing that traffic weight. It will measure metrics from Prometheus, Datadog, CloudWatch and others and based on those metrics, it will take the decision is the new version fit to serve production traffic or not. So as I said, Flagger is a Kubernetes operator. You deploy it on your cluster. It has this declarative model through a custom resource definition so it can create policies inside your Git repo how you want the release to happen and Flagger uses a traffic management solution to route traffic between versions. Flagger works with a couple of service mesh implementations, Istio, LinkrB, UpMesh and because Flagger works nicely with the service mesh interface, things like open service mesh or HashiCorp, ConsulConnect could be working with Flagger in the future. Now, maybe you are not ready to use the service mesh or you just want to do progressive delivery with ingress controllers. For that, Flagger works with Contour, with Glue, with EngineX and Skipper and you can combine, for example, LinkrD doesn't come with an ingress solution so you can combine one of those ingress controllers with LinkrD and do both canary releases for your front-end apps and back-end apps as well. Now, in terms of deployment strategies, Flagger implements a couple of different things. The first one, canary release with progressive traffic shifting works great for apps that expose HTTP or GRPC APIs, stateful apps, microservices, etc. Now, for front-end apps that have, let's say, an API but also a static content like JavaScript, CSS, HTML, and so on, when you are doing the canary release, you want to ping users to a particular version and how can you do that with session affinity and Flagger lets you segment your users based on HTTP headers or cookies. So they can say all the users that have this particular cookie, only those will be used to test the new version. Other strategies are Blue-Green with traffic mirroring. This works with Istio. Well, this kind of strategy works great with idempotent APIs. So if your API is doing any kind of change, writes to a database or does changes some state, traffic mirroring is not the way to do it because you will be duplicating all these actions. But if you have, let's say, machine learning workloads or caching system or things that are building reports or any kind of get query, traffic mirroring works great for that. And finally, Blue-Green, the classical Blue-Green where Flagger will run your end-to-end tests, load tests, looks at the metrics, and based on that result, it does the switch from one version to another in a single go. And I have here some graphical representation of how everything works. The idea is Flagger monitors the deployment that gets applied on the cluster. When it detects a new change, it scales out the deployment, starts to route traffic towards it with metrics. And at the end, if everything goes okay, let's Kubernetes to the full rollout inside the cluster. And for A.B. testing, it uses, instead of gradually shifting traffic, it uses a certain segment based on headers. And for Blue-Green, it runs tests, then it does the final switch. Okay, demo time. So I'm going to use Flux version 2 to set up my cluster. I'm going to use Flagger for progressive delivery, and I'm going to use contour for doing A.B. testing for front-end apps, and LinkerD for doing cannery releases inside the cluster for back-end apps. Okay, so I have an empty cluster here. Okay, I have a Git repo on GitHub, where I'm defining the state of my cluster. We have the infrastructure items, like LinkerD, Flagger, Contour. And I'm also defining two workloads, front-end and back-end app. So first I'm going to install Flux. So Flux version 2 is composed of several controllers. We have a controller that deals with sources, like Git repo stories or Helm repo stories or S3 buckets. And we also have specialized recon sellers, like Helm controller that knows how to install a Helm release, run the tests, upgrade, rollback, and so on. And we have CustomizeController, which applies customized overlays on your cluster and so on. Okay, so I have Flux installed. Now I want to add this repo to my cluster. So what I've told Flux now is connect to the GitHub repo story, pull all the manifests from there and make them available inside the cluster. Now I'm going to tell Flux how to reconcile the infrastructure items. So first time, defining the LinkerD customization. Inside the infrastructure LinkerD directory, there is a Helm repo story and a Helm release. The Helm repo story points to the official LinkerD repo, where their charts are, and the Helm release configures how I want LinkerD to be installed on my cluster. And here I'm also telling Flux, after you apply all things, make sure that the LinkerD proxy injector is up and running. And I'm going to use this information to define how the cluster conciliation should work. So LinkerD, as a service smash, needs to inject a proxy site card in each pod. So I need to make sure that when I'm reconciling anything else inside my cluster, LinkerD injector is up and running, so I get a valid state of my workloads. Okay, I installed LinkerD. Now I'm going to configure Flagger and I'm telling Flux, hey, Flagger depends on LinkerD. And finally I'm going to tell Flux to reconcile contour as well. So before this is okay, let me run contour here and I'm going to show you here in the infrastructure how I've configured contour. So I want contour to be part of my service mesh. And in order to do that, I define a customized patch for contour. So I'm pulling the contour manifest from the contour repo. But then I'm telling Flux, hey, take all this configuration and apply this customized patch to it. Contour comes with the demon set for the Envoy's reverse proxy. And I'm adding here an annotation to tell LinkerD inject from the project contour main space inject your sidecar only in the Envoy pods. Okay, so I have everything running now. If I'm going to look now at Flagger logs, seeing that Flagger has started, it has connected to the LinkerD Prometheus. And what I'm going to do now, I'm going tell Flux to deploy the workloads I have provisioned in my repo. I'm applying the workloads and Flagger will start monitoring the deployments because in my workload definition, I also have canary custom resources. And based on that custom resource, Flagger will bootstrap both application for us. So if we look at what's happening in the front end main space, I see there are things happening. So this is my deployment that I have defined in my repo, pod info, in the front end main space. But I also have a canary definition for pod info that tells Flagger how to configure the whole thing. What Flagger does, it says, okay, I have this deployment running. I'm going to take over the reconciliation for it, and I'm going to create a clone out of that deployment. And the clone is named minus primary. After the clone is up and running, Flagger will scale to zero the deployment that Flux applied. All your production traffic from this moment on goes to pod info primary. So now if I'm looking again at the pods, again at the pods, seeing that whatever was in it, it's gone, it's scaled to zero, and these primary pods are running. Flagger also creates Kubernetes services and contour ingress objects. And for Kubernetes services creates three services. One is the Apex service, canary and the primary, and these services will be used for traffic switching. So in your Git repo, you are only defining your deployment and an horizontal pod autoscaler, and Flagger will create all the other objects for you. Now I'm going to forward forward to the Envoy ingress controller and see if my app is running. Yeah. So I have version 500 bootstrap deployed in my cluster. Now let's say I want to do a release, and how I'm going to do that, I'm going into the repo, and I'm going to bump the version number. So workloads, frontend, and I have here the definition of where pod info deployments and horizontal pod autoscaler comes from the app repo itself. And I've told Flux to pick a particular tag. Now I can tell Flux to monitor the container registry and do the patching on its own and bump the version every time I'm pushing something to the registry or I can do it manually. So I'm going in here and saying, hey, I want to deploy 501. I'm going to commit this change, and if I'm doing watch, Flux get customizations, what is going to happen is source controller, part of Flux will detect there is a new change in the repo. We'll pull that change inside the cluster, then the specialized controllers like customized controllers say, hey, I have a new revision. I'm going to apply that inside. Okay. The new revision has been detected, and now Flux applies that change in the order that I have specified. So it takes into account dependencies. First, it applies linker D, then it applies my workloads. And the workload has been applied. It has moved to the new revision. And if we look at what Flagger is doing, Flagger is saying, hey, I have detected a new version. So I need to test it out, according to the policy that I have in my data repo. And if we look here, we see that we end up on 501. And this is Firefox. Let's see what happens if I'm going to visit the same URL using Chrome. On Chrome, I'm on 500. Why is that? In my release definition, here, I have, so this is a canary definition for Flagger, and where I told Flagger, hey, test the new version only on users that have a header that contains Firefox in the user agent. So I'm segmenting my user based on something from the user agent header. And I'm using those users to test my new version. So here, I'm still on 00. And this is where the test is running right now. What Flagger does, every 10 seconds, it reads metrics, checks that my SLOs, my conditions are fine. I've made conditions like 99% of all requests must succeed. The latency of my new app has to be under 500 milliseconds and so on. So it checks all these SLOs that I've defined. And once the iterations are over, I've set up 10 iterations, it should fully promote the new version to all users. And how it does that, the moment the analysis is over, it tells Kubernetes, hey, now do the rolling upgrade of the primary deployment with what's declared in it. And it's doing that right now. Flagger also waits for horizontal pot auto-scaler to scale up or down the workload. So it pauses the analysis while HPA is running. It waits for the old pods to terminate and then it will finalize the release. So if I'm going back here to my Chrome, I see that now both variant of users are on the same version. And this is how I've done a Navy test. Now let's see how progressive traffic shifting happens. So I have here in workloads, the backend definition, it's still coding for the same app. But in the Canary definition, I've changed things a little. There is no longer a header matching condition. Now I'm configuring Flagger to shift traffic weight from one version to another. I'm telling Flagger to start with 5%, go up to 50% while measuring metrics and so on. And if everything goes according to plan, then do the final rollout. So what I'm going to do now, I'm going to tell Flux to update the backend app. And instead of specifying here a fixed version, a fixed git tag, I'm going to tell Flux to take into account a same word expression, find the latest release for that same word expression and deploy that in my cluster. So I'm doing here same word, the expression is greater than 50. Commit this change. Now if I'm looking at my backend, seeing that Flagger has detected the new version and it will start to roll out traffic towards it. Let's pour forward to the linker D dashboard. We don't need this anymore. Okay, let's look at linker D. So linker D can show how traffic splitting is happening inside the cluster. So if I look here with the info, I see that linker D is reporting that the primary now is on 95% and the canary is on 5%. Now let's say there is something going wrong with the canary. How can I simulate errors? I'm going to exact into a pod. I'm going to generate 500 errors for my canary workload. Now what Flagger is doing, it keeps increasing the weight and measures latency and error rate. Now, while I'm generating errors, Flagger has detected that okay, the success rate should be 99% and it now got to 97%. And the success rate keeps dropping. What is going to happen, I've set a threshold for Flagger. It fails more than three times, roll it back. And what Flagger is doing right now, it has determined that the release conditions are not met. It routes all the traffic back to the primary deployment and scales down the experiment, the canary. So if we look back here, we see that the primary pods are still running the same as those one. And my experiment has failed and it's gone. If I'm doing get pod, this one, looking at the image, I see there is still on 500. So this is how you can set service level objectives with Flagger based on the metrics that the service measure and ingress controllers offer. Now, maybe you want to do more than that. I mean, it's okay to look at errors and latency, but in your release process, you may want to include custom metrics like whatever, how many connections are open to a database, how many people are clicking on a button in an A-B test and so on. How you can do that, you can instrument your apps with, let's say, Prometheus or you push the metrics to Datadog, CloudWatch, New Relic and so on. And Flagger allows you to declare service level objectives targeting these metric providers. So Flagger will run a queries on Datadog, let's say, take those metrics from there and you can set a threshold for each one. This was a demo going back to the presentation. So if you are interested in how Flagger works, there is a docs website with details on how you can configure it, work with any kind of service measure and ingress controller, how you can, ingress controllers, how you can define metric providers, how you can configure Flagger to issue Kubernetes events for everything that's happening or post to Slack, Microsoft, Teams, Rocket and other chat platforms so you know when Flagger did the rollback, when it starts and so on. Flagger also has capabilities like manual gating. For example, you can configure Flagger before it starts a kernel release to ask permission. So it will call on the web hook and say, hey, I want to start this kernel release. Am I allowed to do that? And you can also have manual gates for the final promotion. So it does the analysis but it doesn't do the final release until someone tells it, do it and so on. Please check out the repos here. So it's a demo repo with link at the end contour what I demoed today and there is also one for Istio and uses both Flagger and Flux to drive the whole thing from a GitOS perspective. Now, everything that I've shown is with GitOS and is using Flux but you can use Flagger with any kind of continuous delivery tool even if you do the whole thing from CI because Flagger can be configured and controlled through custom resources. If you apply that from Jenkins or whatever you are using, it will still do the same thing. Thank you very much and please try out Flagger. Let me know how it goes. Have a nice day.