 All right, so, everyone, I'm Sanjesh Jaiswal. I work at VVocs. I am a flagger maintainer, and I dabble around with a few cloud-need technologies. I just graduated a few months ago, and this is my fourth coupon. So I'm really excited to be here and talk about, I mean, SLO validation, yeah. And this is mostly Sam Scars talk about, I'm here to help. I'm Kingdom Barrett. I'm an open source support engineer at VVocs, and also a flux maintainer. And this is us. Okay. Okay. So how many of you have implemented progressive delivery in your clusters or, you know, played around with progressive delivery? Like, can I get a show of hands? Okay. So not as many people as I'm expecting, actually. So for those of you who don't understand what progressive delivery is, let me try and convince you that you should be practicing progressive delivery in your clusters, because it's a really, really good thing to do. So progressive delivery is the art of introducing new software into your clusters in a safe and iterative manner. Basically what you do is, when you want to introduce new software into your cluster and expose it to your end, try and get very fine-grained control over the... So as to if the new version is not working as expected if there are bugs or if it breaks features, the number of users that are impacted are... So what do you need to implement progressive delivery in your cluster? Well, the first thing you need... A CI pipeline is required. The next thing you need is a continuous delivery system, something like Flux or Argo or whatever have you. Something which can take those artifacts that have been published by your CI pipeline and deploy them onto your cluster. Then you need a service measure and ingress for traffic to come through and hit your applications. And this is important because you can't just use a plain old cluster IP service to expose your applications. You need some kind of an ingress or load-balance solution which in our case is a service mesh or something like ingress engine X. So, and most importantly, for Zavartalk you need a good observability stack. Something which you can use to get metrics and validate your SLOs and see whether KPIs are being hit. So, the purpose of this talk, since it's prom days, we'll use prom days, but you can use anything you want really. So, there are three main ways you can do progressive delivery. I mean that I know of, at least. So, the first is a canary release. A canary release works based on routing traffic to different services. So, you have two services, your new service and your old service, and you shift traffic from your old service to your new service in a gradual, weight-based manner. This is the main topic of our talk. I'll just briefly describe the other two as well. So, the next is alpha-beta testing. I'm guessing this is more popular amongst you guys. It's a pretty old way of testing. So, you have some beta users, and the way you differentiate between those beta users is because they have traffic that they generate, the requests that they have are going to have headers on it, and you use headers to differentiate between your beta users. And the last is blue-green deployments. Blue-green deployments is a bit different than the other because it doesn't actually expose any new traffic to your users. So, the new version is not exposed directly to your users. What you do is, when you have your old version up and running, you stand up your new version as well, and then you run, like, low-test, you know, artificially-generated traffic, right? So, using those tests, you verify whether the new version is incorrectly or not, and that's how you do blue-green deployments. Explain this very quickly. I'm here. Yeah. One and a version two. Just use this one. Okay. All right. So, we're going to go from version one to version two, and we're going to do it progressively. So, in panel one, we have just version one is running with two replicas, and Flagger is monitoring. And in version two, or panel two, we've added version two. We've added a pod for version two, and we've started routing some traffic to it. Meanwhile, Flagger is monitoring the metrics to see that everything's going okay. And as it seems that everything is going okay, we're incrementally adding traffic to the Canary. Version two, we go from 10% up to 50%. Once we get to 50%, and our metrics all still look fine, we're reasonably confident at this point that the release is good. So, we begin the promotion phase. In phase four here, you see, now we've begun a rolling update of the deployment of version one. So, we actually have two separate deployments here, if it's not clear. The version one deployment and the version two deployment. And Flagger is taking care of that for us. So, in panel five, now the upgrade, the rolling update is complete. And we've shifted all the traffic to the new upgraded original deployment. I'm going to call it the primary because that's what it's called in Flagger. And in version six, we tear down our Canary pods, and now all of the traffic is being served by the primary, actually since panel five. So, yeah. Okay. So, how many of you have heard about Flagger as a project or just like, you know, installed into clusters with it? Okay, cool. So, Flagger is a CNCF incubating project. It falls under the Flux family of projects. Flux is the GitOps too. So, the entire diagram that Kingden expened, right, in the last slide. Flagger is a Kubernetes operator which automates that entire thing. So, right from shifting traffic incrementally to monitoring metrics, all of that is automated for you by Flagger. It enables, you know, safer releases, right? And one of the things which Flagger does really good is that it handles disaster recovery very well. So, if unfortunately our version two is not working as expected, it has bugs or something like that, you want users to be shifted to the stable version ASAP, right? And that's what Flagger takes care of. So, you can deploy on a Friday and be satisfied that if there's something wrong with the new version, your end users won't be impacted as much, right? Extensive validation, right? So, you can validate your SLOs, right, using a bunch of tools, real use premises for the demo. But, yeah, you can see these are some of the things we support. And it has an extensive webbook mechanism which you can use to run load tests and various acceptance tests throughout every phase of the canary so that if you want to say that I want to run some different tests at 20% of the weight and if I want to transfer different tests when the new version is being promoted, Flagger allows for that. And this is one of my favorite features about Flagger. It's not a feature, it's more like the list of integrations we have. So, we support pretty much every other ingress service mesh out there. You can just see some of the logos. I'm not going to go through all of them. But, yeah, it's a pretty extensive list. And we also have support for Gateway API which is the new standard for... Supposed to be the new standard for networking and load balancing in Kubernetes. And as more and more projects start adopting Gateway API, we are confident that we'll be working with all of them. Canary definition. This is the main unit of currency of Flagger. So, we're going to define a canary and we're going to start with some targets. So, we have a target deployment and that's the deployment that I mentioned before that Flagger is going to copy for us and make a primary out of it. We also have a service. Autoscaler ref. The service. Flagger will create. And we have an analysis section which is really the interesting part. So, as you see, we have a schedule. There's an interval. There's threshold. It's kind of unfortunately named. This is the number of failures that are in order to eventually land at a successful release. The max weight is the point at which we should stop and we start motion. Step weight is every 5% we add. And here the metrics is really the interesting part today because we can add custom metrics here or any kind of metric. So, there are predefined metrics. There are a couple of metrics that are listed here. And then there's this webhook section which is also very interesting. So, you can use it for things like inline testing and also pre-flight validation as well as load testing here. So, I'm just going to try to tie it all together and explain to you in a brief manner how Flagger actually works. So, you apply the Canary object and Flagger is basically reconciling that Canary object. But Flagger is a bit different and it is operators in the way that you are not continuously reconciling the state. So, a typical Kubernetes operator would continuously reconcile the state of the cluster to match your desired state. But in the case of Flagger, you have certain intervals where you want to run the Canary release, especially more specifically when you introduce a new version or you change something in deployment. Rest of the times, Flagger is sitting there idle. So, the way it works is we wanted to be as pretty much loaded as possible so that you as users don't have to do anything much. So, what Flagger does is you give it a deployment, it creates an exact deployment for you and it names it primary deployment and then you have a Canary deployment. So, the primary deployment is a stable deployment. It's the one that you'll want your users to hit all the time. So, what Flagger does is it basically creates services for your Canary deployments and your primary deployments. So, all you need is to create a primary deployment and Flagger will create or handle all of these other objects. So, what Flagger does is it says, oh, I have a Canary deployment and it's a new version. So, what I'm going to do is I'm going to scale this up and I'm going to scale that up and then I'm going to use an HTTP route. So, HTTP route is a generic term that I've used here. It can be anything. It can be a virtual service. It can be a SMR traffic split. It can be a Canary engine X. It can be anything. Anything which can distribute traffic to different services can be considered an HTTP route. So, what Flagger does is it orchestrates this HTTP route to basically route traffic based on weights to these two different types. And what it does is it detects the metrics based on those metric validation. It will, you know, drive the analysis forward. So, this in a nutshell how Flagger works. We'll understand more when we see the demo. Okay, I'm going to talk about this section. These are metrics. In PromQL, you've probably seen something like this before if you're here at Prometheus Day. Just to point out on the left, this is a metric for engine X and it has a namespace and a target interval setting. We're looking for a latency under a particular number so it has this bi-LE that means less than or equal. And on the right, we have another metric. This one is for error rate. It looks a little bit different. I'm not going to go into too much detail about this because, like I said, you've probably seen things like this before but this one is out of 100%. So, we're multiplying by 100 after we divide the numerator by the denominator. Numerator is successful requests that are not in the 500 range and then the denominator is the number of total requests. So, that's mine too. Also, so this is a metric template and we're going to take advantage of one of our definitions that we've just written. So, here we're plugging our not found percentage so actually we'd like to redefine the error code. We'd like to say we're really interested in 404s. We don't want there to be 404s in our release. So, we're just going to go ahead and make this not found percentage metric template and we can plug it into Flagger. This might be useful if you have like a JavaScript application that is actually serverless and really the 404 is the only signal that you get from it maybe. So, that's one of the reasons why you want to do something like this but that's the main event here. If you could go. Sure. This object is basically the crux of how Flagger validates metrics. So, if you see here you have a type and an address in the metric object and this is how you can have multiple different observability stacks running in parallel and you can use all of them and basically have a very complex mechanism on how you want to promote new versions in your cluster. And this is how you attach a metric template to a canary object and then you have this whole way of where you can say so, in our case we don't want the not found percentage to be more than 5%. Every one minute Flagger is going to check not Flagger is going to check that whether the 5% or 1% as well so like if you have okay. So, if Latency is a big concern in your so you can have something like a Latency okay. So, Latency is like you can have a Latency template which would say something like I don't want the Latency to be more than 500 milliseconds something like that. It's very extensible and like I'm going to talk about a rather new software called KDA how many of you heard about KDA just like for sure. Okay. So KDA is if I were to give it a tagline I would say it's HPA on steroids so it's like a more advanced HPA version HPA thing listen to it can get events from it can run queries on and based on those metrics or whatever events it gets it can scale up and down deployments and it can also work for jobs. This can be instrumented with Prometheus right so if it's an instrumented deployment KDA will scale those metrics right and metrics and based on that Prometheus can expose those metrics and based on those metrics KDA will scale the deployments. So this is a more powerful version of how you can do scaling than HPAs. We just wanted to highlight that because when you're doing testing load testing can also be a very very important factor and if you only have like one deployment running for your canary you're more likely to run into problems of memory errors or something like that because it won't lead to real-world conditions real-world conditions we have auto-scaling enables to you know Argo setup like a GitOps tool setup if you specify the replica's field in a deployment that's just a recipe for disaster you should never do that in case if you're doing that don't do that never do that because that can lead to all sorts of problems so you should always have a scalar auto-scaler setup so to go back quickly to the diagram it's pretty much the same diagram but there's a new addition there's a new component in the diagram which is KDA that Prometheus is like really driving everything here right if you remove Prometheus flagger will not be able to determine whether your version 2 is performing expected or not it won't be able to increase the and KDA won't be able to scale the deployment Prometheus is not working Prometheus is Prometheus or any other observability tool is very very essential to how you implement progressive delivery in a cluster okay so it's demo time permission to show this off because it shows very well how flagger works so I need to do a quick port forward oh it's already forwarded that's good you need to switch this to HTTPS and start my port forward again because I made a mistake so we're going to go to the delivery tab this is the piece that you won't get on Weave Get Ups the free version open source but we can see we have a canary here and we can see that it's in a failed state which means it's rolled back from a previous attempt to upgrade we have a list of objects that the canary is aware of and we can see things like the deployment primary and canary so here's our primary and here's our canary so when we do the upgrade we'll see this one kick into action and we have a list of events and we can see that we have defined a custom metric here as well one of the built in ones so alright and there's the YAML view so we're going to try and upgrade now what I'm doing here I'm just changing the image out so we're going to deploy podinfo again with a new image we'll see podinfo is running here podinfo demo.test here so we can see and we're just going to kick flux here we're going to tell it to reconcile immediately so we don't have to wait for this part and this is normally where I would go into the terminal and pull up the canary but since we've received permission to show our flagger UI we can see doing right now is it's occupating the so we have two ingresses one is handling the it's basically configuring the engine to traffic to the primary and right now it's progressing so here now 5% of the traffic and 95% traffic to the primary right if you were to simulate a real condition 5% of your users would be experiencing and all this while trying to check whether the mix is being met so there is what we haven't shown you here is there a load tester that's running trying to simulate real and that's how we're trying to show that when we have a flagger will automatically validate all this metrics for you so right now it's trying to validate those and traffic for you it is really limited of our misconfiguration so if you were to assume that this version only 5% of the traffic 5% of your users which is a good thing as opposed to if you're just let me know what's visible, what's not visible or do any of this might happen so here we have a metric template which defines the latency based on Istio workloads so I have a Istio virtual service up and running right so this basically defines that we want this as you calculate the latency for every request that comes in right and this is our scaled object so this is what Keda uses to scale your deployments so it's pretty simple scale the porting for deployment we want to check 10 seconds what's happening we want at least one traffic running at all times and we want a maximum of 3 and this is how you configure it to listen to permissions so you say I want to listen to a permissions server and this is where it's located at and this is the query I want to run and this is the threshold at which I want to start scaling up my deployments pretty simple right so let's go through the canary so here we have a canary it's pretty simple as we explained earlier as well I'll just go through it quickly so we say that we want to run the check every 15 seconds we want a maximum of 15 failure checks before we decide that it's time to roll back this is a bad version and we want to take 10% state paths basically which means that we want to increase the traffic by 10% every 15 seconds after we reach a 50% traffic split into the canary at which point we can say that our new version is working perfectly right so canary version this is the beta version this is the primary version both of them are 6.2.1 both of them are 6.2.1 and this is right now at zero replica it's not working so what I'm going to do is I'm going to change the version to 0.6.2 so this is immediately a change in our introducing a new software into our cluster ok so what's happening is that we have a virtual service here right so this virtual service is actually like shifting traffic based on what flagger is doing do a quick watch here so right now 100% of traffic is going to the primary and 0% of traffic is going to the canary right so if we look at what flagger is doing is that now we have our port info deployment has been scaled up from 0 replicas to 1 replica so now flagger is going to start sending some traffic to the canary version right so 10% of traffic is going to the canary service and 90% of traffic is going to the primary service right and in the background flagger is checking the metrics based on the permission server and seeing whether everything is working fine or not so we can see that as well so we can see here so we have something called red metrics and use metrics I'm guessing this is promql so most of you are familiar with what red and use metrics are in case you're not red basically calculates stuff like request latency error rate and use calculates cpu usage memory usage bandwidth etc so as you can see here some traffic is going to the primary and some traffic is going to the canary because 90% of traffic is going to the primary and 10% of traffic is going to the canary and these metrics are being validated by flagger in the background so that when these metrics are properly validated flagger will start shifting more traffic to the canary and that's going to keep happening until we reach a point where 50% of traffic is going to be and at that point we can be satisfied with our new version and we can say that this version works fine I'm going to exit into a port so here you can see that we canary deployment and if we hit the primary deployment so it depends on the way I configured so like if we keep doing it we're hitting the primary deployment but if I keep doing it I keep doing it now I have the canary deployment the new version so this is what your users are going to be experiencing if you want to enable some kind of a sticky session thing that is being worked on actually it's like there's a PR open right now which I'm working on we are trying to add sticky session support to this so that if you have a use case there's some kind of legacy application where if a user falls on to the new application you don't want them to go back to the legacy application that is also being worked on using AB testing you can just have headers for your beta users but then you don't get this progressive delivery step it will be all your beta users in one go so I'm going to get out of here I'm going to do a quick get of the canary and it says it's promoting right now so right now 50% of traffic went to the canary and we're all fine we're all good right everything is working fine Prometheus said we're okay this version works great so what I'm going to do is I'm going to take the new version of the canary deployment 6.0.2 and I'm going to tell primary to be at 6.0.2 which was at 6.0.1 previously so by now we should be done we are almost done so we'll just do a quick watch on the deployments and one thing which I couldn't really show you during the demo is if you see port for primary is at three replicas right now right but when we started it was at one replica and that is due to the scaled object that we defined so Keda noticed that traffic is being generated to the port info deployments and based on that the replicas got scaled up to three and if you see here both port info and port info primary is at 6.0.2 right but port info is scaled to zero replicas so all the traffic right now is going to the primary deployment and we can validate that by looking at the virtual service as well so port info primary has 100% of the weight and port info canary has 0% of the weight so this basically shows like this demo was meant to show you how you can promote a new version safely in your production clusters by shifting traffic gradually and you know like now we can say that all our 100% of the users are experiencing the primary version and we can be more confident about that yeah so that's it for the demo thanks for coming guys thanks very much for your attention excellent job Sanskar we do I thought we were over time I was running very fast because I thought I was over time do we have thanks everyone thanks everyone sorry about the technical difficulties so yeah in two minutes we'll talk about runbook automation systems for Prometheus and Kubernetes sorry if there are any questions they are still here and I'm happy to answer them afterwards thank you and sorry again