 Hey, everyone. Thank you for tuning in. Welcome to KTOPSKON and welcome to my talk, our progress delivery with the Kubernetes gateway API. I am Sunskar. I'm a Flux and Plago maintainer. I'm also a software engineer at VWolx and I am also slightly involved with the gateway project as well. So let's start by discussing progress delivery. What is progress delivery? Progress delivery is the practice of gradually rolling out new software to your users. So you make sure that if the stuff is bad, if like the application or the new changes are already and buggy, the potential impact is very limited. So it's basically to limit the disaster radius for when things can explode. It is like the next step in your CI CD pipeline. Your CI CD pipeline merges the code, generates the artifact and makes sure all your community clusters are running the latest artifact. But progressively make sure that the latest artifact does not actually blow your cluster up. How does it ensure that your cluster doesn't blow up? You select a bunch of users that you want to test the stuff on and based on their usage and the performance of the new application, you make sure you make a decision. So let's take a look at the seven strategies that come with progress delivery. First is AB testing. AB testing is cookie-based routing. So you have a bunch of beta users that you identify using a particular header or cookie that is embedded in the request and you expose all your new changes first to your beta users, make sure that they get to use it and based on their usage and the performance of the new changes in your application, you decide whether or not to expose it to the rest of users. Next is blue green deployments. Blue green deployments are basically load testing deployments. You could say in a sense because it doesn't actually the new application never actually gets exposed to your users. You have your current application running and then you have a new application running alongside that and then you generate some artificial traffic and based on the performance of the serving those artificial requests, you decide whether or not to replace this new application and take down this old application. And lastly, we have a canary releases. Canary releases I think what most people have in mind when they think of progress delivery. So let's take a much more deeper look into how canary releases work. So the problem is basically that you have version one of your application which is up and running. It's performing fine and now you have version two and you want to deploy it but you want to do it in a progressive delivery fashion. You don't want to jump from step one to step six because that would be bad because if version two is not as good as we hoped it could be, it can lead to all sorts of problems. So instead of taking such a long jump, let's take a smaller one and go to step two. In step two we can see that we have left the version one pods up and running but we have another deployment which is running version two, the green circle. That's running version two and we can see that also see that 5% of our traffic is now going to version two of our application and the rest of the 95% is going to version one. So approximately 5% of your users are now on the version two of your application which means that based on their usage, you can evaluate the performance of version two and decide whether or not it's behaving as you intend it to and you can do that by checking metrics. You can go to your perimeter circle found a dashboard and you can have metrics that kind of correlate to your SLOs and KPIs and see version two is hitting all of those SLOs and KPIs. If it is indeed hitting those SLOs, we can take the step three. We can go to step three and increase traffic. Now we're going to go to 10% and then eventually we can keep going, keep going, reward some more traffic, go to 15%, check metrics again, then do the same thing again, go to 25%, check metrics again, then we can keep going up to 50% or any defined threshold. 50% is an example but let's say 50% is a threshold for us. What does this threshold signify? It signifies something called promotion. So when version two of your application can easily handle or is knowingly handling 50% of your traffic, which means that it is a good application version. It seems to be performing fine. So we can go ahead and take down version one of our application and replace it with version two. And that's exactly what we do in step four and this is basically the start of promotion. What we're doing is we're taking out the version one pods and replacing them with version two pods. And once we have done that entire orchestra, you have performed any release. Now let me introduce Flagler to you all here. Flagler is a Kubernetes Progressive Delivery Operator. What does that mean? It means that it's a Kubernetes operator that automates the entire process that I just explained in the previous slide for you for free. It hooks into your networking stack and your observability stack. And then it basically automates each and every step of the diagram that I explained in the previous slide. It has a lot of support in the forms of integrations. It has support for Istio, LinkedIn, Nginx, a bunch of other networking tools like service measures and ingresses. But all that's kind of irrelevant because now that gateway API is out and the entire Kubernetes networking ecosystem is moving towards gateway API and Flagler is fully compatible with that. Flagler also has a lot of support for other metric providers like Prometheus and Flux, CloudWatch, and it has a really powerful metric template system which you can use to write queries that fit your SLOs and KPIs. Now that we have a fair idea of what Kubernetes is or Progressive Delivery is and how Flagler works, let's take a look at gateway API. So gateway API is the next generation of Kubernetes routing and load balancing APIs. It's basically the successor to Ingress because Ingress really needed a successor. And it learns from all the pitfalls and all the mistakes that Ingress has. It's designed in a way to be expressive and extensible. You can add your own stuff to it. And it has a very standardized API for doing that. And more importantly, it has a role-oriented resource model, which means that each resource that gateway API provides fulfills a specific role. There exists a resource for each role that you can think about when you're dealing with Kubernetes. And I think most importantly, it offers something that Ingress failed to do was portability across vendors because when you implement a gateway API, it's very easy to switch between gateway API implementations because the spec is so well defined and so standardized. And there is no ambiguity. With Ingress, you had like 100,000 different custom annotations that each Ingress provider has on their own, which made it really hard to switch between Ingresses. And they also went GA recently, like a couple of months back. So now they have several APIs that are V1. So that means that they are considered mature production-ready. So let's take a look at some of these resources that gateway API provides. So first, we have a gateway class. A gateway class is basically a representation of a gateway API implementation. So a gateway API, there are like several gateway API implementations, STO, liquidity, contour. But the way it works is that when you're provisioning clusters, for example, your infrastructure provider will probably install the gateway API implementation for you. And the gateway class, a gateway class object is just like a record of saying that this is the gateway API implementation that's been installed. A gateway is basically a load balancer. It's like a layer on top of load balancer, where you can define all your load balancer configuration like listeners and what kind of traffic do you want to listen to or like DNS configuration or TLS configuration. This is something that your cluster admin would normally deal with, right? Because a load balancer is not going to be something that application developers are going to deal with. It's like someone who has cluster admin access who can actually expose traffic or who can actually expose like applications to the outside world would do that. So cluster admins have a very direct role to gateway. And then we have HTTP rows, which are directly to application developers. You could application developers at the end of day are pushing code changes, right? And it's their code that's getting deployed and getting exposed by services. HTTP routes are like a layer that divides the ingress API. So instead of having all of the load balancer configuration and the routing configuration in one, we have a gateway where we have the load balancer configuration and then we HTTP route where we have only routing configuration where we determine which request should go to which service based on that path or headers, etc, etc. So now that we have a fair idea of what gateway API is, well, let's take a look on how they both work together. So I have a very simple YAML file here. It says kind of HTTP route and says simple split. What it has is basically two back ends, full V1 and full V2. And the most important thing that Flagger is concerned like Flagger like only is only concerned with this wait field, honestly, because this is the field that configures what amount of service gets what amount of traffic. So Flagger just continuously updates these wait fields, right? And based on these wait fields, you can get traffic to split between your new version and your old version as well. And now it's time for a good old fashioned demo where we'll see Flagger and Gateway API working together to do a inner release. So here I have a Canary object. This is the main API that Flagger deals with. This defines all your Canary release configuration. So as you can see here, we have a target revs, which basically defines which deployment to perform the Canary release for some service configuration, the gateway that the generator should be also attached itself to. And then here we have the analysis configuration, which is really the backbone of how Canary releases work. So here we say that we want to take a step forward every 15 seconds. Threshold is basically like the number of fail checks that we're going to tolerate before we say that, okay, this application is bad. So let's roll back. Max rate is the threshold at which we decide that this application is good. So we should promote it. And then step rate is basically the step to take at every, the rate to increase at every step. So you start from 10% and then in the next step, so at 15 seconds, you take the next step and you increase the way to 20% and then another 15 seconds pass by and then you increase the weight by another 10%, which means 30%. And you keep doing that until you reach 50%. And at 50%, if everything seems fine, that's when you can go ahead with the promotion. But if everything is not fine and while doing this, you have like 10 failed metric checks where it did not satisfy your criteria. Then we do something called roll back where we do not promote the categories, but we just kill the categories. So metrics is where you define all your metric queries that correlate to your SLOs. So here we have a metric named error rate, which is referencing to another object, which actually contains a query. So let's take a look at that. So error rate is a, so error rate is something called a metric template. Metric templates are basically a way to let you define custom queries that you want to run. So here we define type prometheus because there can be other providers as well and the prometheus server. And this is the actual query that Flagler will run. And Flagler has a powerful templating mechanism where it can inject information like names, base targets, or even like custom information that can define the canary itself. So this is the error rate metric template, which basically calculates the rate at which we are experiencing five status 500s. And then we have another metric template here, which is latency, which is essentially measuring the request latency. So we reference those here in our metrics section of the canary. So each step, whenever it tries to increase the weight before that, it will run those queries and see if it's actually satisfying the threshold. So here we have defined a threshold range of max one basically means that the error rate should not be higher than 1%. If it's higher than 1%, that means it's bad. So, you know, that, that, that would count as a failure check. But if it's lower than one, that count as a successful check. Similarly, latency should not be more than 0.5 seconds. If it's more than 0.5 seconds, that's bad, right? So, and these results are added together. So if one of them fails, the entire metric check fails. So that increments the threshold value, right? So these results are added together. You can't all them together. That's very important to consider. So all criteria should be met before we actually, you know, promote the canary. Webbooks are interesting. So Flagler has a very powerful webbook mechanism where it can run a lot of webbooks at different phases of the canary release at different lifecycle steps. So here you can run a, you can run a webbook request before the rollout actually starts. So before Flagler actually starts shifting weight to the, to the new deployment, it's going to wait for this webbook to actually return a 200 status code. And as long as it doesn't, Flagler is going to keep waiting. Flagler is going to keep waiting and it's not going to divert any traffic to the new deployment, even though the new deployment has been deployed by CD pipeline. Similarly, you can do the same with promotion. It's not, it's not going to promote the new application unless it gets a 200 status back from the webbook request. And then you can run like a bunch of load tests during the lifecycle of the canary release. So it's really a very powerful way to extend your canary release. Now let's take a look. I have all of this in action together. So I have already installed these metric templates and this canary API. As you can see, this is the status. I have a watch here, which is watching the status of the canary object every 0.5 seconds. And so if you can see here, I have two, I have two deployments, port info and port primary. Port for primary is our primary deployment and port info is our canary deployment. So both of them are 6.0.2 right now. As you can see here, I'm going to go ahead and trigger a canary release. And the way you can do that is by changing anything in the port template spec, but I'm just going to go and deploy a new version. So here you go. So now, as you can see, my canary deployment, my new deployment is 6.0.3. That means that flagger should not do a canary release. As you can see, we have a progressing phase here, which basically means that flagger has detected that there has been a change and that it needs to perform a canary release. So we're going to wait for that to happen. So let's go back to our browser. And as you can see, I'm at 6.0.2 right now. Okay, that's fine. Let's wait for the canary release to kick off. So now our port info port is now up and healthy. It's been scaled back up by a flagger. So now we're going to increase the weight by 10%. So as you can see, canary weight is 10% here. And you can indeed check the HDB out as well to confirm that flagger is actually manipulating the weights. As you can see here, port info primary is getting 90% of the weight and port info canary is getting 10% of the weight. So that means 10% of all requests are right now going to the canary development. That's running 6.0.3. And we can check that here. We can keep reloading this until we hit 6.0.3. Yeah, there you go. So now we have 6.0.3, which means that some of our users are now going to start between 6.0.3. And flagger is going to keep increasing the weight again and again. So right now it's at 40% until we reach 50%. Once we reach 50%, it's going to start something called promotion where it's going to start scaling down the canary deployment and start scaling up the primary deployment. But the primary deployment is going to be at the new version, not at the old version. And that's it. That's the demo we can wait for. And as you can see, Flagger is promoting the canary release right now. And we should be seeing a port for primary to be at 6.0.3, which means that our primary deployment is now running the new version of the application, which is what we initially wanted. I just want to talk a couple of more minutes about some newly released stuff that we have with Flagger and Gateway API. So we sent the added support for canary releases session affinity, which means that your canary releases won't switch back and forth between users randomly. Users once exposed to new changes will remain on that version because of session affinity. And we also added support for blue-green traffic moving, where you can have blue-green deployments, but have the request being mirrored to the canary deployment as well. I just want to talk a couple of minutes about some newly released stuff that we have in Flagger, in particular with Gateway API. So we recently added support for canary releases with session affinity, which means that you can combine canary releases with the power of session affinity so that users don't get randomly routed to different versions of your application. So once they are on the new version of the application, they remain on the new version of the application, we think this is a pretty exciting feature. And I recommend all of you to try it out. It doesn't take a lot to enable it. You can go to the documentation and check it out. It's just like two lines of configuration, and you can have session affinity for free. We also added support for blue-green traffic mirroring, which means that you can mirror all the requests to both deployments, but only have responses being returned from one. If that's the kind of thing that, if your application is completely stateless and it's not making any right operations, you can mirror all the requests to both versions of your application. If you want to compare some kind of metrics or something like that. Thank you. Check out Flagler at these links, and there are a couple of resources as well that you can learn more about what it, about this kind of stuff that I talked about. And I hope to see you all next time and get off scorn. Have a good night. Bye-bye.