 Hi, everyone. My name's Hannah. And I work at New Relic on the Pixie team, contributing to Pixie, which is an incubating CNCF project. So I'm here like you because my goal is to make deploying code to production less stressful. And so no matter how much you test your code, there's always a risk that when you expose it to real customer traffic that you'll surface an issue you hadn't found before. So today, I'm going to show you how you can do a completely open source, cloud native canary deployment with application metric analysis. And we're going to use Pixie to provide the application and metric analysis. So a canary deployment strategy reduces risk by diverting a small amount of traffic to your new version. And metrics from the canary release inform the decision to increase traffic to the canary version or roll back to the stable version. And this diagram is simplified, but you can have multiple analysis metric collections at that collect and analyze metric set. So canary analysis is more of an art than a science. A canary release doesn't guarantee that you'll identify all issues with your new application version, but a carefully designed one can maximize your chance of finding any issues. So you should make sure that your canary analysis, your canary pods get enough traffic that you're going to surface issues, but not too much traffic that if you do discover an issue, you expose your users to it. So the rule of thumb is that you should give your canary at least 5% to 10% of your traffic, and you should try to overlap with your peak traffic periods. The trade-off here is that the longer you run the analysis, you'll potentially get better data, but it will reduce your development velocity. So there's a lot of strategies around this, and the strategy we use is to do a short first step that has a fast fail and then longer subsequent steps. And also you should make sure that the services that are most critical are getting the most analysis on them. So for API-based services, the common metrics to analyze our latency, error rate, and throughput, but it's totally up to your service. For example, at New Relic, we use Kafka as our messaging system, and we use Argo rollouts with canary analysis to monitor Kafka lag, because if you use Kafka, that consumer producer lag can definitely cause some incidents. So you can actually manually perform a canary deployment using native Kubernetes, but the benefit of Argo rollouts is that the controller manages all this for you. And so Argo rollouts is the controller plus two CRDs, the rollout resource, which defines it's a drop in replacement for the native Kubernetes resource, and it contains a recipe for splitting the traffic and performing analysis on the canary version. And the analysis template, which contains the instructions on how to do the metric analysis, and defines the success criteria for that. These two CRDs provide tons of configurability. We're just going to do one example today. So here's our rollout resource on the left, and you can see at the bottom, the steps, we're going to direct initially 33% of our traffic to the canary, which is way too much traffic, but this is for example purposes. And then we're going to pause 60 seconds to give our analysis some time to run, and then we're going to increase the traffic to the canary to 66%, and then pause again to run more analysis. And we just have a background analysis running where on the right you can see that we're basically every 30 seconds after an initial delay of 30 seconds, we're checking for the error rate of our canary pods. And if that error rate is greater than 5%, we're going to fail fast. So how are we going to get the error rate for our canary pods? So here's where the open source application metrics comes in. So Pixi is an open source observability tool for Kubernetes applications. It's a CNCF incubating project. And one of the main features of Pixi is that it automatically traces protocol messages. So for example, this demo makes HTTP requests. And instead of instrumenting the application, we can just deploy the application in an environment that already has Pixi deployed. And Pixi will trace the Linux syscalls related to networking. So anytime the application makes a network request, we'll trace it at the Linux syscall level. So I'm going to just show you this in person rather than spend time explaining it. Can I go to the demo? OK. So here's our demo app, the Argo Rollouts canonical demo application on the right. And I haven't changed it at all. Basically, every bubble is a request our browser is making to the back end. And the back end responds with the color that's indicating the version of the application we've deployed. There's a bar chart at the bottom. And right now you can see that all of our requests are being handled by the stable blue version. So we have Pixi on the left. And this is just a script that shows us all of the HTTP requests. And again, we didn't change the Argo Rollouts demo at all. We're just looking at the requests here. And we can filter all of the requests to the Canary pods. Oops. So now we're looking at all the Canary demo requests. And if we expand it, we can see the endpoint it's being sent to. The request headers, the request body, the response status code, and response body. And Pixi is very scriptable. So we could, for example, group all of these requests to see which. So we can see that we're only getting response from the blue version. And we're only getting the 200 status code. And over on this Argo Rollouts demo, we can increase the error rate. And that's shown with the little red square. And if we rerun Pixi, this is just looking at the last 10 seconds, or five seconds here. We can see we're starting to get 500 status code. And we can also increase the latency. And we can rerun Pixi. And you can see that our latency has dramatically increased here. So we're going to use Pixi for our metric analysis. And it does this all using EBPF. So now I'm going to go over to a refresh this page. OK, so now I'm going to do the actual Canary demo rollout, which, again, is in the Argo repo. And basically, I'm going to demo a rollout of a new application by, let's see, we are on the blue one. So we'll roll out the yellow thing and ignore that GCPR. OK, so now up above here, you can see this is the kubectl plugin for Argo rollouts. And you can see our Canary is this yellow version. The blue version is our stable version. We've spun up a Canary replica set here. And it's running. And we're going to do our analysis every 30 seconds after an initial 30 seconds. Oops, hold on a second. Sorry, that was the latest one. OK, and this green check here, let me refresh this over here. This green check here means that we've done one analysis run and it's succeeded. We should be seeing yellow over here. All right, well, oh, there we go. OK, so now we can see that some of the traffic is being directed to our new yellow application version. And this is just going to keep running analysis in the background according to our analysis template. But we can inject some error and see what happens. And according to our analysis template, if we have more than 5% HTTP error status codes, we should fail. So this error is represented by these red boxes here. And we've got to wait 30 seconds for another analysis to happen. Fingers crossed, this works. Oh, there we go, great. OK, so you can see we had two metric analysis that succeed, one that failed. And we're tearing down our Canary replica set. And our rollout has been degraded. And up above, you can actually see now it's aborted. And it tells you why. And if you want to see why it was, you can go look at, OK, you can get the analysis run. And you can see the actual metric values. So you can see our error rate for the first two runs was very low. But then for the next run, it was at 60%. So can you switch back to the slides? So that's showing you how you can do a completely open source cloud native Canary rollout. So Pixi can provide a lot of other metrics. For all of the protocols listed, you can get latency error and throughput. And you can also get those metrics by request path. So take a chance on Pixi, try it out today. It's another CNCF project. And we'd love to hear back from you.