 First, I'd like to thank everybody for joining us today. Welcome to today's webinar, Delivering Progressive Delivery with Service Mesh. I'm Taylor Wagner, the Operations Analyst for CNCF. I'll be moderating today's webinar. We would like to welcome our presenters today, Andrew Jenkins, CTO, and Zach Jory, Head of Marketing for Aspen Mesh. Before we get started, there's a few housekeeping items we'd like to go over. During the webinar, you are not able to talk as an attendee. There is a Q&A box at the bottom of your Zoom screen. Please feel free to drop in your questions and we'll get to as many as we can at the end. With that, I'll hand it over to Andrew and Zach to kick off today's presentation. Great. Thanks, Taylor. Zach Jory here, Head of Marketing. Looking forward to spending a few minutes up front. And the presentation talking about progressive delivery, some of the concepts behind it, why we think it matters, and why we think the Service Mesh helps with that. And then we'll get to the fun bit after that. Well, Andrew will do a couple of demos showing how you can actually use Service Mesh to deliver progressive delivery. So, progressive delivery is an idea. I should give credit where it's due that James Governor of Redmonk introduced a year and a half or so ago. There were a couple of companies that were doing progressive delivery. WeWorks, LaunchDarkly, the engineering team at Target. We're using a stack of tools to, I would say, do a new method of CI, CD around Kubernetes. Things have kind of changed the way we deliver applications. Now, the microservices become all the rage. And the way you build and deliver and deploy Kubernetes-based applications requires a little bit different way of thinking. And that's where progressive delivery comes in. So, these companies that I mentioned, LaunchDarkly, WeWorks, we're using tools like Kubernetes, GitHub, Service Mesh, FingeFlags to roll out software deploys to small sets of users to see how those went and use the learnings from those deploys to, you know, either roll back, didn't go well, or make changes based on what they were learning. When you combine these set of tools and do software delivery in this way, it's really cool because progressive delivery means software deployment does not necessarily meet the equal software activation, which means that just because your dev team is doing deploys doesn't mean you have to roll out those features to all your end users at once. This is very similar. You know these thoughts from CI. I like to think of this as CI 2.0, kind of the new requirements for how you deliver software in the microservices world. CI was really great for the business because it gave the business a way to limit the blast radius around software deploys, so deploys weren't as scary. I think progressive delivery takes that a step further by not only allowing you to turn hand grenades into firecrackers, but also being able to put those firecrackers out very quickly if there are problems and even put a Band-Aid if one of those firecrackers causes a wound. One of the really neat things about progressive delivery is the business actually cares about it. This controlled empowerment idea is around the sense that the business can say when it's acceptable for new code to hit the end user. So dev teams are not constrained by the business. They can still do rapid iterative development. But the business can say we're going to package that all up and we can wait for 6, 10, 12 deploys until we activate that for the end user. This is really cool because it gives the business psychological safety around deploys and I think rapid software development is not as scary for the business side of the house anymore. Deployment is not the same as release. Service activation is not the same as deployment and devs can deploy a service and it can ship, but it doesn't mean it hits all your end users. And this is a great way. So the business side of the house, the application owner doesn't have to put any handcuffs on the dev team. The dev team can keep moving at the pace they want to move at. And the business doesn't have to be quite as terrified as it has been of that in the past. So there are really a couple different pieces here. One is the psychological safety for the business I talked about in the last slide. The other one that's really neat is the fact that this provides a much better developer experience. Decoupling releases from rolling out new features to customers is a really good way for the business to be satisfied without forcing dev teams to comply with business controls and concerns. So that means app owners can say when new features hit users and that means the business can not only feel better about the way things are getting rolled out, but it makes it a lot easier to meet a lot of security and compliance concerns by being able to put some guardrails along the way your developers are rolling out new deployments. So there are, as I said, a few pieces to progressive delivery. The part we're going to focus on the second half of this webcast is about how ServiceMesh enables progressive delivery. Feature flags and rolling deploys are some things. LaunchDark that I mentioned earlier is a great tool to use. Feature flags for progressive delivery. One of the things that we have been using with our customers is ServiceMesh and it allows us to do progressive delivery things like canary and traffic mirroring, which are really cool because you can roll out an experimental service, see how that service is performing with a set of users and use those metrics to decide if you roll out further or roll it back. And so with that, I think that I will turn it over to Andrew here to deep dive a little bit more into how you can use canary and traffic mirroring and do a couple of demos. Yeah, that sounds great. Thanks, Zach. So let's go ahead and take over from where Zach was. So, as Zach mentioned, you've got a bunch of sort of components to progressive delivery. It's all about decoupling the deployment of software from the way that users sort of encounter it. So that way, business can sort of make trade-offs and that can evolve kind of independently of a decouple and let software teams go as fast as they can. However, you still need some stuff to manage how you're going to actually activate it for new users, how you're going to deploy it. And so part of that story is around sort of these new deployment techniques for deciding how you're going to shift traffic and expose it to users. So what we're going to talk about today is sort of demo is canary deployments and traffic mirroring, which is kind of a new thing that a service mesh lets you do that has some interesting sort of opportunities for improving the way that you might have used or heard about canaries before. So what we're going to demo here is using a progressive delivery operator called Flagger. And so this is kind of the Flagger architecture. And so Flagger's job is to deploy a new canary and then shift traffic to that canary based on different criteria. For this demo, we're going to sort of step through ramping some random subset of traffic. There are other options as well. And then Flagger observes what happens to that new canary service, compares the performance of the canary to the primary service, and continues with its progressive deployment and activation of the canary service as long as that's healthy. And if it becomes unhealthy, if the canary is now performing out of spec relative to the primary or relative to absolute markers, then you can say, OK, wait, hold on, this activation is failing. And so we're going to roll back to the primary. We're going to cancel this activation and we're going to let somebody figure out what's wrong, wait for a new candidate version that we can try activating along those same lines. So, and the service mesh tie in here is that for actually choosing what traffic goes to the primary and what traffic goes to the canary, you can use a service mesh to do that kind of traffic shifting. And then later we'll see how you can also do something called traffic mirroring that will let you do a little bit of both before you sort of activate the canary at all. And we'll talk about why that's good or bad. So this is an example of a canary deployment through a timeline. And so what we're looking at here is we've got an old version of service deployed and fully activated 100% of traffic is going to the old version of the service. We've got a new version. It's past, you know, our continuous integration tests. It's, you know, we're ready to go. We're ready to actually use this guy. And it's deployed. It's up and running. And so now we're going to do this sort of progressive activation piece. And so what we're going to do here is you can see we're going to start by shifting some traffic to the new service. And then we're going to pause for a while and measure this. And we're going to confirm that the new service is performing successfully. And if that's, if that is true, then we'll keep going and shift a little bit more. And then, you know, we'll sort of proceed on and on through the progressive deployment until we decide, okay, you know, for our activation thresholds, we've shifted traffic over to this new service. It's performing healthy. Great. And we can go ahead and complete the activation for all users and shift 100% of traffic to the new service. So this is this kind of like shift measure shift measure sort of loop and the service mesh is involved in shifting traffic to the new service, as well as collecting the measurements in a way that we can we can compare. So at this point, I'm going to shift over to the demo and go through what we'll sort of see this in action. So let's see. Actually, I'm going to get out of here. So I'm going to go through this demo. I'll copy this URL here and slap it down in the chat. You can play along at home or try this out yourself. And with, you know, the examples on if you follow that link. And so let's go ahead and look at this. So I've got make sure Prometheus is up and going. All right. And I'm going to shrink this so that we can see sort of just the current amount of time here. So right now I've got my old service running. This is it's taking all the requests currently and nothing is shifting over to the new service. And what we're going to do is I've got so this this up top will sort of watch this is the flagger operator and it's going to tell us what it's doing and why it's just for a step through here. And then the I've got what is called a canary resource. And this is how you tell flagger what traffic to shift where how to work through this this progressive delivery scenario. And so this this canary is going to say, Hey, every 30 seconds, we're going to move one more step through our progressive delivery, our canary deployment. And what we're going to do is we're going to compare the the request success rate and the request duration between these two. The the primary service and the new canary service. These are handles for can sort of Prometheus queries inside a flagger that go and look at the service mesh underneath like aspen mesh that is collecting the metrics from the canary deployment. And then here we're going to say we're going to do every step is going to be adding 10% of traffic shift. And this version we're not going to do the mirroring stuff that we'll talk about later. So we've got mirror set defaults. So What we've got right now is we're running a basic primary service version 2.1.1. We've got our canary configured flagger is sitting there ready to go. And so we'll do is we will try to update this image to be a new version 2.1.2. And I'm going to do this manually and directly in the demo, but you know, in production usage, this is driven by whatever sort of your deployment scenario is. If you're using, you know, get ops, then it's your get ops operator for configuring this. Maybe it's a, you know, a CD system that that's driving this for you, whatever it is, but for the purposes of this demo, we're just going to play around with with updating this image directly. So I'm taking the place of, you know, your get ops operator and I've just said, okay, great, we're, we're now saying that we want to try this candidate image part of code 2.1.2. And I've got a little canned watchers that we can kind of see what's what's going on here. So flagger has noticed the new revision detected. And so it's deploying some 2.1.2 pods. And it's going to wait for those to get started and show up. But right now it's still saying, hey, all the traffic is going to go to this, this primary version of the service. So we'll wait for just a bit and now the 2.1.2 pod is actually available. So pretty soon here flagger noticed and it started shifting some traffic 9010 between the canary service and the pod in the primary service. So I will go ahead and we can take a look over here and Prometheus can lag just a little bit. We'll go ahead and load this guy. We can see that there's a little blip of traffic starting to go over to the canary version of the service. Keep on going. You can see, okay, now we're stepping it up. And if we flip over here, we can see flagger is saying, okay, yeah, we're getting some success. So it's starting to shift 8020. It'll go to 7030 over time. And then we'll end up with, in this case, canary version of the service should hopefully be happy. And we will say, okay, great, this 2.1.2 version is our new primary version and complete the deployment. Go back over here and a little bit more of the traffic continuing to step up on the canary version of the service and starts from marching down on the primary version of the service. Keep shifting traffic. And for the purpose of the demo, I've got a pretty quick ramp here, 30 seconds only on each sort of stage. That's to make the demo go fast enough that it's interesting to watch. In production, you're probably looking at letting this soak for quite a bit longer, depending on how rapidly you sort of do these deployments. All right, so let's go ahead and see it's continuing to sort of march more and more traffic shifting to the canary version of the service and less and less to the primary version of the service. And we're finally at 5050. And so this means that, you know, flagger has now decided that, hey, everything's good. And we're ready to go ahead and make 2.1.2 the new versions. This is a successful canary deployment from flagger. And so it's going to update the primary to be our new 2.1.2 version. And it says, you know, so it's finalizing this canary getting everything else done. And so here it goes, shifting all traffic over to primary. And then it'll start, you know, the canary is now ready to be used again for the next sort of canary deployment. And so we'll see here and take a little while for the primitius metrics to sew through, but then we'll go back to the stage where all the traffic is going to the primary. The primary is now new. It's that 2.1.2 version of the service. There you go. So now it's all shifted over. Traffic is totally shifted to the primary. Okay. So that's a successful canary, canary deployment with traffic shifting. I can show you the next one here is what happens on failure. And so what happens on failure that's pretty cool is so we can shift, we can measure, we can say, oh wait, this traffic is not as healthy as the old traffic. And so we should actually roll back and we can sort of step through that scenario now. And the one thing that I know I have to watch out for here is that I always put the wrong deployment in there. So let me make sure. There we go. So what I'm going to do is I'm going to try deploying an image that I've configured that always gives back an error response. So this is a, imagine this is a busted build that somehow slipped through CI and it's got some bug where it responds with failures to some subset of requests. And so you can now detect this with a canary and we'll see how by doing it with traffic shifting or some other sort of thing, you can limit, just as Zach was saying, you can limit the impact of letting one of these broken deployments get out. It doesn't have to be activated for all users. And you can detect it quickly, roll back automatically. So I'm going to go ahead and try and deploy that guy. And let me flip back to my watch loop here and we can see, okay, I've told, you know, I'm pretending to be the CI or GitOps operator again here. And I've said, okay, we're going to try deploying this new image. And so once flagger notices this guy, he's going to go ahead and start scaling that up. So there you go. So flagger, flagger notices, flagger say, okay, cool, I'm going to execute this canary now. Let's go ahead and scale up that container and get it to be available and then start shifting some traffic to it. Okay. So now that that new candidate bill is available. And flagger is noticed flagger is shifting, you know, 9010 on the traffic here. And here we should see something a little bit different, which is so this this top plot will show only successful traffic that didn't have 500s and down here I'll show you traffic that does have 500s. There's some amount of traffic shifting from the primary over. So here it goes you see a little dip there. You don't see this count up in this view because this is all these are all errors. But what you'll see here is there are now high rate of errors coming out of out of this service. If we go over to flagger, we'll see, oops, halt pod info test advancement success rate is less than 99%. So this is not happy. So we're not going to progress. And flagger will sit there it's going to try a couple times, and then say that this is all busted, and it will roll it will just sort of shift traffic back 100% to primary 0% to canary. And it will mark this canary is failed. And so if you're an operator, you can go look at, you know, why what happened to this build, why were there all these errors. And then you can you just continue. So the next time you have a new build that hopefully has this bug fixed, then you deploy that build and flagger will repeat, but be happy with the canary that time, and then do a successful delivery. And now said, Oh, yeah, this is no good. Routes, you know, 100 zero back, and then left us a note that the canary has failed. And now we can go take action. But we've automatically rolled back. So our production deployment, all of our users are seeing responses from 2.1, the healthy 2.1.2 deployment. And so now we can pause, you know, we don't have to panic, we can take our time to figure out exactly what went wrong. So that's, that's neat. Did that and you've got, you know, traffic all back to the primary. We want to talk about something, something that, you know, we saw this, you know, we thought that service mission opportunity to do something pretty cool. And so we did an experiment to add this to flagger. And this is not committed to upstream flagger yet because it's not ready. We haven't figured out enough about how other service messages do this sort of thing. There's some caveats that we'll go through, but this is the sort of traffic mirroring case. The source code is available. If you go to that demo, you can see the source code. It's just this, this part, this mirroring part is not really probably ready for production usage yet. And we'd like to get it upstream into flagger so that it's available for everybody. Traffic mirroring is cool, because it's a little bit of a difference. So in this mirroring pre stage, we're going to actually duplicate the request. One copy is going to go to our primary and one copy is going to go to the canary. The user sees only the response from the primary. So the service mesh is going to take care of copying these requests. And it's going to take care of taking the response back from the primary and sending it on to the user. For the duplicate that it sent to the canary, all it's going to do is collect the response, drop the response on the floor, but still collect the metrics and statistics about how long it took and about whether it was a success or a failure. But then it's not going to send that, it's not going to send the actual response on to anybody like the end user. It's just going to drop it on the floor. So what this enables, that's pretty cool, is that if the canary is a bad release, you can detect it without actually sending any bad responses back to the end users. So in that traffic shifting case, you know, we said, okay, we're going to shift 10% of traffic to the canary bill, right? And so that meant that while we had that deployment going on, and we were in that 90-10 split, 10% of our users were getting those bad responses. And that's a lot better than 100%, right? This is part of the progressive delivery story where you can limit the activation scope. But, you know, 10% is still more than zero. And so if you can improve on that even more by saying, hey, we're going to do some free testing where we're just going to, like, try to filter out, you know, bad bills in advance of this, then, you know, we think mirroring is a cool opportunity to do that kind of thing. So the caveat here is that the service mesh is going to mirror the traffic by duplicating the requests. And it's going to send one request to the canary, and it's going to send one request to the primary. And so that means that there will actually be two copies of each request that you choose to mirror this way. And so that means that those requests must be what's called item potent. So that means that your system, if it receives a copy of the request, it'll still be okay. And so item potent requests cover gets, like all sort of read operations are naturally item potent, and also special kinds of rights. So if you have requests that are rights, like puts or posts that are semantically kind of like this thing on the left. Hey, deduct $100 from my account, then those are not item potent because you might have, you know, like, when we duplicate the request, you'll get two requests that say deduct $100 from Andrew's account. That will mean that I've been deducted $200. I'm not happy about that. And so, you know, I don't, that that traffic is not compatible. It's not item potent. It's not compatible with this mirror. The kinds of requests that are important are all gets, like, hey, get my account balance. You duplicate that. It's fine. It's a read. Also, here's an example of an item potent right set my time zone to mountain daylight time, right? And if you do another copy of the set, that's fine. Item potency is a good thing to have in your system anyways, because since we're dealing with unreliable networks, we have a hard time guaranteeing that, you know, there's only one copy and it's reliably delivered. This is pretty interesting to a lot of people. So we're publishing a blog this week that goes into a lot of details about what item potent is why you want it in general. And then, you know, that sort of unlocks the door for this mirroring kind of operation. All right. So mirroring, as I talked about, this is this pre-stage. And so this is going to look kind of like the other graph where we've got old traffic up here, new traffic down at the bottom. And so when we run the canary, the first thing we're going to do is a mirror pre-stage. And so here we're going to copy traffic. That's that little blue dotted line there and send it to the mirror. And then once we'll do this pre-stage, let it soak for a while. And as long as that canary is happy, according to the service mesh statistics, we'll measure it here and we'll say, okay, good, we're ready to proceed to the next stage. And so then when you want to do that, then we'll start with the shifting traffic, measure shifting traffic going through the whole rest of the thing. And then, you know, the canary proceeds as normally, we hit success and we say, great, we're good. All right. So let's see this in action. All right. So I'm going to go over here and edit canary pod info. And I'm just going to turn on this mirror property in the canary. And so we're just telling Flagler, hey, we want you to do a mirror stage before you do all the rest of the normal canary deployment stuff. And I'm going to deploy pod info to point. Oops. That's what I always do is I get the wrong one. Okay, cool. So I'm going to deploy 2.1.2. And we will see. Hopefully I didn't talk so long that stern got disconnected. All right. So update that image. And we will go ahead and wait for Flagler to notice. Oh, whoops, I did something wrong, which is I tried to deploy the same image again. So that's Flagler is going to ignore that. It says you're not really doing a very interesting deployment if you're deploying the same thing. So I'll go ahead and step forward one more version here. And so now we've got a canary deployment that we're proposing. Hey, Flagler, we want you to do this. And we want you to do this traffic mirroring thing first. And so we'll go ahead and wait for Flagler to notice. Out of here. We'll go back to Prometheus. That guy loaded up. There we go. So Flagler noticed and is waiting for our new version to be available to do the traffic mirroring. We're going to see down here. This is the command to the service mesh for how to split traffic. It's going to have a mirror pre-stage that's going to run for a bit that's going to do the mirroring stuff that we talked about. Cool. So now you can see it ran. This now looks different. It's going to say, hey, do some mirroring right now. And so we're in the mirror stage. And so we're not actually sending any production traffic to the canary aside from the special traffic that we're mirroring. We're dropping the responses on that mirror traffic on the floor so that we're not actually sending them back to users. And over here, we can see that it's a little bit hard to see here, but the mirror traffic is going to be 100% of the primary traffic. You're going to be the same for the pre-stage. You mirror all the requests in this pre-stage. And mirrored requests when you use a service mesh like Aspen Mesh or Istio that are built on Envoy. Envoy mirrored traffic has the authority or host header has an addition where it says dash shadow at the end. And that's how if you have other tracing systems or logging systems, you can differentiate. That shadow is the clue that, hey, this was actually mirror traffic. And so you can kind of compare. Cool. So the pre-stage just went successfully and mirror traffic completed here. And Flagler decided, hopefully it's okay. And so now it's going to start shifting traffic over to that canary. Flip back and see, yep, canary says it's happy. It's progressing through. It went 90 and then 80 and it's all the way up to 70. It's got a 70-30 split. We're now into that normal canary traffic split scenario. The service mesh is continuing to split traffic according to the commands that Flagler gives to it. And then once we get all happy here, it'll go 60-40, 50-50, and then we're good. So we're up to 60-40 now. We'll wait just a few more seconds and it should go ahead and go all the way back to 50-50. We can see, you know, this is this. So here's the mirror in pre-stage. And then here's the canary ramp up. And then we're about to hit 50-50. And then if that's healthy, we'll go ahead and declare the canary a success. There we go, 50-50. And so now, you know, Flagler has shifted. We're deploying 2.1.3 as the new primary. Once that's finished, we will, Flagler's next step is to go ahead and shift traffic 100% to the new deployment and be ready for, you know, the future. Other deployments that we want to do. Routing all traffic primary, 100-0. Flagler's finalizing. And I think at this point, we don't need to watch the tail end really here. So that's Flagler progressive delivery in a nutshell. You know, pasted into the Slack, the links here for if you want to try this, you know, at home. All the commands, YAMLs to sort of reproduce the same demo you can see right there. Great. So at this point, you know, I think my demo is finished. And I'll go ahead and I think open it up for questions. Great. Thanks for all the great questions in Q&A. Taylor, I see quite a few there. Should we just start going down the list and giving answers? Yes, let's do that. All right. So the first question from a anonymous attendee is, can this be deployed in OpenShift? Yeah. So I'm pretty sure that this can be deployed in OpenShift as sort of a primary, you know, OpenShift has a lot of Kubernetes stuff inside, right? And so the way that Flagler works is it's taking normal Kubernetes workload descriptions like deployments and services and then commanding one or more service meshes underneath to sort of do the traffic shifting. And so you should go check Flagler's page for any of the details around compatibility. But basically those are the two pieces that you need. You need Kubernetes deployments and you need a service mesh. You know, we're Aspen mesh. We love Aspen mesh for that service mesh. You need a service mesh that Flagler knows how to command underneath. Great. Thanks. The next question is what CNI is supported? That's a good question. So this layer, it's independent of CNI. So the trick is that, again, your service mesh is going to be the thing that's doing the traffic shifting. And so your service mesh has to be compatible with whatever CNI you have underneath. This demo, I happened to use Flannel, which is like VXLan underneath. But, you know, most service meshes are compatible with most CNIs. So if you want to operate a layer above that, Aspen mesh does operate a layer above that. And so we don't really care about the CNI. Okay, great. The next question from Jeff is curious if Canary analysis can incorporate metrics from other sources besides Prometheus, such as an APM tool that provides a REST API for getting performance metrics for Canary. Yeah, that's a great question. So Flagler, so let me show you kind of where the metrics came from. We talked about it just a little bit, but it's worth going into here. So this, so Flagler comes in some, so here's where I specified the metrics, request success rate and request duration. These are handles in Flagler's code to say, hey, if the user is running on top of an Istio or an Istio service mesh like Aspen mesh, then you should query Prometheus this way so that you can get the corresponding service mesh metrics back out. And so this is an easy way so that users don't have to bother with all the details of what query I'm going to run against Prometheus and where and all that sort of stuff. It's designed to make that pretty easy, but you can do that yourself. And you can go check sort of the Flagler GitHub repository and documentation for how you would make your own metrics here. You could point them at Prometheus. You could point them, as you say, some other APM tool, whatever you want. You can list more and more metrics here. This is basically a handle to a template that Flagler will fill in with like the names of the deployments and stuff that it's controlling. And so, yeah, you can definitely, you know, go wild and pull in other data sources. Okay, great. The next question is from Jim. He actually asked two quick questions right in a row, so I'll just give you both. The first is, which metrics are you using to make the canary decision? How are you determining if the app is good and ready? And then right after that he asked, seems like failure rate based on response codes. Yep, so there's, there's a couple of parts here. And so I'll sort of go through all the details. The first is that when I said, hey, deploy this new, you know, update the image, here's our new build that we want you to run. That was adjusting a plain old Kubernetes deployment. And so that's the glue between whatever CD system you have and Flagler. And so that plain old Kubernetes deployment in production usage should have all the normal things that you should do for one of those Kubernetes deployments. It should have liveness and readiness checks associated with it. And so once you do that, Flagler controls how many of those canaries to spin up. And it'll spin up one and let, you know, autoscalers take over if you need more. But it'll just say, hey, I want, you know, I want this to be available and it'll let the plain old Kubernetes machinery take care of deploying that. And it's not, you know, just like all other Kubernetes deployments, they're not available to be part of a service until liveness and readiness checks have passed. And so that that first stage before Flagler can even start mirroring traffic or shifting traffic, that first stage is a plain old Kubernetes deployment. And that's great because it means that you can use all the stuff you've already got for managing that for rolling deployments. And, you know, those the ones that I recommend really are, yeah, you want health checks, you want liveness checks and readiness checks before you even start shifting traffic. So that's, that's the first that's just getting the deployment up and running. Then the next one is going to be how you start deciding that it's it's ready to take traffic. And so that's where, you know, service mesh, Flagler, progressive delivery, canary deployment start coming in. So the first scenario I showed is that, okay, we've got one that's deployed, it's passed, you know, readiness checks. And so now we're going to shift, just start shifting some traffic to it and see if it's healthy or not. And then the extension that we showed was, hey, maybe you want this pre stage where you're going to mirror some traffic in advance. And that'll give you that, you know, okay, it's good and ready to even start taking some production, some production traffic. So this is really about a couple of layers of pretesting so that you can make sure that before it goes out to some subset of your users, then eventually 100% of your users that it's good and ready as you put it. And then you asked about, seems like response codes. So yeah, that's the analysis in the measurement part in the, you know, I sort of showed in the slides here, the shift and measure. So that measurement part is in the demo based on response codes and latency. We sort of showed the example of failing response codes means rollback. And just as I sort of described in the other question, you could choose other metrics for your application. Okay, we have more questions about canary deployment. This one's from Samet. Will the canary deployment work for TCP traffic as well? So for TCP traffic, the trick is that you have to have some way to decide that it is healthy or not. So in theory, this would work for, hey, as long as there's some prometheus metric that we can go grab somewhere, it may be a different metric that, you know, your application layer sort of computes and reports, like the app could be trusted to say how successful or failing it is. That would mean that you'd have to go write your own custom, you know, metric to go find one of those and get it. So it works, you know, pretty well out of the box, like we demoed for HTTP. If you want to do TCP or your own custom stuff, well, the problem is that you both have to be able to collect the metrics and you have to be able to shift traffic. And so if it's TCP, if the TCP layer, right, the service mesh can't tell the difference between one request and the next. So the best it could do is some sort of shifting based on connections. And so that's kind of course brand, but you can you can still do it. And the mirroring thing does not work for TCP traffic. As far as I know, Envoy cannot mirror TCP traffic. And since it doesn't understand anything about the semantics of the things going on inside, it's not very helpful. Mirroring is best suited for, you know, layer seven protocols that your service mesh understands. That's a good question. So this one might be similar. You can let me know. But the traffic, which canary probes HTTP in case this is database traffic or any TCP traffic, will the canary still work? Yeah, I think that one sounds like the same as the. Okay, it was the same person asking so we'll just keep going. Is the canary analysis only comparing against static thresholds by default, or is it more intelligent? So this one is configured to compare against static, static thresholds. I'm not aware, a flagger might have more smarts inside of it, but I'm not, I'm not aware of that. Okay. Will service mesh be working across multiple Kubernetes clusters? That's a great question. So it depends on the service mesh. So Aspen mesh, which is built on top of Istio and Istio, other Istio based service meshes do work across multiple clusters. It just depends on the service mesh implementation that you chose. Okay. The next question is, do you have some telemetry features? Telemetry features. Yeah, I see another one. Yeah. Aspen mesh has service graph. Aspen mesh has, you know, telemetry computing, things like that. We didn't show that off in this demo. We focused just on progressive delivery. You can head over to our website and get a whole bunch more details on that. And we'd be happy to show this off to you. Set up a call with us or check out our demo or open beta. That just wasn't the focus really of this webinar. Okay. This one is from an anonymous attendee. I think get code state would be different against cluster deployment state when rollback occurred on failure. What do you think of that? That's a great, that's a really great question. Let me talk about what's going on inside of there. So the question I think is so like, if you have a failed deployment, and it might take a minute, but I'm going to try and I'll try to set this to do this and then we'll talk about it while it's going. So you can kind of see what's going on. Oh yeah, let me share my screen again. Thank you. And I'll just talk through it because it might take a few minutes for the demo to catch up. So when, so if you have a bad deployment, a bad canary deployment, then what happens is you roll back. And so the, and then the get code state might be different than what's running in production. And so there's a couple of pieces here with, so one is the get ops sort of half of it. And so that is where what you're trying to do is you're trying to say, I want to manage the state of my cluster all the applications that are running in it. I want to manage that like how I manage source code and get so that I can, you know, sort of code review it, I can merge it, I can review history, all that sort of stuff. And so that that is still supported because the key is that when I did my deployment of the broken image here, I touched this pot info deployment. What happens when you set up a flagger is that the first thing it does is it goes and creates a second deployment for you that flagger owns that is not reflected in your get ops and it or you get repository or declarative view of the cluster. No human has configured this the operator the flagger operator took care of it. And this is the special pot info primary deployment. The contract is that flagger gets to own what pot info primary is running. And that is always the version of pot info, the most recent version that has passed the canary deployment scenario. And so you do have to get used to the fact that what you have inversion control for pot info may not reflect what you have inversion, what you were actually running for pot info primary. And that's why you have to let flagger manage this pot info primary deployment, leave it out of your, your sort of your world view of your declarative config, what you're putting into your get ops operator. You have to decouple those. And so that you do have to you do have to account for that. And so that means that you can still treat your code image and all that sort of stuff. And as you normally do where you're just continuing to bowl forward and create new deployments, and you do have to be careful that you have in your head. Hey, the primary may not be the most recent one that we've published the past CI. It's actually the most recent one that past CI, and also past this progressive delivery scenario. And so if you notice a flag or canary has failed, that should make you uncomfortable. You go make code changes and fixes and try to push a new build through that moves it forward to be so that pot info and pot info primary are both running the most recent version of your code. And you've sort of reconciled that difference. So I think it's totally correct that it should be a little bit uncomfortable that you have canary build that failed. The only other options you have are you could, you know, revert like you could have a robot go and revert the commit that caused the build to fail, but that can be kind of hazardous because, you know, there may be other commits pending that are changing things what if that robot can actually merge things. So I'm not aware of a lot of people doing that what I usually see is people say, Okay, we understand what's happened this canary has failed. Let's go try and move the ball forward let's move forward and come up with a new bill that we think will fix that problem and make progress. Okay, last question is a little bit of a long one from Alexander. One of his services is using web RTC connections between client and service terminating connection will cause bad user experience. The session is typically going on for one or two hours. Is it possible to gradually prevent new connections to be established on the old version and shift new connections to the new service to allow him to remove the old version when there will be no more web RTC sessions open for the old version. Okay, that's a good question. Again, that kind of depends on your service mesh. So, so for Aspen mesh, you know, Istio and Envoy again, the, you know, Envoy will let these for a configurable amount of time, old connections to old services stay open. So, you know, you can sort of gradually shift that over, but you do face this tension of well but if you're not going to shift over any old stuff. If you're not going to shift over anything, you've got connections that last one or two hours, then the soak time in order for you to get enough new traffic to your new connections might be pretty darn long. And so, you know, you might be looking at hey how long of a window do I have to have to get enough new stuff. So, most, I think most service measures support this sort of, you know, hey, let's, let's tail off let's not disrupt existing connections. I know Aspen mesh does. And the others you just have to check. If anybody else has a question to ask please type into the q amp a box now. We can wait like another minute in case anybody has any questions. All right, well it looks like we're done with questions. Thanks Andrew and Zach for a great presentation. Thank you for joining us today on the webinar recording and slides will be online later today on the CNCF website and the link I posted earlier in the chat. We're looking forward to seeing you all at the future CF webinar and have a great day. Thanks so much Zach and Andrew. Thanks everyone. Thanks for joining. Have a good day.