 I'm Chris Voss. That's a picture of me if you can't see me from the back. I'm a senior software development engineer at Microsoft. I joined both Microsoft and Xbox Cloud gaming in 2018, August or so. Originally, I was on the team that is managing our Xbox servers, keeping them healthy, making sure they're up to date, all that fun stuff. More recently, the last couple of years, I've been on our infrastructure team, game streaming shared services. Our team manages and operates the infrastructure for Xbox Cloud gaming services. So, you know, Azure resources such as Cosmos DB or storage accounts or key vaults, all that fun stuff. Our Kubernetes clusters, obviously. And then we also have some services that are kind of shared across all of our other teams that we also own. If you'd like to contact me, these certainly work. I'm sure there are others that I forgot. But, you know, if you want to reach out to me, these are definitely good places. So a little bit about Xbox Cloud gaming. So it comes for free if you have Xbox Game Pass Ultimate. And basically, it enables you to play games wherever you want on whatever device you want. And our motto is, or our mission, rather, is allowing people to play the games you want with the people you want on the devices you already own. You know, it's important to us to enable anyone who would like to play games to play those games. It's a way of connecting people who might be from great distances and giving them shared experiences. And so, yeah, that's a big motivation behind our team and our product. So a little bit about what our footprint looks like. We have 26, roughly, AKS clusters. And AKS is Azure Kubernetes Services. I forgot to add that in there. But across several Azure regions, we have 50 or so microservices. I think it's a little bit more than that now, but 50 plus is good. And about 700 to 1,000 pods per cluster. So, yeah, spread across those 26-ish clusters. We have about 22,000 pods around the world. And just a little bit about the servers. So when you're playing a game, you're streaming a game. What you're streaming from is actually a Xbox Series X hardware. It's a custom-designed modification of the Xbox Series X hardware. And it's deployed in data centers around the globe. So when you're playing, you're really playing on an Xbox. And then we're giving you all the signals and then allowing your feedback to come back and connect to that console. So a little bit about our tech stack. Obviously, we're part of Microsoft. So we're going to be using Microsoft products. But we are also huge adopters of as much open source as we can. These are some CNCF projects, along with some of the other stuff we use. So all of our services are written in dot, well, the majority of our services are written in dot net. We run inside Azure Kubernetes service. We use Docker to containerize all of our services. And we use Helm to deploy them. And we use Linkerd along with Flagger in order to use Canary deployments. And I'll talk more about that later. And Fluentbit is what we use for our logging pipeline. And Prometheus, as you might imagine, is Metrix. So to level set, make sure I'm not sure how many folks are already completely familiar with service meshes. But just to make sure we all are on the same page, what is a service mesh? And of course, there are several different service meshes out there, all with unique functionality. But there are some things that unite them as well. And so really, the way I think about it is it provides controls around the traffic inside of your clusters. Sometimes out of. But for the most part, it's inside that cluster. And there is spec for the service mesh interface. And it has specs around traffic access control metrics. And specs are really like the CRDs for routing rules and such, and split, which is what we use in order to do our Canary releases. And so if you want to check out the actual full spec, you can go to that link. All right, so now we understand, at least somewhat, what a service mesh is. Why do you want one? Why did we want one? Why might you? For us, a huge, huge, huge motivator was to simplify our service-to-service TLS, or mutual TLS, MTLS for short. Sorry. Previously, we managed our own solution. We had to get our certificates created, get them deployed into the right clusters, into the right namespaces, all that fun stuff, load them into our service. And then we got the fun option of troubleshooting it when there were problems. And that caused a lot of pain in the beginning. And so service meshes became on our horizon pretty quickly because it created a lot of friction of spinning up a new service or things like that. We also really, really, really wanted code-flighting and progressive deployments. We wanted to be able to build confidence in a deployment before we just shove it on out there on everyone. We've had several times where that saved us from major outages because our auto rollback worked. Observability, so just like in the service mesh interface spec, there was metrics, being able to understand what's going on inside your system, inside your cluster, inside your service mesh. Incredibly important, as you probably know. And most service meshes provide that for you. And code-free instrumentation, that was huge. We did not want to have to pull in tons of different libraries into our .NET and hope they're in .NET and then blow it up, essentially, our containers with code that we weren't necessarily maintaining. So our service mesh search began around 2020. So basically what happened was a few of my colleagues went to KubeCon, Cloud NativeCon 2019 in San Diego. And specifically, my former manager saw a LinkerD booth and there was like a challenge to set up MTLS in a cluster in five minutes. And he was able to do it. And so immediately that put LinkerD on our list of like, hey, those are service meshes you should check out. Obviously, it's cool that one was easy to use, but you want to make sure you're still choosing the right one. And so we evaluated several, which are listed here. We did want to make sure every service mesh met our requirements. And so we wanted to make sure it implemented the service mesh interface because we wanted to use canary deployments. We also wanted to make sure we had efficient resource utilization, especially CPU, because most of our services are pretty much CPU bound. And so any service mesh that we're adding on top of that is just going to add more CPU pressure if it's high utilization. Observability, I already talked about that in the previous one, but we wanted to make sure we could make sense of what was going on inside the service mesh. And set up slash maintenance ease. And this is really, we don't want to have a team that's dedicated to our service mesh. We want to have a team that's dedicated to all of our infrastructure. So if the maintenance or set up are incredibly difficult and painful, we'll likely have to dedicate one, two people to it just to maintain it over time. That was just not scalable for us. So we sat down and we said, OK, we looked at all the features and said, OK, these are the few that we really want to investigate. And so someone on our team put together prototypes and we evaluated all of them. And as you might imagine, LinkerD1, winning is maybe not the right term. It's more fit our needs exactly. So basically, efficient resource utilization, it was very good there. Traffic splitting, using the service mesh interface, we're able to get that functionality. Observability comes with a ton of metrics out of the box. And low latency. Obviously, whenever you're adding something to your call stack, if it is adding a significant amount of latency and you have multiple layers of calls, that can get pretty nasty, pretty quick. And so that was a big deal for us as well. So some of the ways that we're using LinkerD to scale our business. So now that we've chosen LinkerD, we're using MTLS, obviously, with cert rotation, which it's in the title of the talk. So that's probably not super surprising. We're also using it in high availability mode. This is something that we missed when we were initially doing our deployments. We saw that if a bunch of services were getting deployed at the same time, they were having container injection into the pod was having issues, essentially. And that's because there was only one instance of LinkerD running, and we were not running in high availability mode. Pretty easy thing to fix. Just was something we didn't recognize right away. Very important if you're going to production with LinkerD. Prometheus metrics. So to be completely honest, we did not use Prometheus metrics before we started using LinkerD. We had a way of getting some of the internal metrics out into our own systems. And we used that, but it was costly. And again, it required a lot of maintenance from us. Obviously, we used the service mesh interface extension for Canary releases, because it's very important to us to have that. And LinkerD does support that. Also, and this is probably something that can be said about most open source communities. But this is one of my first experiences with it. And we've had a really good experience working with the LinkerD community. This link here is a colleague of mine, Abraham. He was the one who went through and did some of the prototyping work and opened several issues on the LinkerD GitHub and worked with them to drive those to conclusion in order to enable us to move forward. So I'll go into all of those in a little bit more detail. So MTLS. Zero config MTLS is simple. And honestly, I considered having that be the only bullet point on this slide, because that is how much most of us have to think about it now that it's all set up. It's essentially like, so for instance, when we have been moving to a new architecture for our clusters, and I was tasked with kind of getting all of our plugins to Kubernetes working. And LinkerD is one of them. And I was incredibly nervous. I'm not a security expert. I'm not a certificate expert or anything like that. And when I put down what I thought was the right things and executed our Terraform code, I was confident that I had messed it up, because it just worked. And to be completely honest, I had to ask other people to help me figure out whether or not it was working, because I really, really didn't believe it. But it honestly was. I mean, that is not a lie. Sometimes people say things like that, but it genuinely was. I had to go and ask someone else to just help me check my work, because it seemed too easy. And like I said previously, we've secured over 50 microservices and 22,000 pods around the globe. So observability, this was huge for us. Code-free visibility. So we didn't need to have a bunch of instrumented code in order to get these metrics out. We also, interestingly, essentially use some of the metrics to drive our canary deployments. So I'll talk about that a little bit more later. So maybe I won't jump the gun there. So some of the things that we use it for, HTTP response codes, latency monitoring, obviously, canary deployment status, request volume. And it comes with its own Prometheus, which for us, again, we weren't using Prometheus before. And so the fact that it came with its own Prometheus instance really kind of allowed us to see the power of what we were missing. And so it also pushed us to kind of adopt that. So let's dig in on canary deployments a little bit. So for folks who are not familiar with canary deployments, they're similar in idea to like green blue deployments, but slightly different. Green blue is where you stand up essentially a duplicate of your service. And then you tear down your previous one. And it's essentially a seamless cut over. For canary deployments, though, we have this little visualization. You can't see my mouse. But we have this little visualization. Ah, here we go. Laser pointer. All right. So we have this little visualization here. So in stage one, when I'm deploying a service, these controllers represent people who are wanting to play a game. And so initially, we're trying to be cautious. And so maybe we'll give one out of six gamers access to our new deployment. And if we find that that's working well, we eventually move it on to our second stage of two players out of six. And further on, the more confidence we build, the further along it will deploy. And we'll either detect some sort of issue with perhaps it's a scale issue. Maybe it gets 75% of the way or 75% of the traffic. And then all of a sudden, there's a scale problem and we roll it back. We have integrated our canary deployments into our Azure DevOps release pipelines to enable auto rollback. So if we detect that there's a problem, we will automatically roll back. So in that instance, it would go all blue controllers. Or in the case of success, which is hopefully the more likely scenario, you're going to have all green controllers. So everything is executing new code. And it's a flexible canary evaluation. So some of our key learnings from Flagr, things we didn't think about when we first set out trying to use it. The biggest thing was how to handle canary deployments for different types of workloads. And some of those are like non-HTTP calls, such as UDP or things like that. Or asynchronous call patterns, like you're using a message queue or things like that. Or cron job type workloads. We have several services that have periodic background services that spin up, do some work inside the service, maybe fill a cache or something like that, and then go away. Detecting whether or not those things are good in a canary scenario is not just looking at HTTP response codes or latency. But instead, emitting some sort of signal that gives your canary evaluator an idea of what's going on. And also, not all services have constant or high volumes. So sure, even if your service is entirely bound to receive HTTP calls, if you're in a dip in your daily cycle or whatever, you may have some issues with false positives or maybe under-reporting things like that, where you're not able to make a super great evaluation. And this is something that really was sneaky for us. Traffic to our liveness and readiness endpoints, we're skewing our canary evaluation. So in the instance, on those dips and things like that, if we were deploying and we saw, oh, it looks like there's a fairly steady call pattern across that time, looks good to us, but really, all it was was our health endpoints, which absolutely are important and hopefully are indicating some level of health. But some of the point of the canary deployments is to detect the things that you aren't already checking for. And so that's just another thing to keep in the back of your mind should you go forward doing something like this. All right, so engineering and cost savings. This is huge and not something that we necessarily expected. We hoped for, but didn't necessarily expect. So engineers were freed from supporting in-house MTLS. Not only is this like a time savings and engineering work, but it's also a happiness improvement, because not everybody wants to have to think about, how do I get my server into my service and into my cluster, and how do I get it generated, and who do I have to go to rotate it, or things like that. And so happiness, time, awesome. And we also had reduced spend on alternative observability solutions. And I had mentioned, I think, briefly previously that we were sending some of our Kubernetes metrics out into our own system and handling it there. And kind of coalescing on Prometheus for all of our Kubernetes metrics have really helped us reduce spend there. And these two things alone have resulted in thousands of dollars of savings per month. All right, so a little bit about where we're going to be going with Linkerdee. These are things we haven't fully fleshed out, their ideas we've got, their scenarios on our backlog, but service-to-service auth. So sure, we've got MTLS. We've got secure communication. But if a bad actor gets access to one of our pods and they start making calls to other pods to try to find vulnerabilities or things like that, we want to try to reduce the risk of that. And Linkerdee does support that. Also, multi-cluster communication and failover. So previously, in our previous architecture, we had no need for multi-cluster communication. All of our clusters were essentially the same, and they had their perfect scope, and they didn't really care what was going on in the others, which was convenient for some situations, but not great for other situations, especially reliability. And so now that we've got kind of a better architecture for it, we're absolutely going to be working on multi-cluster communication and definitely failover. So looking forward to that. Also, fault injection and chaos testing. Things that you can essentially fake, oh, what happens if all of a sudden my pod stops being able to talk to any other pods or things like that? I have very little experience in that space, so I'm really looking forward to learning more about that. All right, I'm almost done, but I wanted to make sure before I wrap up, I wanted to introduce everyone to my colleague Abraham Wodeji. He was unable to make it to Valencia, but he's an awesome dude, and he did an incredible amount of work in our path to consuming and using Lincardie in production. And I wanted to make sure that his work was not missed by me presenting, because he's absolutely an integral and very important part of the equation. And thanks for joining. If you have any questions, I'm happy to answer them. And if you're interested in trying Xbox Cloud gaming, you can go to xbox.com. You can play Fortnite for free. Or if you're an Xbox Game Pass ultimate subscriber, you can play hundreds more. Thanks a lot.