 Hi, everybody. My name's Mark Mandel. I do developer advocacy for cloud for games. Here I'm talking to you about Agones scaling multiplayer game servers with Kubernetes, which is a fantastically fun title. There's a picture of me. And you can follow me on the Twitter's. So I have more self-worth based on my Twitter account. That would be good, yeah. When I hit 5,000 followers, I thought I was going to feel good, but I would prefer 6,000. That would actually make me feel good. I'm sure it won't change next time it rolls around. So before we get started, I do want to ask all of you about you all, just so I think we can tailor this a fair bit, considering the audience. I usually give this talk, as I was saying before, at video game conferences. There's different assumed knowledge, which is fun. So anyone here work in the games industry? That is not surprising. OK, cool. So who here is not familiar with Kubernetes? Also, perfect. OK, so you in the back. That's OK. That's fine. That's cool. So what I'm going to do is I'm going to give you probably some more background information about how certain types of multiplayer games work so that you can understand the project that word is we're doing. And I'm probably also going to talk, well, I am actually going to talk more about why we extended Kubernetes to do the things that we needed to do for the workloads that we have for this kind of gameplay, that kind of stuff. So that kind of sound cool? Sweet, awesome. So what I want to talk to you today is about multiplayer games and very particular types of multiplayer games. Who here plays video games? Excellent, this explains a lot while you're here. Awesome. So there are a wide variety of different types of multiplayer games out there. The ones I want to talk about are your very fast-paced online competitive multiplayer games. So we're talking like Fortnite, Overwatch, Rocket League, those types of games where a lot of stuff happening, a lot of interactive physics, a lot of things going on between different players, that kind of interaction. And the example I'm actually going to use today is there's an open source, a first-person shooter called Sononic that we're going to use as a demo. And what they do is do what's called a dedicated game server. I'll show you what that means. So a dedicated game server is where in all your clients connect to this single simulation server, somewhere on the internet usually. And that simulation server is your single source of truth of everything that happens inside the game. It keeps track of everything in memory. It does all your physics simulation. It tracks what player input is going to be usually through the clients. They usually send stuff down into the dedicated game server and be like, hey, can I move? Hey, I want to run forward. Hey, I'm firing a rocket. And it's up to that dedicated game server to track all the inputs and all the information that's coming from the clients and be able to say, okay, yes, this thing definitely happened. Yes, I sniped that person from across the room. Yes, I fired my rockets from my car and scored a goal. And send that information back out to those servers. There's a variety of reasons that this is pretty much the prevalent way of doing large scale multiplayer games, especially in triple A like large studio space. Not going to go too deep into it. The major reason that we do this, actually I'll say there's two major reasons for this. One is the authoritative nature. It gives us a lot of control over what's happening inside the game. Helps us prevent cheating, that kind of stuff. The other thing is actually latency control. When we control where the dedicated server is gonna be, we can look at how much time it's going to take for that information going from the client to the server and back again. We know how long that's gonna be. We can track that. That's usually pretty easy. You can send a ping and return it. That's okay. And that means that if I'm playing a game with say someone in New York, I know as the game author, hey if I put it in the middle of maybe Central America, I'm playing from here on this coast and we're gonna play with someone. If I put it somewhere in Central America, we're gonna have a pretty similar ping time, the round trip latency. It's gonna be about the same. We're gonna have a similar play experience and I'll know within my game to what sort of latency requirements I have. Usually somewhere between 50 and 100 milliseconds is not too bad. Below 50 is usually ideal. But I can track that. So that gives me a lot of control. Which is super, super nice. And so what we're talking about here is really like, yeah, multiplayer games that are driven by dedicated authoritative simulation servers. Which a lot of them are. So, sweet. How does this look traditionally? Like what does this usually look like? Often for large scale studios, they build this themselves. This is a wonderful case of a lot of people reinventing the same wheel over and over again. So it would look something like this. So you have a couple of players. They're like, awesome, I wanna play a game. That sounds like a really good idea. So they usually connect to some sort of matchmaker. Usually it's a pretty simple service. Well, it's not simple. It's just like in terms of like it's either a rest end point or something like that. And the job of the matchmaker is to look through the entire player pool and see if it can find other people of similar skill levels, maybe look through people's social graphs. Usually a whole variety of different aspects and trying to find a good group of people to play a game on. That could be a whole talk into of itself. In fact, it could be several. But once we have players who have been matchmade, what we need to do is find a dedicated game server process for them to play on. So usually the matchmaker talks to some sort of game server or manager that looks at, this probably sounds kind of familiar, a whole slew of virtual machines usually distributed around the world. This is starting to sound like Kubernetes. You can see where we're going. But yeah, then it's its job to basically do standard scheduling type things. I know my simulation server needs about a CPU to run the whole simulation. I know how much memory it needs. So I start looking through machines and seeing which one has available CPU and memory and then I find the process. But the big difference here as well is that once I've spun up this dedicated game server process and send that information back to the player, the players receive basically an IP and port they connect to. They're making a direct connection to that dedicated game server. And that's because it's all in memory state. It's all shared. There's no way you can spin that out to a database fast enough. It's way too quick. We want to do this in nanoseconds. So they make a direct connection to that dedicated game server. There's no load balancers, nothing like that. We don't want to introduce latency, latency bad. That makes bad player experiences and people complain on Reddit. And nobody wants to hear complaints on Reddit. So that's usually how we set it up. So the part I want to talk about today is this bit. How do we orchestrate, coordinate, and scale game servers around lots and lots and lots of machines? And how do we do this in a nice open source way that can become a standard for lots of people who want to do similar things? So that's what Agones is. It's an open source dedicated game server hosting and scaling project. So in the same way as you might want to host and run a website using say deployments and services on top of standard Kubernetes, you can use Agones to host and scale these dedicated game servers for multiplayer games. And we'll look at how that looks like. So Kubernetes, yay, all right, we're all familiar with Kubernetes, awesome. So Agones itself is actually a native extension of Kubernetes. We use custom resource definitions as well as API extensions as well so that we can natively extend Kubernetes to basically make it so that it understands how game servers work, which is awesome. The fact that this is a capability in Kubernetes is just absolutely phenomenal and if that didn't exist, it would have taken us so much longer to make this project even vaguely a reality. Kubernetes is actually also really important for this kind of workload as well for a few reasons, which are really nice. One of which is the fact that we can run Kubernetes anywhere. The reality of games is that players show up in all sorts of weird places. Sometimes you just have a really big player base in Brazil and I don't know if you know this, but cloud coverage in Brazil is sometimes not so good. So if you can run a Kubernetes cluster with it be on-prem or like three different cloud providers or something like that, that means you can, it gives you a lot of power to reach players where they are. And that's a huge thing because if you can't hit your latency requirements for your game, your players won't play and you won't make any money. That's simple. It's also a nice simplification to traditionally, the services that run dedicated game servers are usually quite different from, say, what you would run your traditional back-end services for games, so like your account management or your virtual goods or your matchmakers, for example. So this gives a single platform for you to run both your back-end services for all your other game stuff, which you will have as well as your dedicated game server workloads as well, which means less ops knowledge and less complexity, which is also super nice. Awesome, so let's look at this in reality. So unsurprisingly, we run game servers in containers. I don't think anyone's surprised by that, which is super nice, means you can run whatever you like inside your container. We do have an integrated SDK. This is actually quite important for us. Game server life cycles don't really fit the same traditional, the same sort of workflow as, say, stateless or even databases. Game servers run this really interesting gamut wherein they go from stateless to stateful over their lifecycle, which is a fun thing to manage and part of the complexity and part of the interesting parts are gone is. So game servers are great in that, when you start one up and you have it just sitting there doing nothing, it's essentially stateless. No players are playing on it. Nobody cares if you delete it. That's totally fine. But as soon as you have players playing on it, they become stateful. You have a whole bunch of in-memory state that you need to manage and keep a whole of and you cannot delete them. Otherwise, players get really, really mad when you interrupt their play games. And again, then you get complaints on Reddit and all that kind of stuff and that's awful. So the SDK functionality is really about allowing game servers to be able to control their own life cycles. So they know when they're ready. They know when they're able to accept connections. They know when they can shut themselves down because the gameplay is finished. They can, their health tracking inside game servers is a little bit different. So we have some inbuilt health SDK functionality as well as well as some other just utility methods that are super nice. We basically use more open source stuff. So we use GRPC to basically build out the SDK integration. Hence, we can support a wide variety of languages including REST endpoints and we support the two largest commercial engines as well, Unreal and Unity. But we have people pumping out more SDKs kind of as we go, which is super nice. And then we can do this pretty easily thanks to GRPC and REST, which is super nice as well. So as I said before, Agones is a extension of Kubernetes. So anyone here not know what a custom resource definition is? Awesome, fantastic. That makes life really easy. That's great. So when you install Agones, now suddenly it understands what a game server is, which is super nice. I can say to Kubernetes, hey, can I create me a game server? I can do all the usual standard stuff around like give it a name, we'll call it synodic. One of the things that Agones does is it does set up that direct connection for you. We do the policy management for you. Kubernetes won't do like give me an open port between a range of 7,000 to 8,000 and connect that all up. There is some stuff that'll do that direct connection but we need to manage the port management to make sure there's no collisions, things like that. So here we can say, right, here's a port I want and it's gonna be dynamically allocated and the containers on 2600. So it'll go and create, it'll basically set up a host port connection for you automatically so that you can go through the IP of the node and straight into the machine. That works really nicely. And then here, this is basically a full pod spec but you're specifying the container that has your game server in it. And that's pretty cool and that's quite nice. And so that can spin up. It's integrated with the SDK. You can tell when the game server is ready. I'm ready to go so we can put players on it and that kind of stuff. But the next thing that's kind of fun is that game servers are slow to start up. Usually you're looking at a combination of binary and assets that's anywhere from a gig to three gig easily. You've got a lot of map data. You're probably pulling in things from a bunch of different places. Maybe some configuration information. You're probably doing A, B tests inside the game. All that kind of stuff. So what you actually want is you want a warm set of spun up game servers that are just sitting there and waiting for players. So what we have is what we call a fleet. It looks a lot like a deployment. That's because it kind of is. Really what we have here is a fleet that has how many of these game servers do you want? It's that really simple. And its job is to make sure you have this certain number of warm game servers up and running. Here's my spec for it. And you can scale it up and down all that kind of stuff. I'm not going to talk about it today. We have also some very particular types of autoscaling that we have in here as well that are particular to how games work. But we also integrate directly with the standard Kubernetes node autoscaler. So you can actually autoscale your fleets and then you can use the standard Kubernetes autoscaler to adjust the cluster size based on how many things you have in your fleet, which is super nice. It actually works really well. They did a good job on that, which is super nice. So once you have this huge set of game servers, maybe you have like 5,000 of them, 10,000, we need a way to be able to be like, hey, I would like one from that set of game servers that are currently running. I need one of those so I can put a player on it. And I need you to like atomically give it to me in like a safe way. And I want to mark it so that I know that this one has players on it. So if you scale down the fleet or you want to roll out a new image to the fleet, you don't delete it because players complain when you shut things down. So we kind of break out of the traditional Kubernetes mold where we have this thing called a game server allocation that works like an imperative command. Under the hood, it's actually an API extension if you're interested. We don't use a CRD for this. But in which you can create a game server allocation, usually you would do this directly through the Kubernetes API, but I'm just gonna show it in YAML because that's easier. And you can provide a selector of game servers that you want to be like, hey, you have a thousand game servers. Please give me one of those and mark it as allocated in an atomically safe way. And then I know that that's allocated and awesome. I know I have players on that and it's safe. That's in it sort of very special, the special mode. So I know that nothing's gonna happen to it. And I can make players play on it and it's great. And I'm gonna keep doing that. And eventually I'll run out but maybe I'll load a scale up and that kind of stuff. But this is one of the sort of special things that Agones does for that kind of workload, which is fun. So more excitingly, why don't I show it in action and hopefully it won't think it'll break, which would be good. All right, so I have an empty Kubernetes cluster, sweet. So to install Agones, usually I use HOM because it's the easiest. So I'm just gonna do that. Agones slash Agones. It's not particularly complicated. We have a chart, it does all the things it needs to do. There we go. It spits out a whole bunch of gobbledygook, which is fine too. And if I have a look, which I like to do, so there's all the components for Agones that are running there. We have a variety of things in there. So we have a controller, right? If you're familiar with the operator panel, pattern, sorry. So we have a bunch of CODs and we have the controller that manages everything behind the scenes. So this means now we can do, like, get to game servers, which it'll say we don't have any, but like we couldn't do that before. And I mean, I love the extension stuff inside Kubernetes. It's kind of amazing that, you know, all this tooling stuff comes for free, plus you get the API extensions and the API all at the same time. That's kind of magic. So let's create a fleet. Sonoticfleet.yaml. So we're gonna create that Sonoticfleet we looked at previously. If we look at our game servers, we have two of them up and running, which is great. They've already marked themselves as ready, which is super nice. We can see we have IPs and ports for both of them. We can actually see what nodes they're on. They're on the same node as well, which is also ideal, which makes scaling. We pack them up tightly for scaling reasons. And so they're ready to go. And we could connect to those. We could play games on them. But two game servers is boring. We can do really standard Kubernetes-type stuff. We'll say 200. 200 is more like a better number. That seems more fun. So now we can treat that like a regular Kubernetes resource and we can look at our fleet and we can see things like awesome. Like we have 200 of them are up already, nice. 24 of them are ready. We'll see that number start to come up a little bit quicker. There we go. So that's starting to grow, which is super nice. Now we use another open source project called OpenCensus, which I'm sure some of you are familiar with, which will soon become open telemetry and all that kind of fun stuff, basically to do a whole bunch of inbuilt metrics as well inside this project. So metrics are super nice in this, right? Day two, day three operations, like how many game servers do I have? What rate are they going up and down? This is one of many dashboards that we have that can see things like, okay, cool. There we go. That'll start to change. It's usually about 30 seconds behind. And we can start seeing like how many game servers we have ready and all that kind of stuff. So this is all just Grafana. What just happened there? I did something. Last five minutes. That's better. And we can see the rate they come up, all that kind of stuff. And we have a wide variety of metrics. So again, open source standards, yay. We can just provide this foundational stuff for game companies who want to do this kind of workload, which is really nice. And because we use OpenCensus, which is also lovely, we can push this to Prometheus and Grafana. We can push it to Stackdriver, basically anything that OpenCensus supports, which is super handy. There we go. So this actually says that's 200 there. That's tiny. You probably can't read it. They came up pretty fast. They usually, we can usually do, we can usually spin up 10,000 servers and get them to ready depending on the speed of the game server in like a couple of minutes. It's not too bad. Comes up pretty nicely. Sweet. Let's do an allocation, which is what we were talking about previously. So as you saw, we can scale this up and down. They're all in a ready state, which means there's no state in them. So no players are playing on them. So they're basically free to be deleted if we so desire. Or the operations management sort of does that. So let's do, we'll do an allocation. And so the allocation again is that special state, which says, hey, we have players on here. We're just gonna do it through here, rather than through, where are we? Output YAML. Through the API, which is nice. And it'll give us back a whole bunch of results. So this is actually gone and gone through those 200 game servers, picked an appropriate one, and given us back all the details about it. So here we can see we have an address, we have a port. We can actually see the game server name if we wanted to go look up more information about it. But let's do the fun thing, which is actually play game. And so like this is done all the, the nice thing about this is also, it's done all the hard work for us. You know, it's found all the appropriate machines where they need to run. Let's see if I can type multiple numbers. There we go, 307307. See if they get that right. Yes, black screen is good. I don't know why they made that game that way. So this is connecting up to Kubernetes cluster that I've got running to GKE cluster running. It is on this coast, actually. There's a bunch of bots sitting here playing. I'm playing on a trackpad. That's why I'm so bad at this. I just want to be very clear about that. It's got nothing to do with my real skill in games. Where's a series of bots going? Where are you? Oh, there's somebody. Yes, let's kill it down. And as you can see, like I'm playing a game like I would do normally. Everything's fine. What is this gun? But yeah, this is running on Agones. This is just running the dedicated game server. So we're all connected. The bots, in this case actually, the bots are sitting in here. There we are. Excellent. So we can see it all works. And it's all pretty straightforward. Let's play with this a little bit more so you can see what it is that I was talking about previously. So we're going to do a couple more allocations. So say my matchmaker has found some more players who want to play. They have some games they want to play. We have a look at our fleet right now. We can see we have like three game servers allocated and there's 200 there left. And so I just want to kind of show you like what this very special state means and why this works a little bit differently. So say, for example, you were like, okay, we've got these people playing a game. We need to wait for them to shut down. But we were like, oh, actually, you know what? We rolled out a wrong version of this or maybe no one's playing it. So we want to scale it down a whole lot. So we don't want to interrupt the players that are currently playing. So we're like, actually, you know what? Let's just drop it down to zero. That's fine. Let's kill it. And so when we do this, all these game servers are going to start to scale down. But if we have a look at our fleet all over again, we can see those three game servers stay there. Until those game servers either turn around and through the SDK to say, okay, yep, the game plays done. We finished our game session. Everything's cool. This person won. So let's shut ourselves down. They're going to stay there. They're not going to go away. I can specifically request that they get deleted, which is fine. That'll actually shut them down, which we can do. But they'll never go away until I do that. I can roll out a whole new version of the fleet. I can edit my fleet, much like a deployment, for example, and it'll roll out new versions. I can do the same thing inside here. We'll manage the fleet rollout. And again, make sure that no plays interrupted for those game players that are on allocated game servers. It's kind of the magic power of a Ghana's. Cool. I'm sure we have some time for some questions at the end. So we saw that architecture we were talking about before. This means now that you're a custom matchmaker. We actually have another open source project called OpenMatch. It's a matchmaking framework if you want to do open source matchmaking as well. So maybe it might be that. Your custom matchmaker can then talk directly to the Kubernetes API and basically be able to interact with it that way and say, hey, I've got players that want to play a game. Let me talk to this fleet. Give me a game server so that these players can play directly to game servers. There's a lot less bespoke stuff that just needs to be built to manage your games at scale, which is super nice. We have a bunch of other stuff as well, obviously, for time constraints that we can't cover, how much of auto scaling stuff, local development tools, a whole bunch of metrics we didn't show. There's something I was thinking of, actually. It's gone. I don't know what it was. There's some other cool stuff that we have that's super nice. We have a bunch of latency testing tools as well and all kinds of stuff that you just need. There's all sorts of fun stuff in there, but we didn't get to play with that. But there's goodies in there. We're planning on going 1.0. Actually, it'll be next release. We release every six weeks. So it'll be September. We're going 1.0. We've been working on this for about two years. And we've been doing it actually is worth noting we've been doing it in collaboration with a variety of studios. So we first created this in collaboration with Ubisoft. Since then, I mean, Ubisoft still evolved, but we've had a wide variety of game studios get involved, which has been really nice. We've got a whole bunch more metrics and stuff we want to do, more performance improvements, whole bunch of functionality improvements as well. And we're also looking at how we can support multi-cluster better in some really nice ways. And once Windows hosting kind of penetrates all the cloud providers as well, we'll look at doing Windows hosting as well, which would be nice. Because a lot of game developers use Windows. In fact, most of them do. Excellent. If you want to learn more, Grani.dev, Adagones Dev is the Twitter handle, et cetera, et cetera. We are actively looking for contributors always. So if you want to play with a fun project on top of Kubernetes, please come join our Slack channel. I will also make a note that down here, I've got my business cards. If you think you have questions and you're like, oh, maybe I'll think of them in three months time, please feel free to grab some. And I've also got stickers down there too. There are a couple of Agones stickers still left down there that are holographic. So if that's your thing, that'd be cool. So we've got like nine minutes left for questions, which is really nice. If anyone has any questions about how game servers work or how we extended Kubernetes or any of the stuff that I've shown, I'd be more than happy to answer them. Suit. So how do you handle hardware things? Really, honestly, we push that down to whatever hardware that you run on. Whatever hardware you put your Kubernetes instance on, then that's your hardware. I mean, it's up to the engine and that kind of stuff. So different engines often usually have different requirements for different hardware requirements. I've seen Unreal do some very interesting things about how they do to create checking inside the engine that is different from what I've seen in other engines as well. And that way is where you get into fun things about assumptions that engines make about what sort of hardware they expect to be run on, whether it's actually the bare metal or whether it's cloud. And as soon as you step into the cloud, then you have a virtualization layer. And so certain things happen that way. A lot of the cloud providers, myself included, have versions of machines that are essentially like you're running. You have the full socket. And you have basically direct connection directly to the hardware rather than like, I'll speak about Google Cloud because I know that one but like our standard VMs, for example, you have a virtualized CPU. And so that can cause some latency issues. So to explain that, actually, let me take a step back. When you're running dedicated game servers, having a very, what's the word I'm looking for? A very like, what is the word I'm looking for? I mean, like the same tick rate constantly. So usually a lot of game servers run either 30, 60, some 120 Hertz, right? Knowing that that Hertz rate is the same constantly is exactly what you want because then you know how much time you have to do your physics simulation, how much time you have to do to get your information out to your players, and that's super important. So if your underlying CPU is like doing this for how fast or how slow it processes things, maybe by nosy neighbor you have other things on the machine, that becomes a real problem. And so there are solutions for that, whether you just own the machine that you run on, which is one. But you look at what we run. We have computer optimized VMs on GCE. Amazon has something very similar that I've forgotten. I'm gonna assure you it does as well. Wearing, we're like, nope, you have the whole socket. And so you control how much of that CPU gets taken care of. Makes sense? Cool. I can talk about game stuff for ages. Go right ahead. Yeah. So are they things that, and then a subsequent question, have they looked? No, like if you build a game server today, it's gonna be a gig and a half. Yeah, like you have, there's so many assets in there and stuff. Like so traditionally, I mean, it depends on how you build your game server. If you're building on something like Unreal or Unity, what it actually has when you're authoring it, is you're basically taking the client and you're writing either, like you split inside the client, you have splits of code where this is client only and this is server only. Because there's a lot of code sharing in between. And usually speaking, you're doing the same physics elimination, you wanted to have the same players, like the same sort of things that are happening. So you can strip a lot of it out, but you still have things like map data. Like what are all the maps that I'm playing on if I want to switch between maps? You probably have a whole variety of data that gets pulled in into memory. So they're just big. They're just big. It's just the reality of the situation. Been having some really interesting conversations with a variety of game companies about how you could build more distributed dedicated game server type stuff. That has other interesting problems because as soon as you start to pull that stuff apart, as we all know, like distributed systems and undeterministic, different things happen in different orders. When you're talking about gameplay, non-determinism is like a real problem. Because maybe one player gets a slightly different experience than the other, which sometimes is fine and sometimes isn't. But also doing replay, how do I know there was a bug if there isn't? How do I replicate that becomes a real issue. So there's all these pros and cons for how you can do that. Yeah, it's fun. You get into your battle royale type games, right? You have 100 players, so that means you have a really big machine so that you can like load that up. If you wanna start getting into really big numbers, then you have to start distributing that workload. And that has some extra complexity as well. And yeah, it's a whole thing. That's fun. Sweet. That's a really good question. I think, so my experience for running this, most of them are just like, yeah, we're moving to Linux, that's fine. It's less licensing costs, more anything else. If you've got existing games that have been running on Windows for a really long time, they don't wanna port those, I think for obvious reasons. The big fun thing with that, and one of the reasons like Windows support is actually really important, is the very vast majority of game developers run on Windows. So we need to support Windows for QA builds more than anything else, just to get that aeration time down, right? So they're just gonna be like, I have a build, I need to be able to get it up and running within this system, and I just need to be able to drop the Windows version and I'm not gonna recompile it, wait another 15 minutes or however long it takes. That's just way too long on the iteration time. They can do that once they're like, maybe do a daily build or something like that, but just to get that aeration loop down, we need to have Windows support that's super important for them. It's three minutes and 17 seconds. Going once, going twice, going three times. Cool, awesome. Well, thank you so much for spending time with me and talking about game stuff. I'm around for the rest of the day if you have other questions. Otherwise, yeah, thank you for much for joining me and have a great lunch.