 Hi there, I'm Don Green and today I'm going to be talking about how Kubernetes is changing the games industry So I'm head of platform and a small game studio out of London England called Netspeak Games And we've built and launched a mobile MMO in just six months. I'm the head of platform at Netspeak now What sounds impressive? It basically means I'm a team of one the only platform it or in for engineer This is the rest of the studio mainly made up of artists both 2d and 3d Small amount of games engineers and a commercial team And as I said in just six months, we're going to take an idea from conception all the way through to launch Today, I'm going to talk about how and why we use Kubernetes along with open source technologies for our underlying infrastructure To launch a game in such a short period of time So we're going to talk about why speed matters Especially for free-to-play online games. We're going to look at the infrastructure of games And what the component parts are that make it up Then when we look to use Kubernetes, what were the challenges with this? Thankfully, there was a solution and this was a goners an open source project And also how we can use open source and Kubernetes to empower others within the studio to help them build the game faster So how long does it take to build a game? This is a game. You've probably all heard of World of Warcraft released by Blizzard It's probably got about 13 million subscribers which paid $9 a month. So it's got good revenue It was also released originally as a boxed game And you paid an initial fee to buy the game And this actually took around four to five years to build and get out to launch and even after that They are still constantly outdated it with patches and expansions. I think the latest one even took about two years to build in itself Then we start to look at something like this. This is Brawl Stars by Supercell and this is a free-to-play three versus three battle game It's probably a little bit complex got different roles. It's got different ways of playing and it's got a lot of different content However, Supercell probably took around six months to build this game again after the launch They probably tweaked it and made improvements after seeing what their players did How about a free-to-play game that's built on top of snapchat? Well, this is Bitmoji party And this is quite a very simple game and you play as I said against other people, you know in snapchat as a filter This probably took a really small team about three months to play the infrastructure is all there It's part of the snapchat platform. It should be quite easy and quick to build With this in mind Why does it why does speed matter? Well, we want this kind of reaction We want to make sure that it's fun to play for our players Especially in the free-to-play market. So the quicker we get it into our hands the quicker We find out if the game is fun and if people will continue to play So what takes up the majority of the time? Well, actually it's building content That's why we have so many artists inside the studio and 60% of the time is actually built and spent building content Here you can see Rowan jumping for joy. He's one of the NPCs within our game Then there's a small amount of time and for design production and kind of operations of the company And then 25% of times actually spent on tech This includes the gameplay all the gameplay systems and things that build the game Included within this 25% is the hosting the infrastructure and the platform. This is where I fit in So with six months And a game that needs to be built and a new platform What were my ideas? Well, firstly, we don't want to build from scratch And the main idea was that we want to invest in any open-source technologies that we could Standing on top of the shoulders of giants And also we want to try and make it easy for the game team to use this platform We didn't want any friction for them when they were building out the game After this we let's look at the requirements. We've heard that we've got a single engineer and six months They're the two constraints. What do we need to be able to support? Well, thousands of concurrently connected users doing thousands of requests per second Resulting in hundreds of gigabytes of data that need to be stored about the players and what they were doing On top of this we needed to scare with demand Games often are busier during the evenings and weekends and much much quieter during the day So we had to have this elastic scalability Wanted to be in multiple locations the closer our clusters are to our players the better latency they'll get them Which means the game will probably more enjoyable for them We needed to make sure that the platform was transparent to the team and the fact that it could evolve over time We didn't really want to Lock ourselves into any choices early on Also, we had this idea that if we don't need to build it if we don't have to So let's have a quick look behind the scenes the infrastructure in the various components that make up an online game First off the core of it is this idea of a dedicated game server This is a single process where tens to hundreds of players will connect It stores the information about the world and constantly updates all those connect connected players about the world around them It's basically running a simulation For example save my character was running around in the world and dancing and my client would send this up to the dedicated game Server, which is then replicated back to all other connected players Behind the game server is a number of data services which connect into data stores Early on we made sure that we abstracted the data store away from the game server And allow the game and play programs to focus on just calling into API's rather than underlying storage This also means that in the future if we needed to we could switch out the storage for other systems if we Okay, so how do you get into an online game the first step is Authentication you're going to go through a number of steps When you call it also to verify who you are Where you're logged in from and send you back an authentication token. We actually use JWT for this From there you call into a matchmaker passing that token so that we can identify who you are The matchmaker then will often talk to a lot of the data service in the back end to get information about the player What their experience is which things they're doing at that time and eventually? Pass back the host IP and port number for the dedicated game server that play is going to connect to So when we actually connect into a game server the client actually uses UDP and It's actually transferring Data backwards and forwards 30 times a second So this is quite a lot of data that's going backwards and forwards over the network from the client to the server The dedicated server not only has to support that single player in that same process as we said a minute ago It supports many players One interesting thing to note here is that at launch If an online game goes down it's often not the dedicated game server that is under too much stress Often we haven't scale tested either the auth system or the matchmaking service Both of these see a spike in traffic that they're not used to So with all this in mind why Kubernetes? We're going back to the requirements. We need to support thousands of concurrently connected users And we have thousands of requests per seconds. We also need to scale up and down based on the demand that we were seeing We've seen that Kubernetes you can easily deal with this and even still to scale up to thousands of nodes and Especially in lines of business applications. So hopefully we could take this and apply it to a game Kubernetes also runs anywhere We can deploy it into into any cloud provider or we can even use bare metal With this it also allows to standardize the technologies that we're using both for the dedicated game server And they're supporting services in the background if all of these are running in Kubernetes We have a standard way of doing the majority of things along with this Kubernetes has a huge community of thousands of engineers that if we run into any problems with more than likely be happy to help out So was it just that easy did we just take our game server put it into a container and run it in Kubernetes job done Dom takes a holiday for the remaining three months Unfortunately not the first one we came into was containerizing Unreal Unreal was actually first built in 95 as an engine for a first-person shooter The version we're using in real for was actually rebuilt in 2005 However, this being so old. It's not like it's a cloud native application Logs are in a human readable format and there's no concept of things like metrics are distributed tracing There's no documentation of even how best to run this inside a container Thankfully, it can be cross compiled from Windows into Linux And so we're actually were able to get it into a container However, try running as real or even the user and real and you're in for some serious So from here, we try to run our game services deployment. This is the first thing we looked at Deployments allows to scale up and down easy The other thing we needed to do was because we are you doing UDP We actually wanted to connect straight into the host machine and Port map the host machines port into the containers port This seems easy enough so we could do that. We just avoided using services So we can scale up and have multiple game servers on different nodes So that's fine. We've got many different game servers and each map to port double seven double seven on the host machine However, the promise of Kubernetes was more that we can run many many containers on the same node However, this starts to become a problem if two containers on the same node are trying to map to the same host port We ended up having Collision so it's not just that easy to multiple containers mapping to the same host port We need to think of another way of doing it or some kind of clever infrastructure between to make sure the ports mapped To different ports that they were exposed to the internet Next what about scaling Scaling the pods like scaling ups quite easy. Just add more pods This isn't going to affect anyone that's connected into a dedicated game server. However, when you come to scale down How do we guarantee that we're not scaling down a Pod that's got players connected This is definitely one of the bigger issues. It also Is an issue if you wanted to upgrade our game server if we deployed a new image version This would also result in all the containers being restarted into that new version Thankfully open source came to our rescue. There's a project out there called a goners You can go to a goners dev to find out more about it and it allows us to host when in scale dedicated servers on top of Kubernetes This project was actually built by Ubisoft and Google to help Ubisoft actually scale their own game servers on top of Kubernetes So what does Again as give us well, it gives us a number of custom resource definitions Which make game serves a first-class citizen of Kubernetes and accessed by a cube cuddle Here you can see the yaml. I know everyone loves yaml. So it wouldn't be a Kubernetes presentation without a bit and here all we've done is specify the image We want our game server to be and this will pull it down from the container registry And at the top we've actually said that we want to map the container port Here we have to we have the default in the beacon and so seven seven seven seven is dynamically mapped to the host machine And this we'll see in a minute allows us to turn that port on the host machine from the port that we have in the container To an arbitrary port that is then exposed to the internet along with this a goners gives us a sidecar which allows the Like the dedicated server to actually talk to Kubernetes and tell it about various things that's happening within the game So you could communicate out from your game By the goners sidecar with some kind of SDK When we started using a goners we realized that the SDK wasn't quite where it needed to be So we internally have built rebuild our own version and what we quickly decided to do was give this back to the community We're very strong believers in open-source and not just about using open-source technologies But contributing back so that collectively everyone can improve together What else does a goners gives us? It gives us a concept of fleets again This is another custom resource definition, which is basically a collection of game servers It helps with allocation of users to the dedicated game server and we'll see that in a moment And also it's got different ways of being able to scale we can scale the game servers over a number of nodes either using a pat distribution method whereby All the game servers are launched on a single node first or a distributed method that you can see at the bottom here whereby you scale out to different nodes in a round almost round robin fashion Along with fleets there is a fleet autoscaler This allows us to say that we want a maximum of 10 game servers in this fleet and we always want to have say two two Game servers that are warm ready to accept players and each time the game servers move through different stages This would allow us to spin up more and keep more warm for players coming in afterwards So fleets are a collection of game servers and this will help us scale up as we've seen in distribute different ways over nodes However, it will also help us when scaling down and making sure our players aren't disconnected from their dedicated game servers To see this we have to actually dig into a little bit more of the life cycle of what happened to our dedicated game servers So here we are we start up and as the pods created the sidecar actually allocates support from the host machine And so now we are instead of having the port 7777. We have mapped it to 71 12 At this point the container is running It goes into a scheduled state From here, it's over to the game server itself The game server at this point is responsible for putting Telling the game is when it is ready to start accepting players And the reason why this is is you could actually be loading the number of assets in the back Hydrating some kind of state from a database or waiting for some long computation to happen before you want to accept players So when it finally says it's in ready state, this is when we can now start directing players to it As we said earlier players come in through a matchmaker So the first thing the client will do is or go into the matchmaker and say I want to find a server to connect to At this point our matchmaker actually calls into a number of back-end services gets information about the player and then asks The agonist API is internally to allocate it a game server What this does is agonist will actually talk to the fleet and say tell me which Turn which game server to allocate This will then put the game server into an allocated state and return the IP address of the host and the port number Where the player should connect to Whilst this in the allocated state This is kind of one of the most important parts for scaling down and we'll see why in a minute But this effectively saying that either a player is about to connect or has connected So here we can see The client connects with UDP into that host machine host port into the container Now the container is allocated and we can see that we have a happy user playing the game Finally the Afterplay is disconnected and the game server is no longer needed anymore. We need to scale it down And this is done by the game server itself saying I've got no more players left I cleaned up saved all the data that I needed to I can actually shut down in doing so it calls out by the SDK again signalling to agonist to clear it up and The the pod will be replaced by another one which you're going to ready state to accept more So from here, let's have a look what happens if we scale up the fleet and set the number of replicas to six You can see that there are a number of different pods running there. These are actually game servers And so which port they're mapped to on the host machine and how long we've been running So great. We scaled up and what about if we try and scale down all the way to zero? Well, this happens. This is the fact that we've actually said via cube cuddle and scale the fleet up to six down to zero replicas However, it's taken all of them down apart from the game server that was allocated This is very important and this is why agonist helps with scaling up and down game servers It knows that players are connected to this server. So won't destroy it or move it from the machine that it's on Doing so would obviously end the session for the player and result in an unhappy experience So we don't only have to run a single fleet within our cluster And what we actually do is allow devs to run a separate fleet all our devs inside net speak Can have a fleet of their own and at any point in time they can actually deploy into this fleet So they can run against the same infrastructure as our players are What's there talking to the same data service as effectively testing a new version of the game in production alongside other running customers? What so what is agonist give us? Well, we've seen that it's the ability to map container ports to host ports This is in a dynamic way. It's protected us from players whilst we're scaling down or upgrading the version of the game server It's given us multiple fleets so that our developers can run in the same infrastructure As our consumers and it gives us the ability to auto scale up and down and keep a number of warm servers for us From here, we have an SDK which we can talk to the platform to signal various states So finally what I want to talk about is how this platform and agonist along with some other open source technologies and power our team What we want to make sure that building this platform we We made it easy for team the team to run on top of the infrastructure We have gameplay programs and artists who aren't familiar with infrastructure and especially not familiar with Kubernetes So we wanted to make the experience as transparent as possible for them So they could keep up this quick cadence of the first step in this journey is making sure that they could automatically Build out a client and a server whenever they wanted So we're actually using GitHub actions to this and they can signify which fleet they want to have the client connect to You can signal what branch it comes from and they can do the same with the server Once they do this for the client. It's actually uploaded and they get a notification on the device and they can automatically then download That client and connect into a game server However, we still haven't deployed the game server into QVAM deploying into communities was one of the areas Which I was basically a real head scratcher for a while And then I thought about what some of the technologies are out there GitOps it was the answer for us effectively allowing us to take advantage of git But more specifically take advantage of the fact that github and gitlab have a user-friendly UI So the artist can easily go into the UI change the version number and not have to worry about what happens with the infrastructure in the background In doing so they create a pull request which is accepted by one of the other team members and within minutes They've got a new version of the server running inside a fleet on the same production hardware where customers are This allows them to really speed up and iterate fast on building out the game. So we've got GitOps for deployment We've got an automated build system that pushes out clients We also have a supporting cast of other things that's helped the game engineers and artists ensure that the game is being built in the correct way We use things like C-Advisor to ensure that we have the correct metrics and CPU memory utilization Which goes into Prometheus and eventually displayed to them in Grafana again This is just another website that they go to we've hidden all the magic away So they don't have to concern themselves with this and similarly with logs. I mentioned earlier that Unreal has human readable logs We use a tool called Vector to pass these logs out and ship them into Elasticsearch To eventually be displayed to the user This means that the the team don't have to worry about tailing logs in Kubernetes or worrying about how to use different command line tools Keeping it in a web interface means it's really easy and simple for people to use So what we've been able to develop within six months Kubernetes has given this ability to build out something that can be deployed anywhere As I said in any location any cloud provider or even on bare metal With the addition of a goners we can now scale up and down without affecting our players that connect to the game This is also given as the ability to support thousands of connected users on multiple instances of Unreal It's allowed us to do it in such a way that's low friction for the game team And they then have a self-service way of deploying and monitoring their game Hopefully this actually speeds them up and doesn't get in the way of what they're trying to achieve So what have we learned? Kubernetes gives us some kind of standardization Especially with containers we've run and build our dedicated game server out in the same way we do our matchmaker and all our data services So it's one standard approach to doing infrastructure We've utilized open source to make sure that we can move fast. We didn't have to build we didn't build anything that we didn't have to Also the open source community adds to our team. They really allowed us to move fast and can help us whenever problems occur Rather than using some of this homegrown and custom Also empowering our people using the platform inside the studio is really important If it gets in if it Kubernetes gets in the way of us building the game, we won't be able to release it in the time frame We need So what we looked at today, we talked through why speed matters in the games industry It's about getting into the hands of the users and finding what the fun experience is We've looked at the infrastructure of games and some of the key parts This is the authentication service the matchmaker and specifically the dedicated game servers We've hit some issues and when running them inside Kubernetes, especially with just vanilla deployments However, we have utilized open source especially with the goners that has helped us overcome these obstacles map the host ports to the container ports and scale up and down with ease and finally We looked at how using open source and specifically get ops can really empower people on team Thank you very much for listening today, and if you have any questions, I'd be happy to answer them