 All right, so we're not going to start the program officially for another minute out of respect for people getting here. But I will start with some of my preamble. So here's my first question. Who in the room plays Minecraft? OK, OK, maybe not quite half. Whose kids play Minecraft? OK, also about half. So you all probably want to keep your phones, especially those of you who have kids that play it, who think that what you do for a living is kind of dorky or they don't understand it, is pull your phone out because you're going to be able to take some photos of the slides and you're going to be able to come home and say, see, I do cool stuff for work. So and I'm going to say right away because I will forget as soon as we go into this swing of things that we have my wonderful colleague, Marcella, to thank for a lot of the Minecraft imagery that you're going to see throughout the slide. So thank you, Marcella. I think we can get started. I think we only have a few seconds before it hits the magic 11-15 mark. The first thing I want to do is I want to thank you all for coming, and I want to thank Justin and Marcella for the title to this talk because it's not every day that you can pump up Minecraft in the middle of a title that shows up here at KubeCon. So I'm sure that that's what drew many of you. I will tell you that I've had the pleasure of working with Justin over the course of the last month or two. I'm preparing for this talk and really understanding what he's implemented and it is a great story. So you're in for a treat. And so with that, we'll go ahead and get started. I will let my colleague here, Justin, introduce himself and then I'll introduce myself as well. Hey, everybody. Justin Head, VP of DevOps at Super League. Yeah, I get to work on the cool stuff with Kubernetes and help my team build out neat and interesting stuff. Yeah, and you've been at Super League for about four years? Almost four years, yes. Almost four years, and I was just asking Justin that he's been doing Kubernetes for about seven years. So you have a fair bit of experience with this, and so you'll see as we go into the details that he really had to tap in on that to be able to address the particular use case here. My name is Cornelia. I work for Spectro Cloud, which is a Kubernetes management platform. We're not gonna talk a whole lot about Spectro Cloud today. I'll maybe sprinkle it a little bit here and there to tell you a little bit about it. In terms of my background, I have been doing Cloud Native for more than a decade. I spent seven or eight years at Pivotal working on Cloud Foundry. Anybody do any Cloud Foundry in the past? Yep, so some Cloud Foundry in here. And then towards the end of my Pivotal tenure, I also spent some time, I started working with Kubernetes. We brought Pivotal Container Service to market. I also wrote a book called Cloud Native Patterns. It's a manning book targeted at the application developer and the application architect to help you understand how to write software for the cloud. So with that, we'll start the story. All right, this talk could have also been called There and Back Again, The Hobbit's Tale, or I mean, you know, a Minecraft tale, basically. And it's about a journey to bare metal. So it's there and back again because as you'll see as we get into it, we started on bare metal. Even myself actually started my career in Chicago and about, I'm not sure which direction I'm at, but right across the street is the Accomix Data Center. And I used to spend a lot of, you know, long, cold nights in there, just racking and stacking and crash-carding and fixing things that were broken, you know, before you didn't have to go in and drive into places. So that's kind of what this talk is gonna be about. Here's the general agenda of what we'll go through. We'll talk about, you know, going to the cloud, laying a foundation for bare metal, going through kind of how you do life cycle management in a bare metal environment, which is a little bit different, especially for a Minecraft workload. And we'll talk about, you know, the results and how it happened and, you know, did it actually work at the end? I guess probably a little bit about Super League and what Super League is. Essentially Super League is a brand activation agency, so they're helping to bring brands like Mattel into Roblox or places or other brands like Nickelodeon into like Minehut or Minecraft Worlds. So they're building out experiences for, you know, people to play and getting people excited about brands and the interactions within, you know, those kind of, you know, digital worlds that a lot of people are spending their times in, so. And correct me if I'm wrong, but I think you've told me that part of what you did also was you kind of helped people come together as groups to gain, right? Yeah, Minehut itself is the largest Java-based server community in the world. And so there's just, it's a little different than other Minecraft hosting places or, you know, just having a server by yourself. Talk a little bit about it, but there's a, you know, there's a community and lobby and kind of gathering place where everyone comes in when they're joining and going off to their own servers or to, you know, kind of promote their own servers and so on and so forth. So it's all about interactivity and community and also like a family-friendly place for kids to play Minecraft as well, so. Very cool. All right, we'll talk about what started all this off going into survival mode. First of all, we moved Minehut to the cloud. So it started in a provider called OVH where we were doing kind of dedicated servers. Everything was kind of mainly done, so on and so forth. And we migrated over to AWS ECS. So we were looking for, you know, a more professional, stable platform that we could get help with if we needed it. OVH is primarily a European provider and so kind of the hours were not conducive to that. We wanted on-demand ability to do, you know, API everything and kind of get a little bit more modern with a container orchestration platform. So after AWS, they decided to move to GCP. So this is prior to me being there, but the AWS infrastructure got compromised and they brought in a consultant to try and, you know, ask them what they should do and where they should go in the future. And the consultant kind of built out a solution using Kubernetes at that time. Really the only place, if you wanted to manage Kubernetes was GKE and Google. And so that's where it moved. Now, I started at Super League one month before the pandemic hit. I actually started to build, started at Super League to build out a live in-person eSports platform globally. And that did not happen. So, but, you know, everyone was, had a little bit more time on their hands when the pandemic started, everything shut down. People started playing Minecraft on Minecraft, a lot more, and the cost was getting out of control. I'm curious, is there anybody in here who I know many of you said you play Minecraft? Is it anybody playing Minecraft on Minecraft? You doing it mostly on your own local machines? Yeah, I'm seeing a lot of nodding heads, interesting. Okay, so just for a little bit of background, what Minecraft is, is it's an environment where you can go and instead of having to run Minecraft locally, you can just run it up in the cloud. And that's really important context for us to have because you'll see as we go through this story, you'll see that what that means is that because the game is actually running in the cloud, it's stateful. Those of you who play Minecraft know even more than I do, because I'm afraid I don't play, that it's stateful. You care about remembering where you've just come from and what the environment, it looks like where you currently are in the game. So that's what we're talking about here is that we're talking about a whole bunch of different instances of Minecraft that are running on these servers. And so before the pandemic, they had a certain amount of traffic and then the pandemic hit and people running more Minecraft on their infrastructure, that very infrastructure that he just talked about being in GCP. So we're faced with kind of what to do. Minehide essentially is a free service. So people, especially at that time, could come and just build a server out and we would run it for them. And that was unsustainable with the numbers that we were getting up to. And so we wondered, how are we gonna keep this running and keep the lights on while still not going bankrupt on this product? So we had to get creative. In GCP, at the height of all this, we were spending about 200K a month on Minehide. It was mostly compute, but a big part of it is also with games and is egress cost. So those are all very expensive items in your top public clouds. And so we took a look around. We looked at a number of different options. Going back to dedicated servers, building things out ourselves. We even taking down colo space and buying servers. But ultimately, we looked at bare metal and it promised a pretty good cost savings. On the machine side, it looked like it would be at least 50%. We didn't have EBS storage then because they would have storage inside the machines. It came with the pricing. And the network was a big one, which was at least 90% savings, so. But okay, so Justin and his organization decided to move to bare metal. But if you remember, when he talked about moving into the cloud, they were originally on bare metal. When they moved over to the cloud, there were a couple of reasons for doing so. It was to have that agility. It was to be able to use APIs to provision things. It was that ability to have some level of elasticity in the amount of resources that you're using. And you remember that they moved from bare metal into ECS, so they were moving into a containerized environment. So that's the challenge that we're gonna talk about here, which is, hey, when we're moving back to bare metal, we're not moving back to the bare metal that we came from initially. We want to bring those benefits that came from the cloud. We wanna bring those back with us over into the bare metal environment. And so we're gonna talk about each of those three things. So Justin, maybe first talk about your containerized architecture. Yeah, this is a simplified view of what mine hut looks like on top of Kubernetes, especially from the player traffic perspective. So at the entry point, obviously the player is connecting in, they go through, we use Cloudflare Spectrum to basically clean up the traffic, prevent major volumetric de-dosses. And from there, it hits what's called a velocity proxy. So velocity is an open source tool that Minecraft people can use to proxy their Minecraft traffic. So that sits in basically a BGP ECMP setup that has sticky sessions or constant hashing so that players aren't gonna be dropped if something happens with a node or something. So from there, in general, people will get dumped into the game lobby. And that's where kind of that community thing happens where people can see top servers that are being played right now. They can interact with things like the Nickelodeon event. They can do park lore and all these other things in Minecraft in there and get rewards and so on and so forth for doing that kind of stuff. From there, they can then jump over to their own game server, someone else's. And play Minecraft as a community or with just one person, or you can have tens of people in there. I'll talk a little bit about the other things a little later that you see on there, but that's kind of the gist of what it looks like playing from the player traffic perspective. And most of those components that you talked about are implemented as relatively standard web services, web UIs, those types of things and all that's running in Kubernetes. That's correct. Yeah, okay, so it's that containerized architecture. All right, so the second thing that we talked about was we didn't wanna lose that containerized. So Kubernetes is the solution there, but hang on a second, Kubernetes on bare metal. Well, I wanted to, again, I wanted to maintain that level of API support that I had for being able to get a cluster, getting a cluster very easily, being able to have some level of elasticity, but hang on a second, bare metal. And the way this solution works here is it leverages something called MAS. MAS is a project that comes out of Canonical and it is basically a cloud-like API for hardware. And so what you see on the screen here is on the top you see a bunch of different operating systems that are available and on the bottom you see a whole bunch of different hardware elements that are available. And those things, if you remember, going to like EC2 or GCP and spinning up a virtual machine, that's basically what you're doing. You're saying I want one of these machine types and I want this operating system on it. So that's what MAS gives you. MAS gives you that API that makes bare metal look like a cloud. Now there's a second element to APIs and that is, well, okay, now I have an API for bare metal machines. How do I turn those bare metal machines into a Kubernetes cluster? And that's where cluster API comes in. So I just want to check. My timer here is showing 20 minutes and you just showed me five minutes. So I think we're good. So I'm gonna keep going based on this timer. Somebody do let me know if there is a difference because we have a lot more than five minutes of content left. Apologize for that. So cluster API is a CNCF project. How many people are doing things with cluster API today? Only a little, maybe a quarter of you, maybe a little bit less than a quarter. So what cluster API is is it is an API that allows you to interact with clusters exactly the same way that you interact with workloads on your clusters today. In workloads, what do you do? You create a deployment. That deployment, you just apply that into the cluster and then all the magic happens. That deployment gets converted into replica sets, replica sets get converted into pods. All of that stuff is kept up to date for you. Well, cluster API does that for you for clusters. And so cluster API has a number of different providers. Here's just a handful of them. These are the providers that we support in our SpectroCloud product. So we can attach to any one of these environments, including bare metal environments that you see on here. And what that cluster API does is it, we have built an open source project called the cluster API provider for MAS. We've done that at SpectroCloud. So you can find that in our GitHub repository. And that again, leverages the cluster API specification that is totally open source. We've created an implementation for MAS that is also open source that allows you to instrument those clusters using an API just like you would in the cloud. So you see how we're working our way towards bare metal but these are the ingredients that you need to maintain all of the goodness that you got when you moved up into the cloud. So how does elasticity work with bare metal? One thing you, if you haven't worked with bare metal before or you haven't worked with it in a while that you'll notice right away is deploy times. When you're spinning up full physical machines including powering them back up and on, it will take about 15 minutes each. So as far as minehut goes in our setup, we have different node pools that we use for different workloads inside of the cluster. We've got obviously the control plane, a default and proxy node pools. And those are for the most part not elastic. So they're dedicated, they have fixed number of machines that we run. And that's due to some of the statefulness that we'll talk about in minehut. On the game side though, we do auto scale that. So that's obviously our main workload. It's the one that goes up and down with peaks and valleys. And we do separate node pools for paid servers. People can pay on minehut to have extra CPU or RAM or basically disk space, your basic things for their server to run, to host more plugins and our people. So we have a paid pool and a free pool. The, we use different schedulers for each of them. On the paid pool, we'll spread the pods evenly across. The paid pool stays fairly static. It doesn't move up and down too much as the day goes on and so forth. Most of those servers stay running 24 seven. On the free pool though, we use a method of been packing on the scheduler to basically pack them in as tight as possible when new ones come up. They'll also get packed onto the highest node and it keeps things light on the end so that the basic Kubernetes auto scheduler can remove the nodes that are not in use as the day goes up and down. So one thing to note on how we do free plans is they have a four hour time limit before they'll get shut down for the day. And that allows us to do the been packing. Otherwise, people will keep their Minecraft servers up like months at a time, so which would not allow us to do auto scaling at all for the most part. Oh, interesting. So it's too, so you're, what you're doing there is you're leveraging an SLA to be able to allow you to have different operational capabilities as well. So, yeah, very cool. Kind of what about all the other stuff? Yes, there's the game that is running, but you know, the game obviously a service doesn't run by itself. There's other services we run inside Cates like databases, messaging, observability. We utilize operators for that or open source other products like NATs for messaging. And we try to basically run everything we can inside a Kubernetes workload and not be dependent on having things outside of it. And when it comes to storage, which is probably something maybe you'll wonder about, we use, we try to keep it really, really simple. And we use a product called Topo LVM, which if you're familiar with Linux LVM is basically an interface into that that utilizes those local disks in a way that you can basically provision persistent volumes from the Kubernetes API and have it all flow through. We have one thing that we do run in the bare metal environment outside of Kubernetes. And this is MenIo, which is for object storage. It is basically, when you think of the game world and mine hut, it's a database for the most part. It is the state of the world. It's what's ever been created in there. It's what's going on. And we back that up as an object into MenIo and we keep hot data sitting there in the data center for like the last two or three weeks of servers that have started. And after that it gets flushed out. We also utilize back blades in the background to keep like a full archive of every server that's ever been created at mine hut. But we don't just use bare metal. So there's a lot of things that we wanna utilize that are just simpler in different places. Take for instance, like ADS IoT. We actually farm that back out to AWS and we also utilize some SaaS offerings for the observability side. And they get things that our small team isn't gonna be able to manage and continually take care of on a bare metal environment. So I think at least on our philosophy is we only specialize when we need to per cloud. Because super league does work where we burn all the different clouds with different workloads. But we select tooling that works across the clouds and we try to go deep on that tooling instead of attempting to use whatever is very specific in that cloud. Yeah, so the net net of that is it's hybrid. So when you're going to bare metal it doesn't mean you have to run every single thing in bare metal. It means that you can choose to create kind of a varied topology. And just to drive home that point that Justin was making is that even when you are running things on your Kubernetes cluster, whether it's bare metal or anything else, there's a whole host of services out there where helm charts that are packaged up or other various packaging mechanisms. These are just some of the things that we, for example, Inspector Cloud offer out of the box that you can just either install on your cluster or use to configure your cluster. So there's lots of Kubernetes native services there as well. So we're now at bare metal. We've landed on bare metal. We're running on bare metal. So what happens with day two? So we've managed to maintain some of these APIs but how do we do the life cycle management? And where things get really interesting here is that Justin's hinted at it a few times is that these workloads that are running on some of these node pools hint, hint are stateful. And so you can't just do a normal Kubernetes upgrade to upgrade those nodes in there because if somebody's running, and remember he talked about sticky sessions on the ingress there, if somebody's running in the middle of a game and you are cycling that node, that's a pretty bad experience. So that's what Justin wants to tell you about now is how they've actually implemented that life cycle management of the Kubernetes cluster when you have stateful workloads running. Sure, so the statefulness of minehut is really the three different areas there. The velocity, proxy, game world, or game lobby. And it behaves differently depending on which one you were to, for instance, kill. When it comes to, I guess, taking a step back though, the main thing that we're trying to prevent is player disruption. So if you all play video games, then you don't like it to just drop off the server and lose your progress or lose whatever's going on. So that's a big thing we try to take into account. For the game lobbies, it's not so much of a big deal. It's not your world where you're building things necessarily and you're able to just actually not get fully disconnected from this service. You'll get switched to another lobby if one were to break down. So we intercept and redirect you. On the game world side, obviously you don't want to be disconnected, but if you are, you will get switched back and put into the lobby where for whatever reason, while your game went down, you can restart it back up. And the only real sticking point is on the velocity proxies. So Minecraft does not behave like a web service or your browser or something where you're just hitting some service. It's not gonna retry for you. You're gonna just get dropped and disconnected and you're gonna have to load back in the client if that proxy goes down. So which is unfortunate the way that it works, but it is how it works and that's something that we have to deal with. So how do we do upgrades when we have all these kind of stateful databases of the Minecraft world running? Well, we utilize some special flags inside a cappy to tell it to pause all the game pools that are out there as well as the proxy pools. We can do these steps weeks in advance from doing the actual downtime. So what we'll do, we'll pause them. We'll update the cluster spec to say, go to the next Kubernetes version. We'll let the control plane, the default pools, even the pools running databases, we just let them run through and they cycle in and out, no problem. And how long does that usually take? It can take a while. So remember, this is bare metal. So when you're cycling the machine out, it's at least 15, 20, maybe even 30 minutes for a new machine to come up and then drop one out. So you were saying that you could do it weeks in advance. You are sometimes starting an upgrade process several days before you hit a maintenance window. Yes, for sure. Ah, interesting, okay. So, and then we'll take a full downtime maintenance. This is where we will tell the community we're gonna shut down for a few hours. We'll safely drain all of their game rules off, save them back over to MNIO and also back up into back plays. And then we will do what is basically a mass delete of all the machines for the game and the proxy pools. We'll cappy delete them inside a cappy. So basically cappy delete machine, everything in that pool. And then we'll tell cappy to unpause and cappy will just bring them all back up at once. And so this is like the best way to minimize downtime for us and the current way we do things. And it takes, we take a two hour maintenance window but usually we can get done fairly quickly this way like within an hour, so. That is super cool. So what I wanna do is I wanna show you a little bit more about cappy cause most of you said that you aren't familiar with Cluster API. I mentioned this, so what you can see here is that each one of those boxes are resources. Just like a deployment is a resource. Just like a pod is a resource. Just like a replica set is a resource. So these are the resources that exist in the Cluster API. So you've got a cluster and then that cluster is broken down into on the top side you see the control plane nodes and the machines that make up those control plane nodes. And on the lower side you see the worker pools. And that's where Justin was talking about there's many different worker pools. Some of those worker pools are hosting things that can be cycled before the maintenance window, no problem. And some of those worker pools need to be held until the maintenance window happens. So you can see down here under machine deployment there's a couple of properties there that those are the properties that Justin is leveraging. So he's going in and he's changing a value in the machine deployment to say paused only for those pools that have the game worlds running on them and the proxies. Then he's also using cappy to change the counts so he's deleting all of the nodes and then he unpauses so he changes that value back to it's not paused and then lets cappy do the rest of the work. So he's literally just changing attributes on resources the same way that you do with your workloads and that's pretty damn cool. Cluster APIs is pretty awesome. What about emergency maintenance? So there's OS security issues and Kubernetes issues occasionally that you want to get out there right away. This is bare metal and it functions a little differently than your VMs in AWS. So there is occasionally things just wrong with the machines a little bit more so than you get in just running on VMs. So we use a couple of open source projects called node problem detector to kind of look at those real low level items that could be not working quite the way that they should. And we also use another project called, well CURD, the Kubernetes reboot daemon and that's sitting there basically we run a boon to underneath and it's sitting there basically looking for the varlib reboot required file which means that unattended upgrades found something in a security issue that needed to be upgraded and couldn't do it is not able to upgrade without rebooting a server. So we obviously that wouldn't work for the stateful workloads. We wouldn't want them to go down unless it was serious enough security issue that we wanted to do that. So we will allow CURD and NPD to run on like default and other node pools that and allow them to reboot as needed and make sure that only one is rebooting at a time and we'll do a special release if we need to if we want to have game nodes and everything else do that. So we do have an emergency way for the game nodes and pods to come down and the way we do that is we extend the termination grace period seconds to make it quite long and we intercept the term and do a game save and it saves out to the Minio cluster. So it is fairly safe for us to do it. It does cause player disruption so we don't do it unless required but that is how we basically do emergency maintenance on these. Excellent. All right, so we've only got a couple of minutes left. So why don't you tell us a little bit about the results that you got and maybe real quick what you're planning on in the future. Sure. We got 65% cost reduction. Whoa. So that is pretty, pretty awesome. The machines, you know, 55 to 66% depending on the config that we're using for those. Network is 92, 100. So you're probably like 100, that doesn't sound right but some of these bare metal providers will give you the network for free and the machine price. So you're not even paying for any egress out of their systems. So another thing was a bonus. We got 15% better performance on the over VMs. We didn't have to add any new people to the team to manage all of this. Only downside to some of these services aren't stressed as much. So you'll hit some interesting things and you'll need to work with them on and the like the boot times infrastructure iteration takes longer. Cool. And what's next? We recently went to Cox edge. We love bare metal so much. We were currently currently previously in equinex metal. And we went, we're working on doing hybrid multi-region game worlds splitting up the traffic globally and making the automation for Cates a little bit better. So with that, we have like one minute left for questions. The other thing that I'll tell you is the QR code there. We'll take you to where the slides are. So if you didn't get a chance to snap all those that's taking you to the schedule. And then SpectroCloud has a booth on the show floor. We're at booth 014. So please stop by if you're have, Justin's gonna be there part of the time. I'll be there part of the time. Lots of my colleagues are experts on Cappy. So if you wanna come learn a little bit more about Cappy, we can answer those questions as well. So we literally have just a few seconds left for questions. Is there a question? I see one right here in the front and then I'll go to you. And I think we can take those two questions. Yeah, and there's a mic there that way we can have it recorded. So for any of your sort of disruptive operations, whether you're doing upgrades or like some of your bin packing, do you ever do any sort of like simulation or data analysis to make sure that things are gonna go okay before you do the real thing? No, the answer is no. And we have one more question here. If you can say that in the mic, that would be helpful. I mean, it's great to see the number of savings that you mentioned in terms of machines and the network. But when it comes to covenanted schedule and auto scaler, how did you find that Goldilac to make those savings? Based on your workloads, right? How did you figure that out? How did you figure it out? I didn't catch that. What was it? The bare metal infrastructure for the workloads. Yeah, maybe it's kind of hard to hear. So maybe we can just, if you can just come up and we can take that offline. Thanks everyone for coming. Thank you. Really appreciate it. Thanks.