 OK, I've got a lot to cover, so we're going to get started. Hi, my name is Diego. I work at CoreOS, and I'm here to talk to you about self-hosted Kubernetes. So CoreOS, we use self-hosted Kubernetes in our enterprise Kubernetes distribution, which is called Tectonic. We've been doing this for quite a while now in production with lots of customers doing this, and we think it's the best way to run Kubernetes. So this talk, I'm just going to kind of try to prove that to all of you. So who is this talk for? Well, a lot of different people. First would be cluster operators. If you're running a cluster or maintaining a cluster or multiple clusters, upgrading them, self-hosted Kubernetes is one approach you can take that we think is a good one that will make your life easier and a lot of dimensions I'll get into. Also, if you're a Kubernetes contributor or you're just interested in Kubernetes, self-hosted Kubernetes touches on a lot of interesting aspects of Kubernetes API surface and things like that. We hit a lot of corner cases we had to solve, so it's cool from a Kubernetes point of view just to see Kubernetes running itself. And lastly, people who enjoy clever hacks. So we had to come up with a couple of neat tricks in order to make self-hosted Kubernetes work that I'll get into during this talk, and so I think they're kind of cool and hopefully you do too. So first, to get everyone on the same page, I want to talk about what is self-hosted Kubernetes? Well, it's somewhat self-evident, but it's Kubernetes running on Kubernetes. Specifically, all of the Kubernetes control plane components are running as native Kubernetes objects, namely deployments, daemon sets, they're using things like secrets. And that's about it in terms of what it actually is, but of course the devil's in the details, right? So this talk is gonna come in three parts. First, I'm gonna talk about why you'd want to use self-hosted Kubernetes. I'm gonna make the case for why we think this is the best way to run Kubernetes. Then I'm gonna get into how it works and I'm actually gonna launch a self-hosted cluster and show you guys some of the ins and outs of it and some of these tricks I was talking about to make it work. And at the end of the talk, I wanna get into what's next for self-hosting. So we have self-hosting today, it's working great, but it also unlocks a lot of possibilities for what you can do once you have a self-hosted cluster in doing some pretty sophisticated stuff in terms of managing your cluster, scaling it up, that sort of thing. And so I'll touch on that in the end. So first, let's go over why you'd want to self-host a Kubernetes cluster. Well, first of all, you wanna leverage Kubernetes strengths. We have Kubernetes, it's great at running highly available resilient software, wouldn't be great if we could actually run Kubernetes taking advantage of all of that useful stuff that's encoded in Kubernetes. Secondly, it really simplifies your node management story. So it turns out once you're using self-hosted Kubernetes, your node requirements are very simple and I'll describe that a little bit. Lastly, it really makes your cluster lifecycle management a lot easier because you're managing using the same lifecycle management tools that you use when you're managing any application on Kubernetes, Kube, Cuddle, and friends like that. So if you're running a Kubernetes control plane, what might be some properties that you'd like this control plane to have? Well, for one thing, you might want it to scale it up and down automatically. Let's say your cluster is growing, you're getting more users, more people are hitting the Kubernetes API. You might want to scale up the number of API servers you're running or controller managers or so on. You also might want to handle node failures gracefully. So let's say one of your masters goes down. You want to quickly bring up another master to run the components that were running on that failed node so that you don't have any downtime, so that your workloads keep running. You also want to be able to safely roll out new versions of your software in a resilient manner. So you want to run, update your API servers, let's say, one at a time. You want to wait for them to roll out. You don't want to have any downtime. And then what if something goes wrong? What if you set a flag that, it turns out, breaks some part of your control plane? You want to be able to roll back to the last known good state that you had. While we're at it listing, making our wish list here, what about advanced networking? What if you want to run some sort of network policies that constrain how your control plane talks to each individual component into other parts of your cluster? What if you want to have role-based access control and auditing to control and keep track of your control plane, who is talking to, what it's allowed to do? What about health checking, monitoring? If you're deploying Prometheus in your cluster, if you're looking at liveness checks and health checks, you want your control plane to have that too to make sure that you know very quickly if there's a problem. Lastly, what about resource allocation and accounting? You don't want to waste compute by having snowflake masternodes that aren't optimally using the resources, and you also might want to know how much is my control plane costing to run? How many resources is it actually using? So if you think about it, what's really good at doing all this stuff? The obvious answer is Kubernetes itself, right? Wouldn't be great if we could just run our control plane this way. Something else that self-hosting unlocks is simplified node management. So if you think about Kubernetes worker, so forget about masters for a second, what do you really need to run a Kubernetes worker? Well, you need the kubelet, right? That's what's gonna talk to the API server and figure out which pod should be running. You need a container runtime, Docker, container D, rocket, et cetera, that is actually gonna run your pods, and you need some credentials to talk to the API server, let's say a kube config. And that's it, really. Most workers, that's what they run. It's a very minimal subset of things compared to a master where you might be running specific API server controller manager, et cetera, system D units, if you will, or whatever else. So in a self-hosted world, this is all you need because everything is running in Kubernetes. Everything is really a worker. There's no distinction between a master and a worker from a compute point of view. So in that case, how do you select a master node? Well, it's actually just as simple as applying a label. The only difference between master and non-master nodes is that they happen to have labels on them. And so again, using the Kubernetes built-in primitives, it's simpler, it's more uniform, and it's more flexible, and I'll get into that a little bit later as well. Lastly, what about lifecycle management? What about upgrading your cluster, changing flags in your cluster, adding new, rotating your certs, let's say, for your API server? Well, all you do is kubectl apply, kubectl edit, and you can go in and change anything, and then Kubernetes will roll out those changes in the way we all know and love. Now, realistically, your control plane is your most important part of your Kubernetes cluster from one point of view. So you don't really want to be using kubectl in production, I would assume. So ideally, you'd actually want to automate this. Maybe you want to write some software that is using Client Go to talk to the API server. You can encode a lot of logic and expertise into some program, maybe we call it an operator, that will manage your control plane, upgrade it, self-heal it, and so on, and I'll get into that a little bit as well later. So that hopefully makes the case for why you might want to self-host, what some of the benefits would be. Now let's actually get into how it works. So when you want to create a self-hosted cluster, there's just three main areas that you want to address in order to have something that's production ready, that is something you could actually use to run real workloads. The first is bootstrapping. How do you actually create a cluster? It's not, if you start to think about it, it's actually not that easy. And then what about upgrades? I touched on it a little bit, but in practice, how does it work? How do you actually upgrade your whole control plane? And then disaster recovery. Inevitably, some things can and will go wrong. How do you make sure that a self-hosted cluster is recoverable? So I'll go through each of these one by one and I'll actually jump over to the terminal and make a cluster and then break it for all of you and we can see how this works. So the first step that we need to figure out is bootstrapping. How do you actually create a cluster from scratch? If you think about it, the control plane, as I illustrated earlier, is running as daemon sets and deployments, right? But you need a control plane to create daemon sets and deployments, right? How can you say, could Cuddle create if there's no API server to talk to? So here we have clever hack number one, which is that we're gonna create a temporary static control plane to bootstrap a self-hosted cluster. In essence, we'll have a special ephemeral control plane that we can stand up, that we can point at our XED instance that will be used in our production cluster, create our assets and then tear down that cluster and we're off to the races. How do you actually do this in practice? There are a few projects that are doing this. The one I'm gonna talk to you today is called Bootcube. It's a Kubernetes incubator project that I work on and a few other people here. And the way it works is you give it temporary control plane manifest. So these are manifest that purely describe pods because they need to run on the Kubelet. There's no control plane yet. The Kubelet only really knows how to run pods. So these are gonna describe your API server controller and scheduler. You also need your self-hosted control plane manifest. So this is actually gonna describe your permanent self-hosted cluster once it gets running. And then lastly, you need an initial master node. And I put master here in parentheses because as I said, there's nothing really special about the master node. It's just the node that we so happen to choose to be our initial node that we're gonna bootstrap on. So Bootcube is gonna take these three inputs. They're just files on disk and your node. And it's gonna stand up a temporary control plane. And then it's gonna pivot that into the self-hosted control plane. So let's actually launch a cluster and take a look at that. So here, can everyone see this okay? Cool. I've got a few nodes sitting around here on my laptop. So let's look at the actual manifests we're gonna use for a sec. So Bootcube actually provides a nice little render tool that can create manifests that are good, they're kind of baseline to run a self-hosted cluster. If you wanna go run your own, you can just download Bootcube, play with this. In practice, you really want to customize these assets to match the specific configuration you want. So if we look at what we've got. So we can see we've got a kube config up here. We've got these bootstrap manifests that I was talking about. So we've got an API server, controller manager, and scheduler, and those are just pods. Then we've got a bunch of other manifests. And this is actually our control plane that's gonna run. So we've got API server, controller manager, scheduler. We've also got kube DNS, which is an add-on we like. It's got kube flannels. We're actually deploying our network overlay as part of this self-hosted cluster. And then a few other things, RBAC rules, and so on. And then we've got all our PKI material that we're gonna need to have a secure cluster. So if I hop over to my master node, you can see here I've copied over the assets already and I have bootcube. So what I'm gonna do is I'm just gonna say, sudo bootcube start, and then I say use the assets in this directory. And what it's gonna do is actually, if I open another terminal here, what it's done is it's copied these bootstrap manifests over to the kubelet manifest directory. So it's starting them up as static pods. And once that's done, it's gonna use that temporary control plane to create the self-hosted assets. So since looking at text is not actually that enlightening, I'm gonna illustrate this with pictures. And thanks to my co-worker, Aaron, who originally made these really good slides. So here is just gonna reflect what's going on in my terminal right now. We've got the kubelet, we've got bootcube, and we've got at CD. The first thing bootcube does is it creates static pods. As I showed, it's just by copying these static manifests over to the kubelet directory. And so we have an API server, scheduler, controller manager. Once they get up and running, they talked at CD, you effectively have a functioning control plane. The kubelet is configured to look at this API server address, and so it actually joins the cluster. So we have this temporary control plane, and we have the kubelet, it's a one node cluster. Bootcube detects that this cluster is up and running, and then it creates the self-hosted components, effectively by calling kubectl create. And so the kubelet, since it's talking to the API server, sees that it should be running these pods. So it's gonna start up the actual self-hosted API server, scheduler, and controller manager, but really it's just functioning as a kubelet normally does, running a daemon set, deployment, et cetera. Once the self-hosted control plane is running, bootcube tears down the static pods. They're not needed anymore. It sees that the self-hosted cluster is running. And then bootcube itself exits, bootcube's done. We never use bootcube again in an ideal world. And the self-hosted cluster is now talking the same at CD instance, so that's the kind of trick that allowed us to use the temporary control plane and the final control plane and the pivot. And the kubelet now talks to this API server, and we have a one node cluster. From there, we can start joining other nodes and expand our cluster accordingly. So let's go back, look at our cluster. Okay, cool, it's all done. We have all self-hosted control plane is successfully start, so bootcube exited, we're done with it. So let's actually look at our cluster. So you can see we've got two nodes, we've got our master, and then I had a worker sitting around that was just waiting for an API server to appear. And then we actually have our control plane. So if we say kubesystem, get deployments, got our controller manager, scheduler, et cetera. Our daemon sets, we've got kubabiserver, flannel, kubproxy, et cetera. Right, cool. All right, so now we've got a cluster, but as we know in Kubernetes land, things are moving very fast, so we wanna upgrade it, right? How do we upgrade it? Well, this is probably the most boring slide of the talk. Actually, we just go and change the image in our daemon set, it's actually that easy. So I'm gonna say kubectl, edit. Might have to put that first. Daemon sets, kubeserver, cheat. And we're gonna say, well, when 185 came out, what, yesterday, two days ago, let's upgrade to that, we wanna be on the bleeding edge. Cool, now let's get daemon sets wide. So it's kinda hard to see when it's blown up, but you can see that kubeserver has been upgraded to 185. And thanks to daemon set rolling update strategy, it's gonna do this in a safe manner. And as I mentioned, we don't wanna really be using kubectl edit on our production clusters, possibly, but you can, in a pinch or in an emergency, you can go in and you can fix something this way. All right, the last piece I wanna cover here is disaster recovery. So things can and will go wrong with your cluster, either due to bugs, operator error, what have you. So what kind of failure modes might you see in a self-hosted cluster? First, you have partial control plane loss. Let's say that someone is editing their kubescheduler deployment, accidentally scales it down to zero. Well, you've got a problem now because you can scale it back up, but there's no scheduler there to decide which nodes it should be running on. So how do you deal with that? Well, in that case, you're gonna have to recover the scheduler itself. What about if you lose the whole control plane? Let's say I have a bare metal cluster in my basement, I accidentally go and trip the fuse and I start a backup. We need to restart things, right? So we'll have to recover the entire control plane. And lastly, you might lose the cluster completely. Let's say I accidentally delete my auto scaling group in AWS. Well, hopefully you have backups and you can recover from a backup. I'm gonna make a brief interlude to talk about a pod check pointer. It's another thing that we use in self-hosting. So if some of you were, I know I went kind of fast, but a keen observer might have noticed a trick during the upgrade demo. I said it was easy. It is not actually that easy. How do you upgrade an API server? How do you actually handle master node reboots? Your API server, when I updated it, I have a single master cluster here. It's not actually what you wanna do in the real world, but my little laptop, that's all it can handle. So when I updated the API server, you think about a daemon set. What it's gonna do is it's gonna terminate the old pod and once it's fully terminated, start up a new pod. Okay, we terminated the API server, then what? There's no API server. No one's telling us what to do next, right? So how did this work? Well, welcome to clever hack number two. We run a checkpoint or daemon, and this is actually running on all the masters. It's a pretty simple daemon, but what it does is for certain critical pods, it creates local checkpoints. It copies down the static manifest and in the case of a control plane outage, it will actually deploy that manifest. Wait for the cluster recover and then on we go. And so during an upgrade, this is actually a mini outage that we're creating on purpose and the checkpointer is what allows us to do that upgrade by temporarily running a local API server, waiting for the new API server to start back up and then decommissioning the checkpoint. So I can illustrate this in pictures. So let's say we have our Kubelet and our API server, the checkpointer is watching both and it's basically trying to reconcile. The API server says that pod one and two should be running. The Kubelet says that pod one and two are running. So we're good. This is our steady state. The checkpointer will create these inactive checkpoints that are just on standby. Let's say pod two disappears from the Kubelet. Let's say it's the API server. It's like this upgrade example you're just talking about. And it's not able to start a new one because there's no API server if you have this chicken and egg problem. The checkpointer will see this. It will activate the static manifest checkpoint. This pod will start. It's an API server, so now the Kubelet can talk to it and decide, oh, I should be running an API server. So it'll actually run the real API server pod. The checkpointer sees, okay, API server and Kubelet are matching again and it'll just retire the static pod. So pretty simple, but not obvious. And it helps us both in upgrades and in reboot situations. And then we're back to our steady state. The last piece here I wanna talk about is recovery. So the checkpointer helps us in some cases, but not from all outages. For example, in the example I gave about scaling your scheduler down to zero, you don't have a functional control plane anymore so you can't fix what's broken. You need a bigger tool. So if only there was a way to jump start a cluster. Jump start sounds kind of like bootstrap. So clever hack number three comes in, which is we actually use bootcube to extract manifest from the cluster and then run another temporary control plane. So just like when we started the cluster the first time, we can do the same trick to create a control plane that's pointing at our same at CD, fix whatever's broken and then on we go. So I'm gonna do a quick demo of this. We're going to purposely break our cluster. So let's say kube cuddle and kube system. I did this earlier, that's probably easier. So we're gonna scale our kube scheduler to zero replicas because why not? So if we say get pods, we see, we got two schedulers that are terminating. That's bad news, let's wait for them to go away. Okay, they're gone. Well, oops, I didn't really mean to schedule it to two. Maybe I meant 10 or something. Sorry, zero, I meant 10. So let's scale it back up. Okay, that worked, right? API server said sure, that's what you wanna do. But then it's gonna ask the scheduler to schedule those pods and uh-oh, we don't have a scheduler. All we have is pending pods, they're not gonna get scheduled, right? So what do we do? We can use boot kube recover. So let's hop to the server back to our master really quick. And so we can say boot kube recover and then we say recovery dur is home core recover and then kube config, because we need to talk to the API server. In this case, we still have an API server. Etsy, Kubernetes, kube config, all right. So let's talk to the API server and it recreated these bootstrap manifests because if you think about the ones we use to start the cluster, that's a one-time thing. This could be a year later, our cluster is very different. We can't reuse those assets, but this one is gonna create boot kube start friendly assets. So we boot kube start it. Boot kube start. And this time we say the asset dur is home, sorry, home cluster recover. Oh, sorry, core recover, thank you. All right, so it's the same process. It's gonna create a temporary control plane. As soon as that happens, we're gonna have a scheduler. It's gonna be able to schedule the missing schedule pods and we're good again. That's just gonna take a second for the kubelet to take care of this. So, while that's recovering and I'll check on it in a second, I wanna talk about what's next for self-hosting. Self-hosting is a good place. As I mentioned, Tectonic is core-west distribution. We're using self-hosting in production today, lots of clusters. It's stable and great and has all these nice properties I talked about, but it also unlocks a lot of interesting things we can do to kind of take our cluster management to the next level. So one is automated operations. There's a lot of things you can do now that these are Kubernetes objects that you can manage using the Kubernetes API. One is cluster upgrades. As I said before, you don't really wanna be using kubectl apply, right? You want to maybe use client go and code some logic and specifically you wanna have kind of fine-grained control over rollout ordering. It turns out you really wanna upgrade your API server first then your scheduler, controller manager and lastly your kubelets once everything else is okay. You also want to maybe handle some pre and post-upgrade operations, turning on a new flag, updating some credentials, things like that. You can encode this in an operator. Also, kubelet upgrades. So how do you actually upgrade a kubelet? Well, that's one thing that is running statically on each machine, but why don't we just deploy a daemon set that can actually run on every machine and make the necessary changes? It could RPM or DEB install. And lastly, configuration management. As we know, things change over time. We might wanna change our cluster. Ideally, we don't wanna have to tear down our cluster or make a new one. So one thing we can do is actually have our operator go in and change flags and do the right coordination. It might be a nuanced process, but it's something you can encode in software and it's much less error-prone that way. We can even do things like deploying a new network overlay or changing your network settings. This is something that we actually do at CoreOS. So actually a lot of these automated operations are things that Tectonic does, but there are other things that we're still kind of adding every day. Similarly, we talked about node management and how that's simpler, but it also, by having these unified nodes, it unlocks a lot of power as well. One is self-healing. So I just did a bootcube recover manually. That's also not ideal. I had to SSH to the machine. I had to run some commands. What if there were node agents that, if the operator senses that there's a control plane problem, could remotely invoke bootcube recover or something equivalent to heal your control plane? Or what about auto-scaling? You could increase the number of masters if you're experiencing higher load or if your requirements increase over time. Or what about if, since all nodes are the same, anytime your auto-scaler adds a new node, it talks to the operator and says, hey, I'm a new node. What should I be? Should I be a master? Should I be a worker? Should I have some other specialized role? Since all nodes are the same, they can be provisioned as anything. And lastly, node identity. So in recent communities versions, something called TLS bootstrapping was added, which allows every node to have a unique identity by performing a challenge and response with the API server when it joins the cluster. And so since our nodes are simple and have very little state, they can use this very nicely. And we're very close to actually merging this into bootcube. Should be in, hopefully, next week. So let's just check back at our cluster sec. Yeah, it's back. So let's look at our pods here. And yes, our scheduler is back. We're happy. Cool. So bootcube is a Kubernetes incubator project and Tectonic is CoreOS's specific distribution of Kubernetes. But we are trying to also get this support actually in upstream so that more Kubernetes users can access it. So kube-adm, which is the upstream tool for deploying clusters, is adding Kubernetes support, is self-hosted Kubernetes support. It's almost done. I think you can actually launch clusters now and upgrading is coming next. The checkpointer, instead of being a standalone pod, is actually being added to the kubelet itself. And that is already done for pods and we're just adding support for secrets, which is necessary for the API servers, the way we run them. But as always, we need help. If you're interested in this kind of stuff, if you're interested in coming up with more clever hacks, check out SIG Lifecycle. You can join, see us on Slack or our mailing list and come to our weekly meetings. So that's all I've got. We have a few minutes for questions if anyone has anything. And thank you so much for coming. Does this work? Yeah. Yes. In the beginning, where was your etcd cluster? Ah, good question. And how do you feel about self-hosting etcd itself? Right, so my etcd cluster is running on a separate node, just natively. Self-hosting etcd is something we've worked on. We've actually uncovered a lot of bugs in the API server as a result in terms of connections because when you're running the self-hosted etcd on the pod network and they're moving around a lot, it turns out the API server is not actually resilient at reconfiguring these connections. It was something that we never exposed before. So we're working on trying to fix HA for self-hosted etcd and then we'll re-evaluate whether that's something to do. I think it's something that makes a lot of people nervous because that's the baseline building block. But on the other hand, you can launch self-hosted etcd clusters for use, not for your control plane. And that works really nicely. So it's experimental. Actually, in Bukub, you can use it. But there are some known issues with it that we're trying to fix upstream in the API server. Sorry if you can pass it to someone. So what if we want to upgrade Kubernetes versions? So we need to down our whole cluster, right? If I want to update from 1.8 to 1.9. You don't have to do that. With Tectonic, we are able to upgrade from 1.6 to 1.7 and 1.7 to 1.8 with no downtime, just rolling update exactly as I showed. What are the bare requirements that you need? You might have done it, and I missed it. In order to get, you need to run the kubelet directly on the node, right? That's not obvious. You can't self-host the kubelet in the kubelet. We tried, but that one was a little too tricky. And adding new nodes to the kubelet, I guess it doesn't matter where the control pane runs or its daemon set. So I'm also confused about the proxy, because I thought that has to run on the node and the flannel also. Yes, so I kind of elided this for time. So we run the control plane on nodes that are labeled as master nodes, right? So we label the nodes. We set selectors on our control plane. And we set taints on those nodes so that other workloads don't run on them. And then for the API server and kubet proxy, yes, those need to have network connectivity. So that's why they run as daemon sets using host networking. But the controller manager and the scheduler were able to run as deployments since they don't have that. But does the flannel have to have access to the host devices? Sure, yeah. Well, it uses CNI, right? So it actually just talks to the kubelet and is able to deploy the plugins, and that's it. And then the proxy is the same way, I guess? Yep. Very cool. So the only binary that you need to run directly on the node is the kubelet itself. That's right. I think we might be out of time. One more question. Anyone? OK, let me. The clarification on the snapshotter, is it necessary in case of breakdown of communication with the control plane? Or if there is a breakdown and the node reboots and then the kubelet doesn't know what to start? So the checkpointer runs conservatively. It's basically trying to reconcile the state between the last time it talked to the API server and what it's seeing on the kubelet, which it can always see because it's running on the kubelet itself. So if the API server is down or if there is a network partition or something, it'll do the conservative thing. But in Kubernetes land, that's what you want, right? If you're running an extra pod, theoretically, that'd be OK. You don't want to run the kinds of workloads where you can only have exactly one pod. I mean, if I just lose my control plane, like the connectivity to the API server, then will I instantly lose my pods because the kubelet is going to be like, oh, I don't know what to do? No, that's not the way the kubelet works. The kubelet will keep running what it's running. But it's more, yes, if it reboots or something else happens. Thanks. Cool. All right, let's go. I want to ask between permanent control plane and temporary. Well, what do you mean the mix? Maybe you can hand in the mic really quick. Having a permanent control plane and what we're seeing right now, the best of both worlds, correct? Well, the temporary control plane is, as the name implies, temporary, right? It's only used when you are starting your cluster or recovering it. Right, but having a permanent control plane and self-hosted control plane, how about that? Well, but if you're maintaining a permanent control plane, then you lose, you have double the overhead, and you don't have the benefits of self-hosted control plane in terms of management and stuff, right? So if you're having to maintain both, it's almost as what's the point in that case, right? OK. Because part of the idea is this is simplifying the management of control plane. Right, thank you. Thanks very much.