 Hi, I'm Clem, and my pronouns are he and him, and I work at Slack as an engineer. Thanks for joining this session today where I'm going to talk about how at Slack, we support long-lived pods using a simple Kubernetes webhook and other bits and pieces. So today, after quickly introducing my team, I'm going to describe what are so-called long-lived pods and why we need to support them, and then we're going to dive into what the solution that we came up with looks like and the different parts that it's made of. Finally, we'll wrap up with a few of the limitations and possible future improvements. So I work in a Cloud Compute team at Slack, and I'm based in Melbourne. We have about half the team in Australia and the other half in the United States. So let's look at some numbers to give you some context about what we are dealing with. On a good day at Slack, we get about 16 million concurrent users. That load is spread onto some 45,000 EC2 instances, so we're on AWS. We manage about 162 Kubernetes clusters onto which some 316 services are deployed, and we have just above 1,000 engineers. We also manage 235 Chef roles, and although my talk today is focused on our Kubernetes compute platform, we still have important applications running on our original compute stack, which is Chef on EC2. So let's look at what the problem is. The problem is that some pods want a long lifespan, but the nodes that they're running on are getting killed. So first, let's look at why some pods want a long lifespan, and then we'll look at what actually is killing our nodes. In an ideal world, applications are 12-factor, they are stateless, they boot instantly, they scale out infinitely without any drain on infrastructure, they're fun to work with, but unfortunately, we don't live in an ideal world. Instead, we live in a growth-obsessed capitalist society, and we have to deal with applications that are stateful, that can be slow to warm up, that can be intensive on infrastructure when they do scale out, maybe they don't like to get terminated early, or maybe they hold onto some long-lived web sockets, or sticky sessions, or things we don't want to lose. So while at Slack, we do have some ideal applications, those were the first ones to migrate from our original Chef compute platform to our Kubernetes platform, and so today, we're left with a long tail of unruly docs, which are applications that are hard to move from one platform to another. So this is Peach, Apricot and Plum, my docs, and they are very cute friends. Okay, so back to work. Let's look at some examples of applications that fit in that less than ideal category. So first, I've got batch jobs. So while today, you could design a batch job that could gracefully survive a restart without losing any of its work by, for example, checkpointing its work onto an external storage, at Slack, we have some batch jobs that are not as clever as that, and that can take a long time to run, like talking above a day. And if those batch jobs get killed, well, we have to restart from scratch, and people are waiting for those jobs to finish, and that's not something that we want. The second example I've got is distributed caches, so think about radius, memcached. When you want new replicas, they need to pull some existing data from the existing nodes to warm up and get populated, and so that can be slow to warm up and that creates a lot of internal network traffic. So if you lost half of your replicas on your ring, then you might head into a disaster there, so you need to thread carefully. The third example I've got here is the one of a Jenkins controller. So given that it's a singleton, you can only run one of those at a time as a pod, and so if that pod dies, then engineers lose access to Jenkins, so you don't want that to die during the day when people are doing their work. Okay, so some applications want to not get killed, but why are they getting killed? Well, they're getting killed because the pods are running on nodes, and the nodes are getting killed, and here are the systems at Slack that are killing the nodes. So I've separated those into two categories. On the left side, we've got the things that we control, and on the right side, the things that we don't control. So the first thing is autoscaling group skating in, so outside of peak hour, we reduce the size of our clusters to save cost in the planet, and that kills nodes. The second thing we do is chaos engineering. We kill some nodes to make sure that everything works continuously, and nodes go away when we do that, and the third thing is that we have a process that kills every node that gets to two weeks old, because that helps us roll out patches and updates. Talking about patches and updates, well, if we need to roll out an emergency security patch, sometimes we just don't have the luxury to wait for two weeks to do that, so we have to go and rotate all the nodes, and that's something that we can't do anything about. That's something that has to happen. And the last one I've got here is AWS terminations. So AWS is usually pretty polite, and they send you an update that some of your instances are going to get terminated, so you can act on that, but you could also lose a node out of nowhere if you're pretty unlucky. Okay, so let's look at the solution now. Let's look at the whole picture first, and then we'll dive into the different components that make this solution. So we're using taints and tolerations to match pods to nodes, and then we're protecting the instances from getting killed. So taints and tolerations are built-in communities feature that help you match pods to nodes. Today it's more likely that you would use something like pod affinities and anti-affinities, which give you more granular control. But for our feature, taints and tolerations were simple, and that's everything that we needed. So it's like we decided that nodes live on for two weeks, 14 days, and that's what we base our calculations on. So in this example, we've got a pod that wants to live for seven days. So what we do is we apply toleration on this pod, the toleration is called lifespan remaining, and it has the values 789 all the way up to 14, which means that it's then happy to get scheduled on a node that has at least seven days of life remaining. And so then the nodes we apply taints on, and the taints represent how many days left there is of life and is based on the uptime of the node. So in this example, the node is four days old, so it has 10 days left to go. And then we need to go and teach the systems that kill nodes to not kill nodes when they're not allowed to do that. And we're going to talk about this in a few slides as well. OK, so let's zoom into how we get that taint on the nodes. So that's pretty simple. We've got a small go service that we're running in each of our clusters. And it loops through all the nodes and taints the nodes updated with the right value. So in the top there, the nodes got seven days left because it's seven days old. The one on the right-hand side has got zero days left to live because it's been up for 14 days. And the one at the bottom is a new node, so it has a whole 14 days to go. Right, so we applied taints on the nodes, but it doesn't really do anything. So now we need to go and tell the systems killing our nodes to understand this concept of taint and toleration that we're using and to prevent them from killing the nodes. So in the next two slides, I'll talk into details about ASG scaling and chaos engineering. But the third one in that I talked about where we just kill nodes after two weeks realized that we could just retire that because, thanks to the first two ones, nodes actually don't even make it to two weeks already. So that's something we could retire. But then we're left with the things that we don't control. And so if we need to roll out a security patch or if AWS kills an instance, this is just too bad. So in a way, this is part of the contract. We have with our users of this feature that the feature isn't perfect. And sometimes something like that might happen. And OK, so let's talk about autoscaling. I put this aside for a second and don't think about Kubernetes. The classic way to scale instances in an autoscaling group in AWS is by using autoscaling rules. Autoscaling rules look at the resource consumptions on your EC2 instances. So how much CPU, how much memory is being used. And once you reach a given threshold, ads all remove nodes. But with Kubernetes, this is not, this doesn't work to do that because workloads in Kubernetes, like applications running, request some resources, so request CPU, request memory. And it doesn't mean that they are using what they're requesting. So you can get into this unfortunate situation where all of the resources on your cluster are being requested, are being claimed, and you can't schedule any new pods. Meanwhile, the resource utilization on your instances can still be pretty low. This is where cluster autoscaler comes in. So cluster autoscaler has this insider knowledge of the resources that are requested in a cluster and scales your cluster depending on that, not depending on the end utilization of your instances. And so cluster autoscaler is a Kubernetes project that you can find on GitHub. And here I've got an extract of our deployment manifest that we use to deploy cluster autoscaler. There's a bunch of flags, which I'm not going to go over, but I'm just going to talk about the last two ones because that's the flags we had to add to get this feature working. So the first one is ignore taint, lifespan remaining, because the way cluster autoscaler tries to create new nodes. But if it creates a new node, based on node one, reusing that old taint, the new pods won't be able to get scheduled on these nodes. So we just say, hey, just ignore the taint, and we're tainting the nodes ourselves. Just don't worry about that. However, any node that has a taint that is being ignored is seen as unready by cluster autoscaler. And because typically tints can be used to cordon nodes or make them really for deletions or things like that. So then when too many nodes are seen as unready, cluster autoscaler freaks out a bit and goes like, hey, something looks dodgy here. I'm not going to touch anything. So we have to say, hey, look like it's OK for 100% of the nodes to be unready, and so then it keeps on working. So I mentioned that we're doing some chaos engineering. We have this service called kubetest cluster, and it loops over every cluster we have, and for each cluster looks at all the workloads and resources that are deployed there, how many there are, in which states they're in, then finds a node that is eligible to get killed, kills the node, waits for a new node to come back up, and waits for the resources and workloads to stabilize and look the same as before so that we know that we can scale out, that we can schedule pods and do all the things that we want to do. But how do we know if a node is eligible to get killed? Well, the simple way to go about that would be to just wait for a node to be two weeks old, then we know we're not breaking anything. We can kill it, but instead what we do is we run this function over our nodes, so this function looks at the pods that are running on the given node, and unless there is a pod that has a lifespan request that hasn't been reached yet, then we're allowed to kill the node. So I've got a diagram on the side here to illustrate what I'm talking about. Let's say we just have one node and one pod, and when the node is two days old, we schedule a pod that has a lifespan request of four days. For the first four days here, that function will return false. We're not allowed to kill the nodes because there's a pod there that still needs to live for a few days, but after four days, so when the node is six days old, then the function will return true and we're able to kill the nodes without breaking any contract on the pod. And so we can kill the node before it gets to 14 days old. Okay, so we saw how the nodes are getting tainted and then how we protect them from being killed, but now let's look at how we get the toleration on the pods. So at Slack, users typically don't deploy Kubernetes manifests directly by themselves. Instead, they write a Bedrock YAML config file that gets passed by our tooling chain and creates Kubernetes resources. So in this example, there's a service called Grumpy Service that has one stage to be deployed in Dev, asking for one replica and for a minimum lifespan of seven days. That YAML config file gets passed by our Bedrock CLI and Bedrock CLI creates a Kubernetes deployment on the targeted cluster. So it takes that minimum lifespan field in the Bedrock YAML file and turns that into a label on the nodes and the label is called lifespan request, requested. So we've got a deployment. The deployment controller goes and tries to create a pod. So the Kube server API sends an admission review request to our admission webhook. The admission webhook looks at the pod spec, mutates it by injecting all the tolerations that we need based on the label that was on it, and returns the admission review response to the Kube server API with that JSON patch. And so a pod gets created with the tolerations all the way from 7 up to 14. So users never have to type all that by hand. Instead, they just punch in the fields in the initial YAML file and the magic happens. So to do that, we need an admission webhook. So I went online and I tried to find out how, like, what's the best way today to get an admission webhook of the ground and deploy it in Kubernetes. And I found Kube Builder, which I'm sure you're familiar with, but this is like the first line in their readme. Kube Builder is a framework for building Kubernetes APIs using custom resource definitions. So here already, you should start thinking that we might not be on the right track because we're not doing any CRDs, we're not controlling any resources, like, what we're doing is stateless. Like, we just want to mutate requests on the flight. On the right-hand side, I've got the architecture concept diagram from the Kube Builder documentation. And this is probably too small to read, but I just wanted to put this up to illustrate to show all the capabilities that Kube Builder has. So we actually don't need most of that, like all the controller stuff we don't need. We don't need to manage anything. All we need to do is the right-hand side box called WebHook. So instead, we went and wrote a simple admission WebHook in Goring. And that worked pretty well for us, so we went and open-sourced it, so you can find it on the Slack HQ organization on GitHub. Yeah, and like, look, even if it's not something that you need or want in production, this repo comes with a make file that uses Docker and Kubernetes in Docker, also known as Kind. And so you can, with a couple commands, just get the Kubernetes clusters running on your laptop, deploy the admission WebHook, deploy pods, look at the logs, or change the code, change the validations, mutation, redeploy the WebHook, and see what's going on. So what I mean is like, I'm hoping that this might have some educational value. If you are not super familiar with how admission WebHooks work in communities, well, maybe that's something that you want to play with. It also comes with a blog post that goes into a bit more detail on how the different parts of the codes interact. If you don't know the Slack engineering blog, I suggest you should go have a look. There's some really good stuff there, especially some really solid technical articles after we have serious incidents. Okay, so that's all the bits. So we've got pods getting matched on nodes and the nodes are protected from being killed too early because we taught the system how not to do that. And so this lifespan feature at Slack, today is used by about half a dozen services. And the maximum value we allow is seven days and everyone is setting that to seven. Initially, when we talk to different teams, some people said, well, look, all we need is one day or two days. Just give me the guarantee that my pod is not going to be killed like every few hours. But at the end of the day, they still set seven days in the config file. So we could just have had that filled as a Boolean in the config file and the end result would have been the same. Make it a bit more simple, maybe. All right, so I've got a few more things to talk about before wrapping up. So the minimum lifespan guarantees that your pod will live for a given amount of time, but there's no guarantee that your pod will actually get killed once you reach the end of the lifespan. But some of our users actually want control over the whole lifespan dynamics of their applications. If we think back about our Jenkins controller, well, it's not really that we want the Jenkins controller to live for a day or seven or whatever. It's that we really don't want it to get killed during the day. But with what we've implemented so far, we can't really set that. So we have our users kill their application at night to reset the minimum lifespan counter. But so hopefully there's just a feature that we can add later on. But one of the limitations that we have today. Another thing that I didn't mention is that we already had an old admission webhook, an admission webhook that was built using a QBuilder. But it's a few years old and it hasn't been upgraded in a while. And upgrading it today to a more recent version of QBuilder would be a lot of rework, rewriting, refactoring. And in the same way, the validations and mutations that this old admission webhook does are stateless too. So it doesn't do CRDs either. It doesn't do any controlling. So it's the same requirements as the simple webhook I presented to you. So that was one of the reasons why we went and created the simple admission webhook because we thought, oh, this is great. Then we can just migrate the features from the old one to the new one and retire this old piece of work. But as you might know, retiring old systems is hard to prioritize. And so while some of the features have been migrated to the new one, so for example, we validate that our Docker images are coming from our trusted registry. We still have a lot. Everything that is sidecar injection is still on the old webhook. So we inject sidecars to do pod-to-pod encryption to deal with secrets and service discovery. And so that's still living in the old one. OK, and so I didn't build all this by myself. There was three of us working on the feature. It was me and Sean, who I would like to thank for the things he taught me about software engineering and Trisha, who has been a great source of inspiration. And I also want to thank Javier for his ongoing support and for how excited he was to have me be here today doing this talk. And those are the images I used. And thank you. Perfect. Great session. So now time for the Q&A. There was one question online about the slides, but they're going to be added to the event platform, I think. The link to the slides, that was the question. Yeah. Yeah. Also, the slides are on Sked, so you can just download them. I did a few updates today, so I'll reupload a new version after that. Perfect. Any questions here? OK, there. Thank you. Why or which is the reasoning behind having two weeks life span for all the nodes? And probably related another question. And can you guarantee that within that two weeks life span and because of security rollout, every node will have a 14 days span? And if eight days goes, let's say, eight days passes, after that if someone asks for seven days span, and you wouldn't be able to schedule because there is no node available. So how do you deal with that? Yeah, right. So the two weeks is historical to start with. So this has been like the tendency as to how often we rotate nodes and how often we rollout patches. And this is why the maximum we set is seven days because then it means we have seven days to kill the nodes and get everything done. Because if we let people set two weeks, then at the end, then we have to rotate all the nodes at once. And that's not a position we want to be in. But most of our applications don't use this feature. So from seven days to zero, then most of everything we have will still run on the nodes. It's just the few picky ones that want that life span request that won't be able to run on all their nodes from seven to 14 days. Great. Any other questions? Oh, there we have one really quick hand. Thanks for the presentation. Just a question. Services, do they know internally that they have these many days to live? For example, if I start my service at Slack, do I know that my service has seven days? And then do your engineers do something about it, like the exit gracefully at seven days and then they are just not consuming resources? And then you end up with empty nodes? Or how do you deal with that? So you mean services that don't use this feature? If my service is configured to request 10 days of minimum lifespan, then what I could do with that? Could I exit in 10 days and then stop the pod from running and just sleep? Or do you leverage that in some way? So if your service is requesting 10 days, then after the 10 days, nothing will happen right then. But it'll likely get killed within 10 and 14, because that's how fast we rotate nodes. Am I answering your question? OK, so developers don't do anything with that 10 days lifespan? Oh, no, no, no. No, it's just an extra guarantee they have on the stability of their workloads in our infrastructure. They don't have to do anything. Thanks. Great. Was there any more? Yes, this one. Quick question. So I like the Tains and Tolerations approach for just making it very easy to see what the state of the notice. Has there been a discussion or an idea of using custom scheduling rules or writing a custom scheduler to do the same behavior as done right now? And not seriously. That could be an option we should talk about after that. And that'd be interesting to know how you might go about and using a custom resource definition or a custom scheduler to do that. We had a bit of time pressure to release this, and we tried to find the easiest way to get that feature delivered, and that's what we came up with, but definitely a custom scheduler would work. OK, great. Any more? No immediate hands, but you can obviously go do. Oh, there's one. Sorry, I missed you. That was a really good talk. Thank you, first of all. You've got a huge number of nodes to manage. I can't hear you at all. So the one thing I'd notice from your talk is that you've got a huge number of nodes to manage. And with the number of long-running pods which you're trying to prevent being killed prematurely, I'm wondering how you measure the success of that and how many do you know how many long-running pods are still getting killed prematurely? Yeah, we don't know. So you have no monitoring tool to sort of see how successful your approach has been? It's like, so at first we had debug logging enabled on the Admission Webhook. And so we were getting a lot of, like, analytical data from the logs. And we saw how many applications used the pods and how old the pods were when they were getting killed. And from that, like, little investigation, like that window of data we had, I don't have a number to give you, but it looked pretty good. But since the logs are not debug anymore in production, and so we've actually lost track of this, and, yeah, our telemetry story can be improved. Great. Any final last ones? Nope. Thank you for a really great session. Thank you.