 Welcome, welcome to my talk how to blow up a Kubernetes cluster. Sorry about the clickbait, but judging by her crowd the room is I guess it kind of worked. I hope some of you have read the talks description because if you haven't you'll be disappointed to learn that this is actually about resource management. And it's about resource management from the point of view of an application developer because that's who I am. I'm an application developer. My name is Felix. I'm a software engineer at Eteratec. We are a consultancy slash agency based in Germany. And the reason why I'm giving this talk is because last year I was I gained access to a Kubernetes cluster and I didn't really have a clue on how Kubernetes works. And I was told there are some parts that need limits and they need request set. And I read about it in the documentation and I thought I was doing good things, but it turned out the entire thing blew up. And this is why I'm giving this talk. And since this is the one one track, I'll start from the top. So let's look at what the documentation has to say about resources. Now, when we talk about resources, obviously we're talking about compute resources and there's CPU, there's memory. These are the obvious ones. There's tons more like a familial storage. There's PID limiting and many more. But for the purpose of this talk, I'll concentrate on CPU and memory. Now, the documentation tells us how we can set requests and limits and it defines requests as the amount of memory and CPU that is guaranteed for our containers and limits is the amount of CPU or memory that we cannot exceed with our containers. And there are different units we can use to set these settings. So for CPU, we have one CPU unit, which is one core, either physical or virtual, and we can use different fractions for this core. And we can define these fractions using milli CPU or milli core, whereas 1000 milli core equals to one CPU unit. And for memory, the base unit is byte. And we can use all kinds of kilobyte up to exabyte and the binary equivalent key byte up to XP byte, of course. Now, when we look at a part definition file, when we define those requests and limits, they look like this. For this part or for this container rather, they're always on a container basis. We have defined a request of 64 milli byte and a CPU request of 250 milli core and the limit is two times our request. So we have 128 milli byte and 500 milli core. Now, what does happen with our part when we exceed our memory limit? And that's kind of easy because we just get to terminate it. We run into an out of memory kill. And that's simply because memory is an incompressible resource. We cannot make more of it. Once it's gone, it's gone. And we cannot share it. For CPU, it looks a bit different because CPU is a compressible resource. We have the possibility to throttle. And we don't need to terminate these parts or Kubernetes doesn't need to terminate these parts. Now, let's take a quick look at how parts are scheduled in Kubernetes. For this scenario, we have two different nodes. Both nodes have five units of memory and five units of CPU. Let's just say one of these memory units is one gigabyte and one CPU is one core. And now we want to schedule a part and the part is requesting three gigabytes and two CPUs. The scheduling is done in a round robin kind of fashion. So the first part is probably going to be placed on the first node. And before Kubernetes is going to place the part on the first node, it's going to check, are we having enough resources to fulfill this request? So are we having three gigabytes and two CPUs on this node? And the node is completely empty so we can just place the part there. And this goes on for the next part. The next part is probably going to be scheduled on the second node then because Kubernetes is going to look, are we having enough resources for this second part on this node? And of course we have this node is also empty. But for the third part, it's looking a bit different. First Kubernetes is going to try to schedule it on the first node again because it's a simple round robin algorithm. And this part is requesting three gigabytes and two CPUs. We have the CPUs but we don't have the memory. So that doesn't work but we can place it on the other node and that actually kind of works. So now we have placed three parts on our two nodes and we have used all the memory of the second node and some of the memory of the first node. But if you've paid attention to the limits we've chosen here because those limits don't really matter when it comes to scheduling, then you might have seen that these limits are a bit higher. So for the first node we have a combined limit of five gigabytes memory and two CPU. And this is basically all we have. So if these parts start to use all the memory that up to their limit the entire memory is being used, which is not a problem now because there's nothing else on this node but we still have space to schedule something on this node and it could lead to a problem. And it actually might already lead to a problem on the second node because the limits are much higher than our capacity. So once these parts start to go over their requests and up to the limits then we might run out of memory. And we have another problem here on the second node because we have allocated all our memory but we are just using three out of the five cores when it comes to CPU. And these other two cores are just stranded because there's nothing that we can schedule here on this node because it's just completely packed. Okay now we've discussed what happens when we exceed our limit and but what does actually happen in this case when a node runs out of memory? Well then Kubernetes also has to terminate some parts but this time it's not the parts that exceed their limits because they're probably none. But this time Kubernetes terminates the parts that exceed their memory requests. So limits don't matter in this case. And this is what brings me to the title of the talk how to blow up a Kubernetes cluster. So the situation that we had so all the ingredients to blow up a cluster was we were running a couple of microservices. These microservices were communicating using Kafka and we had barely enough memory in our cluster. There was a situation some words about Kafka if you're not familiar it's for distributed event streaming. We used it for asynchronous communication. But what is important is it uses a ton of memory and it keeps the entire state in memory. And our Kafka pots were using about 2.8 gigabytes memory usually but we also saw some spikes so we went and set the request for 3 gigabytes and the limit for 8 gigabytes. Now the incident looked kind of like this we had different nodes with different Kafka pots running on them. We had 3 Kafka pots in total running on 3 different nodes and on all of these nodes there were also other services running. And the utilization was incredibly high. We had a memory utilization of 90 something percent. And of course that went well for a while but there was a time when we just were trying to use more memory than we actually had. And of course then Kubernetes needed to terminate some parts. And the first part that got terminated was one of our Kafka pots. So that was gone. That's not a problem. Kafka can handle it. We have still 2 more pots running. So that wasn't really a problem. But we still are trying to process the same number of messages that we have previously processed with 3 pots and we just have 2 pots left now. So that led to an increase of memory usage for the other Kafka pots. And well we need something to terminate now. And we just have where we left with one Kafka pot. So now that these applications that were using Kafka to communicate run into more and more problems because they couldn't communicate anywhere. They couldn't send out messages. So they had to keep the messages themselves which also led to an increased usage of memory. So some of these services also used more memory. And of course these Kafka pots weren't gone for good. So when Kubernetes terminates a pot it's going to be rescheduled right away. But we didn't really have many opportunities to reschedule a pot because our services that couldn't send the messages had an increased memory usage. Which led to the situation that on node 1 we couldn't really place the Kafka pot anymore because we didn't even have enough memory to schedule it. So there was one that we could schedule on node 2 but that didn't really help us much. Because that was the one that just got terminated on the same node with the same setup. So that's probably not going to last for long. And we had another Kafka pot that was stuck in pending because the node 1 didn't have any memory left to schedule this pot. And with these Kafka pots constantly crashing and being rescheduled and being terminated the applications that were trying to communicate using Kafka run into more and more problems. But all of these applications were very good with failure handling. So some of those just crashed and we had even more applications stuck in a crash loop back off. So yeah, I think you get the gist of it. We were in a vicious cycle of pots being reschedules, pots being terminated, pots needing more memory than usual because of our originally high utilization. And we learned some things. So we learned that overcommitting on memory is always a great idea. So it's probably good in the most cases to just set the memory request equal to the memory limit. And we also learned that clusters need some room to operate. Utilizing 90-something percent is also not a great idea simply because memory is an incompressible resource. Now I initially said that I'll be talking about both memory and CPU and this was all about memory so far. So let's look at CPU. CPU, other than memory, is a compressible resource. We can throttle CPU and thus memory, CPU resource management is completely different to memory resource management. And the Kubernetes documentation gives us a fair warning about setting limits or rather about not setting limits because it tells us if we don't set limits, there might be a container that is using all the resources that our cluster has to offer or that the node has to offer rather. And that is true. That can happen. But I came to the conclusion that this is actually not always a bad thing because let's look at a setup where we have set a request of 500 millicore and a limit of 500 millicore. So we always get 500 millicore in the worst case but also in the best case. And with a single-threaded application that can just use one core, it looks like this. We use 500 millicore of one core. The other 500 millicore are either idling or being used by some other application. With a multi-threaded application, it looks a bit differently because we're using two cores. We're still getting 500 millicores. But now it's not just half of the CPU that's idling half of the time but it might be three quarters of the CPU that's idling three quarters of the time. So if there is just no demand for this extra CPU. Now, if we remove the limit, we're still left with the same worst case. So we are always guaranteed to get our request of 500 millicore. So in a single-threaded case, we get half a CPU. But if there is more CPU available, we might get the entire CPU. We can just claim more resources and in a multi-threaded application, we could claim all the CPU. This is kind of what the documentation wonders about. But is it really a problem? I think it's not as long as we are setting requests because requests actually determine our CPU share. In this example, we have a request in place for the yellow part of 250 millicore and a request of 500 millicore for the red part. So the red part requests twice as much CPU time as the yellow part. And in the worst case, they're just getting their request and nothing more. But once they start competing for resources and they're trying to claim as much as many resources as they can get, the share that they can get will be determined by the ratio of the requests. So the red one is asking for twice as much as the yellow one. So it's also getting twice as much as the yellow one when they try to claim all the resources. So in this case, two-thirds of the CPU for the red one and one-third to the yellow part. And that's more than they have requested, but there's not really an issue with one of them being a noisy neighbor. And removing CPU limits can lead to drastic effects in favor of response times. So Thomas here posted this graph on Twitter where they removed limits and they saw their response times drop from 150 milliseconds to 90 milliseconds. This is 75p. So this is a drastic effect. And now you may say, okay, but when I remove my limits, I'm not in the best quality of service anymore. So let's look at the concept of quality of service for a second. There are three types of quality of service. If you don't specify a request and you don't specify a limit, you're in this best effort quality of service class. You don't really want to be there. It's not that great. But once you specify a request, you get into the burstable quality of service class where you're always guaranteed to get your request. But once you exceed the request, you're subject to being a throttled or terminated. And there's the guaranteed quality of service class when your request is equal to your limit. You're guaranteed exactly that. Now guaranteed sounds great. I want to be in that third guaranteed quality of service class. But actually, do I really want to because the request is already guaranteed in the burstable class. And if we take the same request for both cases, it's kind of, I mean, I'm getting more in the burstable class. Sounds better. So in the case of CPU, I would advise not to set CPU limits, but always set requests so you're not suffering from noisy neighbors. There are two more pitfalls that I want to address. First is you should be aware of your resources. If you're planning to fairly distribute all of your resources among your apps, and you may have ordered a 10 GB note with four CPU cores, you might not get all of what you've ordered because the system demons use a portion of the available resources. So you should check the note, the status allocatable field to figure out what resources you actually have available. The second thing I want you to watch out for are namespace limits. So even if you do not set a limit, there might be a namespace limit as a default. So in this case, there is a default namespace limit of 500 millicore, and we have a request in place for this part of a 700 millicore, which is higher than our limit. That doesn't make sense, and this part is never going to be scheduled. But if you're not aware of this namespace limit, we might not realize why aren't we being scheduled. So that's something you should be aware of and watch out for. Summary clusters need room to operate. Utilizing 90-something, 80-something percent is usually not a great idea. For the most part, I would advise to set the memory quest equal to the memory limit, and I would always think about not setting CPU limits. Of course, there are some exceptions. There are always exceptions to those rules. Here are two I thought of. There are probably more. Setting CPU limits is a great thing if you prefer consistent workloads over performant workloads, so you just want to have that guaranteed quality of service class always have the same quality of service, but nothing more. Then, of course, go and set limits, and you may want to overcommit on memory when you don't care about termination of your parts. If you have workloads that can be picked up any time, that can be interrupted any time, sure, don't overcommit. Don't do overcommit then in this case, because obviously that's cheaper than blocking all the memory. So I initially said that I'm just an application developer, and I get it. You may not want to take this advice from me because it sounds a bit weird. It's not really what's in the documentation, but the good thing is you don't have to. Here's someone you can trust more who is giving out the same advice. So Tim, the creator of Kubernetes, one of the creators, posted this actually as a response of the tweet we saw earlier, Thomas tweet where we saw the latency drop, the response times drop from 150 milliseconds down to 90 milliseconds. That's it. Thank you.