 So, hello everyone. I'm a Itamar Holder. I'm a senior software engineer working for AdHat, and I'm a Qvert developer. And today I'm going to talk about the journey through supporting VMs with dedicated CPUs on Kubernetes. And the reason that I put the journey word into the title is because this was a true journey for me. And during trying to solve this problem, I've stumbled upon many cool technologies and cool facts about them, how Kubernetes is implemented under the hood. And basically, this talk is going to be divided into two. The first part is introductions to many technologies, cool facts about them. The second part is the actual problem that I'm trying to solve. So, we're going to talk about CPU manager, Kubernetes resource allocation, C groups, dedicated CPUs, pod isolation, and many other interesting stuff. So, let's begin. So, first of all, an introduction to Qvert. So let's say that you need to run both containerized and VM workloads. And I'll talk about use cases in a second. So, on the right here, you have your Kubernetes stack, which is designed to run containers. But as you probably know, containers and VM are designed completely differently. And you can't just run VMs on plain Kubernetes. So, traditionally, you would have a different platform to run your VMs on. So, that's a huge disadvantage, because now you'd have to gain knowledge in two platforms, need to maintain them both, implement logging, monitoring between them, make them communicate with one another, which is a huge burden. And we would much prefer that we would have only one stack to run all of our workloads on. So, that's basically what Qvert is. It's an extension to Kubernetes that lets you run VM workloads alongside container workloads on Kubernetes in a cloud native way. And when I say a cloud native way, I basically mean that these VM objects behave just like any other Kubernetes objects. So, but what are the use cases to run VMs today? So, three main use cases. The first is legacy workload. So, let's say that you have a big business with many VM-based solutions, and you want to make a transition into making all your workloads containerized. This transition might take a lot of time. And for big companies that are making this transition, we want to support these legacy workloads and to run everything on the same platform. Another use case is VM-bound workloads, and what I mean by that is that some workloads need to emulate their own kernel or need to emulate their own hardware, so they simply don't fit to the containerized model. Another interesting use case, which is pretty new, is a project that's called Cap-K, which stands for cluster API provider Qvert, which basically lets you bring up VMs, Qvert VMs, install Kubernetes on top of them, and then you have Kubernetes inside Kubernetes. So, that's another cool use case. And I'm not going to dive into the architecture of Qvert too much, but the basic idea that you need to understand is that the trick is that we're running a VM inside a container. So, inside the container, we would have a hypervisor running the guest, and this is all wrapped inside a container. So, why do we care about VMs with dedicated CPUs? So, it's crucial for certain use cases like real-time VMs or VMs that depend on low latency. And the key point that we need to understand here is that we need to avoid context-switching the guest. So, if you depend on very low latency, you don't want your guest to be context-switched out, and then when something happens and you need to respond really, really fast, there's an overhead of context-switching the guest back, and we want to avoid that. And another thing is that it's widely supported by most regular hypervisors, and we want to bring it into Kubernetes as well. So, a question. Does anybody recognize this or know what this means? Okay, great. And how many of you actually know exactly how it's implemented behind the scenes? Right. Okay, so this is relevant. So, first of all, let's talk about what are containers. So, containers is basically an idea. It's an abstract concept, which can be implemented in many ways. And if you go to the kernel and ask it, like, do you know what containers are, it would say, I don't know, because from the kernel's perspective, there is no such thing as containers. There are building blocks, and if you're using those building blocks, you can implement a container with them. So, let's talk about these building blocks. The three main building blocks for containers are C-groups, S-Linux, and namespaces. And very briefly speaking, C-groups is responsible for resource allocations, S-Linux for security, and namespaces for isolation. So, let's dive into C-groups for a second. So, basically, C-groups lets you split the resources of the node between groups of processes. And the architecture is there's a tree of resources. So, in this example, you can see, for example, let's say that on our system, we have 100 CPUs. We can split them into children, and then, eventually, every process is attached to one C-group. So, for example, these group of processes would be limited to 10 CPUs. Another thing to know is that in C-groups, there are subsystems. So, for example, this is the CPU subsystem. We also have subsystems for memory, IO, huge tables, and a bunch of others. And in the Kubernetes model, each container, sorry, gets one C-group. So, just a word about C-groups V1 and V2. So, C-group V2 is the new version. It was introduced in March 14th, which is marked just because it happens to be my birthday, but never mind. There is no backward compatibility whatsoever to V1. And the basic idea is that V1 was too generic. It didn't restrict you at all, and therefore, it was really error-prone. It was very easy to misconfigure it, and therefore, they're very hard to debug it if you do so. So, the idea in V2 is that you have much more restrictions. It's less generic, but it would be less error-prone. Another thing is that we have a unified hierarchy. So, you remember when I talked about the subsystems? In V1, basically, there is a different hierarchy for every subsystem. And that also might cause a lot of trouble, because the different subsystems are not aware of each other. In V2, we have a unified hierarchy, and in every C-group, we define all of the different subsystems. Therefore, they're more aware of each other, and, again, less error-prone. Currently, both versions are supported, and V2 was GA in Kubernetes 125, so it's still relatively new. Let's talk specifically about the threads model in V2. So, in V1, we have no restrictions whatsoever on threads. You can do whatever you want with them. And it was pretty nasty, and nasty is not my word. It's from the actual official kernel documentation. That's what they said. So, basically, I want to talk about two limitations when it comes to threads. So, first of all, threads must live under the process subtree. We can just take two threads of a process and split them between different groups. They have to be under the process subtree. Another thing is that, if your C-group is threaded, then you can use only threaded subsystems, which means, basically, that you cannot use a lot of the subsystems, which is a huge limitation. Okay. So, in Kubernetes, all of the values are always absolute. So, for example, when you're defining a container, you would define it with 100 mCPUs, which equals to 0.1 CPUs, 1.3, whatever, but these are all absolute values. In C-groups, we have relative shares that call CPU shares. And, basically, as opposite to Kubernetes, they're entirely relative. So, let's say that we have only two containers or two processes running on a node, one with one CPU share and one with two CPU shares. So, the one with two CPU shares is going to have twice a CPU time. It doesn't matter how many CPUs we have in our system. That's completely relative. So, how does Kubernetes convert between the absolute values and the relative CPU shares? So, we can say that one CPU is 1,024 shares, just because it's the default. So, if someone needs 200 mCPUs, which is basically one-fifth of a CPU, all we need to do is divide 1,24 by 5, and then we get approximately 205 shares. But, remember that shares are still relative. So, there's a nice side effect here, because, for example, if all of our pods request only 50% of the CPU on the system, the other 50% is going to be split relatively to the container's request. So, basically, request is the minimum amount that's allocated. Let's talk about Kubernetes QoS for a second. So, we have three quality-of-service levels in Kubernetes. The first one is best effort, which basically means that you don't provide any request or limits, not for CPU and not for memory. If we'll talk about the guaranteed QoS, it's kind of completely the opposite. You specify both requests and limits, both to memory and CPU, and they have to be equal to one another. And anything that's not best effort or guaranteed is burstable. So, burstable is that you can specify only requests, only limits, you can specify them both, but they're not equal every other case. And, basically, the idea here is that there's a trade between predictability and stability. So, Kubernetes tells you, if you're going to be predictable with your resource usage, you're going to get more stability. And, of course, you have to keep your promises. So, for example, if you limit yourself to a certain amount of memory and you cross that limit, the container is going to die. So, yeah. So, let's talk about dedicated CPUs in Kubernetes. So, we have CPU manager, which basically is responsible for allocating dedicated CPUs in Kubernetes, and there are two requirements for that. So, the pod has to be of guaranteed QoS, and the CPU request, which equals the limit, has to be an integer. So, another fact is that not all the containers in the pod need to be allocated with dedicated CPUs, but the whole pod as a whole still needs to be guaranteed. Okay. So, let's now talk about namespaces for a second. So, namespaces is a part that's responsible for isolating different group of processes from one another. So, for example, we have the PAD namespace, we have a mount namespace, a group namespace, a lot of that. So, for example, if you're from within a container, if you execute a command like ps to see all the processes, you're going to see only the processes within your namespace. And a cool fact that I didn't know before I dove into that is that we can actually share the PAD namespace between the different containers. That's supported by Kubernetes. And that's pretty cool because we can have processes communicating with one another from different containers. And as a side effect, the file system are also shared if you use that. The trick is going into a slash proc slash PAD slash root, and then you get to the root file system of a process that lives in another container, which is pretty cool and sometimes useful. Now a word about KVM. So, basically, there are two kinds of hypervisors. Type one, which is also called a bare metal hypervisor and a type two hypervisor. So, with a bare metal hypervisor, we basically install the hypervisor straight on the bare metal. And with type two hypervisors, we install an OS on top of the bare metal and then the hypervisor on top of the OS. Now, type one hypervisors are much, much faster. And KVM is great because it's a kernel model that basically turns Linux into a type one hypervisor. So, with Kubert and KVM, we can reach near to native performance. And another thing is that KVM is basically responsible for CPU virtualization, which is the performance part. For other stuff, it uses, we're using QEMU. So, for example, for stuff like IO and similar. And in the KVM model, each CPU from the guest perspective is implemented as a kernel thread. So, for example, if you're creating a VM with four CPUs, then from the kernel's perspective, these are four VCPUs guest threads that basically run your guest workload. So, going back to Kubert. To talk a little bit more about the architectures, when I said that we run a VM inside a container, so basically we have the virtual launcher pod, which actually runs the guest. And inside, we have different containers, but the main container is called the compute container. Now, let's talk about the attempts to support dedicated CPUs. So, first attempt was pretty simple. We can simply allocate dedicated CPUs to that container. It's possible with CPU manager as we talked before. We need it to be guaranteed. We need the requests and limits that are equal to be an integer. And that's it. We're done, right? So, not really. So, let's dive into that a bit. So, inside the compute container, we basically have three levels. The first level is the Kubert management layers. Basically, these are processes, the Kubert processes that basically start all of the other processes, monitors them, and also access a bridge to Kubernetes. So, for example, in terms of Kubernetes logs and stuff like that, and also communication to other components that we have. So, this is the first layer. Another layer is the virtualization management layer, which is basically consists of libvert, which means libvertD, vrtlogD, etc. And the third layer is the emulation layer, which means basically QEMU, the vCPUs, which is basically the guest itself. So, these are some of our processes and all of the threads. Now, you don't have to understand everything that's going on here. What's important is that here are the vCPU threads that I was mentioning before. We have two CPUs here. They're just regular threads, as I said. But the problem here, as you remember in one of the first slides, I was saying that the key point is avoiding preemption, avoiding context switches. And here we have tons of threads. So, we basically took a container, allocated it with dedicated CPUs, and now all of these different threads are running on these CPUs. So, what happens is that we would have to preempt or context switch the vCPU threads in order to run different threads. And this becomes even more complicated, because all of these threads have very different priorities. And some of sibling threads of the same process have different priorities. So, if we're going to look about the QEMU process, for example, there are the vCPU threads that I mentioned. Of course, they have the highest priority. But, you know, vNCWorker, for example, doesn't need to run on dedicated CPUs. But IO threads is a bit more important. So, we have many different priorities between all of these threads. So, what we did is basically lie to the guests. These aren't really dedicated CPUs, right? Because we're going to context switch out the guests all the time. So, second attempt. Basically, the field in the VMI object, in the virtual machine instance object, was introduced. It's called Isolate Emulator Thread. The basic idea here is, if the user asks for xCPUs, we're going to allocate x plus one CPU. So, one extra dedicated core to run all of the non-vCPU threads. And basically what we're doing here is using leverage configuration to say pin all of the non-vCPU threads of the QEMU process into this dedicated core. So, again, if you look over here, we have the first vCPU running on first dedicated CPU, the second one the same, and all of the other threads are running on a third dedicated CPUs. So, there are two problems here. One that we waste one dedicated core in order to achieve that. The other one is that what about all of the different threads here? You know, there are a lot of different threads that we didn't even do anything with. So, this doesn't really solve the problem. Again, these threads are going to be context-switched into this dedicated CPU. So, third attempt is the housekeeping approach. So, the idea is that we create a child C-group for lower priority threads. This is called the housekeeping C-group. Again, just like before, we would allocate one extra core. So, if the user asks for xCPUs, we'll allocate x plus one. Then what we'll do is move all of the non-vCPU threads into this housekeeping C-group. And then the vCPUs would run on dedicated CPUs. So, this is basically how it looks like. We have the vert launcher pod, and then the compute container with x plus one dedicated CPUs will create a child C-group, the housekeeping C-group, with one dedicated core, which basically runs all of the threads but vCPU threads. And the vCPU threads themselves would run on x dedicated CPUs. And while this approach is a huge step forward, because this is the first time when we actually support dedicated CPUs, there are still a lot of problems with it. So, one problem is that we still waste one dedicated core. And ideally, we would have said to Kubernetes, we need x dedicated core, of course, plus 0.2 shared CPUs, because we don't really want all of the low priority threads to run on a dedicated CPU. But as we said, this is impossible in Kubernetes, because if you'll write something like 3.2 CPUs, you don't feel the requirements to having dedicated CPUs. So, Kubernetes goes for an all-or-nothing approach. Either all of your CPUs are dedicated, or none of them are dedicated. Another problem, which is the following two problems are more related to design than actually performance, because we're doing something twisted here. We say we need the vCPUs to run on dedicated cores, so we configure every other thread to run, we configure every other thread, and it should be reversed. We need to configure only what we care about. We need to configure only the vCPUs and leave everything else as is. And another problem is that I've talked about, you know, this C-group runs threads. So, this is a threaded C-group. And as I said before, threaded C-groups are exposed to many limitations. And basically, now, we can't use many subsystems on almost all of our threads, all of the threads except for the vCPU threads. So, that's another problem. So, fourth attempt is the emulator container approach. So, the idea is that the compute container will stay as usual. And when I'm saying as usual, it would not be allocated with dedicated CPUs at all. It still needs to be of guaranteed QoS, but no dedicated cores. Instead, we will create another blank container with X dedicated CPUs. This will create a new C-group for us, because in Kubernetes, every container gets a new C-group. And when I say blank container, by the way, I mean that, you know, one process that's slipping forever, something like that. Then, we can move only the vCPU threads to this C-group. So, let's see how it looks like. So, now we have the virtual launcher pod. We have the compute container with Y shared CPUs, and they're shared CPUs and not dedicated. And we have the emulator container with X dedicated CPU. And what we can do is simply move the vCPU threads into the emulator container. And now, only the vCPU threads are running on X dedicated cores. Everything else is running with Y shared CPUs. So, there are great advantages to this approach. Basically, we solve all of the problems for before. Only the relevant threads are being configured. The compute container and all of the threads inside stay exactly the same. The housekeeping tasks are running on shared CPUs now. We avoid allocating one extra dedicated core, which is a high-expensive resource. And we keep things open for extension in the future in the sense that we're not, we don't have the limitations for thread C-groups for all of our threads, only for the vCPU threads. But, oh, first question. So, how can we even move threads into another container? That sounds completely weird, right? But we didn't really move them into another container. We didn't move them only to another C-group, which is different. And remember what I said. From the kernel's perspective, there isn't such a thing as a container. We really only change C-groups. And should we share the PID namespace? That's what I thought originally. But the answer is no, because we didn't change the namespace at all. So, the vCPU threads that are running now on a different container share the PID namespace with all of the threads and processes from the compute container. But this doesn't work with C-groups v2. So, with C-groups v1, that's entirely possible. With v2, that's a problem because of the threaded model that I was mentioning. And basically, all of the threads need to live under their process subtree. But the process, the CPU, the QMU process is in the compute container. We're moving some threads into a sibling container. And that's illegal with C-group v2. So, that forces us to move all of the QMU process with all of its threads into another C-group. But this is not such a huge problem. And maybe it's an opportunity. Because what we can do is the following. So, just as before, we would have another container. But it turns out that the C-group for the pot itself is owned by Kubernetes. We cannot change it at all. But we can mess around with the C-groups of our containers that's allowed in Kubernetes. So, what we can do is the following trick. We can edit the CPU set of this container to having both X-dedicated CPUs and Y-shared CPUs. Again, this is not legal in Kubernetes. But it is possible in C-groups. And then what we would do is move the QMU process with all of its threads. Only the vCPU threads would run on X-dedicated CPUs. All of the other threads would run with Y-shared CPUs. Now, just in this C-group, nothing really runs here. Because all of the threads are split between one of the two children. So, that's basically just a C-group hierarchy. But, yeah. And this is how it looks like. But let's look at it again. Because as I said earlier, there are two layers of basically management. And the third layer of emulation. Now, in essence, the management layers are really different from the emulation layer. We can have, like, since they're different, we could treat them differently. We can have different permissions, different resources, different, I don't know, different definitions for the management layers and the emulation layers. Because they're different in essence. So, if we look at it again, what we did is basically reflect our model a lot better. Because now we have basically the compute container for the management layers. And the emulator container for the emulation layers. And we can now configure the both containers differently according to our needs. And also, this opens the door for further extensions in the future. Because now we have no limitations about both of our threads. We can also extend this hierarchy even more. So, let's say, for example, that we want to limit IO for the ZCPUs on certain scenarios. We want to limit, I don't know, memory for the management layer. So, we can even extend this C-group hierarchy in the future. So, that leaves the door open for many extensions. So, summary and takeaway. There are where a lot of introductions here. And we've seen a lot of cool technologies, CPUs allocation, C-group, dedicated CPUs, namespaces, KVM, Kuvert. And again, my hope is that beyond the problem and solution that I was trying to solve, that you would take some of these cool facts and cool technologies and use them in your journeys in whatever you're interested in. Yeah, and that's it. Thank you very much. So, are there any questions? Yes. Yeah. So, the question was, how much of it is a heck? And how much, like, basically, if I understand you correctly, what you want to ask is, how do we know that Kubernetes won't be surprised by these changes, right? And so, to be honest, this is still work in progress, and we still have to test it with a bunch of scenarios, to be sure. But from whatever I tested it, it was completely fine. And from what I understand is that Kubernetes is owning the pod C-group, but everything that beneath that, it doesn't care about. They're only configuring it while the container is being created. But after it's been created, then don't touch it at all. So, I don't see a reason right now why Kubernetes would be surprised. But again, this is work in progress, and I might be wrong here. Yeah. Sorry, I didn't hear you. CPU pinning with different size to cores? You're referring to NUMA, right? Okay. So, the question is, how does NUMA get into all of that? And yeah, I'll be honest that this is still under investigation. Again, we don't see a reason why it wouldn't be supported. And Kuvert already supports NUMA. So, I guess, I don't think there's a problem in this area, but again, work in progress, I don't want to guarantee anything. Any other questions? Okay. Thank you very much.