 Hello, everyone. I'm Swati Segel. I'm a Principal Software Engineer working for Red Hat. And I'm Marla Weston. I'm a cloud native architect working for Intel. And we're presenting This Is The Way, a crash course in the intricacies of managing CPUs and Kubernetes. The scope, we will cover CPU management requirements only, but we'll also reference other projects. Things are in a vacuum, and your CPUs are just part of the whole picture. In the beginning, systems were simple. And here's an idea of why they were simple. Nodes that had only single sockets. There was a NIC, there was some memory, there was some switch, and there was world, because you're only really running microservices at the beginning with Kubernetes. KubeLit was designed for simple as well at first. So here's what the early KubeLit looks like. It has none of the managers that we have today. And it's just a simple throughput between your pod and the resources on the node. So I want to take you back in time. Early 2017, pre-Kubernetes 1.8 release, and show you how the landscape looked like from resource point of view. At this point, we have CPU and memory as only two native resources supported. Requests and limits that you see highlighted here are mechanisms Kubernetes uses to control these resources. Requests are what a container asks for and is guaranteed to get. KubeScheduler uses this to identify a suitable node that can fulfill the resource requirement. Limits on the other hand, make sure a container never goes above a certain value. KubeLit and container runtime uses this to enforce limits on the resource consumption. CFS quota is what is used to enforce CPU limits on the containers. Particularly from CPU perspective, all the pods and containers running on a compute node can execute on any of the available cores on that node. This level of resource control is not sufficient when pods are running CPU intensive workloads and contending for CPU resources available on that particular node. There can be scenarios where workloads get moved to different CPUs, which could impact the performance if it is sensitive to context switches. One interesting thing to note here is that PodSpec looks very similar to what it looks like today. But there are some nuances that you need to know to be able to tap into some of the new advanced features that we'll cover in the subsequent slides. So around this time, there was a lot of discussion going on in the community to enhance Kubernetes in order to support diverse and increasingly complex classes of applications. Discussions about how to provide Kubernetes the ability to run workloads that requires specialized hardware or those that perform better when hardware topology is taken into consideration. Because of the momentum in the community in this area, our resource management working group was formed, which included representatives from Red Hat, Google, Intel, Nvidia, Microsoft, and many more. The community identified wants in the compute resource management space that was followed by Kubelet extension strategies to enable performance sensitive workloads and improve resource isolation. Support for huge pages, exclusive allocation of CPUs, specialized hardware support, NUMA alignment of resources became focal points of the work that followed. As a result of this, in Kubernetes 1.8, huge pages became a native resource. CPU manager was introduced to support container level CPU affinity and device plugin API was introduced as a vendor independent framework to expose and consume devices. So this is how Kubelet looked like from resource management perspective after these features were introduced in 1.8 release. If you give a mouse a cookie, he's going to ask for a glass of milk by Laura Nuberhoff. This is in reference to a children's book, but E. Hashman in a signered meeting used in reference to users. So we've covered the very, very basic, very simple use case, but let's start looking at more complex ones. So let's talk about high performance use cases, which you have performance sensitive workloads. So here's at a very, very high level, high performance AIML cluster. You can tell it's AIML because it has the XPS, but you can also use these for high performance compute. You no longer have a single socket, you now have memory on both sides, you have XPS on both sides, you have dual NUX, you have leaf switches, you have spine switches. You care about the location of your pods now. So you want them generally on the same leaf switch, and you care about whether or not your cores are pinned more specifically. So CPU manager is a beta feature that is useful for CPU intensive workloads that are sensitive to CPU throttling, context switches, and require hyper threads from the same physical core. CPU manager gives us the ability to allocate CPU exclusively, but in order to achieve this, you have to make sure that pod belongs to the guaranteed quality of service class and has positive integral CPU request. Internally, CPU manager maintains logical set of CPUs, which is also referred to as pools. We have exclusive, shared, reserved, and assignable pools. The two most important ones are exclusive and shared. Exclusive pool corresponds to the CPU sets assigned exclusively to containers. Shared is the one where burstable best effort and non-integral guaranteed containers run. So a cuboid flag can be used to specify the CPU manager policy. The default configuration policy is none, which basically enables the existing default CPU affinity scheme, providing no affinity beyond what the OS scheduler automatically gives us. Even though a CFS quota is used to enforce pod CPU limits, the workload can move between different CPU cores, depending on the load on the pod and the available capacity on the worker node. When enabled with static policy, CPU manager allows us to allocate exclusive CPUs and no other containers can be scheduled on those CPUs. Please note that this takes place at a container level and there is no way for CPUs to be allocated at pod level. In Kubernetes 122, an additional configuration option called CPU manager policy option was introduced that allows us to fine-tune the behavior of static policy. This feature graduated to beta in Kubernetes 123, but certain feature gates might have to be enabled based on the maturity level of the policy option itself. The first option is full PCPU only. This is a beta option and has visibility by default. This option allows to make the behavior of latency-sensitive applications more predictable when running on systems where hyperthreading is enabled. We might want to guarantee that no physical core is shared among different containers to meet applications latency-sensitive and performance goals. With this option of enabled, the pod will be admitted by Kubelet only if the CPU request of all its containers can be fulfilled by allocating full physical cores. If the pod does not pass this admission check, the pod will be rejected with S&T alignment error. The second policy option is distribute across NUMA. This is an alpha option and it's hidden by default. CPU manager by default was designed to pack CPUs onto one NUMA node until it is filled and then any of the remaining CPUs simply spill over to the next NUMA node. In parallel code, this can lead to undesired behavior as the slowest worker can potentially act as a bottleneck. With this option enabled, application developers can ensure that all workers run equally well on all the NUMA nodes, improving the overall performance of these kind of applications. So now let's look at the pod spec again. So we have a guarantee pod here, which is evident from the fact that resource request is equal to resource limit and the request of the CPUs is integral. When the node is configured with CPU manager static policy, we have integral CPUs and hence the container is allocated exclusive CPUs. If in addition to that, we have CPU manager policy option configured as full PCP only, the pod would be rejected with S&T alignment error as this would result in partial allocation of core. So now we'll go into NUMA zone. So turn off for the week of heart. So remember, we talked about memory, CPU and XPU and Nick and locations. So if you have the case where the CPU is in a different location from the memory and the XPU, the UPI bus basically acts as a troll and collects time every time you try to cross that bus. So this is not ideal. Instead, you want this particular case so you're not having to cross the UPI bus toll. And instead, your memory, your CPU and your XPU are all located in the same location. So around 2019 timeframe, which is 116 release, the apology manager was introduced. This is the Cupid component that coordinates topology of resource allocation of CPU and devices and helps to extract the best performance out of the underlying hardware. It aligns resources of pods of all quality of service classes and we have the ability to configure policies in scope for different resource assignment requirements. Scope can be defined at a pod level or a container level. And in addition to that, we have no level policies which can be configured as none, best effort, restricted or single one would. And that depends on how strict you want that alignment of resources to look like. In scenarios where topology manager is unable to align, topology of requested resources based on the configured policy, the pod is rejected with topology fMT error. With topology manager, we were able to align CPU and devices with guarantees around memory did not exist for a while. And there was a major gap. This was addressed in Kubernetes 121 with the introduction of memory manager which became a beta feature in 122. One of the known issues that topology manager has is that despite being able to align the resources, the Kubernetes default scheduler lacks knowledge of node hardware topology because of which it can schedule a pod where it can fail with topology fMT error. If the pod is part of a deployment or replica set, it results in runaway pod creation because of, because the subsequent pods that are created end up with the same error as well. So topology manager comes into picture when the pod is being admitted on the node. It works with other resource managers which are also called hint providers as the provide hints. These hint providers are used by topology manager along with the configured policy to align those resources and reject if the pod cannot be admitted. So in this diagram we have topology manager doing its magic and admission time where it asks the hint providers for hints. Once it gets those hints, it runs an alignment logic and determines the suitable number node on which the resources can be allocated. And once that decision has been made, it sends an allocate call to the hint provider to perform the resource allocation following which the containers are added. So we've covered topology where scheduling and pinning for the pods, but there's still quite a bit of gaps for high performance compute for optimization. The cores are only pinned by a container. They cannot be pinned by pod right now. This causes issues because if you have a pod and you want to share cores between your containers, you can't. Affinity is node level only. You cannot choose affinity of resources. So for instance, if you have a stack of XPUs and you want to use just those and you're not worried about the CPU affinity, you can't do that either. You can't mix pinned with shared cores. You can only do full pinned cores. And if you add shared, they have to be pinned right now. And limits with larger numbers of pneumozones. So right now you have an eight zone limit for the pneumozones. And you cannot schedule one pod or container per pneumozone. So I can't do a nice pretty split across all of the zones to make sure that from a scheduling standpoint, everything is just in one zone per each. So they're not interfering with each other. Hydrogenous clusters, fun for the whole family. So this is our very generic set of use cases. But it's real. In many people's data centers, they have different types of nodes because you upgrade, you buy things for particular purposes, et cetera. So in this case, we have a multi CPU node. We have another one with weird CPUs, which are different from the other CPUs. In some cases, some architecture actually has different types of cores on the same CPU. You have multiple nicks in some, single nicks in another. And this last one you have an accelerator fabric with XPUs combined on top of that. So here's a workload view of how you might choose a center like this. You have a set of microservices on the first one. You have maybe mixed workloads on that middle one depending on what your CPUs are. And then you have AI workloads on that last. So we have the current gaps for heterogeneous clusters include all of the ones that we have for high performance compute clusters. And we still have at least two more. So you can't choose what type of CPUs. So in the case where you have that, remember that one that said weird CPUs in the middle and the regular CPUs at the beginning, you can't choose which CPU you want. And you can't ask for more CPUs on the fly. There's actually a proposal being discussed in the community spanning across signaled and six scheduling that aims at allowing pods resource requests and limits to be updated in place without the need to start the pod or its containers. And this feature is being targeted to be merged in Kubernetes 125. And once we have that particular feature merged the gap should be addressed. So yeah, we have a long way to go and we welcome feedback and all the help we can get from the community. So we now have a bunch of requirements and gaps when it comes to native resource management capabilities. Is there a middle ground model? Maybe do we have options out there that can be used in the interim? So currently there are three different options that I know of, well, three and then some, right? So there's CRIRM, which is a CRI resource manager. This is put out by Intel, but it is open source and they do take contributions, which is a pluggable add-on for controlling resource assignments to containers. It plugs in between the Kublet and the container runtime, keeps track of the states of all the containers on a node and it intercepts the CRI protocol requests from the Kublet. However, do note that all of these solutions require turning off all resource management by the Kublet so they don't fight. There's a CPU pooler. This is a project put out by Nokia. This is a solution for Kubernetes managed predefined distinct pools of Kubernetes nodes. It physically separates the CPU resources of the containers connecting to the various pools and it has a device plugin that exposes the CPU cores as consumable devices. And then there's also CMK. It was used in the past. It was currently deprecated partially because ISO CPUs are getting deprecated in the kernel. And this accomplished core isolation by controlling the logical CPUs each container may use for execution. And it wrapped it in a CMK command line program for managing CPU pools and concerning those workloads. So like the projects that Marla just mentioned, to some extent you take resource management capabilities in your own hand. You could go essentially all the way in and come up with a completely customized solution that perfectly suits your needs and make it do whatever you would like it to do. But this comes with a massive disclaimer. Do it at your own risk. If you are someone who's already doing something like this as part of the community, we would love to hear what prompted you to take that path so that it can be taken into consideration in the native solution. Oh yeah. So how can you get involved? If you're new to all this and are feeling a bit overwhelmed and are not sure where to get started or how to get involved, we have you covered. So there's a lot of community discussion on how to address all the gaps because there's a lot of gaps right now. We managed to figure out what those gaps are with the CPU management cooblet use case doc and what is unaddressed. And you're welcome to go there, add to it, comment, engage with the authors. We began this last December. It really did expose a lot of the issues that we have currently. There's also cooblet resource plugin RFC. This is a suggestion to move the cooblet resource model splitting the cooblet into the control plane which is what you do on the node, right? Is all the resource management on the node and a data plane which advertises to the scheduler because we haven't really discussed this extensively here but only half of it is after it gets to the node. The first part is where do you schedule your workload? Which nodes? So you're welcome to go to the stock and add in commentary. We seem to have quite a bit of community support. So I'm pretty excited about that. There's also NRI which is a node resource interface which is a CNI type interface for managing resources on node for policy containers. This is coming out of tag runtimes. And one thing that's pretty exciting about this is this would work well with the cooblet resource plugin RFC because currently they're still trying to find ways to get all the resources they need into NRI. And then there's dynamic resource management which is an alternative to the device plugin API. And the primary idea is that the resource allocations can be ephemeral or persistent and it allows users to specify resources with specific parameters. So Kubernetes has various operational areas that are organized as SIGs and working groups. You would highly recommend that you keep an eye on SIG node and SIG scheduling. SIG node is responsible for lifecycle of pods that are scheduled to node and its scope includes cooblet, pod and node API, container runtimes, various resource managers and hardware discovery. This is in no way a comprehensive list of what SIG node does, but it should give you a good idea of what to expect. SIG scheduling on the other hand is responsible for components that make pod placement decisions. So your Kubernetes scheduler and scheduling features, frameworks such as scheduler framework that allows you to create custom plugins all fall in the realm of SIG scheduling. Working groups are groups that are formed to solve specific problems and that typically span across multiple SIGs. So topology where scheduling is a project that spans across SIG node and SIG scheduling. It is not an official Kubernetes working group but a bunch of people that have come together from different organizations trying to solve the problem of scheduler being no more unaware. The GitHub organization linked here will provide you all the documentation artifacts and components you need to get started on this. The next one is CNF tag runtime and the goal of this group is to enable widespread and successful execution of wide range of workloads such as latency sensitive workloads, bash workloads. Within the tag runtime itself, they have a container orchestrated device working group and this group is focused on container runtime related problems and device vendors and container runtime maintainers are coming together to solve the support for devices in cloud native space. And lastly, if we have batch working group, this is a very recently formed working group and they're focusing on enhancing support for batch workloads such as HPC, AIML natively in Kubernetes. All these SIGs and working groups that I mentioned usually have regular meetings and we have linked the GitHub page and Slack channels for you to get all the information that you need. This essentially concludes the presentation. Thank you very much for attending. We are happy to answer any questions after the talk or if you'd like to reach out to us on Slack or email, we'd be happy there as well. Thank you very much. Thank you.