 Hello everyone. Thanks for taking the time to attend this talk. My name is Vati Sehgal and I'm a senior software engineer and work for Telco 5G Compute team in Red Hat. My team and I have been working on enhancing Kubernetes and OpenShift to deliver leading edge solutions and innovative enhancements across the stack. Our goal is to enable customers and partners to run high throughput and latency sensitive cloud native networking functions on OpenShift. I've been working with engineers and stakeholders from Red Hat, Huawei, Nokia, Samsung and Intel with a goal to enable topology aware scheduling in Kubernetes. And today I'm going to be talking about the work we've done on this project so far. So today's agenda includes hardware topology. I'll explain the term NOMA and why topology alignment is needed, how topology alignment can be achieved in Kubernetes. We're going to discuss topology awareness of Kubernetes default scheduler and try to understand what leads to that and how the default scheduler works. I'll also explain a proposal of enabling topology aware scheduling, the key components that we proposed, as well as the end-to-end working solution. I'll also talk about the current status and the use cases and wrap up by providing a few pointers for future reference. So let's look into the first item, hardware topology. What is NOMA and why is NOMA alignment important? NOMA stands for Non-Uniform Memory Access. It is a technology available on multi CPU systems that allow different CPUs to access different parts of the memory at different speeds. Any memory directly connected to a CPU is considered local to that CPU and can be accessed very fast, as opposed to any memory which is not directly connected to a CPU, which is considered non-local. Now, on modern systems, the idea of having local versus non-local memory can be extended to peripheral devices such as NICs or GPUs. All memory on a NOMA system is divided into a set of NOMA nodes with each NOMA node representing the local memory for a set of CPUs or devices. For example, in the figure here, CPU core 0 to 3 and devices connected to PCI bus 0 would be part of NOMA node 0. Whereas, CPU core 4 to 7 and devices connected to PCI bus 1 are part of NOMA node 1. In this example, we show a one-to-one mapping of NOMA node to socket. This is not necessarily true. There can be multiple sockets on a single NOMA node or an individual CPU or devices of a single socket may be connected to different NOMA nodes. Now, let's move towards why NOMA alignment is important. For performance-sensitive applications in the field of TALCO 5G, machine learning, AI and data analytics, CPUs and devices should be allocated such that they have access to the same local memory. Another example is DPTK-based networking applications which require resources from the same NOMA node for optimum performance. So, the next question is what NOMA alignment means in Kubernetes context and how do we achieve NOMA alignment in Kubernetes? In order to illustrate aligned and non-aligned resource allocation, let's consider the simple scenario. Here we have a system with two NOMA cells. Let's consider a workload requesting one SRV virtual function, two CPU cores and the resources that are being aligned kind of align differently in these two scenarios. So, the first diagram shows you where all the resources are aligned on the same NOMA node and this will lead to optimum performance whereas the second scenario shows whether resources are not allocated from the same NOMA node and can lead to underperformance. At a NOMA level, topology manager, which is a Kubernetes component, coordinates the topology of resources that are allocated. This includes the resource allocation of CPUs and devices. Topology manager has flexible policies and you can define scope for different resource alignment. It orchestrates CPU manager, device manager and upcoming memory manager. It allows workloads to run in an environment which is optimized for low latency. In addition to that, it has a set of node level policies, for example, best effort, restricted, single NOMA node policy and the scope to define whether you want resource alignment at a pod level or a container level. It basically orchestrates resource managers, which I mentioned like CPU manager, device manager, by gathering hints from them and using those along with the policies to align resources, allowing workloads to run in an optimized environment for which is optimized essentially for low latency. So, now that we know that topology manager takes care of resource alignment, you might ask, what's the gap here? Topology unawareness of Kubernetes default scheduler. So, the challenge is that the default scheduler is topology unaware. Despite the introduction of topology manager, enabling topology alignment of requested resources, scheduler's lack of knowledge of resource topology can lead to unpredictable application performance. In general, underperformance and in the worst case, a complete mismatch of resource requests and Kubler policies basically scheduling a pod where it's destined to fail, potentially entering a topology affinity error failure loop. So, let's try to understand this with the help of a better example. So, we have two worker nodes here, worker A and worker B with 40 CPUs and eight devices split equally across NOMA nodes, meaning that there are 20 CPUs and four devices on each NOMA node on each worker node. In this case, both the worker nodes have been configured with single NOMA node policy, single NOMA node topology manager policy, which essentially means that all the resources should be allocated from the same NOMA node. So, here we show a scenario where workloads running on the node and the resources consumed by them are distributed differently with the accumulated resource consumption and hence the allocatable resources on each the node is the same. So, when an application request four instances of devices and four CPUs and needs to be placed, the scheduler finds worker B as a perfectly fit candidate for scheduling the pod. However, as you see that the topology manager which has been configured with the single NOMA node policy will not be able to align the resources on a single NOMA node and the pod would end up in topology affinity error. So, let's talk about how the default scheduler works. Let's try to understand the help of this diagram that I have here. So, on the left, we have a pod which is requesting resources. The request, the resource requests as well as the pod spec goes to the API server. The scheduler gets the node object corresponding to the node that is part of the cluster. As the scheduler is essentially a controller, it looks for the pods that haven't been assigned to a node. It then runs its filtering and scheduling or sorry filtering and scoring algorithm to identify a suitable node where the pod should be placed. Once a suitable node has been identified, it updates the pod object and captures the node name where it should run. So, Kubelet of the chosen node starts provisioning resources. So, for example, if it's this particular node one that has been selected, Kubelet on that node would kick in and would start looking into resource allocation. Topology manager, as I mentioned, which is a Kubelet component, starts doing its Hint calculation to identify what are the suitable resources that I can allocate and that would allow me to align all the resources. There are two possibilities. Topology manager is able to admit the pod and the pod is up and running. In another case, the topology manager might be unable to admit the pod and it rejects it, which results in a topology affinity error. If the pod is part of a deployment or replica set, it results in a runaway pod creation because the subsequent pods that have been created again end up in a topology affinity error. So, in order to optimize cluster wide performance of workloads, resource utilization and enhance the overall performance of the system as a whole, the default scheduler needs to be enhanced to increase the likelihood of a pod to land on the node where it will fit. So, let's talk about the proposed solution. How do we make topology aware scheduling capability enabled in Kubernetes? So, the key components of our proposal are pod resource API, node feature discovery, topology aware, scheduler plugin and node resource topology API. So, let's deep dive into each of these items to try to understand them better. Pod resource API. So, pod resource API is a cubelet endpoint for pod resource assignment. Pod resource API was enhanced to add support for CPUs and device topology. Additional endpoint to enable what support and obtain allocatable resource information. The second part is node feature discovery. We started working with a component called resource topology exporter with the goal to expose resource topology information through CRDs. The Kubernetes signal recommended that we try to consolidate this work in node feature discovery. Node feature discovery was already a popular project and is actually already a popular project. It is a node policing agent which exposes hardware capabilities in the form of node labels, annotations and extended resources. Exposing hardware topology information as CRDs was just a natural next step. So, NFD basically runs as a demo set. It collects the resources allocated to running pods along with associated topology information and gathers the information. It identifies what are the nomenotes corresponding to those resources to be able to expose CRDs which allows us to have information as to the resources available at a NUMA node level. So, topology aware scheduler plugin. This is the scheduler plugin that uses the per node CRD instance to make a NUMA aware placement decision. Node resource topology API. This is the CRD API which is used by NFD and the scheduler plugin. So, essentially it access the glue between both these components. So, now again moving back to the example that I showed previously, we have NFD which uses the pod resource API to gather information of the allocatable resources and the NUMA nodes those allocated resources were from. And then once we determine that we are able to expose a per NUMA allocatable resource as part of a CRD. So, now when the pod, so essentially when the pod comes it goes to the API server. The topology aware scheduler plugin uses the CRD to make a more topology aware scheduling decision. It runs a simplified version of the topology manager alignment logic to determine the node which is suitable to place the pod. Important thing to note here is that topology manager still runs its alignment logic at a node level. So, after the pod has been placed on the node again the resource allocation for the corresponding pod needs to happen and again topology manager still runs that alignment algorithm. So, now circling back to the example that I had shown previously, let's try to see what would happen when we have topology aware scheduling enabled in our cluster. So, now topology aware scheduler has a more granular view of the resources on a NUMA node basis. So, in this diagram here you can clearly see that NUMA node, the worker A and NUMA node 1 is the one which is empty which is represented by the data over here and it's clearly visible to topology aware scheduling to make the right decision as opposed to previously where worker B was valid enough candidate to be for the workload to be placed. So, again when the same workload comes pod which is now requesting four CPUs and four devices, the scheduler knows that worker B will fulfill this request and hence places the pod onto that node. So, let's talk about the current status of this project. In terms of pod resource API changes the Kubernetes enhancement proposals have been merged. We have we've introduced device to pod G and CPU ID information as part of these the pod resource API and the pull request corresponding to that has been merged we are we're targeting get allocatable resources for Kubernetes 1.21 release. In terms of node feature discovery work the resource topology enable meant the resource topology exporter. The Kubernetes enhancement proposal and the code is ready the enhancement proposal is still being reviewed but the code is all up and ready. We've had initial discussions with the NFT maintainers and stakeholders. We have like an issue we have a proposal doc to capture all the design discussion in terms of how do we proceed about enabling resource hardware topology through CR through CRDs in NFT itself. Development work is currently work in progress and the initial demo can be seen here. In terms of the topology aware scheduler plugin the KEP and code the code has been done KEP it's currently still being reviewed and the node resource topology API work is still being discussed in the community. So there's some work that we need to do in terms of the value proposition to prove to the Kubernetes component community how this feature would be useful with different use cases as well as at a larger scale. Now let's talk about the use cases. We've been extremely fortunate to have the opportunity to work with stakeholders and contributors from Intel, Huawei, Nokia, Samsung. We've contributed and supported this work. Here are some of the use cases that we have. We have VRan user plane use case by Nokia where packets need to be processed with extremely high bandwidth. Pods handling the user traffic require SRVVS huge pages and CPU resources from the same node. Due to failover requirements the scheduling of these pods need to be extremely reliable. Use case from Samsung is about performance intensive high throughput network functions or networking applications for containerized 5G deployments and MAC. The cloud native networking function cluster level NUMA alignment by Intel where we've come across an interesting scenario so far we've been mostly talking about full alignment of resources but here Intel had raised an interesting point about having partial alignment of resources so what if you just want to align CPU and huge pages and in certain scenarios don't care about SRVVS for example. The fourth use case is GPU direct scheduling use case by Visat which requires direct GPU to NEC transfer over PCI instead of through CPUs. For detailed information on these use cases and more such use cases please refer to the use case dock linked here. In this document you'd find NVIDIA's way of preventing topology affinity error. It's a very interesting read and I will highly recommend it. Finally we have references of the documents and demos that we've worked on so far. We would really love to have more contributors so please get involved and get in touch with us on topology wear scheduling Kubernetes Slack channel also you can email me or find me on Slack. Thank you very much.