 Hi all, I'm Swati Sehgal working as a principal software engineer in the ecosystem engineering group at Red Hat. Our focus has been to enable virtualized networking functions on Kubernetes and OpenShift for our customers and partners. NFE is an approach is widely adopted by telecommunications industry for deployment management and scaling of networking functions. The idea is that rather than procuring and deploying applications on custom proprietary hardware, there's a shift to deploying these virtualized networking functions also referred to as VNFs as software applications running on virtual machines or in containers on commercial off the shelf servers. Leveraging virtualization and cloud technologies leads to significant simplification of deployment and management of networking infrastructure, agile service delivery, cost reduction, efficient scalability, and widening of supply choice. Such applications have stringent throughput and latency requirements, and it is extremely challenging to achieve performance comparable to customer-built hardware. Because the processing of communication protocols place a significant burden on the CPUs, hardware acceleration plays a key role. It not only leads to efficiency, but also improves performance as the CPUs can perform more useful tasks while IO transfers are taking place, allowing the VNFs to efficiently scale in high-bandwidth networks. In addition to hardware acceleration, in case of multi-numer systems where Numer stands for non-uniform memory access, system memory is divided into zones called Numer nodes which are allocated to particular CPUs or sockets. Access to memory that is local to a CPU is faster than memory connected to remote CPUs on the system. Normally, each socket on a Numer system has a local memory node whose contents are accessible faster than memory in the node local to another CPU or the memory on a bus shared by all CPUs. Similarly, devices such as physical NICs are placed on PCI slots on the compute node hardware. These slots connect to specific CPU sockets that are associated with a particular Numer node. For optimal performance, CPUs, memories, PCI devices should all be allocated such that they have access to the same local memory as the performance impact of Numer misses are significant. Kubernetes has recently turned heads in the telco industry as it has become the de facto standard for container orchestration and it's attracting performance sensitive workloads that are very demanding and need the speed and raw performance as they're directly running on bare metal. In today's presentation, we'll focus on Numer alignment of resources and the current gaps in Kubernetes scheduler and the work we've been doing to solve that. Without further ado, let's dive straight into the problem we're trying to solve in Kubernetes. Here's an example of a Kubernetes cluster. We have two workers, worker one and worker two, each running Kubelet, which is a node agent that communicates with the control plane and makes sure that the containers are running in a pod. For simplicity, we are only considering CPUs and devices in this cluster. The most important thing to note here is that from Kubernetes scheduler's point of view, resource-wise, both the nodes are completely identical. Both have four instances of device A, four instances of device B and eight CPUs each. So if the workload request a device A and a device B, both the nodes should be able to run the pod, but that is not what happens. When the pod gets scheduled on worker one, the pod is rejected with topology affinity error. Here is how this looks like in command line. When the same pod is scheduled on worker two, the pod runs without any problem. In the command line, we see the pod running here. So what has changed? And why on one node, the pod runs fine? Whereas on the other node, it is rejected with topology affinity error. To understand this, we need to look into how resource management works in Kubelet, how Kubelet is configured, and how the resources are distributed across NUMA. In Kubelet, there are four resource managers that are responsible for handling resources as well as alignment of those resources. First one is device manager. This became a beta feature in Kubernetes 1.10. Device manager was introduced as part of device plug-in framework as a vendor-independent solution for vendors to be able to advertise their resources to Kubelet and monitor them without writing custom Kubernetes code. Second one is CPU manager, which became a beta feature in Kubernetes 1.12. A node in Kubernetes cluster runs multiple pods, and some of these pods can be running CPU intensive workloads. Lead-in-sensitive applications can get affected by noisy neighbor problems. So for performance and lead-in-sensitive applications, exclusive allocations of CPU is important to minimize CPU throttling effects, context switches, and processor cache misses. Static policy of CPU manager in Kubelet allows us to assign exclusive CPU to each of the container according to the CPU request. The third one is topology manager. This became a beta feature in Kubernetes 1.18. Topology manager is in Kubelet. It's a component that coordinates the topology of resource allocation, allowing a workload to run in an environment optimized for low latency. Prior to this feature, the CPU and device manager in Kubernetes made resource allocation decisions independent of each other. This led to undesirable allocation on multi-socket systems, where CPUs and devices could be allocated from different numeral nodes incurring additional latency. The most recent addition to the resource managers is the memory manager, which became beta in 1.22. It enables guaranteed memory and huge-page allocation for pods in guaranteed course class. Prior to this feature, topology manager in Kubelet aligned CPU and PCI devices, but there was no way to align memory. Now that we have some background information on how resource management is handled at node level, let's get back to the example I showed initially. Let's zoom in on how the nodes are configured and look at the resource distribution across Noma to understand why we were getting the unexpected scheduling behavior. So here we are. Kubelet has configured such that topology manager policy is single-luminode. CPU manager is configured with static policy. And in case, given that topology manager is configured with single-luminode policy, it means that all the resources should be strictly aligned on the same Noma node. In case it is not possible to align all the resources, the pod is rejected with topology affinity error. If the pod is part of a deployment or a replica set, it can result in runaway pod creation because the subsequent pods could land on nodes where resource alignment is not possible. Now, if you look carefully at the resource distribution of the devices across Noma on Worker Node 1, all the instances of device A are on Noma Node 0 and all instances of device B are on Noma Node 1. There's basically no way a pod requesting an instance of device A and an instance of device B can be fulfilled from a single-luminode on Worker 1. But if the pod is scheduled on Worker 2, this request can be fulfilled as the resource distribution across Noma nodes is such that it allows the workload resources to be allocated from the same Noma node. This clearly shows that even though we have topology manager taking care of resource alignment based on hardware topology, the scheduler is topology unaware leading to suboptimal scheduling decisions. So in order to optimize cluster-wide performance of workloads, resource utilization and enhance the overall system, the default scheduler should consider resource availability along with underlying resource topology in order to increase the likelihood of the pod to land on a node where it will fit. So what are we doing to solve this problem? In order to solve this problem, we had to create a few components and make enhancements to some of the existing ones. We needed a component that exposed resource topology information along with the granularity of Noma Node and basically had to enhance the scheduler plugin under the scheduling process itself with the help of a scheduler plugin to take the information into consideration. For exposing resource topology information, we added a software component called NFD Topology of Data in Node Feature Discovery, which is often referred to as NFD. This project is part of Kubernetes SIG repo. NFD is a node-fleecing agent exposing hardware capabilities in the form of node labels, annotations and extended resources. The software component called NFD Topology of Data, as I mentioned, collects information about the resources allocated to the running pods along with the associated topology information using the pod resource API. And basically, we are able to determine the available resources with Noma Node granularity and we expose that information as CR instance per node. Pod resource API is a cubelet endpoint for pod resource assignment and it was enhanced to add support for CPU IDs, device topology and an additional endpoint to obtain information of allocatable resources. The second part is the topology where scheduler plug-in, as you can see over here. Kubernetes SIG repo also houses a repository for out-of-trees scheduler plugins based on the scheduler scheduling framework and we contributed no resource topology scheduler plug-in there. This uses the CRs created by NFD to make a Noma wear placement decision. Essentially, the scheduler plug-in runs a simplified version of topology manager alignment algorithm to determine if the node is suitable to assign the pod to a particular node. Node resource topology API is the CRD API which acts as a glue between NFD and scheduler plug-in. The important thing to note here is that topology manager will still run its alignment algorithm at a node level for resource allocation. Here are all the relevant links for the components that I just mentioned. If you'd like to learn more and have questions, please contact us on Slack. Thank you very much.