 Hello, my name is Yannir Quinn and I'm going to give this session alongside Francesco Romani. We're both engineers at Red Hat and today's session is going to be about how to enable low-latency workloads in Kubernetes clusters. So let's start. Now that Kubernetes has conquered the cloud, it's time for it to move on to other domains. Four and foremost, conquering cloud native network functions, the cornerstone of 5G deployments. To make that happen, we need to address an important limitation, ensuring low latency. The workload of running cloud native network functions, such as workloads designed for telcos, stores interesting challenges to overcome. It's very demanding, requiring fine-tuning of the node behavior to ensure low-latency processing of network packets, where special hardware and software are needed. Kubernetes container platform can be augmented and tuned to become a great platform for some demanding class of CNF workloads, and this talk will demonstrate how. We will describe the features available alongside upstream Kubernetes to provide real-time guarantees and to handle low latency and high-performance workloads. Let's have a look of a feature use case taken from the telco world. A cluster admin handling a deployment that should provide low latency, real-time, or DPDK purposes, will want to configure it without deep understanding of all the interactions of the kernel and the operating system components that are in the mix. You should be able to do so in an easy declarative manner that will encompass all the moving parts to achieve this goal, such as isolating workloads for interruptions, for gaining better latency results for telco company, and on the web scale and edge fronts. So how do we do it? And how can we guarantee performance and low latency? Let's have a look. The ecosystem that will provide a way to achieve this is built from these existing and further developed components. We will explore the kubelet settings, topology and CPU manager, the kernel settings on the node using tune D, and how they interact to deliver the workload guarantees. Kubernetes tunings that can be achieved harnessing and manipulating kubelet settings and tangled with CPU and device managers, as well as applying topology structures. And doing that, on the side, we can be using operators, and we'll be using operators to apply machine configurations to the operating system itself, including installation of real-time kernel and other configurations between the kernel and kubelet. We will be setting node level tunings relayed by tune D. This also will be done by an operator. At the top, we have an operator, performance add-on operator. We will see it later on. It will orchestrate everything to achieve optimized clusters for applications sensitive to CPU and network latency. If you're asking yourself what is an operator, so an operator is a Kubernetes controller specified to operating an application. It's a Kubernetes pattern that is extending the Kubernetes control plane with a custom controller and custom resource definition that add additional operational knowledge of an application. We will see some of the operators in actions, as I mentioned, combine to tailor a tuning suit for our cluster. To optimize Kubernetes clusters for applications sensitive to latency, installing a real-time kernel and configuring the operating system to run as a real-time system is one of the key aspects. Linux can be considered as the best choice for high performance computing due to the years of Linux kernel optimizations focused on delivering high average throughput for a vast number of different workloads. A real-time is not necessarily superior or better than a standard kernel. Instead, it meets different business or system requirements. It is an optimized kernel designed to maintain low latency, consistent response time and determinism. All resources managed by any Linux kernel, CPU time is a principal resource. Tuning a CPU towards better performance aspects should be according to scheduling algorithms that use priorities and real-time policies to control execution, favoring high priority tasks while interrupts and preemptions are enabled for a much wider range of time than on traditional kernel. Let's see how we reach and touch the nodes in our cluster to make them real stars using operators. With the user machine config operator, one of the operators we mentioned earlier on and we'll see following the slides, the machine config operator manages the operating system and keeps the cluster up-to-date and configured. Using this operator, platform administrators can configure an update system D, cryo or kubelet, kernel, network manager and more on the nodes. To do so, the machine config operator creates a statically rendered machine config file which includes the machine configs for each node. It then applies that configuration to each node with the machine config server and the machine config controller relaying this configuration. In the following example, we see how a machine configuration can set a kernel type to real-time on a specific node in a declarative way. So we have this configuration called of kind machine config. We set real-time kernel as an argument and this will apply using the operator on the nodes. Let's see it in a micro demo now. Before the micro demo, we'll see another example here of a machine config capabilities which sets kubelets configuration such as CPU, memory and topology manager policies on the node. The configuration goes through the operator and finally lands on kubelet conf files for matching nodes. And now we'll get to the demo. So we'll see how we apply real-time kernel and kubelet arguments on nodes using the operator. First we'll see how we change the kernel nodes to real-time kernel. So here we label the node with an RT worker label. We create the machine config as we saw earlier on to set the node as real-time kernel. We create a machine config pool, a pool where the machine configurations we create will apply to. So if we apply the machine, the pool, all the machine configs associated with it will be applied on the nodes. According to the labels, the label we used earlier on was the worker RT. We get the nodes, we see the label and now digging into the node, we can see that the Linux has changed to real-time. That's one example. I'm using here OC but you can consider it as a CLI very similar to kubectl with additional capabilities. Next example, we'll see how we change kubelet configuration on desired nodes. So we create this machine config configuration with kubelet config settings that will sync into kubelet confiles on the node. So we have here CPU, system reserve values and kube reserves, for example, and also manager policy and recon sign period. We create the configuration, the pool already existed so changes has applied and then we go into the specific node we labeled and here we can see going to the kubelet confile that values has actually changed according to the configuration we added. So after seeing how we used machine configuration, let's talk about Tuned. Tuned is an open source system tuning service for Linux. It monitors connected devices using the Udev device manager. Tuned system settings according to a selected profile. It supports various types of configuration like syscatel, sysfs or kernel boot command line parameters which are integrated in a plugin architecture. Many of the plugins are supported by another operator, no tuning operator and include the plugins we see below, syscatel, CPU, bootloader for example. Tuned also supports hot plugin of devices and also stores all its configurations cleanly in one place. Tuned profiles that manifest the required tunings we desire can be defined hierarchically as we can see in the diagram which reduces duplication and simplifies maintenance. Tuned also supports full rollback where the system can be easily returned to the state before the profile was applied and it also includes a number of predefined profiles for common use cases. For example presets for high throughput, low latency or power saving are distributed. Ultimately we will want to ship a profile that best suits the tuning requirements we have. The profile can be composed from one or more presets to have our desired settings but more parts are needed to compile to complete the picture as we will see soon. Here we meet yet another operator after the machine config operator and no tuning operator. It is designed for node tunings that provide us a centralized way to customize node level tuning for cluster administrators, abstract OS version level dependent tunings details away, enable modularity and tuned profile inheritance, provide same defaults for control plane and worker nodes and also provide dynamic tuning with rollback without the need for node reboots. The node tuning operator manages tune D as a containerized Kubernetes demon set. It ensures custom tuning specification is passed to all containerized tune D demons running in the cluster in the format that the demons understand. The demons run on all nodes in the cluster one per node. Deployment of custom tunings is done by creating a tune DCR which includes profile data, recommendation logic and machine configuration labels and selectors to apply these tunings on desired nodes. So let's see a short demo of how the node tuning operator applies our tune D profiles. First we label the node we want changes to apply to as we did before. So in our case we see the label here. We create the pool to be associated with the nodes as we mentioned earlier on. And here we see a tune D declarative CR associated with the node tuning operator where we set several variables such as kernel arguments for isolated cores. But we also see it includes a real time and open chief nodes parent profiles in our example. So let's see the parent profile. The parent profile also sets several variables. For example under syscater the kernel hung task time of six to six hundred. So we get into the nodes again and we debug our specific node. We can see the kernel arguments has changed to isolated CPUs equals one. And also check the syscater value for kernel hung task time of six which is indeed 600. Let's look at the cubelet and let's see how we can tune it and then the work nodes for telco workloads. Let's start with the most basic computer resources the CPU cores. To meet the latency requirements we need to mitigate CPU contention and context reaches. To do so we can use the CPU manager which is built in in the cubelet. The CPU manager supports different policies. The static policy allows exclusive allocation of CPUs to the workload. Exclusive allocations allows us to avoid the context reaches and to reduce cache misses because no other workload can run on the exclusively allocated CPUs. However system activities can still run on isolated CPUs. For full isolation we need other settings like for example handling the error queue processing on housekeeping CPUs. Housekeeping CPUs can be selected using for example other cubelet options like the reserved CPUs option, command line option. The telco workloads very often require access to SRV virtual functions or to hardware devices in general. Kubernetes provides a device plug-in framework which works together with the device manager another component built-in in the cubelet and this system can be used to expose other resources to the cubelet. Hardware supported by the device plugins includes the aforementioned SRV virtual functions but also other devices like GPUs or hardware accelerators in general. The device plugins report topology information for the other the manage including the new affinity. The topology manager is another component built-in in the cubelet. It orchestrates the other resource manager like for example CPU and device manager. It allows to align the resources that the workload requires using the topology information provided by those managers. For example we can do NUMA cell alignment. The topology manager is comfortable using policies. With the most restrictive policy enabled called single NUMA node all the resources a workload require must be allocated with the same NUMA affinity. If this requirement is not met under this strict policy the workload won't be admitted. Other policies allow less strict behavior. The topology manager can do for example best effort allocation. If it's not possible to allocate the required resources with the same NUMA affinity the workload will still be admitted. Let's see a few examples of NUMA aligned and NUMA unaligned resource allocation in order to see how the topology manager works. In general we should not expect the NUMA cell to be symmetric or we should not expect the NUMA cells to all have CPU cores memory and PCI buses because in general NUMA cells may have different combination of those resources. For simplicity sake however we will consider a system with two symmetric NUMA cells each of them will have CPU cores local memory and a PCI bus attached to them. This is the simplest scenario and we use it for illustration purposes. Our first example is a NUMA unaware resource allocation. The workload is allocated regardless and NUMA cell alignment for the required resources. So those required resources required by the workload may span across different NUMA cells. This still allows the workload to run but the performance is not optimal because for example of the of the traffic required between the NUMA cells. For example if our workload gets core number three and the servivirtal function for NUMA cell zero on the left and core number six and the memory from NUMA cell number one picture on the right this is a legal allocate a NUMA unaware allocation does not prevent the workload to run but if for example core number six on the right needs to access the servivirtal function on the NUMA cell number zero on the left it must pass across the cross NUMA link and this causes a performance penalty. This is our second example as it is a NUMA aware resource allocation we can do we can obtain this allocation for example using the single NUMA node topology manager policy. In this case all the resources the workload requires are aligned on the same NUMA cell in this case the NUMA cell zero on the left. The performance is optimal because no cross NUMA link traffic is traffic is ever required. Let's see a demo of how the topology manager works on a real Kubernetes cluster okay we are now seeing the demo so let's run a workload which requires aligned resources. For workload perspective the only requirement is to obtain in the guaranteed class and quality of service class so this is the example workload which we are going to run which is very very similar to the workload we seen previously in the in the example slide so two cores are required and one servivirtal function so from the user perspective with the topology manager nablet there is no difference in the flow we just run the pod and it is running okay yeah it is running so which is more interesting is how the resource looks like from inside the pod so to inspect the resources from inside the pod we can run at the bug workload but first let's see how the definition workload definition actually looks like so again we require the same resources and we just run a different command because we want to run our debug tool so we create the pod as usual we wait for it to be running and here it is no not yet here it is so now we can jump inside the pod and check the resource allocation okay here it is so two CPUs core two CPU cores are required and this workload got two of them and both of them are bound to numal cell one we required one servivirtal function which is as a PCI device is again bound to numal on one so everything is aligned on numal not one we have seen so far how many moving parts like operator components cubelet settings we need to configure to ensure optimal performance for low latency workloads let's see how we can bridge together everything since this is also a quite complex task let's see how we can it could be automated and to automate it we can use for example another operator which is the performance operator to configure a cluster we can use this operator which simplifies the process exposing a high-level profile which allows us to configure the performance behavior of the cluster we use another higher level of abstraction the performance operator will orchestrate the configuration of all the components like the operator we seen the cubelet settings and it will drive them instead of user so the performance operator acts as orchestrator for all the cluster components so in this slide we see an example of all of how the performance operator will drive all the other components and for example some settings are still done almost directly by the operator itself like for example for now huge pages configuration this is how the performance profile could look like we for example we can request like huge pages we can select the reserved housekeeping CPUs and we can add command line kernel command line arguments and we will see now how the performance operator works in practice in a cluster okay we have a cluster with a few nodes a few worker nodes and we selected the role to make sure the performance operator can work can change the compression of worker node so we see the performance operator is running and we have how the spec looks like so again some kernel arguments some huge pages for illustration purposes the single number node couple g manager policy and in this case we also we are required the real time kernel and for that we need cluster capability to support real time kernel so for example we see again the huge pages and now we see how the performance operator works with the all the other cluster components for example we are now checking the tune d the tuning node tuning settings we seen previously in this session okay we are seeing how the tune d profile is translated in practice and now we're checking we are checking we're going to check the huge pages so this is how it should look like but now let's jump on a node and check how the oldest settings actually translated to node tunings for example again with proper clustered support we seen the RT kernel but we also seen the kernel arguments we required and the huge pages we requested in just a moment and here they are here it is huge pages configured so this is how so this is how the performance operator drives the configuration of all the cluster if you want to try the performance operator we can check the GitHub page of the project on which you find of course the source code the rating is open source documentation the installing instructions we have pre-built container image and very soon we'll have all the operator available on operatorapp.io to easier consumption so this concludes our session thank you very much for attending