 Good afternoon everybody. My name is Martin Sivak and I work in the cloud native infrastructure team focused at the telco use cases. Today, I would like to show you how OpenShift can be used as a platform for running low latency and real-time applications. Before I begin, let me go quickly over the content of my presentation. First, I would like to introduce you to the general concept and the use cases that are important for low latency. Then I'll give you an example of latency budget and show the latency sources we are dealing with. The next section will guide you through configuring OpenShift, the application and explaining what happens behind the scene. At the end, I'll show you how the tuning can be validated and the latency measured. I will also describe the known limitations and give you some tips about how to change tuning manually. And how to debug the latency tuning. I am going to move to the low latency concepts now to not waste any more time. First, a note about low latency versus real-time terminology. Those two terms are very close in meaning, but not identical. And I don't want to get into the details during this presentation. So I will be talking about low latency only from now on. Having said that, what is low latency and why is it useful? Low latency systems are about guarantees with regards to response or processing time. This might be needed to keep a communication protocol happy, to make a device secure by reacting fast to an error condition or just to make sure a system is not lagging behind when receiving a lot of data. More specific examples are on my slide. Telecommunications and the radio industry processing needs to be low latency to synchronize radio transmissions and to give guarantees. The new 5G standard defines the ultra-reliable low latency communication class that guarantees one millisecond, one-way latency for the purpose of Internet of Things and Industry 4.0. Industry 4.0 is the move to remotely controlled machines and vehicles that require the same guarantees for safety and reliability. The same applies to the field of autonomous vehicles. The reaction time to sensor inputs must be instant. The two use cases I mentioned are just the tip of the iceberg, and I'm sure there are many more like the stock trading platforms and the banking industry in general. I will focus on the telco example in the rest of my presentation. I described the telco use case and mentioned that low latency is needed for synchronization of the radio transmissions. On this slide I want to show you a more detailed view of the time constraints we are dealing with. On the last slide I said that the ultra-reliable low latency communication guarantees one-way latency of just one millisecond. A half of that is needed for high-level infrastructure. That leaves 500 microseconds for all the radio transmissions, cabling, operating system and the communication protocol processing. That is not much time. A half of this time budget is consumed just by the radio transmission, another 1% by every kilometer of cabling. The total time remaining for processing in the distributed unit is 100 microseconds. And that is shared between the hardware, operating system and the cloud-native network functions that implement the 5G protocols and functionality. When all the requirements are summed up, the remaining time left for the infrastructure and the operating system latency is just 20 microseconds. Now that we know the latency goal for the platform, 20 microseconds, let's see if it's achievable and how. Imagine a process that is triggered by a timer. Ideally it would do its job immediately once the timer tells it so. However, it's not quite as simple in real life. There are many different sources of delays and noise in the typical operating system. The first step the operating system kernel has to do is to actually wake up the resources. That itself can take a lot of time, even if the sleep is not too deep. The CPU frequency might need to be adjusted to the performance mode. Then the kernel needs to decide who is supposed to run first. Is there maybe a higher priority task in the queue? This decision can take microseconds, eating from our time allocation significantly. Loading the process to memory is not pre-either. What if the memory bus or the CPU caches are being used by someone else? Now finally, the process runs. But a state of happiness usually does not last long. It can suddenly be stopped and replaced by another high priority task. Either an interrupt handler or by another process. Would you say that I surely covered all possible cases? Not at all. The firmware of the computer itself can take over. Stop the operating system and perform some hardware housekeeping via system management interrupts. As you can see, it is impossible to guarantee any processing time under normal circumstances. Luckily, there is a way to mitigate most of these interruptions and I try to add hints about how next to each latency source in blue. We call the procedure low latency tuning. It requires updating and configuring machine firmware. You might know it as BIOS. Configuring the operating system kernel, making sure processes run only on specific processors and so on. Now I am finally getting to the OpenShift role in all of this. I will start by showing you the administrator visible components. There are many places that need to be configured, but my team prepared a single to use operator that is an OpenShift term for supervisor component that takes a high level description of which CPUs to use for the operating system, which tuned for low latency, where the real-time kernel is needed and which group of workers compute servers to apply the tuning to. OpenShift processes this performance profile and applies all the tuning necessary. This usually means couple of automated reboots, but that's it. We also provide a tool that will help with validating the tuning. It can figure out the cluster layout, run a bunch of tests to make sure everything is set correctly and allows running the OSLAT latency benchmarking tool if requested. We use the same tool for our continuous integration test suite to be sure we did not break anything. As a cluster admin, you do not really need to do anything else, but maybe you are interested in how it all works inside. All the magic is done by orchestrating couple of components coming from Kubernetes and OpenShift projects. I will go from the bottom layers closest to the operating system. Kubernetes runs an agent on every node that manages the lifecycle of all containers on that node. It contains two components that are relevant for the low latency tuning, CPU manager and topology manager. CPU manager is responsible for allocating exclusive CPUs for latency-sensitive containers. The topology manager handles NUMA affinity between containers and devices like network cards. NodeTuning operator runs the containerized version of TUNED service and adapts it to the distributed cluster environment. The tuning service is then responsible for applying all the low-level tuning on every worker node. Machine Config operator allows installing the real-time kernel and other changes to the operating system configuration. We use it to start some helper services to configure system D to apply a default CPU affinity mask to every infrastructure process. And the highest-level component I ever mentioned, the performance add-on operator. This is the component that computes the tuning rules based on the high-level description you saw previously and instructs the other components to do their part. Here you can see what I just described to you in a nicer graphical form. The flow is generally left to right. The performance profile is processed by the performance add-on operator that spits out more granular objects which are then processed by the other components I mentioned. That results in tuning the system. On the right side of the diagram you see color-coded processors. And in the middle you can find examples of processes and workloads color-coded in the same way to give you an idea about where everything is running. The example performance profile displayed on the setup slide defines two groups of CPUs, reserved and isolated. Reserved, here marked by the red color, is the group for CPUs that are to be used for noisy operating system processes. Isolated, marked by the green border, is the group of CPUs that is tuned for low latency purposes. There is however one specialty of OpenShift here. The isolated CPUs are not dedicated to latency-sensitive containers until one such container actually claims it. That fact is signified by the green fill in the diagram. You can see that the CPUs 3 and 4 match the green color of DPDK and low latency and real-time worker pods. Only those CPUs are exclusively dedicated to those containers in this example. All the gray ones can still be used by any pod in the system. Here is a summary of where each kind of process can run in the tuned OpenShift cluster. It matches what I just described on the diagram slide, but describes the placements explicitly. Operating system and kernel tasks are typically restricted to the reserved CPU pool only. The reserved pool needs to be large enough to be able to pull the wave of the operating system. The latency-sensitive processes are pinned to their exclusive CPUs to avoid any interference from the OS or neighbor processes. But notice that the infrastructure and general workload pods can actually run even on the reserved CPUs. Their CPU list is not restricted in any way, and OpenShift relies on the Linux kernel CPU scheduler. The important fact to mention, though, is that the pods are still accounted towards the isolated pool capacity quota. That is the current behavior of the stock Kubernetes CPU manager. Now that I covered the system configuration, I still need to describe how to define a workload pod. That's an OpenShift name for a group of containers, and how to mark it as a candidate for the latency-sensitive treatment. The first and most important piece is making sure the sum of resources requested by all containers inside a pod match the sum of resource limits. This way Kubernetes places the pod into resource class called Guaranteed. Only pods that belong to this class are processed by the CPU placement logic in CPU manager. When this is satisfied, the CPU manager allocates exclusive CPUs to all containers defined in the pod that satisfy the same rule. When your application is single-threaded and knows how to handle thread placements by itself, you can also decide to disable the kernel CPU balancing logic for the allocated CPUs. This results in further reduction of the OS latency and is typically used for DPDK-like applications. The performance add-on operator installs a new runtime class to the cluster that allows controlling this functionality per pod. There will also be a feature to enable or disable interrupt processing per pod in the upcoming OpenShift version. Even though interrupts are typically not desirable on the dedicated CPUs, the application might need them if it uses kernel networking stack for communication. I covered all setup and workload slides, and I will briefly describe the latency measurement tools to be used. There are many tools that can be used for measuring the latency of the system. Each has its own specific use, and typically all of them are used as the validation progresses. First, the measurement by hardware lab detect is needed to establish the baseline that is achievable by the bare hardware. There is no point in continuing to the software-based tuning if this number is not good enough. Rather focus on firmware and hardware. After hardware lab detect gives us good results, we usually proceed with the tuning and then use cyclic test. That tool allows us to verify the timer latency. It schedules a repeated timer and measures the difference between when it should have triggered and when it actually did. This can uncover basic issues with the tuning caused by interrupts, process priorities, and so on. The last two tools are useful for the final verification. They behave in the same way a CPU intensive DPDK application would and measure all the interruptions and disruptions to the processing. The hardest part about measuring latency is patience. Some latency sources only disrupt the measurements once per hour or once per day. Proper latency tuning validation takes a long time because of this. Our flow is to first run a short test to detect the easy tuning mistakes, but once those are corrected there is nothing else left than to leave the test running longer and longer. Our flow is to first run a short test to detect the easy tuning mistakes, but once those are corrected there is nothing left than to leave the test running for longer and longer time. Rounds of 12 or 24 hours are not an exception. You can see a shortened example of a typical latency measurement result on the slide. This is an actual result from one of our debugging sessions. There is a spike of 35 microseconds visible on the max latencies line. But the number alone does not give the full picture. In this case there were only three cases of higher than 20 microsecond latency in about 100 million samples cyclic test collected over 15 hours. That might be good enough for your use case. Unfortunately it was not for hours. I am getting close to the end of my presentation and I would like to mention that there are some common issues that you should know about. I already mentioned that the infrastructure pots count towards the isolated pool capacity. This is how our upstream Kubernetes decided to implement the CPU manager. But it can be slightly confusing and so I am mentioning it again. You need to make sure your reserved CPU pool is large enough to pull the wave of the operating system. The kernel can grind to halt if you don't do that. A guaranteed class container will only have the dedicated CPU for itself after about 5 to 10 seconds after it starts. The time is configurable but making it too low puts higher load on the node. All the tuning guides might tell you to use ISO CPUs. Please do not do that anymore. ISO CPUs on OpenShift causes issues as all CPU process balancing is disabled on all the isolated cores. But the non-low latency and the infrastructure pots run there as well and will be affected by that and that includes OpenShift infrastructure itself. And the last remark is about kernel vs. dpdk networking. I already mentioned that one when speaking about workload definition, interrupt processing and the runtime class. There are two tips I would like to give you with regards to the process of tuning. Debugging latency issues is not trivial. There is a procedure you can use when you are truly lost. First disable all power saving modes in kernel and then use the procedure described by Clark Williams from the Red Hat Real-Time Kernel team. This procedure will let you watch what exactly interrupted your process and where it came from. Once you figure out what needs to change in the tuning, our tools give you an option to add your own overrides. The description about how to do that is too detailed for this presentation but you can find a step-by-step how-to in the link I added to this slide. Please do not hesitate to ask any questions you might have about low latency and OpenShift. In case you would like to participate or just follow our work, the code is all public on GitHub. Thank you very much for listening and if you have any immediate questions, please find me in the chat. Thank you.