 Hello and welcome to Do Not Disturb Mode, that is fine-tuning resources for latency-sensitive workloads. My name is Antti Kervinen. I work for Intel and I've been working on resource management, especially on the CRI and OCI run times. Hello, my name is Peter Hunt. I'm a software engineer at Red Hat, working primarily on cryo, but also on things in SIGNode and other container-related technologies. We also wanted to send a special shout-out, this presentation and work was partially brought to you by Marcus Lutzman, who also works at Intel. He has the integration and design of the RDP work that we're going to describe a little bit. To start off, we're going to talk a little bit about QoS or quality of service in Kubernetes. Since the beginning, Kubernetes has provided knobs to configure how different workloads are prioritized and how they're configured to run on a node. The QoS classes were used to specify memory limits and CPU limits, as well as boom-killing behavior for different pods. This behavior was able to be further customized by CPU management policies to be able to configure which CPUs workloads ended up on, which allows for users to configure CPU affinity so that there's an noisy neighbor problem among different workloads. This leaves a lot of other resources that exist on the node but aren't quite as possible to configure. This causes workloads to be disturbed when it would be a little bit better if they were less disturbed and were given more access to those resources and were less latent. Cryo has recently entered PR and Continuity as well, but in Cryo 1.22 we have the capability of configuring two new resources. The first one is Intel's Resource Director Technology, or RDT. It is a way to configure cache and memory bandwidth, so you can think of it as finer grade and control over the way that memory and the memory bandwidth and the CPU cache are configured. Next, we have the capability of configuring block IO classes. In a similar way as the QoS class, this gets class-based on a C-group granularity block IO controller. So you can configure how, given a C-group, you can configure how that C-group is able to, the scheduler is going to prioritize it, you can give it extra weight or you can throttle that workload so that it can only access a certain amount of the block IO. So let's describe the situation before any of this QoS stuff. So we have our three workloads running on the same cluster. It's a very realistic scenario of having 911, our emergency telephone service, and a flower delivery service, which is an important site that needs to stay up because people want to be able to order their flowers at all the time, but if it goes down or if it lags a little bit, no one's going to get hurt. And then finally we have a scanned file system, which will look all over the file system and see what's out there. This is something that we want to be doing fairly regularly, but we don't want to be, it's not really mission critical, there's no apps or things that will go down if it doesn't run, so it is our lowest priority. So by default, if we just run on a Kubernetes cluster, no configurations of limits of memory or anything like that. This is what it's going to look like. Each of these three workloads are going to end up on each of these three CPUs, the two CPU cores, and they're going to be split up among, they're going to argue with each other in the cache, causing things to get evicted back and forth. They're going to contend for memory bandwidth as well as for storage block IO. So this is not great if we want our 911 process to be undisturbed. So the first thing we can do is, and this is available as like 112 beta, so it's fairly well tested now, is setting a static CPU manager policy. So this allows the user to configure which CPU affinity each of the processes are scheduled on to. This allows for the 911 process to have its own dedicated CPU and no other process would get in the way of it. That's allowing it to be less latent. And then the other two processes can go on the other core and they can share and they can contend with each other and it's on as big a deal. But we still have all of these other resources that they're contending with. So something we can do now with Cryo 1.22 is configure the CPU cache behavior so that the 911 process is given a specific way to how its cache is being configured and it can run a little bit nicer with less contention around it. Further, we can configure how our scan file system process, which is a lot less important, is configured with memory bandwidth so we can throttle the memory bandwidth allowing for the scan file system process not take up too much memory bandwidth that would contend with the 911 or the flower delivery processes, which are much more important. Next up, we have the capability of giving higher block IO priority. So by default, the block IO priority weight is 100. So if we give the 911 process a weight of 400, that makes it have higher priority and it's given access to the block IO scheduling more frequently and it's allowed to do its things faster, which means it can run in a way that it gives it more access to the node and it can run less district. Finally, we can also throttle the block IO and this allows us to specify a specific amount of read and write rate that a certain C-group is given access to. In this case, our scan file system process is only going to give access to a certain amount, a certain rate, and this will also free up room for the flower delivery and 911 services. In other words, with all of these capabilities combined, we're giving this 911 access to all of these resources in a way that it is relatively undisturbed, much like this Q-puppy here. So 911 can do all of its normal activities and not worry about resource contention with other processes. So next up, we're going to do a bit of a demo of the block IO throttaling. Thanks, Peter. So I'll start sharing my console and now that we know what can be done on slides, let's see how it works on a virtual machine. So in my virtual machine setup, I have a configuration file added for Cryo, which tells that there is a separate block IO config file where block IO classes are defined and the file is in the ETC containers. And if we check the contents of that file, there is a single class definition. So there's a slow reader class introduced in this file and the introduction goes so that for all the block devices that match this regular expression, this wild card actually, for all those block devices, total reading down to the five megabytes per second. And now we are going to start our file system scanner here to go to this class. So let's see how this file system scanner put and actually demons that Jammu looks like. So in the annotations, you can find block IO resources, beta Kubernetes IO annotation and I'm giving it the scope of POT, meaning that I want this annotation to apply to all containers inside this POT. I could also give here a container name to restrict to this block IO class only to a container. Okay. And the block IO class name is the slow reader and how this file system scanner works. There's one container running busybox and inside the busybox it is running a while loop and inside the while loop it is looking for files under the scan directory and it is passing those files as command line arguments to MD5 sum and then that will be sorted and saved to a file and then the files would be diffed compared to previous MD5 sums and that way we could track that what is changed in the file system and what are the checksums for all these files. So the most important thing here is actually to know that what is actually being done and who is actually reading the block devices is the MD5 sum process so it is reading the files and maybe for completeness I'll show quickly that what is under the scan so in the scan directory I have mounted user bin and user lib from the host file system as read only mounts. Okay. I think that we are ready to launch this file system scanner. There it is and if we now check out what these C-groups look like so I have a helper script digging out C-groups details I can see that in the default namespace there is a file system scanner pod inside it there is a busybox container and for that container in its C-groups there is a read BPS device file for block IO controller defined and it has found two devices from this virtual machine so two devices are actually matching this wild card here and the read is now limited down to five megabytes per second for both of those devices. Let's see now the MD5 process really should be running and if I'm checking out what's in the proc and bit of MD5 and IO file so there is something really happening numbers are changing so if I take this nice little polling loop here and compare the numbers all the time to the previous ones that we have read and make that human readable we can see that it's always the five megabytes per second at most that the MD5 some process is reading okay so now we have seen that this actually works so we can now continue and take a step back and think that what have we actually introduced here is a bigger scale so I'll start sharing that tap again and let's see so we have been defining here first steps for this kind of class based QoS so we have enabled workloads to be annotated with QoS related class names there is block IO classes and there are RDD classes both for high priority workloads and also limiting classes for low priority workloads which we want to make sure that won't interfere that much with other higher priority workloads and here the name of the class in the workload configuration is kind of node independent so we at this point where workloads are annotated we don't actually have to care too much about the node hardware details so we do not give any absolute numbers here for instance but those are hidden inside the per node configuration files that are read by container run times cryo in this example and here why we are doing this so that we have this kind of node configurations separately is that if you think of the block IO for instance and compare spinning hard drive to SSD or compare that to non volatile memory like obtain the characteristics of those block devices in performance are completely different so the value is how you would like to throttle this kind of file system scanner on real hard like physical hard drive or non volatile memory those values could be totally different but anyway you might want to have this kind of heterogeneous cluster where you really have different kind of nodes and it would be very nice to be able to schedule this kind of configuration on any node so scan file system on any node so that it won't disturb the higher priority jobs on those nodes and now this node specific configuration that depends on the hardware characteristics enable you to do that so keep it simple in the workload then a couple of words about the QOS in future this is what we have been getting so far up and running and what we are doing is using the annotations as you saw and this is good and only good for working getting this working now but obviously this is not a long term solution so there needs to be some cleaner way to do the class based QOS in the future on the other hand in Kubernetes there are resources the numbers for CPU and memory they currently imply the whole QOS class for the container and for the pod and that QOS class implies things like how to configure out of memory killer whether or not this pod should have exclusive CPUs and if you think about the future where the hardware is going there is going to be different memory types you might have high band with memory you might have low latency memory and there might be different CPU core speeds that you would like to somehow state for your workloads that this workload would prefer like a slow low power CPU or another workload would really need like high performance cores these kind of things I do not believe that we can anymore or any longer encode to these numbers of CPU and memory what is requested and limited so one way just to food for thought QOS future I'd like to like sketch somehow that how this could be made more explicit so of course we could have these numbers for limits and requests as we have done for now and have the default behavior unchanged so that everything would be backwards compatible but in order to make this more future proof one option could be so introduce this kind of QOS section or container resources where we could be giving them this kind of RDD and block IO classes how we want this workload to be run and why not at the same time enable overriding this kind of implicit assumptions that we have had for out of memory killer or CPU exclusiveness or maybe at some point for memory speed and latency this kind of things could be then stated explicitly and even if you have this kind of QOS section also in the port level and not only on container level then we could also affect the assumptions that we have on port level QOS like eviction so when noticing that no resources are running low it starts to evict ports that are not high priority so why not make that also explicit so that users can configure it as they want and not only imply through the numbers of CPU and memory but anyway that was just a thought for the future and let's get back and wrap up what we have actually been doing so we have been tuning workloads we have seen that okay there is already good way to say that this workload needs explicit CPU cores but now with these new extensions we can also give workloads exclusive CPU cache and we can throttle memory bandwidth if we do not want other workloads to take those and from high priority workloads and similarly we can configure block IO priority for high priority jobs and again we can throttle block IO bandwidth of low priority jobs and as a result we can see our most important workloads relax a bit and don't be disturbed by other workloads so thank you very much here are the links to trial project and rdt and block IO and now I think it's time for questions