 So welcome everyone good to see you here. I'm Marcus Lehtonen from Intel, working as a cloud orchestration software engineer there. Mostly working on the resource management area in Kubernetes and the runtimes. Hello everyone. Thanks for joining. My name is Peter Hunt. I'm a senior software engineer at Red Hat working primarily on cryo, but sometimes kubelet, signote, things, podman, other container runtime related technologies. And today we're going to be talking about class resources. Kubernetes is fastest way of shushing noisy neighbors. So I think we don't really need a button. I don't think we really need to argue about this too much, but in Kubernetes we don't really expect all workloads to be treated equally. And there's been, you know, the native constructing Kubernetes that's represented this, which is the QoS quality of service. QoS classes now represent specify CPU and memory limits and requests. Another mechanism that's been in Kubernetes since 119 that kind of does QoS though it's not really called that is CPU management, which allow you to pin certain containers to certain CPUs. And that allows you to further customize a quality of service, even though they're not really called QoS classes there. But ultimately there's going to be more resources on a node to guarantee the quality of service of, and that's kind of the issue that we're looking at solving. So the overall mission is we're looking to improve the quality of service of applications to enable controls that don't fit into the current Kubernetes resource model. The three ones that we're originally targeting that we're going to be talking about today are cache memory bandwidth and disk IO. And the ultimate plan is to add a fundamental resource type to Kubernetes that allows us to express those as well as future resource types. So properties of QoS class resource that we're going to describe today that will eventually be trying to add to Kubernetes is we have a request class identifier instead of the amount of capacity. So currently QoS class resources specify the amount of CPU and memory that pod wants. But we want this to be opaque to Kubernetes and be instead specified in the container runtime. So you would specify in a container, you know, I want my QoS resource X for this pod to be class A. In addition, we expect there to be multiple containers that are able to go into the same class. In this case, we have container one, two and three are all in class A. The container four is, you know, we want some sort of different apology for that. So we're going to give it a separate class. And finally, we want a new and an innumerable set of classes. So you can have any number of classes representing the specified resource that you're trying to represent. So, you know, we I mentioned earlier, we're going to talk about a couple of examples. And so we're going to go through three different resources that we're looking to express with QoS class resources to start. The first one is cache allocation. So in Linux, you can use the REST control FS interface, which is in SFS to specify cache allocation. And this is already inherently class based. There's a name for, you know, different caches that you'd allocate to different processes. And ideally, you know, what we would get out of this is being able to hide some hardware details from the user. The, you know, specific caches already represent, you know, are a good example of what we're trying to represent with the classes, because there are, you know, m groups of class of, you know, groups of caches, but we have n applications or pods or containers that would actually fit into those caches. Another example would be block IO. So currently, your block IO is going to be pretty hardware dependent, even though it's specified through the C groups. You have different nodes, you're going to have totally different hardware. And representing that in the CUBE API might be a little bit complex or challenging. So that is an appealing aspect of the opaque nature of these class resources, where only the container runtime has to be aware of the differing hardware levels. And since the container runtime, you know, this configuration of that could be node specific, you can have for different nodes, the different hardware could be accurately represented. And so block IO specifies modeling parameters through device. So here's the runtime spec that would describe block IO, and to the left is what actually is written in the C group hierarchy. So I'm going to go through a, I'm going to walk through like an example of what we're kind of imagining the value of this feature could be. So imagine a very realistic scenario in which we have our emergency alarm system that will go off if there's a natural disaster we want, you know, that to be very fast and be reactive to situations so that we can get people safe quicker. And then we have a rock band website that, you know, is going to handle the tour dates and all of the tickets for a popular rock band nearby. So as, you know, if you weren't to try very hard to separate those two, you would end up looking, it would look like something like this, where, you know, both of them are sharing most of the resources, you know, the memory and CPU, they're going to have, you know, differing amounts maybe, but they're going to end up on, you know, the same CPUs. And maybe if there's the rock band websites having something happen, the emergency alarm might get a little interrupted. And that's not very good. We want the emergency alarm to be able to be isolated. So, you know, now in Kubernetes, we can represent this with a static CPU policy. So have them separated on a CPU course. That's a little bit better. And have, you know, the COS classes be represented in the limits of requests for the pods. But that still doesn't really give us all that we want. I mean, there's all these other resources that we've mentioned, plus others that, you know, need to be broken up and still thrashing on the rock band website could cause interruptions with the emergency alarm. So for example, something that we could do is with the class resource feature, we could give the emergency alarm an exclusive cache with RDT. And so that would allow it to, you know, be a little bit more isolated from the rock band website. We could also throttle the memory bandwidth of the rock band website. So even if there's a tour that's just been announced, you know, spike in traffic won't cause the memory bandwidth to be taken up and be taken away from our emergency alarm system. We can also do, we can give the emergency alarm system block IO priority, you know, give it a higher weight. So it's able to get the block IO resources it needs. And we do the opposite for the rock band website and throttle it so that, you know, it's not able to use too many resources. So ultimately what we get with, you know, this kind of configuration is a situation in which our emergency alarm system, you know, is finally able to get some peace in its multi-tenant Kubernetes cluster and able to get the resources that it needs so it can, you know, help the people that need helping and not be interfered by those pesky rock bands. So I'm going to describe a little bit about what we currently have available for this in Kubernetes and actually in Kubernetes it's not really there. We have some support in the container runtimes, both cryo and container D have support for using REST control FS to control the cache and memory bandwidth, as well as the block IO using the block IO C group controller. But it's only in the runtime so you basically use a pod annotation to specify the... Oh, cool. What's that? I see. Use a config file in the container runtime and that corresponds to a config file that specifies that specific resource and then you use an annotation to specify that resource for the pod in the container runtime interprets that and then does something with it. This is, you know, this works, this functions. We can get this kind of isolation, but we, you know, it's not the best user experience. For one, it's not at all related to Kubernetes. So there's no documentation. It's only really for early adopters, people who are intimately aware with the feature. It's also a pretty bad user experience. I mean, we're using an annotation to specify something that's going to be specified like three or four levels down depending of the stack, depending how you look at it. So there's no visibility as to what resources are actually available on which nodes, no support in the scheduler to actually be able to delegate which pod should be going to which nodes depending on the classes that they want. So we just have to know what there is and where to put it. So it's not user friendly at all. And I'm going to switch over to Marcus to describe on the future that we would like to get to. So yes, let's get to the deeper technical design of our proposed enhancement in Kubernetes. And first I'm describing the kind of control flow that we've envisioned for the QS class of sources in the kind of complete solution in Kubernetes. So here we have a simplified view of a Kubernetes cluster with the API server and scheduler from the control plane, then one node depicted in this figure. And Kubelet and the container runtime running there and the system representing the operating system and its services. So basically everything starts with the container runtime initializing or discovering the QS class of resources that are available in the system. So it might be something that the container runtime actually configures itself or it might be in some cases something that is pre-configured by the node or cluster admin and then the runtime only discovers what is available on the system. Anyway, it gets the information about available QS class of resources from the system. Then it hands that information over to Kubelet, which then in turn updates the node status object on the API server. So we get the information there in the node capacity that these three resources, QS class of resources, A, B and C are available and then the specific classes of each resource type. Then PodSpec or Pod is created in the API server. It has some specific requests for the QS class of resources. In this case, a pod request like class gold from resource A and class high priority from resource C. Scheduler picks that pod from the API server and does the normal node filtering and fitting trying to find a suitable node that could satisfy a QS class resource request put there in the PodSpec. Then it finds that, okay, our node X actually can satisfy the requests. So resource A class gold and this high priority for resource type C are available on node X. It schedules the pod to the node. So Qubelet picks the top and reads the QS class resource request from the PodSpec and hands that information over the CRI API to the runtime. And then lastly, the runtime then is enforcing the QS class resource assignment of the containers on the container processes on the system. So that's how we're envisioning it to work. One key idea in our proposal is to make the life of Kubernetes as easy as possible. So making the QS class resources as opaque to Kubernetes as possible. So basically the configuration and management of QS class resources would be handled by the container runtime. So Kubernetes doesn't need to know actually about anything about the implementation details of each of these QS class resources. So it basically knows what type of QS class resources are available and what classes in each type of resources are available on which nodes, but that's it. It doesn't need to understand any more than that, like what specific resource type or class names actually mean. So it knows resource types and names, but class names, but not much more. And this would allow easy implementation of new types of QS class resources without any changes in the Kubernetes. API or Kubernetes components. And one use case we have in mind, for example, is for a vendor to be able to implement specific QS class resources for their needs. For example, cloud service provider could write controls for, let's say, disk or storage IO or network priority controls. All in all, we're trying to come up with a generalized mechanism that would allow simple addition of new QS controls into Kubernetes in the future in a future proof way. So the scope of the cap is currently kind of two-fold. First part is the CRI API between Qvalid and the runtime. So we want Qvalid to be able to communicate to runtime the QS class resource assignments from the user and the other way around so that the runtime is able to tell the Qvalid what is actually available on the node. We also want to support in-place updates of the QS class resource assignments of running containers. And currently, we envision that we would have an initial user interface using pod annotations before the kind of pod spec changes landing the Kubernetes mainline. So similar annotation-based UI than what is currently available in the runtime only approach. And the second part is then the Kubernetes API. So we want to extend pod spec to have specific fields for these QS class resource requests. We want to update node status as well to see what is available on the nodes. So this serves two purposes. First, visibility to users so they are able to see what is available on which nodes. And then also it's an enabler for Kube scheduler to be able to do the right thing. So fine, I know that can satisfy the requests for the pod. And the third part of the Kubernetes API we have at the moment is having permission control of available QS class resources by extending the resource quota mechanism that is already available in Kubernetes. So we think that it would be good to split the implementation in multiple phases. So first, well, everything under the runtime between runtime system is kind of implementation details and out of score of the cap. Well, the first implementation phase would be just a CRI API between Qubelet and the runtime and then also implement interpretation of the special pod annotations in Qubelet as the initial user interface. And everything in Kubernetes API and the control play components would be then implemented in future phases. And with the cap fully implemented the user interface would look and experience would be a lot better than what with the runtime only approach that we currently have. So everything starts similarly from the kind of runtime configuration. In this case, an example of the cache management with rest control. So similar thing, three classes there. But then with the runtime and Qubelet support we are able to update the no status, see that, okay, this is available there. And then the pod spec looks a lot cleaner as well and we can do like proper input validation of the fields. For example, so next we'll have a short demo of a proof of concept implementation that we currently have. So this demo will be kind of demonstrating the full complete solution with the scheduler support resource quota and everything. So this prerecorded, I don't trust the corporate VPN and Wi-Fi and all that. So this almost live, it was recorded yesterday. So in this demo we have a simple single node cluster with our proof of concept code based on fairly recent versions of Kubernetes and Cryo container runtime. And I'll quickly show the kind of container runtime side configuration regarding cache allocation and block Kio from the previous examples already familiar. So we configure three classes for RDT Gold, Silver and Bronze and then for block Kio similarly three classes, high priority, normal and low priority. And first I'll show the annotation-based UI which is not much but anyway. So in this case we have one pod with two containers and we use these special annotations for setting in this case for container one like RDT class Gold, block Kio class high priority and then the second container could be this rock band side let's say with RDT class Bronze and block Kio low priority. We create the pod, see that it's running and then we can actually verify from the REST control Zoodo file system under CFS that actually some PIDs of our containers were assigned to the classes that we requested. So Silver class will be empty because nothing was assigned there. We could do the same thing for block Kio but it's a bit hassle to find out the correct C group file system paths there. So we just trust that block Kio parameters were also applied correctly. So next, then we delete the pod and see that okay, nothing is anymore in the dash file so our container was deleted there. Next, we'll see how it looks like in this complete solution. So now looking at the node status we can see that okay we have these type of Kios class resources available on the node. So block IO class is high priority, low priority and normal as we configured. Same for already the bronze, gold and silver. And in this demo we have also one like stub Kios class resource doing really nothing, nothing just for demonstrating third possible resource. So then we can take a look at how the pod spec would look like. So here we have dedicated fields for the Kios class resources. So similar example or corresponding example with the example with the annotations but now we just use dedicated fields in the pod spec for that. Let's get that running and see that it's once. And then again we can take a quick look at the REST control file system. Okay, actually we got new PIDs in the dash file. So the cache allocation was applied as we requested. So next one will be like demonstrating the scheduler support. So here we have a pod spec with a non-existing RDD class bar and then also resource that doesn't exist on the node called dummy three. And we apply that and then if we take a look at the pod status we can see that okay, dummy three doesn't exist. And also RDD class bar is unavailable on any node. So this pod cannot be run. If node with this dummy three resource and an RDD class bar would appear on the cluster then the pod would get scheduled, of course. And lastly we'll demonstrate the resource quota how that looks like or would look like. So here we have a resource quota spec extended with the Kios classroom sources. So for RDD we only allow class bronze for block IO. We allow classes normal and low priority and then for this dummy one Kios classroom source we allow the usage of classes B, C and D. And we apply the quota and then take a look at the status of that quota object from there. We can see that the limits that we put in place are actually now enforced. So for RDD only bronze is allowed and block IO normal and low priority as we wanted. So now if we try to create a pod with this, or an allowed Kios classroom sources so in this case we have like RDD class code which wasn't allowed in the resource quota spec we try to create that pod and it fails because RDD class gold was not allowed. Then we modify the pod spec a bit so change the RDD class from gold to bronze then the pod can be scheduled without any problems. So everything works fine. So that was about the demo. Let's continue with the slides. Okay. Yeah, hand it over to Peter. Thank you, Marcus. So I'm going to talk a little bit about where we're at and where we're going. So the current status of the cap when we originally submitted this talk we had hoped that the cap would have been made in but it's still under review. We're working through some details specifically trying to figure out which parts of the phases we're figuring out. So we're targeting 1.27 for this work. You can try it out now with just the container runtime annotation version now if you'd like. Some open concerns that we have are the uses of annotations in phase one whether the kubelet's going to become annotation aware and pass down the proper CRI object down to the container runtime or if the kubelet is, if we're going to go for full pod API support in phase one so then the kubet API server would pass down the resource to the kubelet directly and then the kubet down to the CRI and then some small API details here and there. And then a piece of future or some pieces of future work past that future that we just talked about we can imagine a world in which maybe we'd want to make explicit the pod QOS class that currently exists where it's currently interpreted from the resources the CPU and memory limits and requests but maybe one day we'd want it to be explicit and actually say QOS class is burstable and give the container runtime bad information. Also we want to possibly implement new types so we've only talked about the three types that we have in scope now but eventually you can imagine maybe we have some high bandwidth or some swap or maybe there's some hardware special fancy hardware that has some special fancy resource that would want to be broken up between containers and pods like that so we also have that in mind as well. If you are interested in getting involved through reviewing or testing or even contributing you can check out KEP 3008 and you can help out over there. Other than that, I'd like to thank everyone for joining us and ask if there are any questions. Do we want to pass around? Hey, thanks for the talk. I'm curious if there's any consideration about doing something out of the box that allows for basically fair sharing of these resources among pods running on the given node rather than configuring quality of service classes or something like that would these quality of service classes let's say that all the pods running on a node have the same quality of service class is there something in place that would throttle the pods? I guess could you elaborate on that a little bit more or setting limits and how that throttling would happen rather than just prioritization? So I think it depends on the resource, right? Like different resources handle having multiple processes within them differently. So the focus of this is kind of designating priority but theoretically if there were multiple processes within, you know, block IO weight for instance and they would have equal block IO weight comparatively to each other but then anything that has a higher one outside of that would be treated differently. I have a question. So one thing I wasn't sure is who is responsible for managed allocation of the resource including recycling for example when the pod is gone now the resource need to be counted back. So traditionally I think the current situation is the Kublat is the one allocating the resource on the node in your cab is that now the runtime responsibility? Yeah basically the runtime handles that. So in that case how, so normally the runtime for example container D it doesn't really have this view on the node, right? So do you have actually now it had need to have persisted view for the resource on the node? I think like the I don't know, I feel like it's actually it might be the admin who has the responsibility of balancing out the resources because they're the one that passes down the Yeah, so I because there's no automatic reconciliation that the container runtime is able to do to like pull some other pod into a class when another pod within that class disappears. So the balancing I think has to be done by the admin. Does that answer your question? Yeah, what's the admin? The person running the pod spec or like the person running the Kubernetes cluster would have to make sure that there are enough classes for the pods underneath them to be able to put into be designated, but I think it's up to the pod author choosing the resource and making sure that they're balanced with it. Am I misinterpreting the question I might be? I guess I'm a bit confused about the example of the scheduling, right? In that case, the scheduler need to be aware of how many pods, for example, for CoreQSA can fit into a node. So in that case, I guess the Kubernetes actually need to report the status relatively accurate. Yes. Yeah, okay, now I've got the question. So currently it's actually out of scope of the KEP we have this kind of accounting of these classes, so maybe that's a possible future improvement, I guess, on this area. But it's left out of the scope of the KEP currently to have any kind of accounting how many pods are actually assigned to certain class, for example. Okay. And one reason is also to keep this kind of simple and not confused people because we are kind of talking about or trying to talk about some kind of unaccountable resources on a class identifiers. But yeah, that's a good question, probably a future improvement on this area. Okay, thank you. Yeah, I have a question related to using IntelRDT to partitioning the cache and bandwidth. So after pre-partitioning the cache, is it possible for even the CPU is idle, the process might use less resources so that it has some performance regression? So yeah, basically at the moment the risk control file system does so that if you limit the cache available for some plus, so then even if the CPU is idle it's only able to use the slice of cache that was actually allocated. So it's not giving the kind of idle cache. Okay, so currently, is there any solutions from Intel that can burst the usage from say 30% to 100% if the CPU is idle? Currently, is there technology to support that? So from the point of view of this cap, so it's kind of implementation detail of that technology, so I'm not kind of internally aware that what is happening in that space on that technology. So what I mean, yeah, basically this regarding the scope of this cap, I mean it doesn't, so it's kind of implementation detail of that technology. Thanks. Yeah, but good question as well. So I think there's a similar problem that we faced today, like the order scalar may want to spin up a new node and it may not know what your requirements are and you spin up the wrong node that doesn't have the class you need and you cannot schedule the part and then it's like a deadlock. So do you think if there's a common solution coming out that can say the part needs this and the node that's been up should be of that particular type that will have the classes that you need? Yeah, that's not stated in the cap at the moment, so I guess that's also part of the kind of future implementation phase is to figure it really out how the kind of auto scalar would work correctly and then also the upcoming kind of in place, port vertical auto scaling is something that is not like stated here or in the cap currently, but it's also something that we want to be able to support. Thank you. Alright I think we're totally out of time but thank you both. Thank you everyone. Thank you.