 Tänään, Dixie ja minun Antti saavat sanoa, että CPU-affinity. Se tarkoittaa, että joka setteä CPUs pitäisi käydä joka setteä kontainoja. Mikä se on tehnyt, ja miten voidaan tehdä se. Joten, kun olemme already gottuneet meidän prezentation-akseptit, olen kysynyt Dixie, jos hän oli sellaista kerrottavasti, jossa voimme muistaa CPU-affinity. Hän minuutti Duckerfilemme Tensorflow AI-model-trainingin CPU-affinity. Ja olemme saaneet tämä kerrottavasti täysin hirveän ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ja se on tullut, että meillä on 8 numa-nodeja, joten 8 set of CPUs, 8 memoria. Ja se on jotain, joka tekee uusia, kun me olemme ajatteluneet Workloaden CPUs. Joten, mitä olemme tullut, on se, että kun me olemme tullut AI model training, kun me olemme tullut uusia kentäriä on se nodea, joka on tullut trainingin paralla. Joten, ensimmäinen, kun me olemme vain yli kentäriä, ja ei ole eri kokemuksia, jossa on tullut 3 kaseja. Yksi kase on, että me ei ole mitään CPU-nodeja. Yksi kase on, että me olemme ajatteluneet nämä kentäriä, jota on 4 CPU-nodeja. Yksi kase on, että me olemme kentäriä vain 2 CPU-nodeja. Ja kun me olemme ajatteluneet enemmän kentäriä, jolloin se on yksi kokemuksia, kun me olemme ajatteluneet 8 kentäriä, jota on 8 kaseja. Yksi kase on, että me olemme ajatteluneet 4 kaseja, ja me olemme ajatteluneet 4 kaseja ja 4 kaseja. Ja kun me olemme ajatteluneet enemmän kentäriä, nyt olemme ajatteluneet 15 kaseja paralla. Joten, kun me olemme ajatteluneet 4 kaseja, ja jos ei ole CPU-nodeja, jota kaikki kentäriä tai kentäriä voisi nähdä kaikki CPU-nodeja, niin tämä ei ole todella aivan kentäriä, mutta se on paljon. Ja finally, kun me olemme ajatteluneet limit, me olemme ajatteluneet 20 kentäriä, samanlainen kuin kentäriä. Kun me olemme ajatteluneet enemmän, jota ongelmaa, jota me olemme ajatteluneet, joten me olemme ajatteluneet, joten me olemme ajatteluneet 20 kentäriä. Mutta siellä voimme nähdä, nyt on ehkä yksi eri, jotta 2 CPU-nodeja ja 4 CPU-nodeja ovat kentäriä. Joten, jos me olemme ajatteluneet 4 CPU-nodeja, jota kaikki kentäriä, me olemme ajatteluneet vähän enemmän kentäriä, jota ongelmaa, kuin mitä me olemme ajatteluneet 2 CPU-nodeja. Jotenkin, se on tärkeää, ja ehkä ei ole niin pieni, että jotenkin se on tärkeää, että se on tärkeää, että tämä on tärkeää. En asian, että ne on noin tuntaat. Jotenkin, jos me olemme ajatteluneet, jotenkin me olemme ajatteluneet, jotenkin se on tärkeää, sitten, jos me olemme ajatteluneet, niin me olemme ajatteluneet, In which case. So first data locality. We are allocating these CPUs where we bind these containers so that they are always close to some memory node. So all the memory accesses are to the most local memory and that already improves performance. Another point, cache hits. So for each cache level there is actually less users, less processes that sort of use the cache, that pollutes the cache in other words. And then also they are lacking this far away cache invalidation so that here some CPU would be reading memory from there. And here would be writing memory from there and that right then invalidates the cache in this different socket even. Again another reason is CPU frequency. So now that we are packing the workload to a smaller set of CPUs it is more likely that there are some CPUs actually idle. And those idle CPUs can donate their power budget to those CPUs that are busy. Which means that those CPUs will be running automatically with higher CPU frequency. You don't have to do anything for that to happen. Finally workloads can behave smarter when they don't see the whole system and when they don't think that they actually own the whole system. So some frameworks and run times may be like overwhelmed. Oh we have like 128 CPUs, that's a lot. Let's start a new thread for each CPU and let's take advantage of all of those. Imagine that you are running 20 of this kind of workloads on your system and you have 20 competing threads for each CPU core. When we split the CPUs to smaller sets this doesn't happen. So now taking a step back from this single performance trial and trying to form a bigger picture that what we are actually talking about. So we are dealing and doing with the Kubernetes worker node scope. We want to do CPU affinity for all QoS classes of workloads so guarantee first of all the best effort. And our goals are both performance and isolation. And we are doing this by introducing a highly configurable resource policy for assigning CPUs and containers. And this talk is addressing these well-known problems like data locality, noisy neighbor, device locality, CPU attributes. We are going to allow also CPU frequency configurations for containers. And there is a lot of room and requests actually also for all sorts of application specific tweaking. To mention one there was someone who wanted to run virtual machines in their containers and each container would have like four CPUs. We have four CPUs but also so that we could run more virtual machines if we took these four CPU sets and run actually two containers on each set. So share each four CPUs with two containers. So this is beyond the default and now let's start from what is the default and how we could get here. Thank you, Anthony. I'm going to talk about what are the different things that we use to achieve beyond default and also how does it compare with the default. So if you have a port spec YAML today and you apply that using Kube Cuttle, what control pin would do is it determines that there is a need to create some containers. And then the scheduler would find the appropriate nodes that have the capacity to run your workload. And after that the Kubelet is in charge of making sure that the containers are created. Kubelet would take the resource request and limits from your port spec and it would communicate with the container runtime using the GRPC protocol that's specified in the container runtime interface. And from there the resources are translated into the OCI spec and OCI Linux resources spec. And it is forwarded to the RunC which is responsible for eventually creating the containers and writing the resource spec and mapping those to the various C group files. So this is the port spec that we used, that's the default for our use case. Now for us what we wanted to do was we wanted to have a more granular control over the CPUs that would run our workloads. And today Kubernetes does have CPU manager but the policies, different policies that would help us manage the CPU sets allocation for our workload at a more granular level are kind of work in progress right now. So we used NRI plug-ins to have something which is beyond default. NRI plug-in sit along with the container runtime layer. It would intercept the container lifecycle events and make adjustments to the resource policy and enable you to specify the custom way to allocate the CPU resources. This is what the balloon policy spec looks like for our use case. So if you see there is an option to specify the option that says prefer spread on physical cores. Today CPU manager in Kubernetes doesn't have this option, it's being worked on and it will be available in 131. But as of now since it's not available, we use the balloon policy in NRI plug-in. And we have this option to specify the CPU set allocation across different cores and not on the same cores. And we also specified prefer new balloons. This help us achieve some sort of isolation since this configuration would create new balloons instead of allocate, instead of placing our workloads on the existing CPUs. And hence we were able to achieve some sort of, some level of isolation there using this. And then we have namespace star which would mean that all the workloads running on the node would comply with this balloon policy. And there is also an option to specify annotations at the pot spec level in case you would want that for your use case. But for our use case at least we are running all the workloads that comply with this balloon policy. Let's walk over how are we doing the installation of the NRI plug-in. So the first command would add the NRI plug-in repository. The second one is the helm install. It would install the NRI balloon daemon set and the CRD. And it would also take care of whatever is required for the balloon policy plug-in. And with the patch runtime config true, it would enable the functionality at the container runtime level. This functionality is default enabled on container D 2.0 and further. And it is available on 1.7, but it's enabled only in 2.0. And for the default balloon policy plug-in you can actually live tune it and edit it as per your needs. We use kubectl edit to edit the balloon policy for our need. And so what we are trying to address is what's the need for actually granularly specifying the CPU sets for the workloads. So there are different workloads that have different use cases. And a user might want to run a particular set of workloads on particular CPUs while the other set of workloads on different CPUs without any interference across each other. So we do not have a lot of different policies in the CPU manager today and NRI plug-in solves that problem. And I will walk over the details about the balloon policy plug-in and deep dive into it. So here is what happens once you have installed NRI balloon policy. The policy demon set, of course it starts a pod on every node in your cluster. And each of these pods then registers to the container runtime. So cryo and container D have this NRI server there. So in the registration it tells to the runtime that which other events put and container lifecycle events that it is interested in. And the container runtime responds by telling that what are the running containers already. So if there are containers on that node already, then this plug-in can already start adjusting those without restarting the containers. Once the new containers are being scheduled and started on that node. Then Kupelet is telling the runtime that now we are creating a new container. Runtime is saying and forwarding that event to the NRI plug-in. And the plug-in then does the assignment. So in this case when there is like first container coming in, NRI balloons could create a new balloon which is a new set of CPUs and assign the new container on this balloon. So that this balloon contains enough CPUs to satisfy the CPU requests of that container that is coming in. And when again new container is being created in addition to the previous one, then the balloons policy can inflate this balloon by adding one more CPU there so that it satisfies the CPU requests of both of these containers. And here I want to highlight that we are actually modifying the CPU set where this already running container is already, maybe already running so it gets more CPUs to its use in this case when we are like inflating balloons. But not all workloads are happy with that. So some workloads may want to stick with the CPUs that they first see when they are started. And we can't actually then increase and decrease the number of CPUs live in that case. And to address those, balloons policy also has options for having these kind of fixed set, fixed balloons, which have like a static set of CPUs and there can be even balloons that are created without any containers being already there. But this is important for the cases where you know that this node is going to run some workload, some edge cases for instance have these kind of situations where there is some special hardware, where you want to ensure that you have CPUs reserved for your coming workload so that it can communicate very efficiently with that special hardware. Again some more options. So balloons also allow doing these kind of things that you have some dedicated CPUs onto balloons, but you are also sharing some CPUs that are idle so that those CPUs that do not belong to any balloon are shared with the workloads in these balloons. And you can specify the level in the topology that where these idle CPUs are shared. So you can share idle CPUs in the same Numa node or you can share idle CPUs on the same die or on the same socket or even the whole system if you like. So this gives the data locality when you specify different topology levels there. Okay, so let's compare the default and the beyond default cases. So in KubeNet Managers they manage currently guaranteed workloads only while our goal with balloons is that you can run containers of any QLS class in balloons. Another point is that managers currently give exclusive set of CPUs per a container or you can have a node level switch saying that you want an exclusive set of CPUs per a pod. In balloons we have exclusive and shared CPUs and you can define any containers which should have which option. So for instance you can have like database pods which contain two containers like one database container and one logger container and you can give exclusive set of CPUs for all the database containers and then put all the logger containers from every pod to the same set of CPUs. Again managers provide static CPU sets which do not change during the run time which is safe of course. Balloons offers also static and dynamic CPU sets so if you know your workloads that these are just fine to be adjusted when they are running then fine go with it and you can take advantage of it. Topology levels managers support Numa nodes up to eight Numa nodes then there is some algorithmic problems actually which prevent using much more. Balloons policy detects like several topology layers. We have we know which are course are hyper thread it we know Numa nodes we know dies which are basically something that could run almost like a different CPU so get but they are you can like pack several dies in the same package. So it's just a sub sub package I would say and then full package is of course and you can share as I mentioned you can share idle CPUs for instance on any level that you just define in the configuration. In addition to that we have CPU frequency controls we are implementing some CPU power controls as well so that you can define for those CPUs in some value that's how the frequency and power should be handled there. We can do cash level sharing and different levels and we can do live tuning I would also let's jump with the live tuning to another example. Now here I'm editing balloons policy default that is the default policy to be used on every node in the cluster but you can actually define also policies for separate nodes so that they have been a bit different balloons configurations. I'm having here a balloon named compress this is my synthetic synthetic benchmark which without I just want to demonstrate that how you can do the configuring live. And I'm telling that maximum CPUs per each balloon is three and minimum number of CPUs per each balloon is three that means that whenever a balloon is created it will have three CPUs and no more no less. And there's also the option that Dixie already mentioned pre first printing on physical cores that is when we are allocating these three CPUs for a balloon we are picking up them from different physical cores so that they do not have sibling hyper threads in them. This is actually an option that I'm changing here in this live tuning results so here's the baseline. I'm just reading reading random device when just doing some base 64 encoding and compressing and doing it again so the synthetic benchmark as I mentioned and baseline throughput. Looks like that without balance policy without any CPU affinity no limits in those resources when I apply balance policy with mean CPUs one then it runs only on a single CPU and we get very flat line so that's very predictable performance but the throughput is definitely below what was here in the baseline. When I add a new another CPU to this balloon so change the mean CPUs to two I can immediately see an increase in the throughput and again we get a flat line so very predictable throughput which is actually on average already above the baseline. Again adding one more CPU then we are getting some variation already there I was thinking at that point that do we get this variation because of the frequency changes so I changed actually minimum frequency for those CPUs in this balloon to the maximum that was available but it didn't change the situation so the same fluctuation was still there. So which means that actually the CPU was already topped on those CPU cores that were used so I tried out the other option changing this pre first spread on physical cores and then I was able to see that okay now the variation at least got a bit smaller so it's more predictable predictable in that case. In this case this note didn't have much any any workloads to run so this this sort of is an example of the one of the cases where I mentioned that where this extra performance is coming from so this is actually coming from the fact that you get higher CPU frequency when you are when you have few very busy CPU cores. So would you like to conclude? Sure so again what we are trying to say is there can be different use cases in which a user would like to like their workloads to be placed say for example on disjoint CPU sets there can be a workload that the user wants to place on one CPU set and the other workload should not maybe go on that CPU set for better performance and say there can be a workload that should allocate that should have CPU allocation closer to the network device or the block device. So today we do not have this level of configuration options in Kubernetes. Some of them are being worked on but we use NRI plug-in which is able to provide us these options and we achieve some sort of isolation as well and mitigated some problems that we have around noisy neighbor and NRI plug-in is available in container D1.7 it's enabled by default in 2.0 and in cryo it's available in 1.26 if you would like to try it out feel free to next slide please. We also have attached the YAML files that we used for you to try at home in case you would like. And the most important thing if you would like to contribute to NRI or different resource policies or Kubernetes signal we have added the links here the first two are for the NRI I have also added a couple of more links which I will upload later. So there are a bunch of open issues in Kubernetes one is around adding a new CPU manager static policy that will help you to place your workloads across different CPU cores. There is another one which kind of tries to address adding CPU policies that will enable you to configure different C group options such that you can have different requirement satisfied for CPU allocation. And if you would like to learn more about what we are doing in signal please attend the maintenance track on Friday. And here is the QR code for feedback. And we are open for questions. Thanks for great performance. Imagine that you have a lot of different deployments thousands of deployments and you have hundreds of Kubernetes nodes. And now you have to decide how to split them into balloons how to choose the optimal sizes of balloons how to understand that in such a dynamic environment where you can have a lot of different deployments. That's a very good question. I think that eventually it boils down to measuring. So if you want to know how to really optimize how to squeeze out everything out of your note you need to try it out. I can't tell you what would be the magic solution for that. There are good defaults so we can provide you with different policies actually so we are presenting here only NRI balloons policy but there is also available NRI topology aware policy. Which is like a zero configuration policy. Here in the balloons policy you are defining different balloon types CPU settings for each. So it's a lot of configuration options which really help you in squeezing out the every single bit of the performance. But with the topology aware policy you just deploy the policy and enjoy the topology awareness. So you will get like Numa node alignment and lots of things for free without without bothering with the configuration. Thanks. Hi very good presentation is very flexible Sims. When you mentioned this topology awareness is it count device Numa topology awareness if I want to schedule it on the same Numa node when my Nick is taking in just for example. These policies do not really affect the Kubernetes scheduler so there is a gap there just like with the default policies currently as well. So it might be that you are not getting exactly what you wish and then then your pot might be like failed to run. So that's that's a lot of stuff and like if you say that my pot should be run in this and this balloon but that balloon is can't be created anymore on the node then the scheduling will fail. Running it will fail. Okay but in principle the topology manager will do something and then in the right plugin either perform it or fail. Are you talking about the cube? Yeah I mean if the topology manager of the cube enforces a human node. Okay sorry sorry. I would say that do not use if you are using NRI topologies or sorry NRI resource policies like topology manager NRI or all balloons. You should not use cube plate topology manager no cube plate CPU manager or memory manager at all. So just switch those off and because it would be otherwise just a waste of time. They suggestions are completely like thrown away by this NRI policies. Okay I see. One short another question. Can you exclude some CPUs from scheduling? You can annotate your port saying that do not touch the CPU sets or memories that this is using. So this is again great question. There are cases where your workload might need access to every single CPU core that is available in the system. So for instance when you are taking measures and accessing is like CPU counters, hardware counters on each CPU. You can't do that without having access to all of those and with these kind of special cases you can say that this workload is special. Don't put it into any value and just let it run everywhere. I would like to add a bit more here. So we were having a discussion as to what would happen if a port is already scheduled on a node. And then the balloon policy plug-in tries to inflate or deflate CPUs and those CPUs are not available on the node. So again I would say the balloon policy plug-in would fail here. So there is an ongoing discussion about whether these things should be at the cubelet level so that the scheduler is aware of what resources are attached to the node and then makes the decision there. Or whether it should be at the container runtime level. I have a different opinion about this. So we were just having this discussion and I just wanted to bring it up. Yeah let's work together. Hello I have a couple of questions. So the first one. Are there any defining workload as you will? Are there any workloads which you've seen no benefit or even degradation with your plug-in? I haven't seen that before. We have done some experiments with both synthetic CPU, some of which are very intensive to CPU, some of which are memory bound. We have done benchmarks on in-memory databases and all of them actually benefit so far. So I haven't seen this kind of case. Though if something comes up I'd be very glad to know about it. Thank you. My last question is in your demo, so you're a data scientist, you run your cluster. So you can say I want this workload on these cores and these on another. So my use case is I've got a cluster and 100 teams and I have no real control over that. So I can't create static policies per se. Is there any plans to for example just create lots of balloons and then have something scheduled like this balloon has less load. So let's move some containers over. Actually, yeah, we have discussed and we have such plans to like sort of migration of workloads between balloons and also the other way around. So exchanging CPUs, if some balloon is getting very loaded on CPUs and then other is quite idle, then why not move CPUs between balloons. So these kind of considerations have been done, they are not implemented yet though. Okay, do you have any aspiration? Yeah. Okay, thank you. Hello. Intel has the new processor with three core types. How is that going to work? You will get options for selecting these kind of different core types. Not there yet, but coming. Thank you. Hey, thanks. So I have two quick questions. I mean, one is not quick, one is more quick. So how is the discoverability of what balloons is the pod attached to, for example, is it something that you can see in a status field or like what kind of pods are assigned to this balloon or. Great question. So for this kind of, it's sort of debugging purposes currently, we are providing metrics interface. So that can be enabled in balloons configuration that you can get like Prometheus metrics interface where you can see that which containers run in which balloons, which CPUs are assigned to which balloons and which like extra idle CPUs might be allowed to be used by the containers. In a balloon, if this kind of share, sorry, idle CPU sharing is enabled. So it's, we do it with curl currently on the node that is running this policy. So if you have some good ideas that how this could be exposed, I would be also very glad to get those ideas, even like a GitHub issues or something like that. Yeah, definitely. I mean, one thing could be status filled on either the balloon or I mean the balloon policy would be easier. But yeah, can talk about it later. The next question is sort of like more general. How do you say this coexist with the array, right? Because it feels somehow like a little bit similar like right. We have some dynamic resource allocation in this case CPU, right? Sorry, how this cooperates with, what did you mean? The array. You mentioned you also have some gaps right in scheduler and everything. I think in some areas where the array trying to solve. Yeah, I think that these are quite different topics. You look like you want to comment something into this. So maybe I'll give Mike to you. Yes, so simple answer, it's not any how connected with the array. A bit more complex answer is what the array is handling the allocation of the device. So for example, like GPU. So the array says use this particular GPU one and GPU two. NRI starts when the container is created and at that stage the device is already selected. So what we can do on NRI level is what we can see what kind of devices are used and when we can find the needed pair of CPU cores or memory regions which is close to that device. So it happens practically like behind the scenes. And that's actually an answer to a previous question about the topology manager. So the topology manager tries to dictate what CPU to use, what memory region to use and so on. Here is the opposite. So you first select what kind of device you are trying to use and we will find the memory and CPU regions to accommodate the best that accelerator device. So that's the connection between those two things at the moment. Okay, now it looks we are running out of time, so time to thank you for your attention. Great questions, thank you.