 Joten olemme täällä, jotta teemme, miten voidaan muistaa enemmän komputin puolesta jota järjestelmätä, josta jäi helposti järjestelmätä järjestelmätä. Minä olen Antti Kerovinen, josta olen kautta kautta-opistosta, jota olen järjestelmätä järjestelmätä. Ja nyt järjestelmätä. Shasha, jota katsotaan. Ja olen Alexander Kanievski, jota järjestelmätä. Järjestelmätä, jota järjestelmätä. Järjestelmätä, jota järjestelmätä. Ja olen Jänne Voskaisena, josta järjestelmätä, jota olen Jänne Voskaisena, jota jänjestelmätä. Mutta tänäänvremmin pystyisimme nähnyt ja perueen, ja ensimmäisenä jätän biisiin kunnasta, jota olen jo jo osa bullisteita tänään. Joten etän jäädä liittämät yksink sinister sellainen kunnasta sinun ennen sen, kunnasta, about Numa problem several times. You have data dispersed, you have container dispersed across your whole machine. Not working properly. So everything what we've seen so far today mentions let's do the Numa scheduling, Numa affinity. That's a usual scenario. Let's pin the containers, let's allocate the data close to the CPUs. We will get something better. True, but it's just a tip of Weisberg. Right now in Kubernetes it works for guaranteed containers. For burstable, for best-of-fort, it doesn't work. If you want to improve it, we need to do some better solution. Same thing goes to containers which is communicating to each other. Right now the apology manager and the rest of the head of managers will focus it on one single pot. We can handle several containers with input. But what happens if you have multi-pod application, let's say front-end plus your memc-hd and we are communicating to each other. You need to have affinity. You need to say somehow to Kubernetes what those two will be working together. This allows to hardware to do a bit more efficient scheduling and data transfers. You get a common L3 caches. You get optimizations on the memory channels and so on. Kubernetes right now doesn't have it. We have it on the cluster level on the scheduler side. We don't have it on the runtime side when they are running with things. Another thing, again, how we from as a users we can help hardware and the OS to give you a bit more performance. Right now, most of the time when you run application it creates a lot of threads. It consumes all available CPU cores and so on. What it means in reality in the hardware? It means what those cores are not sleeping. Those cores run like certain period of time when sleep. If we can somehow to instruct the kernel what those containers doesn't really actually need the whole set of cores. You don't need to run everything on every single available CPU core. Group it into smaller pieces. It will allow your kernel and your hardware to bump the turbo frequencies. Your workload most probably will finish earlier will benefit with the higher frequencies and as a consequence you will get better power savings because not all the cores are running. All of this is just like a tip of Weisberg if you are starting to think how these pieces of hardware can be exposed to the end user. As we saw from Francesca and Swaggy our usual suspects. Kubernetes how to implement all of this scheduler how to get all of this information and when last but not least is user experience. How to end user to express what he actually needs for his workload. And there are a bunch of problems with that. So like in restaurant you don't tell the chef which knife to use to cut the meat. You just order with your steak and you say I want rare medium or something. Same here. We have a way to express what is important for you and how to do it. A problem how we implement that. So if we go to Kubernetes site we cannot implement all possible hardware. So the pictures what you seen it's for hardware which was actual like five, six, ten years ago. The current hardware is not like that. Web coming hardware will be even more different in terms of what we consider as hardware resource zones. Kubelet should be simple should be working everywhere. The same for the scheduler. Yes, we can try to expose that information to the scheduler but we need to understand the cost. So the more algorithms we are exposing the bigger delays we will have in the scheduler site. You have a heterogeneous nodes you have more algorithms the scheduler needs to be aware of. You want to optimize not for the hardware but wants to optimize for example for the few groups or pool groups of your workloads or batch jobs as we've seen in all kind of presentation. You need to have another policies. So all of this in current Kubernetes it's simplified. We are trying to find the way how to do it in extensive way and more user friendly. Thanks. So, as social emphasized that there are problems and we what we need really is the flexibility to be able to manage the jobs and add to the whole framework this kind of flexibility and there are a couple of options so that if we do that on the Kuplet level there is this herd of managers topology managers, CPU manager, memory manager so those would need to be like pluggable so that resource management algorithms could make their way into the Kubernetes itself but there is also another option so if we go one layer below so if we go to the container runtime level there we can do something something smarter so we can forget that maybe we don't need to go through that door maybe we can jump out of that box in some other way and this is what the CRI resource manager is actually about so regarding the features of this CRI resource manager we provide zero configuration CPU and memory pinning meaning that if you just add this CRI resource manager to your nodes in your cluster you don't need to do actually anything else and this CRI resource manager is starting to manage your jobs and like placing them to the CPUs and memories so that they really give you the performance benefits and here on the slides we have the links to some material which is actually reporting that what kind of performance benefits we have observed and on the other hand as you saw these kind of affinities that were needed to solve these problems that if two containers are tightly bound to each other and communicating with each other a lot so it makes sense to run them very close to each other even inside the node as we saw earlier nodes communicating with each other but now we are talking in the processor level so put them to the same Numa node so that they are using the same memory dims that are very close to that processors so for that purpose there is a way for users to add this kind of container affinity information and in addition to that CRI resource manager also allows defining quite custom resource management policies so they are pretty easy APIs for creating your own resource alignment and management algorithms for instance just to mention a couple of those so there is a pot pools algorithm that allows you to say that I want like sets of three pots to run on Y CPUs four CPUs and they share the CPUs among each other so this was something that one of our customer needed and that is implementable in a matter of hours or in a matter of days using CRI resource manager and you don't have to tell anything to Kubernetes about that so another example is balloons resource management policy where we can pin containers to CPUs into the pools that are actually like inflating and deflating depending on the needs resource needs of those pots so this is dynamic and you can also dynamically adjust CPU frequencies of the CPUs inside the pot so how this all works, where it really goes so this is the how the things are now typically in Kubernetes node so there is a cuplet and it is communicating with a container runtime often container D can be also a cryo and then what this means today that if you want to add CRI resource manager into your node then a cuplet starts communicating it to it instead of the container runtime and here resource manager is doing the CPU pinning memory pinning and running these algorithms and then telling the underlying container runtime which is still your favourite container runtime that how to really pin those CPUs and manage the resources but this is where we are today and this is where we really would like to be to save this year still so that cuplet would be again communicating directly to the container runtime and container runtimes would have a NRI which is like node resource interface and through that you could use CRI resource manager what's the benefit on this compared to this where we are today is that you can actually apply you can deploy CRI resource manager to your Kubernetes node directly and you don't have to go to the node level and configure this kind of cuplet container runtime stack so just as I mentioned if we start to look at the hardware problems and how to expose it, it will be just a tip of Weisberg and actually the CRIRM is just our project it started as a demo vehicle but right now it's in a shape what it can be used in production but what was we think to showcase what, how different policy how is the different variants of resources can be exposed in user friendly manner to applications beside that we have in CRIRM implementation of last level cache control, memory bandwidth block IO and so on what we are trying to do is all of those small pieces what we have is going to appropriate places in the upstream so for block IO, for cache controls and memory bandwidth controls, those patches just recently got merged into CRIR and container D already today you can start using that functionality using annotations we have a cap which is talking about the class-based resources so cache block IO is one example of those class-based resources another example can be different memory types so you can have the workload which says I want high bandwidth memory I want combination of DRAM plus slower tier of memory which is currently not possible but hopefully will be done in some future we have ways how to use accelerator more properly so our team, well in addition to NVIDIA practically like we only two publicly available device plugins for Kubernetes so we know how what is the problems to expose the accelerators to the Kubernetes together with NVIDIA we implemented with CDI interface it's container device interface on runtime level how to expose the devices and now in Kubernetes we have this dynamic resource allocation cap this is about how you can fully control your accelerator devices so you are not any more requesting I want just one GPU you can say I want GPU of particular class I want GPU with particular interconnect I want GPU with particular amount of memory and so on and so forth and this is like iterative changes to existing Kubernetes further down the line we most probably will need to think about more bigger changes to the way how Kubernetes working so as we mentioned kubelet is already complex, hard to maintain and not really universal to make it more simple but to have this good separation of concern between what and how we will definitely need to involve with CRI protocol we will need to figure out how information about containers communicated to runtimes we need to understand how resources are discovered and reported to the nodes so we think what was presented as separate topology exporter it can be a part of future CRI protocol we think like the errors of admission of the pods and rescheduling them or avoiding the scheduling errors or what we need to investigate how we will fit in the future well, what's it what we had for today thank you, I think you are right on the time questions, I don't think we have time right so we probably we have a coffee break so maybe in one minute we can cut your conversations outside do you think that the pod spec is it good enough to express these do we have any problems at that level do we need anything extra in the pod spec to express I want these two GPUs the same or do you think that is not where the problem is so in pod specs we have several problems which is not solved and which is actually some of those caps are trying to solve if we are saying I want two GPUs and I want with particular parameters with the current resource model where GPUs are represented as extended resources it will not fit completely so we learned from storage to have claim of those GPUs so you get separate objects which describes I want to claim I want to allocate me this amount of GPUs with that properties and when the pod spec will be referencing this object saying I will be using that so if we are talking about the classes we don't have anything right now so what we are proposing is to have on the pod level and on the container level field resources classes and when you have key value pairs and the current proof of concept implementation includes practically everything like the whole chain from discovery, run times, scheduler resource quarters and so on but where linkit cap is simple part is just like the first part like what is in the pod spec what is in the CRI protocol regarding affinities so right now in the pod spec we have no any constraints and you have required, during scheduling ignored in runtime preferred we need the same but for runtime side so I don't know will we start with exposing that on the scheduler side but on the runtime side we need to have this similar fields required on runtime, preferred on runtime so a bit complication compared to what we have with node affinities node affinity constraints we usually work on scope of pod so we are saying I want this pod affinity to that pod or anti affinity here it's a bit more complicated you want to say I want my front end with memchd together but you might have also more complex affinities let's say I have my database and I have my logger or backup container I want to be running with low priority container in some other area, in some other CPU pool so you need to have way to express also this input container affinity anti affinity so the syntax will be most probably a bit more complex thank you so much