 I just started it. Thank you. All right. Now the floor is yours. Okay. So I am, let's see how this goes. So I am the cloud native architect for resource management, not for all of Intel. I'm not even sure what my org is currently because we just re-orged, but Sasha is my counterpart in SATG. So just to make sure we're all talking about the same subject, when you're talking about resource management, there's two parts. And I know most of you probably already know this, but I run through it anyways just to make sure we're all on the same page as far as what we're discussing. So there's two parts of resource management and clusters. They're scheduling on the clusters where you schedule your workload matters. And so if you don't want to be scheduling your simple workloads to your high-performance AI specific machines with GPUs on them, and you don't want to be scheduling your high-performance workloads to underpowered cores, right? The other, so here's the general picture, which is basically this is just a spread of a heterogeneous system and maybe what you might want scheduled on there, right? So currently with, we'll go through this with the Kublet resource plugin, but currently you ask for a certain number of cores and it does its best in scheduling. So Kubernetes is not great at handling heterogeneous clusters at this time, right? The second part is once you get to the node, what do you do with your resources? So how do you schedule your resources determines both performance of your workload and how long it runs. And this is also getting added attention to the sustainability forums that are starting up and they're all over the place. There's one with NCNCF. Ideally you want to power down all the resources not currently being concerned from a sustainability standpoint. And at the same time, you still want optimal response in this and getting the resources aligned when it's time to use them. This is just a generic example. I know it's new and specific, but you can do this with any sort of resource changing. So if you have your memory, your CPU and your XPU in different zones, you have the UPI bus toll, I call this the toll, you know, there's a toll living under the bridge and the UPI bus stealing all your time. Ideally you want these aligned. This gives you more responsiveness. It's faster. It doesn't run as long. It doesn't take as much energy. So my team has a few different projects. We have telemetryware scheduling over an SATG, which I'll also run over is UPIware scheduling, which is part of that. There's power management and then there's the cooblet piece that we would like to get done. So that is a really big piece that I would like to understand community needs for. So telemetryware scheduling, why? We want to avoid scheduling on unhealthy nodes. We want to migrate pods away from unhealthy nodes. When scheduling, we want to consider node property such as temperature, current load, power, and other. And we want to give support to external components like the Kubernetes deschedulers and GPU ware schedulers. I think I did the site. So telemetryware scheduler is currently an extender. So we take telemetry data to aid scheduling and descheduling decisions in Kubernetes. We use policies to enable rule-based decisions on pod placement. And these are powered by metrics collected from the nodes. You can use Prometheus. You can also use other metrics collectors. It knows how to interpret filtering and scoring and utilize node affinity rules and the policy support multi-metric rules. So this is new this year. So we can take any of an olive. So you can combine metrics. So this is the general layout of how this works. There's Prometheus. There's Prometheus adapter. There's custom metrics API. Basically, if it goes into the custom metrics API, we can pull it. Then there's telemetryware scheduler working with the scheduler. And then you can have TAS policies. So we're going to go through these one by one. And there's don't schedule. So pod with this strategy will not be scheduled on a node breaking these metric rules. So if it's with metric name, if it's equal to one, then don't schedule there. And then if it is scheduled here, it prioritizes nodes based on a comparator and an up-to-date metric value. So if the temperature is low, in this case, you schedule there. De-schedule, so if a pod with this policy is running on a node and it violates, can be dis-scheduled with the Kubernetes de-scheduler. So basically, if your temperature is too high or your amount of RAM is less than some amount, then it will de-schedule. And we also allow labeling. This is less than the scheduling, but maybe based on your particular scheduler. So in this case, we basically make labels based on these rules. So you have a card 0 equals true and card 1 equals true. And this is used partially with GPU or worse scheduling. So this is just info on this. You can submit PRs for changes. This is open source. We do have future work for TAS before I want to release this and try to put it into the community, which is specifically to move from an extender because we're currently a Kubernetes extender to the Kubernetes plugin, which plays a little bit better with the current scheduling decisions. So we're currently in that work. But once that's done, we do plan to try to push upstream to the community. And these are more links. I can send this after, but there's white papers on this. We have a power-specific example. And then there is a recent KubeCon talk and demo done by Denicio and Madalina on my team. Do we have any questions on this before I move to GPU or worse scheduling? None for me. OK. So caveat to this is there's GPU or scheduling. So this is a case. I'll go over the use case. The node has two GPUs, and each has a certain amount of memory. And you want to make a replica set to three. And each of these needs five gigabytes of memory. You end up with those nodes being split, or one of those pods being split, because you can put one in each, but then you still have three in three. So this is basically keeping you from scheduling across GPUs. But there's other ways to do this. There are other pieces of this. But with this, we are using the Intel i915. That's a Intel-specific GPU. And we can choose a number of millicores here. And we can choose an amount of memory per each. And choose how many, so this tells you how many GPUs you want, and then what the spread of the memory is for the particular one. So this is cases where you want to divide up the current GPU into slices, and then schedule it across many pods. Or you want some sort of specific number of GPUs on your pod. So this is that particular project. There's NFD with GPU plugin, because you really do need NFD to do the node feature discovery, so you know what's running. Are there any questions on that one? OK. I think one question that I had was on the GPU. So this is for any kind of GPU? Yeah, as long as you can schedule multiple things to it. So how you do the underlying pieces, the device plugins is your own piece. But this will help you do the scheduling. OK, so basically the device plugin would take care of making sure the drivers are up in the node, and then this takes care of the pod scheduling, and they can share it with you. Right, this is the scheduling half. So earlier we had the scheduling half and the resource management half. This is the scheduling half. So to be a bit more specific, the device plugin needs to be specifically crafted to utilize real functionality. So our team who is doing the support for GPU is involved in this scheduling part as well. There's two sides of the same piece. But I guess the question I asked was if I would be using an NVIDIA GPU, for example, with NVIDIA operator orchestrating the driver setup. Then they need a driver to handle it. Yeah, and the NVIDIA plugin needs to be adjusted to utilize the same concepts. All right, OK, thanks. Yeah, so this is scheduling only. And then we have the last project that's Intel specific. Well, GPU aware scheduling and telemetry aware scheduling aren't Intel specific, but we are using them for Intel. Is a power manager, which it provides limited control over the pod. So Kubernetes currently, you don't have really any power over the configuration of CPUs assigned to the pod. So if you want to lower the seat frequencies or raise the frequencies, and you do want that all the time with performance or sustainability environments, you can't do that. And so we've does the Intel Kubernetes power manager is designed to expose and utilize the Intel specific power management technologies. So currently, we have granular control over the configuration cores. We can change the frequency of all the cores in the shared pool. We can lower power consumption by controlling the frequencies of the shared pool cores. And then these are the particular features we have. These are SSTBF, SSTCP, and then frequency tuning. And currently, so I'll advise everyone to wait about a month, maybe less, until we release the new version of power manager. But we're changing from using the library we were using, which was a Python library, to a Golang library that we've also built, which is supporting. It's easier to deploy. And also we have better, it's faster for us to get in functionality. And it's open source. So those are also shared here. So I can share those links also. And we also have a white paper on how to use these. It's, I don't know, we like it. It's currently, it'll remain an operator. In the future, I'd like us to add a GRPC interface to it so you can control cores from outside the power manager through the power manager, basically, so that it doesn't have to be through the pods. But if you have something monitoring things on your system where you want cores spun up or down, you can do that. And you don't have different power managers and governors fighting for control of the cores. Does that make sense? Okay. And I'm happy to work with anyone who wants to work with us. I am requested. We get this out of the door, but we're already going to go. And then this last piece, which I'm extra, this is a community project, more than Intel project, is that we have the current state of Kubelet is we have some restrictions that we can't handle today. So you can't mix PINs with shared cores. You can't choose which pneumozones. So topology manager does a compact packing in pneumozones. Different types of frequencies. So if you have something, we have different types of frequencies within a core, you can't choose them according to frequency. The cores are allocated by container, not by pod. I can't assign them to a pod and then let the pod move from one to the other. I can't handle the affinity of anything below node level. They're cold, so it doesn't support CPU or memory, less nodes. And there's still a max eight pneumozone limitation, which as you start looking at the Thailand, that a lot of cores are doing it these days, plus the fact that there's multiple sockets. That's a pretty big limitation at this point. And part of the challenge to this is Kubelet has a set of resource managers that have to be addressed every time we add a new feature to the Kubelet. So the current solutions, including CRM and CPU puller, they work by turning off the Kubelet functionality entirely, which can have unintended consequences assuming something's working. And we still cannot schedule cores by pods or across specific pneumozones or affinities. We want it out of this model. So I was speaking to Derek Carr and his suggestion was to split the Kubelet into a control data plane and make the control plane a pluggable. So working from that, we would like to do this work so we can release resource managers custom to our hardware to handle our specific use cases, including plugging into solutions we've developed, such as the CPU manager control plane. It's wrong. This is the CPU puller for CRM while still being native to the ecosystem. And if you want info on CRM, Sasha is here so he can give it, he probably has a presentation in his back pocket he can go through. Future is we want to finish getting this RFC through the new and developed, create a cap and start work and get the Kubelet remodeled following the specifications and which I have links to. And we want to plug our resource managers into the new model, which will be more native. Historically, Kubernetes has kept things in pluggable interfaces. So CNI, CSI, device plugins, but we haven't done that with our CPUs and memory and for our devices. And if we have time, we're at the half hour, we can go off over the RFC for the CPU landscape exploration doc where we went through all of the different issues and then customer requests, et cetera, and we put them all in this document so that we knew what we were missing. So that list that we have up there mostly comes from customers or from those issues. Great. Added. Do we want to go through that RFC? Do we want to talk about CRM? I don't, I'm putting it on the spot, Sasha. Sorry about that. No problem. From my point of view, I'm happy to see you continue a little if you feel like we've got time. Okay, we can go through the RFC. Sasha, do you want to, at the end, go through CRM or do you want to do that first? Up to you. I'm fine with that. Okay, let's go through the RFC. It's fairly short. Let me pull that up and share that. You have to find which window I was on. You have to stop sharing and then re-share. Okay. So we'll go through this. So at the beginning, this basically goes through the fact that Kubernetes was initially written with this simple model of node resources and how they would be configured. This has worked well for generic, but now we have a wider range of use cases, which is why a lot of you are in this group is you have your own specialty use cases for HPC or AI or any other specialty case. Telco has worse cases as well. So we wish to move to a kubelet resource plugin model similar to how specialty resources are handled as we have within the device plugin model. I'm currently, here's what the kubelet looks like currently. So you have the kubelet, you have the topology manager, and then you have the hint providers. The hint providers are your CPU manager, your device manager, your memory manager. And everything has to go through the topology manager to go through. So anytime you make changes, you have to make sure all of these places work correctly. So the exploration doc is listed here. There is some commentary on here. I would like more commentary, just to make sure when we start the cup, we know what we're doing as far as who's working on what. So we make sure we're handling everything. The cases I have in there are already in there, so the other one is you can't take advantage of non-uniform, all three cache access configured for CPU course. So we have Intel RGT, but there are others. So that was not listed in the other one. I should probably add it to that. So solutions to any one of these challenges require a related solution to optimize memory. So if we change where the cores are, now we have to check the memory. So if you touch the CPU manager, you'll now have to check the memory manager and you still have to touch the topology manager, right? No matter what you do. So our design proposal is to make a pluggable resource hub, basically, instead of retrofitting functionality to the existing model continually, right? And basically to pull all the resource managers that are currently in there, topology manager, CPU manager, device manager, memory manager, out into a plugin and then work backwards from there. We're still going to need to also handle the runtimes. So there needs, whatever plugin we do has to be both, because we do wanna roll those managers out probably into a GRPC piece. We still also want to make sure that you can also route them through the runtimes because there's projects that route the, including CRM that route the resources through the runtimes. So this particular one will go through the goals. We wanna be able to plug resource managers into Kuglet to allow customizer of resource requirements. We want to be able to export resources to expose them to the scheduler. So that piece may be more complicated because now we have added annotations, right? We want to make it simple to expand resources to those currently not envisioned. So when you're talking about memory less nodes or CPU less nodes, there's other components there, right? You wanna make it simple to expose attributes about resources. Non-goals, we don't wanna break any existing use cases. So whatever solution we add, there should be full support default behavior. And we don't wanna change default behavior. We don't want to create any more latency than there is today for scheduling. And that's, that may be something we do in the future or have a look at after we get this done is there is still quite a bit of latency in scheduling. We wanna be able to support current spot specs. There may be additional expense sensibility, but we still need to support the current spot specs. And the plugin should not be writing to the API themselves. So there needs to be some sort of interface. So the next step is basically to start the cap, but we're still looking for feedback here as to what goals and non-goals are to make sure we have everything encapsulated. Thoughts? No, I mean, it's very, very good, very detailed. On the RFC side of things, is this something you've had to do before or is this the first time you've done this? This is the first time I've done this. Sasha's done this before, I'm sure. I had some questions actually almost right back at the beginning. When you're writing a custom schedule like this, how do you go about testing it? So that's part of the questions. So to start, we're going to be using the default managers, right? So whatever functionality and testing that they use when they put in those with the initial caps is a testing that we'll pull out to make sure it works now. So all of the current ways that we already validated the topology manager, memory manager, CPU manager, all of that seems to work through the plugin model because if it doesn't work through the plugin model, we're missing something, right? We do have some resources to work on this, but we would like community health problem, you know? Does that answer your question as far as testing? Yeah, basically we're looking currently at pulling out all of the current caps that we're already put in for CPU manager and device manager, all of the managers and just pulling them out and then looking at those particular tests. And if someone wants to test a custom resource manager, you're right, we probably need to test harness, so that's a good point to have there. So should I add that to the goals? Yeah, yeah, can't hurt. Are we into general questions now or have you got anything else you want to cover? I think I'm done. Sasha, do you want to go through CIRN or do you want to just go into questions? Well, I can say a few words because there are a few additional things besides with CPU management activities. If you can stop share, so I can reuse. Floor zero, thank you. The recipe, we should be able to see it, right? Yeah, that's good. All right, so I'm going just to reuse a couple of slides from our presentation, what me and one of our team members from my team and Auntie, we did on HPC and BatchWorking Day in last cubecon. So, Maro already mentioned about our project CIRN. The history of that project is such what we tried to create with Kublet resource plugins about three or four years ago. And at that time, community was not ready to. Now it looks like a community a bit more receptive, but meanwhile, to validate all the ideas and to see what we are proposing as actually working, we needed to have some solution. And we come up with some intermediate step. And this intermediate step is CRI resource manager. It works as a normal container runtime. So Kublet sees it as a container, geo cry or whatever. It's absolutely transparent towards the Kublet. It doesn't reinvent with you. So in the backend, it still uses container, geo cry or whatever you prefer to use. But what it does, it allows you to have a dedicated set of policies on how you are managing the resources. So we have both policies related to hardware. So like all the scenarios what Maro just mentioned like limit of NUMA nodes or memory cheating, different setups, all of these things, we tried, we know how to work with it. So like we have tests with huge machines like 32 circuits and so on. We had scenarios with different memory tiers like how you've been with this memory, PMIM, CXL memory, which is upcoming hardware and so on. We also tried to work from a perspective of not only hardware, but also application. So for example, if you have a set of containers which needs to work together, let's say like your application plus service mesh container, you don't want to, like when the data is passed between those two containers, you don't really want them to cross like L3 cache domain zones or even worse like memory domain zones and so forth. We support container affinity and anti-affinity. So for example, like your database should not be affected by like CPU consumption of backup container or something similar. So we provide to the user the set of knobs how to get out of the node. And that's actually like the main difference between like what team or where Marvel is working and my particular team. My team, we are focusing specifically what happening within the node, like all the details, all the deep knowledge of hardware, all combinations of how it's going to work. So as I mentioned, right now it's implemented as a like kind of proxy between where Kublet and actual container runtime. But we are working together with container deep project and cryo project to implement the thing which is called NRAP, node resource interface. It's also plugging interface similar to what Kublet community is now thinking of implementing, but it's a bit more detailed. The reason for it is what we need to understand the border where each layer of the Kubernetes stack contains enough information of doing the proper decision. So right now the communication between the Kublet and runtime is kind of imperative. So Kublet dictates how certain things needs to be implemented like the CPU set, bunch of other things, or like transforming set of borders and so on. The problem with that is what it was okay five years ago when we had only run C as a runtime. The current set of runtimes we have VM-based runtimes like Cata, we have user-leap-based runtimes like device and end hours. So all the assumptions what Kublet has about how the container is run is not necessarily appropriate or not necessarily the true. So some of information available only on the runtime. So with things what my team is trying to do, we are trying to make sure what certain information properly pass in between the Kublet and runtimes and when utilizing how it's done. And beside the CPU, we have several other activities like NRI I already mentioned, but when we have a few things which is related to class-based resources or quality of service kind of resources. So it's cache, it's memory-benders, it's block, well, to some degree, memory teaming can be represented as quality of service scenarios and so on. And another thing is device manager. So like what Marlowe just mentioned about the GPU scheduling it's good, it's a way how to utilize GPU based on current device plugin API. But the problem is what current device plugin APIs contains a lot of, I wouldn't say design mistakes, but inefficiencies if we are looking from the current hardware state of view. So like it worked well if you have single exclusive use of the hardware without any knowledge about internal resources, no shared usage and so on. But as soon as we start to thinking like, okay, let's have one physical GPU or some other accelerator device to be shared. Let's think about the memory on it. Let's think about the internal topology of those accelerators and so on and so forth. Those simply didn't work. Like what are different workarounds and what we implemented with GPU device plugin for Intel accelerators, it's also set of workarounds. Nvidia have way on Google for Nvidia GPUs, we have way on, but it's all not really extendable. So what we are working together with Nvidia and recently we also have people from other projects joining. So like one notable is ACRI, like IoT kind of devices, network-attaching devices, we have those two initiatives. Like one is CDI container device interface. It's a game on the runtime level, how we attach the container to the device. No, sorry, how we attach the device to the container. All the nifty details, how it should be done on a low level. But when the upper part, like the cooblet part is dynamic resource allocation. So this is revisiting how the user can request the device. So going from the previous model of let's use the extended resources and when try to have like all kinds of combination of those resources and when GPU schedule extensions and so on, we are coming to interface similar to persistent volume claims. So you request the device of particular class. You specify a set of parameters to this device. When device driver, like the vendor specific code will be able to understand what those parameters about, how to get it properly allocated and when hint to the scheduler where it will be allocated. So like full flexibility for vendors to provide like vendor logic specific particular device or actually even chain of the devices. So this is like set of puzzles for Intel in overall working across multiple teams in the resource management domain. From scheduler, existing cooblet things, low level runtimes and combinations of it. I think I will stop with that. If for additional questions, I can pull out some of the slides or some of the details. Cool, thank you. I don't know if anyone else has any questions. So I'm curious how the folks in the traditional resource drive management scheduler community would interface with something like this. So if you had somebody from the Altair or the SketchMD community that wanted to pass back to scheduler to make some, to give it information about what components of the node, what resources on the node it should have. Is there something in what you're proposing here that they would be using or that's kind of a entirely separate scope? So there are several things how we can tackle it. So right now the cooblet is discovering resources and when doing my assumptions about how those resources are present on the node. It's not necessarily much as what actually runtime has and can runtime can use. But let's one side of the story, I will come back to it. The second part of the resources is extendable resources. And here we have two variants how it can be announced to the scheduler. Like one is device plugins. So device plugin says I have this amount of instances of particular resource type. Second variant is what you can patch the node object and say what this node object has this amount of this resource allocatable. And one cooblet will do with simple accounting like how many pods are using this particular resource. To help with that, we have NFD. So like our GPU device plugin is working together with NFD to actually automate with announcing of those resources. So like for example, like this middle CPU or middle GPU parts of a GPU or like a GPU memory is handled by NFD plugin. But as I mentioned, it's most of it is like workarounds for current design of device plugins API. So current design of the extendable resources. This things what I mentioned like this dynamic resource allocation, it has similar setup like the storage drivers. So you have a cluster level components which works together with scheduler and you have a node component which is responsible to actually attach the device to container and work together with runtimes to do it. So it will be a part of this cluster level component to talk with the scheduler to make sure what the resources are available and can be consumed for the pod and provide with apology information on which nodes with resources will be available. Regarding the Kublet runtime part, it's long way to actually get there but what we eventually need to do is what we need to remove with discrepancy between the Kublet knowledge and runtime knowledge. So it means what at some point we need to revisit the protocol how the Kublet and runtime are talking about the resources. So right now in this class-based QoS resources what we have, we have a CRI messages where runtime tells to the Kublet which quality of service classes available and what types of QoS classes available and what is the potential values are available. So the node, the Kublet can report it into the node status and then the scheduler can consume it to make a scheduling decision. So it's similar model what we have right now for the native resources on the differences what Kublet not discovering it, Kublet gets it from the runtime. So if we are talking about the resource management plugins like regardless on which level on Kublet or on runtime so now later we will need to have exactly the same interface. So Kublet, sorry, plugin should be able to tell to runtime or to Kublet what kind of additional resources it might be available. So it can be used in the scheduling decision and obviously node status. And that's part of the work is figuring out how to have dynamic resources as part of this, right? So they do need to be advertised to the scheduler but some of that work is figuring out how. Yeah, the biggest problem with all of those how is what we have a lot of legacy code and we have a lot of users who are relying on this legacy. So to change something where we need to be very careful how to not to break the things. Previously it was a huge roadblock in terms of Docker because well, you still needed to take care of like Docker API which was quite simple. Now when Docker is removed, we have a bit more freedom like how we can evolve the CRIP protocol. Awesome, thank you. That is time for one more quick question if there is one. Otherwise, I think that's probably us could time actually five two. So thank you very much Marlow and Sasha for coming along. I think a couple of people has dropped. So sorry about that, but we've got it all recorded and we'll share it. So thank you very much for your time. It's really, really good stuff. Yeah, and I can send out the slides after Sasha if you want to add to mine before I send, I'll send to you. I just sent in the chat, we're linked to a session or what we had much on HPC day. So like the slides what I showed it's attached where and I think recording is also should be available for all the people. Yeah, that's great. And any slides just chuck them into the Slack channel. All right, that sounds great. Thank you. Brilliant. Okay, thanks very much. So yeah, that's probably it for today. Our next session is going to be actually 6th of July. I think it was in the agenda previously. It was down as the 29th of June, but it won't be then because we've already done two sessions in June. So yeah, 6th of July will be the next time and we're gonna do have a session, I think on Cillium and EBPF. So yeah, thank you again, Sasha Marlow and see you all next time. Thank you. Thank you. Bye bye.