 Good morning, Amsterdam. Good morning, KubeCon. I hope you are energetic today after the keynote sessions. And today, we are one of the first sessions, and we are starting as a panel discussion. And as you might assume, a panel is something what should be interactive. And we, as a panelist on this session, we are expecting what audience is asking questions. So as soon as you have questions, if you want to ask something, please raise your hand. And our helper in the room will bring your microphone. Feel free to ask anything in the area. And today, we've got panelists, many of us from different companies, working on different areas. A lot of new topics, a lot of new changes in the past year in the area. So let me first introduce myself and all my colleagues here on the stage. I am Alexander Kanievsky. I work for Intel. I am part of a team who is working on resource management topics, mostly. So runtime, KubeNet, and related projects like node feature discovery. Hi, everyone. I am Swati Sehgal. I am working as a principal software engineer in the ecosystem engineering group at Red Hat. I've been involved in the resource management space as well for a while, just like Sasha. And my focus has been normal awareness and resource management capabilities, especially the KubeNet resource managers. And looking forward to having questions from you guys. Hi, good morning. I'm Yavan Lazar. I'm with NVIDIA, part of the cloud native group there. Well, I just try to get GPUs and other funky devices to work in containerized environments. So that's why I'm here. Hello, everyone. Thanks for coming. My name is David Porter. I'm from the Google Kubernetes Engine team. I work on node. I also work in upstream SIG node community. I'm the maintainer of a seed visor, which does resource management monitoring. And I focus a lot in the resource management space. So I'm really excited for this panel. Hey, folks. I'm Sasha. I'm considering myself as an upstream contributor to Kubernetes. I'm working in SIG node and also SIG release. I'm also one of the maintainers of Cryo, the container runtime. You may know it. And I really hope that you enjoy this KubeCon. It's great to see all of you. All right, to start with discussion, we're a big picture. We feature what in the area are present in Kubernetes for quite some years. So we all know the distribution of functionality between the control plane, like the scheduler pickups with node, and keeps track of resources on the nodes. And when we have Kublet, this set of the managers who are doing the decision on the node level, plus we have runtimes and different flavors of runtimes like everyone knows, so a container project, Cryo project, and low level components like Run C, C run, Cata, and so on. So something what was known and existing for years. And recently it was some of the good improvements there. And I would ask Swati to cover those. Sure. So as just Sasha just mentioned, that we have some of these components. In Kublet, we have a bunch of resource managers, CPU manager, device manager, memory manager, and topology manager. These are components that have been existing for a while, essentially to allocate exclusive CPU, allocate memory exclusively for aligning resources, device assignment. And they've been existing for a while. The goal has recently, there was in the committee, it was mentioned that we should avoid permabater features. So we started an effort to graduate some of these capabilities to GA. And in the past couple of releases, we graduated CPU manager, device manager, as well as topology manager to GA. So if you want more information on these components, I'd recommend you check out some of our blog posts. We have two blog posts on CPU manager and device manager, as well as one on topology manager that talks about its internal details. Next slide. So as Sasha just mentioned, the world has been with some of these components for a while. But recently there have been a bunch of additions. We've seen DRA, which is dynamic resource allocation that was introduced in Kubernetes 126. And then we have topology aware scheduling. This was a feature that was introduced to enable topology aware scheduling in Kubernetes, because there's a disconnect between how KubeLit handles resource allocation and how default scheduler views components and resources. In order to do that, we had introduced an API called NRT API. And that allows us to enable topology aware scheduling. In addition to that, as we interacted with the community, we realized that there are other use cases. And we came across NRI plugins that are using NRT API that was previously introduced for topology aware scheduling. And then we made additional changes to NFD as well to enable topology aware scheduling. So I think the world is changing. And we are constantly evolving based on the feedback that we're getting from the community as well as the use cases that we have from our customers and partners. Next slide. So I'll pass it on to Evan. Yeah, so as I said, my thing is getting devices, funky devices, to work in containerized environments. One of the features that we're quite excited about is dynamic resource allocation. And that goes from a model where you have just integer-accountable resources like Nvidia GPUs in this case, trying to plug our products a little bit to something that's a bit more expressive. Now, obviously, it's a lot more verbose. But we introduced this concept of a resource claim, which is associated with a resource class that's defined by a cluster admin or whatever. And it allows users to, third-party developers, to actually expose resources and define an API that suits those types of resources. So my colleague from Nvidia, Kevin Clues and Alexei, will be presenting a little bit of a deeper dive on DRA in Room G104 just after this session. So if anyone's interested, please go check that out. I think it's about a five-minute dash, if you want. In the next slide there. So as we said, it's more verbose. But why would we do that? Isn't the integer use case enough? And it's not, right? Like, these devices have developed and become more complex. And how a user is expecting to interact with them has also become more complex. So the fact that we're now exposing a more rich API allows us to do a lot more interesting things with them. So on the slide, you see that currently you're able to assign a device to one container and there's sort of no sharing. I know people in the community and ourselves have weird workarounds and hacks in place. But with DRA, you're able to actually define, you're able to share GPUs or devices across different containers in the same pod, across different pods as well. And you're also able to handle devices that require some kind of parameters to set up. So you're able to dynamically partition devices and share them and sort of mix and match as you choose. So yeah, and what this actually uses behind the scenes is something called the container device interface, or CDI, as a common spec CNCF funded project that spec that defines how devices or resources actually look from a, well, this DRA node plug-in perspective all the way down to the runtimes such as ContainerD and Cryo. And so I think I'll hand over to Sascha since he's a, oh, OK, he wants to stand up. Sitting makes me nervous too. Yeah, give him that a little bit. Is this microphone on? Can you hear me? Yeah. All right, so giving this a bit more dynamic approach, right, so. All right, so I would like to jump over the slide if that's truly possible. Just want to head directly to the slide. So if you look at container features in Kubernetes, then we mostly see them from the perspective of the node or the cube led. But we also can drive features in Kubernetes from the container runtime perspective. So for example, if we speak about our recent enhancements to open telemetry tracing, which enable us to collect the logs together with metrics and correlate them together and find out what is wrong on a node, how the resource usage actually is, then having this feature only on the container runtime side, like in Cryo, then this wouldn't bring us much at all. I mean, we are working on bringing more enhanced open telemetry tracing, but we also need it on the cube led side, for example. And for that, we have a dedicated enhancement in upstream Kubernetes where we are working on. So in 127, we created the open telemetry cube led tracing to beta. So we enhanced the spans. We now have more information about the pod lifecycle and about garbage collectors in the cube led. And all of this information bundled together on a single node helps us to understand how resource usage gets applied on the node. The same applies to features, for example, like the event at black. So the pod lifecycle event generator is usually the source of truth when it comes to a synchronizing state between workloads and their actually container. And this feature doesn't bring us anything at all if we don't take the container runtime into account. So we have to add dedicated support for the event at black in the container runtime. We also have to extend the container runtime interface for that. And this is the end-to-end delivery we need in the whole cluster setup for each node to let the feature actually work as intended. I would also like to some somehow related features, our username spaces, for example, and also swap support. So we kind of plan to continue our work on swap and upstream Kubernetes so that we can have dedicated swap support for the cube led as well. And there are also some features which are more dedicated to the container runtime then. So for example, we have support for the NI and Cryo 127. We have some experimental second notifier support. So these calls have influence on resources on the node, as we already know. And we can then try out the second notifier to get informed about negative impacts of these calls on a single node. And we also feature now vertical pod autoscaling. But that's just Cryo. And we also have container D, which kind of also drives features in a different direction as well. David, can you tell us a bit more about the new stuff in container D? Yeah, sure. So container D17 was just released. And there's a lot of new features in container D17. There's a new sandbox API to better manage containers as a unit. There's some additional features around transfer service. And one of the big new things as well across the container ecosystem is NRI, which we'll talk about in a second as well. Yeah. Yeah. So NRI, one of the new acronyms, the three letters ones. It stands for Node Resource Interface. It's new common plug-in interface for the runtimes, both implemented in container D1.7 and Cryo 126. It started as a project of container D. But over the last year, it was evolved as a cross runtime interface to plug your custom logic into your cluster. So the key thing about NRI is what the plug-ins, what you have, we have all the internal knowledge of your container from runtime. And some of those properties you can adjust on the fly. So like resource limits, with CPU, memory pinning, with swapness, couple of other flags, which provides you a lot more flexibility with out of three components compared to what you can do nowadays with Kubelet or with current functionality entry. So if you want to have a bit more details about NRI, I recommend to watch the presentation what my colleague did in Detroit last autumn. The protocol description is present in container DNRI repository and example of the plug-ins, which implements several resource policies were published a couple of weeks ago into a common container tools project, like Alongside with Podman, Conman, and a few other things. So this thing is one of the building blocks which can integrate between multiple features. So as Swati mentioned, we have scheduler extension, we have runtime, we have Kubelet, we have dynamic resource allocations. All of these components where you can plug your custom logic Kaiser for vendor, for a hardware vendor, or as a cloud service provider, or as a user who optimize some of the things. And obviously, there is a need of sharing the information between those components. And piece or set of data, how to share with that, is another free letter acronym what Swati already mentioned is called Node Resource Topology API. So we have Node Resource Topology API. This is an API that is CRD-based, and it was introduced to enable topology awareness in Kubernetes. To enable this solution, we had to enable it as an out-of-free component. We have two components as part of this. One is a component that exposes hardware topology, and the other component that utilizes it, which is the scheduler plugin, which uses this information and makes topology aware scheduling decision. So the API itself was designed to have more granular information of the resources. In terms of topology aware scheduling, we can have it distributed across NUMA, and we have information on how much resources available, allocatable, and what is the capacity. As we started engaging more in the community, we realized that topology aware scheduling is not the only use case, and NRI plugins can use it as well. So that's when Sasha and other folks in the community interacted and provided us feedback on additional things that we can do, introducing top-level attributes, as opposed to specific field for topology manager policies, which can be used for other capabilities as well. So I have the link here on the slides for the API, and I've also plugged in a talk that I'm going to have later today on topology aware scheduling if you're interested to learn more about it. So one more thing, which was also happening in the last year, and continues in this year, quite interesting, an active discussion, is support for C-groups version 2. It was graduated to JA recently, and it enables a lot of new functionality. And I would like to ask David to cover it. Yeah, so to step back for a second, whether you're using devices, whether you're not, every single container out there is probably using CPU and memory, which are the core Kubernetes resources. And so underpinning pods and container resources with CPU and memory are C-groups. So C-groups are the Linux kernel feature that provides kind of a couple things. One is being able to group a set of processes together and be able to limit the resource usage, so be able to limit how much CPU and memory those resources have able to. So C-group v1 was kind of the first iteration of the C-group API and what everybody was using. And so very recently in the ecosystem, C-group v2 has been starting out. And we're really happy to announce it was G8 and Kubernetes 125. So C-group v2 is a new kind of platform. And so it has many new resource management capabilities that we're kind of hoping to take advantage in Kubernetes to provide more enhanced resource management capabilities. So some of the example I want to highlight come of the few projects that we're working on in the Signode community that are kind of built on top of the C-group v2 platform. One of them is memory QoS. This is alpha in Kubernetes 127. So this basically provides memory guarantees for pods. So before you didn't actually have guarantees around the minimum amount of memory your application can have, with memory QoS, we actually provide minimum guarantees so your pod will always have this amount of memory and also we can kind of prevent or help prevent ooms and other kind of out of memory situations. Some of the other things we're looking into is PSI metrics. So PSI metrics is a new C-group v2 feature that basically provides metrics around resource shortages. So you can see things like, hey, there's CPU pressure, memory pressure, and the kernel will provide us metrics about that. We're hoping to take advantage of these metrics in Kublet and be able to make more informed smart decisions around, for example, which pods to evict or how to prioritize different pods when they access different resources. And then kind of the longer term roadmap is we have pretty good support today for CPU and memory, but that's not the only resources that pods and containers use. Some of the other things we're thinking about is IO isolation. We want to make sure that if multiple pods are competing and trying to use the shared IO resources on a node, we can provide some guarantees around that. As Sasha mentioned, Swap is something that we're looking into. Swap is already alpha, and there's more work to be done there, but many applications, for example, need to allocate large amounts of memory and can all fit in, fit in main RAM. So we need to swap for that. And then there's other resources as well that we haven't really explored too much around isolation, but there's more and more use cases around, for example, network. How do we guarantee the different pods? We'll be able to access the network and be able to provide QS guarantees. So in general, we're really excited to hear from you what type of resource management challenges are you having and what features are you looking forward to and what issues do you have. I did a talk, actually, in last KubeCon. Seagruv v2 is coming soon to a cluster near you in Detroit that provides some more details around the Seagruv v2 platform, so if you're curious, that's a good resource to check out. So with that, I think we want to open up for questions. We'd love to understand what resource management issues you have, what you think about these features, what use cases you have, and we'd love to hear from you. What is the question here? Can I switch on the microphone? Well, if. Hello. Yes. Excellent. So first question is about these dynamic allocation of resources. Does this also allow me to underfly dynamically allocate a GPU on a node? Because from what I understand right now, if I want a container that have access to a GPU, I need to actually allocate a node type that has GPU attached all the time. I mean, it might not even make sense to do, but I'm not that familiar with all these things. So you're saying that you, so the GPU, well, in order to access a GPU, one is a GPU present, right? So I suppose it depends on your infrastructure there. And so DRA in general allows you, if there is a way to do it, it allows you to define the API so that users can access it. Kevin over here might have more context, and he might be able to give you some more information there. Yeah, I think what you might be asking is that today, if you want to access a GPU on a node and you want to access a specific type of GPU, you can only put one type of GPU on an individual node in your cluster, and you have to use node selectors to direct your pod to a node that has that GPU type on it. And with DRA, we'll be able to have a mix and match of GPUs on a specific node, because the API that we are now defining to access GPUs using DRA will allow you to select on a specific GPU type, a certain amount of memory you might want to have access to, independent of the GPU type, and so on. So what DRA really gives you is the flexibility to define whatever API you want for the resources that your DRA driver is advertising for you to be able to allocate. And we are defining that API for our GPUs so that you have a lot more flexibility in how you can request access to them. And the way that I've been talking about, at least internally in NVIDIA, is that with the capabilities we're adding to our DRA resource driver for GPUs, you're basically going to have a similar user experience for being able to get access to a GPU and run a workload on it now inside Kubernetes that you would have if you were running on a bare-metal machine and just wanted to launch a CUDA job. You can have the same kind of user experience in that way. Yes, no? Any other questions? Will the PSI metrics include something for identifying CPU contention from the pod side? Because we're trying to pack more workloads to the nodes. And we found that it's very hard to know when there's CPU contention, especially for short bursts that wouldn't show up in CPU usage on the node level. Yeah, so I think that's a great question. And I think more and more folks are more cost-conscious trying to impact their nodes as much as possible. And so being able to monitor those resources is really critical. Like, we have a few things in the pipeline. So one of them is what I mentioned earlier, the PSI metrics. We hope those will be able to be exposed and will be able to provide more fine-grained up-to-date metrics around CPU contention. I also want to plug. There's essentially a session later today I'm doing with my colleague, Peter. We're actually going to be talking specifically around pod and container workload metrics and some of the work we're doing in this space. And we hope to cover some of those areas there. Thank you. Hey, what's the name of the talk? It's right there. Thank you. Thank you. I would also like to add to that question. So detection of a contention or resource jamming is one piece of the puzzle. But the second piece is you need to change somehow or affect the running containers, like squeeze them or regroup them or pin it to some other places. And this is where the piece like NRI can help you. So you can plug your dynamic policy, which, based on detection settings, tweak where existing running containers, like in transparent mode. And one other thing I wanted to add there, one other feature that we're really excited about, we had been mentioned in 127, is dynamically updating resources on containers. So today, this is a new feature that's alpha. You can update how much CPU memory your application requests without actually recreating your pod. So we hope, kind of long term, we can integrate some of these new monitoring, kind of intelligence around detecting these resource shortages and then actually being able to update the resource requirements there. And in addition to that, it's more like a broader topic about how to make the whole resource management and Kubernetes more dynamic. So right now, resources are detected once the Kubernetes starts. And that's considered to be static. And the world has changed. So devices can be hot-plugged. Well, something goes offline. Something goes online. It might be like new resources popping up. So we have a lot of areas in the stack where we can improve. It comes to, like everywhere, like the kubelet, like the C-advisor part, where discovery, where dynamic status update of the node, when the CRI protocol, like how the things are communicated, like one of the common examples, where the misconfiguration of C-group driver between the kubelet and runtime. So now, with Docker-Shim is gone, we have a lot of flexibility of how to improve the communications between the kubelet and runtimes and how to make it more flexible, more dynamic, more advanced. Network stuff, pod level events, a lot of other things. So even internally in kubelet, we are noticing that there's the desire for flexibility. CPU Manager, for example, has two policies. One is none and static. And there was this initiative. We wanted to modify the existing CPU Manager static policy. And in order to do that, rather than introducing a new policy, we introduced a construct called CPU Manager policy option to add additional policy options to change the behavior of static policy. So we have a PCPU policy. We have Align by Socket policy and many others. So this clearly shows that there's a desire for flexibility. And on the similar way, we have in Topology Manager, Topology Manager policy options, where we want to modify the behavior, how alignment happens. In scenarios where it's not possible to align CPUs or resources in general from a single Numa node, we want to minimize the number of Numa nodes from which they can be allocated. And just like Sasha mentioned, there's a desire to actually move in the direction that we have additional flexibility in resource management capabilities with external components. So similar to device plugins and DRA, we can have a mechanism to introduce more fancy ways of representing resources, be it CPUs, devices, and memory in general. Any other questions? Hi. My question is mostly related to running real-time applications on Kubernetes. Can you speak closer a bit to my microphone? Yeah. So my questions are mostly related to running real-time applications on Kubernetes. So we have two specific use cases, like we would want to partition the cache for real-time ports. And the second question is, so Linux has the concept of isolated and non-isolated CPUs. So can we use NRM to solve these two cases? Short answer, yes. For us, AMP capability is already existing in the upstream components. So for example, for the cache and actually initial implementation for block IO separation, the feature was implemented in container runtime, I think, in container D1.7. And for Cryo, I think it was in 125. It's controlled by annotations, so you can specify which workload belongs to which partition of the cache or which partition of block IO priorities. There is currently the cap on the discussion. It's called QoS class resources. So this thing will provide you kind of like first class citizen functionality in the pod spec. So for each container, you can define the class when it will be, this cap also includes the dynamic discovery of this class-based resources. So like something like cache or something like classifiable resources can be discovered on the cluster. It has functionality about the quotas, so you can define what, like, which namespace or which containers are able to use particular class. So for example, like low priority applications and not being able to use, like, higher class range and so on. So work is going on. Some of the features you can use already now, some of the features are coming. The follow-up question is, would it be possible to even create dynamic QoS classes? Yes, we kept what I mentioned with QoS class resources. It's built on top of the idea of it's dynamically discovered and dynamically provisioned in our cluster. So it's not hard-coded values. Not like current, like, native resources. So it's not like CPU memory of huge pages. You will get a flexibility. So like your cluster, your rules. It will be a mechanism to plug you in to discover your classes and you will have a control what those classes means. Like, would it be cache, block IO, or any other things what you might fit into this like philosophy of resources? OK. And regarding the second question, so the CPU resource could be isolated and non-isolated. So are there some policies with which we can specifically select whether I want an isolated CPU or a non-isolated CPU? So with current upstream topology manager plus CPU manager, you can get the exclusive CPU course for guaranteed QoS resource, guaranteed QoS pod classes. Yeah, but with NRI plugins nowadays, you can get more flexible thing for like isolated CPUs based on our properties of workloads. OK. So the Kubelet doesn't look at the kernel settings. It doesn't look at and identify whether a CPU is isolated or non-isolated. So I mean, how can I specify that I need an isolated CPU on the system and not a non-isolated CPU? So based on the quality of service of your pod, you can determine if the pod is allocated exclusive CPU. So if your pod was guaranteed, means request is equal to limit? Yeah, I understand that part. Guaranteed and best effort are already available. But at the kernel level, you have a property saying, the CPU is isolated. That means nothing runs on those tables. At this point in time, there's no support for isolated CPUs in Kubernetes. You'd have to manually manage it yourself to separate those CPUs from that particular pool. And the pod can be allocated to your system. But how do we get that visible in the Kubernetes resources? You don't need to make it visible into Kubernetes. So Kubelet is not going to support isolated CPUs. But different plugin mechanisms, what we mentioned, it provides you flexibility to do it. So one of the links what I was showing on the slides, the plugin, it has a support for detecting isolated CPU and giving it to a container when it's requested. Thank you. If you would like, come to me after the session, I will show more direct examples. Sure, thanks. I think we have one minute. We can take one more question. Sorry. So about the tracing integration, what kind of configuration is needed for integrating those additional traces from nodes with OpenTelemetry, for example? And is it possible to associate them with anything that application is doing with traces from applications? Thanks. Right now, the focus was more on getting more information from the Kubelet and the container runtime and correlating together, so passing the trace spans over the container runtime interface and also integrating more traces into libraries we already use in Kubernetes, like K-LOG and stuff like that. But later on, we would like to get more information on the actual workload. But that's for future plans. Thanks. All right, we run out of time, so thank you very much for coming to our session. We will be available either on conference halls or in corresponding booths of our companies. So if you have questions, please visit our talks or catch us up. We will be happy to talk on all of those areas. Thank you.