 Hello, KubeCon. Welcome to our session. I'm Alexander Kanymstein. I'm Cloud Software Project, working for Intel. Today, with my colleague, we are going to talk about something interesting in the Runtime Strap. Hello, KubeCon. My name is Christian Litke. I work for Intel Finland as a Linux and Cloud Software Engineer. Maximizing workloads performance with smarter runtimes. We will take a look at performance optimization, how Kubernetes models hardware detour into hardware domains and then talk about smart runtimes. Let me first start with a theory. Performance optimization comes down to a very small problem of noisy neighbors. We all have dozens or hundreds of workloads running now and on. All of those workloads, they have different characteristics, different life cycles and so on. One thing I want to say in a very beginning, the silver moon doesn't exist. Even though if you have some good algorithms to optimize one set of workloads, most probably for our set, it will have differences. The reason for that is that we all have differences in the CPU's gaseous memory and our hardware resources. The usage of patterns of those resources is also different. What is important is what you need to understand, how those resources are used, how you can measure them, and how you can react in those events on the life cycle of those workloads. Today, in our scope, is just a container on time. Try a container B and some interfaces for this, connecting those pieces together. Even though here I'm talking about the theory of our things. We have a project, which is called CRI resource manager, where we validated most of our theories. We have practical examples of what can be achieved by extending work on time. But let's talk about how resources are organized, how we can look at them. In the Kubernetes world, where pristine and vanilla installation of Kubernetes, you have a very simplified model of resources. Everything is one big shared pool. Every CPU is equal, every memory region is one big shared set of resources. And obviously, in this setup, you are not able to provide too much of a contract. You have some priorities, but that's it. To fully embrace a contract, we need to start dividing those resources into smaller channels. And then we can measure and control them. And when people are thinking about the divisions of standard computer resources, they immediately came to the idea of NUMA. And this count implementation of NUMA in Kubernetes, you still have one big set of shared resources for CPUs. You still have one big shared set of memory regions. And this is really to do some exclusive locations, but it's still limited. And even if you look at the newer development in that area, you have some improvements. You can select multiple memory regions, but still this memory regions will be shared resource between the exclusive and shared locations of our workloads. When we are talking about NUMA, it might be not enough for the boundary of division of resources on the system. And for reason for that, we need to look more deeply in the history of NUMA. Two decades ago, we had a setup of a system where you have multiple CPUs, single memory controller, you had a memory bus, and when all of your CPUs were accessed in core memory. That was called uniform memory access or symmetric multiprocessing. However, this setup was not really scalable. And when the architecture evolved to NUMA, non-uniform memory access, what it means, it means what you have groups of cores, you have independent memory controllers, you have independent set of memories, and you have interconnect between those groups. So each of these groups is becoming a NUMA domain. And you can have multiple of those domains, and you can have multiple different lanes between them. Not necessarily like for full mesh interconnect, but different sizes, different connections. And that is really model or system dependent. That's why this idea of non-uniformity is coming on. And when you're optimizing, you need to understand how the access between the memory regions is actually affecting your performance. But even if we double click on your memory, nowadays your memory became also heterogeneous resource. We don't know about the standard memory theorem. But when we have already other types of memory, for example, interlaptain persistent memory, it's still connected to the same memory sockets. It's still deep interface. However, even though it's seen as a normal system memory, it has a bit different performance characteristics. And to fully utilize it or applications, it would be good to understand those differences. But that's not the end of the story. Very soon we will have another bus, which is called Compute Express Link. And one of the profiles for this bus is a memory extension. And by memory it means what we're going to have a possibility to attach additional memory controls to the nodes. And that's not going to be like uniform type of memory. You can have different types of memory. It's different performance characteristics. This memory can be hotplugged and so on. So all of these different memory regions will be seen in Linux as a separate memory nodes. So if a double click to a CPU box, we can see what modern CPUs have multiple physical cores. Why might be different by the frequencies? But it can be also our differences. For example, the hyper threads. So you can have multiple hyper threads per physical core. And those hyper threads are sharing the same resource on the low level, including L1 and L2 caches. So one of the division boundaries can be also physical core versus one hyper thread. When we have L3 cache or last level cache before you reach for memory, this already becomes a shared resource between multiple cores within the same socket or that. And the architecture is not something what is studied. The architecture of CPUs are evolving. And in new versions or new generations, you might have an absolutely different split for resources like what is core, like what is the L3 cache domain and so on. And that actually also goes to even more deeper separation of things. So you can have domains split physically or logically. So for example, you can have multiple dies per one physical package or you might have a logical subnamoklustering within one physical package. So based on that, we need to understand what Linux and actual hardware looks like in terms of hierarchy. When you have physical components like the sockets or packages in Linux terms, you have dies for multiship CPUs and when you have inside the CPU cores. And those CPU cores practically can be grouped into some pool or some zone or for resources, which contains CPU, caches, memory, aio and so on. You have possibility to have inside the small group, shared set of resources and exclusively allocated set of resources. And when you can bound your workloads into the small groups. My next problem or next question will be of course like what if I have a workload which is bigger than the small group. Well, you still have ability to group on the higher level. So you can have a parent nodes or parent virtual groups, which spawns across like multiple dies into one socket. And when you can have a proper accounting for all of those, which is practically like some of the resources of all the child nodes. So to reiterate, when we are talking about resource zones where workloads are bound, it's good to have a good understanding what your physical layout of a system looks like and that might be a lot different based on generations, based on the vendors and so on. Because in different generations you have differences in how memories are being organized, how caches are divided, how aio buses are connected. You can have different types of memory, even though we will be connected to the same memory controller. For Linux it will be visible as separate Numan numbers. But what we actually want to understand is how those separate, reported by the Linux controller, by the Linux kernel, Numan numbers can be grouped into one logical and usable resource zone. We have possibility at the moment to say which zone can be used, but what we don't have, and this is something that I hope will be improved in the last years of Linux, is the ability to control how much memory or how much of our resources can be used per zone for application. And when we have even more harder tasks to understand and fix, is how to understand the lower details of a workload, for example the working set size, to have a good split of understanding what is the hot size for memory used by application, where she's called an offer for pages used by application. So, going first, we have caches. We can, at the moment, split one. We can assign workload quotes to the caches. However, we have a problem where, first of all, the amount of cache classes is limited. So you cannot assign each of your container to separate class. Second, where configuration of the caches is something that is really hard to edit. We need to come up with a good abstraction, and we think what class is one of the good abstractions that we can use in Kubernetes or in CNC approach. Meaning, we have classes, for example, like Gold, Silver, Bronze, and when we have a node level mutting between those classes to the actual parameters of the hardware. So, simplified UI for containers, deep knowledge on the node level. And even though if it's done for the caches, we have another resource which also can benefit from the same approach. And I'm talking here about the blocker. This is virtual resource known by Linux kernel, where you can define the priorities and limits of a block storage. But, again, like multiple nodes in the heterogeneous cluster might have different configurations of storage devices on one node. You don't need to expose that to the end user how to control it. But what you want to do is, again, we have a class-based abstraction for consuming of those resources. And that allows you to, again, properly split and control those resources on per node basis. So, all of this, the reason why I talked is we have the possibility to divide the resources. We have the possibility to measure one. But now, when we have understanding how to divide, how to measure, now we need to have a control for it. So, smarter run times. Ever since its introduction, Kubernetes has enjoyed an increasingly wide-scale adoption across public and private clouds. As a result, both the spectrum of workloads and the diversity of the cluster hardware is rapidly growing. We have AI databases, content streaming, network function, virtualization, automotive edge, running anywhere from a virtualized generic public cloud, something like Google Cloud, to a bare metal custom cloud with accelerators, for instance, 5GH with FPGA. Kubernetes is literally taking over the world. While this is a positive problem, it is also causing some headache for us. Kubernetes node resource management is implemented in Kubelet, the node agent. Kubelet has a simplistic view and limited understanding of the hardware. Moreover, it comes with a single one-size-fits-all resource allocation algorithm. Now, while this algorithm is good for many common cases, critical workloads on bare metal require more optimal hardware-aware resource assignment for acceptable performance. Similarly, workloads in domain-specific clusters often require custom resource allocation logic. The one-size-fits-all algorithm cannot scale to satisfy all these requirements. We need multiple resource allocation and assignment algorithms and support for custom logic. That sounds a little bit like we need support for plug-able algorithms, but where to plug these? We have very few candidates, either the Kubelet or the runtime itself. Kubelet already hosts one policy, so should we extend it with plug-able policies? Well, that would be problematic for many reasons. First of all, we would like to keep the orchestration hardware agnostic. And as it is critical infrastructure, we would like to keep it as simple as possible. Now, Kubelet is already complex. Adding plug-able policies would definitely increase its complexity. Luckily, another option suggests itself, the runtime. Let's plug custom algorithms to the runtime and make it smarter. Kubelet's APIs are predominantly declarative. The user describes whatever cloud needs and orchestration decides how this will be achieved. This is true for the resource API too. Container resources are requested in abstract terms. For instance, 3,000 millicores and not cores 1,3 and 5. But the API between the Kubelet and the runtime, the container runtime interface is fully imperative. It carries concrete resources and related parameters. To bridge this gap, the Kubelet turns abstract resource requirements into concrete resource assignments using its allocation algorithm. But what would we need instead? So let's separate the what from the how and leave resource requirements unresolved in the Kubelet. Let's update the CRI API to allow passing abstract resource requirements to the runtime itself. Let's extend the runtime by plug-able custom resource allocation and assignment algorithms, and an API that these algorithms can use to resolve pod spec requirements to concrete resources. In other words, let's make the runtime smarter. There already exists an extension point for plugging resource allocation algorithms into runtimes. This is NRI, the Node Resource Interface, container-disapp-project in draft status. It is described as a new interface for managing resources on a node for pods and containers. It was largely inspired by the container network interface, which is used by runtimes to handle multiple implementations of the container network stack. We share common interest with NRI. It aims improving node resource management with a structured API and plug-in design for containers. We would like to do pretty much the same. It wants additional interfaces to customize a container's runtime environment. We also would like the same. NRI's current focus, however, is on better control and more flexibility for injecting devices into containers. This we would like to extend a bit. Ideally we would like to make NRI the primary integration point for extending the runtime and allow any vendor to plug in their any resource allocation and optimization algorithm. How do we want to extend NRI? NRI taps into container lifecycle events and invokes corresponding plug-in API functions. The plugin receives information about the event selects and configures resources and responds back with data about how to set up the container. The current NRI events are creation and removal of ports and containers. To enable more generic support for resource optimization algorithms, NRI needs to tap into a few more events. To enable smart algorithms, NRI needs to pass all the necessary information to plugins. To allow proper container setup, NRI plugins needs to be able to pass back enough data for correct container configuration. So finally, let's look at the plugin invocation mechanism. The current implementation is a hook-like one-shot API with a separate exec for every request. For stateful plugins, we need another mechanism with minimal overhead, probably a proto-above-based API over GRPC or TTRPC. So eventually we would like NRI to become a resource integration point for all OCI compatible runtimes. Therefore, the NRI core and data structures need to be runtime-agnostic. Adaptation to any corresponding runtime-specific bits should happen in the runtime itself. Container is already integrated with NRI, but we need to extend on that a bit. We have to hook into a few more additional lifecycle events and make sure that all the necessary data is passed back to the plugins. Also, the current code ignores plugin replies altogether. That has to be updated to modify containers according to the plugin decisions. Once we have ironed out all these details with container D, we can take cryo and integrate it to NRI as well. CRI needs to evolve also to provide better support for smarter runtimes. Smart algorithms need declarative resource requirements and the separation of what in the cubelet from the how in the runtime. To do this, CRI needs to pass through Potspec resource requirements verbatim. Smart algorithms with better efficiency also need a few group operations. VM-based runtimes need full information about all container resource requirements for sandbox creation. Also to optimize the co-location of related containers, it is better to get a single request for creating older-related containers at once. We also need the ability to update multiple containers with a single request. And finally, we would like a dedicated API for container state monitoring. We want to allow clients to subscribe for container state change events, then trigger and deliver notifications from the runtime when such an event occurs. We just provide it to you. There are three things what we want you to take home out of this presentation. First of all is about the hardware. The hardware is evolving and the hardware is changing. All the assumptions about how hardware works might be changing next day. So it means no one simple algorithm which will satisfy all the needs. Second thing about the Kubernetes. Kubelet holds the information of what and what should be run. What is the priority of the things. But we need to split what and how. What is about the runtimes. Runtimes now contains some information about how, like Linux container, Windows OS containers. So we want to have runtimes to control with full part of how containers run. And we want to have plugins in this interface. So what more custom logic and vendor specific logic or special installation logic can be extended. On the web, we have reached a number of applications. We have a number of applications. On the web, we have reached the end of our presentation. Me and Tristan are available on the GitHub or in the Kubernetes Slack. And we are happy to talk more details about all the things what we presented today and what we are doing for hardware performance utilizations. We will be available today after the sessions if you are participating in our session today. Thank you.