 Welcome everybody. This is our panel about the state of the art for enabling performance sensitive workloads and what we need to do in the future. And this panel is about how different companies and different participants are working together in the Q&A ecosystem to enable the running of performance sensitive workloads in the Q&A ecosystem. And I would like to ask our panelists to explain who are they of course, which companies are they coming from and what is their motivation to participate in this work. Swati. Hi everyone. I'm Swati and I'm based in Ireland. I work as a principal software engineer in the ecosystem engineering group at Red Hat. My team and I have been focusing on enhancing Kubernetes and OpenShift to enhance and enable our customers and partners to run performance sensitive next generation workloads on Kubernetes. The primary use case that we've been looking at is how to make the Kubernetes scheduler topology aware. The performance sensitive critical workloads in the industries like telco, 5G, machine learning, artificial intelligence, high performance computing, they require resources such as CPUs memory devices to be allocated such that they have access to the same local memory. And essentially that leads to optimum performance. Topology manager, which is a Kube. The component was introduced for topology alignment of requested resources at a node level. But scheduler's lack of knowledge of the underlying topology can lead to suboptimal placement of workloads. And we are trying to solve this problem. Thank you, Sasha. Oh, I'm Alexander Konevsky. Some people call me Sasha. I'm cloud software architect. I work for Intel and I'm part of a team who are looking at resource management topics in Kubernetes and in CNCF infrastructure. As we are a hardware company, obviously, we have a good knowledge of how to optimize the workload to get the best out of available hardware. And our team is working on enabling different pieces in something in Kubernetes, something in runtime, something in add-on projects. And our goal is to give a possibility to utilize hardware 100% for all your needs, for all your demands. Thank you very much, Cliff. Hi, my name is Cliff Burdick. I work at NVIDIA as a DevTech engineer. Currently, I do optimization on GPU code and kind of working on the input and output of GPUs from a network perspective. I'm trying to optimize that and get the latency and throughput as high and low as possible. Previously, I worked at a satellite communications company where we built out something similar where it was a high throughput GPU and NIC solution, where we solved some of the problems that we're going to talk about today using an alternative method. Thank you very much, Alexey. Hi, I'm Alexey Pirivarov. I work in the Advanced Software Technology Lab of Huawei, Russia Research Institute, Moscow. I have been working on supporting performance critical application in orchestrators like Kubernetes. I'm working on it and on OpenStack in the past. I mostly focus on bare metal deployments. My scope is covered by deployment security hardening and other use cases for multi-tenant scenarios. Thank you very much. My name is Gerger Chatari, and I'm working in the Open Source Program Office of Nokia. As we are a telecom vendor, we are interested in running our workloads in the Kubernetes ecosystem. Of course, some of our workloads are performance sensitive. This is why we are interested in in this effort. We are supporting the effort with requirements and let's say technical consultancy. Let's jump to the technical details. I would like to ask you to explain a bit the solution space of this problem. Sure. Kubernetes has become the de facto standard for container orchestration. It's attracting performance sensitive workloads that are very demanding and need the speed and raw performance as if running directly on bare metal. In the diagram here, you can see an example of a Kubernetes cluster where we have master node where the control plane components like EPS server and scheduler are running. Then we have two worker nodes. Each node is running Cubelet, which is the node agent. It communicates with the control plane and makes sure that the containers are running in a pod. To create a pod, Cubelet needs a container on time, which is responsible for running containers. Kubernetes supports several container runtimes like Docker, container D, cryo, and any implementation of Kubernetes CRI, which stands for the container runtime interface. Then we have, of course, the hardware, the underlying hardware. Our aim to enable Kubernetes in Kubernetes is the support for resource management for next generation workloads on the underlying heterogeneous hardware. In order to achieve this goal, we need to tackle this problem at various layers. We need to solve various problems at cluster level, node level, runtime, and the hardware level. There are a few key questions that need to be addressed. How do we make sure that at a cluster level, the scheduler is able to make placement decisions not merely by taking the amount of requested and available resources on a node, but also taking into consideration the underlying topology of those resources? At a node level, how do we ensure that resources such as CPUs, memory, PCI devices are aligned for optimal performance, and then there are optimizations at runtime and the hardware level that we need to be able to leverage for low latency and high throughput applications. Then the other aspect of this is at some layers, we have the ability to create custom plugins to alter the default behavior. However, in the other layers, we cannot do that. Therefore, to solve some of these problems, we need to create plugins. Whereas for the others, we need to enhance the core of Kubernetes itself. Thank you very much. I think it's a very interesting and several component problem. It's an interesting question, how the members of this group are collaborating together and how this whole problem space is not solved in a Kubernetes Sieg or a CNCF tag. Alex, can you tell us some words about that? Yeah, sure. Kubernetes Siegs such as Sieg node is responsible for everything related to working nodes. Whereas other Siegs and Kubernetes community is responsible for everything related to Kube schedule. For example, CNCF tag calls mostly runtime such as ACCRI container runtime interface or OCI open container initiative specification. But before proposing something in those Siegs, the group of involved people gather together and tries to work out a common solution. Try to find common requirements. The communication occurs in Slack, sometimes by video call conference or in Google groups, when we are gathering together and discussing issues or making a brainstorm. But Sieg node meeting is not the best place for brainstorming, much better to meet them before to prepare slides and then go to Sieg node where we are sure we found a solution which satisfies our general requirements. Thank you very much. And now we are talking about the requirements of next generation performance sensitive workloads. But what are these requirements exactly, Sasha, can you explain that? Well, first of all, I need to say what those requirements are unique to each workloads. And everybody in the space who are using Kubernetes for workloads, we have probably different meaning of that. So if you have a compute workload with high demand on CPU time, most probably what you really need is exclusive core or set of course allocation and exclusive usage of a cache to reduce the disturbance from other processes. If you have memory intensive application, you need to have a good alignment not only on the CPU but also memory controllers. And when probably you need to optimize how it's allocated. If you have some devices, let's assume like the 5G network where you have strict requirements about latency, about the bandwidth, about the ability to process the packets in predictable time. When you need to have even more complex setup where you have like the CPU, the caches, input output lanes, the PCI devices like network cards and so on, all of them set as an exact pipeline which delivers optimal performance for your particular setup. And as an example, Cliff actually mentioned what he in his previous company worked on something similar and I would like to pass on a message to Cliff. Cliff, can you explain like what kind of complex setup you had in the past? Sure, thanks. So a couple years back in my previous company, we were working on an NFV style application where we had many 100 gig NICs, network interface cards in a node, many GPUs in a node, and a dual socket system. So it's a fairly common type of system when you're looking at NFV or accelerator applications and we had it containerized already. So the way that we were deploying it before was in Kubernetes, we would manually specify which resources we needed in the pod spec and the pod itself would go consume those resources. And this was very difficult to maintain because every time you wanted to deploy something, you had to specify exactly to the pod what it should claim when it started up. So it obviously wasn't scalable and it wasn't packing as many pods on the nodes as possible. So we looked at our node architecture, which is similar to the one shown here. At the very bottom, we have the CPU or in this case, it was one Numa node per CPU and it was a dual socket system. So you had two Numa nodes on there. And then in the purple and the kind of teal color, that's to indicate the hyper threaded cores. So you have a core and a sibling core that are paired together and affect each other in performance to some extent. Then above that, you have the PCI topology, which is where all of your devices are connected. So in this case, we had GPUs and NICs. And in this particular case, I'm showing a GPU and a NIC connected to a particular PCI switch. And then there's a tree of those going down to the actual CPUs themselves. And then at the very top there is the backplane for the GPUs. So in this case, I'm drawing it to one GPU to one GPU, but it could be an all to all. There's many different configurations that it can be. And then annotated in the text is the actual rough throughput of each of these links. And so you could see down at the bottom, you have about 30 gigabytes per second for the inner processor link. And then on the way up, you get 20 to 150 gigabytes per second. Now, if you were to schedule a workload naively and place it on any GPU and any NIC, there's a high probability that that traffic from the GPU and the NIC will go back through that inner processor link, which you want to avoid because it would be a big bottleneck if you compare aggregating all those 20 gigabyte per second links at the top. And really what we wanted was we had a NIC, which was feeding 100 gigabits of data into a GPU streaming, processing that data and then sending it back out. So you really didn't want the data going through that inner processor link, but you also didn't want it going through the second PCI switch in that tree either because that means that you have a bottleneck with the other two GPUs and NICs that are also going through that switch. So what we ended up doing, we looked at the Kubernetes landscape. And at the time, Kubernetes, the main way that we could solve this was Kubernetes had a way to plug in to the scheduler. And so all this meant was that if you wanted to run your own scheduler, Kubernetes, when you launched your pod, Kubernetes would hand off the entire scheduling process to your pod, to your scheduler rather. So we built a custom scheduler that we called NHD that handles all these scenarios. So it's hyperthreading aware. It's PCI switch aware. It handles huge pages, which are not shown here, but those would be the memory aspect of it. It handles NUMO awareness. So that's the part that Sasha touched on a little bit. So it's kind of handling our extreme case where we needed to pair up these GPUs and NICs. Now the big downside is that since it is a custom scheduler and it was back at that time, there were no easy ways to plug into the scheduler. It meant that we had to take on the entire scheduling process ourselves, which was a big pain because we didn't get any of the benefits that the scheduler already had, such as knowing how much disk space is available on the node. We would have to do all that stuff ourselves. And we did, to some extent, kind of reinvent the wheel. But since then that has changed quite a bit. And we'll talk about that later in some of the upcoming slides. This project is open source. If anyone wants to go look at the code or use it, the URL is right there. And feel free to ask questions at the end as well. Thank you very much. So now we saw how this program can be solved with the custom scheduler. But the question now is how can we solve it in the different layers of the Kubernetes ecosystem, where we have the runtime and hardware level, we have the node level, and we have the scheduler level, and all of these levels need to work together. And all of these layers have different kind of gaps that we need to close. Swati, can you tell us something about the gaps in the scheduler layer? Sure. So I touched this on my intro. So while making the scheduling decision, the Kubernetes scheduler looks at the amount of requested resources and determines the nodes that can fulfill that resource requirement. But it doesn't consider the topology manager policy on that node or whether or not those resources can fit on the same NUMA node. So essentially, scheduler lacks visibility into resources available on a per NUMA basis, which can lead to unpredictable application performance. At a node level, topology manager coordinates the topology of resource allocation of CPUs, memories, devices, and helps to extract the best performance out of the underlying hardware. However, in scenarios where topology manager is unable to align topology of requested resources based on the configured policy, the pod is rejected with topology affinity error. And if the pod is part of a deployment or replica set, it results in runaway pod creation, because the subsequent pod that is created ends up with the topology affinity error as well. So in order to optimize cluster wide performance of workloads, resource utilization and enhance the overall performance of system as a whole, the default scheduler should consider the resource availability along with underlying resource topology to increase the likelihood of a pod to land on a node where it can fit. So essentially what we are doing to solve this problem, we are introducing a few components and in addition to that we are making enhancement through some of the existing ones. So there are two pieces to this puzzle essentially. So one is a component that exposes resource information along with granularity of NUMA node. And then there's the scheduling piece where we have to enhance the scheduling process to take that information to consideration to make a proper scheduling decision. So as I mentioned, the first piece of the puzzle is node feature discovery, which is often referred to as NFT. It is a project that is part of Kubernetes CIG repo. NFT is a node fleecing agent which exposes hardware capability in the form of node labels, annotations, extended resources. We are adding a software component called NFT topology updater, as you can see in the diagram that collects information about the resources allocated to running pods along with associated topology information using pod resource API to determine the available resources with NUMA node granularity. And then we expose this information as CR instances per node. Let's talk a bit maybe about the pod resource API because I guess not everyone would be familiar with that. It's a Kubelet endpoint for pod resource assignment and we enhanced it to add support for exposing CPU IDs and device topology information. And the other thing that we added as part of this was additional endpoint to obtain information of allocatable resources. Then the second piece, as I mentioned, is the topology wear scheduler plugin. Again, Kubernetes CIG repo houses a repository for out of free scheduler plugins which are based on the scheduler framework. And we contributed node resource topology scheduler plugin there. This uses the CRs created by NFT to make a NUMA wear placement decision. Essentially what it does is it runs a simplified version of the topology manager alignment algorithm to determine if a node is suitable to assign the pod to a particular node. Then there's a glue between these two components, which is the node resource topology API. That is a CRD API, which essentially is used by these two different components, NFT and the scheduler plugin. An important thing to note here is that topology managers still runs its alignment logic at a node level for resource allocation and the scheduler plugin essentially ensures that the scheduling process takes place and the part is assigned to the right node. Another few kind of understood the challenges at the cluster and scheduling level. Alexi, could you please share with us how resource management looks like at a node level and what are the gaps that you see in at at kubelet and node level? Yes, sure. We need to list every component in kubelet which are important for high-performance critical workloads. Such components like topology manager, CPU manager, device manager and manager manager each has its own responsibility. The topology manager is responsible for aligning resources to a normal node. The CPU manager is responsible for exclusive CPU allocation. As you already said, device manager registers the set of device plugins. You can see it on the picture and provides resources, so we each have a name of extended resources to the cluster level. Also, if a device plugin provides a normal quality of device inappropriate resource, in this case, the device manager could help the topology manager to align the device on the same normal node where CPUs were exclusively allocated by CPU manager, a particular container. The latest component is memory manager. It also helps to guarantee normal alignment, but now alignment of memory. Before memory manager was introduced, we had to rely on Linux kernel or organize huge terabyte-first-month points in other world. We did it manually. Let's describe the port manager in details. It uses as a manager which implements in-provider centerpiece as the manager, device manager and memory manager that implements this interface. Bitmap of the possible normal allocation is provided by in-provider centerpiece. The port manager implements different policies. The most important policy for high-performance critical growth is single-numar node policy. This policy guarantees the target or resources from the same numar node or raise topology affinity error if it's not possible. Last year, the scope parameter was added into topology manager. As we know, component districts are resources per container and topology managers. Policers also apply its policies per container. For example, in case of single-numar policy, it could be a situation when two containers from one port are correctly placed in different numar nodes. It could affect application performance. For example, if we have those containers interaction through memory, like db application, we have to clarify until memory manager was introduced. It was not 100% true. The idea was to introduce behavior to apply policies per ports, not per container. For this purpose, the program manager scope option was introduced. The port value for this option means to apply it upon the measure ports, the per ports resources, when the container value of the option is the default behavior as it was before. In this case, any resources of all containers of a particular port will be in the same numar node. Of course, if single-numar node policy is labeled on the node, to have exclusively allocated the CPU, the static policy of CPU measures will be selected by Qubelet configuration or by Qubelet command line option. The hyper-creator that the CPUs and chips with correstos of last-only locations are out of the scope now. But recently, this year, the CPU manager policy option was introduced, which aims to allocate all the CPUs from the same physical cores, but many other use cases are not supported yet. Let's move to device plugins. As you can see, device plugins are working outside of the Qubelet. Nowadays, there are a lot of device plugins, most of your ISROV device plugins into or under the network deployment working group when detonability. Also, NVidia GPU device are plugins where a popular device plugin. Before the device manager, this problem was solved by an APAC integer resource. In that time, for example, not officially discovered, provided APAC integer resources. For example, for ISROV, APAC integer resources was replaced by extended resources. Now, it's a history. Briefly, the memory manager calculates topology hints for a container and graduated quarter of service port for conventional memory and huge pages of all sizes. The poetry hint represents a possible set of non-annuals that has enough capacity to satisfy containers memory demand for all memory types. Before memory manager, we had to rely on Linux or organize huge at a very fast point. It was a current situation. George, your next question. Yes, thank you very much. Sasha, can you tell us something about the lower layer, so how the runtime and hardware layer is organized? Well, thanks for asking. I would like to complicate the picture a bit more compared to what Swati and Alexei explained. Yes, we have right now in the upstream Kubernetes set of CPU memory managers, topology manager. All of this tried to solve the most common setup, like the most simplified architecture of the system. You have one socket, which equal to one NUMA node, which consists of some CPUs and some memory. And assumption also, what this socket has only one single iobus. Reality a bit more complex. Even with hardware, which is released in the past several years, we have scenarios what like cores might be different. We have different performance or power settings. The caches might be different on those, where a socket might have several NUMA nodes where memory controllers are working in different mode. You can have multiple PCI buses. So some cores are closer to PCI buses, some are further. You have different types of memory, where these different types actually means different performance or like bandwidth or latency of accessing this memory. And none of those hardware details is actually visible to the Kublet, because the Kublet and overall Kubernetes was built on the promise of being hardware agnostic. And the things what we have is trying to simplify the task for most common problem, for most common hardware, but it lacks all the flexibility for custom solutions. And this is where we need to look a bit deeper. So what we can do on other layers. Of course, there are some solutions which tries to hide that complexity by hypervisor. So for example, like VMware Tanzu, it tries to optimize the hardware placement of a virtual Kublet node transparently. But if we're talking about some other deployment, like either cloud or bare metal, we have not that many pieces where we can plug it. So we have between the actual hardware and Linux kernel, two boxes. One is CRI runtimes. And luckily, there are practically two projects, most active crier and container D. And when we have actual OSI runtimes, and this OSI runtimes, we know how to operate with a kernel, with actual hardware nodes what kernel provides. But as we have this several layers of abstractions like CRI, OSI, and so on, it means what like low levels can modify the parameters more actively, but we are not exposing all the knobs. Same same losses between the CRI and the Kublet. Kublet evolved from the old times when we had a Docker shim. So the Kublet CRI interface is partially declarative. So it describes what needs to be run. But when it's partially imperative. So all this detection or creation of OSI groups, some of OSI group setting is dictated by the Kublet directly to runtime. And while we had only run C, it was okay. But nowadays we have VM based runtimes like Cata containers, GViser, and so on. So as soon as we start to use those like micro VMs, some of the assumptions about like what actually Numa node, for example, means is not exactly the true. So we looked at different means how to extend that. Like obviously you can have a custom OSI runtime, but you are limited of what you can do there. And when we looked at the CRI runtimes, CRION container D. And about a year ago, one of the maintainers of container D introduced the project which is called NRI, node resource interface. At the time of introduction, this project was very simplistic. So it utilizes the idea of C and I plugin. So something what can be executed on the start of a container. And it can alter some of properties of newly created container. But if you really want to utilize all the flexibility what hardware provides, and if you really want to have a plugins what knows about details of your hardware, you need to have something more flexible. And right now our team plus a few our people from the community, we are working on improving this NRI interface to be flexible. So we can hook into all life cycle for containers. It should be easy to deploy those plugins. And we would like to make it interoperable between those two major container runtimes. So this tries to solve the flexibility to the level what you can have a custom installation with custom resource policies tailored for your hardware needs. And these are the extension points for all the controller for resources. Thank you very much. And let's let me ask back to Alex say for a minute. So are there any work in the built-in resource management at the moment? Yes, right. Right now we'll find out that there is external resource management. Sasha told about it. But the work in internal and built-in components are still in progress. So there is an enhancement proposal regarding last time on caches, but it's still in review. So you can contribute. You can participate in reviewing and in implementation. Enhancement proposal for changing CPU distribution among sockets and cores is also on review. Port resource API also evolves. Additional features are added to the port and are on the way and changes are coming up for new version of port resource interface. Yes, thank you very much. And I would like drop the ball back to you and ask you about like how the requirements from internal company projects and requirements from problem source projects are synchronized in your case because we are working all of us are working for a company and the company has a set of targets while the open source projects have a different target. So how is it working for you, for example, Alex? So yes, all of us are working on different products for different purposes and different companies. But we are trying to find a way to implement components in a common way and be useful for more and more use cases as possible. Sometimes it's not possible. As Alexander mentioned, we have a lot of use cases, a lot of hardware architectures and vendors. Also, we are trying to reach a maintainability of our products inside the company as a community core. So let's imagine we have rewrote the whole CPU manager policy, but we are still calling relevant functions in kiblet and still we are being invoked by our components. And once we are moving from one version to another version of Kubernetes, we have to adopt our code. And it's a huge maintainability burden since function prototypes and interface are changing continuously in upstream version. From my point of view, the way to keep the balance is developing sustainable interfaces. And these very different components might implement different companies or might implement different components as it was done for, for example, for device plugins or different trend times implementation. And all components will communicate with each other, maybe without a problem. So generally, I just wanted to jump in really fast on that question. Also, I think one of the important things here is that as long as the building blocks are there and out there and documented, it makes it easier for companies to come in that are looking for building their own solution around something like this and figuring out how they can or can't use what's out there to do it. So a good example of that was NFT that Swati mentioned, the node feature discovery. That was used heavily in our scheduler because we needed to leverage a whole bunch of the features that NFT had and we didn't want to build our own. But in addition to that, NFT allowed extensions that if you needed to add custom user data that maybe wasn't something that was built into NFT, but you had other pieces of topology awareness that you needed to add, you could. And then in turn, you could go back and upstream that if it seemed appropriate for other companies to use as well. So I think it just takes time and there's a lot of different requirements and use cases coming in. But as more and more come in, we're seeing all these pieces kind of come together in a way that is, I think, general enough where you can build pretty much anything off of this in the future. Thank you very much. And now, like we saw that there are lots of things happened already. There are lots of changes happened in these components and there are lots of changes still happening. But what do you think, what will happen next? So what are the next steps, Sasha? Well, my simple answer to it, what's next means actually a lot of work. And as you mentioned, all of those components are still in development. There are a lot of things that can be done for the for scheduler part for enhancing the internal resource model inside Koblet. Obviously, a lot of work in runtime space, a lot of understanding how the future generations of the hardware will be affecting your workloads. And even with your workloads itself, you need to understand what like workloads itself evolve with. If a few years back, we're just running like with databases, today we have application plus site containers, like service measures and so on and so forth. All of this includes the implications of how resources needs to be managed. And from our perspective, even though like in the Koblet, we are trying to solve the most common problems and we are different plugin mechanism, we are giving you a flexibility. We as a community still wants to understand what kind of workloads you have, what kind of problems you are trying to solve, what kind of requirements you have, what kind of maybe internal solutions you have. And all of this information helps us to, first of all, to develop with things what we already planted, plan the new things for those components, define the API is better. And even by catching the box by trying what we already created, all of kind of with feedbacks for our work is really appreciated. And if you're going to contribute, it's even more appreciated. Yes, and looking to that, let me just list all the forums where you can connect to us, to the members of these groups. So first of all, there is SIG node, which is a Kubernetes SIG. And of course, you can access the SIG node discussions on the SIG node channel in Slack or in the regular meeting of the SIG node or in the Google Groups discussion forums of the SIG. Also, there is a topology over scheduling Slack channel in the Kubernetes Slack about this particular problem where you can discuss with us and try to help us. And also, there is on the CNCF level a tag for runtimes under the tag runtime channel of the of the CNCF Slack. So if anybody is interested in joining to this work, please join us because there is a lot of work to do and we need all the help that you can provide us. And with this, I would like to give you the opportunity to ask questions in the meeting platform. We are doing everything to be there and answer your questions. So thank you so much for joining our session. Thank you. Thank you. Thank you. Thank you very much.