 Hello, everyone. Welcome to our talk, resource orchestration of HPC on Kubernetes, where we are now in the journey ahead. I'm Swati Sehgal. I'm a Principal Software Engineer working for Red Hat. I'm Francesco Romani. I'm a Principal Software Engineer working with Red Hat. We've been contributing to Kubernetes and OpenShift for accelerating and evangelizing telco features and requirements in the upstream community. Our team has been focused on resource management capabilities like no more awareness, hyperthreading, and Kubernetes resource management. And we aim to enable our customers and partners to run performance sensitive next generation workloads. In terms of the current landscape, even though we've been focused on telco, some of the problems that we have encountered and are trying to solve are not only relevant to telco but also to other wide range of workloads. We believe that HPC workloads can benefit from some of the work that we've been doing. And, and that's why we're here. For the talk today, we would primarily be focusing on workloads that care about normal alignment of resources. We try to demystify some of the resource management concepts in Kubernetes. And our aim for the talk is to point you to the right direction, give you the right tools, and essentially allow you to get your hands dirty to enable normal awareness scheduler in your cluster. So for performance critical workloads, typically, they have very strict resource requirement. So it is required that the resources such as CPU memory and devices be allocated from the same number node for optimal performance. From resource management perspective within kubelet, we have CPU manager device manager and memory manager that are responsible for allocating CPUs devices and memory or huge pages respectively. The technology manager gathers hints from resource managers, and based on the configured policy aligns them on the same number node. Now no more unaware scheduler, even though we have the ability to align the resources at note level. The scheduler is not no more aware. We have a few challenges and pain points that need to be addressed. Some of those are the first one is that the scheduler lacks visibility into resource availability on a per no one basis. The second one is that the kubelet rejection loop can cause delays and pod lifecycle, especially for low latency workloads this can be detrimental in the fact that you know it can impact the SLA and the performance of the workload itself. The third one is the unbounded amount of error pause that need to be cleaned up. Let's double click on each of these pain points to understand them better. The dynamics between topology manager and scheduler is very interesting here, because in a way we now have two schedulers, the cube scheduler which is the main one and the topology manager, which operates at a node level, and is responsible for identifying the suitable and allocating the resources, essentially acting like a mini scheduler within kubelet. Since it is the responsibility of the scheduler to make sure that the pause land on the node and topology manager to allocate the resources, the scheduler has very less control over workload placement within the node, and the subsequent resource allocation. In addition to that, it doesn't consider the topology manager policy configured on the node, and whether or not those resources can fit on the same moment. So, essentially, it can place a pod on a note where topology manager can simply reject it with the topology affinity error. But unfortunately with the current design we can do much better. The second pain point is related to the kubelet rejection loop. If a pod is part of a deployment or replica set, and we have the topology manager configured with a single node policy, we can end up with a runaway pod creation. The reason behind that is since nothing has changed on the node from resource perspective and the pod has failed at the admission time. The subsequent pods created by the replica set controller end up being placed on the same node with the same fate, and we end up resulting in a runaway pod creation. The third pain point is very closely related to the second one, where we have runaway pod creation. And essentially, what happens is because of the runaway pod creation, we have a series of pods that need to be cleaned up, even if eventually there is a pod that is up and running. The pods that failed continue to exist on the cluster itself. You might say that this is clearly a consequence of a split brain problem. We don't, why don't we get rid of topology manager, the so-called mini scheduler. The answer to that would be we theoretically could very well do that, but it's not that simple. In the current architecture, it is an intentional decision to keep the hardware implementation and information local to the node, and the resource managers within the Kubelet take care of resource allocation and therefore topology manager was placed there. In addition to that, topology manager has been with us since Kubernetes 1.16 and it graduated to a beta feature in 1.18 and preserving the existing behavior and maintaining backward compatibility is not going to be going to be trivial. With no more where scheduling design, we've been trying to make sure that we are taking incremental steps and have an evolutionary approach to come up with a solution that takes care of all these concerns. So let's dive into nowhere scheduling now. Before we dive into the architecture it's important to understand the use cases to understand the why behind all the work that we've been doing. So the first use case is workloads that require specialized hardware and that's where HPC comes into picture. This includes HPC workloads that require FPGA GPUs and want no more alignment of the resources. There's another interesting use case where you might want alignment of multiple accelerators. The GPU direct scheduling which which came into light by why is that it basically requires multiple accelerators like GPU index for direct GPU to Nick transfers over PCI instead of going through CPUs. And it was important that GPU Nick and the CPU all be allocated from the same number node. This here is the performance sensitive workloads. We've all heard about rich high throughput networking applications for containerized 5G deployments, like, like other container networking functions. We run user plane where packets need to be processed with extremely high bandwidth aligned with the survey virtual functions, huge pages and CP resources to make sure all of them kind of are aligned. The first use case is applications that are sensitive to the TLB misses these applications that have large memory working set or sensitivity to memory access latency examples of this includes database management systems like my SQL Oracle and packet processing systems like DPTK. We'll dive into the number where scheduling solution and how it can be enabled into Kubernetes. It's very important to expose the resource information with numeral granularity to the scheduler and imparted the intelligence to make use of the information provided to make a number where scheduling decision. We have enabled this in and out of the solution in Kubernetes and there are three key components of the solution. The first one is the new object node resource topology CRD. This is the CRD based API to capture the resource hardware topology of a node. So each node resource topology CR would correspond to a node in the cluster. The second one is the node resource topology updater agent. We needed an exporter component that you know runs as a daemon on the node and exposes resource information, along with the numeral granularity. We introduced a component call in every topology updater in node feature discovery to do that as NFD is a well known node fleecing agent. Another component called resource topology exporter, which was custom built for no malware scheduling. In addition to this, you could create your own custom built exporter, but you have to make sure that it conforms to the API. The third piece is the scheduler plugin. We leverage the scheduler framework to create a scheduler plugin that implements filter and score extension points to enhance the scheduling process and make the scheduler more intelligent. Now in terms of the end to end solution. We have the node resource topology API. We have the topology updater agents, and they use the pod resource API to gather information of the allocated resources and the normal nodes. Those allocated resources were come from. And this is done to determine the part of the available resources. These disinformation is essentially exposed as CRs. And it's available via the Kubernetes API. When a pod comes to the Kubernetes API server, you essentially have the topology where scheduler plugin that comes into picture it uses the CRs that's available via the API. And it makes a topology where scheduling decision by running a simplified version of topology manager alignment algorithm. It's important here that we understand that policy manager still runs its alignment logic at an old level, and it performs the admission check for the part. But what we are trying to do is being proactive and imparting the scheduler intelligence and empowering it to make the right scheduling decision. Now I'll hand over to Francesco to cover next part of the presentation, where we double click on each of these components and cover them in more detail. Over to you Francesco. Thank you, so let's cover the components we just talked about in a bit more detail and let's see what comes next in terms of roadmap. And last but not least, let's cover how to get involved in this initiative. The nother source topology object is an external object being a custom resource definition, which, like you mentioned corresponds to one node on the cluster so you have one one relationship between node objects and other source topology objects. And what we found in this thing in this object is the counters for for each number zones for each resource known by Kubernetes about capacity, the loctable and available units for each resources. And then we'll see in a little while what what they actually mean. Let's start with an example of a real world example about another source topology. So we'll have we want to highlight first of all the note the name of the object which matches the name of the node, which these objects are first to. So it's very easy to cross correlate between not objects or not not in general and not just the Polish objects, and then we have the policy because this allows the plugins to make further logic and about the, what's actually running at the level. In this in this case it's also important to the scope, which is also encoded in the policy the scope of the Polish manager. So we have the nodes which represent each anumazone each numazone has a name, which is unique on the node so zones on different node can have actually the same name. And then we have the resources so for each resource known to Kubernetes, we have the resource name and the counters associated with that resource. And then we have the capacity, which represents how many of those resources are actually physical present on this zone, and then you have the unlock table, because you may very want to reserve some of those resources for other tasks or in general to not be available to the Kubelet. So examples are, for example CPU and memory to reserve to system demons for the node to work properly. We have the unlock table which is a subset of capacity. And we have available units, which is how many of the unlock table units are actually ready to be assigned the versus already taken by work load running on this note on this numazone. What mentioning CPUs are referred as a wall because the one single core one single CPU, consider a sign on him is the minimum amount that could be exclusively allocated by the CPU manager. These are the costs which are the new my distances as reported by the Linux earner. Can we just extended an old object. Well, it's not that straightforward for a bunch of reasons, the most important ones is, first of all, this is a very specific information we may want to have separated by the basic node object which is widespread and part of communities API. And so, restrict the access to this information and in general only the clients that wants to access this information there, access those objects versus requiring them to access the bulkier node objects. And nodes are bulky indeed so adding even more data is not straightforward than arguably, not to the obviously better approach. So here, we are exploring options and we are keeping the conversation open about the best way to expose this information in general and to make part of the core Kubernetes API. So this conversation is still in progress. We mentioned previously, the topology of data agent which is the component running on each node. You are interested to expose the per numar resource allocation and availability. We have one topology of the agent as probably running as a demon seat on each on the one set on each worker node. And this agent needs to have authoritative information about the resource availability and the current allocation. And the authoritative sorts is the cubelette because the cubelette is the orchestrator and the cubelette is the one which allocates resources on a node basis thanks to the policy manager and the resource manager. So the topology of later agent talks to the cubelette queries the cubelette using the pod resources API which can support over the last few releases to expose numal locality about the resources associated boats. So the pod resources API is not local. So it's very fast, but and the project agents queries periodically and arranges this information per resource base per numabay per numabasis and updates the objects, then other sources of objects. But we can also introduce some notification mechanism to have the topology of data agent, update agent, reacts quicker in a quicker way. So for example of this notification mechanism we are exploring and trying it out is for example plugging into cryo so when workload when a container requiring exclusive resources, that's up. The topology of the region can be notified and in turn query again the cubelette and get up to date state. So why do we need a new component to do this task. And we need for two main reasons. First of all, is at this point in time, then other source topology is a separate object so it's even more evident that could be a good idea to have a separate container taking care of this object specifically. So it's very easy to enable or disable. But it's not it's not enough to just expose the initial state of the node, because there is a tighter coupling, and a tighter need for reconciliation between the Nordic roots topology object and the shadow plugin. There is an uncertainty factor, which is unfortunately unavoidable about where the location is taking place, because of the split brain thing we just mentioned we just described. In other words, the topology manager, then the cubelette guarantee that the workload is get either all the results aligned on a number zones but it's not known before and which number zones, or this workload is rejected. So you can you cannot predict on which to my son the workload is going to land so you cannot really do accurate accounting, you can only learn after the fact. And this is caused by the fact that the topology manager, who's the cubelette has the ultimate authority on the placement. That we need to reconcile more frequently the scheduler state with Numa per Numa counters and the node state represented by the Nordic roots topology objects and of course this is a performance escalability concern which we are actively exploring and improving and will cover that our plans about that in a little while. Last but not least, we cannot just override the topology manager because it has to have the ultimate word and about the placement because it has the most authoritative picture about the exact node allocation so that the scheduler cannot and probably should not even if it could drive the topology manager decision. So we will have this uncertainty factor to deal about and we'll have there is a reconciliation need. Now, now that we have the data about the per Numa research availability and capacity, and we have an agent which keeps this information export this information and keeps this information fresh. We can build the schedule plugin which consume this information. We have right now two plugins which already got merged in the scheduler plugin main repository on the 296 and the two of them are the filter plugin, which filters out from the later processing stage filters out nodes which we are very confident we are sure they cannot provide a line that align the results to the workload requiring them because of how the resources are already occupied on that node on the number level. So, if the filter plugin rules out a node, we are sure that not was insuitable is not suitable. We also have the score plugin which provides the well known scoring function and scoring policy like least allocable list allocated and most allocated on with Numa guarantees. How does improves the pain points we mentioned previously. We call with the filter plugin alone ruling out the node which cannot accommodate the workload with the requirements the workload have and with the alignment, the workload needs. We can minimize the topology affinity error, we cannot say unfortunately removes completely because the uncertainty factor because the fact we need to reconcile with the ultimate the Caesar, which is the Kubelet. However, we can greatly improve this scenario, and we can then have more predictable behavior. More mentioning that some workload type like for example content as network functions and 5G telco workloads also kind of expects this kind of number where shelling so this is a way for us to cover this gap. How can we try out this all the components we described we talked about. We have a Gita organization with the links are in the slides, which is, which contains all the repository for the components we are developed. So for example, we have the Norders roots CRD definition and the auto generated book client. We have, we've, we mentioned that the shadow penguins are already been merged. We have the Norders roots topology data agents. The future discovery got support for updating in zero 10. We have the research apology exporter on which we experiment to stuff and when we are confident it's good and it's ready. We send changes or we propose changes to another future discovery. We have prebuilt container images for RT in this way or dot your repository linked in the slides. The same in the organization we also have repository documentation and we also have manifest which also encodes, not just how do you actually deploy the components but which components with goes with other components so you have the set we test and tried. So for example, version a of the API goes with version B of the scheduler. So you know this version goes well together and this information is encoded in the manifest linked in the slides. We built a set of go packages to work nicely to work programmatically with those these components to install to manage them, and we built a tool called the employer lacking a bit creativity here to actually exercise those packages and actually deploy those components in any Kubernetes cluster. This command line tool is available as binary releases is again linked in the slides. So this, this tool is not just a demon but something we actually use and we, we generated manifest we mentioned previously out of it. Last but not least this tool can also this tool and go packages toolkit can also validate the cluster meaning checking the configuration and ensure it is compatible with the not the, not numerous scheduling settings. So the topology manager well configured is the super manager well configured and things like this. We built. So actually a subset of us built an operator and on top of the generic reusable go, go packages toolkit the national. This operator is wants to be compatible so nothing specific about any distribution, however we depend on the machine configure operator because it's much more convenient to manage node with that but that's actually the main and only requirement. The operator or whether as more opinionated the settings about how the new one we're scheduling will look like should look like while the employer toolkit we mentioned are more generic. This operator is also available on open shift. Now we can, we added towards the conclusion so let's wrap up. We want to actually investigate and invest and explore how we can reduce the need to reconcile shadow plugin and the state of notes and this goes in the form of the reserve plugin we are experimenting with and we aim to bring forward in the next months. In the next months, we want to keep enhancing the NFD component to update the topology information, and we will be sending changes and updating the code base in the next months, and we will keep the conversation open about better integrate these data and components in the core Kubernetes, possibly having native objects. Should you want to participate in this conversation, the best place is the batch working group which gathers experts and participants for of older related six like apps node and scheduling so the batch working group as luck channels is the best way to get in touch to and talk about this work. We also hang out on six specific slack channels and communities like node and scheduling. There is also a dedicated slack channel very low traffic about this initiative. So if you want to ask questions or proposed changes or discuss very focused topics about this initiative. This is also a good choice. This is the link to our own GitHub organization which holds the code all the code and the documentation repose for all the things we mentioned it. We are very welcome to file issues and send PRs. With that, we are done. We covered our work, and we thank you for attending to our session, and we are very happy to answer any questions you may have. Thank you.