 So, hello everyone. I'm Ikele Gazzetti. I'm a software engineer part of IBM Research Europe. And my work mainly revolves around the management of the metal infrastructure for Kubernetes clusters and HPC environments. And a nice part of my job is that sometimes I get to assess new and emerging technologies and I can try them out and see how they work. So, today I would like to share with you my experience working with composable systems or, more in general, composable disaggregated infrastructure and Kubernetes. So, the agenda is as following. I'm going to introduce the concept of composable disaggregated infrastructure or CDI. And this is not to be confused with container device interface. So, for the rest of the talk, CDI means composable disaggregated infrastructure. And I'm going to describe some of the challenges that we are facing integrating CDI in Kubernetes. I will briefly talk about Sunfish that is an open framework that aims at managing some of these challenges. I will introduce and describe the composable resource operator that is a proof of concept that we have built to test composable GPUs in Kubernetes. And we are going to take a look at the operator in action and I'm going to explain some of the steps that are happening under the hood for the management of the resources. And I will recap with what we learned and what we think it would be interesting to tackle in the future. So, composable disaggregated infrastructure or CDI is a set of disaggregated resources and these are physical resources that could be compute, memory, network, accelerators. And these are generally organized in a resource pool and these resources are connected to computer system via a fabric. For instance, it could be PCI, for example. And in terms of scales, today we are talking about rack level composability. And as part of CDI, we also have the software provided to provision and manage these disaggregated resources. But why would you want CDI in your infrastructure? I can try to tackle this question with an example. So, on the left you see a simplified representation of a bare metal Kubernetes cluster. And we have three physical worker nodes, each one with a predefined set of compute, memory and GPU resources. And in this scenario, all our Kubernetes components are running on this worker node. We skip the masters because they generally are not schedulable. Now, we start to schedule pods on this cluster. And, for instance, we get a pod that requests half of the CPU resources, half of the memory resources that are on the node and one accelerator. And this can be easily scheduled because it requests an amount of resources that are less than the overall amount that we can find on a single node. But if we receive a request to schedule a pod with an amount of resources that goes beyond what we have on the single server, well, this will remain pending. And here we see that the physical server chassis sets constraints in how we can allocate resources to our worker nodes. And we have a couple of options to solve this at the physical level, but they generally are painful in a traditional data center. We could move resources, so we have to cordon two nodes, open the nodes, take the resources, for instance, the accelerators, move them to another node and bring the node back into the cluster. This is a very disruptive operation because you might be running some workloads on this worker node, right? Or you can provision a shiny new node with all the resources that you want, but in this case, you just needed GPUs. And you had enough GPUs in your cluster. They just were not configured the way you wanted. So if you want to add specific resources, you cannot just add what you want. You have to also provision memory and CPUs. And this increases the cost in your data center and also increases the chances of overprovisioning resources. With CDI, we still have server nodes, but some of the resources are not statically assigned to the node, they are organized in a resource pool. And this relaxes the boundaries in terms of what I can allocate to each node based on the workload that are incoming. So, for instance, in this example, the node starts, for instance, with two compute units, one memory. The moment we receive a request for accelerators, we can dynamically attach them to the node and schedule our workload. So from an operational point of view, we have some benefit. We can compose high-value resources, such as GPUs, into the server avoiding waste pool overprovisioning. We can add resources without the need of adding, for instance, memory and CPUs via traditional servers, and we avoid intrusive reconfiguration of the existing physical machines that we have. But there is a price to pay. And the price is an increase management complexity. So if before I was just managing the single worker node, now I have to manage the single resource. So it's a much smaller granularity. And if we push this complexity of managing resources to the end user, we create more obstacles than benefit. Because the user needs to learn a new API to drive the disaggregated infrastructure attachment and detachment of resources. It needs to take decision on what to connect to a specific node at a certain point of time. So we increase the risk of misconfigurations and we decrease the level of security because we give direct access to our infrastructure to a user that might not even want to manage this kind of problem. So we can try to remove some of the pain points by handling this kind of complexity and offer an integration of CDI in Kubernetes. For instance, we could think about hiding the underlying complexity of the infrastructure by providing a simple interface to request resources. And, for instance, we could use CRD in Kubernetes. We could take care of attaching and detaching the resources safely because if I just attach the resource to a node, that doesn't mean that it's visible in Kubernetes straight away. There is some management to be done. And what you will see in the rest of the presentation is basically our journey trying to tackle some of these challenges. And we are going to focus mainly on GPUs. Talking about complexity, as if the situation was not complex enough, your data center might look like this. You might have multiple solutions for CDI in your data centers. And this might come from different vendors. And different vendors comes with their own API, but their own mechanism to describe resources and manage the resources. So here in the picture, for instance, you see three different solutions with three different APIs represented by the different shapes. And while the APIs are different, you might see the same device present in multiple of these resource pools. So if you want a specific device, for instance, a GPU, you might find it in the first pool and in the second pool. So now you have to learn multiple mechanisms and multiple APIs to request resources. So this brings even more complexity. This is when Sunfish sparked our interest. And Sunfish is an open framework under the OpenFabrics Alliance. And it aims at tackling and managing complex composable system. It fits this scenario because it abstracts the vendor-specific APIs and resource representation using well-known standards, like the MTFs Redfish and SNEA Swordfish. And these are standards that, if you are familiar with the management of physical infrastructure, you might have seen before. For instance, many BMCs offer this kind of APIs for software-managing infrastructure. And for instance, with Redfish you could power off, power on the node, set, bios configurations, and so on. Sunfish also creates a unified view of all the available resources in your data center. So you don't have to go and query different solutions and different endpoints to get a unified view. It also defines some components for the dynamic selection and reconfiguration of the infrastructure resources. So this is useful because the user now doesn't need to take a decision. There is a layer that takes care of selecting which resource should be attached based on the current state of the fab. So from a stack point of view, Sunfish is placed in front of the vendor-specific API, and it provides that single endpoint to inspect and compose resources based on user-defined policies. But we are still missing that window inside Kubernetes that would allow the user to drive composition and request the composable resources. For this reason we built the Composable Resource Operator that is a proof of concept to integrate composable resources in Kubernetes. And this allows the user to request resources via CRD called Composability Request. And it handles the attachment and the attachment of the resource. So given that this is a POC, we had to start with a specific type of device to manage and we decided to select GPUs. And this might not come as a surprise given what we have heard this year at Kubecon. But for us the GPUs are fitting because many users tend to request GPUs and they are opinionated in the way they request GPUs, how many and what type of GPUs they need. So from a stack point of view the operator runs as part of Kubernetes and finally we have a way for the user to request composable resources without interacting with the lower level of the stack. So this is an example of Composability Request and the user can describe a target node that is where they want their resource to be allocated and we have a section to describe the resources that are necessary. And we might have different type of resources not only GPUs that in this case are described as scalar resources and we might have non-scalar resources in the future like memory. And this might have to be handled in a different way. And we also had to pick a bare minimum amount of information to be provided by the user to drive the composability. How many GPUs they needed and the model of the device that they wanted. And here we use the same label that you would see on your node object in Kubernetes when for instance the NVIDIA GP operator sets up the driver and all the services. Now this is what we picked for our POC and you might already have many opinions that we can improve this definition and I will also make some considerations on this at the end of the talk. So now we can look at the demo and in the demo I am going to create a pod requesting for GPUs on a cluster that has no GPUs. As a result the pod will remain pending. But this cluster has a node that supports composability. So we can create a composability request specifying an amount of resources that needs the requirement that we have described in the pod for instance requesting for GPUs A100 and once these GPUs are attached the pod will run and we are going to see the status change to complete. Now the demo is a recorded demo for time constraints and it will be shown via terminal and we have three sections. On the left side we are going to execute our commands to create pods and composability requests. On the right side at the top we are going to look for allocatable resources on our composable node and we are going to see how they change as we interact with the system. On the bottom we are connected via SSH connection to the composable node and we are watching for devices on the host via the LSPCI command. And the two panels on the right are useful because we can see how the state on the host and in Kubernetes changes over time. Let's first list our nodes and we see that we have six nodes Trees are masters and three are workers. This is actually an OpenShift cluster. And one of the workers are composable 0,4, this is our composable node. Now we can start watch for the allocatable resources and we see we have 96 physical cores 125 gigabytes of memory but we have no GPUs. We can start to look at the physical devices with LSPCI and Grap4imedia and non are available. So let's see the pod that we are going to create. It's called NVIDIA SMIPod and it tries to list GPUs as part of its command and requests for NVIDIA GPUs. We can create the pod as a next step and we can inspect the current state. We see the status is spending and this was expected. We have no GPUs in our cluster. So let's see why it was not scheduled. We can do this by describing the pod and look for the events and we see that three nodes have an untolerated paint because they are masters and three have a sufficient amount of devices in NVIDIA GPUs. So this fits what we said before. So as a next step we can look at the composability request that we are going to create and we have a target node that is one worker node in our cluster and we have the definition of the GPUs that I need for GPUs in NVIDIA 100. Next step we can create the composability request and I would look at the bottom right and we see that devices start to appear on our host. And these are the devices that were attached because of our request for composable resources. Now it's interesting to see that the state of the host and the state in Kubernetes, the state of the Kubernetes node object are not in sync at this point. And we need a couple of minutes for Kubernetes to actually see the GPUs. For the time being we can look at the state of the composability request and see that this is pending. And this will stay pending until the two views the one on the host and the one in Kubernetes are not sync. So at this point we can skip ahead and now we have four GPUs available and visible as allocatable resources. So as a next step we can look at the composability request again and see that this is online. But what happened to our pod? It was pending before. So we can actually go inspect the state of the pod and see that now it's completed. And it ran composable04. So let's see what the pod saw during its execution. It saw four GPUs, the same four GPUs that we have previously dynamically attached. So as a next step we can go and bring back the system to its initial state by deleting the composability request. And we see the devices at the host level and in the Kubernetes node object change we have no resources visible and the number of allocatable resources now are zero again. So what is happening under the hood? When we create a composability request the operator has an implementation for a client for the lower level of the stack. In this case we have Sunfish. And it forwards the request for composition. Sunfish inspects and selects the resources to compose and forward the request to the vendor specific CDI management layer that is actually governing our resource pool. And the resources will be connected to the computer system. And these are seen as they are local. We have seen before with LSPCI. And it was interesting to see that the mechanism that we have today to manage the setup of GPUs are actually suitable to handle this scenario. So the node feature discovery that is the demon that we run on the node to recognize devices and label the node based on what it sees on the physical host can actually see the GPUs and we get a new tag a new label informing the rest of the services in Kubernetes that the GPU is present on this node now. And at this point the controller that is generally in charge of setting up the device driver and all the services to assess the GPU can be triggered and install all the services. And in this case we talk about the NVIDIA GPU operator. So I will not go through the details but I have seen many interesting talks describing how the NVIDIA GPU operators and the device driver is installed. So if you are interested and you want to know more, please go check them out. But once the process is done the devices are listed as allocatable resources and from the operator point of view I just have to wait for the resources on the host and in Kubernetes to synchronize. And once the resources are available the scheduler can start to schedule pods on the node. Now we've seen that for the attachment there was not much to be done. Most of the heavy lifting was done by the node feature discovery and you might think that the the attachment would be as easy but that is not the case. In fact if the operator just forwards the request to undo the composition to the lower level of the stack and we don't prepare the node appropriately what it happens is that the node reaches an unstable state and we see many faults coming from the kernel and the driver the most common that we see is that the GPU has fallen off the bus and this we are talking about the PCI bus. So we have a node that is in unstable state it's still running, I can schedule traditional workloads that use memory of CPUs, but I have lost any kind of ability to compose resources. And rebooting the node in some cases is not enough. I have to restart the whole fabric. So we see how a fault in one node can impact multiple elements of my physical infrastructure. So in this case the operator needs to get more involved and as a first thing it sets the node as unschedulable. This way no new workloads will end up on the node. It evicts GPU workloads and here we are talking about pods, assessing the device, but also all the services that run on the node for instance to export matrix. An example is the DCGM export. It forces the driver removal and this is done by using certain labels that are provided by the NVIDIA GP operator, for instance we can say that there is a label for instance driver present equals to when the driver is installed by setting it as false we can drive eviction of the driver. And same for other services. Once this is done we can undo the composition. So we forward the request to Sunfish, we reach the CDI management layer that governs the resource pool and the resources are detached. Once this is done we can set the node as schedulable again and this leaves the node in a state where future request for composition can be carried out successfully. So there are various aspects that I think are important and we realize while working on this POC. From a Kubernetes perspective we assume that the resources are statically allocated. They might come and go and we need tools in place to perform a correct attachment and most importantly the detachment of the resource in a safe way without impacting workloads that are still running on our node. Something else that we realize is that all of the stack should strive to have a coherent method to name and export metadata regarding the devices. And this is useful mostly for the CDI management layer where I have to match resources that are coming from the Kubernetes user and the physical infrastructure that I have and I am managing. So if we don't have a way to match these two views, we need to start using device IDs or vendors IDs but these are not very human friendly. Last but not least adding and removing resources is not enough. We have seen from the CRD that the user still needs to provide a target node, still needs to define the resource so there is still a step in a better way to drive composability would be to react to just workload requirements because that's what the user at the end of the day sets. And this brings me to the first potential next step that is embrace the approaches that are already defined by the community resource allocation of DRA. So with DRA the user can specify the resource requirements by, for instance, a resource claim and we could react to these resource claim requests to drive the composition. And it was pleasant to see that this conversation has already started. Yesterday Naokioguchi and Gene Haase from Fujitsu presented a first proof of concept composability using DRA. It was this year at Kubekan. And this was during the birth of a feather challenges of Kubernetes for composable disaggregated computing. So if you get the chance and you want to know more go check it out. As a next point Sunfish is still in development and we are planning to keep compatibility with Sunfish at these words towards official release. And we would like to also explore more client more vendors and different type of resources to see if there are commonalities in how we handle composable resources or we need to define specialized mechanism depending on the resource that we want to connect. From a research point of view it would be interesting to see if I can impact and improve the sustainability of the data center. So now we have less constraints in how we can allocate resources to workload can I define and take smart decisions and make for instance my data center smaller or decrease the amount of resources that need to be allocated but not in use. And there are many more topics like for instance scheduling topology for instance we are taking decision from the scheduler but we have constraints from the level of the fabric so how can we merge these two views and make decision so that the two layers are not fighting each other. I leave a couple of links to Sunfish and the composable resource operator. They are both open but still in development. So if you are interested in this topic and you would like to contribute or collaborate please reach out. The appropriate way I believe to manage integrate composable resources would be defined by the community and this is a problem that I believe should be tackled together. So this concludes the talk. Thank you for listening and I am happy to answer questions if we have time. Hello, thank you very much for the talk. My name is Dimitri, I work on MetoCube. My biggest question would be how much of actual hardware supporting Sunfish do envision in data centers in the near future or is it more like labs versus research hardware at this point? Are we going to see it next year in data centers for you? I think the whole concept is still emerging. We see vendors that are already selling solutions but I think there is we need more work to define a reliable way to manage these resources and avoid big faults in the data center that might push away even like early adopters to these technology lines.