 Good afternoon everybody. My name is Gaurav Singh. I'm a product manager in an open shift and I have a Sanyan San from IBM. So if you want to introduce yourself, see if your mic is working. Okay, it's working. Okay, yeah, I'm Sanyan Sushant from IBM Research from Tokyo. So I gave a session in this morning. What I really want to bring in front of the community is stuff that I'm listening from my customer so that you get the perspective of the use cases that we are dealing with, right? We have this session with Darren. We talk about scaling on cloud busting. You saw a gentleman from NL Lab who's working with IBM and Red Hat to this flex scheduler. So these are all based on the use case that we are seeing in the field and we are helping them, right? So like clouds and HPCAI, what we are seeing is customer looking forward to have like cloud saving and elasticity type of use case where any like GPU based EC2 instances are expensive. You don't want to grab that and hold it for the job and if it's finished, you don't want that to be associated with that job. You want to get it free so that either it goes back to the pool, you don't get bill for the usage or it'd been used for other job, right? And similarly elasticity. You need to make sure that if there's more jobs in the queue, you are scaling the same cluster, adding nodes because think about creating a cluster from the scratch and adding the whole operators on top of it, take time, take five to 15 minutes, right? So cost saving, elasticity, reliability, right? Reliability in terms of the job. Let's say you are running a training and then a cloud instance by inheritance is sometimes unreliable. So what if you are on a day of your training and your cloud instance got deprovision or disconnected? So you need to figure out that reliability of that running of a job may be taking a snapshot or dividing a job into sub jobs and running across the zones. Flexibility, I did talk about this in the morning where you want to, all my customers, they are looking to take benefit of the cloud in a holistic way of the hardware capabilities they are providing either from GPUs, a fast network and fast storages. So challenges, right? Challenges that we are seeing or HPC is, the communication from GPUs, GPU needs to be fast. There needs to be a faster bandwidth pipe going from one GPU to other. And Sana San was gonna talk the whole presentation on how we can use multi-infinity band. Second is the fast loading of images. These AIML images tends to be very heavy, right? Think about you are downloading the image from an image registry, it's gigabyte of images, you're downloading it. What if you want to cache it? Do you want to cache it on the same node or you want to cache it on every node as your east to east-west traffic is more faster than north-south, right? Rather than going to the, to registry all the time, you have a cache somewhere. Right, so think about that. We are seeing a prevalent use case around that. Again, Q, we have talked about Q in a couple of ways. You know, we talk about Q as in project, this MCAT that is happened, IIM team is working on, but really the problem set we're trying to solve here is when your jobs comes in, they need a place where they can stay until and unless a resources is available for their job to run, which is also the concept of the gang scheduling where if you have a resources for five node, five jobs out of 10, your job doesn't schedule until you have all the resources available. I mean, and going forward, we have only, you know, we'll talk about only multi-infinite band, but I'll be happy to talk to you offline about all these three other buckets that we are working on, but I'll hand it over to Sana San. Thank you, Gaurav. Okay, from my part, I would like to start from this slide. As Gaurav has already mentioned, it's his first slide that Cloud brings so many midfields to the HPC and AI workload, and that is especially with their containerization and orchestration systems. That is why we are working so hard to make the full cycles of the AI and also the HPC workload work on the open shift, especially on top of their work showing stand running on our GPU servers on the cloud. And in this talk, I would like to focus on the first challenge that Gaurav has already mentioned in the previous slide, that we are going to make the multi-infinite band available for the HPC workload on top of the open shifts and on top of the clouds instead. And this is the situations that we have initially on our virtual servers, oh, because VSI is sent for the virtual server instance on the cloud. We usually have a single primary mix to make a connections on the clouds, on the common VPC or virtual private cloud. To make the network is more like supportive for the HPC and AI workloads, we get the machines with their support of the SIOV that we can bypass this with layers of their hyperwifes and we can enable almost full bandwidth on the virtual instance that we have. And in some of our machines, we have two of them and we got like about 200 gigabit per seconds for the tuples and 20 microseconds for the latency. Also, we use our adopter technologies of the RDMA as well to like reduce the latencies with the RDMA. And also we have some instant that have a combination of those technology together and finally we get like 400 gigabit per seconds for the tuples and 10 microseconds for the virtual server instance level. The problem next is that how can we bring this monthly network solutions from the cloud VSI into the pod level? And this slide will show two part. One is like what we have in the default networks or the Kubernetes system that we provided right now. Right now, as you may already know, we usually have one primalize our interface on the pod that make the connection to the control plane and those stuff. And with that one, we usually have to pass the package through to some address translations or the encapsulations to make our pod IP or pod packet routables to another instance and that further reduce the latencies and also further reduce the tuples and increase the latencies. So what we need for the AI and HPC is that, okay, we can leave the first primaries for the control planes, but what we need is we need to bring the secondaries in phase that we attached to the high speed networks, connect that upgrade to the pod so that we can have the same tuples and same latencies that we have in the install level. And this is the key that we can achieve that part. One thing is that we have to directly expose the package of the pod straight forward to the instance in the latys so that we don't have overhages of encapsulations of the address translations. And another thing that we have to work on is that after we make a straight route to the host, we have to make sure that the pod packet is routable on the other latys. And in Cognizys, we have their projects called Maltas that are already mentioned in the previous talk that is this their project that do very excellent job on allow us to attach multiple network interface into the pod. However, to make that happen, there is the multiple step that still have to do manuals and people's file suffers from doing those stuff. That is this post step. The first step is that administrator have to sneak into the instance. They have to know about their interface, what interface is available, what is the interface name, what is the network address and those stuff. And then they have to make a configurations to make the pod address routables. And then they have to define one by one for each in the phase to make the attachment available and annotate that to the pod. In some providers, there is an infrastructure support that will handle all the pod IP address to make sure that even after you export the pod to the host, it can be routable to the other instance. However, the other step is to leave to the administrator and the users to handle all of the stuff. And in this talk, I would like to present our recent work on the Manfinix R-CNI operators. These operators will handle the configurations and they're discovered in phase and handle everything. So what leads to the administrator is to define just a single CNI definitions in a very simple way and annotate that to the pod. So the benefits of the Manfinix is not just only their adoptabilities and usability that you can see from the previous slide that our admin and the user don't have to do much thinking about the network and they can enable or multiple infinite band into the pod. Furthermore, in the Manfinix, we also consider the dynamicity of the cluster. Like the clusters, machines can be scaled out and scale out and scale in any terms that depend on their workload demand. And also the interface itself can be added or removed or it can be unavailable any times. And like the noise can be unconnected, it can be reset anytime. So this is the thing that we have to handle with this dynamicity. And also we consider about the scaling as well. Like we expect the noise can be scaled like up to 100 or more than that. And we have to handle that simultaneously with a lot numbers of the pod IP allocation as well. So this is the key that how can we achieve the dynamicity and scale by the Manfinix. First thing is that we do the isolations. We isolate root configurations for the each network so that we can handle the dynamicity separately from what the host is working on. And then we do the synchronization times and times to make sure that the latest state of the host interface up to date and we can handle the failures and do the recovery. The third thing is that we minimize the communications to the API servers. I choose some of you may are faced the same problems when we like depends on anything on the custom results and we have to connect to API servers so many times and we can't get like a rate limits problem and those issues. So we are, because most of the custom results that we create is automatically created and controlled by the Manfinix ENIs. So we can cache that. We can do the minimizations on the API server communications. And the fourth are pieces, the distributions. We do the IP management in the distributed way and independently from each post to make sure that it can scale. Our, if we want to tie Manfinix ENIs now is already available on the operator hubs not just only on the open ship operator hubs but also like the communities communities as well. To get started with the Manfinix there's just only a simple three steps. The first thing is, okay, install the operators. The second is defy the network, Manfinix network is only a single one and then no matter how many of the interface you have it can be attached to like an interface pool. And the third step is like annotate that to the part. This is where it's on top of the Mautas. So the people who get used to Mautas will also simply understand and like apply this our Manfinix ENIs. From this slide and next slide I will go into details about their Mautinix ENIs like how it's work, how it's composed of but I will go very quickly because I don't think you're interested into the detail but of course if you're interested you can reach out to me and see more detail. Basically the CNI is composed of three components. The controllers part of the main thing and the demons and the CNIs are very similar to the orders CNI block ends. For the custom results that we manage is that there is a full custom resource to manage. The main one is the Manfinix network that you still have to defy. For the list tree is this managed is auto created and managed by the Manfinix controller so you may not have to concern about it. Basically there is a tree operating flow that may happen with the Manfinix CNIs. Let's say it can be even triggered to the CNI. The basic operating flow is starting from like user deploy operators and then deploy the Manfinix networks, create the part, delete the part. This is a simple trigger that happens to the Manfinix CNIs. Additional to that if users scale the node like increase the number of the node or change the interface on the node that will trigger the synchronization change and also if the node is restart or it's failure so it will trigger another synchronization. Starting from the basics operating flow like when you deploy the operator it will automatically create a demo deployed at the host with the CNIs and also it will automatically discover about the host interface. Know about the secondary interface what is the device ID is this or those stuff is listed in the host interface custom resource. And then when you deploy the Manfinix networks it will automatically create multiple definitions network attachment definitions considering the host interface definitions and also it will automatically compute the pod starters, compute the IP pools for the management and then do the configurations to the root at the host to make sure that the pod IP that is compute is root tables. And then when you annotate that CNI to the pod it will automatically delegate to the mod-hust and mod-hust will call this Manfinix CNIs and with this Manfinix CNIs it will automatically select the interface that suitable for the situations it can be the policies for selections and it will automatically allocate the IP for the pod if like for example if you define for two additional interface we'll allocate for two IP address. Also like for after the delayed pod is built again they will get to the mod-hust and come to the Manfinix CNIs to like update the IP pools and do those synchronizations. If you change the node it will like periodically synchronize and detect that and after that it will update all of the custom resource do the reconfigurations. Again if the node is distressed this can be detected by the root it's not available on the host anymore so it can do the reconfigure and make the root become available again. So this is now like opens under the organizations of the foundations model stack so you can shade it out and any contribution is very welcome. What can we achieve with the Manfinix CNIs is that we try to attach our singons and the two of their infinite back yeah with the SRIOV and then we got like half of the latency as expected at like 3.5 or 2.7 with the tunics of the network bandwidth can be increased. Furthermore not just only in some of the microservices of the latency throughputs we also taste with the real workload with the time series models with a large amount of their learning parameters and we got like 89% of the power else efficiencies on the AI deep learnings workload in our system. So in summary like we have a Manfinix CNI that can make their communities networking ready for their AI and HPC workload with not just only adoptability and usability but we also consider about the dynamic state and scale with full or key pieces that is our isolation synchronizations minimizations and distribution. So that is all of my talks is that in questions is welcome. Thank you for joining this talk. Thank you. Questions? I kind of had two questions that I guess were related. If you have a pod with multiple I guess IP addresses associated to it how do you do serve discovery for it? Like say I want to talk to that pod I've got four different addresses to use to I guess take advantage of all that bandwidth and so how do I address them and then alternatively is there a way I can present those four individual physical nicks as a bond to the pod so that I don't have to worry about that at all and it just appears to the pod as a single interface. Okay I'm not sure I understand your questions clearly but what the maintenance see or discover is it something that's our vegetables in the host level like we have the demons, Mandini demons that can do the same thing like I have configs with the NetLynx library and then we can get the address and we can figure out these address the primaries or the scan rates and we can get the all information most of the information from the NetLynx library. Sure I guess my question was more if you have pod A wants to talk to pod B how does it know like you know if you would normally do this by like pod.namespace.pods.name right like how does that work when you've got lots of nicks in the pod? Or it's still based on the IP address and in the maintenance we will do like a host interface locals IPAM so we like with the site results that we have is we'll keep the information of the okay if you're pausing in this host with this interface we will in this range and this will use that range for put that in the root tables so that the host can know that okay if this IP address is coming from this pod from this host in this interface. Sure yeah okay I understood that I'll follow up with you. Oh sorry. It's alright yeah it doesn't matter. I guess the question is more about like DNS resolution like what does the DNS keeps track of like the pod has multiple names one name for each IP is that how it works? Yes. So I guess that answers your question. Oh I'm sorry could you repeat the questions again? So you've got the host is you know has its four additional or three or four additional nicks and you've kind of you know you've exposed those directly into the pods network namespace and you've set that all up in the routing works I understand that. I guess my question is can we make it so that those four individual physical interfaces are exposed into the pod network namespace as a bond rather than individual interfaces like using lag or something so that you don't have to deal with it. I will come afterwards I feel like. Okay alright. And there's a good page for this you can always ask a question over there and there are a lot of people who can help you over there. But the question is like why would you want to represent as one like the idea is that you want to have different interfaces and each one of them is serving a different purpose. Let's have a follow up discussion after this one. Seems like an interesting topic. Any other questions? Hi thank you for your talk. I'm triggered by the minimize API server cubitation. I don't have a background in AI ML. I'm just running an OpenShift platform for large bank but from a security perspective we had a very similar discussion. Could we basically cut off all the traffic from the workloads to the API server and only allow the hosts to connect to API server. Maybe you could work with the same pattern. So what you're saying is for security reason you want all the cubelets to connect to API server with the faster bandwidth. Basically with Miltus offers you is basically connect multiple networks and separate them out. I guess what he's suggesting is that like any communication that happens between the orchestrator components happens on its separate network. And then for security reasons and then the applications they have their own network then you can utilize Multineck to do that. But that's Multineck at the node level not necessarily at the pod level right? Yeah exactly that is one of the purpose to have the Multineck as well. Like to use for a different purpose. Maybe I'll connect with you. Marky Oka, Oka makes sense. Any other questions? Do you deal with setups where you have multiple necks at the node level and you wanna assign one neck for each pod? So it's not Multineck at the pod level for example you have a neck for what you have a Numa architecture or a neck connected to a GPU and you wanna assign a pod to that GPU and get that neck and like does your plugin handle this setup? We are thinking about that but it's not like at the point yet. So yeah we are also thinking about the topology as well. Yeah. All right thank you so much. There's one more slide where we show the whole team. Oh yeah. I'm sorry. There's a group of people who are helping us throughout the effort and it's just two you see over here but there's a full team behind us helping us. Maybe building on the question from Abdullah for the Multineck. If you have a setup where it would have both InfiniBand and Ethernet, is this something that you would handle with the Multineck controller or a? No it's something that's already available that there are virtual servers. So what Multineck's operator do is like to bring a wild boar interface to the pod. Right okay. Yeah so we are not like controls at the lower levels of their. Okay so both would be exposed. Yes. All right. All right thank you. Very interesting thank you so much. Thank you. Thank you. Thank you. Thank you.