 Next presentation, we have Liang Yan, a senior software engineer from Digital Ocean. Hello everyone. Today I'm going to give you a topic about how to optimize your rescheduler and autoscalar on the Kubernetes cloud, but from a heterogeneous task perspective. So before that, a little bit about myself. I'm the software engineer from Digital Ocean, infrastructure fleet work team. I'm a software engineer, but I'm super interested in the hardware. So I'm focused on the hardware visualization for GPU network. I'm also quite interested in the heterogeneous architecture acceleration and optimization for distributed machine learning. So today we are going to look about the re-optimization. So first we will see where the story begins, what is the problem we are facing, and then have a quick look at the recovery and see some of our work currently doing and planning doing in the future. So first, yeah, where the story begins. You probably already have noticed that when looking at HBC cluster today, we kind of all the system kind of assume that the hardware in the cluster are identical, or at least even don't care about it, or the controller just distributed all the workload to the worker, and then just wait thinking back from the later I think organization. However, it's kind of a different story from this one. Like the operator performance, the operator here is like the PyTorch operator, not the Kubernetes operator here. So we noticed that the performance of it is not kind of related to the input data, but actually because of the GPU type. And also we also noticed that even if we choose a better hardware there, we may not get a better result as expected. So this gives us a thing that maybe we should take a look at different heterogeneous GPU situation. So we start from the flex flow, the first distributed machine learning that could provide automatic parallel strategy. It's from CMU, it's task-based and logical resource map, state test graph. Another reason we choose it is because it's based on LEGEN and REALM, MIT HPC system, which could support the simulator. So it gives us the flexibility to do verify things. And also we tried machine learning operator in Kubeflow, but there are just too many restrooms there, we'll see later. But during this work, one day one of my best friends introduced me, Ray. So I had a look at that, I was so fascinated by it. So for a quick introduction, Ray is kind of a popular distributed machine learning today, which is back behind chat GPT or open AI. And me and myself are so interested in three main features of it. The first one is the reactor model. It has two different jobs there. Task is a kind of a function, state list. You can put it everywhere when the worker ready. The other is actor, it's a class. Which has the state file. You can't just assign it anywhere. You need to put it somewhere where it already has an actor there. It also has this remote function. You can set up the resource you're going to use. It also has the weight to coordinate your dependency during the setup. The other one is the resource. Similar with the flagged flow, I just talked about that. It also has this map idea, conception here. So it has the logical resource. And when we set up like the remote function earlier, it's actually not the pure physical resource. And even if we set up the resource quarter there, it's actually not like we set up two CPU, five GPU. It may not really use this exact one. So this gives us the flexibility to try different machine type, device type later. The last and also most important one is the dynamic task graph. Also similar with the flagged flow, it's a static task graphic. So that one is easier. For this one, it's much more complicated. Of course, it also has the flexibility for setting up. So it needs like low latency scheduling. It also needs like when scheduling things there, you need two different work nodes there. You need to make sure all the data is also there. So they kind of have this in-memory data center, a database there called the epic arrow there. I think that's for this purpose here. So yes, we could also run on the Qubri. Similar with all the other training operator in the Qubri flow on the Kubernetes. Here it's kind of interesting. It's also like using this Qubri operator. Every time you launch a job, it actually launch a cluster there. And it has all the resources there. You can set up CPU, GPU. And this gives us the idea to try different results set up. It also has autoscaler. But similar with other Kubernetes scheduler or even the Qubri flow scheduler there, it's quite straightforward and quite limited on the function. We can like, yes, not so much we want to do, but we definitely want to improve. For our dev environment, we are lack of the hardware. So we're only using one server with eight GPUs, 830 GPUs there. And we set all the Kubernetes environment on the VMs. We have these three Kubernetes. We set up a GPU and a CPU computer there. You can see we actually use a different flavor for that set up on purpose. So what we are doing is that we are so far only focused on the deep learning. But we are also implemented on ResNet, but we are also thinking to implement a transformer model later. And we actually add this task level. The idea behind that is we're trying to ramp a series of tasks going to the worker there. And it's actually pretty good. I think there's a PyTorch operator there called the operator fusion. It's kind of a similar idea. And based on that, we found that sometimes instead of giving it a more similar node, it's better to give it some bigger node, like from the vertical prod autoscaler. Like we said, one, two, four GPUs there. So I think that's also kind of verified that because the dynamic graphic, task graphic there, they needed to think the debt for different tasks. But if we put it in one node, it actually has better performance. We also tried to put the device topology into consideration. So far, we only add the PCIe locality and the different GPU types. In the future, we may put more information into consideration. We need the new hardware there for sure. Like we can also add the NVLink RDMA for the epic arrow there. So like even so far, the in-memory database is already very fast, but still not quite enough in some specific situation. So we may think maybe we can put the NVLink or even RDMA into consideration. So yes, this is for time to do the lighting. Thank you so much. And like I said, this is just a new project to transfer from backflow to query. Feel free to raise me if you are interested with ideas, questions, or collaboration. Thank you very much.