 Good afternoon, everyone. My name is William Wang, a maintainer of the channel community. This is my partner today. My name is Hongkai from Huawei. Today, we will share our thinking about maximizing GPU utilization over multi-cluster for clownative AI. So firstly, let's have a look at why GPU utilization is so important to us. Firstly, the high price of GPU leads to high cost. And secondly, as AI model becomes larger and larger, the demands for GPU explode. And the GPU is in shortage from time to time. Thirdly, it takes a long time for us from making order to get GPU cluster ready for production. Maybe several weeks or even more. There are several challenges for improving the GPU resource utilization. The first challenge is that the GPU devices are scattered among local areas, the IDC, and the cloud vendors. So it's making it difficult to manage them uniformly. Secondly, when different teams share the GPU resources, so there are different GPU generations, different versions. So it's hard to share them fully. Finally, the different GPU models and GPU versions make it difficult to unify. So how do we build a platform to consolidate these distributed GPU resources with different models, different versions, to make full use of them? From our perspective, I think an ideal platform should have the following functions. First, the platform can uniform management managing the GPU resources. The resources are from the different IDCs, different regions, even from different cloud providers. At the same time, the platform can normalize the GPU power, computing power with different generations. Secondly, the ideal platform should have an intelligent scheduler that can schedule a task to most matching the resources based on the global resource view. These scheduling strategies have a lot of policies, such as the minimum cost first, the highest performance first, something like that. So we propose to leverage the CCCF project, Commander and the volcano, to provide a multi-cluster AF solution. The commander is responsible for the multi-cluster access and unify the resource management. As mentioned earlier, these resources is distributed in different regions. Also, the commander will cover the scenarios like the high availability, fault tolerance, and job segregation. So volcano global component is responsible for the lifecycle management for the AI jobs and also cover the scheduling. The global scheduler will coordinate with the in-cluster scheduler. So there will be a set of scheduling strategies to get better performance under the utilization. Regarding the AI workloads, we think it's extremely important to have a job abstraction to support multiple kind of training frameworks, like Ray, Tencent Flow, and Petouch. So what are the common characteristics of AI jobs? First, there are posts with different roles in a job, such as a PS worker, PS pod, work pod, evaluator pod inside a Tencent Flow job, and a master pod and work pod in an anti-AI job. Secondly, AI job resource allocation needs to support all-or-nothing semantics. Again, there's communication between the posts inside a job. So there's also some topology inside the job, such as the data parallel, the tensor parallel, and pipeline, something like that. For example, so we think currently the job, volcano jobs, the RD, have a good abstraction. So the next one is about the basic and very important feature, either in the single cluster or in the multi-cluster, the gone scheduling. We assume that a job has 10 copies. First, we will calculate the maximum number of the replicas that can be deployed in each cluster based on the available resources at the node. Then the scheduler will filter the cluster, the cluster with less than 10 available replicas will be filled out to achieve the scheduling ability. So in the single cluster, the gone scheduling is also needed because there is some risk condition between the global scheduling decision and the in-cluster HPA. After solving the basic gone scheduling problems, let's look at how to improve the GPU utilization at the multi-cluster level. So command build the resource view of all single clusters, and then volcano scheduler can schedule the jobs based on the resource view and the user preferences. One of the most basic strategies to improve the utilization is being packing, where which scheduler schedules the code to the cluster with the fair resource allocation, then the idle nodes can be scaled down and released. For the inference workload, we may donate to the BIMPAC, maybe spread policy is better. So to improve the GPU utilization, the single cluster strategy is important as well. We command that using the config the resource sharing policies according to their scenarios. We cannot support the queue for the resource sharing between the match tenant. There are two kinds of resource sharing mechanisms. The first one is the proportion strategy. You user need to configure the weight of each queue. Then the scheduler will allocate resources fairly based on the weight automatically. And the other resources can be shared to the jobs in other queue. And also when there is some resource competition, the scheduler will reclaim resources by weight automatically. So even the cluster scale up and scale down, the scheduler always can keep the ratio between different users. So this mechanism is quite flexible and the user don't need to config too much things. The second mechanism is used the capacity scheduling, which allows the user to configure their reserved resources with the maximum capacity and the maximum capacity and the capability. So the deserved resources means that the amount of resources a queue can reclaim from other queue. So these two ways of resource sharing can help the user to make full use of their GPU utilization, GPU resources inside a cluster. So another way to improve the utilization is deploy the multiple kind of workload inside one cluster, such as deploy the big data workload and micro-services inside one cluster, or deploy the training workload and inference workload in one cluster. However, they are present for ensuring the queues for high-priority tasks such as the online services. They are typically advertisement workload and recommendation or search business, which are greatly latency-sensitive. So we need to guarantee their queues. So we cannot add a new component named the volcano agent. This component will work with the analyzing open-source system to provide the kernel-level acceleration to ensure the queues for high-priority tasks. At the same time, the component also do the oversubscription to increase the container density on the node. So for the next part, let Hongtai to introduce more about how to use Commander to manage the multi-cluster more efficiently. Okay, thank you. We have a lot of feedback that people want to build a unified infrastructure based on multiple clusters. And that's why Commander comes. So Commander is a Kubernetes management system that enables you to run your applications across multiple clusters, even in multi-cloud. So if you are familiar with the Kubernetes, it will be easy to get started as the component in Commander is very, very similar to Kubernetes and Commander controlling, including Commander API server, and a controller, and a scheduler. Commander, the way Commander manages a member cluster is that a cluster needs to reject it to Commander, and some clusters might have limited network access. Commander can provide the push mode to let this cluster register, and a component named Commander agent can be used for that. Commander also provides a lot of features such as cross cluster application fillover, multi-cluster service discovery, and so on. Before learning more about Commander, let's first understand the core concepts of Commander. The first thing you are... The first is the result template. Commander can take Kubernetes native API, and after the Jamo created in Commander API server, Commander takes this configuration as a raw configuration. We call that as a result template, and with another propaganda policy API, you can describe how and when to distribute your application to multiple clusters, and the third API, the override policy, you can configure the different configurations for application running in different clusters, such as applications running in different cluster may need to have different image registry, may need to configure different labels, annotations, and so on. So you can do it by config override policy, and another two API is the result binding and the work. They are the internal API, so basically a result template created in Commander API server will be driven by propaganda policy, and then it will be combined in the result binding, and then distributed to different execution space. After that, the manifest and the configuration will be delivered to the target cluster. The pitch on the left is the normal way that people manage multiple clusters. The administrator might do a lot of repeat operations on these clusters, and with Commander, you can let Commander manage this cluster, and all the operations can be done against the Commander API server over the past few years. I have talked with many infrastructure administrators. I've noticed that the way they manage multiple clusters is quite primitive. They usually maintain a list of clusters, along with the cool config files, and when they want to operate a separate cluster, they need to switch the cool config again and again. So that is annoying. The way Commander distributes your application among multiple clusters is based on the propaganda policy. The scheduler will select target cluster based on the rules defined in the propaganda policy, and the results manifest may vary in different member clusters. So before Commander delivered the configuration to target cluster, it can operate any fields in it. So you can use propaganda policy to replicate an application to multiple clusters or just split the applications to multiple clusters. This is another advanced feature that we got a lot of feedback that they don't have enough resources in their data center. So maybe they want to buy some managed cluster from the public cloud, like Google Cloud, AWS Cloud. So with Commander, the user can build a unified result pool across multiple clouds. For example, you have some cluster in your local data center, and considering the results, it might be insufficient sometimes. You can scale your workloads to public, but during the scale, the local IDC results always preferred that will be useful for cost saving. I will give an example about how to define cluster group with the propaganda policy. Propaganda policy is the most important API in Commander. From the policy, you can specify which application you want to propagate, and also you can declare where the application should be delivered to. In the placement, you can declare cluster affinities. With the multiple cluster affinities, you can declare one same propaganda policy. The scaler will first try first cluster groups. In this sample, you can see the scaler will try to find a cluster for your application in your data center. If there is no sufficient results, it will go to the next group. That probably is a public cloud. Another feature is the failover. Commander can manage multiple member clusters. When a cluster goes down, Commander can gracefully migrate the application from the 40 cluster to another available cluster, ensuring that the service remains available and uninterrupted. Additionally, once a cluster is identified as available, it will be isolated and the subsequent jobs are not scheduled on that cluster by default until it is recovered. Beside of cluster failures, an application can potentially fail. In such cases, Commander can automatically migrate the 40 application to another cluster. The more important is that the user can fully control the migration process by defining when the Commander should take actions, when Commander can take actions, and how the migration behavior is. Also, Commander also leaves the rooms for Kubernetes cluster to retry the field job. In Commander, there are two layers of scheduling, and there is a balance in Commander that we want to provide accuracy scheduling in the first level and try to not use so much resources. When Commander selects a cluster for your application, Commander provides three ways to improve the scheduling accuracy. First is that Commander will do the scheduling based on summary results. Second is the result model, which can build a model for your cluster. The third one is the most accurate method that Commander can do accurate scheduling. The accurate estimator will consume more resources. This is the result summary it looks like. Commander will count all the resources, all the valuable resources, by counting the resources in the node object. Scatter will see how many valuable resources on the cluster. This is the result model for the user. The user can declare a resource model to Commander and tell Commander how to build the model for the cluster. There are different grades you can describe your cluster, and Commander will analyze your cluster and build a result model. Scatter will use this model to schedule your workload. The next is the most accurate estimator in Commander. When the Commander schedules applications, it wants to try to find a cluster. How does Commander know exactly the resources in there that Commander can call another component named the Scatter estimator? The Scatter estimator will hold all the necessary information, like all the poles, all the nodes, so the estimator can make an accurate estimate and tells the result to Commander Scatter. What if the first layer schedule not work, and what if an application has been filled and disabled in the memory cluster? That may happen because some nodes might be gone. In that case, Commander can detect the applications and find out how many unscheduled replicas the application has in the target schedule cluster. Then it will evict them from the cluster, and then Commander Scatter will schedule the application again to find another available cluster. Welcome to talk with us about any questions, any features you want. You can find us from the GitHub of the Scensef Slack. We have some time left, so if you want to ask any questions, no questions, one, two, three. Thank you.