 Hello, everyone. Hey, Sal couldn't be here today. I'm Joshua Alfons, developer advocate at ByteDance. So today is going to be a prerecorded talk from Hey, Sal. It's on improving GPU utilization and accelerating model training with scheduling framework of NRI. Hey, Sal is a senior engineer at ByteDance, and he's working closely with the KubeWARF organization with projects like Catalyst and KubeBrain as well. So without further ado, I'll play the video. And if you have any questions at the end of the video, feel free to ask. And we can also relay these questions to Hey, Sal. And you can also join us on Discord and on GitHub as well. So thank you. You need audio for the, oh wait. Hello, everyone. My name is Heo Sal. Today I'm going to share a topic titled improving GPU utilization and accelerating model training with scheduling framework and NRI software engineer at ByteDance, and also a maternal of the Catalyst project. First, I will provide some background on cognitive AI and give an overall introduction to the Catalyst project. Next, I will delve into ByteDance's practical experience in efficiently managing a large number of GPU resources based on Catalyst, including GPU sharing scheduling, topology of scheduling, and title co-location. Afterward, I will discuss the practical outcomes of this feature. Finally, I will discuss the future roadmap of Catalyst in cognitive AI domain. I introduced the Catalyst community. From the picture on the upper left, we can see the overall development process of cognitive with the thriving goods of the cognitive ecosystem. Kubernetes has not only found extensive applications in microservice scenarios, but also increasingly support various types of workloads, including storage services, search as a recommendation services, as well as motion gardening and data data jobs. With the risk of child GPT at the end of last year, AI large models have brought about some new opportunities. On one hand, large models have significantly improved product data across various industries, bringing new growth opportunities for businesses. On the other hand, the training and the influence of large models require large scale resources, which has also accelerated the development of cognitive infrastructure. In addition to the opportunities brought by large models, we also encounter some challenges in the field of cognitive AI. Let's start by discussing the challenges encountered in AI inference scenarios. Due to the fact that a media GPU can only run a single QDA context at a time. Typically, one GPU can only be allocated to a single process. As shown in the picture on the upper right, in inference scenarios, a process typically doesn't fully utilize a GPU resulting in significant waste of resources. Therefore, in AI inference scenarios, a key objective is to improve GPU utilization to fund green resource management. Next, let's discuss the challenges in AI training scenarios. As illustrated in this table, different models have diverse computing and non-worker resource requirements for their training jobs. Large models like GPT-3 require over a thousand A100 GPUs for training for a month or more. Therefore, in AI training scenarios, another crucial objective is to accelerate model training through fund green resource management. To achieve the objectives mentioned above, we have incubated a resource management system called Catalyst. The name Catalyst is derived from the word Catalyst in chemical reactions. And the key symbolizes its ability to provide enhanced automation for resource management for all workloads running within the Kubernetes ecosystem. It has four layers, the API layer, the muscle layer, the known layer, and the kernel layer. In the second part, I will present the balance practices in improving AI resource efficiency based on Catalyst. Let's start with GPU sharing scheduling. First, let me analyze the limitations of the device plugin mechanism. From this diagram, you can see that within our Kubernetes, we support full GPU requests, which causes a huge GPU waste in AI inference scenarios. And also, the device plugin interfaces make certain important parameters, such as the metadata of the pod and the container. As shown in the picture, in the allocate interface, the only parameter is the device waste. While the industry has various GPU virtualization solutions, such as time slicing, MBS, MIG, and QDA level hijacking, they all have some issues when it comes to the isolation of GPU memory and the computing power for isolation, flexibility, and intrusiveness. Hence, we have opted for a kernel-level virtualization solution called MGPU to implement GPU sharing scheduling. Our new GPU sharing scheduling solution enables scheduling at a granularity of one person for computing power and one megabyte for GPU memory. At the Kubernetes level, it consists of two components, the scheduler and the device plugin. The scheduler is responsible for scheduling a pod to a super low and scheduling each container to suitable GPU combinations. And the device plugin is responsible for advertising MGPU resources and injecting corresponding environment variables into a container based on the scheduler's allocation results. For Qubit, computing power and the GPU memory are treated as two separate extended resources. If GPU-level scheduling is performed by Qubit, it may lead to a container being allocated a convenient power and GPU memory onto separate GPU cards, which is actually abnormal. Therefore, we have implemented a two-level scheduling involving both node and GPU through a scheduler plugin. Here is the architecture of the scheduler plugin. We have implemented several extension points, such as the pre-filter, the result, the pre-band point, and the unreserved extension point. We also introduced an extended cache to store the state of the GPU cards. As for the scheduling strategy, we have implemented a two-level beam packing scheduling strategy for node and GPU. The strategy helps reduce fragmentation to improve GPU allocation and utilization rates. We have also implemented a two-level spread scheduling strategy for node and GPU, which can enhance the isolation for node and GPU failures. As for the scheduling algorithm, we abstract the entire scheduling process as an optimization problem. We choose the variance as the objective function to reflect the degree of beam packing or spread. In addition, we use the DFS algorithm to search for feasible GPU combinations and identify the optimal one. As for the node resource management, as mentioned earlier, the device parking interfaces lack certain important parameters, such as the metadata of the pod and the container. So the device parking isn't aware of which pod and the container it is processing. So we have the device parking utilize the pod resources API of Kubeit to locate the pod and the container and then use the container runtime hook to inject the environment variable and media-visible devices, which is used by the media container runtime hook. The second scenario is topology-aware scheduling. First, let me analyze the limitations of Kubernetes native topology management. For one thing, the Kubernetes scheduler is not aware of the macro topology on nodes, which can lead to a high number of unexpected information failures. For another thing, as shown in the picture, the topology manager, the topology of vintage strategy only takes into account NUMA topology, which can now meet the performance requirements of distributed and model training jobs to address the further issue we have implemented topology-aware scheduling, which includes plug-in node resource allocation strategies, which is implemented by the QoS resource manager and the class resource plugins. And we also extend a scheduler plugin to filter and score nodes based on the macro topology. In addition, we have also implemented some new topology affinity policies. In distributed training scenarios, RDMA devices are used to accelerate network communications. So if a GPU and an RDMA are allocated under the same PCIe rule complex, GPU-direct RDMA can be used to accelerate communication. As shown in the picture, and then go over the hierarchy of the PCIe switch shared by a GPU and an RDMA, the greater the bandwidth. So we have customized a strategy for GPU and RDMA affinity at the PCIe switch level. Furthermore, the RDMA devices are connected to different level switches. The closer the switch is connecting RDMA devices are, the greater the bandwidth for communication between poles. Therefore, we prefer to allocate RDMA devices on the same S0 switch for possibly doing a job, and then the same S1 switch and then the same S2 switch. For training jobs using a PS worker framework, parameter servers require high performance and cannot have resources allocated across NUMA nodes. In contrast, workers usually consume a significant amount of memory bandwidth. So we have implemented an inter-pod anti-affinity policy at the NUMA level so that parameter servers and the workers will be allocated to different NUMA nodes to avoid interference. The third scenario is title co-location. It arose from the challenges we encountered during capacity planning. As shown in this image, the resource utilization of online inference services exhibits a title pattern with very low utilization during the night, which leads to a huge waste of GPU resources because online inference services and offline training jobs have complex business models and homogeneous resource usage patterns. We choose to enhance data GPU utilization to title co-location, also known as time-sharing reuse. The title co-location solution consists of two parts, instance management and noble management. The HPA controller or the training operator will scale up online services and scale down offline jobs during the daytime, while scale down online services and scale up offline jobs during the night time. Meanwhile, the title controller will adjust nodes for online services usage during the daytime, and time to perform been packing for online services and adjust nodes for offline services usage during the night time. The title co-location solution has two primary application scenarios. One is elastic training. In scenarios like computer variant or natural language processing, it's common to fully go to the required model parameters into a GPU's memory and using the all reduce pattern can make more efficiency use of bandwidth. Workers, as for the framework characteristics, workers support tolerance and against dissipation. And against dissipation is to a significant increase in training speed. And it's positive correlated with the number of workers. In contrast, if the gun chair crashes, the training job will fail. Therefore, we allocate stable resources for the gun chair and the workers from zero to the minimum number of replicas to ensure the stability of the training job. In addition, we allocate elastic resources for the workers from the minimum to the maximum numbers of replicas to accelerate the training. Another application scenario is title resource reuse. CTV or CTR models have extremely large number of parameters that cannot be stored on a single node. So the PS worker pattern is adopted. If the parameters are crashed, the training job will fail. And for workers, elastic doesn't result in a significant increase in training speed. So we use title resource reuse instead of elastic training to enhance DNA resource utilization. During the peak hours of online services, when the idle resources of online services to the workers of training jobs, that is to say, stable parameter servers and elastic workers collaborate to support the training jobs. During the peak hours of online services, we stop idle parameter servers to release some sources so that online services can utilize more resources to alleviate degradation. In the third part, I will discuss the practical outcomes of these features. Through GPU sharing scheduling, we have boosted GPU utilization by more than 50%. Through topology-aware scheduling, we have increased communication bandwidth between training tasks by more than two times. Through title co-location, we have achieved a peak of tens of thousands of GPU cards we used. Finally, I will discuss future works and introduce the catalyst community. Our core objective is to improve GPU utilization and enhance model training efficiency. To achieve these goals, we plan to support some new features and evolve the architecture. For example, we plan to enrich the scenarios for topology-aware scheduling and support regular GPU co-location. As for the architecture, we plan to integrate the catalyst with the dynamic resource allocation feature of Kubernetes. Here is the catalyst community and my contact information. You can join the bi-weekly community through this link and join the community group through the QR code. Here, we welcome everyone to participate in community contributions. If you have any questions, please feel free to reach out to me for discussion. Thank you. Thank you, everyone. Yeah, please, if you have any questions, please contact KSAW via these QR codes. As well, on top of this, by dense open source, we also have a bunch of different open source projects. And you can join us on our Discord as well if you'd like to learn more about Cubor, Catalyst, and some others, and also learn about our innovators program. Thank you all so much.