 Hi, everyone. Welcome to our session today. This is T-Gel from Baidu and Label1 from Huawei. We will introduce how to optimize the knowledge distillation training with Volcano. Thanks. Next slide. Okay. So let's have a quick overview of the project's background. Palo Palo is China's fully open-sourced deep learning platform, and it is also an AGI framework for industrial development of deep neural networks. Palo Palo Deep Learning Framework supports both declarative programming and imperative programming with both development flexibility and high runtime performance preserved. Palo Palo also supports ultra-large-scale training of deep neural networks. It launched the world's first large-scale open-source training platform that supports the training of deep network with 100 billion of features and the trillions of parameters using data source distributed over hundreds of nodes. Palo Palo includes and maintains more than 100 mainstream models that have been practically polished for a long time in the industry. Some of these models have won major prize from well-known industrial competitions. In the meanwhile, Palo Palo has more than 200 pre-trained models to facilitate the rapid development of industrial applications. Next slide. For large-scale training, Palo Palo enabled collective training on multiple GPUs. Also supports the asynchronous parameter server mode training on GPU and CPUs. Palo Palo used FLEACH API for highly scalable distributed training, and now most of the Baidu intelligent services are powered by Palo Palo framework. When Palo Palo do the large-scale distributed training inside Baidu, we suffer from some problems. One of the products in the large distributed job got failed, the whole job failed. It will waste a lot of resources to restart the whole job. Another one is the low utilization of some inference card cluster like K40, while the training card cluster like V100 is always out of resources. So we need to figure out a way to resolve those two main problems when training in the large Kubernetes cluster. Next slide. Okay, we create EDL project, the Elastic Deep Learning project as the middle layer between Kubernetes and Palo Palo framework to handle the elastic training-related stuff. Currently, EDL uses Kubernetes as the foundation and provides user scenarios, including predicted training methods like knowledge distillation, reinforced learning, and hyper-parameter search by using the Kubernetes CRD. And now with three major releases, EDL enables snapshot-based forced torrent, job pod autoscaling, and the support knowledge distillation natively. EDL also highly integrated with Volcano for advanced scheduling features to accelerate the training speed. And EDL now is a Linux AI and Data Foundation incubating project. With EDL enabled in Baidu internal cluster, cluster-level resource utilization is above 70%. And the job submission queuing time is less than 44 minutes and the failure rate of the job is less than 5% now. Next slide. And then what is the knowledge distillation and what's the benefit for that? To some people who are not familiar with knowledge distillation, let's have a quick overview. Nowadays, the deep learning model is getting bigger and bigger. The natural layer is getting deeper and deeper. In many scenarios, the larger the model and the more layers, the better the model effect. But you're limited by the request of reasoning speed and the video memory resources, large models usually cannot be deployed directly and the models need to be compressed. The current mainstream compression method include tailoring, quantification, and knowledge distillation. Among them, the concept of knowledge distillation is a state of the art technologies proposed in the distillation, the knowledge in a neural network paper published by Hinge in 2015. It is a very classic model compression technology that converts knowledge from a complex model migrated to another lightweight model to achieve model compression. In fact, the so-called knowledge transfer can be understood as a training process, which is to use the teacher model to train the student model. This training method is distillation training. After training a good student model, the student model can be used for actual deployment. As shown in the figure above, the training steps can be divided into two steps. The first is train a teacher model and then use knowledge of the teacher model to train the student model. Okay, next slide. Originally, there are two common ways to do the distillation training. EDL based on user scenarios invited to invent the third way to do the training. Let's introduce those three types. The first one is called the pure distillation training. The method of pure distillation training is much like the teacher recording the content of a lecture as a video and giving it to the student for self-study. And then the student learns by themselves according to the course video. Therefore, the pure distillation training is to first use the teacher model for inference and save the results in the disk. And then the student model use the sample saved in the disk and the inference results of the teacher model as a data set for training. Okay, in the training model, no, no, no, yeah, okay. In the training model, the training of the student model is the same as the regular training and the method is simple. However, the training method generates required data enhancement and takes up huge disk space. So the application environment is subject to certain restrictions. And the second one is same network distillation training. Same network distillation training refers to putting the teacher model and the student model into the same network. And the fixed teacher model parameters are only forwarded and the student model is normally used for back propagation training. This is also the current mainstream distillation training method. This is very similar to the teaching method in the real life. The teacher and the student are in the same classroom. The teacher says a sentence and the student listen to it. However, the training method not only take up to a lot of space for the teacher mode, but also because the teacher and the student have a one-to-one binding relationship and the training of the student model completely relies on the teacher model. And the student model has to wait for the teacher model to output a batch of inference and the result can be trained. And the teacher model has to wait for the student to train a batch before starting the inference of the next batch, which has a certain impact on the overall training speed. And our EDL has the third one, which is called the EDL service distillation training. Compared with the pure distillation training, EDL service distillation training decouples the teacher model and the student model. The teacher model is deployed as an online inference service and the student model uses the clients and identity to send samples to the teacher model in real time via the internet to obtain the inference result for training. It's like letting the model take lessons online and we use distillation reader for communication. Next slide. Okay. There are several advantages of EDL service knowledge distillation. The first one is save the GPU memory resource. Due to the decoupling of the student model and the teacher model, the service distillation training can use hero teachers resources. That is deploy the student model and the teacher model to different devices. Distillation networks that were only originally limited by the size of the GPU memory and were different to a single GPU card can be deployed to a different cards in this way. User can also flexibly set the ratio of the teacher to student according to the throughput performance of the teacher and the student, which means that multiple teacher can teach multiple students instead of maintain a one-to-one tutorial model to maximize training output. The second one is improve the training speed. Due to the saving of GPU memory resources, the student model can be trained with a larger batch size. At the same time, because the student model and teacher model are in different pipelines, the student model does not need to wait for the teacher model to end the inference before training. Combining the two reasons, those can greatly improve the training speed. The third one is improve the utilization of training resources. In practical application, we can develop the teacher model to an online elastic inference card cluster and use the computing resources of online predictor card to increase the throughput of the teacher model in the distillation task. At the same time, because the teacher model can flexibly schedule, there is no need to worry about task failures caused by the preemption of online instances during peak hours. It is equivalent to transferring the resources requirements of the teacher for the training card to the online GPU card. When offline training resources such as V100 are limited, the online card is used to accelerate the training to see valuable training resources. In addition, on offline clusters combined with scheduling strategies, the teacher model can also be deployed to cluster fragmented resources or resources with a low usage rate such as K40 to make full use of clusters, idle, and fragmented resources. The right picture is the flow chart of service distillation training operation. In this figure, you can see that the student model sends samples to the teacher model and obtain the inference results while the service side of the teacher model can be added and deleted at will and adjusted flexibly. Next slide. Okay. Now we'll see how EDL leverages Kubernetes and volcano to optimize knowledge distillation training. EDL support elastic training with inference style services during training. It deploys the teacher model as online inference through paddle serving. In addition to a teacher and a student training pod, a service registry discovery model is developed by EDL. Online inference service are elastic and are registered to EDL service registry modules for service auto discovery and fault tolerance. So EDL enable dynamic adoption of teacher's model online instance to maximize students' training through output and the resource utilization. With K40 inference called a serving cluster and V100 called training cluster, EDL use volcano for multicluster scheduling. And since by the way inside are always short of V100 resources for training, we use scan scheduling for knowledge distillation job to avoid training resource from that lock. We also utilize the IO awareness in volcano for maximize the RDMA usage in training cluster. Next slide. Okay. So in order to verify the effect of EDL service distillation training, we use pure training, same network distillation training and EDL service distillation training on the ImageNet dataset to train the RESTNet 50 VD model. The first one to concern is accuracy. In terms of accuracy, compared to the pure training distillation training improves the accuracy of RESTNet 50 model by nearly 2%. The service distillation training and the same network distillation training have the same accuracy. In terms of the training speed, compared with pure training, same network distillation training takes up a large part of the computing power due to the teacher model. So the training speed is only 35.9% of the pure training with the same training resources. The EDL service distillation training use additional online P4 elastic resource and transfer the teacher's resource request for the training card to the elastic card. So compared to pure training, it still maintains a training effective of 82.8% and the speed is 2.3 times to the same network distillation training. If you continue to use teacher resources, theoretically the speed of EDL service distillation training can be the same as that of the pure training. Of course, same network distillation training can continue to accelerate if resources are increased, but this will take up more valuable V100 training resources. Okay, here are all the distillation training on EDL. So now let's welcome label one for deep dive into the volcano project and tell us more details about the features and the implementations of volcano and how it is integrated with other AI and data systems. Thank you. Hey guys, I'm William Wang from Huawei Volcano Community Maintenance and Tending Lead. It's my pleasure to share this topic with TDo. Okay, let's get started. With the development of industries, more and more domain frameworks are invented and applied to support business development. This framework plays an irreplaceable role in their respective domains, such as spark, instant flow and fling. On the other side, the business model is becoming more and more complex nowadays. It's difficult to handle complex business scenarios with just single domain framework. Multiple domain frameworks are widely used together to achieve business objective. The domain framework cluster is becoming bigger and bigger and these clusters are independent of each other. The resources cannot be shared. This leads to a huge list of resources. Therefore, more and more users want to use the unified scheduler system to resolve the resource sharing problem. Kubernetes is the best choice for many users because of its excellent accessibility and ecosystem. As you know, Kubernetes is designed for micro-service or chest-strition in the early age. When we tried to migrate batch workload to Kubernetes several years ago, we found there are still a lot of challenges for Kubernetes. The first one is the scheduling policy for high-performance workload. For example, gone scheduling. Batch workload need all-or-nothing scheduling to resolve the deadlock. The fair share scheduling for maintenance. Job priority scheduling for urgent workload. The top logic of scheduling to accelerate training is central. The second challenge is job lifecycle management. Different types of workloads have different expectations. For example, buttons and flow, if the PS and work pod field, we have to restart the whole job. However, for Spark, if executor pod field, only restart the executor pod is enough. All this error handling should be resolved in job lifecycle management. The third challenge is the heterogeneous hardware support. The high-performance workload has higher performance requirement. Many providers produce different kind of hardware to accelerate computing. This requires a scheduler to schedule the results uniformly and provide the best results allocation. The last one is the performance tuning. For example, the scalability, throughput, network, container runtime is not only about the scheduler. volcano is a Kubernetes native batch system. It is designed to try to address these challenges. As you can see, volcano implements a batch scheduler to provide the rich scheduling policy. The new controller is added to do the lifecycle management and provide a unified interface for different kind of workload. We also provide some command line for HPC users to help them to submit workload by command line. volcano is a CNCF sandbox project right now. There are more than 1,600 GitHub stars and more than 100 contributors from different company and organization. They are already tough release from Ms. Boyd and more than 14 public adopters. Here is the volcano overall, volcano architecture. Volcano spots scheduling multiple types overload in one cluster. The resources can be shared among this workload to get a better utilization. This workload might be online microservice or offline data analysis tasks or AI workload. Volcano provide queue for user to plan their cluster resource. It's easy to map the company organization to volcano queue. Different departments can share resources to each other when the resource is added. The rich scheduling policies are supported in volcano such as priority scheduling, topology scheduling, preemption, reclaim, time division multiplexing, and so on. And also support monitoring or resources for fine grained scheduling. The red part shows the benefits of volcano in different scenarios. We will pocket data. The first scenario is showing the gun schedule in Tencent Flow training. As you know, all-or-nothing schedule is required for Tencent Flow or MPI workload to solve the dialogue or visibility. In the test, there's no enough resources for two jobs to rank concurrently in the cluster. Then we submit jobs to cluster. As you can see, when we submit five jobs with two PS and four workers, only two of the five jobs finished because of the dialogue. The left three job occupy parts of resources and waiting for other job release resources each other. The second scenario is Tencent Flow training. Our performance testing, it is found that the different placement of PS and work post affects the training result, especially for GPU training for some models. The course is the host network is better than network across hosts. APS and work post can be scheduled to one host. They can exchange data with host network. There are three nodes in the test cluster and we submit training job with two PS and four workers. We get three different placement at last. The result is random with default scheduler. As you know, the group C is the best placement that we want. The task top-loader scheduling in O'Connell is target to handle this kind of visual. You can define the top-loader of the task in job. The scheduling will get you the best placement based on the input. A notice is that the complexity of the feature doesn't increase the, doesn't increase with the cluster scale. As far as I know, some users use the Kubernetes affinity and anti-affinity features to achieve this goal. However, the complete city increase with the cluster scale. We also do some research on the L or VR scheduling with the task top-loader info and IO information. We can minimize the max data transfer latency and give even better performance. The figure shows the VGC 16 model training results with default scheduling, volcano task top-loader and L or VR scheduling. The L or VR scheduling gets a 30% performance increase compared with default scheduling. The result depends on the data exchange and models. This page shows the Spark on Kubernetes with volcano. Several years ago, we started to help users to migrate Spark off workload from Hadoop to Kubernetes. We use the TPCDS to do the performance test. It is found that the deadlock happened when the job concurrently is high. The main reason is that all the results are allocated to Spark driver pod. The extra pod started later than driver pod and no results available. All Spark applications are stuck there. We have to prepare the dedicated node for extruder and driver pod to resolve this problem. However, this kind of node division increases the results fragment. As you never know what is the best proportion for driver node and external node. A new feature called minimal results is introduced based on this situation in volcano. Volcano creates pod group for each Spark application. The minimal results is a property of pod group. The scheduler resolve minimal results for each Spark application and resolves the driver pod over commit issue. It is a job level scheduling. User no need to prepare the dedicated node anymore. And also the fragment issue is resolved. As you can see, the performance improved more than 30%. Next, as more and more MPI users try to submit workload to Kubernetes with volcano job. Volcano provides some features to help user submit MPI job workload to Kubernetes. The left part shows a volcano job for running MPI workload. User can define the main available resources for gun scheduling. User also can define the job policy as well. For example, event pod evicted action restarted job. This means whenever the pod is evicted, the MPI job will be restarted automatically. User can also define the MPI master MPI worker replicas, resource requests, task policy respectively. In addition, Volcano provides building job plugins to simplify the configuration. For example, the SSH plugin provide SSH authentication without password. And the SVC plugin prepared hardly service for communication amongst pod. It is convenient for MPI users. Another scenario is the GPU sharing. GPU resource is expensive resource. The GPU resource utilization is not good enough in some conditions such as development environments and the inference. GPU sharing is expected by many users. Volcano provides the GPU sharing ability. User can specify the memory amounts they need. The soft isolation is supported so far. The hard isolation will be supported in the future. ChromeVal is a popular pipeline software and widely used in gene computing. Volcano has been integrated with ChromeVal. User can use the WDL to orchestrate the Volcano job. The ability has already supported in our gene computing service. So the APT user command line is important. They always use command line to submit workloads to submit workloads. Volcano provides a set of command lines such as this app with console. We suspend to help users make it a PC workload to Kubernetes. Various of SDK language are supported as well. Now this cluster still become bigger and bigger. As far as I know, there are more than 2000 nodes in one Kubernetes clusters in some user's production environment. The CoupSIM is a simulator of Kubernetes for batch and offline workload. It is based on the CUMARC on Kubernetes. We can use it to simulate super skill cluster to do the scheduler performance testing. There are two problems in CUMARC. The first problem is, user cannot configure the results of poll nodes. The second problem is the poll status which running on the whole node cannot be updated. This problem are resolved in CoupSIM. In the future, we may add more enhancement for CoupSIM. For example, add more job templates to support simulate batch job submission, et cetera. You can join the volcano and the EDL community to learn more about how to apply and optimize the distillation on Kubernetes. We have group select channel for open communication. And you can also submit PRN issue on GitHub. Okay, no, pedofedal and the EDL repository. We will respond to you as soon as possible. Thank you for listening.