 Hello, everyone. Welcome to KubeCon and CloudNative.com Europe, 2022. This is Klaus. I'm the founder of Volcano and Kubebatch. And I use it to be the co-leader of Sixth Guideline and the technical lead of CNCF, Tuck runtime. Volcano is a CloudNative batch system for intelligent workloads such as HPC, AI and big data. It was promoted to CNCF in competing project earlier this year. So it's my pleasure to give an introduction and a deep dive on this project. Here is the background on why we would like to have Volcano project. At the beginning, we built a batch system for traditional HPC workload. And then we have big data platform to handle data. And after that, we leverage CloudNative technical for AI platform. But the problem is that all this platform are using different technical stack, building different ecosystem. This will make it hard to share results between the different workload. The results utilization is really low. So more and more organizations are leveraging CloudNative technical to build a unified platform for all workloads. But there's still some gap in the CloudNative ecosystem. So we built Kubebatch at 2017 to handle scaling gap. And then we built Volcano based on the Kubebatch to handle all the other related gaps. Here is the major gaps in the CloudNative ecosystem for batch workload. The first one is the job management. It's a common requirement to have a different port template and a fine-grained life cycle management in a job. It's complex and hard to maintain to build CRD for different type of job. The second gap is about the scheduling. Such as the priority, fine-grained, down, capability, reservation and backfill and so on. The third gap is to support a different workload in a common job CRD to reduce complexity and maintain effort. The next one is dynamic results scheduling, dynamic results sharing between different tenants such as Q. The last gap is about performance. Most of the intelligent workload required has to be put. For example, we got a requirement to dispatch 8,000 posts per second for SparkDrop. As we know, the throughput of the default schedule is about 100 posts per second. Here is the overview of Volcano project. Volcano includes several components. For multi-class scenario, we have a federation subproject in pipeline to balance the results between the different cluster. In each individual cluster, we have introduced several CRDs such as the job for common batch workload and a queue for resource sharing and the controller will have to manage the life cycle of the CRDs. The Volcano scheduler are built based on the queue batch, but we introduced some more scheduling algorithms and we also do the performance enhancement based on the queue batch. Volcano was open source at 2019 and then we donated to CNCF in 2020. Then it was promoted to incubator project this year. Currently, there are more than 300 contributors in the community and more than 50 enterprise users adopt Volcano in their product environment. We are going to have a release every three months and the latest release is 1.5. There are some examples of users such as we already used Volcano in Huawei Cloud in our product environment and we also worked with AWS and some other teams to build Volcano in their products. As we know, in batch systems, there are several concepts which are important to the high-level design. The first concept is the job. For batch systems, it usually has a common job specification for all kinds of workloads such as life sentence, AI metadata, and so on. So in cloud native ecosystem, it's required to introduce multiple pooled templates and fungrain error handling. The second one is about tenants. Traditional batch system introduced a user as a tenant and a Kubernetes user named space as a tenant based on my understanding. To avoid system overload, resource elements are required to the batch system. For example, the slurm support job queues, quality of service, and Kubernetes support quota to control the results, how many results can be created in the system. The last one is queue. Queue is the common concept in the batch system and it is widely used in lots of systems such as the slurm, yarn, RSF. It helps administrator to manage the results in the cluster and share results between different tenants. In addition, some systems also support different scheduling configuration for each queue. It's also useful feature for the users. In the following slides, we are going to introduce the detail of volcano. The first section is about job management. In volcano, we introduce a job CRD with multiple pooled templates as a common specification to support all kind of workload. Currently, we already verified several types of workload in the product environment such as MPI, TensorFlow, PyTorch, Horror World Sparkflink, Cronwheel, and lots of other frameworks. Here is the MPI example. MPI run and the worker are using different pooled templates with different command lines. Volcano also supports fun green lifecycle management. For example, restart the whole job when one of tasks was failed. Volcano will also build the job plugin in volcano. It helps users to do customizing enhancement to the job. For example, SSH plugin is used to configure SSH automatically for MPI job. So we don't need to add more configuration for MPI pooled. SVC plugin is used to create a headless service for communication between the pooled in each job. As we know, Q is a common concept in batch system. It is used to share results between different tenants. In volcano, we follow this practice to make Q in cluster level. So Q will have to share results between the tenants, and the quota is considered as a resource limit for tenants. Currently, Volcano supports FIFO priority and proportion algorithm for Q, and configuration is global for all Q. We are also going to support configuration for Q for users. Here is overall on how to share results between Qs. Considering there are six CPU in the cluster, and two Qs. One Q is Q1, the other one is Q2, which are mapping to the two teams. The weight is two and one accordingly. At the beginning, there are six pools in the Q1 and Q2 is empty. According to the proportion algorithm, the pool in the Q1 can borrow the results from Q2 to get all pooled running. And then submit a job to the Q2. The scheduler will reclaim two CPUs from Q1 based on weight of them. So the new job in Q2 get two CPU to run. In addition, we support a guaranteed in Q2 result for high priority jobs, and support a capacity in Q2 to avoid system overload. The next one is about fair share. Fair share is a common requirement for elastic streaming job like Spark. However, in Kubernetes, the more pool submitted, the more possibility to get more resources. This is not fair share. Kubernetes provides the fair share between jobs and namespace. We can see that from the graph. We can see user 1 and user 2 submit a small job and big job to the same queue. The small job may get starving without fair share scheduling. We can not ensure big job and small job get the result fairly with DRF algorithm. But only job fair share is not enough. Suppose one namespace over submit a loss of job to the queue compared to the other namespace. It would possibly occupy the most of the queue result. We add namespace fair share in volcano as the graphs show. Also, namespace 3 submit more jobs than namespace 1, but they get the same results eventually. So this is important for multi-tenants environment. Another one is about hierarchy queue. Hierarchy queue is very useful in loss of scenario. For example, in this button feature, I figure there is a cluster with 10,000 GPUs. The first level organization such as support, engineering and marketing department has a static quota. And the results are not allocated to be shared across them to ensure the isolation. But the second level organizations such as DA1 and QA are allowed to share parameter results in order to get higher utilization. The flight queue is not easy to meet this kind of needs. Several engineers from Baidu are contributing to this feature in volcano. With this feature, this will be easy to map the company organization or the company team to the queue. As we mentioned in the first section, there are not enough scheduling policy in Kubernetes for the batch workload. We have been spending lots of time to fill this gap. Here are part of the scheduling policy or algorithm provided in the volcano. So we are going to give some introduction about some of them. This one is about elastic training. Today, machine learning workload has a higher demand for GPU compared to the traditional workloads. GPU is an expensive result. How to improve the GPU utilization is a hot topic and is important. As training can dynamically adjust the number of instances in the training. This is greatly improving the utilization of GPU results. Especially on the public cloud, it can be worked with a spot instance to get a lower cost and improve the training efficiency. Firstly, let's see what elastic job is like. The left picture shows a volcano job. The main available job has five posts at least. The replica inferred to the job has ten posts at most. The job gets running with five posts allocated. The job will get more posts if there are enough GPU results for them to run. Here is another scenario. The inference service always has a large GPU utilization compared to the training workload. As people tend to call located the inference service with the last training workload to improve the utilization. In the right picture, the inference job 2 with the high-properate will prompt elastic to pull from the training job 1. To start the service. Another feature is about the task top logic. I will wear the schedule. This page shows how the task top logic and I will wear the schedule to distribute the training. For some GPU training case, the data exchange between the tasks caused a lot of time and became the bottom line of training. If the time of the data exchange could be reduced, the training performance can be improved. The task top logic would schedule the PS and the worker posts on the same node to reduce the data exchange latency. The PS and worker posts can be used to host a network, which is better than the network to cross the host. We also have a test with the default scheduler. There are three nodes in the testing cluster and with some training jobs with two PS and four worker posts. We get three different placement at first. The results are random with the default scheduler. As we know, the group C is the best practice that we want. With the task top logic scheduling, we are able to get a stable result at the group C. As far as I know, some users are using the authentic and anti-authentic feature to achieve this goal. However, the complexity increased cluster scalability and the performance is not good enough. We also do some research on the IOWARE schedule, which with the task top logic info and the IOWARE information, we can minimize the max data transfer latency and get even better performance. The picture shows the training result with the default scheduling, task top logic and IOWARE scheduling. The IOWARE scheduling gets 30% performance improvement compared to the default scheduler. The results will depend on the data exchange and the models. The other one is about SLA scheduling. In a real product cluster, users often submit multiple kinds of workloads, such as a small job, a big job, how to avoid a big job or a small job gets starving is very important to meet the SLA of the job. In the left picture, at the moment of time 1, T1, users submit a big job, job 1 with a gun and a small job, job 2. The small job gets allocated and the big job keeps pending due to not enough results. As a mob in the time 2, a new small job, job 3 was submitted and get allocated. The big job keeps pending because there's still not enough results. With the time go on, the big job keeps starving. If the release results always come down, the set file is gone and users are made a small job. The SLA is scheduling a lot to configure the job so that it's completed on the time and reduces the risk of missing deadline. SLA wait time is the maximum time that one job should stay in pending. When SLA wait time is reached, the SLA plans to move the pending job to the national state and start to reserve results for the job until the job requirement is satisfied. The next important feature is about NUMA feature. For computing intensive and performance demand at the job, the user would want to consider using exclusive CPU results which can reduce the loss by context switch. For job running on the NUMA-based process, it's also better to have the CPU applying to the port on the same. NUMA node which reduce the communication loss and improve the training. Currently, top logic management and CPU management provide top logic scheduling as a node. So the problem will improve the scheduling in the scheduler cannot perform the top logic scheduling. It's not aware of the top logic and the CPU allocation on the node. Therefore, the container may be refueled by the crewmate because the node results is less than the top logic minor requirement. Eventually, the scheduling will not be successful. Another problem is all the top logic management working on the crewmate on the node, they cannot choose the best node for the job. To address those limitations, we can now introduce a NUMA-aware plugin on the results of the product components. We can now offer the port-level top logic policies so that users adopt different top logic policies for their own case. The results of the product components report the node NUMA information by CRD based on the NUMA information and the request of the port. We can now conduct the scheduling. The next one is about Spark. Spark started to support Kubernetes in 2.3 version in 2017. As the later unzoned Spark operator provides another way to have running Spark on Kubernetes as well. However, in a long time, Spark on Kubernetes had a very helpful batch scheduling feature. Later over the last year, we started to work with the Spark contributor to support customer batch scheduling for Spark. Spark with a volcano provides the batch scheduling capability like job priority, QO fair share, results, reservations, and so on. It will be released in Spark 3.3 version. Another important feature is about call location. Multiple server being deployed together has become common in recent years. Online web service and offline analysis service has their different features. Online service requires new results, but online service requires few results, but they are very demanding for the latency such as the recommendation and other service. They also expensive traffic every day. Offline analysis service is computing intensive, requires more results, but the traffic is stable and they don't demand a faster response like online service. If online service and offline workload can be deployed in a hybrid way, the results can be fully used. Another pin point is over suppression of the results. Users would like to ensure their scalability, so they tend to apply the exclusive results for posts. However, Kubernetes scheduling relies on the stated post requirement in the scheduling. This leads to higher results allocation rates and lower results usage. Typically, the average class of CPU results usage could be less than 15%. This is a huge risk of the results. To resolve this pin point, Volcano has done some research, done some investigations. For example, based on the monitoring and usage, Volcano can perform dynamic scheduling. Additionally, on the Huawei operation system, the CPU and memory results are isolated for online service and offline service. Moreover, features like network isolation and inference detection and application profiling will be deployed in the pipeline. We also did several enhancements to our throughput. As the moment of data continue to grow and the complexity of the business increased, users require a Kubernetes cluster to support larger scale. We also spend lots of effort to support 10,000 nodes. And 100 million clusters of posts by optimizing the container network, scheduling, container, engine, and so on. In the right part, at least some specifications of enhancement in our project. Let's take a scheduler for the example. We improved the scheduler throughput to 1.5 thousand posts per second by adopting equivalent cache, batch bonding, and SIG bonding. They also depend on some enhancement to the Kubernetes, to the Kube API server. Here are some other cases from Volcano. Xiong Shu is one of the top social media and economic companies in China. Many people use their applications in mobile. They have 100 million active users per month. And the main workloads, they are running to provide recommendation for the end user. They need to refresh the model every several minutes. And also they have some online service to do reaction when user refresh their nodes. The challenge is they have a large cluster with nearly 1,000 nodes. And the model have 100 billion parameters. One job has hundreds of PS and worker posts. Actually, they want to buy the result allocation. The user adopts the task top logic scheduling and we have 20% performance improvement. They also use the Volcano SRA-based scheduling to prevent large jobs from starvation. Another other case is Retien company. They initially use a yarn to schedule backstop. As the company grows, the environment of policy also changed. Different research teams require container deployment to avoid informal complications and dependency. However, the Kubernetes defaults schedule like fair share scheduling between different tenants. So they also have the requirement of using different frameworks such as TensorFlow, Petouch, and MPI. That's required to install different kind of operators in Kubernetes, which will have lots of maintenance efforts. The user looking for the solution and they found that Volcano can satisfy their requirement and offer diversity scheduling capability and they use the Volcano job as a common job specification for all kind of AI training workload. Eventually, they decide to migrate from the yarn to Kubernetes with Volcano. Currently, there are about 300,000 posts already in their product environment for this user. For the public adoption, we do get a loss of user adoption, especially for the people running the AI Big Data live-sentenced transcoding workload on Kubernetes. Here are part of adoption using Volcano in product environment. From the code diversity in recently, we can find that we got a good diversity in community development. Here is the release journey of Volcano. We have released more than 16 major versions since the project started. At the early stage, we developed a set of scheduling policy to support the batch workload, then integrated with the ecosystem such as Kubernetes, sorry, with the Kubeflow and Spark operators, and Argo and so on. Later, we found that there are lots of gaps on the job management. So, we support quite a lot of time to enhance the job management to upstream. In the future, we are going to support several scenarios such as multi-cluster scheduling and performing enhancement and intelligent collocation for better results utilization. Here are some useful links in the Volcano community. You're welcome to draw on community and give some feedback. Thank you for the... Okay, that's all from me. Thank you very much and welcome to join our community.