 Hello, everyone. I'm William from Huawei Cloud. It's a great honor for me to participate in this event. Today, my topic is volcano clonative computing system for AI Big Data and HPC. A brief introduction about me, I'm Tech Lead of volcano community, and my work is focusing on batch scheduling, performance, AI Big Data acceleration. Here is the agenda. Today, I will first introduce the trend of batch computing and challenges we are facing, and then I'd like to talk about clonative project of volcano. Later, I'm going to show you how volcano addresses these challenges in Kubernetes. Then I'll show several use cases that how user build their platform with the volcano. Finally, I will introduce the community. First, let me start with the batch computing development journey. There are three different lines showing three different domains. Initially, we started with domain of HPC. From the figure we can see in 2018, HPC and AI started to converge. Some of the AI application can be used for HPC. HPC technology can also support AI. In the coupon of 2020, people were also talking about how to better deploy HPC application on Kubernetes. In the Big Data domain, as you know, Spark kind of link have already officially support for Kubernetes. As in AI domain, the AI framework such as TensorFlow, Caffe, the cumulative, most of them running on top of Kubernetes right now. More and more users would like to aggregate their batch workload to Kubernetes to unify the platform, and this would bring two major benefits. First, multiple resource managers can be unified. Users do not need to maintain your own methods anymore. As a result, the cost of learning and the maintenance is reduced. Second benefit is to aggregate their resource ports together, and that would lead to hair resource utilization. At the early stage, Kubernetes prioritized the microservice support. There are several major challenges we observed in the early days when people were trying to aggregate batch workload to Kubernetes. First is the lack of job management. We know Kubernetes confines to pod-level scheduling, no awareness of up-level applications. When it is support job management, but the function is weak, job or Kubernetes can only describe one kind of pod, but for AI, their job involves multiple types of pod. The second thing is, Kubernetes has limited scheduling policy, especially compared to those systems like ER or SLRM or SGE sort of systems. For example, the job property, preemption, fast share, reservation, topology are not supported in Kubernetes. The computing framework in different domains are trying to run better on Kubernetes. They developed their operators such as TF operators, Spark operator, etc. The cluster needs to deploy different kinds of operators to support different computing frameworks. It's complex to maintain all these operators. People are actually looking for a deeper integration and better support for these computing frameworks. Kubernetes can isolate resources by name spaces, but lack of dynamic resource sharing, Kubernetes has no concept of Q. The resource cannot be used as an effective way. Not challenging is the heterogeneous computing. Many companies have their accelerating hardware, such as GPU, TPU, NTU. How to schedule these resources uniformly and provide the best resource allocation in a co-ordinatively way? Something like this needs to be addressed. The last thing is the performance. The batch computing workload always have higher demand for the throughput run shape of the system. The Kubernetes is not able to satisfy this request without performance tuning. So that's why we started working on the volcano project. Here is the volcano overall architecture. We are trying to provide more rich advanced scheduling policies, especially to help those batch workload better running on top of Kubernetes. As the figure shows, volcano is not just a scheduler. It has job and job controller to support the enhanced job lifecycle management. It has Q to offer resource planning sharing for maintenance. For the analyzing hardware, we are working on to support heterogeneous device to better manage the S86 on NTU, etc. And working with Kubernetes to improve the performance for the batch workload. From the graph, we can see volcano engage deeply with the upstream computing frameworks. So far, we have supported almost all of the popular mainstream framework. Volcano become open sourced at June 2019 and donated to CNCF in April 2020. As a sandbox project, volcano release feature version every three months. And currently, there are more than 350 computers in community. Volcano have had moved to incubator in April 2022. And there is more than 50 enterprise users adopt volcano in their production environment. Before we get into the detail of volcano, let's look at some very important concepts. First is the job. What does the multiple-pot template mean? Let's take MPI job as an example. People expect to use different image for master pod and worker pod in each MPI job. The job needs to offer this kind of ability to allow different configurations for different role of pod in each job. The second is the username space and the resource quota. In Kubernetes, a name space is used to isolate the resource. It is mapped or regarded as the user. And the resource quota refers to up limit resources that a user is able to use at most. And the queue is widely used for sharing resources between different users or resource pros. The quota is more like QPS in Kubernetes server or QQs in Slurm system. The purpose of quota is to limit the resource. The purpose of Q is to share resources between multiple users. We can see there's a difference between them. Let's move to the job management in the volcano and see how to address the challenges we mentioned previously. Multiple-pot template is supported in the volcano. With this feature, the volcano job is able to unify the support most mainstream computing frameworks like MPI, TensorFlow, Power Word, Patosh, et cetera. The volcano job also provides an extendable job plugin to allow users define their custom behavior and provides several building plugins by default. Also, the volcano job and scheduler can coordinate to do more optimization for batch workload. For example, the job dependency scheduling did a well scheduling and to improve the job performance. Next, let's go to the resource management. We add Q in the volcano for resource planning and sharing. The Q is classed, scoped. It is decoupled with name space. The Q is mainly used to share resources between multi-tenants and resource pool. For example, when Q can map to a group in a company and also you can match a specified kind of resources like GPU for this Q in the cluster. And also, user is allowed to config different policy for different Q. It's very useful. After we got to get to know the Q concept in the volcano, let's talk about some scenarios. The first scenario is, if you have two teams, how to share resources between them with Q? Here is an example. There are two Qs, Q1 and Q2, which are mapping to the other two teams. And their rate ratio is 2 to 1. There are six CPUs in cluster, and there are six pools in Q1. Q2 is empty. So the pools in Q1 can borrow resources from Q2, and all pools get running. And then submit a new job to Q2. Scatter will become two CPUs from Q1 and to keep the ratio to 2 to 1. And the new job in Q2 got two CPUs and got running. The second scenario is, for some users, they have routine jobs or urgent jobs. They want to reserve amount of resources for them. For this case, they can configure the guarantee in the resources to make reservation. Next, fair share is a common requirement for elastic or streaming jobs like Spark. However, in Q1, more pools submitted the more possibility to get more resources. There's no fair share. volcano provided the fair share between jobs and names to this. From the graph, you can see the user1 and user2 submit small job and big job to the same queue. Small job might get starving without fair share scheduling. volcano ensures big job and small job get the resources fairly with DRF algorithm. And also, we add the namespeed fair share in volcano as the graph shows, although namespeed3 submit more job, let's say job4 and job5, then namespeed2. But they get the same resource eventually. As we mentioned in the first action, there's not enough scheduling policies in Kubernetes for batch workload. We have been spending a lot of effort to fill these gaps. Here are part of scheduling policies provided in volcano. I will present some of them. Nowadays, machine learning workload has higher demand for GPU compared to traditional workloads. GPU is a precious resource. How to improve the GPU utilization is hot topic and has great value. Elastic training can dynamically adjust the number of instances involved in the training, greatly improving the utilization of the GPU resources, especially on the public cloud. It can work with spot instance to get a lower cost and improve the training efficiency. Firstly, let's see what elastic job is like. The left figure shows a volcano job that may available in first to the job has fail-pulled at least, and replicas in first to the job has ten-pulled at most. The job gets running well, fail-pulled get allocated, and then the job will extend more, pause if there's more free GPU resources. Here is another scenario. As you know, the inference service always have lower GPU utilization compared to training workload. As people tends to co-locate the inference service with elastic training workload to improve the utilization, the right figure shows an example. The inference job too with higher priority, preempt elastic pulled from training job one to ensure it's SRE. Whenever there's free resources available, the job one will extend more pause to accelerate the training. This page shows how the task topology and IO aware helps the distributing training. For some GPU training cases, the data exchange between tasks costs lots of time and become the bottleneck of the training. If the time or date exchange could be reduced, the training performance can be improved. The task topology would schedule PS and work on the same node to reduce the date exchange latency. PS and worker can use the host internal network, which is better than the network across hosts. We also had a test with default scheduler. There are three nodes in the testing cluster and we submit training job with two PS and four worker. Four worker, we get three different placements at last. The result are random. As you know, the group C is the best placement that we want. With the task topology scheduling, we are able to get stable result as group C. As far as I know, some users use the affinity and anti affinity features to achieve this goal. However, the complexity increases with the cluster scale and the performance not so good. We also do some research on the IO awareness scheduling. The IO awareness scheduling can minimize the minimalized the max data transfer latency and get even better performance. The figure shows the VGG16 model training results with the default scheduling task topology and IO awareness scheduling. In our test, the IO awareness gets 30% performance increase compared with default. In a real production cluster, user often submit multiple kind of jobs. Small jobs, big jobs, how to avoid the big job or small job getting starving is very important. Left figure shows an example. At the moment of T1, user submit a big job with gun and small job too. The small job get allocated and big job keep pending due to insufficient resources. At the moment T2, a new small job three was submitted and get allocated the big job one keep pending due to insufficient resources. With the time goes on, the big job will get starving if the released resources always cannot satisfy scan and the user submit small job continually. The SRA scheduling allows to configure the job so that it is completed on the time and reduce the risk of missed deadlines. SRA waiting time is the maximum time that one job should stay in pending. When the time reached, SRA plugin move the pending job to the next date and start reserve resources for this job until the job's request is satisfied. Spark started to provide support for Kubernetes in 2.3 version in 2017 and later Spark operator provides another way to help run Spark on top of Kubernetes as well. However, in a long time, Spark on Kubernetes lacks of batch scheduling abilities. Late last year, we started to work with Spark contributor to support custom batch scheduling for Spark. Spark with volcano provides the batch scheduling abilities like job priority, queue, batch share, resource reservation, et cetera. It will be released in Spark 3.3 version. As the amount of data continues to grow and the complexity of business increases, user require Kubernetes to support larger scale cluster. We also spent a lot of effort to support 10,000 nodes, 100 million containers in one cluster by optimizing the container network scheduling, container engine, ETCD and API server. The red part lists some specific approaches. Take Scheduler, for example, we improved the scheduling throughput to 1.5,000 pods, pods per second by adopt increadence, cache, batch bounding, async bounding, et cetera. Now let's take a look at how users use the channel to build their AI platform. Xiaofeng Shu is one of the top social media and e-commerce company in China. Many people use their apps in mobile phone. They have 100 million active users per month. And the main workloads they are running is to provide the recommendation for any users. They need to refresh models every several minutes. And they have some online service to do immediately reaction when user refresh their nodes. The challenge is they have large clusters with nearly a thousand nodes and the model have 100 billion parameters. One job has hundreds of PSN workers. Actually they want the best results allocation. The user adopt the task topology scheduling and again, 20% performance increased. And they also use the volcano SRE based on, based scheduling to prevent the large job from starving. This is another case of financial selector, sector user. They initially use the yard to schedule batch jobs. As it can come in growth, the environmental policy also changed. Different research team requires container deployment to avoid the environment conflict and dependency. However, the Cornish lacks fair share scheduling between multiple teams. Also they have the requirement of using different frameworks such as TensorFlow, Petouch, MPI. That requires to install different kinds of operators in Cornish, which lead to high cost maintenance and learning. The user looks for the solution and found the volcano can satisfy their requirement and offer diverse scheduling abilities. And what's more, they could use volcano job to unify TensorFlow, Petouch, my MPI operator. Eventually they decided to aggregate the from yard to Cornish and the volcano. Now, volcano support 300,000 posts increase per day in their production environment. For the public adoption, we do get a lot of user adoption, especially for people who are running the AI-based data gene transcoding workload on Kubernetes. Here are some parts of the adopters using volcano in production environment. For the code diversity in recent one year, you can find out that we get good diversity in community development. Almost more than 50% contributions are from community members. Here is the release journey of volcano. At the early age, we developed a set of scheduling policies to support batch workload and then integrate with ecosystems such as TensorFlow, Subra, Carpeter, Flin, Carpeter, Argo, et cetera. Later, we found that there are a lot of gaps in job management. And so we spent quite a lot of time to enhance the job management to the deeply support upstream computing framework. In the future, we are going to support several scenarios like multiple cluster scheduling for batch workload, performance enhancement for big scale cluster, intelligent co-location for better utilization, et cetera. Here are some resources of volcano. You are welcome to join our community and give us your feedback. Thank you for attending this topic. That's all from me. Thank you.