 Good morning everyone. Welcome to this topic. I'm William Wang and I'm the maintainer of the volcano community and from Kauai Cloud. This is my partner. My name is Li Mengxuan. I come from the false paradigm. Okay. Today our topic is cognitive batch computing with the volcano. We have some updates and the future roadmap sharing in this topic. So there are four parts. The first part is about the volcano introduction and then I will share the updates and the improvement of the volcano community. And the third part is about the GPU sharing and isolation from my partner. There will be a deep depth and finally I'm going to share the community and the roadmap. So here is the overall volcano architecture. From the picture we can see that volcano community have strong strong relationship with the upstream computing frameworks like Tencent Flow, PyTouch, Spark, Flink and so on. And also we can see that volcano is now just a scheduler. So it has a pod group and job group controller to provide the enhanced job management. And also it has the QCRD to help users to share their resources more efficiently. And also we have a lot of, we spent a lot of effort to work with the analyzing hardware to support different kind of heterogeneous devices such as GPU, S86, APU, TPU, something like that. And also we will work with the Kubernetes team to improve some performance especially for the AI workload and Spark workload. So this year we will add two more components in the architecture. First one is the volcano agent. The volcano agent is designed to provide the kills management for the workload. And also we will add a new component named the rescheduler. So this rescheduler will work with the permissions monitor system to balance the resources more efficiently in Kubernetes cluster. Here is the general of the volcano. So at the beginning we support, we designed a set of batch API to support the AI workload running on the Kubernetes. And then we supported the pod group, the QCRD and provide a set of scheduling policies for big data area like the FASHF scheduling the resource reservation to prevent the resource competition between the Spark driver and the workers. And then we have received a lot of feedback from the community users. So they said there are so many kind of training operators. So it's hard for them to maintain all these kind of operators. So we enhanced the job management to support the volcano job API to unify scheduling all kinds of training frameworks. And then today we are doing a lot of work to support the large language model training and inferences. So I will share some new features at this area. So the first feature is the job flow. As we all know, the workflow is very common requirement for some traditional batch workload and the AI pipeline. So there's a lot of projects like AGO, the air flow, the Qube flow. Still some users come to the community, said they need a lightweight workflow or workflow scheduling engine. And also they need the more complex workflow engine, just like this. So in the picture you can see that in their use case they need the workflow to support the AFL or switch case symmetric. So we designed this new project, it's lightweight workflow management. Here is the example. There's two CRD for this feature. The first one is the job template. Users can define their job using the job template. And the third one is the job flow. So users can define the dependency between the jobs in the job flow. The next new feature is the load awareness scheduling and the risk scheduling. So we can see that in your cluster we all know that the utilization for each node is very different. So we expect the usage on every node is balanced. So the basic reason is that currently the request and the node allocable resource is scheduled. And the next reason is that the pod goes away from time to time. And new pod is created from time to time. So it's hard to keep the resource balanced on each node. So in this feature we support the work channel scheduler to communicate with the monitor system like permissions and the ERK system. So the scheduler is awareness of the usage of each node and scheduler according to the usage. The next one is about the capacity scheduler. As we all know, we cannot support the proportion scheduling several years ago. The proportion scheduling is very flexible. So it can share the resources among different queues based on the weight. And also the scheduler reclaim resources by the user. And when the user scale more node into the cluster, the scheduler can keep the ratio all the time. It's very helpful and flexible. But also we have some use cases that the user said that they want to configure the resources and share resources by their configurations. So we support the capacity scheduler policy in the channel this year. As you can see, the policy is the policy is support user to configure the maximum resources in the queue and the capacity resources in the queue. And also we support the reserved resources in the queue. So these three kinds of definitions support the fine-grained scheduling and the reclaiming in the scheduler. Here is the use case from the LinkedIn. They have super large GPU clusters and they use the volcano and use the queue to share resources. But in their production environment, they have multiple organizations in their cluster. And for each organization, they have the different kind of GPU resources such as V100, 800, and T4, something like that. So how to use the scheduling and share resources mechanism to share the resources and achieve a great, perfect resource utilization is a great challenge. So they tried a lot of approaches. At the beginning, they use the solution one. They use the resource quota to do the acceleration. But the resources can't be reused efficiently. And then they use the proportion policy. They can fix the weight for each queue, mapping to their organization. But the resource utilization goes up. But still something can be resolved. Something like that. The resources about the V100 and 800 can be shared fully among their different organizations. So we supported the resources based on the each dimension of the GPU. Like the red part. After enable this feature, their production environment utilization goes up and achieved good result. For other part, this year and later last year, we improved a lot of manufacturers. The first one is about the we enhanced the scheduler to support micro services. And currently, user can move their default scheduler to volcano in the cluster and very smoothly. And also, we enhanced some other features such like keep the schedule work with the auto scheduler more efficiently. And also, we add some new small features to prevent the job from preempted in some conditions. Next one is about the colocation. There's more and more users. They deploy their different kind of workload into one cluster such as co-exist the Spark application and the non-running services. And also co-exist the training workload and the inference workload to achieve better resource utilization. So we developed this feature in the cloud. And we can support the tools management and the resource or commit to make full use of the resources and also prevent the interference between different containers. This feature hopefully will be released at quota two. And next, there's a big feature about GPU sharing. My partner will introduce this feature and show how to use this feature and some more use cases. Thanks, William and the volcano community for inviting me here. I am the contributor to this VGPU feature and I'm glad to give you a deep dive about this feature. The background of this feature is the growing requirement of the computing power. As you can see in these figures, the requirement of computing power grows quite accelerating. Especially with the emergence of the large language models, as you can see with the red line, it can be as great as 375 times per year. In the meantime, GPU manufacturers has released GPUs with more rapidly with more computing power to match the trends and of course with a higher price. Another challenge is the device utilization in volcano and Kubernetes cluster is quite low. Large amounts of computing resources are spoiled. The left two figures are the core and memory utilization of a single A100 GPU in a Kubernetes cluster. As you can see, the core utilization equals zero for the core and memory utilization is quite low as well. The major reason behind this scenario is the GPU-related resources can only be used in an exclusive manner. Our solution to these challenges is to design a device sharing mechanism called VGPU. And unlike other device sharing mechanism, volcano VGPU can provide device memory isolation among containers. In other words, we can enforce the upper limit of the device memory for each container to use. It is done by a component called Hami Core, which is another open source GPU sharing project. And you can see in this figure, at the scaling level, the volcano scheduler is responsible for grasping the usage of devices and assigning tasks to an appropriate node. Then it passes on to the device to mount the Hami Core and the proper devices into the container. And the Hami Core is responsible for the container resource control. And how do we achieve that? I mean, how do we contain the resources used in containers? We achieve that by using a component. It hijacks the level search between the CUDA runtime and CUDA driver. It recalls every memory allocation and retains an OOM error if it exceeds the limit. As you can see in this figure, if we allocate 3G device memory to this container, it can only get 3G when curated by NVIDIA SMI just as the figure shows. Besides memory isolation, it has the following other features as well, like call utilization limitation. It can guarantee fault isolation and it's transparent to GPU tasks. Volcano VGPU is quite easy to use. Simply specify the number of GPU you wish to mount into the container by using vocalo.shslice VGPU number. And you can specify the device memory available for each GPU mounted using vocalo.shslice VGPU memory. In this example, we create a container with two GPUs, each have an upper limit of 10G device memory. The result is showing in this figure. Volcano VGPU can improve the GPU utilization greatly in many scenarios. AB testing is one of them. Let's consider the following scenarios. This is a typical AB testing solution platform. It has the following features. The whole system consists of a base production model and several experimental variant models. Most of the input is processed by the production model and a small part flows into the variant model. Without the help of the visibility technology, each variant model requires an exclusive GPU which is a serious waste of computing power because the traffic flows into the variant model is quite low. With the help of Volcano VGPU, however, the two variant models can share on a single GPU and they can leave a GPU completely idle for other tasks to use. There are many concerns regarding the performance of Volcano VGPU and we can safely assure you that the overhead introduced by Volcano VGPU is below 1%. On the other hand, the throughput can be increased by 10 to 90% due to the kind of task it performs very good in the inference scenario. And Volcano VGPU is officially supported in Volcano 1.8. However, there is still much work to be done. We plan to add a monetary system in May 2024 and we plan to support multiple schedule policies in this summer. We also have a plan for next year which is to be compatible with CPU operator and other heterogeneous devices like Intel GPU and AMD GPUs. We are looking forward for you to try this feature and provide feedback to us. Thank you. It's a great feature in Volcano. We are here to visit Volcano to have a try. The next part is about the community. Here is overall. For the public users we have a lot of users adoption especially for the AI and big data and coding workload. Here is part of the adapters. For the code diversity in the last 12 months you can see that we got good diversity in the community development almost more than 50% computers are from the independent developers. Also, here is the future. The major features we are going to work this year. In addition to the algorithm in terms of AI engineering, the biggest challenges for AI training and inferences are the scale, the performance and the cost. This year Volcano will focus on this aspect. The first one is about the topology of various scheduling. As we all know in large model training scenarios, network topology is and the task topology is very important as the bottleneck has changed from the computing to the network. So the scheduler will be aware of the task topology like the data parallelism and the tensor parallelism and pipeline. Also, the scheduler needs to be aware of the network topology such as the RDMA, NV-switch, NV-link and something like that. How to use the scheduler to schedule a set of code with higher requirement for the network is very important to improve the training performance. Also, for the inference we are going to add more support for the resource sharing. Besides the GPU sharing we will support the ascent NPU device sharing and some heterogeneous devices more. The second one is about the AI training on multicaster. As we all know the model is becoming bigger and bigger. So the training of multicaster is a great training and we are exploring in this area. The third one is about the ascent NPU support. As we all know the ascent NPU is more popular in some areas besides the GPU. So we will support the NPU sharing and NPU topology scheduling something like that. The last one is the colocation. So it is found that more and more users are trying to deploy multiple workloads to one cluster as we just said. So we will still work a lot of effort on this area. Finally the community will provide the broader support and optimization for the other heterogeneous devices like AMD and Intel. So there are still a lot of requirements from the community. Feel free to get involved of the community. Here are some resources. Feel free to go to the community to share issues, share the requirements, all make contributions with us together. We also have the Slack channel where we can communicate there. So that's all from my side. Thank you. Any questions? Okay, thank you for the talk. I was wondering if you could explain a little bit about the colocation. So you were talking about you can have a mix of training and inference or a mix of Spark and let's say notebooks. How you're actually doing that colocation scheduling and how it's different from your traditional batch or fair scheduling. Pardon? Are you seeing the colocation? Yeah, between long running services and batch and batch ephemeral jobs. You're saying that you have a new strategy around colocation scheduling. Can you walk through what that is? So if you go back to the slide I think it's on colocation strategy that you're introducing. Sorry. Yeah, this one. Yeah, I was wondering if you can walk through that yellow section and how that enables both long running and ephemeral jobs to exist in the same nodes or same cluster. So for the yellow part the scheduler will be aware of the job priority like the online service priority and the offline job priority and scheduler zone with different priority and preempt the resources when the resources is sufficient. And also the scheduler is needed to aware of the oversubscription resources that reported by the volcano agent. The scheduler needs to schedule the oversubscription resources to the offline workload because the over committed resources is not stable resources. So we will allocate this kind of resources to the job with low priority. So do I as a user need to provide higher priority for my online jobs and lower priority for batch and that's my contract with you? Yes. So all through just priorities, right? That's the contract that we have. Like we say priority class and priority like zero. Yeah, yeah, different priority class. We will divide the job into five different kind of priority. So user can define their job just like the latency, intensive workload with the highest priority and then the normal batch, the free, there are about five classes. Okay. Okay. Thank you.