 OK. Good afternoon, everyone. My name is Dalian. This is my partner, Ho Jia. We come from Bad Dance. Today, we are very delighted to have this opportunity to share our batch processing to you. And since batch processing is a huge topic, and we will focus on the scheduling, especially for the performance and the capabilities. The agent I will show you the agent. First, I will have a brief introduction about our batch processing, just about our overview. And we will focus on the problem with the Kubernetes native schedule, and where do we develop our Bad Dance unified scheduling to support all of our workload. We call it a good schedule. We will give depth to the performance and the capabilities, and including the evaluation. At last, we will have a brief introduction about the future work and our open source. OK. First of all, I'd like to show you the ecosystem. Any problem? Sorry for the technical difficulties. Go to display settings again. Sorry. OK. This is our batch scheduling ecosystem. And I would like to divide into four layers. The first layer is the hardware layer. We manage a lot of the, what happened? I think no display option. Just go through. OK. OK. For the hardware layer, we use Kubernetes to manage a lot different types of resources, including many types of GPU. And we also manage a lot of the CPUs. And we use Ham to reduce our costs. And we also focus on the network bandwidth scheduling, since our training job and costs consume a lot of the network bandwidth. And currently, we already use the Kubernetes to manage all of our resources and provide different objections to the workloads. We replace the COOP schedule with our unified schedule. We call it GERDA schedule. GERDA is famous for mathematics and the whole proposal of incompleteness theory, which means that if we want to develop very, very good schedules, we have a lot of challenge. And for the framework, we now have a lot of different framework running on our Kubernetes, which are supported by the operator. We have the Spark for big data processing. And we use Flink to support our machine learning and the big data stream processing. And we also launch a lot of TensorFlow and PyTorch frameworks. And we also develop many Python machine learning frameworks to achieve better accuracy. We support a lot of business, including the recommendations, search advertisement, and neural network process, and the computer vision. This slide, I want to address on a large scale. Today, we have more than 1 million jobs running every day. This is just for the batch jobs. And since the batch job, especially for the Spark Circle, which create a lot of ports, so we have more than 130 million ports created, scheduled, and deleted every day, which requires a lot of big challenge for the scheduling performance. And currently, our batch job consumes up to 60 million CPU cores. And our biggest cluster managed more than 20,000 nodes since we do not to manage a small cluster to increase the resource fragmentation. And most of our cluster are more than 5,000 nodes. And this picture will have a brief evolution of our batch processing. I'd like to give it to you three stages. The first stage begins at about 2016. In the stage at St. Suanasi, we built our online platform with Kubernetes. And we built our offline platform with Hadoop. Both the resource and managed system where there are no resources. And each has a dedicated control plan and a dedicated cluster. And it has some problems. First is the resource fragmentation. And also, the online service applied more resources than the workload needs. And we also have a lot of resource wasted during the off-peak traffic hours, especially for during the night. So we come to stage two. About three years later, we try to show the node between the Hadoop and the Kubernetes. In this how to solve the complications, we develop a coordinator to solve the conflicts between the Kubernetes and the Hadoop. We will deploy the agent of Kubernetes and Hadoop in the same node. And this coordinator will allocate the resource to Kubernetes agent and also allocation resource to the Hadoop. During the days, the online workload consumes a lot of CPUs. So the coordinator will allocate more CPU to the Kubernetes. But during the night, the coordinator will reclaim most of the CPU and tell the node manager, oh, I have a lot of resource here. I please distribute more workload here. Yeah, at this stage, we increased our CPU utilization a lot, but it still has some problems. First one is the high operation costs. Since we still need to manage two scheduling and resource management system, and the resource elasticity is still not undesirable. Since the control plan is still, we still have two different control plan. So the year before last year, we come to stage three. We use the Kubernetes to manage all our resources and no more Hadoop. And since the default scheduler does not met our requirements, we develop a bad dance unified schedule. And we also keep the coordinators. This is some co-located agent. And we have brief introduction at the end about the open source. And in this stage, we achieve high resource elasticity and utilization. And we also improve our operation efficiency. And this section, I want to address some problems with the native schedule. The first one is the scalability. Since we have more than 130 million posts start and scheduling and delete every day. So since the batch job, so we care about the scalability a lot, but the default schedule does the serious processing. So as the benchmark shows, in most cases, the throughput is no more than 200 seconds. The second thing I want to address is about the capabilities. Since the Kubernetes was designed for the microservice first, it lacks some important capabilities for the batch processing. As we all know, important is a high performance scan scheduling. And we also care about many special features, such as the job-level affinity, which means that I want to put all of the posts in the same chain job, to the same network switch to achieve batch network performance. And we also care about macro-target scheduling to improve the scheduling accuracy. And by the way, the Kubernetes also lacks some capabilities for located workloads, such as some sorting, primary source fairness and priority based. And we also have some sorting like the fair share. The last thing I want to address is about the preemption, especially we are locating all of our workload together. The preemption is still very complicated, very, very complicated. For the next section, please invite Horjia to share it. Thank you. OK. Thanks, Dalian. So now I'll be talking a bit more about our scheduler. And yeah, so we are set to open source this scheduler soon. And this is sort of like a preview of some of the features and capabilities. And hopefully also sharing some of the considerations that we had, or trying to build a scheduler for scale, can provide some useful insights for everyone. So before I dive into the details, maybe you can talk about some of the goals that we had. So we wanted to create a high performance scheduler. We both support for online and offline scheduling capabilities. So this is to address the scale that I think Dalian shared about earlier. So we also hope to use the scheduler to increase the resource utilization across our clusters and also unify both our online and offline resource pools to achieve complete resource elasticity. So we believe the last two goals are the key to reducing costs within our company and achieving higher efficiency. OK, so now I'll share some interesting highlights of our architecture. So actually the girl scheduler is designed as a distributed scheduling system. So the queue scheduler is actually a non-distributed one. And it runs with a single component. But Gutter actually consists of three separate components. So we have the dispatcher, the scheduler instances, and the binder. So this design actually allows us to scale the number of scheduler instances horizontally to achieve high throughputs. But naturally, having multiple scheduler instances running at the same time will lead to conflicts. So for example, you might have two scheduler instances. Without any shared state, they might choose to assign the last CPU on a node to two different ports. So this is where the binder comes in. The binder is in charge of conflict resolution before actually writing the results back to the API server. So now let me talk about our scheduling framework. So we try to follow Kubernetes' scheduling framework as closely as possible. The difference is that we added a few stages. And at the same time, different stages are actually carried out in different components. So for Gutter, the scheduling actually starts at the dispatcher. And the dispatcher is in charge of watching ports on the API server and grouping them into groups, which we call scheduling units. So the dispatcher then sorts these units based on priority and assigns them to dedicated scheduler instances. So the scheduler is in charge of assigning the best node to each port within the scheduling unit. And it does this by executing a filter and score stage. So this is similar or the same as what is happening in the native Kube scheduler. So additionally, it also has a dedicated preemption stage where it suggests candidates for preemption. Last but not least, we have the binder. And as mentioned previously, it's the key to conflict resolution, giving us a form of optimistic concurrency control. So for scheduling decisions that pass the conflict resolution, it will actually execute the reserve and bind stages and finally write the results back to the API server. It's also in charge of actually executing any preemptions if needed. So yeah, now I'll highlight some of the performance features that we have to increase the scheduling throughput. So as mentioned, one of the goals that we had for Gutter was to achieve higher throughput to meet our requirements. So Gutter supports concurrent scheduling. So I think this is pretty obvious by now. We can have multiple scheduler instances running at the same time and the binder is in charge of resolving the conflicts for the scheduling decisions. However, having more scheduler instances also results in higher chance for conflict. So actually a balance is needed when scaling the number of scheduler instances. And in addition, we also support node partitioning to reduce the chance of conflicts. So node partitioning is where each scheduler instance is given a dedicated set of nodes to schedule to. But while this reduces the conflict, there's also a trade-off because the best node might not be found within a single node partition. Yep. So we have a few other optimizations. One of them is also result caching. So actually, this is based on the fact that we found that most ports within a workload actually share the same template and same requirements. So a feasible node that fits one port within a workload is likely to be suitable for others within the same workload. So during the filter stage, we actually cache feasible nodes. And this helps to speed up the filter stage as a whole by avoiding repeated computations. So now I'll move on to share some of the scheduling capabilities. We'll not go through all of them for the interest of time. So these scheduling capabilities are designed to help us to run our offline workloads on Kubernetes. So the first feature, and I think a very common requirement for batch processing schedulers, it's gang scheduling. So in gang scheduling, in the context of Gerdo, all ports within a single gang are scheduled together with all nothing semantics. So this is important for scenarios like distributed learning, where maybe the absence of a port running the parameter server would block the progress of the whole job as a whole. Yep. So Gerdo supports this by not scheduling ports individually, but in scheduling units like I mentioned previously. Yeah, so this actually fills the semantic gap needed to support gang scheduling in Kubernetes. And the binder will actually take the necessary actions to ensure that all ports get binded together. Yep. So next, you also support some more complex queuing semantics. So when you have a lot of offline jobs in a queue waiting to be run, I think it's important to ensure that there's no starvation in this fairness. So while the native queue scheduler, I think it supports some basic priority based sorting which actually implemented more sophisticated policies in our dispatcher, such as a domain resource fairness and fair share. Yep. So yeah, this is in order to help us to run our support, the running of our offline workloads on Kubernetes. So next, you also have microtopology scheduling. So batch and ML jobs are characterized by high IO and frequent memory access. So microtopology scheduling is where we schedule our CPU and memory from the same NUMA node to a single port. And this can actually improve the performance of our offline jobs. Yep. So this actually requires cooperation between the scheduler and also the Qubelet, or no agent. Yeah, so while the Qubelet now supports NUMA binding using topology manager, the native queue scheduler still lacks support for microtopology scheduling. So we use a custom node agent that writes topology information as no annotations and the girl scheduler is able to process this information to make its scheduling decisions. Yep. Yeah, so we implemented a few other capabilities which I will not go through in the interest of time. And I think this has also been mentioned by Italian earlier. Yep. So now I'll share a bit about the results that we achieved with the girl scheduler. So actually, we ran some benchmarks comparing Kubernetes with a girl scheduler running a single scheduling instance. And with some of the optimizations that we have, we actually managed to achieve a much higher scheduling throughput. And we can also take this a step further by scaling the number of scheduler instances. And it allows us to hit almost 5,000 ports per second, as you can see here, where we run benchmarks on a 10,000 node cluster. Yep. So we've been running it in production for some time now. And we've managed to increase the resource utilization in our clusters to up to 60%. We've managed to sustain a peak scheduling throughput of 5,000 ports per second. And we've also managed to unify our resource pools consisting of millions of calls. Yep. So now we're coming to the end of the presentation. And I'll share a bit about some of the future work that we have installed for the girl scheduler. So currently, transitory stages used by the different components are persisted in the API server in the form of port annotations, et cetera. So we are planning on using an in-memory cache or in-memory database to store these stages. And we expect this to increase the performance of the girl scheduler. We're also working on a girl rescheduler. And its job is to take primitive measures by watching running ports and carrying out actions to reduce resource fragmentation and contention, achieving higher quality of service for critical workloads. We're also developing an explainer and simulator to improve the explainability and visibility of our scheduling decisions. Yep. So yeah, we're planning to open source this soon. And actually in ByteDance, we've been working with Kubernetes for quite some time now and trying to scale it. And we've actually began to open source a lot of our projects and solutions under this organization called CubeWolf. So for the girl scheduler, we're also planning to open source this in the fourth quarter of this year under this organization. Yep. So do stay tuned for any updates. Yep. So we've come to the end of the sharing. And yeah, to summarize, we shared about batch processing and ByteDance, why we decided to move them to Kubernetes, some of the problems that we faced and why we wrote our own scheduler. Yep. So thank you so much. What were you using to gauge the performance between the two? Okay, yep. So actually we ran the benchmarks in a test clusters where we actually tried to saturate the number of scheduling requests by creating pods within the cluster. And yeah, once we reached that saturation point, then we got the scheduling throughput. Yeah, I mean compared the two. Yeah. So yeah. Hope that answers your question. I have a question for the architecture part. So I saw the scheduler is parallel as in they are using a shared blender. So this one is working, I mean the scheduler zero and N is working in different go routine or say they are different components as in communicate with each other, use some RPC or whatever. Oh no, actually they write the transitory stages to the API server. So in the form of like annotations and pods and such. Yeah. So yeah, there's no like direct communication between the components actually. Okay. Oh, how do you compare your scheduler with volcano scheduler? It seems like they achieve similar goals. Okay. Honest speaking, before we developed the good schedule, it's about two years ago. And we test the COOP schedule including the COOP schedule framework way too. And we also have a test about the volcano but the most important is about the performance. Yeah, I think if you manage your customers from 5,000 to 20,000, you can have a benchmark about the volcano. So sorry, I could not hear. So is your scheduler also a Kubernetes scheduler plugin or is like a separate scheduler running on Kubernetes? Kubernetes. Okay, good question. It's a separated scheduler which will replace the COOP schedule. Yeah. We use the schedule to schedule all of our workloads. We will release your code in the Git repo we can check. So. Okay. Okay, this is our open source organization and you can find some interesting open source in the organization. And please allow me to have a brief introduction. You see, we extend the Kubernetes from 5,000 to 20,000. The first thing we do is to replace the ETCD-CD which is the first project COOP plan. We use Wally's key value consistent store to replace all ETCD to achieve the, to solve the first unique of the storage system. Yeah. And you can see a catalyst code. This is also a very important node agent because this is our most important collocated open source project which will solve the conflicts between the online workload and offline workload data. We also report a lot of the different resources and different quality of service to our Kubernetes and the scheduler will cooperate with these to achieve a better scheduling qualities. We'll close here for the next session. Thank you. Thank you for presentation.