 Hello everyone, today I'm going to share our work about batch job colocation on Kubernetes. Our topic is coordinate workload colocation, accuse-oriented scheduling husband on Kubernetes. This project has already been open-sourced on GitHub. First, let me introduce ourselves. My name is Zhang Zouwei and me and my colleague Li Tao are both worked in Alibaba Cloud. Besides, we are also the maintainer of the open-source project called Coordinator. It is the project we are going to share today. First, I will introduce the first three parts of the slide and my colleague will introduce the next following parts. Speaking of the colocation, let's go to the background first. For the past few years, new tags are developing really fast, such as an AI Big Data 5G. This makes the digital economy becoming more and more strategic. People can spend more and more time online both for living and for working. According to a recent report, the colocation data center market revenue will be tripled on 2028 comparing with last year. And at the same time, the energy consumption becomes more serious. Some researchers say that in 2020, the data center industry consumed around 196 to 400 terawatt hours. It is quite a lot. So, improve the resource efficiency is really important, which can save the TCO, hardware costs and maintenance. It can also reduce the carbon emissions. But a lot of data centers utilization are pretty low, which means that the resources are wasting. So we need to improve the resource efficiency of data centers. Actually, the applications in data centers can be split into two kinds. First is online services. The other is batch jobs. For online service workloads, most of them have strict SLA about performance, for example, the 99% response time. Which means that they require really high quality resource. And if we take a detailed look at the resource usage of them, we can see that the average resource utilization is quite low and the resource usage always frustrates over time. This makes the resource allocation of data centers quite embarrassed because they know the resource always with a high request but low usage rate. And the other one, the batch jobs are most computation intensive for high throughput and they don't need quite high resource quality. Because most of them are just big data or AI training computing. With the rapid development of these years, they require a large scale of resources. And according to a report last year, 77% of respondents are embarrassing Kubernetes because they want to improve the resource utilization by moving big data jobs to Kubernetes. So here comes a good idea that maybe we can put the online services and batch jobs on the same Kubernetes cluster or even on the same node to improve the cluster utilization. But there are some challenges in Kubernetes. First, take a look at the resource model. Actually, the resource requirements of POD can expressed as request and limits. The request resource means the container can guarantee to get and the limits determines the resource usage never goes about the threshold. And the different value of request and limits determines the queues of POD. They are guaranteed, burstable and best efforts. So we might consider that how about we set the online services as guaranteed or burstable so that it can get a high quality resource. And we can set the batch jobs as best effort or burstable to improve the cluster utilization. However, if we really do this, we will find that no one is really happy for this publication because the online services always got interference because of the resource contention. And the batch jobs always have bad performance because there is no reference for the node capacity of best effort. And besides, there are no fairness between the best effort calls. Actually, there is a classic word can conclude this problem called noisy neighbors. So why noisy neighbors happens? There are some, there are a lot of reasons. First, most of the resources are sharing node by different containers, not only the CPU memory, but also the CPU cache network, block L and so on. But there are only CPU and acceleration on CPU and memory for Kubernetes container. Besides, the hardware topology is more and more complex. It is, there are more than Neuma, but also the CPU cache and memory bandwidth and even hyperspread of CPU. And the collocation makes things even worse because most batch jobs are really resource hungry and the simple resource restriction for best effort is really not enough. So we need an end-to-end automatic solution. If the load, for example, if the load of node goes high, it can automatically limit the resource consumption of best effort jobs on node. And under control plan, it can schedule best effort part to avoid more interference happen on node. And if the load of node goes low, the restriction of best efforts jobs can be unleashed automatically to improve the utilization. And also the control plan should allow more pause to be scheduled on the node so that everyone is happy. Next, let's take a look at the design and the architecture of the coordinator. There are some key features in coordinator. First, it designed the priority and the kills for batch collocation scenario. And it also provides application profiling mechanism for resource over-commitment. On the node level, it coordinator provides a fun-grade resource organization and isolation mechanism to avoid noise-enabled problem. And it also provides flexible job scheduling mechanism for diverse workloads so that workload can be easily used. The purpose of the coordinator is to achieve high resource utilization with safety kills guarantees. The architecture of coordinator can be split into two parts. First, of course, is the control plan. There are three parts of the control plan that coordinator works on. It provides independent component called manager. There are three models inside. First is SEL controller. It is responsible for resource over-sale and kills strategy measurement. And also there is a model called recommender. It is used for resource profiling and out-scaling. And command manager also has webhook called collocation profile. It is used for called auto-injections by self-defined roles. For example, we can set the batch jobs as low priority to avoid the interference on node. Besides, coordinator also provides some plugins inside could-be scheduler and disk scheduler to improve the performance of batch jobs and also for online services. And on the node level, there are two important components. First is the called runtime process. It acts as a process between the cobalt and runtime. It will receive the... During its life cycle, it will receive the request from Kubernetes and call the external plugins to inject more background resource parameters in the group or resource control and so on. And the cobalt is the most important part on the node level. It collects the metrics for resource profiling and the interface detection. And also it tunes the resource parameters of container for kills enhancement. And also it provides some external plugins for called runtime processing in order to achieve the resource execution. There are two key designs in coordinator. The first is resource priority. The coordinator provides resource over-sale strategy. It means that the requested but not used resource can be resolved on lower level. So that the node can provide several levels of resources as product, mid, batch and free. And I express as the extended resource on node status. And the more aggressive over-sale strategy across the more resources can be overcommitted, but the quality will be worse. We can take the... We can look the graph as an example. The red line is the uses of product level. And the dark green part is the capacity of batch resources. And the light green part is the capacity of mid level. We can see that although the mid level has better stability, but its capacity is quite low comparing with the batch level. And the batch level has more resources, but it is not very stable. If the pod want to specify the resource priority, it just need to fill the Kubernetes priority class filled as product, mid, batch or free. It means that it want to request the priority of resource. And we also provide the star priority on pod for queue and provision during scheduling. And since the resource is overcommitted, pause with lower level will be through total queue when resource condition happened on node. The other key design is queues. It is an extended definition on pod level at system latency sensitive and batch level. And latency sensitive can be divided into more detailed queues according to the performance requirements of applications. The difference queues means the different resource isolation parameters on containers. And when resource competition happened on node, pod with higher queues will be satisfied first. And speaking of queues and priority, there are two independent properties and can be mixed used. And of course some combinations are limited due to the actual scenarios. For the most representative scenarios are product, batch, and batch efforts. They are for the online services applications and for batch jobs. And since there are more and more types of applications, we also provide some enhancement such as the product, batch RIC or SR for those more sensitive service applications. And also the mid, batch efforts for those who need more stable, oversold resources such as the air training jobs because they need the resource to be more stable. Okay, next please welcome my colleague Rufeng to introduce some technical details about the coordinator. Hello everyone, I am Li Tong. Next I will show you how to use the coordinator's collaboration capabilities as well as some technical details inside the coordinator. The coordinator can flexible use the coordinator to support multiple workloads. Here we take Spark as an example to introduce how to run Spark on Kubernetes with the coordinator. First let's briefly introduce how Spark runs on Kubernetes. Many users currently use Spark on Kubernetes operator to manage Spark applications. The Spark operator defines the RCRD named Spark application that describes the Spark application specification. Spark operator manages the lifecycle of Spark application on Kubernetes cluster. The Spark operator provides a command line to name the Spark control for working with the operator. The Spark applications have two types of pods, driver pod and executor pod. Spark application first creates the driver pod and then the driver pod creates a batch of executor pods. The executor pod completes the task. We deployed Spark application and hired the linked system applications on the same batch of nodes. The linked system applications use Kubernetes native resources and the Spark applications use coordinator batch resources. It can improve resource utilization of the cluster and reduce the cost. The YAML on the left describes a standard Spark driver pod. When the Spark driver pod is created, it will be updated by the Colloxon profile webhook of the coordinator component. The coordinator queue class will be added to detect the driver pod as a batch of effort pod. And the resource type will be modified from the Kubernetes native CPU and memory to the coordinator batch CPU and batch memory. It will eventually be scheduled by the scheduler on the far right as Chen Zi in the utilization before and after using the coordinator. The utilization increases after the coordinator is used. Okay, next I will start with some technical details inside the coordinator. How is the coordinator over commit resources and how to ensure the runtime quality of the pod? When users use Kubernetes, it's difficult to accurately evaluate the resource units of applications. And they don't know how to better set the request and limit of pods. Therefore, in order to ensure the stability of the applications, larger resources by, say, facacists are often set. In actual production environment, the actual CPU utilization of most linked system applications is relatively low most of the time. And maybe as high as 10% or 20% waits with a lot of allocated but unused resources. The coordinator uses this part of the allocated but unused resources through the resource over commitment mechanism. The coordinator evaluates how many resources of the pod can be reclaimed according to the matrix. As soon as the part of the reclaimed is reclaimed results which can be oversold. And these reclaimed resources can be allocated to low priority workloads such as batch jobs and tasks. In order for these low priority workloads to easily use these resources, the coordinator will update these resources to know the status as soon as you know the info right. Or when a pod has burst of requests to process, it requires more resources. The coordinator helps the pod to get these resources back through the QC enhancement mechanism to ensure the survive quality. We consider that at the beginning of the resource of the coordinator. That is necessary to reduce the difficulty so that everyone can gain benefits by simply and quickly using the coordinator. So the coordinator provides a CRD named class the co-location profile and corresponding webhook. The CRD describes waste name spaces and the pods need to enable co-location. The webhook will automatically inject the coordinator priority calls class batch resources and other co-location protocols into newly created pods according to the rules described in the CRD. Overcommanded resources can effectively improve the utilization of the cluster. But it's also accompanied by uneven utilization among nodes in the cluster. And a load of hotspots may also appear. A load of hotspots can degrade workload performance. The linked system applications and batch workloads conflict more on the nodes with high utilization, which affects the runtime or quality. To optimize this problem, the coordinator scheduler provides a plugin to come to the utilization. The plugin mainly depends on the node with metrics reported by the coordinator. During scheduling, nodes with a load higher than 30th through-hold will be felt out to prevent the pods from being unable to obtain resources on such nodes with high load. On the other hand, it can also prevent nodes with already high load from continuing deteriorate nodes. The nodes with lower utilization can selected during the scoring phase. The plugin will avoid overheating code nodes after a period of time due to concurrent scheduling of too many pods to the code nodes based on the time window and estimation mechanism. When the utilization increases to a relatively high-voltage level, the resource conflict between linked system applications and batch efforts workloads will be more serious. And more refined resource management is required to better guarantee the runtime quality. So as to further explore this potential, at the beginning of the design coordinator, we designed the QC class for colloquial scenarios based on the Kubernetes native QC class. And restricting the way QC class uses the CPU. LSE and LSE are represented linked system applications with high performance requirements with CPU-set model. The CPU allocated by LSE will not be set by workloads of other classes. The CPU allocated by LSE can be set by batch-level QC class, but can be set by other than QC class. LSE adopts CPU-set model, set a group of common linked system, set a group of common batch of the same CPU. Similar to LSE, these CPUs can also be set by batch-level QC class. Best ever uses all CPUs except those allocated by LSE. Coordinated scheduler and coordinator cooperates in support of the model and provides race-bounding strategies and exclusive strategies also compatible with the equipped CPU management policies. In order to ensure the runtime quality of virus workloads in colloquial scenarios, coordinator provides race-kills enhancement capabilities on the node side because the time is limited. I set a few representative capabilities. Firstly, introduce CPU surprise. As mentioned before, the applications often have allocated but unused resources. These resources can not only be used by newly created best-ever pods through the overcommitment mechanism, but also can serve idle CPU resources to best-ever pods that are already running as much as possible. I assume in the finger, we call the data funds that the resource utilization of the linked system applications is low according to the metrics and the CPU used by the best-ever pods has not exceeded the safety through-hold. Then the idle CPU within safety through-hold can be served with best-ever pods. So that the best-ever pods can execute faster. Therefore, the load of the linked system applications increases. The coordinator will return the CPU center to the best-ever pods to the linked system applications. Although CPU surprise can help best-ever pods get more resources for faster execution, there are also some disadvantages. When the linked system applications are on the high load, best-ever pods may not get resources. Best-ever pods hold special resources, such as chronologues when they are stressed may need priority evolution and effect performance of RSA pods. In order to solve the problem of CPU surprise, the coordinator provides an eviction mechanism based on resource satisfaction which helps the best-ever workloads get the opportunity to reschedule and distribute them to loads with low load. CPU satisfaction is the ratio of the total amount of CPU actually allocated to the total amount of CPU expected to be allocated. When the CPU satisfies best-ever workloads is lower than the through-hold and the CPU utilization of best-ever workloads exceeds 19% coordinator will effect some low-priority best-ever pods and release some resources for use by higher priority best-ever pods. Through this mechanism, the resource requirements of best-ever workloads can be improved. Coordinator provides a variety of adaptive kernel-level QC management mechanisms like SACI's CPU breast allows CPUs routed containers to burst their CPU utilization for higher performance and lower latency. This filter can ensure the quality of containers without reducing the deployment density of containers. Coordinate adaptive is with endless OS and the community version of the Linux kernel. Because of time constraints, I will not give detailed explanation. If you are interested, you can go to our official website for more information. Also, cloud scheduler provides a variety of scheduling features such as enhanced code scheduling. It can be completed with code scheduling and called group CRD. And supports multiple code groups to complete code scheduling as group. Elastic quota scheduling is compatible with capacity scheduling and elastic quota CRD. It is supposed to ensure a firmness according to the standard weight and multi-level structure. It allows users to set whether to allow boring quota to other consumer objects. And cloud scheduler implements GPU start scheduling because of time constraints. I will not give more details. Coordinator has released seven versions. Implementing most of the capabilities required for colloquious scenarios including resource overcommitment, CPU surprise, fine-grained CPU management, disk scheduling, and elastic quota scheduling. We have provided a lot of documents on the official website and GitHub providing home charts and best protectors. If you are interested, you can try Coordinator. You are warmly welcome to hack our Coordinator. Welcome to report interest, improve documentation, fix bugs, and add features. We encourage all contributors to become members. We aim to grow active, healthy community of contributors, reviewers, and code owners. Thanks for watching our style on Coordinator. Thanks everyone. Bye.