 Hello everyone. I'm QingChan Wang from Alibaba Cloud. You also can call me Alex Wang. I'm interested in large-scale cluster management and scaling, allowed to participate in open source community. Thanks for your attention. Hello everyone. My name is Kenyubo from Alibaba Cloud. Today, my colleague QingChan and I will talk about the practice of C-group resource scheduling in Kubernetes environment. In our business scenario, such as database service, many online instances need to dynamically increase the resource limitation during operations and can't accept the impact of polystyle. Then we need to develop a service to configure the C-group for the payload. Also, more applications run on the bare metal instance in Alibaba Cloud, and they meet the multi-new MAP performance problem. Then applications may require the new MAP awareness, CPU core bounding, reduce the data copy between CPU cache, and split up the data processing tasks. We have developed a combined scheduling system based on Kubernetes scheduler framework and C-group controllers. The scheduler perceives C-groups-level resource, such as new MAP, CPU core, memory limit, plus schedule the dynamic scheduling to specific nodes. While allowing certain ports is bound to the specific CPU core, the C-group controller can also dynamically adjust the poly source limit without causing the port to restart. Okay, this is a general of today's topic. In the first part, I will introduce our practice of C-group resource management. Then my colleague Qin Sen will talk about the CPU scheduling based on schedule framework, also about some other related works. As we know that C-group is an implementation to limit resource for the process and its basic ability in dog container. With C-group, we can set the limitation for a number of resources that the process group can use, and also it can set the priority control of the process group. Also it can record the number of the source used by the process group. In our production Kubernetes environment, database instance is deployed in ports. So we use the source limitation configurations in container-level for the source limitation, although it is not enough. First, we can only configure the CPU or memory in the container spec. Limitation for the storage is not supported. Secondly, the limitation is set with the container starting, and cannot be changed when the container is in running state. Also the limitation is only configured for the containers and not for the ports. So we need a developer service that can dynamically update the C-group limitation both for the container and the port without a payload result. And with a single smart controller, the source limitation can automatically configure by some C-group policy. C-group controller is developed just for this purpose. From the perspective of a function, with a C-group controller, we can dynamically configure C-group limitation. We create a Kubernetes resource, and you do not need to care about the port location and the running environment. From the perspective of realization, a C-group controller is a realization of the CRDs for C-groups and C-group policy. With C-group CRD, the controller dynamics can configure the C-group limitation for target resource and payload. And C-group policy makes the policy to automatically adjust C-group resource limitation. With C-group smart controller, Kubernetes can increase resource usage with smart resource allocations. C-group controller also supports to configure different payloads in closed development. As I have said, if the limitation is set to the deployment, all ports will be configured. With C-group controller, we can configure the different types of resource for the limitation in closed CPU, CPU set, memory, and BLKIO. Kubernetes can only set the limitation for container-level as C-group controller. Supposed to set the limitation both for port and containers. With C-group limitation, all containers in the port will share the resource and share the limitation. In some sharing business, the share limitation in the port level can improve the resource usage. BLKIO can be supported instead of C-group controller. But the block device is complex because there are more than one device in one node. So how to find the block device is important. C-group controller supports to define the resource with multi-mode, such as WTFS, volumes, device ID, such as VDX on the DV path. Now let's talk about the implementation of C-group controller. First, for the definition of CRD, it defines the resource details. For resource belong to which node, the type of resource and the resource ID, for example, we can define one resource as the BLKIO limitation for the WTFS of node 1. The CRD also defines the target payload including department, still set, given set on port. And also need to point out the target port and target container name. Also, CRD can record the resource limitation status. With it, the controller can check the limitation setting is completed or not. Next, let's take apart the main process. C-group controller was the C-group CRD from the API server. And if the new C-group object is created, the controller will get it and pass the parameters. The controller will check the payload, for example, the development of ports, and get the target node ID and container information. And such step is the controller will create a job by the executor which will run in the target node. The created job is configured with target payload and limitation configurations. The job port will get the C-group path and the target container of ports. And then it sets the limitation value to the target path file. Also, the job port records the limitation setting status to the CR. Which can point out the C-group limitation setting is successful for the C-group port. Next, let's look at some examples. The first one is to set CPU and memory for one port, namely the port CPU. We should configure the container name belongs to the port. And also we can configure the CPU and memory limitation at the same time. The second one and third one is setting the CPU set for the development or online special port. We should configure the target container name for that. The first young template is setting the BLK-RU for the block device. We can set the device read or write BPS or the device read or write RPS for one device. As mentioned before, the target device can be defined as rootFS or volumes or device ID in the target node. And in this example, we set to the rootFS. Above, we have the C-group controller design and implementation. We can find the C-group CRD is only to configure the C-group limitation for different payloads. It is automatic but not smart. Here, we add a new CRD named C-group policy to define the C-group allocation policy for the special resource. The C-group policy can support the special resource allocation policy which makes resource limitation every between the payloads work on the resource. A CRD can define the target resource include target node, target resource type, target resource ID which is uniquely defined in the class. Also, the CRD can define the total resource limitation threshold. For example, the total RPS for the block device resource or the percentage of CPU or memory for the specified node. Also, just like the C-group CRD, the C-group policy should record the payload limitation on the target resource in the status field which can show you the limitation details on one resource. Next, let's talk about the main process. The C-group policy implemented the CRD logic for the C-group policy. The controller watch the pulse, the node, the C-group policy object from the API server. If the controller defined a pulse is added or removed from the target C-group policy, it will create a C-group object CR to adjust the resource limitation for the different payloads. For example, if a set of pulse related to one C-group policy increase from the number of two to three, the new resource limitation is rebalanced for all the three pulse and the C-group object is created to configure the new limitation for the three pulse. Step three is the controller will watch the status of created C-group object and update the status and the deleted pulse list to C-group policy object. Let's look at an example for the C-group policy. In the below C-group policy template, we define the target node information with a node selector which can define one node or more nodes in the class. Also, we define the resource type resource ID. Then we can identify the unique resource in the class. When resource type is defined as the BLKIO, the resource RPS limit can be defined to the target device total RPS limitation value. If the resource type is CPU or memory, the resource CPU percentage or resource memory percentage can be configured. And the policy field is defined the resource limitation policies. You can configure as a spirit which means the same limitation will be configured between different payloads as soon as possible. Also in the status field, the status shows the status of one resource limitation, the balance actions. The processing value means the one balance job is in the process and new balance job can't be started immediately. The payloads field shows the resource limitation settings and you can find the pods. Works on that as limitations details of them. If you want to enable the resource limitation policy for the special payload, the SQL policy defined definition should be added to the payload annotations just like the pods example. pods.sql sql.alibabacloud.com as one policy object then the sql controller will get the configurations and do the source limit balance. The below graph is an example of one pods in the payload which is configured to a BLKR RPS type SQL policy. We find that the pods RPS is changed over time. OK, the above is about SQL resource management topic and below my colleague QinChan will bring the topic of CPU scheduling based on scheduling framework let's welcome. Next, I will introduce the content of CPU scheduling based on scheduling framework. First of all this feature shows the CPU topology of the multi-core machine that we use more frequently at present. A topology is divided into three layers socket, human node and a core. Each core has a dedicated level 1 or level 2 cache. This is referred to as private cache no other core can overwrite. The level 3 is shared between cores. CPUs belonging to different human nodes to access different parts of memory at different speeds. Based on these characteristics, we think it is necessary to find CPU scheduling. Our goal is to reduce the program loss caused by frequent switching between multiple cores. To reduce the loss caused by frequent switching between different human nodes. This is referred to select the best CPU scheduling result in closed dimension. This is our complete architecture. We have three extension components. First, CPU scheduling plugins based on scheduling framework. Second, C-group controller it is responsible for C-group operation. This is mentioned in previous part. Third, C-group agent it is responsible for reporting the CPU top-login under real load of CPU. This will be used in scheduling decisions. The process is like this. Step 1 When we add a new node in Kubernetes cluster C-group agent will report the top-login to API store. Step 2, Kubernetes schedule watches the port creation event. Step 3, scheduling decision is made according to the algorithm. Step 4, Kubernetes schedule will create C-group CR for CPU set. Step 5, C-group controller watches the C-group CR creation event. Step 6, C-group controller creates job for C-group CPU set operation. Step 7, the job will complete the CPU bounding operation. Our CPU scheduling is based on scheduling framework. The scheduling framework is proposed in release 1.15 and is expected to graduate in release 1.20. We have implemented the following four extension points. Feature Feature will fill out the nodes that can't meet the requirements of the port for CPUs. Score Score will select optimal CPU scheduling result by algorithm. Resum Resum will first resolve the optimal CPU scheduling result to prevent reallocation to other ports. And it will clean the CPU scheduling result if failure occurs in bounding cycle. Postbind Postbind will create the port C-group CR to store the CPU scheduling result. Our scheduling algorithm is mainly divided into two parts. This is the first part Neumann node selection. The top priority is the CPU core should be assigned to a single Neumann node to prevent the loss caused by frequent switching between Neumann nodes is the first extension to pay attention to. As shown in the picture we will choose Neumann node 1 in machine 1 and Neumann node 4 in machine 2 instead of other Neumann nodes. If multiple Neumann nodes meet condition 1 they will be selected according the real CPU load. The real CPU load is reported by C-group agent. If a single Neumann node can't be met it needs to be in a single socket. Then the second part is the core selection. Here we have two policies. The first one is try to make a side processor belong to a same core. Then they will share the level 1 and level 2 cache. As shown in the picture we will choose Neumann node 1 or 3 instead of 2. The second one is try to choose the relatively idle Neumann nodes. As shown in the picture Neumann node 3 is better than 1. Next I will introduce our results under different workloads. For CPU-sensitive Java computing workload time will be reduced by 20%. For CPU-sensitive Java or GoVab application we will increase the by 20% to 30% on average. The following picture shows the trend. For my so-called workload according to our actual use of data find that in server TPS increased by 13% and select TPS increased by 10%. For video transcoding workload we choose to convert mpg4 to h264 with different size and concurrency. The test result shown in the picture below about 10% efficiency improvement because the result of scheduling is optimal result at the moment of port creation. As shown in the first picture due to limit resources some ports are allocated to CPUs across Neumann nodes. After a period of time the CPUs allocated by port A are released. Port B can be allocated from the single Neumann nodes. This schedule will create a new C-group CR. It is not about modifying the original. The purpose is to prevent data conflicts in direct modification. Scheduling framework will watch the event to update the scheduling cache. In case of data inconsistency the original date of schedule shall prevail. Finally, C-group controller adjust port B to the single Neumann node. In this way we diametrically adjust the CPU size to achieve better results. Finally, I will introduce other related works that we are working on. In order to better schedule in related plugins a new project schedule plugins is created to facilitate users to manage different plugins. Users can define their own plugins directly based on this project. We have completed our implement the following plugins. Code scheduling also we can call gun scheduling capacity scheduling node resource allocable real load aware scheduling the link is below. If you are interested you can learn more about it. Our future work is mainly divided into two parts. The first is make Spark run better on Kubernetes. We will enhance the scheduling to meet the needs of the Spark. In the meantime we will also carry out better acceleration work. The second is heterologous resource scheduling like GPU topology scheduling. Okay, that's all about our talk. Thank you for listening. If you are interested contact us by email or Slack. Thank you so much.