 That's my tensor, so let's share about what we've done in Linux scheduling algorithm and its optimization. This is the background. So the server in Tansen is million level, so it's heavy assets, how can we make the full use of it, how can we better use it, make sure it's not idle, make sure it's fully functioning. So let's say the CPU usage ratio is only 20%, now we try to boost it to 70%. So actually internally speaking the server in Tansen we're still trying to boost the usage ratio of CPU and GPO and actually we'll try to boost its efficiency that's yielded and safe cost. So with Doctor with containers and other means we try to boost the usage ratio. So we have our internal KPM and also we can also directly run our containers. We'll improve the usage ratio of CPU. So we have different definitions for a CPU and I won't go into great details here. So we cannot use it for latency sensitive part and we can only use it for part that's with lots of offline threads and processing. It's mainly used for offline scheduling. We need to use the KPM internally to make optimization and improvement and some of the solutions are trying to run the containers on raw metal machine. So I'd like to emphasize again for the online service we are basically using it for latency sensitive business and it's highly reliable and so for offline business it's low latency sensitivity and most of them are for computing so the impact on one time latency is pretty minimal for this part of the business. So for online business for example real time game that's latency sensitive they have really low reliance and tolerance for latency and if you are fighting with the other competitors then there's a very high latency it's unbearable actually. And for offline services for example if it's low latency sensitivity then for example if you want to check on the virus at back end or sorting your photos then these are low sensitivity these are the business with low sensitivity for latency so these are offline part it's different. So for video games that's not a story that's highly sensitive for latency. So the CPU situation is the load is not very high. And for the mix deployment a lead to a classical issue. The blue part is online service and we have N units of CPU and for example task B we need to acquire resources from A and from C it's dependent on log and origin is CPU 0 and it's B is occupied by offline tasks Y and that would cause the whole sequence is chaotic and CPU 1 to 4 to 2 it cannot be unlocked and cannot be operated. For example Y and Z so they have different logs it's very complicated landscape in this part chain and go back to the original question for example how to solve this issue currently different categories different enterprises are using this platform to allocate their resources in CPU and for example we have a mix deployment of online offline business and it's uncontrollable and this is a very fair scheduling program and it would not show any preferences to any thread or process. And we have to guarantee all the main thread can be scheduled on the timely manner. So this is shared platform this is like weighted somehow and it cannot solve our problems and for quarter mechanism this mechanism if you're familiar with it C group will provide two mechanisms setting different quarter different function groups and we will have high frequency timer and try to traverse the whole C group tree and it's very heavy it needs heavy operation I need to double check on the quarter and if we set it smaller period and we have a smaller number of offline groups then we can avoid some ineffective overhead for CPU otherwise it will be bad for the system. This is the sheer and pure quarter status and issues we've encountered in the others includes the balance load load balancing for example in this scenario we have 100 online limitation and 1000 offline limitations then we have to separate online offline thread and if we mix them together then that means we add another 1000 offline threads into this 100 online thread then we have to do load balancing that will be panic and the CPU will be migrated to CPU Y that will cause the major impact on the terminal and we have to change the context and in high-level scenarios we have around 200 entries if we are migrating thread then obviously that means lots of consumptions and the load balancing is not calculating these parts we don't have a switch to control these process and for example this load balancing issue we have to strike about you know to solve it we have to strike a balance in mix deployment and introduce a new categories of scheduling this BT we don't have a final name for it we still discuss on that and if it's lower than the set CFS then we have to guarantee that the online service can operate and will not be stripped off the resources by offline parts we have differentiated thread differentiated online from offline parts and the systems we have to guarantee that these two parts online offline are not interaffected by each other this is why offline scheduling is meaningful we have CFS we have CPU scheduling and we have a BT scheduling RT scheduling why because first that means we have to change a lot of codes and that will undermine probably the stability of the whole system so we need to solve this problems with controllable cost the first principle is that we have to prioritize the online computing and if not in the whole designing mechanism we have to guarantee the preemption of online service and we add additional support from our tech side and we add lightweighted low balancing mechanism the logic behind is that we guarantee that the offline has their own load balancing mechanism and online has its own load balancing mechanism as well and we add two new interface one is the generic interface they can control the overall ratio of offline service setting the top to 20% so why do we set this limit it's some of the part want to do their own configuration for each set of the CPU and we cannot allow that by setting the configuration they can control the load for each CPU and if you don't want to do that then you can transfer the control CPU to the mechanism of load balancing and we have three categories one load balancing and limitation for interrupt and we will set the priority for offline offline technologies and offline processing center and so for the limitations of bandwidth we have lightweighted statistics and accounting we use this minor technologies we're not doing heavy accounting like a recorder it's not necessary by our standard for example didn't have to be as precise as like 20% 22% it's not necessary and it's about 20% so you can accept it but yeah this is not very important and it's offline so we need to make sure that we are not at a new interrupt and we have make it compatible with norms so to make sure that the offline when it's six hertz it is compatible so for load balancing it's similar to the one you have heard of but we have this wait time here when we are conducting this offline tasks we need to calculate this wait time for the basis for the load balance instead of the life cycle the balance yes so the function sometimes is closed and sometimes the function lies idle so the option load balancing when we have this online load balancing we won't consider the offline but maybe the offline load balancing may be online so we need to make sure that not to make the offline load balancing into the some CPUs with high level of online load balance this is our some our trip trip and based on this our test result proved to be very good so these two test results so on this line we have this online service a so for per minute the task maybe we focus on the failure success rate one is not mixed with the offline is about 200 but with the mix of the offline then there's an increase of 5 000 so this task yes for the users they cannot accept it and also it's a failure rate it will increase from 200 to 400 but the success rate remains the same and we can make add some lightweight load balance into it for the second scenario it's more about the average time delay so this average time is about 150 million a second but with the addition of the secrets the failure rate is increasing and but why users cannot accept this time is because while there's an increase of about 20 milliseconds then we can't accept it let's go back to this page so for each module there's an increase about 20 milliseconds and this is quite a long time then there is a total of 120 milliseconds then go to this page you can see that the time delay is very long yes that's why the 20 millisecond delay is has a huge impact it's like a time increase in along the whole chain so for this the failure rate is not increased and it's acceptable and for the reliability is also ensured and this is our off balancing effect so these yellow curves we have not classified it although it does not affect the online task but it's unpredictable it means that the offline cpu utilization rate has not reached an acceptable level for the blue curve it means the standard kernel you can see that the fluctuation is not that sharp but it affects the online task and so for offline it's not that smooth for the gray one after we have done the test it does not affect offline and it's also runs very smoothly itself and also it has less sharp increase or decrease we can see that the online can be synchronized with the offline and also there's a typical example we you can see an increase of or for 65% from just 15% before so this is a sharp increase so for the offline task scheduling it's the same to do with the characteristics of the cusp so if it runs very smoothly then the rate is very high so the cpu utilization is not that high and it won't reach a high level so it's too with the characteristics of the online tasks so the conclusion is that the Tencent kernel group we are focusing on the researching how to improve the integration utilization rate of the servers as the server is becoming larger and larger and it is becoming more and more competitive so we have more requirements for the scheduling and network is also very important last year we optimized a lot of servers last year and yes we need to focus a lot of areas why the offline cannot be realized to acceptable level because new model it emerges it brings some problems so btd scheduling algorithm what it is about we need to classify the processes we need to classify the processes to divide them into offline and offline online we need to make sure that to make to make it juggle with different tasks at the same time how to increase the utilization rate of the cpu this is the conclusion from our research so i'm not sure about the patch but i think there's not big change to it so CFS will be changed later or we adopt our thought to classify otherwise the utilization rate cannot be improved this is our development plan the draft plan and finally we finalized the whole report and our with detailed information the T-Linux the Tencent we have released TK T-Linux and you can buy it in the cloud and this is our headquarter cloud headquarter including our open source programs that is T-Linux team yes like the hybrid that the public and the private cloud and we also have input in the nationalization of the device so that's the end of my presentation so yes i'm maybe running a bit faster because i don't go deep dive into these details otherwise it will concern the scheduling of the kernel if you are interested in it you can pay attention to my slides so in general it provides a new thought for our offline task so hello sorry the interpreter cannot hear what he's talking because he does not turn on the microphone yeah i'm from Alibaba we have been working on this hybrid for a long time and it's more about the scheduling of the CPU to ensure that the high operation level of the tasks so the questions we have encountered is more about the memory yes because the hybrid deployment machine is more about the offline offline and but after the mix it will increase the memory and it's a big increase because the utilization rate of the CPU will be increased so it's how there's a problem with management of the memory and also because offline it need to come it need to apply for a lot of the memories then then destroy it again and it applies for it again so the cache will be a lot and there's need to do with recycling so it will post the impact on the host os and it will cause the backlog for the online and some pressure for online tasks so i think we have not dealt with it yet yes this is about the scheduling optimization and we also have done a lot of work on the memory like the recycling of the page yes for each group we have this page recycling for the container and for each group it's far ahead we have these restrictions if it had visited millions of files then we then if we recycle it at the time then it will be too late so so this presentation is about about the city cpu utilization rate hello sorry the interpreter cannot hear what the audience is talking about because he doesn't use a microphone so so for the online tasks suppose you have a 128 mb and to make 100 of them online and 20 of them offline we need to focus on the issue of the integration rate it's too low yes we need to focus on the utilization rate of the machine sorry i cannot hear it i have a question yes about the kernel about the scheduling algorithm for this kernel that changes to it how can you differentiate between the online offline how can you tell whether it's offline or online we mark it on the page it is a higher level scheduling it has this offline scheduler and it marks and provides an interface so you open the interface of the kernel and let it tell you yes the tasks it is in cooperate with the upper level tasks if you just screen this sample i think the offline scheduling is conducted together with yes i mean the upper level of the will give you this information about how to do this scheduling thank you very much