 at this open source summit. I'm not a native English speaker, so feel free to ask questions if there is something not very clear. And you could also send me an email with your ideas after this sharing and I will reply to you as soon as possible. I am an open source developer currently working at PINCAP with developer TIDB, an open source new circle database. Here are some of my personal open source projects. One of them is the C-Cortin library. Another one is about Pexo's algorithm. Today our topic is how the CFS algorithm could be used in Cortin scheduling. Let's start with some benchmarks. We're programmers and we all know benchmarks, right? At the very start, please allow me to introduce two terminologies. One is the event-intensive task. Most of the time in slide cycle is waiting for events just in a minute of CPU and only a very small part of the time is doing CPU computing. CPU intensive task. Most of, even all of these life cycle is doing CPU computing just in a minute on CPU and only a very small part of its time or has never been instead of waiting for events. Here is our first benchmark. F-data sync latency benchmark. First we open a log file and loop about intimes and write four kilobytes data to this file. F-data sync and then loop the content. Our test environment is on an hardcore Linux version motion with an exclusive NVMe SSD attached. Here is our first benchmark result. One is C version, another is a Go version. We could see the difference is very small. The result is not bad, right? Actually, the C version should be a little faster than the Go version because the Go version has this scheduled overhead in the go runtime. Here is our second benchmark. F-data sync with CPU intensive task and benchmark in Go. At first, we loop about intimes each create a go routine which is the CPU intensive task and then exactly the same as before. And the CPU intensive task is a infinite loop which contain a checksum of one megabytes by slice. Here is our second benchmark result. When the CPU intensive task amount is zero, the result is same as our first benchmark. But when the CPU intensive task amount grows, the F-data sync latency become worse and worse. This is not a very good result, right? Because for many Go applications, CPU intensive tasks is simply unavoidable. So this benchmark result tells us that for those applications, such latency penalty may already exist for a very long time. So what's going on here? Basically, the current go runtime scheduler use a run-robin scheduling strategy. Here is a simple demo. The demo below is not mean to work exactly like the go runtime scheduler. Go may evolve in the future. The purpose of this demo is to show that in some cases, run-robin is not a good choice. On the left of our demo figure is a go routine ridicule. On the right is the CPU. A is F-data sync go routine and BCD are CPU intensive go routines. We assume that the time slice is 20 milliseconds. Add T zero, time T zero. A is scheduled to run. And because A is event intensive, it only consumes very few CP time. We just neglect it in this demo. Then B is scheduled to run. While B is running, a finished is event vaching and become runnable again. And then at time T zero plus 20 milliseconds, B is suspended and C continues to run. Well, because our time slice is 20 milliseconds. At time T zero plus 40 milliseconds, it's time D to run. And then finally, we got A resumed to run at T zero plus 60 milliseconds. So that's why we got the worst F-data sync latency when it's running with CP intensive tasks. So how to solve this problem? We know that the scheduling strategy of the current implementation go is basically run-robin. But in practical application scenarios, priority relationships may need to exist between different types of tasks or even between tasks of the same type. For example, if you want to use Go to implement a storage engine, in general IO tasks should have higher priority than any other CPU intensive tasks. Because in this case, the CPU, the system bottleneck is more likely to be in this guy or not CPU, right? For another example, we may use Go to implement a network slice, which is required to ensure the latency of some certain lightweight tasks, even while other Go teams are quite CPU intensive. However, in the current implementation of Go runtime, if there are a certain number of CPU intensive tasks that need to run for a very long time, it will in invite body impact the scheduling latency. And because Go routine scheduling is quite like the thread scheduling, so let's test this scenario on Linux kernel and see what happens. We do this test on a wine called Linux version machine. As we could see, the current CPU load of this system is very low. And we could assume that all these processes are even intensive. On the left, the pin command is running, which has a destination address is our router of this line. The SH demo process is this SS session server, which is even intensive too. Now we create a CPU intensive task by simply using the best script. Edge top on the right tells us that the current system reached its maximum CPU load. After waiting a while, we could see the pin latency on the left still very stable and seems no big differences with the previous one. And our SSH session still feels very small. As we saw, the Linux kernel scheduler performed very well. Maybe we could compare the thread model provided by kernel with the goal team model provided by goal in addition to the inherent low cost and high concurrency advantages of goal team or threads. Some very useful mechanism in the kernel thread model cannot be found in the goal team model of goal at least not yet. For example, the dynamic modification of scheduling policy and priority mechanism including adaptive priority adjustment. The scheduler of the initial version of Linux kernel is quite primitive too, but with the continuous development of the kernel more and more applications how have higher and higher requirements for kernel schedule. And the kernel scheduler has been keeping in voicing until today's CFS scheduler. That should also apply to goal. A real commercial scheduler will make goal even greater. So let's introduce our CPU worker. CPU worker is a customized goal team scheduler over a good runtime. Basically, the idea of a CFS scheduling is to give every thread a logical clock which records the duration of threads on CPU time. Different priority settings of threads, different speed 10 passes of its logical clock and CFS scheduler would prefer to choose the thread which has the most behind logical clock to run. Because the scheduler sync is quite on fire to make such thread even more behind. After all the scheduler's name service stands for completely fair scheduler. So if one thread is event intensive then it would have a higher effectual priority. And if one thread is CPU intensive then it will have a lower effectual priority. That we could call it adaptive priority adjustment that is quite important feature which could ensure the scheduling latency of event intensive threads will not be very hard even a current system is on the high CPU load due to the existence of some CPU intensive threads. Assuming we are running on an ad core Linux box and the Go Max Prox has been set to add then we supplied this add P into two pieces. One has two P and another has all these left six P. The two P part is prepared for event intensive tasks. And the left six P part is roughly prepared for CPU intensive tasks. And to achieve that, we only need to ensure that the concurrency of the CPU intensive tasks is always no more than six. And pair are for two propositions that make the idea of customizing go-round time scheduler feasible. The first one, event intensive task has a very small on CPU time and they're always available P for event intensive tasks. Second, CPU worker, the scheduler itself is event intensive. The third, CPU intensive tasks completes these limited P resources, completes this limited P resources and which are managed by our customized go-round time scheduler. The fourth, go-round time scheduler supports work stealing. At this point, we have the ability to customize our own go-round time scheduler over go-round time. And because it's over go-round time, so we could implement any scheduling algorithm. We want not only CFS, there are three patterns of, three patterns of CPU worker submit. Parting one is a special case of parting two. Parting one is not friendly at all because it will occupy one P during it's how lifestyle, life cycle. So the best practice is this pattern should only be used by very short CPU intensive tasks. Parting two got a checkpoint FP in the import parameter and CPU intensive tasks should call this checkpoint FP from time to time. This is the point where the CPU worker scheduler suspend or resume the CPU intensive tasks. Parting two is a special case of parting three. Parting three, we name it a hybrid task which is a mix of CPU intensive and event intensive. In theory, parting two is probably enough, but in practical, there may be some tasks not so easy to slide into CPU intensive or event intensive, especially for those already existing go-projects. Here is a demo of parting three. Basically, we try to extract all the event intensive part of hybrid task and thus all its left parts are CPU intensive. The import of event core FP should be an event intensive task or Neo, which means a checkpoint, which is same as the checkpoint FP in this case. Here is the step transition of one task from the perspective of CPU worker scheduler. Step one, the newly created task has been just submitted to CPU worker and CPU worker scheduled its run. The step two, when performing the checkpoint FP check, the task finds that it has received the suspended interaction from the scheduler. So it puts itself into the run of a task prior to queue and waiting for the resume signal. At step three, the suspended task receives the resume signal from the scheduler and continues to run. At step four, the task returns P to the CPU worker scheduler and then executes the event intensive task. And step five completes the execution of the event intensive task. Then the task puts itself into the runable task prior to queue and waits for the resume signal. Step six, that is the end of this task's lifecycle. As a proof of concept, CPU worker currently only implements a part of this or a part of the service and could also be set a variant of service. Here are some more benchmarks. Here is our second benchmark we did at the very start. I've done something with CPU intensive tasks benchmark in Go. Here is the CPU intensive task definition and here is our third benchmark, which is the FDATSENG with CPU intensive tasks benchmark with CPU worker. Basically, it submits some amount of CPU intensive tasks to CPU worker at first and then do the actual FDATSENG DATSENG benchmark which means the CPU intensive tasks will be under the control of our own scheduler. Here is the definition of CPU intensive task with checkpoint, which have an actual input parameter named the checkpoint IP and in the infinite for loop and you will call it this function from time to time. Here is our benchmark result. We could see that the vanilla version of DATSENG latency becomes worse and worse as the amount of CPU intensive tasks grows while our CPU worker version FDATSENG latency is very stable. Keep around one millisecond to two milliseconds. As the amount of CPU intensive task grows, this is a very good result. Here is our last benchmark. The first function is the CPU intensive task and the second is the CPU intensive task with checkpoint. People got a HTTP handler named handle checksum without CPU worker. And the last function is the another HTTP handler named handle checksum with CPU worker and has checkpoint. Here is the last HTTP handler named handle delay one millisecond. It is an event intensive task. Here we do the HTTP handler setup and here is some laws we're going to test on our demo server. The common one is an event intensive task. The common two is the CPU intensive task but without our CPU worker which means the vanilla version. And the common three is the CPU intensive task which under the control of our CPU worker. Here is our last benchmark result. On the left is about throughput and the right is about a variance latency. The first row in the table is about delay one millisecond which means delay one millisecond is the only load in this row of 36. We could see that the throughput and average latency is very good. And the second row in this table delay one millisecond plus CPU intensive vanilla which means this load is the mix of delay one millisecond and the CPU intensive tasks without CPU worker. And we could see the performance drops dramatically. And the last row in this table delay one millisecond plus CPU intensive CPU worker which means this load is the mix of delay one millisecond and CPU intensive tasks with CPU worker. And we could see the performance becomes nice again. For more information you could refer to this link that to CPU workers get a repository. Thank you, that's all.