 Hello everyone, it's our pleasure to talk about what we learned from the reduction of potential latencies in pre-anartic kernels. In this talk, we reviewed the development of this kernel test isolation through the years, and then the latest test isolation implementation with minimal changes as discussed. The preliminary results show that it is capable of eliminating the unintended latency for pre-anartic kernels. In addition, we would like to address the potential kernel integration considerations. The Node-HD mode in Linux allows partial task isolation, decreasing the number of interots that the CPU receives. For example, the clock tick interots is disabled for nearly all CPUs. However, Node-HD does not scan T, there will be no interrupts. The running task can still be interrupted by patch faults or delayed work queue. Full task isolation is arguably an attempt to finish the job by removing all interrupts. A process that enters the isolation mode will be able to run in user space without interference from the kernel or other processes. With the help of PRCTL system code, a task can activate isolation. Meanwhile, we want to identify the source of noise, which is crucial to pre-anartic kernels. Here, task isolation means the escape from the noise by introducing several isolation mechanisms. In this talk, we will go through the problem of kernel task isolation, the thing of feasibility to achieve for task isolation. We also reviewed the evolution of full task isolation. We will show some patch sets proposed by several developers. However, we will not discuss geo-house or other hypervisor-based solutions. We will not discuss yet another patch set. What we want to do is to consolidate printRT patch sets. CPU interference is defined as any case where kernel management function will take CPU time from a pure CPU bound user task, making no system code, and running print to an isolated CPU with no other contending user task. Precise speaking, any kernel activity that occurs as a result of the specific task activity is not considered interference. However, kernel activity on the specific CPU happening as a result of a work or other task of the general kernel infrastructure not a related special task are considered interference. For example, IO is an important source of noises. For example, in DPDK, we must be aware of the blocking to receive data from the socket. Inside the Linux kernel, there are some housekeeping mechanisms. Tech and bounding works, for example, they are RCU-O and Timer. Tech and bounding works, for example, they are RCU-C via state updates routine. Before we go into the detail of the test isolation, we would like to address its definition. It should be able to provide a bare-mental-like environment for computationally intensive or real-time application to run on. Let's go back to the implementation of current infrastructure for test isolation. 20 years ago, schedule set affinity system code was introduced to specify a set of CPU on which a thread could run in. In 2004, isolated CPU was introduced to allow CPU to be removed from scheduling domain and no balancing, which improves the real-time behavior. In 2012, RCU callback uploading was introduced. The next year, no HTTP was introduced. Its motivation is to reduce the tick to one HTTP. In 2018, no HTTP was improved. Let's go through each kernel feature associated with test isolation. The first thing is schedule set affinity system code. It is the first mechanism for isolating tasks in this kernel. It controls each CPU affinity mask of the task to indicate which CPU it can run on. It needs to handle each mask to achieve test isolation. The next mechanism to achieve test isolation is isolated CPU. It removes the specified CPU from scheduling domain and isolates the process from the selected CPU by default, which means that processes will not migrate to the isolated CPU during no balancing. It is a crucial feature for plain RT to be more practical. Then the third mechanism is no HTTP. It is the most important feature for test isolation. It could reduce the timer tick when the system does not need to do scheduling. However, the timer tick might not be easily disabled. In fact, there are some dependencies such as parties, timers, perf events, cloud source, schedule, and as you call back. For schedule, Excel needs to perform pre-emption, which means the timer tick could not be disabled at that time. For RCU callback, we have to think about further issues. RCU, an OCB or RCU no callback is an RCU feature that overloads RCU callbacks, such as life cycle handling and execution out of the encuters' CPU or RCU to specify comathrate instead. This pulls some kernel noise out of CPU that may run crucial codes. This usually is associated with the no HTTP form. So far, we can make a list of the problems we face in kernel infrastructure from the perspective of full-test isolation. It is suitable for isolating from unbounded work by stating affinity masks or passing isolated CPU and no HTTP kernel parameters. However, it is not possible to prevent bounding works from interrupting test-isolating CPUs. For example, VM state update worker will be queued to per CPU run queue and SQB every second by default. So it means the unintended latency still exists for pre-ended RT kernels. So as I mentioned earlier, there are already some developers who send test isolation improvement patches. In 2015, Chris proposed an interesting test isolation patch. There are some notable features. First, it provides configuration through PRCT assistant code and he evaluates the possibility to disable TIC at the beginning of test isolation. And in the Chris patch set, the VM state update worker was cancelled. So if you go back to no HTTP, no HTTP is not able to eliminate TIC from occurring forcing the system to put the notice on the potential timer interrupts. And for the LRU add-gen, it ensures that the CPU will not be asked to free up CPU local patches while it is running in user space. However, inside Chris patch set, the kernel may busy wait until there is no more pending timer to run. Later, in 2019, Alex proposed another patch based on Chris implementation. He adds the isolation feature which caused several changes in various kernel subsystems. And as patch sets, it prevents IPI which stands for inter-processor interrupt from sending to isolated cores. And he adds the code to ensure to enable isolation at system core IRQ and IPI entries. However, there are some notable problems in Alex patch set. First, he broke some summaries of kernel API. For example, the kick or CPU sync will not synchronize on isolated CPU. So it means the behavior changes in full test isolation mode. Also, since kick or CPU sync has been modified to avoid scheduling interrupt on CPU with isolated task, it implies that it may come with several risk conditions when changing isolation mask. In Alex patch, the amount of modification was huge. The modification across several paths including system core IRQ chip and so on. And most importantly, Alex patch set was dedicated to on 64 architecture only, which means that other developers have difficulty to reproduce Alex full test isolation work. Recently, Marcelo, who is the maintainer of this KVN, tried to improve KVN performance by reworking Alex patch sets. His motivation was to improve KVN performance. And Marcelo reworked what Alex did before by providing fine grain configuration. And Marcelo believes to have the flexibility to decide which interrupts are acceptable to the system. Only VN state update worker is cancelled, which means it brings less impact to kernel since the frequency of updates can be modified through system core. And the cost of updating VN state is more expensive in KVN. So that's the motivation of Marcelo's patch set. However, inside Marcelo's patch set, the TI which stands for threat information break must be updated if the isolated task is pre-ent through pre-ent notified. And we have to enable KVN configuration in order to make use Marcelo's patch sets. Here is the general overview of the API usage based on Marcelo's patch. You can configure and specify the feature base you want to use. However, at the moment, only VN state update worker is cancelled. So you could only specify the flag, but it's possible to extend. And we can activate a specified feature of the sets. The usage is like this. So here we use the PRCTL system code to specify the feature. And this part is to activate test isolation by means of PRCTL. All latency or all states is part of RT test packages. And in Marcelo's patch sets, we use PRCTL to mark the beginning and the end section, which is sensitive to latency. So the code shown in this screen is the main loop or all states. So you can put your own routines here, which are sensitive to the latency. Here is the activate and this part is to deactivate test isolation. In order to measure the effectiveness and the benefits of test isolation patch, we have to measure along with the scenario. We will use the benchmarking tool. The first one is OSLAT, which is a part of RT test. It pulls the timer value, which can simulate some usage. For example, the DPTK use case, which is the user space network driver. And we will use a function tracer of ftrace, which records the behavior system, including SQLT function and events. And we will use OS Noise Tracer, which was introduced in recent kernels. It has similar behavior to all states, but it can record more information such as actual execution time and type of noise. Besides benchmarking tool, we must prepare the proper workload. Here we use two tools. The first one is tune D, which is developed by Rehex. It is very useful in several scenarios. It helps us to configure and reproduce in much straightforward way. We will use stress ng, which generates various kinds of workload such as virtual memory and timing interrupts. The benchmarking scenario we propose are listed as follows. The benchmarking idea is to test the behavior and the effectiveness of test isolation patches. We will focus on the scenario that has intensive assesses to memories, which forces VNState update to synchronize the data amounts called frequently. Based on the idea, we design three kinds of workload. The first is the frequent patchboards. The second is frequent out-of-memory kernels. The third one is the mixture of the above workload, which means patchboards and out-of-memory kernels take place frequently in such workload. We choose two platforms to validate test isolation patches. The first is the recipe pipe fourth generation, which consists of 64 architecture. Another is the Intel Zhiyang class machine running KVN. For the both platforms, the kernels were configured with skill tasks set to one in kernel parameter. We use TUNE to generate the workload. With precise speaking, we set real-time virtual host profile to isolate a single core. Here are the steps to benchmark. Initially, we choose the treasure and events we want to record. Then we perform the voting ops by starting the workload on non-isolated core and waiting for a while. Eventually, we could run the treasure and record the possible noises associated with the events. You can use the scripts we prepared to reproduce. The kernel we use is based on Linux version 5.15, along with cleanRT patch sets, and the Marcellus test isolation patch. Most of the results were measured by isolate, which is part of RT test package to catch all possible interference. We test on two kinds of platform, one is recipe pipe, another is Intel Zhiyang service. We run three kinds of different workloads generated by stressNG. The first is patch 4, including major and minor patch 4. The second is virtual memory and MAP, which is out of memory here. The third one is the mixture of patch 4, virtual memory and MAP. Check this part. This part is major on recipe pipe, which consists of 64 micro-architecture. The blue line is cleanRT kernel with noHZ4. The orange one is configuration with test isolation. If you check patch 4, the test isolation behaves a bit better. If you check the virtual memory workload, it is still better. If you mix the workload of patch 4 and virtual memory with out of memory here, you can see from the diagram that the test isolation behaves better. All test cases have lower latency in average. In recipe pipe, since the system is clean and it does not run other applications, the test isolation feature brings an improvement of about 2 microseconds to latency at least. In certain workload, it could be much better. If you check the Intel Xeon service, it brings 10 microseconds of latency reduction. The issue is that the isolation from via state update is still usable from the perspective of a caveat. However, the maximum latency is still high. If you check the ARM 64-bit architecture, you can see the 200 microseconds in maximum latency, which means there are still some other interference that should be isolated. According to the results, we can conclude that there are no civil bodies, which means we could not rely on a single test isolation patch set to eliminate all latency. In other words, there is no general solution existing for test isolation. We have to identify the source of noise and figure out the cross-bounding solutions one by one. Based on the latest patch set proposed by Marcelo, version 15 extra areas are needed for test isolation. Again, we have to take the source of noise seriously. From the latest test isolation patch set proposed by Marcelo, we are able to cancel via state update events, but we still have to think of intra-handerous such as IRQs or IRQs, and schedule tick is an important source, which causes noise. So we have to think about the usage and addressing scenario. However, the results of the latest test isolation patch shows it is quite feasible to extend. At the moment, we could only cancel via state updates, but it's possible to add some features such as a disabled IPI for R64 platforms. I would like to emphasize the importance of test isolation patch, but meanwhile, there is room for improvements. Thanks for your listening. Bye.