 Hello everybody. We are going to talk about the approach to measure and reduce kernel latency from the perspective of real-time system development. And in this talk, we discussed the latency, which means the time after test is involved and before it is skewed. And we are talking about the detail where the latency comes from. Also, we are introducing microscope measurements by efficient way to visualize system latency. The major target in our real-time system is pre-end RT. But I think the idea can be reused by other real-time extension like ThinWine. Finally, we will analyze and reduce the latency using Cortex-A9 modicol for example. So pre-end RT itself is an approach to minimize linear internal latency from external event to response. So the latency consists of several paths. So the first one is before the interval is really react, you have to handle the critical section which might disable the interval. And you have to wait for the time from the hardware to react to the interval. Also, in the design of linear interval, there is a top health and bottom health. So this part is critical. Also, the time to wake up usually involves luck, RCU, and so on. Then you have to handle the bottom health, which means stop IRQ and other potential synchronization. So in the last part, you have to handle the schedule. So what does pre-emptive kernel means? It implies the control of latency to allow the kernel to be pre-endable anywhere. And the kernel pre-emption increases responsibility, but in other side, it decreases the throughput. We mentioned pre-emption to be the ability to interrupt the test and many pre-emption points. So you can compare with the two diagrams. The first one has longer latency. But if you can insert more pre-emption points, we can get the shorter latency. And the longer pre-endable program use, the longer waiting time and the uncertainty comes. So the idea about pre-NRT is to make system code pre-endable as well, even in the context of system code. So if you check the diagram, you can see if you turn off kernel pre-emption, and the kernel only pre-ends in user space. In other words, pre-emption is not allowed in kernel more when your kernel is configured with pre-emption disabled. So you can see from the diagram, interaccount handling itself is always the most important highest priority task. And you can see from the diagram as well, the SOILQ and other kernel threads are the secondary. So if you enable the pre-emption configuration in your kernel, for example, the first one is to insert explicit pre-emption point in kernel by introducing MySleep. So kernel can be pre-end only at pre-emption point. If you configure your kernel with configure pre-ends, which insert more pre-emption point and do implicitly pre-emption. So pre-emption could happen when the pre-emption count gets zero. And finally, the configuration of pre-end T4 is to reduce non-pre-endable case in kernel. Here we care about the spin lock and the interrupt context. So you can see from the diagram almost every pass can be pre-endable in this kernel configuration. And if you compare with the original one, I mean to turn off the pre-emption in kernel and compare with pre-end RT, you can see the SOILQ itself is gone. So SOILQ is removed in pre-end RT configuration, where case SOILQD is implemented as a normal kernel thread which handles all SOILQs. SOILQ runs from the context, COOL runs from them. And there are more system management thread such as RCE watch dome, case SOILQD, and the passive CPU timer. You can check the ascent talk by Stephen Bluster last year. If you check the implementation of pre-end RT, you can see all the spin lock IRQ set is replaced with spin lock. And it's quite interesting pre-end RT itself already changed the implementation. So the real implementation is here, a new RT spin lock which is sleepable. So if we talk about how to measure the latency, we have to consider several conditions, several scenarios. The first scenario is wake up. As I mentioned before, the latency consists of several paths for hardware, IRQ, and schedule, and the final test gets the response. So in Linux, internal handling is quite tricky. Internal controls send a hardware signal. And the process switch more from the original mode, like use more, and the banking register, and the disable IRQ. Generic interact vector call is called and saves the context and interactivity. So the last part is to identify which interrupt happened and cause the ribbon IRQ service. So in this scenario, you can see the task two is, I mean the execution of task two is get delayed because of external event. Another scenario is wake up on idle CPU. It takes time to wake up the CPU by means of IPI, which means inter processor interrupts. So it takes time from the more CPU from idle to running. It's a problem of potential latency. So scheduling is to put a wake up task otherwise interact increase. So we have to think about how could we reduce the latency in this condition. The first one is you can think about the arrangement of process priority. Lower priority task weighs on the IRQ while higher priority task given GPU. And you can change the scheduling policy. For example, schedule FIFO and long robin always schedule before schedule other or batch. So let's back to the measurement. We want to introduce an efficient way to perform microscope measurement. So we think about the several items. The first one is the class source and high-resolution timer, which we usually mention is HRT. HRT is quite critical because it can introduce the hardware and the software interrupt for the currency. So time and interval are not a response when system is already and it costs more time latency in kernel. And the next one is test switching cost. As you know, a process switch is always heavy than switching. Process switching needs to flush TLB. If your real-time equation consists of lots of process switching measurements, it's necessary. And also we think about patch 4. Initial memory assets cost patch 4 and it costs more latency. And the patch out swap area also costs patch 4. You can use a system called Mlack O and replace the original memory locator with your customized one. So as you might know, there are some real-time space locator like TLSF. And also we think about the issue of multi-course. TAS can move from local core to remote core. And this migration costs additional latency. So you can fix the problem by using CPU state or specify the CPU groups. The final one is Lack. As I mentioned before, SpinLack already replaced by RTMU test, which can slip. SpinLack must not be used in atomic path. That is to disable our local IRQ state. RTMU test uses priority inheritance. It's not just a mute test. It implements the protocol of priority inheritance to prevent from priority invention. And it is no longer just few tests. I mean the fast mute test. So as a result, the cost of the new locs get higher in general. So before we do the real measurement, we have to prepare workload in order to simulate and figure out the problem. There are many tools you can use. We just picked up some. The first one is the famous one is a headbench. Headbench itself is a tool, just a single source tool to test the schedule and the unit circuit performance by spawning process and thread. And the next one is stress. Stress is a tool. You can use it to normalize the data to get, I mean, the data made from the system and give you an overview of the system impact of each kernel. This is a fantastic thing in the different parts. So you can compare various convergence. The last one is our in-house pre-article test, which evaluate robot control algorithm. This part is a new product and already in use in some China factory. So MC test itself is our customized program. So we use it to simulate the workload. The algorithm can be used in both the user mode and the kernel mode. We will talk later. So cyclic test, the program cyclic test is the well-known one. And how can it work? Cyclic test measures the data from where is the schedule to where is actually get a wake up. So the motivation of cyclic test is trying to measure the timing data. Cyclic test can use a high-resolution timer. And the data gets allowed user to see the distribution or latency. So if you check the diagram, you can see some interest part. A long tail of latency shows that some paths in the kernel are taking a while to be pre-ent during critical session where the kernel cannot be interrupt. Cyclic test is a good tool. But we want to mention is disadvantage as the loss of time information of the latency events. And there is no way to reconstruct. So we are thinking about an efficient way to implement the way to measure and it's a way we can reduce the latency eventually. Okay. So talking about profiling tools for linear scheduling, we were mainly focusing on the overall performance of the system. So traditionally, we use Perf to understand which program utilize the CPU resource. This is achieved by pretty luckily sampling the CPU's PMU counters and use static code analysis to understand the scene. The method works well for long-term CPU tests as with high throughput workload. But for real-time tests, we would like to get down to the micro second level to every single point, understand every single point what the CPU is doing. In the paper proposed in the paper DECAP OECD call last year, Eurosis, they have developed a technique which locks linear stimulus behavior. This allows us to trace the scheduling results in the micro second level. This is accomplished by inserting profiling points or profiling hooks in the CPU's linear scheduler which will trigger every time at a point of interest. Typically, this will be the point which the scheduler state changes like run queue increase, decrease, condition switching, or the current executing process ID or etc. The profile data is stored as a text file and can be fast analysis later on. The original source which the cores have provided the plotter which visualize the run queue size of each core in the form of heatmap. For example, on the screen is a heatmap of four core Intel i5 processor executing a hackbench. Profiling starts from the top left and goes through to the right. Every pixel means one 10 micro second and it wraps every 10 millisecond. By default, the profile data is triggered on the event of run queue increase, decrease, and balancing events and scheduling events in the scheduler. Of course, the scheduling hooks can be put at arbitrary points in the scheduler so we can actually provide any kind of scheduling events. For example, we have modified it to profile the current switches in the scheduler. On the screen is a sample of Cortex-A9 running hackbench and every of the light yellow stages is where the CPU do current switches. Also, we have recorded switch to PID so we can use other tools to analyze every micro second while the CPU is executing. This is a sample of the current switch points of Cortex-A9 running in stress and the cyclic test altogether. We have zoomed it to a larger scale and other blue box with red stages is the cyclic test into the run queue and with the red box and the yellow stages where the cyclic test has been counter switched and being executed. That yellow stage is about 10 micro second scale and with other tools we have developed we can analyze the time between cyclic test real-time test entering the run queue and probably as a histogram and also the current switch time probably as a histogram so we can know exactly how much time it takes for each interval the cyclic test test enters. Scheduled by the scheduler and here's the schedule duration where the time between the test is in queue and put into execution. So the scheduler provider look enables to understand how text is treated by the scheduler in runtime under various load conditions. Okay so let's back to the latency. How can we reduce the latency after we measure? There are some random tips on improve the latency, reduce the latency on PNRT. Pre-ension is disabled after acquired low spin lag which was the real implementation of spin lag in PNRT. For pre-ension num, I mean the already vanilla kernel, you can use a spin lag to spin until until you get a lag. But for pre-NRT all spin lag was was replaced by sleeping spin lag. So you have to use a low spin lag one instead of spin lag. And the pre-ension off for long time is a problem which which implies higher priority test cannot run. And the pre-NRT tries to make pre-co-section pre-endable. If you think about the detail when you're trying to disable interrupt, you need to schedule to check if higher priority test needs CPU or not to break out of the pre-ension off section. And here disabled pre-ension has an effect of locking CPU to other tasks. And in the implementation in the deep internals of PNRT, linear mutate utilize OSQ lag which will spin in some conditions with PNRT. Here OSQ means optimistic spinning for sleeping lag. So if we think about the detail of IRQ again, IRQ3, because every IRQ contest in pre-NRT is actually a kernel thread, a normal kernel thread. And IRQ3 are scheduled first in first out task with priority 50. Priority can be changed so that other new time tasks, other more important tasks can have higher priority. And you should always avoid unnecessary spin lag IRQ save. For example, there's an implementation in CSS2 and you can see the co-leasing. So the sequence is spin lag IRQ save and perform the low spin lag. Then you do some operation. Finally, you unlock low spin unlock and spin unlock IRQ restore. In pre-NRT, the symmetries of spin lag already change. Spin lag IRQ save do not disable interlock in pre-NRT4. So consequently, you should replace the low spin lag with the sequence like disable IRQ, no spin IRQ and do the spin lag which is sleeping lag. So you can avoid unnecessary low spin lag IRQ save to shorten the latency because of lag. Also, we have to think about the overhead of system call. There is a research in OSDI about six years ago. The idea is system call have usual improvement in the manner of synchronous mechanism. And the FlexSC, the researcher of the name, propose a new implementation that implement is sectionless system call in Linux kernel and we implement the p-thread package in user label. Which translates the legacy synchronous system call into the sectionless one. Eventually, FlexSC can improve the performance. You can see from the diagram from the side. So you can see a patch, my SQL binder and so on. This occasion gets improved by this research. However, the FlexSC is giving the impact of the compatibility. So we use another way. About almost 20 years ago, there is an interesting hack called kernel linux which enables the ability to execute user process in kernel more. And the benefit of running user program in kernel more is user program can access kernel edge space directly. So you don't need to use the system call gate in order to invoke the system call internally. You can directly invoke the function call since you get everything in kernel. Which means user program can invoke system call very fast because it is unnecessary to switch between a kernel more and a user more. By using costly system interrupts or context switches. Unlike kernel modules in kernel more linux, user program is executed as a normal process except the previous label. So scheduling and patching are performed as usual. Although you might think it is too dangerous to do such a thing. But in fact it is set because the process still gets isolated in kernel more linux. So we will show you how we can use more techniques like printing RT along with kernel more linux in the case study. So let's think about the detail. So talking about the case study, we are using these platforms. The first one is the athera cyclone 5 suck development kit and the other one is the nxp ims 6q sabri sdb. Both are using cortex a and i. One is dual core and the other is quad core and memory is both 1 gigabyte of DDR3. The experiment configuration we are using is a build loop based system and the kernel version is 4.4 and we have patched it with the print RT patches. Additional patches we will apply is the wasted core patches and the kernel more linux patches depends on the experiment we are going to do and I will explain later when we come to it. The benchmark suite as we mentioned later is a cyclic test and the mc test and for cyclic test the arguments and the parameter given is given above and both of these two test benches is wrong in kernel more linux if possible as if kernel more linux is enabled in our kernel. For mc test is measured in the dictionary system of code execution time it's developed to simulate a robot motion control algorithm which as you know it has mentioned earlier you want to see as a user-based program or a kernel more margin kernel space and you will output the execution time of each run. So for this is an example on the left is a cyclic test and compared to mc test is output every cycle execution time so we will learn from cyclic test is the jitter of the wakeups and from mc test we're going to learn about the execution times jitter. Okay so these are our experiments and measurements the first one is how kernel more linux benefits the real-time performance and we are going to measure the srmb piece scalability of linux systems and to look after there is there anything we need to care about when using linux as a real-time scalability system and we are going to talk about sure inter arrival time and finally the scheduled duration. So first kernel more linux impact on real-time performance we run this experiment on ms6 and we have set up the cpu1 as isolated and set the sql which means there will be no time interrupts on cpu1 and we have used the technique of cache lockdown which locks all label 2 cages to the first cpu1 and the load is a combination of hackbench and the native perf test bench we use is mc test and for user space mc test we can find out that this is not running in kernel more linux the user space mc test we can see a lot of jitters and spikes and for kernel space mc test is much more stable the reason of this is because the impact of system cost the cost is very high for each system cost so the user space mc test are generating a lot of spikes during this execution if we move user space mc test into kernel more linux we will see that the spikes has gone away and the overall execution time is very stable and is comparable to the kernel space mc test so we can see how kernel manliness improves the dependency on system cost and now i'm going to talk about this smp scalability of the print rt patch kernel linux kernel we are running this on cyclone 5 suck and the load is stress and the number of threads is there so we can see that the whole length poetry wraps is 10 milliseconds and by default the linux scheduler is set to reschedule and balance it and loads every 10 milliseconds so this short burst of tasks and counter switches will not be rebalanced and leaving a lot of gaps which are showing green which is only one text running inside the cpu comparing to another cpu is running over five tasks and for the Wesley core patch it is not going to help with this with this because it does not alert the scheduler rebalances time scope so it still rebalance every 10 milliseconds this is considered okay for long-term high throughput workload because after sometime the performance will amortize then the super will still be good but for real-time tests like the cyclic test we can find out that cyclic test is being scheduled and wake up every time on the an overloaded core and as the other core is much more free for resources and this could bring unwanted overhead to a real-time system and later we have measured the scalability impact from chromo linux and as chromo linux is targeted to the reduced system goes overhead so actually there's no much difference between the difference between page and not page the kernels so short interwrite arrival time we have able to catch the events of scheduler in microsecond level so we have captured some event like this which is two texts with very very short inter arrival time enter the queue and these two texts is the first is the time rq of the system and another one is the cyclic test main test and you can see that at certain time let two match almost together and enter the long queue and call some overhead to the scheduler also this has another showcase is the thread of cyclic test and the real-time test of the cyclic test and same in some times they will come very close together under a couple of other 10 to 20 microseconds they almost enter the long queues together so the benefit of visualizing text inter arrival time is that it could cause irq byte on half or text delay we would like to produce a system with constant and rather large inter arrival time which will lower the scheduler overhead and lower the systems uncertainty finally i would like to talk about the scheduled duration who is the time scheduler takes a scheduled nq the task for content switching the graph on the top is a cyclic test running with one mini second interval it's a bit hard to see that it's content switch every yellow stages and the system is running with no load the graph below shows the scheduled duration for each cycle of cyclic test the orange bars means that the cycle test is the only one task in the long queue and for red ones it means there are over two tests in the long queue so the scheduler has to pick among them we can find out that the schedule duration is pretty low as though it's usually is below 10 microseconds and for the histogram we can find out that the mode yes it's mostly below 10 microseconds and the maximum is about 20 microseconds and for the next one we are going to run with a mail load which is a stretch with three CPU threads and again with cyclic test with my mini second interval and we can find out that still the scheduled duration is pretty low as low as about five microseconds with some spikes and the bars are all red because it's always sharing a call with the stress task from his histogram view we can also find the result that the maximum of the scheduled duration is 16 microseconds and then since counts interesting as we run it again with heavy load which is straight with eight CPU threads cyclic test one mini second interval we can find out that it's a lot of content switches and the scheduled duration is not as stable as previous results it varies from 10 microseconds to 30 microseconds and it's not constant value from histogram view we can find out that it creates another peak with about 27 microseconds and the maximum has gone all the way up to 34 microseconds also we have found that now not only heavy load will cause increased scheduled duration but also short burst tags in a both example CPU 0 is a QKIN kernel mode MC test and CPU 1 is executing the cyclic test real-time thread and main thread together and when the two cyclic text threads comes into the schedule with a radius short interval time the scheduled duration of this real-time text will rise about double the time when it's running solo in the CPU and in the histogram view we can find out that the maximum has gone all the way up to the 33 microseconds scale so we have observed that when the scheduler and queue the high priority test into the long queue it will require a period of scheduled duration before switching it for execution scheduled duration between entering long queue and counter switch will be most at most as 35 microseconds depends on the load and the scheduler linus kernel is using guarantees 01 on searching which takes to be executed but after identifying the next test for execution scheduler will still require some extra time which will vary with the load in the scheduling long queue the long queue is scheduling the shorter the inter arrival time the wider the scheduler duration distribution will spread so we have evaluated the real-time behavior of linus by profiling this scheduler and measuring the latency of various colors variance and intensive inter loss can cause long all its lengthensies due to the design of the interrupt processing mechanism in Linux we propose new tools to visualize test scheduling in fine-grained scale as microsecond scale this enable us not only focusing on the interrupt latency but also scheduling duration locks and it's a red week and it would soon be highly desirable to combine the isn't in 16 techniques for example glomonious isolated CPU tickless kernel to improve the text responsiveness on the various target application characteristics on top of pre-mrt here's our references so if you check the page your difference you can you can see uh i created some eoc talk and the relative one uh so you can check uh the first one by steven lost uh talking about uh what is the real-time system and how pre-mrt is actually implement and the next one is quite interesting uh the guy from hitachi uh give a real cast for for the for the study the real-time property in baylinus uh yes about three years ago and the the third one is a talk talk about the real-time throughput uh yes the talk was over about five years ago but still is essential in in the in the view of design and the last two two one uh the first is uh the euro sys uh paper and the last one is os di uh which i mentioned earlier about how to reduce the o have a system code and for the application i mean the tool we measure the latency uh we mentioned before uh we'll be in open source you can check out the github link uh after the slide is get uh update because uh the one you download from the eoc website is the preliminary one and we will update uh another one um mentioning the github link and the reference setup so you can uh prepare your own uh profile data and visualize using the tool so do you have any question okay the question is uh which kernel common line come up parameter are you we use right yes okay the building configurations okay uh we use uh print out here uh the uh user to configure pre-ent full and we disable some uh some not necessarily debug driver and using uh take this kernel or not and isolate cpu so you can use a cpu set to specify the uh iq affinity and also we use uh customize uh use our customer memory allocated for for the higher level applications uh we i mean the mc test our robot control simulation model so uh other things are are usual as you might know there's uh the original also has a website which publish these patches yes and also is paper discussion about the performance and how they how they implement it in xxd6 and chips okay uh so there are two questions uh the the the last one is um how can we uh major the latency impact of uh top top health and about health iq right okay so let me answer this one first okay so uh we since we are in the in in in the perspective of the microscope view so everything every everything is getting very small so we have to use uh some uh efficient way like f-trace to figure out the the impact in the top top health so uh the you we have we have to figure out the the the execution time in specifically in the top health and the the lack the potential lack otherwise we could not major major how the sequence or other but health tasks scheduled by case of iqd okay uh what is your first question the periodic test not not using psychotic tests okay okay uh the question is about uh how can we generate uh workload reflecting the actual system right okay um the this is a good question because um we are always facing the problem that psychotic test is too simple to and it's too naive and uh so that's the reason why we want to implement our our own uh test bench so um you can see from the the diagram this is a product uh many uh produced by uh delta electronics in in taiwan and it's a scalar low low bar control and and the most algorithm uh it the most algorithm implement is shared in real product and the mc test so we use uh real algorithm to major to be as a simulation target uh as a workload so uh that is an not a good as i think it is not always a good mechanism to use this approach but we we think it is a real case so uh you might use you might think about more more stress more reflective ways but but just it's a reason why we want to talk here because um we are facing the same problem excuse me can we repeat again the flex s c so you are asking if there is anywhere people are discussing about discussing about a flex s c paper is i think anywhere if somebody is discussing about the flex s c paper you mentioned earlier uh i think uh there are few discussion about the osdi paper but you can check the hack news i checked that last night uh there are few discussion about that uh and it is uh in fact a early prototype so uh it's not stable so that's the reason why we use the conomallinus conomallinus is already verified for more than 10 years and the impact of conomallinus is smaller but the performance of conomallinus is not quite good compared with the osdi research paper okay uh i think um okay it's time time's up okay uh thanks again for your attention thank you thank you