 Hello everybody. Welcome to my little talk about the racing. My name is Andreas Glinger. I'm working independently as Embedded Software Engineer but originally I'm an electronics engineer and I am emphasizing this because this is the way how electronic engineers are doing the measurements, how they are doing the time measurements. That's exactly what I did here in this investigation. I was measuring the overhead of the tracing framework and I did it as we do it in the electronics I did it with the scope and that's why I'm emphasizing this. So I want to talk about how I did it, how I did the measurement, but I also want to show you how how it is done with the scope, how it is set up, what I did, because you can use the same measurement principle for many other things, not just for measuring how much is the tracing, you can use it for measuring, for example, how long a protocol stake takes. So just by setting GPIOs and measuring them with the oscilloscope, you can measure the boot time, how much time is spent in the boot, in the boot process. You can toggle a GPIO in the boot loader, in the Linux kernel or wherever you want. So you can toggle GPIO pins almost everywhere and that's the idea which is behind this talk. So, let's start. Yes, so this is the agenda. I want to talk about the scope of the measuring, what I'm doing, and then I want to go systematically through the different tracing mechanisms we have and at the end to give some summaries and some output. So probably everybody here know where this comes from. You don't know? No, you can't trust it. Okay, so you're answering the question. So, can we trust the values here? So this is from the function graph, and it's a snippet to fit it into the slide. It's not a complete output. And what we can see here are the time values it took to calculate inside of these functions and we can also see the time values it took on the function calling the function. So the overall time ends on and the question I asked myself is can we trust these values? So why did I ask this question? In my last or still kind project, I was facing performance problems, so I had to do performance optimizations in our system and I used the tracing framework. So I'm using the tracing framework a lot. I'm really using it on a daily basis and I was asking myself if the measured time values, for example here in the function graph, but not only also with the tracing events and when applying filters and triggers and kernel probes and whatever all the stuff you can do there, you are getting the tracing output, you're getting time values and are these the real time values? Can I take these time values? Or are there more or less rough estimation or somehow? So I wanted to get an idea what is the real overhead? So that's why I made this investigation. So, but we will not jump directly to the function graph. We want to start a bit more systematically. So motivation is performance optimization. Another important point is when we are delivering our product, so when our product is finished, can we afford to deliver our product with the tracing framework or not? So what is the overhead? And this is something you actually can only measure from the outside from with the scope to just by creating your product once with the tracing framework compiled in and once to compile your product with the tracing compiled out and then to compare it. So yes, that's an example. That's a question we will answer in a couple of slides. So I also use different kernel configurations and I played a lot around with kernel configurations and at the beginning I played around with kernel configuration to find an optimized kernel for the hardware system I have. So I optimized. It's working like this. I was juggling juggling GPIOs. I changed the kernel configuration, I applied the kernel. I was measuring and looking how wide is the pulse and then I changed the kernel configuration, applied this kernel and compared it and then I had a big paper and with a different kernel configurations and then I figured out which are the kernel configurations, the optimum kernel configurations for my board. And this is the starting point. This is where I started actually investigating about how much is the tracing after that. So and what I have put into the slides here are only two knobs, which is just one knob, which can be either bremmed, config bremmed or config bremmed RT and so I want to do a simple comparison about what is the difference between a kernel without bremption and with bremption just and with bremption and real time. But there are many many more knobs as you know, I have investigated many of them, but they are not in the slides. So yes, and there might also be a difference on how to use the tracing points. So nowadays we have many possibilities on how to use the tracing events, we can use them with the interface in the traceFS file system, we can use the trace command or the perf programs or we can use them with EPPF. That's also a very nice feature and especially with the EPPF. I also did some comparisons. So there's a lot of source code configuration, oscilloscope dumps and also the impression at the presentation involved and I have put everything in the cloud. So here you can see on the last three lines the name of the cloud, the will of the cloud, the user and the password and I will keep it on the cloud for a couple of more weeks. So you have enough time to download it if you are interested in the complete details of everything. So now let's go systematically what I have measured. As you can see here first of all the tracing event and of course, it's the tracing event GPIO value because this is the one we can we can monitor simply with the GPIO toggling stuff. Then I did the kernel probes. I used the histogram trigger for measuring a time span. Then I used the synthetic events for generating a histogram over these timestamps. I used the trace print K and then the function and function graph traces at the end from the kernel side. Most of these tracing facilities can be combined with filters and or triggers. I also measured this and I've only the example with the tracing events brought in to see how much is applying a filter, how much is applying a trigger. So but you can also combine it with the other tracing facilities. So what I have seen, it's quite comparable. So I didn't see any benefits to bring all everything this here in the slides. So I also investigated from the user space with the Syscall, with the Syscall events, Sys entire control. So I'm toggling from the user space and then I'm monitoring the Sysenter and Sysexet IO control. I used the U-perbs from the user space. I used trace markers and I also used the S-trace program. So this is a completely different story, of course. The S-trace, clear. But I wanted to bring it here because with the S-trace you can also do time measurements and so that you can see what is the difference. It's completely different than it is. There are context switches and everything else going on. So what to have an idea, how much is the overhead right there? So this is the board I used. I used the Beaglebone black board with 101,000 kilohertz. These are the kernel versions. I used the long term support kernel, the latest one, which was available when I did the measurement. I did, first of all, I created a kernel which was optimized for performance and I'm not doing worst case measurements in the sense of a real-time system. So these are not real-time worst case values, what I'm measuring here. So I'm taking the average value. That's what I'm interested in. So I wanted to know what I'm doing performance optimization. I wanted to know what is the average overhead I have to calculate. I have to bring into my considerations. So this is the snippet of the kernel driver. I was writing for this. It's a very simple kernel driver. You can get it from the cloud if you want. So the real driver is larger. So this is just a snippet to show what I'm doing. So I'm doing the toggling. The function toggle GPIO sleep is a kernel for it. This kernel for it is running as a real-time task all the time as fast as possible. And in this endless while loop, I'm doing this toggling five times. I'm calling the function toggle once. Maybe you're asking why I'm doing it just five times. After the five times, I'm doing the M sleep. And this is just because of the real-time throttling. So I didn't want to see the real-time throttling in the scope output. Of course, I could switch off the real-time throttling. But then the question is how to remove this driver without reboot. So for me, the easiest thing was just to put the M sleep, to relax the CPU and not run to avoid running into the real-time throttling stuff. No. So you would see that happening too? Yes. So production interrupts are still enabled. Yes. Yes. I created the function toggle once for being able to use the k-bobs on it. That's why I have this function. So otherwise, of course, I could have put it into the loop right there. And I toggled only five times by five times. I'm not measuring on the first cycle. It's turned out measuring on the first cycle. I'm also measuring scenarios with a cold cache. And we have seen yesterday with Jan with the cold cache. There are many things which can happen. That's not what I want to measure here. I want to measure the overhead of the tracing and not the effects of the cache. So that's why I'm measuring on the second phase, on the second cycle of it. And why I'm toggling five times? It's simply because when I'm toggling five times, I have a nice graph in the scope. So we're toggling just two times. It's not that nice. That's the reason. And I'm measuring with 30,000 wave counts. So this was a compromise on how long to wait for the output. So the scope is not measuring that, not able to measure that fast. We could take some time to do the measurement and everything. But randomly, I did it 500 times for verifying. If the result is the same, if I do it 500 times. So 500 times is running the oscilloscope overnight to the next day. That's why it's 500 days, 500 times. So this is the output of the oscilloscope you will see later. So we have the high, the positive pulse. This is the T high. And we have the negative pulse. I'm taking the positive pulse as for comparison. And I'm also calculating the duty cycle. You see it here on the bottom. Oh, to be honest, I'm not calculating it. The scope is calculating it. So the scope is also calculating the duty cycle. Why is this interesting? Because sometimes the duty cycle, of course, should be around 50%. But sometimes the duty cycle is not around 50%. And then we have to ask ourselves, why? What's the reason for this? And there are some outputs where there's some additional explanation needed, exactly because of this. So now we are at the first real output, a real comparison. So here I'm comparing a kernel with a minimal configuration without tracing, compiled in. It's a kernel without the tracing. And with preemption and with real time and preemption. So on the left side, it's a minimal kernel without real time and on the right side, it's with real time. And you can see here, these are the pulses. And this is what the scope has measured. And the mean time here is 141 and 145 for the positive pulse. And for the negative pulse, it would be the other way around. It's longer for the standard kernel as for the real time kernel. But the duty cycle seems to be more precise on with the real time page. So this is not significant difference. And yes, we can say it's a good work with the real time kernel so that it's producing almost the same. And it seems to be a bit more precise. It's more strict toggling it. You can get all these configurations from the cloud. So these are the names of the files. And this is the comparison. So there's just this one knob with the real time is compared. So sometimes we can see there are two waves here. This does not mean anything. It's because of the scope. So there's one point of type when I'm pressing the point screen on the scope. And sometimes I'm getting one and sometimes I'm getting two waveforms. So it's a feature of the scope. It has nothing to say. So because of this, I selected the real time kernel and the real time kernel without tracing compiled intakes 145 nanoseconds for one switch. Now let's take this kernel and switch on the tracing. And I'm switching here. This is what I did. I'm jumping the output. So these are the knobs I switched on. And with these knobs, I made the whole presentation. Of course, you could say not all the knobs are necessary if doing a simple trace event, of course. So I wanted to have just one configuration for which I am doing the whole presentation. So each of these knobs, of course, has some effects. That's for clear. So with these knobs, this is the output. The difference is 145 to 168. So 23 nanoseconds, which is 16%. Or I did the same with the standard kernel where it was 19%. So this is the overhead which comes in if we are delivering our system, our systems with the real time patch included on the target site. So very often we want to deliver our systems with the tracing framework because this offers us a lot of facilities for doing measurement, for doing error detection and kind of things. And this is the cost, what I have measured. So everyone, of course, this is for this hardware. For different hardware, it can be different. Of course, that's for clear. But the setup how I measured is quite simple. So everyone can do this measurement. And because on our system, on our project, we are very, very close. We're very short with the calculation time we are delivering without the tracing framework. Unfortunately, unfortunately. But because we have to, we need to save the, let's say, the last 16% percent. So the function tracing is not the tracing. What is it? The function tracing is what calls the overhead. Yes? It adds padding. Yes. So that's what I said. Every knob is adding some overheads. What is it? What's that? So this is the important value with which we will continue with 168 nanoseconds. So this is, and this is also our reference configuration. So with this reference configuration, now I'm switching on the tracing events. And so this is the matrix with all the values as distance for standard deviation. So now I'm switching on the tracing event. This is the reference configuration, the GPIO value event. And as you can see with switching on and tracing event, we are at 1.36 microseconds. So on Monday I have heard about 100 nanoseconds, but probably you're talking about Intel systems. Yes. So I have verified this measurement. And with other ARM systems, so I'm working mostly on ARM. So this is realistic. So we are about 1 microseconds. That's more or less the case. And so that's the overhead. This tracing event brings with it. So depending on what you're doing, this is much or it's not much. It depends on what you're doing. So for most purposes, it's not that much to have 1 microseconds. You can still live with it. But if you are doing GPIO bit-banging drivers, so I also did some drivers where we are grating the clock with GPIOs and doing this kind of stuff, if you want to trace something like this, So basically you have a trace point and a jump label? No, no, no. Okay, I mean, thank you. So, yes. Now with this tracing event, I also tried many events. I changed the driver so that we have many events in it. So that I can enable many events in it. So I created individual events. And with individual, I made up to 10 individual events, we can see it's increasing. The time is increasing linearly. Yes, that's what I expected just to verify. Then I used filters. I was filtering on the GPIO number 70. So this is the GPIO I'm printing out right here. And with the GPIO 70, the overhead is additional 218 nanoseconds. And with enabling trigger, here enabled the trigger, the overhead is another 120 nanoseconds. It's a simple trace on trigger right here. Just to see how much it takes. We can also combine filters and triggers, then we can just add some up. I don't have slides about this, but it's just a matter of summing it up if we are combining them. So now I was using the histogram tracer and I created a synthetic event. What I wanted to do is I wanted to measure the pulse wide with the histogram trigger. That's the purpose of this. So I'm doing what I'm doing with the scope, I'm doing the software. Exactly the same. And I want to look what's the difference, if there's a difference or not. So I created a synthetic event here. And on the rising edge, I'm remembering the time stamp, the common time stamp. And on the falling edge, I'm calculating the time delta. And if it matches this event, I'm raising this synthetic event, this GPIO high, which is defined right here. And the output looks like this. And the latency we get here, this is the measured time, the pulse, there's a positive pulse wide, the histogram tracer has measured. Which is here 2.4 microseconds or 2.3 microseconds. So you can see it's already a bit more than it was in reality. And the output of the scope is like this. So now we have 4.77 microseconds. So it adds some overhead. And the amazing thing maybe is that the duty cycle now is not around 50%. So as you can see, it's not symmetrically anymore. So what's the reason why? So one might think on the rising edge, we are just remembering the time stamp. And on the falling edge, we are calculating the time stamp and inserting the synthetic event. So it should be the other way around. Yes, it should be. It's not the case. So the reason is because if you look into the kernel sources, you can see that just before actually calling the driver for switching the GPIO to low, the raising event is called. So the overhead is not included here in the negative pulse. It's included in the positive pulse. And here the insertion of the synthetic event is done here. That's the reason why it looks like this. And that's important why it's also, and that's exactly why I'm also looking at the duty cycle to being able to clarify such situations. So it's not a bad thing. It's how it is implemented. That's it. No one knows that's implemented like this. I think we can live with it. So now the additional overhead is 3.4 microseconds in comparison to the normal raising event. The histogram trigger brings with it. So that is the function where you can look, where you can see this, what I just explained. So now I want to create a histogram out of this information. And I just need to activate the synthetic event. So I just need to add a histogram trigger to the synthetic event with the latency buckets of 100 nanoseconds. And I'm sorting for this latency value and the output looks like this. So we are getting here for each time period the number of hit counts we have. So I operate it here. And this is, yes. And what I did then is I did some AWK stuff for bringing it in the form to be able to print it with new plot. Then I did the new plot. And I also calculated with AWK the average pulse width with the standard and here with the real-time kernel. And the average pulse width here is 2.8 microseconds. I calculated out of the data I got here. And this is the new plot diagram. Oh, sorry. Yes. So it's logarithmic on the Y scale. So this is the output from the scope. So now we are at 5.9 microseconds due to cycle the same effect with the one before. And so now I did the same with EBPF. We can also use the GPIO value event. And when the value of this event is 1, I am once again remembering the time. And when the value goes to zero, then I'm calculating the time difference and I'm doing the quantize. And I'm using the program ply. So ply is a very small, especially for small embedded systems created program for interfacing the EBPF. So with the ply, I'm getting this histogram. So the histogram is not that nice as we had with the tracing framework and the synthetic events. The reason is because the ply from its implementation, from its implementation, it's only able to use buckets here, which are increasing in powers of two. That's because of the implementation. So it's using a bitmask internally and that's the reason why this output is not that nice as we have with the tracing framework. But the overhead is a bit less as we had how I used the tracing framework before. So now let's go to the kernel verbs. I use the kernel verbs. As you can see here, I use them on a function GPIO DZ value. This is the function which is called whenever the GPIO is toggled to 1 or to 0. So this is comparable to our GPIO value. And I'm printing out any string and a value. So this is the output I got right here. So it's about 2.96 microseconds. So the overhead to being underused, now we are going back to the underused version, not to the tracing event, to the underused version is 2.79 microseconds. That's the overhead of the kernel verb. Once again, I did this check with the linearity. Yes, that's the case. So I created a special driver where in which I was calling the function 10 times. So we had a back trace of 10 functions and I added one by one the kernel verbs and measured what was the time value. So I did the kernel verbs with the EPPF. 3.68 microseconds, which is, let me go back, which is more than with the simple kernel verbs. But there's a little bit more of logic in there. So these are the tracing events and the kernel verbs. Now let's go to the function and function graph traces. When I'm enabling the function trace, I did it here with the raw interface just writing function to the current tracer file. We are getting this output and I'm also doing filtering. I'm filtering for GPIOD and when filtering for GPIOD, I'm getting three output lines for each toggling. So there are three functions shown for each toggling when I'm filtering. So on the left side, you can now see it takes us 4.64 microseconds, one toggling, without filtering. And on the right side, it's 3.38 microseconds. So what I have seen with the function trace work is that as soon as just one function is enabled, there's a certain amount of overhead, which is most of the overhead. It's around 3 microseconds, this overhead. And with each additional tracing event, there is additional overhead, but it's not increasing that much. So with the function trace framework, as soon as we are tracing the first function, we have most of the overhead already, so it's half of the overhead. And with each additional function, the overhead is not increasing that much. It's increasing slowly. So it does not make so much difference to trace a lot of function or only a couple of functions. As soon as we have activated it, the half of the overhead is already taking place. So doing the same with the function graph tracer. Now I selected once again the same example with the GPOD set value with the function calling here. We can see the function graph tracer tells us 16.292 microseconds. We already know that it's a little bit more than in the real world. And looking at the scope, we can see the output is at 12 microseconds. And with the function graph tracer, I didn't see a difference if I'm filtering to this function or not in the positive pulse wide. In the negative pulse wide, there was a little bit difference. So with the function graph tracer, I saw only little difference when filtering or not. So set graph function will actually trace and go through all the overhead of the function graph tracer and just decides whether or not to write to the buffer or not. If you use setFtraceFilter only on that function, then it only gets called. And actually, usually the leaf functions are the most accurate. But when you have a function graph that traces a function graph that traces a function. Function graph or set graph functions only just so I could see what's being called within a single function. But it doesn't make any... Actually, I tell people don't trust the numbers because it doesn't make it any faster. So if you want to speed things up, don't use setGraph function. You actually have to use setFtraceFilter to speed things up. I expected that it would be the exact same because all the overhead is still there. By using setGraph function, you did not remove anything that causes the overhead to go away. Okay, thank you very much. I can repeat it. I will repeat it at home and try it if it makes a difference. Thank you. So I think we are a little bit short in time. Five minutes. Okay. So, yes, I did the TracePrintK. With TracePrintK, I got 1.22 microseconds. I was surprised. I thought it would be more than a tracing event. It's not. So you're not surprised? No, it's paid for debugging and it's made for debugging race conditions. So it was highly optimized to be very, very fast. Yes. So it's quite fast compared to even faster than the tracing event. But this TracePrintK is not printing any values. It's just a string printed out. So, and this is the summary on the overhead. So this is the system with tracing compiled in. It takes 1.19 microseconds to enable the GPIO value. And then you can see what is the additional overhead with the history and the synthetic event with triggers on the tracing event and with filters and with the other tracing facilities we have here. So now let's go quickly through the user space. I did also the toggling in the user space. So this is once again a small program in the user space. And from the user space, it took me 1.58 microseconds to toggle once without any tracing. So in what we can see here, the overhead between 0.19 and 1.58. So it's about 1.4 microseconds. This is the overhead which comes from the Cisco and the VFS. So this is the overhead which comes for calling a Cisco. And this is always the overhead we have in the user space. So we can also use this type of measurement for measuring what is the overhead of a Cisco. We can measure what is the overhead of the SysFS. We can measure what is the overhead of libraries and all this kind of stuff. That's also what I'm doing sometimes. But this is now our reference for the user space, 1.58 microseconds. Now we are enabling the tracing events, the SysEnter and SysExit IO control. And we are getting 3.94 microseconds, which is an overhead of 2.36 microseconds. But we need to be careful. There are two tracing events for one toggling. So to compare it with the GPIO value tracing event, we need to divide it by two. And then we have an overhead of around 1.2 microseconds. So it's in the same range once again. So I also used the U-probes. And this is the U-probe. I used it with the program Perf because Perf is nicely calculating all the offsets and kind of stuff. So we don't need to use the object dump. And with the U-probe, we are getting 16 microseconds. So it's, yes, a bit more than we get with the kernel probes in the kernel, where we have been around the 2. something microseconds. And I'm only doing the U-probe only on one, only on the rising edge. And once again, we have the effect that it's here, this is the overhead, is here on the high pulse-high phase. So we have 14 microseconds of overhead. I did the trace marker. I did it like it's documented in the kernel sources. And with the trace marker, I measured an overhead of 5.81 microseconds. Oh, sorry, not the overhead. At the time, the overhead was 4.2 microseconds for one trace marker from the user space. And finally, I used the program S-trace and with the S-trace, we are getting around 93 microseconds. So this is not comparable, of course. I just wanted to give you this number so that you see what's the overhead of the S-trace. Of course, it's a completely different story, just for comparison. So this is the overview. I have a couple of more slides. What else can be done with this measurement, but I think we ran out of time. So you can read the slides. We don't have time for questions, but I need to leave anyway. If you have any questions or you have any suggestions, please write my email. The email is contained on the first slide. So I am happy to hear your suggestions. What can I improve with my measurement? Of course. Thank you for audience.