 Good afternoon. I think I'll get started. My name is Gustavo. I work as a software engineer for Linaro. Today, the idea is to share with you the results we've got. I've got, so far, experimenting with compilers for tracing and profiling. This is a work I've been doing since the beginning of this year with my colleague, Kevin Townsend. So hence the title of my presentation is Leveraging Compiler Code Instrumentation for Tracing and Profiling in Zephyr. I'm going to start with a quick anecdote. It was the early 50s, and the code channel, the mathematician, was in London, attended a conference with Alan Turing. And at that time, Alan Turing was working in the University of Manchester with the Mark I computer. And he invited Channel to go there and check what he was going on. And Channel asked what Turing was doing at that time, and Turing explained that he had a new command, which will generate an electric pulse and send it to a loudspeaker, and would produce a sound like boop. And Channel was quite intrigued by that. And Turing continued to explain that the idea was to put that command inside the code, like inside loops. And then the idea was that how fast the loop will determine the pitch of the sound. And Turing continued to explain that one soon would get familiar with the sound. And by listening to the sound, one would be able to tell if the program hung up or was spent more in a certain loop or not. So I think this is quite interesting, because this is pretty much one of the remote accounts I know about somebody trying to trace the program being executed. So from that as well, we can see that tracing was very important back in the early days of computing. And it continues to be important nowadays because, of course, the systems got much more complex now. So tracing profiling is a kind of dynamic analysis, which is performed at runtime. This is very interesting, because in theory, we can capture almost the state of the system very close to the state I state in real life or in production convictions. Nonetheless, they can cause, to use the term taken from physics, a sort of observer effect, which means that when you trace and profile the code, you also affect the runtime. And so you affect the measures as well. However, if we understand, well, the properties of the system and the disturbances, it's possible to, so the disturbance will be negligible. So Zephyr, given the importance of tracing profiling, Zephyr have many options for tracing, currently for tracing profiling. However, what was the motivation for using compilers and trying to make, trying to profiling using the features from a compiler? The idea was that we would like to make people trace and profile without any financial or license barrier. So make it more democratic. So people don't depend on additional hardware, for instance, or proprietary or expensive tooling and software, like J-Link. We do also have, for instance, GPROPROF, which runs in some platforms in Zephyr. But it just runs on native POSIX in native POSIX 64. And it's a little bit tricky to get the timestamps in those platforms. So you're not able to use GPROPROF in real physical boards. And also, we have the tracing subsystem already in Zephyr. However, with that new system, we would like to show at a very low level the execution of all the functions which are called in Kernel and in application code and when exactly the functions are called, which is possible with this when you use the code instrumentation in compilers, which the compilers provide. So it's a little bit different from what the current tracing system provides, which base it on hooks inserted on certain parts of the code. And the final reason, which is quite personal one, is that I missed a command line tool for tracing profiling like we have in Linux, like perf and ftrace for tracing the functions in the kernel. So given that, compilers already, it turns out that compilers like GCC and Clang already provide several code instrumentation features. For instance, for code coverage analysis, where if you enable that flag, you can instrument or your code. And for instance, for a given set of inputs, you can see for a given input which functions are called. So you can see what part of the code is exercised. This is important, for instance, for testing. We have also flags in those compilers for profiled guided optimizations, which when you turn on those flags, it will generate, like for instance, in GCC, a binary which you run. And it will collect information, profiling information, and about the functions, probably hotspots in the code, given a input. So it's important to have a meaningful input in that case, which is close to what you intend to run in production. And once you run that code, that the binary, it will generate profile information. And once you can recompile that code, and it will take into account that profile information and will try to optimize those functions which are in the critical path. And also, we have flags for runtime checks, like for checking out of bound access in arrays, for checking new pointers, and for checking, for instance, the canneries, the stack, for checking the stack. So the canneries, not sure if everybody knows, but the canneries are a sort of a signature that's put in the stack, and the compiler generates additional code for checking that signature. And if it's spoiled, it probably means that somebody tried to attack the program using a smashing the stack attack, for instance, so it will abort. And finally, they provide flags for collecting profiling statistics, which we can pretty much leverage for tracing profiling. So one of the things, there are two set of flags, which are interesting here. The first one is the instrument functions and the associated, it's associated flags for executing files and functions from being instrumented. There is another function, another flag, which is just available in Clang. We will talk about it later. And the second set of flags is the patchable function entry. So just thinking of the first set, the F instrument functions flag in GCC will pretty much add a hook at the entry and at the exit of every function, and call a sort of handler for that specific in all the functions. So with that mechanism, one can leverage it for designing a profiling and tracing system, because I forgot to mention that the signature of the functions take the arguments which are passed to the functions is the function address and the call site. So the call site, when a function is called, the hooker, a hook will be called. And the arguments given will be the function on address and the address of where it was called from. So with that thing in place, we can, on top of it, kind of design a system for tracing profiling. And we decided to use for the tracing a ring buffer with two working modes. One working mode is just a trace buffer. And once it gets full, it stops tracing. And the other working mode is like a ring buffer. So you can record an event, but the oldest ones will be discarded, and we just keep the newest ones. So pretty much like a ring buffer. And also we kind of introduce it or play a little bit with that concept, which is kind of taken from the compiler world, which is event type promotion, which means that if you can note, that kind of instrumentation just provides you the information of entering, exiting, or returning from a function. However, for tracing, and it is important, for instance, especially in Zephyr, to know context switches, for instance. So the idea is that we just have the information of entering and exiting a function. However, given the context where we are, we can promote that event to a different event. For instance, base it on the name of the function or the state of the kernel or even the state of the machine. We can promote that to different events. So this is pretty much the case of the context switch events, which I will show in a bit. And also the profiling, which currently is very simple. The buffer is just an array, which will keep the accumulators. So when a function is called, we track the time. We save the time stamp for when we call that function. And when the function returns, we just compute the delt value for the time. And we accumulate that per function. And currently, you can set the size of the both arrays in kconfig. But for profiling, it is possible to use more interesting heuristics to discard the functions in free space. For instance, maybe you can tell what function is important. And so if there is no space, the other ones, which are not in the list, will be discarded. Because the functions are discovered for profiling as in the execution flow. And we add that to the array and compute the delta time stamp, delta time. As you can imagine, the amount of events generated in that situation, especially for tracing, is large. And there are a few ways to play with that and try to work around, which is the first one, which is kind of obvious, is to increase the size of the buffer. But it depends on your board. So maybe the board has not much RAM memory. So it depends on the case. So you can increase that to try to fill more traces inside that buffer. I should mention that so far we have just experimented with asynchronous transfer, which means that we keep data in the tracing buffer. And then later, probably, and that's the case, just one, the program stops running. We collect the data. But we would like to experiment something synchronous, which will with different channels like using USB and Ethernet. So anyways, besides increasing the size of the buffer, it's possible to tune the trace point. One way to do that is at runtime. The other way is at compile time. So the idea is that at runtime, we have a trigger and a stop address, which is basically when the trigger is a function address. So when that function is called, the instrumentation system gets enabled. And this stop address is another function address that when that function returns, the instrumentation stops. So with that, you can set a region for tracing, which can be larger, or you can shrink, base it on your knowledge about the code to reduce the number of events. And the trigger and stop addresses are kept in the RAM. So when you reboot the system, they survive a reboot. This is important. We'll be clear later why. But anyways, you can set the trace point using the trigger and the stopper at runtime. So another way, which is at compile time, is that it might be the case that even set in a small region, the tracing is getting too much. Like for instance, in the AST board, very low level functions will start with LL underline. And that might be not interesting for you. Generate noise and the events will be tracked in the buffer. So at compile time, it's possible to specify a prefix or even an entire compiler unit like a .c source file to be executed from being instrumented. And so you can reduce the noise or reduce the number, again, the number of events to track it because you are not interested in that and they are just noise. So with those two approaches of setting the tracing points at runtime and setting the exclude list, which can be passed to west at compile time using dash, d, cmake, underline flags, you can reduce the amount of trace events to fit inside the buffer. And so this is for the user. Internally, we do have other ways to disable at compile, both at compile time and at runtime, the instrumentation as well. But users will not touch that. So it's just mentioned for completeness here, but it's also possible to avoid instrumenting the functions using a compiler attribute. In that case, you put that per function and you use the attribute keyword, no instrument function, and it will not instrument a given function. We use that pretty much inside the instrument subsystem itself because it cannot instrument itself. It will cause a kind of, it will cause recursion and it will not work. So and also at runtime, at runtime we disable that because once in the handler, so the hook is called and we go to the handler to collect a trace and profiling information. We need to disable that also to avoid recursion. So this, but this is just used internally. Users will not touch that. So and the profiling will happen automatically once you set the trace points. So once you set like the region you want to trace, the profiling data will be collected automatically. The data transport, as I mentioned, is asynchronous, still synchronous, that's what we have experimented with. So it's, we are using the serial, but we do want to experiment both with asynchronous and synchronous mode with different channels like for ARM, DSWD protocol and also USB and Ethernet which are widely available and has a better, a wider bandwidth. So how do we enable the subsist, the instrumentation subsystem? Quite simple. You just need to add a config to your pproject.conf. So we, you're gonna use config underline instrumentation equals yes and set the main stack size. This is mainly, this is our initial experimental guess because of the ring buffer functions and that kind of stuff and maybe we can change, that's too much, but anyways basically that. For the retained mem mutexes and retention mutexes, this is gonna disappear. There is a patch upstream being reviewed right now which will allow us to disable the mutexes in the subsystem kconfig because that's not possible. So there is a patch being reviewed and the retention memory is important because we keep the trigger and the stopper addresses in the RAM and that they need to survive the reboot because we set them, then we reboot the system, then the instrumentation system will kick in and collect the tracing profiling information. That's the flow. And we need, we can't use mutexes there because the instrumentation system will track functions quite early in the boot. So there isn't a thread context. So it's not possible to use those mutexes. It will crash the system. And the second thing you need to provide are pretty much an overlay device tree for your board. I used the ST585U board. So there is already an overlay for it but it's quite simple to add for other boards and this overlay is just to free some memory for the retention memory. And this is just a quick example here just to explain quickly how that device tree node is specified. So here below you are pretty much getting the RAM available for the board and shrinking it a little bit leaving a space at the end. In this case we have 32 bytes at the end which are left for the retention memory. And then you need to define a retention node, a retainant mem node and then a retention one which will where you're gonna specify that 32 bytes at the end which will not be overwritten by a Zephyr when it reboots. So because it will be used by the retention subsystem. So that's it. That's all you need to enable the instrumentation subsystem. So if everything works you just recompile your application and it's ready to be traced and profiled. We introduced a command line interface tool. We call it Zeru for the lack of a better option. We thought of Zephyr at first but Zephyr was taken and also just perf is just profiling and but it also is about tracing so we decide to pick something randomly. And the Zeru is basically the translation it's a Japanese word will be strainer or too used to filter out or drain water for instance. And this is because of Zeru's SOVA which is a dish usually serve the chilled. And this is why the reason is for why Japanese because Kevin and I we pretty much love Japanese food. So we and it starts with Z. What? So let's remove that. I'm open for other suggestions. So yeah, I love that. Anyways, so and it starts with Z. That's the main reason actually. And the command line tool is written in Python. It uses the it uses the labor of Babel Trace 2 which has Python tree bindings. I'm not sure if I forget to mention or I skipped one slide actually because sorry about that. It's just I forgot to mention that before the command line interface tool actually we use the common trace format to decode the binary because once you transfer the data from the buffer down to the host you need to decode that binary with traces with the events. And for that we use the common trace format which has a DSL domain specific language. And the parser for that is implemented in the lib Babel Trace and has Python 3 bindings. And we found some issues using the 1.8 specification which is if we have for instance a 32 bit architecture and a 64 bit architecture. So we will need to have a metadata for each architecture. It's not possible. The specification is not flexible enough to specify in a single file that base it on a given field. Other fields will have different sizes. So this is something which has been addressed in the 2.0 specification which actually is not using the DSL anymore. It's using JSON objects to to model the binary. But it's not implemented yet. It's a work in progress. So it's not available to be used. So we are currently using the 1.8 which requires that metadata file one per architecture or even for end in this. So if you have end in this as well you need yet another metadata and you need to take care of it. So you need to maintain many metadata files which is annoying. Back to the command line interface. So the command line interface pretty much used the Python bindings for the lib bevel trace to trace. And currently we have two comments or if you will sub comments available which is the status to get the status from the system to tell if the profiling or the tracing parts are enabled. The idea is that you can disable them individually. We have the reboot command which will reboot the system but it's also in place a sort of ping. So it will reboot the system and then we'll ping to see if the system came up correctly. If not it will complain. And the sub comment trace and profile. The trace one is pretty much the command to set the region you want to trace. So to set the trigger and the stop addresses. It doesn't take the address or you don't need to remember or inspect the file to find out the address of a given symbol. You just put the name of the function and it will inspect the L file in the Zephyr image and it will set the address for the trigger and the stopper functions correctly. And then finally the profile which is used to extract in the code the profiling data in the buffer. And the idea was to have a command line interface which produces which give us the traces and the profile information in the command line in the text format but also have sort of backends or plugins which will allow us to export the traces to other GUIs for instance or even for a PDF file. And for data visualization that's the reason. And we decided to experiment with the Perfero GUI. This is a tool used a lot in the context of Android. And I have an example here for the Perfero output and I think one thing important to mention about Perfero that we found is that for generating the context switch events or for displaying them in Perfero the Perfero specification it's a JSON file. None of the less the switching information is taken as is from the F trace system in Linux and we don't have that in Zephyr. So currently we are kind of faking building a F trace like event just for displaying it in Perfero. So ideally the Perfero specification or the format taken as input in Perfero should have a platform agnostic way to specify context switch but there isn't. So we are pretty much like faking the F trace event out of the kernel context switch events in Zephyr. And the context switch you can see here in the inside the yellow box. And here we are tracing a program which is in the samples and we have a sample for the instrumentation subsystem and the sample pretty much is a kind of ping-pong program which runs with take turns executing IOLOOP in a single CPU. And here you see the main thread and the other thread being created and which were created and at the tip of the red arrows you see the state of each thread. The red one is running, not sure if it's possible to see and the purple one is stopped. The one which is running, you can see that it's a sort of flame graph upside down. And that's what actually by another idea had Kevin had which was to bring a kind of flame graph analysis to Zephyr pretty much as we have in Perf and Linux. So anyways, this is just an example of a graph key user interface or data visualization for the traces and which shows a context switch with two threads. Now, what is the impact of turning on instrumentation subsystem? We used the GCC compiler, which is supported by Zephyr. And you can see that the number of functions, I just used the samples under the basic dash samples dash basic applications there to plot this chart. And as you can see that the number of functions increase more than twice. And the reason is that the flag instrument functions forbids inlining. And as it forbids inlining further optimization like that code elimination, tail call elimination would not take place. So it's preventing the compiler to do further optimizations which are important. So the jump on the number of functions here is because no inline is being performed by the compiler. And it reflects in the size of the final image which is also twice, more than twice actually in comparison to when you were switching off the system. And if you compare with the config debug, the situation doesn't improve because there isn't much difference when you turn on the config debug. However, I put a current impact on turning on the instrumentation subsystem. Why current? Because client do have a flag to allow inlining when you use the instrumentation functions flag. The thing is that GCC does not support. So we are hoping that with Clang or with, if that get, there was an issue opened in GCC for fixing or adding that to in GCC. So we hope that it will improve the situation regarding the image size. So all I have mentioned so far is published as a request for comment. So feel free to go there and comment everything. Change the name of the command line tool, everything. Now that we know that Kevin doesn't like that much Japanese food. Yeah, we can put some Italian or Thai. So anyways, please comment there. And there is, I promised a demo in the description of the presentation. The demo, there is a video attached to the RFC. Please go there. There isn't enough time to go through the demo, but I recorded a video there, which is the Zeru in place, collecting the trace and profiling there, and also exporting to Perfero. There is also a screenshot of Perfero, the same one I have shown you. Yeah. Okay, so I, so far was just about the first set of flags, but I mentioned there was a second set of flags which might be interesting as well. And why those flags are interesting? Because if you, you could imagine that in that system where all, or a bunch of functions are instrumented, and we have to match, we need to verify if the function is the trigger or the stopper to turn on and turn off instrumentation. You have to compare, you're gonna compare for every function, you're gonna make a compare and decide to return or to call the hook. This might imply some overhead. We haven't measured how much this overhead is, but it's possible to measure as well. We will probably. But anyways, we can imagine that there is a overhead. And for working around this overhead, we can use that second set of instructions, which is the patchable function entry, which pretty much we'll add at the prologue of the functions a bunch of no ops, nops at the beginning of the function. And we're gonna also have to use no optimized siblings call for reasons which will be, I will mention a little bit later in the next slide probably. But anyways, that flag will add the nops, which means that we can start to play with the dynamic function instrumentation, which is quite interesting. This is quite experimental. But the idea is that we are going to generate the hooks, not at compile time, but at runtime entirely. The compile will just give room, add the nops at the prologue of the functions. And we are going to add the hooks for just the functions we are interested in. And at runtime, pretty much like a JIT engine or compiler will do. And this will, so the checks as we had before in the system with the instrument functions, we will not have that overhand. However, there is some comps. And the thing is that you will have to have a sort of assembler there in place because you need to generate the instructions in the prologue at runtime. And also with that function, we are just patching the entry of the function. We are not patching the exit of the function. And so it means that we need to tweak the stack and the return address in the stack to allow us to return proper to the hook. And the other thing is that code relocation is necessary. So you cannot patch a program running in the flash. So it depends on your board. It will be restricted. The amount of code you can use that feature. And another thing is that it still prevents optimizations. That flag and those siblings, that flag, no optimized siblings call is pretty much to avoid tail call elimination because we need, you cannot, the thing is that with that dynamic function instrumentation, all main calls A, A calls B functions. With tail call elimination, B can return directly to main, but we want actually that it will always return to A function and A function to main. So here is just a kind of, I'm about to finish just to an example of what we've done. We implemented sort of assembler inside a code. And to test it and which will, we have here a simple F function, which the compiler will add, I exaggerated it, it's not necessary to be that size of course. And this is the F function without being patched. And here is an example with the F function patched adding the hooks. So this function F now will call a hook at its entry and when it returns, because we are manipulated in this stack here as well. And finally, I would like to mention that yesterday I thought there is another possibility which will reduce the cons of this approach, which is we can generate all this code at the host site and transfer the data to the target. Pretty much as the EBPF does when you compile for those which know the EBPF in Linux, of course the EBPF has a VM in the kernel site to the code, the instructions and JIT instructions. However, inspired with that, we can move that part of having a assembler by using the Zephyr's 1.2 chain. And I guess that's pretty much it. We are on top of the hour. Okay, okay, cool. Thanks a lot folks.