 Welcome to my talk. Thanks all for coming. My talk will be about extending the performance analysis tool set. So I will be talking about tools that you can use in your Linux environment in order to find out performance bugs in different ways. So this is about performance and this is about embedded performance because it's on an embedded conference. And when it comes to embedded performance, it's really important to understand that when you're coming from a desktop world, your performance will look differently. The time inside your embedded system will be spent differently. Maybe let's have a show of hands. Who had that situation of thinking, well I know this is probably something with the IO and in the end the performance bug was somewhere else totally. Okay, I see some hands. So on embedded we cannot really rely on our intuition. We need to be scientific and we need to measure and then we can decide and improve things. Okay, here's the outline of the talk. Okay, first I will talk a little bit about how to get sampling performance information on an embedded device with Perf and how to get the symbols in and so on, debug symbols. In the second part I will show you hotspot, a tool that we developed. I already have the t-shirt on, it's quite cold but still okay. I'll show you hotspot. And I will demo this for you and we will go through several examples like how performance defects can look like in this tool hotspot, how you spot them and maybe also how you fix them. And in the third part I have to do a little bit of disclaimer here. I'm coming from the QT world. I will talk about how we at KDAB introduced LT-TNG tracing, not sampling but tracing to the QT libraries in order to enhance the experience you have when debugging performance issues. Okay. So, but first talk about Linux Perf. Let's talk about embedded. So, what you usually will have when you want to have Perf information, so sampling information, you need to have your Perf running on your system. Oftentimes that's a performance issue. Sometimes that's a security issue. Oftentimes your manager say, let's don't have Perf on our real images. So I really suggest you to have a production image and a development image should at least have all these flags set in the kernel. So these flags really determine whether you can do useful tracing with your Perf. So the first is, do we have Perf at all? The second is, can we do dwarf unwinding which becomes important in some time? And then do we have trace points which we can use for example to trace the scheduler to find out issues with locks and so on. Okay. So, when you have an embedded system, make sure you have an image that at least has something like this. Because otherwise you cannot measure on your embedded system. If you measure on your desktop, you will not find the issue. Second point, it's really, really, really important to have debug symbols but built on release. Maybe you notice this trick here, this code. This code calculates the sum of integers beginning from zero, ending at the certain integer that you want, and you compile it with and without the minus O2. What will happen? Anyone knows the answer? Okay. Maybe I will spoil a bit here. So without O2, you will get a for loop, normal loop. You can look at the assembler. It will be a normal loop. But there's a really smart guy called Gauss who found a formula for this. And when you pass the minus O2, the formula will be applied. So you will have that n times n minus 1 and half of that as the result of this. That means in some ways it will be infinitely faster when you have the minus O2. So always be sure to have a profile build. So in your build systems, there is this rail with debug info, for example. Please make sure that you have this because otherwise you're not measuring anything that is relevant to performance. Because on your release systems, you will have release builds. And for this, please have O2 but with debug symbols forced. Okay. Second problem. We have no space on our embedded systems. So usually you have something like these user space libraries. Again, I'm coming from the Qt world. LeapQt5 core and LeapQt5 core stripped is 5 megabytes, which is big for embedded systems, but it's okay. But if you have the debug symbols inside, then it's really big. So you cannot really have it inside. So what will you have instead? You will have separate dwarf debug info inside your build system. I think Yoctur does it by default. So inside your SDK, you will find your debug symbols, which are just elf objects that are separated onto your host system. But on the target, you just have your libraries that are stripped. Okay. Another problem. Now that we have debug info on the host, but actually we're measuring on the target system. We have an architecture mismatch. So what we need to do is we need to unwind the measurement like stack pointer measurements that we did on the target system. We need to unwind on the host system. So on our desktop system, that's not always easy to do. Okay. And then we have Linux per for itself. And there we also need to look whether we can do a low overhead measurements using our performance measurement unit inside our CPU. To see whether you have a performance measurement unit, just scroll through D message and look whether you have that PMU entry somewhere on Intel. You will find it on some arm CPU is only called zero is counted. Arm does not specify that this PMU has to be inside the CPU. So it can be no PMU at all. Sometimes it just gives you bogus values. So please find out if you have an PMU and if Perf can actually measure things like cycles on on your CPU. Okay. But then we can go ahead and get our debug info from our device just with this call here. We do Perf records because we know now Perf is working on our device and we do minus minus call graph. We want to have the stack information and then we just close the parameters here for Perf. And then we start our app with our parameters parameters that we want to that we want to see and just measure there. Okay. Now we have our profiling information and now we need to process it. And as I said before, when we do unwinding on the device, we can use Perf. There is a Perf report, which is kind of this command line interface looks like this. I don't really see any information. This is weird right here. The numbers do not really add up to a hundred percent. Yeah, what like my biggest problem is apparently in the C. It's unclear what's happening there. You can actually expand these. Look inside what is actually happening inside my Lipsy and like where is it coming from? But Q event loop exec does not really tell me a lot here. This is why we build Hotspot. Hotspot is a free and open source software Linux UI for visualizing your Perf results. There's already your eyes for visualizing or like approaches for visualizing Perf results. For example, from Brandon Gregg, who is really great on this topic. So really check out his web page when you want to learn about Perf and Linux performance in general. But we wrote a tool that can do a bit more. I will show in a short time in a demo. Internally it uses Perf parser, which comes from the Qt creator guys. And which in turn uses lip unwind or lip DW for unwinding even of architecture. And this is the important part here. Before I come to the demo, I just want to give a big shout out to Milya and Wolf, who cannot be here today. But he is the core maintainer and like all courtesy goes to him. Okay, let's see a bit how this Hotspot looks like. I have a Perf, I have a simple application here. It's just called example one. I show you example one. It's one of these nice Mandelbrot examples. That's playing you maybe note from university. But the problem here is that it's a bit slow. So if I like scale it a bit bigger, you see it's slow. It even lacks take some time before I get there. Now I want to find out what is really slow inside here. Okay, so we heard that we just go there and write Perf record dash dash call minus graph dwarf. So we're using the dwarf information for unwinding or like for recording and later for unwinding. And then I go ahead and do X1 source and Mandelbrot. I will pass an argument to this. I would say minus B10, which means repeat the drawing 10 times. You won't see anything on the screen, but believe me, it's actually calculating the thing 10 times. Okay, we do this. And what falls out of it in a few seconds is this Perft data. Here you see maybe another problem that you get. The Perft data is quite big, but you can set the frequency in which it actually records the data. So you can shrink it down and like play around a little bit. So you find that sweet spot of the maybe 10 megabytes you have left on your device or maybe just, yeah. And then if I look inside here, we have this Perft data information now recorded. And now I can start hotspot and by default it will read the Perft data out of this file here. Okay, starts. And the first thing I see here is it too small? Please interrupt me if it's totally too small. It's okay. All right. First I see what was recorded here actually the total runtime. We are at five seconds. That's going to be important later because we can improve this thing. And then we will see our hotspots. High pot finite. And there is also a draw Mandelbrot and there's also a multiplication of a complex double version three. That is important here. This is something that is already telling us something about where our hotspot or where our problem might be. But something that is really useful and really widely known in the field now is this flame graph. And I will explain the flame graph really short. Just everything you see here is actually the stack in going up direction. And going left to right is the cost of the functions that were called on the stack. Okay. So I see inside my start I have my Lipsy start main. Inside I have my main. Inside I have my draw Mandelbrot. Inside I have my standard absolute from a complex double. And then I come to my high pot finite and so on. So what we can see here is we cannot even inspect and go inside like just click on main. Now main becomes the new hundred percent. We can go inside and inspect that further. We will see that in other examples later on. What we can do with hotspot also which you cannot really do with the just scripts that are outside there. Which also produce these flame graphs for you is you can select individual filters like filter for time. For example you might be interested in your startup time. Maybe your startup time is your problem. You want to look what is going on in my startup time here. So I filter on here and what I see here is okay. I'm actually in the loader. There's lots of DL open and stuff going on and there's actually no Mandelbrot at all in my startup time here. I can also reset the filter again. Reset zoom and filter. And then I go I go back in this in this time frame. And in this time frame I should have nothing not these DL opens and these stuff. But I should have my start main and so on and my Mandelbrot which is actually my cost. I filter in on selection. You can see you can also filter in on individual threads and so on. Okay. We go here. Did I just just zoom? I don't know. And I can also go here and ask okay this draw Mandelbrot. Let's go to another view which is the caller calling mode. Where I can see okay who was calling me as my function as function like draw Mandelbrot. Okay. I was called mostly by main. Inside I was calling this standard apps of this complex double and so on. And here I have locations where actually the costs are having happening. So actually in hotspot you can go here and set open an editor. Then after some time editor will open up. And we just saw it's in line 40. We see okay here in line 40. Here's my standard apps of my complex which is actually like giving me headaches which is actually giving me problems here. Small spoiler for solutions complex numbers. I'm calculating the absolute of the complex number here. So solution to make this faster here for example would be this is kind of a Pythagoras calculation. It's kind of a square root of real real squared plus imaginary squared. But why have the square root and why not put a fork here and don't take the square root. Okay. So I just would replace this with kind of a fast norm that still would be the squared norm instead of the norm that has the square root inside. I'm not doing it right now. I'm just telling this to you. Another piece would be line 42. We have another problem here. Open it again. Go there on line 42. It's actually wrong. It would be on line 45. It's a bit complicated sometimes with these stack traces. They're sometimes off. You would see that actually this multi see like this complex double multiplication. Here's actually slow and there's another trick when you multiply by yourself. You can also be faster with a complex number. Just a simple example of how you could find problems inside your code. Just looking back at the flame graph just there is a simple look you can have at the flame graph and tell if there's a problem and which kinds of problem is there. This is kind of the one big problem problem, which is the good problem. So if you have one big problem, you might be easy on finding this. Another problem is the death by a thousand cuts where you pay and pay and pay, but you pay really small amounts everywhere, distributed over your whole code stick. And we will see this in other implementations. So I just show you the, I have a solution for this and I also show you what hotspot shows you for the solution here, hotspot. So I just replace this in the code and I go to example number two's and I have a perf data here and when I look at the summary, we had five seconds before. It now ran for the same even with 15. It only ran for 1.4 seconds. So it became faster. And if we look at the flame graphs, there are still these plateaus. But now I know about these plateaus. Yes, they're optimal or they're nearer to what I want to have in terms of performance. And now we can also see some parts of the flame graph that are actually becoming more like the usual flame graph, which I will show you in the next example. Here is an application called settings. And this application already has some problems with the unwinding. It doesn't really find the symbols because now I have libraries. Maybe you see that in the binaries here. I'm inside my graphics driver. I'm inside lip PNG. I'm inside lip Z. I'm inside cured, which didn't have information at this point. Let's see whether we can still find something out here. Okay. This is how the flame graph looks right now. So this is how a usual flame graph, I would say when we go to a customer, something is slow. We don't have all the information would look like you have like mixed. You have symbols and mixed. You have no information at all and you need to find your way inside these mountains. But what you can see is that the flame graph looks a bit more like the same flame graph, which is like lots of small peaks. So if you have lots of small peaks, you can be assured you don't have the big problem problem. But you have maybe just many small problems problem. Yeah. So here we can also turn this flame graph around. We can go to bottom up view and see, okay, let's suppose we're on the top of our stack. Where was I? And where was I coming from when I was at the top of my stack? So here I see all the calls that ended up in my graphics driver. This is during rendering. And here I see two paths that come from various parts of the code that were loading PNGs, that were inflating the PNGs. So maybe this was the problem in this case, which actually was the problem in this case. So I go there and look, is there anything like big PNGs that I could replace, something that I could maybe preload during boot or something like that, which I could improve in order to improve the startup performance of this application, for example. If there's any questions, please interrupt me. There's bonus points for interrupting me with questions. Let's go to our last example about hotspot here. You see unwinding happens now. It takes a long time. And sometimes you're really spending like a minute or two minutes unwinding your information because sometimes you just have lots of stack information that you need to retrieve. Maybe this is megabytes or hundreds of megabytes of data. And this is something that I would call kind of a healthy flame graph. It's a bit complicated to see. Maybe I can make it like this. So if you see a flame graph that looks like this, and you would think, okay, maybe it's healthy, let's turn it around, maybe see whether we have lots of small stuff on top of our graph. Here's something in front config, front loading. Maybe I need to look at this. But other than that, I'm quite happy with my flame graph. And I'm quite okay that at least with hotspot, I cannot find any other performance issues really fast. So just with this one look on this flame graph. All right here, the front config. We could fix the front configs other than this. We have like small parts on top or on the bottom with bottom up or like these many small peaks, which indicate that there isn't that one big problem and also not that distributed problem because it would show up in the bottom up view. So again about hotspot, there is one thing I like the most when I go to a customer and we have an embedded system and I get some information from there, I can unwind off target. You can provide hotspot with all kinds of hints where to look for extra information for this unwinding as to get more and better looking flame graphs and flame graphs that don't have that question marks inside. So you can provide your debug paths, where is your split debug information. I just recently learned that there is also an elf utility, it's unstrip. So if you have some debug information that you want to just put in a strip library, you can also do that on target. Maybe you have space for this, maybe just if you have a bug happening and just the top of your stack is unknown, maybe you want to find out, okay, just these last frames, what is happening, what is going on there. Just take your debug information, strip it on the device and have debug information just in your last frames. Then you can provide extra lip paths. So this is something where you have lips that you're developing on your own in there, maybe not inside the SDK, then the application path, which hotspot needs to look as well and then the sys route, which is just SDK sys route and which is also important is the kernel symbols, which you can get out of the proc file system inside your embedded system. So after you've done your performance measurement, just take all your K-Aulsims from proc K-Aulsims, get them into a file, and also put them on your device for later unwinding so you can see even kernel information so you see, okay, some kernel worker or something like that was actually slow there. That's it for hotspot. Again, please go to GitHub download hotspot, try it out. It's free. Please contribute to hotspot also. We recently fixed a big issue with unwinding so also for newer compilers, it should be working a bit more stable now. Okay. Now about the second part I want to talk about, which is LTT and G, it's a Linux tracing tool, it's next generation, it's nothing that we invented. It comes from the, I think, official OS and Ericsson site and it's a great source of information for tracing. So tracing information about your kernel that you can get there. Problem is, it's only for the kernel. So it's not really a lot of user space queries that are actually existing where you can look at the same time what is happening in my user space application and what is then happening in my kernel at the same time and we want to change this. So this is just backup sites. I can also show you a tool which we also did not develop here. It's called trace compass and with trace compass you can see like inspect your traces a bit better. We're talking about a different kind of performance analysis, right? Before we had this kind of collection of how many times was I in this function, like everything was summed up, but now I have detailed information with timestamp. We can see that when I start trace compass. Trace compass. It's kind of an eclipse program. It's an open source. And trace compass allows you to visualize the so-called CTF information which is a common tracing format which I think not only entity and G but also other tracing tool support. So what you can do here and what like surprises me most all the time is you can zoom in kind of infinitely into your running system. So you can go here and say, okay, I want to know what is happening here and you can go here and you say, okay, here's an EPOL weight and it can go further here. Like this is my XORG. The red part means, okay, I'm waiting for some things. Actually, yeah, wait for CBU. And here I can like step through all the syscalls that are happening inside the kernel partly. I see all the kernel workers, for example. Is this the small maybe? Should I increase the size? Okay. And here I have an exact listing of what's going on here, like this EPOL weight. I don't just have the information that there is an EPOL weight. Okay, here's the sketch switch because I was waiting before. I'm not waiting anymore. First I get the event about the scheduler that I was now switched on and I could see from which process I was switched. I think this is the idle process here and yeah, you can see all kinds of information here and what is going on in your kernel. And it's just too bad that you cannot have user space tracing here up until now because entity and G also supports user space tracing. And it's also kind of a low-level thing to have inside your application. So it's not only, like, it's low overhead. I'm sorry. So why is it low overhead? Because you can have it outside and it's actually using a fast buffer either inside the kernel or inside your user space to store all this information. So there's several ways of having trace point providers. I'll tell you a bit more about this in a second. Inside your application, whether you compile it in on compile time or on kind of on start, on run time, you can preload it inside your application or even inside on run time during the application you can deal open the TP provider, the trace point provider, and have the information also in as well. So if you don't have the TP provider in, usually the trace points cost you nothing but if you have them in, they cost you a little bit just putting this information inside the buffer. Okay. So what did we do in Qt? In Qt, we added a tool called trace gen which is kind of a code generating tool which adds these trace points to interesting points inside the Qt library. For example, here we had this PNG loading which was slow. So here I have this Q image reader read before reading and read after reading so it can measure exactly which image was slow, what maybe can I improve, maybe I can talk to my designers and stuff like that. So here's all the information, all the trace points that we have and trace gen goes ahead and creates these functions. It's actually macros but it creates these functions which you can then compile to this object files trace points which you can then use inside your application. Okay. So here it's for this image reader. You might ask yourself what is different between trace point and do trace point. It's just one asks if it's enabled, the other one does it anyways. So if it knows trace points are enabled and the third one is just do I have the trace point? Okay. And here is how it looks like inside Qt. I have this macro Q trace and I have this Q image reader read before reading and I also pass arguments like the file name and with this information I can now go ahead back to my trace compass and see that I don't only have kernel information but if I go and filter for my application which was called chip here, maybe uncheck all and then just check chip, user space tracing and everything else you want to check maybe some kernel information and so on. Then we can see here that we have some information about the kernel like this mProtect information but additionally to this we should have some user space information as well. Just to start up. It's just so many things that are going on here. Here you can see, for example, user space tracing Qt core, Qmeter object, active end signal. So some signal was just processed inside Qt. We know that it's still, we still should improve upon this like sender has just given you with an address which should be a class name, something like ah there was this class sending me something and then I waited for ages in the kernel. But yeah, this is the start that we are providing for LTT and G coming inside Qt and with these trace points that we're introducing with this trace gen code generator we're not just having LTT and G information which you can use on Linux but also ETW information which you can use on Windows and a nice UI for ETW for performance analysis that looks like this like a mix of hotspot and this trace compass which I just shown to you but on Windows. So if you're on Windows you can also look inside this. Okay. As in 5.12 you have this what I just shown to you inside Qt so even if you have like a lower Qt version you can run it with Qt 5.12 just try it out and please give us feedback on how you want more trace points and which trace points you really want to have inside Qt. Okay. To summarize how is LTT and G also different from Perf? First I showed you Perf and hotspot and I showed you how sampling kind of is a stopwatch next to your CPU from time to time and writing down where you are in the stack trace and then you can see where you are like where you were most of the time when the stopwatch was taken and trace points you can have a trace point file edit this, add this to your code and then have detailed information about your trace points with LTT and G and inspect these events up to the nanosecond inside your application. Okay. And that's all. I want to thank you for your attention and maybe we can have a small discussion on performance issues you had and maybe how you solved them or if there's any question feel free to come to the microphones here in the front. Thanks. This sounds really nice feature my question is on the other side of the boundary because I'm currently developer could I use that for debugging C programs is there a plugin for tracing points within C program? Yeah. There's user space, tools for LTT and G for C programs just go to the LTT and G.org web page they have really great information on this and they tell you how to add simple trace f which is kind of a print f that goes inside these tracing information or how to add real trace points which you can see as these bars which I've just shown you so LTT and G.org just look for user space C and then you will see how to get the user space tracing also for C. It's just how to get this information how to get it inside. Okay. And one more question Do you know if backend infrastructure used to trace the kernel calls? No, no, no not at all. Yeah, good question good question. For example I think tomorrow a few hours earlier in exactly this room there will be a talk about eBPF which is now like the rising star in the world of performance analysis so people go write small eBPF programs or BPF programs put them into the kernel and collect information there to for example find out their I.O. issues or something like that so count stuff every time something happened in the kernel just count stuff or selectively filter for certain events using the small BPF programs so also this read up on Brent and Greg's web page he's talking a lot about eBPF and BPF in times and of course there's other sources of information I think all these old ftrace and so on they obviously still work and are a great source of information I just wanted to show you LTGNG and Perf as two examples of this great that you again told me that there's actually more and please go ahead and look at everything there is. Yeah, that sounds really great feature. Okay Are there other questions? If not then thanks again and enjoy the rest of the conference.