 Alright, hello everyone. My name is David Faw. I work for a company called KDAB. What we do is Qt OpenGL and C++. For instance, I've been doing Qt for 20 years in the KDA project. And one of the things we ended up doing as a company when working on customer code on Linux is we felt the need to have better tools for profiling. So we ended up writing two open source tools, which basically make it easier to profile CPU and memory. So I'm going to present these two tools to you. So they are GPL and available for free. And I'm hoping all of you will benefit from that when you want to do some profiling on Linux. So there are the two tools are called HipTrack and Hotspot. HipTrack is about memory profiling and Hotspot is about CPU and off-CPU profiling based on Perf. We'll talk about that one next. So even if you don't like the first one, don't run away. The second one is even better. Let's start with HipTrack. So this is about profiling memory allocations. If you think of, you know, my application uses a lot of memory, how can I fix that? If you're lucky enough, you don't delete anything and you can use a leak checker at the end of your program to find out, you know, what memory you have allocated. That is kind of the easy case. There are many, many tools to do leak checking. This is not what this is about. This is more about the case where your application grows and grows in terms of memory over time. It creates many objects. But then when you quit the application, everything gets nicely deleted. And then leak checkers wouldn't find anything. So you need a tool that can tell you you've been allocating memory from this part of the code. And this is where most of the allocations come from. And so you want to see stack traces of where the allocations come from. So that's a tool that, you know, we thought, that should be easy enough to write, right? How hard can it be? We just need to hook into Muller Can Free and record every allocation. Easy, right? So hooking, that's easy. You just write an LD preload library which hooks into Muller Can Free and you get all of the allocations. That's easy. Collecting backtraces. That's a little bit more involved. There are ways to do that. So then you, you know, resolve debug symbols, get the Dwarf information to do that. And then you have a load of backtraces. And then the next step is to actually collect these backtraces together and present them to the user in a nice way. So that's what we did with Hiptrack. It is comparable to the massive tool of Valgrind except that Valgrind is very, very slow. On the other hand, Hiptrack doesn't have that problem because the code runs as usual. It's only Muller Can Free that have a little bit of overhead. Even better, we can attach to a running process which is, you know, that's easy to do. You just open GDB on the process, force it to load the library and then quit again. Except, of course, this gives kind of a partial view of your application because you're missing all of the allocations that happened before that. So it can be a little bit strange. You have free without a malloc. So that could be a little bit confusing. So in general, you want to run this on the entire program or a benchmark or a unit test that you wrote for that, for that problem. Of course, because we resolve all of the debug symbols to get the stack traces during recording, that generates a lot of data that has to be processed afterwards. So the GUI is a little bit slow when loading traces because there is a lot of data to process. We have ideas of how to improve that in the future. But for now, that's one of the small problems with it. So it's better if you do that on a small application or kind of a test case or a benchmark rather than a very, very large application. Or at least if you do it on a very large application, do it for a short amount of time, not for one hour, that's just impossible. So the tool you can get on GitHub is actually part of the KDE project. So I have two hats, right? The KDE open source project and the KDAB company. So the code is on the KDE project, whereas what we did as KDAB is to provide continuous integration for it and generating an app image so that you can install it without having to compile it at all. So the URL is up there so that you can get it from there. So that's really easy to install because all you have to do is, you know, download, make it executable and run, and it contains all of the required dependencies. Right. So let's have a look at what it looks like. Actually, I'll do that as a demo. That's always more fun, especially when things go wrong. So we have this little application here. I'm a KDE and Qt developers. Obviously, this is going to be a Qt application, and the application works, right? I can run it and take some time, and at the end it says, I found a thousand matches. Now, if I want to run this in HipTrack, all I have to do is HipTrack my application, and then it runs it again, but this time it records all of the memory allocations and the allocations. At the end, it tells me a few stats. You have done, I don't know, a million allocations, and then a lot of them were temporary allocations. What we mean by temporary allocation is if you look at all of these events, all of the malloc and all of the free, if we spot a malloc and a free on the same area of memory next to each other, so there was no other malloc and free in between, this is a temporary allocation. It's an object or some bit of memory that was allocated, used, and destroyed right away. Sometimes you have to, but sometimes you can maybe avoid that. Right, so the nice thing about HipTrack is that it actually tells us what we need to do next. You just copy paste the end of the output, which says you run HipTrack again in analyze mode with this recording file that was just generated, and if I do that, I get this GUI application, which is too big for this monitor. There we are. I don't know if you can see the bottom like this, right? So the first tab is kind of a summary where we can see where the largest memory leaks came from, most of the memory allocations came from, and so on. This is not the most useful view. What you want to do in these kind of tools is to switch to the frame graph. We all love frame graphs. They are a very easy way to see where the biggest offenders are. So basically in this view here, the wider a rectangle is, the most memory was allocated from that piece of code. If I want to look at, so I'll come back to this, but overall if I look at the amount of memory that was consumed by this application over time, well, okay, the graph here is not very interesting because it's very linear, but yeah, the allocations went up and up and up, right? And what we do in the frame graph is to take the peak of that graph. So basically at the point where we used most memory, where did it come from? So that's what the frame graph shows, because right now I'm looking at memory peak. So I can see that at the time where this application was using most memory, the memory came from Q string number over here and some appending over there. And I'll show you the code so that we can relate that to the code. The other thing I'm interested in is if you go back here and the summary as well, it said there is really a large number of temporary allocations. So I want to see where they come from. I can go here and switch the thing I'm looking at to be temporary allocations. And then I can see that all of them, the 10 million of them came from Q string constructor in main. And if I want to see where exactly I can actually go, it will switch to that other time. And I can see that the temporary allocations here come from, and I can even get line numbers somewhere in there if the window was big enough, which is not. Temporary allocations, that's the bottom row. So that actually comes from line 23. Okay. Let's have a look at this code. It's in here. Line 23 is where all of my temporary allocations come from. And I know probably not all of you do Qt, but think of it as STD string, right? Basically the same. What we do on line 23 is we create a string just to compare with it. And that is, of course, temporary object that shows up as allocation and the allocation right away of that object. So that's obviously where all of this temporary allocations come from. We can fix that. There are Qt APIs for that. Right now I could do Qt at in one string. I will rebuild. And now I run hip track again. Temporary allocations zero. That looks good already, right? And even if those numbers here were, you know, mixing up a lot of other things, then I wouldn't see anything. What I can do now is run hip track analyze. It has a diff mode where I can actually do dash D. And then I have my two files, the old recording and the new recording. And now what it will show is the difference between the two. So if I come here, I can see minus 10 million temporary allocations. This is good, right? It's red because it's the biggest number, but it's actually good. I want a minus in there. My new recording has a lot less temporary allocations than the old recording, which also means a lot less allocations at all, right? Not just temporary, but of course the overall number of memory allocations went down. Right. So that's basically it for hip track. I need to move on to hotspot. But as you can see, it makes it very convenient to find memory allocations in your code. So this I showed. It is only available on Linux, but this is a Linux conference, so that's fine. It does not have support for recording on one machine and looking at it on another machine that still needs to be implemented. We could imagine more great features, but at least it's very useful already. All right. Now hotspot. So the other thing we want to do here is profile applications that use too much CPU, run for too long, or even wait for too long. So basically that is something you can do on Linux with Perf, but the problem of Perf is that it was written by canal developers. These are really fine people, but in terms of making user interfaces, that's not really what they do. So what they did is Perf report, which is a common line tool, which we looked into. It's really hard to interpret. We thought we are cute programmers, how about we do a nice GUI for this, which is a lot easier to interpret. So that is called hotspot, and you can find it on GitHub as well. It depends on a large number of dependencies, but again, we provide an app image. You can install it in one click, so that's really easy. So the idea is doing, of course, just a subset of what Perf can do, because Perf can do many, many things. What we want to do is sampling profiling, and of two kinds. The first kind is what is my application doing when it's actually running, when it's on CPU and the threads are calculating stuff? Where do I spend my time? And the other thing that hotspot can do is profiling the off-CPU time. My application is waiting on a mutex, network, file, whatever, and then I want to see that and see what are the biggest offenders in terms of waiting for something. So let me do a demo of that. I will start hotspot somewhere, and then if I start hotspot, it shows me, yeah, good enough. I can do record data. I have some application, oh yeah, let me show you the application first. It's an application that draws a fractal image, and when I resize the window, it is really slow. I'm resizing, and it takes forever to compute the new size, right? So that's the bug report. How do we figure out why is my application slow? The first thing you need to do is a benchmark, so that you can reproduce the results. So we did that in that application with dash B1, so that the application itself provides a benchmark mode where it doesn't do any user interaction. It will simply do one calculation and exit, and that is a benchmark we can use then. Right, so that's what I can do the recording on. I pass the benchmark here. I enable off-CPU profiling so that I actually do both on and off-CPU at the same time. Start recording. Now perf is actually running the application and recording samples, and when it's done, I can view the results, and again, some summary in the first tab we don't really care. Well, there is one thing we always want to look at is how many samples did I get? 10,000, that's good. Sometimes perf messes up, and we get eight samples that is statistically not relevant at all, and you need to rerun or configure perf in a better way. So yes, I have 10,000 samples, which is like 5,000 per second. That's fine. Now I can go to the frame graph, and if I'm looking for cycles, so actual CPU cycles, then I can see where is my application taking time, and I can see, okay, it is taking quite some time in fast norm. I can look at what that thing is doing and whether I can make it faster. It's also spending some time in square and also some time in that first pass method itself, right? There is no calls being made here because that is the code of that file itself. Okay. Now the thing about that program is that it's supposed to be multi-threaded and do all of that in separate threads. So let's have a look at how well it managed to actually run in many threads. So I can use that view at the bottom, which shows me, so the first, the stuff at the top is the CPUs, so I can see the CPUs have been doing stuff, right? I have multiple CPUs running, though it's not orange everywhere, so there is sometimes some CPU don't do anything like CPU one over here. More interesting, I can scroll down and see my threads. So I have the main thread, I have two from Qt, and then I have a whole bunch of threads in my thread pool, and what do I see? Far too many threads, and they keep waiting, right? For most of them, they run a little bit, and then nothing, and then they wait for a long time. And if I want to see what they're waiting for, I can go back to the frame graph and say, show me the off CPU time, which is really the time where a thread is waiting. Sometimes a thread is waiting for a good reason, right? It waits for work to do. In this application, they were supposed to go ahead and crunch numbers. So if I look at what they've been waiting for, ooh, there is this mutex lock in here, that takes 80% of the waiting time. Now I can go into the code, look at what that mutex is actually locking, and figure out, okay, it's far too big a scope, and actually in this example, it can be reduced and so on. There are a number of things we can do to this code. I don't have the time to show that. To make this application run much better, reduce the number of threads, of course. I have eight cores. There should be eight threads, not 20, or I don't know how many. And do the locking in the right way. But that's the whole idea of this, is to be able to see where is my application doing some work and spending time doing it, and where is my application just waiting for things, and what is it waiting for? So mostly on the mutex. Here, it's waiting on a wait condition. That's probably just waiting for work to do. That is probably fine. But you can look into that. In hotspot, obviously, because this is sampling, right? Perference regularly, like, you know, 5,000 times per second and collects backtraces. So the whole idea is that if you do that often enough, the data you're looking at is what's relevant. It's not a problem that maybe I missed a little bit of code somewhere. It will not show in any backtrace. If it's that small, then it doesn't matter. What matters is what actually takes time. So statistically, I want to see what's over there. But because it's sampling, there is no way to do diffing in this case. Because two runs of the same code would still be a little bit different. So we can't do a diff because we would get noise in there. It would be nice, but it's just not possible. And again, here, if I didn't know the source code, I could go to FirstPass Mandelbrot, do color coli, and over here, I could see the actual lines of code. So I can see where most of the waiting comes from. I can order on of CPU time, including colors. So this one, it usually works better on a bigger screen. So now I sort it here, and I would go to this line here, one, two, four. And if I open that, somehow, it's supposed to run. It must have done it here. Okay. The low of demos. Normally, you can jump to the line here. Something's broken. Oh, well. Then you can see the line of code, and that's where the locking happens, obviously. Okay. Right. So that's what I had on hotspot. So basically, what I would like you to do is download, compile, or use the app image for Hiptrack and hotspot. If all of this was way above your head, you can buy a training from us. So the tool is free, but of course we have to eat. So what we do is sell trainings and consulting. And yeah, if you have any questions that don't fit here, this is my email, and you can email me. But yeah, for now, any questions? Hi, thanks for your talk. I have a question about cross tools. Do you have a plan to add, you say what first tool was not able to be able to run in cross environment, for example, we collected some data in on some small embedded device and penalized it on our regular machine. Do you have any plans to implement this cross functionality? Do you know if it will be implemented or it won't? Actually, you make me realize that what I said was a bit of a simplification. If you have an embedded device, you can use Hiptrack on it as long as you have the debug symbols on the device, which I know is a problem in some cases, but there are cases where you can do that, right? If you have a large enough embedded device, you can copy the debug symbols onto the device and then Hiptrack will work. What doesn't work is to try and do like you can do with Perf where you run it on the embedded device without debug symbols, you get a trace that is basically just instruction addresses, but not resolved to actual function names. So that trace, you then move away to a developer machine and there you can resolve with the debug symbols and get stack traces. That is something that Perf supports out of the box. So obviously for hotspots, it works, right? But it's not something that's done by Hiptrack. But if you have room enough to have the debug symbols, then you're fine. You can then move the actual recording onto the developer machine and use the GUI to look at it. Any other questions? You get to run over there. I think it's better for the recording if we have it on the mic. Can you do it pretty much the same thing in the batch mode? Like for instance, if you run something completely non-interactive, we get some XML file and download it for the later analysis. I'm sorry, the sound isn't very good. I didn't hear all of it. Did you say some XML file? Yeah. Can you do it in the batch mode in non-interactive completely? So for instance, you run it somewhere on the cluster, your application, you get some output in XML and then download it for the analysis. Yes. To make it easy, I showed you recording from hotspot, but actually what I do most of the time is to record from the command line. You can do perf record and then the same command line and then, well, yeah, you need, that's why it's annoying and that's why you like GUI. You need to ask for backtraces to be in there with call graph dwarf and then you can do the recording on the command line, which means you can actually script this and automate it as much as you want. That part is kind of the standard perf part. You can do perf record just like you would do normally. What hotspot provides is the GUI for it. At least for the cycles and similar recording, for the on CPU recording, that's all there is to it. For the off CPU filing, that is a little bit more tricky. Hotspot does use perf for that, but there isn't a single event for this in perf. So there is a bit of post-processing happening in hotspots to be able to understand this. So actually, it is possible to do that on the command line. I have an alias for this. I have a perf of CPU record, but if you look at what it says, it's a whole bunch of things that you actually need to put in there. What you can do is you use hotspots initially once on your machine and then you go to the summary tab and you check out the actual command that it has run, which is here, and that you can copy, paste, and run again in the future outside of the GUI. Okay. Yeah, at the back over there. Can you bring the mic here? Sure. Thanks. So for hotspot, is there any limitation that would prevent it from working for data directly sampled from the kernel instead of user space programs? Or can I just feed my perf recordings for kernel events into that GUI and it will give me something useful? Oh yeah. As long as you can do a perf recording, it will work in there. I have proof. I see kernel symbols at the top of there. It's all the same to perf, so it's all the same to hotspot. Okay, that's great. And all you have to do, because that's the little missing link I didn't show, if you start hotspot in a directory where there is a perf.data file as recorded by perf, then hotspot will just pick it up. And the thing it does on startup is to load the file from the current directory. So that's it, right? Now it's showing the recording I just did. Does this work for all architectures? Intel, ARM, PowerPC, bibs? If perf works, hotspot works, right? Because it's just a GUI. So as far as I know, perf does work on all of the architecture. It is part of the Linux kernel, so it is as possible as the Linux kernel. Okay, thanks. And for those of you who want a bit more advanced usage, when you record with perf, you can actually choose which events you're interested in. I've been showing only cycles, but it could be anything else, and it will actually show up here in the dropbox here. If I do perf record with my own alias, that one enables a few more things. So let me do that. All right. I have enabled more like instructions and memory load or cache misses and these kind of things. So it's loading all of that. We need to profile this thing. Right. And now I can see instructions and cache misses and stores because this is what I've recorded. So hotspot will pick up whatever you have chosen to record in the perf.data file. Any other questions? Yeah. With a hiptrack tool, what happens if the application is quit or killed before the report is written? Do you mean for perf or hiptrack? Hiptracker. With hiptrack. Yeah, the writing will happen at the end. I'm wondering if we have, you know, all we need is signal handler to kind of catch the termination of the application. I'm not sure if we have that. Let's try. I think you said it attaches to the program? It can attach to the program. It has both modes, just like gdb. What I'm doing here is to just start the application in hiptrack, which basically sets ld preload to a .so file. That's kind of the easy setup if you can do ld preload. It also has support for attaching to a running app with dash p and the pid. But it's kind of two different modes, right? So I think if you can attach, you can detach as well. But this one runs too fast now. I can't quit it before. Okay, let's do something much bigger. Yeah, I think that worked. Yeah, it works. I did control C to just, you know, kill it. It's not exactly as if it died, but close enough and the signal handler cut it and output the file so that works. Because I think that's an important feature for embedded systems. Many of them are developed to be safe when switched off. And so there is no effort put in a clear shutdown. Right. That is true. Sometimes you just can't quit the application, right? Yeah, or the system is, you know, safe for power loss. Yep. And you don't spend any effort on a real clean shutdown. And that's the point where many of the memory heap trackers fail. That is true. That is true. That's another advantage of doing it this way. So yeah, you can just interrupt the app or kill it and then it will catch that. I guess I can do the same with the kill dash nine and then maybe not all of them because I might have a few lagging around. But yeah, let's kill it. And then, yeah, there you go. It did say I just generated a file here and then I can open the file. So just kill your app. Okay. Thank you. I have a question. If track is like, it's like what a respect warground, what does the limitation respect massive or their memory leak profiler? The limitations over massive, the reason massive is really slow, right? But probably it's able to detect leak with kernel memory or something like that. Kernel memory. Oh. Track low level location that track cannot. As far as I know, massive hooks into model can free using val grain, which gives you the same result as hip track, except much slower. If you're interested in kernel memory, then of course you need to go with, you know, lower level tools like peripheral TTNG, which can look at these things. But that is kind of a different level. I realize I may be talking to the wrong audience, right? These are tools we made for application developers and here there are lots of kernel developers. So it's good to figure out that Perf can be used with kernel tracing. But in terms of hip track, that is meant to be application applications, not anything lower level than that. So just a last question. Is it track? Track directed a look page, main map or a look page she's called to kernel or is it like a per other library that look to malloc call something like that? It is a dynamically loadable library that hooks into model can free. So I don't think it hooks into M map because from an application developer point of view, M map does not advocate memory. I know it kind of does, but right? It's not the same kind of memory. Okay, thank you. But that could be added if we wanted to, right? All right, thanks everyone. Have a nice conference.