 Hello, my name is John Ognis and welcome to my talk tracking down what happened a comparison of tracing solutions Where I'll be comparing four different tracing solutions available on Linux Now when I first got started with this talk thinking about how to do it I wasn't really sure how to present it in a way without being biased. I myself do a lot of tracing and There's certain ones obviously that I prefer but I really wanted to so have an unbiased approach to which tracing solution And also I didn't want to choose a problem where I knew for example that certain tracing solutions are better than others for this particular problem so for this reason I Attacked it from another direction and I decided first I will Define an objective a set of objectives that I want to get out of this talk And I went through it like this So the first thing I did was I defined a scenario and problem that I would like to investigate for tracing Right so all without thinking about any tracers or any particular application just define a Problem that I would like to investigate and once I define the problem then I will go and choose an application That's where I can investigate this particular problem And then I will use the four different Tracing solutions the main ones available for Linux so ftrace ebpf system tap and LTT and G to perform the Tracing that I want to get done for this particular problem and scenario Now at the end I will rank these based on usability their ability to present to present the results How much they affected the runtime of the application while they were tracing how available they are on? particular distributions I chose for the four main base distributions so Debbie and fedora open Susie And arch are the ones I chose So how available and how well did they work on these different distributions and then of course an overall summary? Just a quick Disclaimer be aware that all of this ranking at the end I know the ranking is what a lot of people really like to see it's really based on this Particular problem that I choose and in the scenario so for a different problem a different scenario The ranking actually may be different, but hopefully this will be realistic enough that This ranking is applies to many situations. Hopefully so the first thing I did was decide what problem do I want to investigate and usually a Complex problems for example with inter-process communication This is something that often when we're writing applications or embed applications We really need to focus. What is the problem there if we feel like our communication is the inter-processing communication? Is not efficient or as we're not meeting certain deadlines or there's certain latencies and we need to know where it's coming from Obviously, there's lots of great profiling tools out there But the problem with profiling tools is it's really hard to really understand what's happening during the inter-process communication Yeah, like for example if I want to know what was my maximum latency or is this maximum latency normal? Or is it happen all the time is it's just a normal in a grant when if I were to create a histogram of latencies is this Huge outlier or does it fall within these norms? And for example and of course the most important thing is is what happens when we hit this maximum latency? Why did this latency occur and if we're using profiles of profiling tools? Usually you can get some information. Okay, we know there was a lot of page faults there We know that there was you know a lot of CPU cycles CPU is busy But really what you want to do is you want to know a sequence of events what actually happened during this maximum latency? Situation I want to know who was on the CPU who got scheduled who was supposed to wake up and for these kind of things really you Need tracing right so tracing is an optimal So method and solution for the inter-process communication scenario And that's why I chose inter-process communication for the problem. I want to investigate now I also needed to set up some conditions For these tests. So what kinds of things did I want to do and then decide that I could have a sort of basis for? Even deciding if these tools were good enough for me So obviously I need to first choose an application that uses inter-process communication, right? So I need to pick an application that's out there that's using that and What I decided is I want to measure every single inter-process communication that this application Performs right so if it sends thousands of a bit of intercommunication data back and forth I want to really have the latency of every single one of these the time from send to receive on every single one of these Measure and I want it's not to be sampled or things like this. I really wanted to measure them all that was important for me For the one that was the longest the one that had the greatest latency I want to be able to see the scheduling Activity that took place during that okay, so you would do we're running the application that's doing inter-process communication and one of these is the longest of them all and I want to know everything that was being Being scheduled at during that time And then the last criterion is that I didn't want to have to modify the application Yeah, so I'm going to choose an application that already exists and I want to just use it application I don't want to have to like write code or things like this. I just want to To use this application as it is Right, so these were the conditions that I set up For the you know when I do these tracing tests Now the application I chose I had never done this before I'm a real-time developer and there's a tool called hack bench that we use to Generate a lot of scheduling loads on the CPU and so hack benches it is a great tool just to you know Blast the scheduler and just make sure that everything is being scheduled correctly Now the way that hack bench works is it uses a lot of inter-process communication? And so I thought well actually this would be a nice application for my tracing tests If I use the inter-process communication that's happening within hack bench as the inter-process communication that I want to investigate So the way that hack bench works is basically you'll create n senders and n receivers and a Little piece of data is going to be sent through Unix local domain sockets each sender is going to send it to each Receiver right so five ten senders ten receivers then there will be ten each sending to ten so there'll be a hundred Inter-process communication events that that take place there So that's the way the hack bench works and that's how I'll be using it now the test system that I chose to do all of these Tests that we're going to be seeing the results for it was a fedora linux version 34 It's just the current version to download was actually released last May I'm not a really big fedora user actually but the reason why I chose fedora is it was the only distribution that supported all of the features in all of the tracing solutions that I Am will be presenting it was the only one that could do them all so That was the obvious choice for me for performing these tests now when I'm running hack bench I'm doing it with a real-time priority policy sked-fifo with a real-time priority of 70 That's just to make sure that the any other noise on the system is not really affecting my test because I'm going to be Do timing tests and I really want to compare these timings and just to try to be fair I used a real-time priority there a policy and the value for n the number of senders and the number receivers I chose was 20 so I have 20 senders 20 receivers, which means there will be 400 inter-processed communication activities during the test Now when you're talking about Unix local domain sockets So we have one process that's sending through a Unix local domain socket to another process a bit of data What actually happens there? What what is when I'm tracing? What is interesting there now when we talk about tracing what you need to talk about our trace points? And these are points in code either in user space code or in kernel code that are Interesting when the CPU goes through you want to generate an event at this moment or do you want to do something at this Moment at these different points in code that are interesting for the thing that you are Investigating now when we're talking about inter-processed communication with the Unix local domain sockets There's actually these four points here that I identified as critical That's interesting for for tracking the inter-processed communication So we have for example the sender we start off in the sender in user space And the sender is going to do a write on that Unix local domains file descriptor It's going to do a right to send that data and right before that right is called That is the at first point. That's interesting. That's actually I'm going to take a timestamp That's actually where I'm going to start my clock. How long does this inter-processed communication require? Now when the right system call is called and we go down into kernel code At some point the kernel is going to allocate a socket buffer for this Unix local domain sockets inter-processed communication Now when that socket buffer is allocated, we're still in the context of our sender But we're here. We're allocating a unique buffer This is the next point that's interesting because we're still the sender, but now we're in kernel space So that's an interesting place to be Now when this buffer gets transmitted to the receiver Then the receiver in kernel space is the one that will obviously Pull the data from the socket buffer, but then it will release that suck it'll actually free that socket buffer Right, so it's not going to be copied in the kernel It's actually we hand it over to the receiver and the receiver is the one that actually will free that in user in sorry in kernel space So that is the third event that's interesting because we're still dealing with the same buffer But now we're in the context of the receiver, right? So we had the sender in the context the current that moves down to the kernel now We're still in the kernel, but we're in the context of the receiver And then of course the final event is is when we come back into user space as the receiver And the receiver will call is it comes out of the read system call and In the moment when it comes out of that system call That's where we can take a timestamp and we can stop our clock, right? So we're basically we want to know from the time the right is about to be called to the time that we return from read That is what we're going to call our IPC latency. How long it took to transfer the data from A to B So these are the four events that are interesting for that now the scheduling activity I also said for the longest latency. I want to know everything that's on the CPU or got scheduled during that time There's two Events that I can use one called sked wake up and sked switch This is basically when a task gets that runnable. Yeah, it should be put on the CPU It's it's woken up. So that's the sked wake up And then when the task actually gets on the CPU is the sked switch So these two events are actually quite interesting when you're looking at scheduling just to see okay Who was getting woken up and who actually made it to the CPU? These are the events that I'm going to be interested in interested in to just see who's on the CPU Now keep in mind these events are interesting during that period of the greatest IPC latency, right? That's why I'm I'm doing this for that Now for the events two three five and six there's already Trace events in the kernel now the kernel already has these set up So it's really easy to use these events for event number one and four These are user space. This is user space code. There is no existing event So I actually will have to create Events from user space and there's a couple of mechanisms to do that For example, there's something called user space probes right can dynamically just choose points in the executable where these events will be Broadcast or there's something called user space Statically defined at trace points where I can actually modify the code and actually add the trace points into hack bench Now I already mentioned. I don't want to modify code So I'm only going to use USDT if I have to we'll see if I have to or not So now we have these for different points that we've defined Yeah, we have the four points that are critical when we're tracing the inter-processed communication And then we have the the various events Dealing with scheduling just so we can see what's on the CPU during that maximum IPC time Now the way I am able to coordinate this is because each event each trace point has certain information That is also made available So there's certain attributes that are available and for example at the user space event If I know the PID then that's something useful. Okay, now I know that this this task is about to do the right and for example in the when the Buffer is being allocated in the kernel. There's a K memcache alloc node event There I also obviously have the PID But I also get for example the pointer address for the buffer that was allocated and with the free I also have the PID and the the pointer of the buffer I just am releasing and Obviously when you get back in user space, I also have that PID So using this information I can now coordinate and create a flow From this one to four. Yeah, so for example, if I we see here in red These two will be the same, right? So if I'm at the second event if I'm in this K memcache alloc node event I can see Did I previously see this event right because it's it's synchronous In time, right? So if I see that one that I know, okay, this we are dealing with this scenario Just like when I get to the third event when I see this pointer is the same pointer that this previous event had Stored then I know okay. We're dealing with this right So just there's always one piece one attribute that matches with the previous piece so that I can kind of follow the data and Move through there and at the end then I know okay This is the timestamp and this is the PID of the sender and I know my PID and my current timestamp I could put that all together and have the latency trace there Scheduling is actually quite simple there because you need to associate you have schedule wake-ups and schedule switches when I actually get on the CPU How do I know which switch goes to which wake-up? That's you can see there because the wake-up is on a certain task the certain PID and With sketch which I see which one is now getting on there, right? So measuring the distance between these two is also the effective latency Between waking up a process and it being scheduled Okay, so with this information now This is everything that I need now to begin choosing tracing so I haven't talked about any tracing method yet But now I have these are actual trace points in the kernel that are available I can create these user space and if they're you through you probes or through USDT I can create these user space trace points And then I can use these trace points with this exact information and now I can perform my tracing So now it's just a question of which tracing tool should I use? So let's begin with F trace so F trace is built into the kernel right so every Linux kernel had the F trace Tracer inside of it. It supports all the kernel tracing events that are there It has full support for the you probes. These are the dynamic user space tracing events demand that you can define it's available through traceFS, which is a pseudo file system similar to like proc or sys and Basically use echo or cat to set data or to read current settings, right? So through a just with echo and cat you can basically do anything you need to do Now there's a little bit more high-level tool called trace command to do the echo and catting for me but This this is a convenience thing really all you need is the echo in the cat The F trace tracing solution is capable of processing Data in the kernel right so you can actually this whole chain of how you know We saw that we have to look at the previous events to decide does this event correspond to the previous net events Right because we're following all this IPC happens at the same time right we need to be to be able to follow these different IPC paths and F trace is able to Track all that in real time in the system. So really in live on the system There's something called Synthetic events and histograms and it can use those to actually track these different IPC methods Traces are also available for recording so that you can also do post processing as well afterwards And there's a graphical tool called kernel shark that allows you to view these recorded traces Graphically, which sometimes can be really helpful if you have some sort of complicated contention system Situation or something like this right so That's also there kernel shark Here is a sample of the F trace commands here. You see I'm just doing everything with echo There's no fancy tool that I need and I'm just insist kernel debug tracing That's basically usually where trace FS is mounted and you could see here is where I'm setting these up now I have here. These are comments. I don't think I'm not going to go through this right now We don't have enough time But this is just kind of showing you how I'm setting up the various synthetic events so that I can track each of these Points in this flow right so initially I'm interested in the timestamps before right and then I want to know the Timestamp and that the PID of the sender when we're when we are Allocating that buffer when we're freeing the buffer, you know This information just keeps being passed on to the next and then finally at the end I have the sender PID the the sender timestamp The receiver PID and the receiver timestamp right so this is just collecting all that data through these four different steps By creating several Synthetic events here, and then I can later take a look at the data that it collected and Here you're saying what that looks like right so after I've run hack bench now Then I just have to look at with here again. I'm just using cat I can look at this Histogram file in the synthetic event IPC end with that was my final synthetic event where everything is there And here's basically what it looks like. It's automatically not automatically It's I have it set to sort on these latencies the diff from the send to the receive So I can really easily see the different latencies that are available That were caused during this inter-prose communication. We see that there was exactly 400 of them So it's a running system and I start hack bench, and then I have the 400 events at the end Sorted on latency and from here. I can also see for example the largest latency, which was six point six point two milliseconds Was coming from the process one zero zero two two being sent to its data to nine nine eight five you know Actually was a nine nine eight five is it the top of the four worst on nine and eight five This poor process got a lot of latency in the messages that were sent to it Now from this information I can go back and look at the trace and for example to grab the actual time stamps of the worst So this is that worst case from one oh two two to nine nine eight five Are these time stamps? That's not automatically generated. It is possible to the synthetic events to also create this but It's a very simple information that I just wanted to get at the end. So I just Manually parse this from the trace information Now what sometimes what you'd like to see is a nice Histogram and so with a little bit of Python code. There's not a lot here Python It's really easy for things like this. I could take this This data that we previously saw this history. It is also a histogram, but it's just a you know It's more like it's more like a sorted hash table and I can actually Turn that into something that is a really nice overview because we wanted to know is that an outlier, right? Is this really an extreme situation or does it fit within the normal picture? And we're seeing that this maximum case sitting in this in the six millisecond to six and a quarter millisecond bucket We're seeing that's actually totally normal in the curve Right, this isn't some huge crazy outlier in the latency there, right? So we can see that really easily And then here it finally is with the kernel shark tool. We can take a look graphically at this these traces And see what it looks like. So this is the exact same. This is all done with one run with hack bench And again, these are the same. We can't really see on the slides there It's too small, but this is the the same PID. So here's the receiver PID And it shows it graphically so we can see every the CPUs are pretty loaded up the whole time with different And if I was actually in the tool I could see that these are actually all different hack bench Process of running just for fun. I added some more tasks there because you can look at individual tasks So this is for example an RCU task and we see several K workers there And these last two are the actual worst-case pair that's communicating, right? So here down here, this is actually the sending Process and here's the receiving process, right? So you can see that actually this whole time The you know the the sender was done here, but during this whole time the receiver Was not scheduled until at the end and this was this is that latency that maximum latency, right? And I can choose it with Colonel shark I can choose any particular task yet now to look at and to investigate why why did it not get scheduled During all this time when we see there's lots of stuff on there, but you know how fair is that? That I have to wait behind all these other tasks, right? So these are things I can easily so investigate So for the final results for F trace The runtime increased 44% on hackbench Now this 44% Is a real number, but obviously hackbench runs does a lot really fast So it's a small amount of time here. We're talking about six milliseconds, right? So normally it was like four and a half milliseconds and increased to like six and a half milliseconds, right? So it's a very small window and there's doing a it's just doing pure inter-process communication, right? So F trace is really the tracer is really doing a lot, right? So if it was like a normal second if it was an application where there was lots going on and what I'm tracing as a small Component of that then you may not see this overhead Nevertheless, this now this number is valid as a comparator number between the other trace methods I'm going to show right because they're all using the same program and they're all I'm comparing the runtime for this How this four mill of four and a half milliseconds are changed from the different? Tracer so size it is a valid number for comparing between the two But obviously it might be slowly inflated because I'm using an application that's only doing inter-process communication And very heavily, right? So the runtime For F trace that this is a number increased 44% Positive thing has in kernel processing, right? So at the end I run heck pitch and then I just cat this file and boom I can see automatically so forth immediately what my worst Latency was I have the graphical viewing with kernel shark which is very comfortable to really in details look at what's going on there All I need really is echo and cat to actually do the tracing so I can actually record the tracing and Then take it offline and look at it offline later if I wanted But I mean there's not a lot you have to install in the machine you need a kernel, right and on all four of the base Distributions I tested F trace was available which you could expect because it's part of the kernel, right? Now some of the the downsides of F trace is that you aren't going to most likely have to do some manual post processing Right, so you'll have to do some some grep and said maybe awkward with the traces cell themselves Or you know you're going to have to take like for example the results from the synthetic event A hash table and maybe you'll have to you know like what I did to to get the worst The worst case timestamps, right? So there might be something and usually when you're dealing with F trace There is some stuff we're gonna have to do there So when you're dealing with F trace, it's pretty rare that you can just say here I only need kernel shark and I can do everything you probably are going to have to do some things Just to look at the traces manually and get some information Another downside is unfortunately in Debian as synthetic vents and vents are not turned on on the kernel Right, so the other three distros have it turned on and Debian does not Debian But Bullseye does not have it turned on so hopefully they will change that for the future I will file a bug for this for sure Moving on ebpf ebpf is not a tracer But it's actually something that is using the built-in virtual machine runtime in the kernel So the kernel has its own virtual machine runtime and with ebpf. I can actually write programs Load them into the kernel and they will actually run inside the kernel in this virtual machine runtime Now when it comes to tracing you can actually attach these ebpf programs to trace points Which means when we go through a trace point the CPU goes to a trace point it can actually execute this little virtual machine program From the runtime and I can get information that way right so this is how I can use ebpf to perform tracing Now ebpf can be attached to trace points It can be a tracer to uprobes as the user space probes so I can everything that I was able to do with ftrace I can also do with ebpf as far as the type and amount of events that are supported ebpf has several different methods that to use it obviously there's something called the bcc compiler You can write really complex applications This requires obviously a lot more know-how, but it also is much more powerful However, there is a very nice tool called bpf trace and I highly recommend taking a look at it It has its own script like language and you can use this very simple script language For bpf trace and it will compile the be the ebpf programs For me and load them into the kernel so bpf trace makes it very very easy And I was debating do I should I use bcc or use ebpf trace for this talk? And I found out that bpf trace really has everything that I need so it is immensely simpler and it Works fantastic actually so I used bpf trace actually for the ebpf component of this talk Yep, so the data can also be processed in kernel through the ebpf programs right so when I'm at an event I can actually now decide is this part of the flow that I'm tracking and I can store this information And when the other event comes it can check is this the third event in in the chain And they can also store the information so we can do all of this processing live just like with ftrace I was able to do it live so as soon as the test is done the data is already there, right? And ebpf has support for histogram so for ftrace I actually had to write a little Python program to generate that with ebpf you get that for free So here's an example of what the bpf trace code looks like this is so scripting code that I can give to bpf trace to do the things I want and just now with the same as with Ftrace you can see there's actually four components here for the four different events where you know at the beginning All I'm doing is drawing a timestamp, and then I'm interested in start dealing with the pointer of the buffer and the timestamp and The sender PID etc. So the as each event progresses in this flow I'm storing more and more data and on the final event that's where I can actually look at the worst case and all these things To determine the situation, right? So you always see in all four and all of four of these tracing solutions You'll see the four components there where we're building these events Yeah, they're not called synthetic events here. They're actually just a Hash tables are okay So here is then the result so as soon as the bp the hackbench mid epf bpf trace is running Now as soon as it's done I can immediately see this information to spit out the ebpf program actually generates that and spits this out for me and Here I can see then You know just like I generated before with my Python program ftrace here. I can just see that with ebpf tells me immediately 5.6 Milliseconds was my worst Was my worst Event it says nanoseconds. I see good. It's not actually no seconds microseconds But yeah, so 5.6 milliseconds was the worst case for the bpf and It tells me that the worst the worst situation was from PID 17557 to 17520 So it's giving me that information now with ebpf I also generated all of the schedule switching information so I can also take a look at that and It will show me all of the different schedule and switches and things like this so that I can identify What was the scheduling in between the worst situation as of the beginning and the end of the worst LIPC communication, right? So I it's not graphical. It's all there and it's not sorted. So with bpf trace. It's not possible to sort the hash table entries but So, you know, it's very simple to just say I want to just sort based on the second column here So it's not a big issue And if I was using BCC I could actually sort on the hash table entries With bpf trace that's not possible So I had to do a little bit of work just to get in this form, but that's not much, right? So it was very easily and I can also see now it's not graphic But I can still see everything that's happening there, right? So, you know, this is makes a little bit more complicated because now I have to manage Okay, here's some wake-ups happening and switch, you know, which wake-ups happening to which switch I can't see this here a kernel shark. It was graphically displayed there, right? So you can you can really easily see when these different things are happening So a summary ebpf was the fastest. I only had a 31% Increase in runtime. It was the fact it was faster than ftrace Has the internal processing so as soon as it's over. I have the results immediately The histogram is automatically generated for me and all of the four base distributions that I tested have support for ebpf bad downside for ebpf Obviously, you're gonna need the libbcc for the compiler the libc libclang LLVM these relatively large libraries that you will need on your system For a desktop or a server that's not a problem for an embedded device. It might be too heavy for depends on what you need as I also mentioned viewing the traces It's not like, you know, I'm creating this format myself. I'm spitting things out in in my own form with my own events It's there is no viewer that's available where I can really easily look at this information, right? So I do have to you know, do some manual processing and sorting and and kind of a gripping the information that I need there And also for again for tebian You use you probe offsets are not supported, right? So I couldn't actually do this test wouldn't actually wouldn't actually run with on Debian Because Debian doesn't support you probe offsets, which I need So I was not able to do that But there was an open bug on Debian and I will reopen it Okay, moving on to system tap a system tap Feels a lot like eppf, but it way it works is conceptually totally different So with system tap actually you will also write a script similar with BPF trace You write a script, but this script I should be converted Into C code and compiled as a kernel module and loaded into the kernel that way So you're actually running as a native kernel module and not in a virtual runtime as it is with eppf System tap also supports all of the trace events, right? So I can write a script it's compiled in a kernel module and that's going to attach some code to the various trace events It also supports you probes however only based on code line numbers So I can't actually for example choose an actual offset In the in the text segment so I can actually choose a choose an offset in the executable Where I want my Upro to be inserted I have to actually choose a line of code and that'll usually translate to that what you want But that's not always the case right because a line of code could do many things, right? So you and you want to you know specify and if you're using optimization It's even more difficult to translate that to what you actually want, right? So the use being able to use the offsets is actually quite important good But for this investigation that was good enough. It was enough to there was actually a line I could specify and says yeah, this is where though before we call read and this is after we Before we call write and this is after we've called read, right? So it was good enough this tool what you use with system time is called s tap and like I mentioned There's a scripting language that's available and as you just give s tap your script and it will compile the kernel module for you and load it As system tap also supports in kernel data processing so live a processing So just like with ebpf and with ftrace you are able to actually do the live flow control processing So that in the moment that is done You can take a look or you don't have to wait for it to be done at any moment You can take a look and say what's the worst case latency and I can say from this PID to this PID is the worst latency I've seen so far, right? But it's doing this live, right the whole time so you can look at it in time you want and System tab also supports outputting histograms similar to ebpf or a bpf trace actually and it supports iterating hash tables, which is something that that Bpf trace unfortunately does not support yet, but you know if that's if you really need that maybe take the time to learn PCC Here's just what the script looks like in system tap This is extremely similar to bpf trace both of them are basically modeled out D trace So it's you can almost one-to-one. It's just syntax changes But it's almost exactly the same if you were to compare setting up these four different You know modules that I'm building is like the four different bpf programs that we had before right? So it's almost identically the same there So if you if you know ebpf system tap, I'm sorry if you know bpf trace System tap is really easy to switch to and if you know system tap moving to bpf trace is also quite easy Here's what the histogram on any at the end looked like for system tap It also spits out for me the the worst latency that I saw and the IPC The two PIDs that were involved in that worst latency It also spits for me out the list of scheduling events But since I'm able with system tap I'm able to iterate through this that it means I was able to automatically Sort it so that it's automatically spitting out a sorted time sorted list and I could do things like adding So made a data there So like it would spit out and say these are the lines that are dealing with the sender Here's lines dealing with the worst case receiver, right? So I could actually just I iterate through this hash rate and spit this information out and this is also done live, right? So at any time, you know if this is a hack bench doesn't end It just keeps going going going at any time I could just spit this information out and it's annotated and everything like this, right? So it's actually quite nice. This live processing is is quite helpful again We have no graphical tool, right? So, you know, I will have to later figure out what all this stuff in the middle actually means or where who belongs to what But the information is there System tap the advantages obviously you have the incremental processing which is really nice to have right because many times You can't just have your system run and stop and now we're done, right? Like it's just running and you just want to go in there and then now we started grabbing the information and live Viewing that information that's that's incremental processing is really nice for that you have the histograms that are it can be easily generated and the Sorted hash table viewing which is really automatic, right? So it comes things just come out automatically from the kernel in the format. You want to have them System tab has things called tap sets. These are basically Script functions that have already been written for me. So there's a lot of really nice Functions I can use to make my scripting really easier, you know, if I just want to grab a stack trace or I want to Find you know, what is my schedule state or something like this? There's a whole bunch of tap sets that are available and I just basically call a function and I get this maybe complex information Yeah, there's some things that are like network specific and things like this. So that can makes it really nice for writing more complex scripts Available on Fedora, Debian and Arch Unfortunately on OpenSuzi It was I really tried to get it to work But OpenSuzi was just complaining about it was tumbleweed that I was using So this is the you know latest greatest from OpenSuzi, but it would just I could not get it to compile All right, so it goes complaining about missing headers. I Looked at it with s trace and it was finding the headers So I'm not sure why the the s-tap compiler was complaining that I couldn't find them But sorry couldn't get it working on OpenSuzi. So it's not OpenSuzi is not in that list Here the runtime increased 84% which I actually put in the cons section Because this is almost double what the other tracing methods had right so this is Definitive slower, maybe I don't know if it's because the type of scripts I was writing are particularly inefficient or what the deal is there but system tap is definitely Affecting runtime more than f trace and ebpf4. In fact, this is the slowest of them of the four The Uprobe support is limited to line numbers. This might be okay for you But you know a lot of people like to optimize applications things like this Line number might not mean anything, right? So that might not be enough It has a huge set of dependencies because we have to compile kernel modules, right? So And it is possible to have the to build these like offline and then just load them, but it gets quite complex These are just dependencies to be aware of and because we're buying Compiling kernel modules, they might not actually build if you're using some sort of exotic third-party hardware Kernel you might maybe for some reason you can't actually even build the kernel modules, right? So it's complex because it relies on so kernel module code templates to build for the kernel, right? That wasn't the problem with OpenSuzi, but That can be in general a problem So the last one LTT and G LTT and G is different than the other three because it is a complete Tracing infrastructure. So it has kernel modules. It has user space libraries It has user space tools has all kinds of things and basically LTT and G is really meant to be a complete solution that you adopt for all of your tracing needs for a user space for kernel for Kernel user space anything So it's more than just Like a tracing method. It's it's a infrastructure really In particular the libraries as very powerful libraries that make it really easy to write programs to for example parse data and do things like this LTT and G supports all the kernel of trace events And it supports Uprobes, but the Uprobes support is really only limited to symbols, right? So I can only specify a function. I can't expect to specify a line of code or an offset inside of the program And unfortunately that wasn't good enough for this investigation because I needed that Uprobe It's in the middle of a sender and receiver functions I can't it's not enough to just set the sender function, right? So unfortunately for LTT and G I needed to use USDT Which means I must modify hackbench and add those trace points manually The tool that's available for LTT and G is called LTT and G That's basically the tool to start everything start the sessions stop the sessions and look at things And the traces are recorded and they are then post-processed. So LTT and G does not offer live Data processing like the other ones did right so you have to be aware that you're going to record a trace and Then you're going to have to post-process that trace at a later point using the various tools and libraries And it also includes a graphical viewer, which is called trace compass Which is we will take a look in a minute is quite powerful graphical viewer So here's just a quick to show you the patch I actually had to modify hackbench and just add these two trace points now There's a couple more code. I have to have there's an include and there's another Template file that's generated, but this is the actual code that had to be modified I just needed to insert the before right and the after reach USDT trace points so that LTT and G Could react to these events right and before using you probe offsets or even line numbers I could just specify this line here for example, or I could specify this line here and it would do do the same thing Without modifying the code So it's with LTT and G you are in a different mode. You have to think in terms of post-processing, right? You're you're not doing things live And so there's actually really great libraries and Python available So they're called Babel trace to it's a great library and it's a tool by the way where you can also look at all these traces And here basically you'll just have you'll just iterate through all of the collected events and You can very easily with Python then and decide what to do So basically everything that I was doing live with ebpf or system tap or ftrace I'm doing it post-processing here, but you still see the four events So here's you know if I have this kind of event. This is my processing this kind of it So I'm doing the same thing slowly collecting more and more data to get to this end event The code is almost identical, but the difference is and it's a big difference one is post-processing and One is live processing right so But it I just wanted to show that you can see there are the analogies It's quite similar the logic the way that it's handled and this doesn't automatically generate up Here I used my Python code again to spit out this But I had a nice hash table that I could iterate with all this information, but I used Python to generate A hash table again so that we could see the worst case if it fits inside the curve And then I also gave out with my Python post-processing gave it can I'll get this information as well Right, so we get all the same information, but it's it's being post-processed, which is not necessarily a bad thing It just depends but it is really nice to have a system That's just do live processing and you just say show me the current state and it just spits out in the right format right you don't have to like Extract log files or anything's like this go offline and things like you know Here's the graphical program where I'm now going to investigate the worst-case scenario right so we have This is called trace compass and you can really it has a lot of stuff here I really had a hard time and just figure out what I want to put on here But here you can see all the different events. This is the receiver. That's totally scheduled way at the end here to receive this Here I'm also showing you this is the sender and you see this sender was done at this point and The receiver was getting scheduled at this point. So the middle of nothing You can also see the CPU was pretty much All four of the CPUs were pretty much full the whole time. So there's no idle sitting in there But anyway, there's a lot of available tools for LTD and G you see here. There's a lot of different so Wizards or intelligent things that can go through and match things and to help a lot. It's really quite powerful powerful tool I'm not going to go into it here. It's talk for itself But it's quite powerful for doing the post Analyzing of what happened So for LTD and G the runtime increased 49 percent, which is I believe the third best right and We have the graphical viewing with trace compass, right? So and this is by the way, I haven't underpro because it's still way under almost it's still almost half of what we had with System tab We have a lot of great libraries for post-processing Debbie and Fedora and open Susie have been support for it unfortunately under arch The tools package LTD TT and G tools package is just plain broken, right? So it's also in the if you go to the package logs You can see it has been confirmed that it's broken. So unfortunately right now. That's not going to work For arch other things that are a problem. Maybe a small problem The limited you probe support, right? So we can only use symbols in a lot of cases. That's probably not going to be enough for you So you may have to modify your application. Maybe that's not a problem for you But it's nice to not have to modify applications in science to take something done and now let's just investigate why it's Acting strange, right? That's a really nice thing to be able to do This also has to build its own LTD and G kernel modules That means you need the compiler and the kernel headers in order to do that So you might also have problem building kernel modules the same problem that system tap could have And what I also found is that the kernel modules were not pre-packaged for fedora and open Susie So for fedora and open Susie LTD G it does work But you have to download the modules and compile them yourselves not a big deal But still, you know, if we look at Debian for example, it's really out of the box LTD and G just works, right? There's there's nothing you have to download or manually compile or build it just works, right? So that's it's really nice So now this is a summary based on usability runtime effects presentation results and the availability I had a really hard time putting these rate in these rankings together Because a lot of things it's really a close call and it's really hard to decide but at the end of the day I do feel like this is the appropriate ranking for the different tools This was the easiest one because it's just a Number to compare right but all the other ones. It was really difficult. I have to say I'm usually don't use eBPF but working with BPF trace is really a pleasant experience So that's if you have not used BPF trace. I Really recommend trying that out Also, please keep in mind. This is based on the experience with this investigation So if you have a total different subs set of problems for tracing, maybe it would be different and Finally if you were assigned points to all of those individual ratings and then you could assign a top three top four Rate ratings is it was really that eBPF came out on top being the fastest easy to use and things like this It's biggest drawback is you know viewing the data really, you know What's really missing eBPF? So you had some sort of official format where you could send things into So that you know could use for example a trace compass or something like this to view the eBPF traces that would be huge a feature for eBPF because that's really where it's lacking F trace LTG and unfortunately system tap although is also very nice to use You know, it's not bad, but it just there's so it's like third and fourth place on everything And so at the end it's it just couldn't make it make it up there again based on this investigation This does not mean that system tap isn't generally a bad tracing solution. I Thank you for my for your attention And I hope you learned a little bit about what's out there And I really encourage you to try them all because it's not that hard and it's a good experience and for example Then you start to learn the advantages that the difference have to offer And so if you are presented with a certain problem You might say hey, I know that system tap has a whole bunch of tap sets that are great for this Domain we can use that or you know, whatever, you know, we say hey We need you probes with offsets then you can immediately say, okay LTG and cheese not going to work for me and system tap as well, right? So just being aware of these things can really help you to decide how to trace your problems so Thank you for your attention and have fun at embedded Linux conference 2021