 Okay, hello everybody. My name is Irie Olsha. I work for Red Hat. I maintain the Perf for Reset Enterprise Linux and I also do some work upstream. I will be talking in this presentation about processor tracing that we support in the Perf. In the nutshell, it's a feature that allows you to go from functional level detail that you might note, for example, from F-trace, which is this output, and you can go all the way to the instructions. So that's what processor trace is allowing you to do. It's basically you will get a total overview about your workload. You will get all the instructions that were executed in your workload. As maybe some of you might notice, this is the x86 instructions output, and this is basically the scope of my presentation in Perf. We have actually like global framework for processor tracing, and we support other architectures. Also, we have implementation of processor tracing for S390 and ARM, the core site. However, I will be covering only x86 and namely the Intel, Intel models and features today. So it gives you the, the processor trace gives you the instructions, instructions flow. For the sake of this presentation, you don't need to be familiar with the assembly with small exceptions, and those are branch instructions. Because that's what the processor trace and other related features are based on. I will just really quickly explain what are branch instructions about. Basically, they control the flow. They whenever the CPU reach the branch instructions, it allows to see the CPU to jump to another portion in the memory. So basically they are used for conditions, for function calls. There are two types. It's really, it's either unconditional jump. This jump will happen every time that CPU reach the instruction, or there are conditional jumps and based on the type of the jump and based on some flags inside the processor which are checked before the jump, the CPU either jumps to a new location or not. So those are branch instructions and they are really crucial for processor tracing. I will start with really fast and quick example to keep you interested. First of all, you need to find out if the processor tracing is available on your system and the best way to do that is to do the perf list and look for the Intel underscore PT. Basically it's Intel underscore PT is basically the PMU itself that provides one single event, Intel PT and this is the event you want to use when you use the processor trace on the Intel. I have this sophisticated example. If you want trace program like this, you run perf record dash e Intel PT, you will say that this U modifier means that you want to trace only the user space and you specify the program. So the program was running. In the perf data, you have now the trace of the program and now you can display what you actually traced and that's what we have the I trace option for. If you want to see all the instructions, you can say I0. Ah, no, it's too small. Okay, I will use the error. This output basically gives you for every instruction, you will synthesize instruction event sample. So every line now stands for instruction. You can display more details for every instruction. You can display the actual instruction that got executed and you can display it very nicely in the hex dump or you can translate the hex dump and this is basically all the instruction that got executed when you run the X command that I was showing in the beginning. So if I go and search for main function you will find out basically the routine that you wrote. This is the preparation for for Lipsy calling and all around the main function is basically the dynamic loiter and the Lipsy glue. So if you wish, you can check all that code and it always amazes me how much of the code is executed just for 12 instructions. So that was just a quick example. Before I actually go and introduce the processor trace itself I will introduce two other features that are like the predecessor of the processor trace because processor trace in Intel didn't just appear by itself. There were some development. One of those features is actually dead-end. This is the branch trace door. Last branch record still continues and people are using it. Let me start with that. Last branch record is basically the CPU feature and CPU allows you to configure the CPU in a way that it will remember the last branches that it executed. So at any given time you can ask the CPU how did you get to this point? What were the branches that you executed? And that's basically how we use it in the perf. Whenever there's the sample for event like for cycles, when you're profiling on cycles you have the sample and at the point of sample we are gathering all the information like the processor ID, thread ID, CPU number, all those information and one of those information is also the LBR record. So we asked the CPU what were the branches? How did we get here? And that's basically what we are storing for every branch. We store the source and destination address and we also store like the details about the jump which is like the usual flex that you want to know about the jump if it was predicted, mis-predicted, if it was part of the transactions and we can also get the number of cycles that were spent from the previous jump. Quite some quite of useful information to find out if you have some support on LBR on your system, which you probably have because it's kind of all this feature, you just grab the kernel log for LBR and you should see something like this. The 32D is basically the number of jumps that the CPU allows you to store and it's different for every architecture for there's the small table that gives you the idea. I included this structure. This is basically the structure that's used in the kernel to store the LBR data for the sample and you can see all the flex that we actually store for the jump. So mis-predicted, predicted, abort in the transaction and how many cycles. You can see there's the reserved field. So it's actually not a dead feature. For example, these cycles were added like for Skylake architecture, so I don't know, three years ago. So it's kind of still useful. It's still there. People are using it. How we can actually use it? PerfRecord-B will instruct the Perf to store the LBR data for the samples. PerfRecord-J is basically the filter. It does the same as B, but you can say the type of the jumps. So it's supposed to be faster because the filter is set in the hardware. So you will get faster workload. We have actually support for call chains. Also, when you say call graph LBR, the standard frame pointer call chains will be mixed with the LBRs and it will get you just better call chains. Once you have the LBR data in your profile, you can run Perf report by default. It will actually show you the most used branch. If you, there's very nice options, branch history, it will actually show you the profile and the branch, the LBR data will be shown in the call chain. So I recommend using those features. How is it with the performance of LBR? I did this really simple benchmark. So the first record is just the profiling of standard schedule pipe bench that we have in the Perf. So the baseline is like this 58 seconds. If you include the dash B, it will slow down almost four times. The call graph LBR is actually quite nicely fast because it does just the subset of the LBR. It doesn't delay the CPU that much. So that's actually really, really usable. So that's LBR, actually just instrumenting the samples with the branching course. The first feature that actually did the whole storage like that stored the whole trace of the branches is branch trace store BTS and basically it's a CPU feature. Of course, it allows you to configure the memory, debuffering the memory and once you configure the BTS, the CPU will start dumping the information about every branch that it's the execution going through. So for every branch, it's actually storing 24 bytes. Source address, destination address and eight bytes for one flag. As I said before, LBR feature is still ongoing, people are using it. BTS is actually just a dead end. It's just something, at least for Perf, something that predates the processor trace. It has some common functionality but people are now using the Intel processor trace. This is like that feature. However, how did that feature work? There are two ways how we can actually enable it and it's by using different types of events. First event is branch instructions with the period one. That's the dash C one. If you use this instruction, the kernel will notice and transparently, it will create the BTS buffer and the CPU will start dumping the data to the BTS buffer and there's a mechanism when the BTS buffer will be full, it will be dumped and that's where kernel steps in. It will parse all the BTS buffer and for every record, it will create event, branch event and store to the Perf ring buffer. Perf ring buffer will have all the branch events from the BTS and the decoder is the kernel. If it sounds expensive to you, it's because it's really expensive. It's really horrible from the performance point of view but it's working. The other way to enable the BTS is using the Intel underscore BTS event and by using this event the kernel doesn't step in. The record, the user space site that's monitoring the data that CPU is generating will actually take all the data from the memory and dump it to the data file. So you will basically end up with the Perf data with mixed data from the BTS buffer and from the Perf ring buffer. So no kernel is involved. Nobody is translating the data in the kernel. That means the decoder actually moved to user space, which is a nice thing. It's supposed to be much faster and the decoder is basically hidden behind the dash dash iTrace option. Basically, by dash dash iTrace, you're saying I have the data with the hardware trace, in this case the BTS. Synthesize from this data the thing that I will tell you to synthesize and you can synthesize all this kind of data. For BTS you can synthesize only the branches and the cores. All the other options are there because this is actually the option for the Intel processor trace. So this was like the evolution of the Intel processor trace. For BTS you can synthesize only the branches. All the rest of the options are available for Intel processor trace. I will talk about it later. So to sum it up the branch instructions with the period one works. It's not fast. The Intel BTS events is actually disabled for PTI feature, page table isolation feature, which is probably enabled on most distros now. So most likely you will not see it on your computer unless you recompile your kernel and disable it. It's not well maintained at the moment. On top of it, the BTS does just the branches, nothing else. So you will just get the overview of how the branches were executed. There's no timing information and you cannot do system-wide monitoring. You can do just attach the event to the single thread. That's all you can do. I mentioned the overhead already. So there's the Intel processor trace to save the day because it really has much lower overhead. Mostly because it's using the dump of the processor trace is basically very highly compressed data. At some point it's even like one bit for one branch. So it's really a really powerful compression. And moreover it's actually adding more information to the trace. So you get the timing information and you can get some information about the power, like the frequency of the processor at any given point. It has the similar functionality from the top-level overview like the BTS. So again, there's a buffer that you need to configure for the CPU and when you enable the feature, the CPU will start dumping the information to that buffer. As I mentioned, it's not 24 bytes for every branch now. It's actually the dump is the processor trace dumping the information in form of the packets. Many many different packets that form some sort of protocol that describe the flow. And the decoder is the guy who actually is on the other side and trying to decode that thing. How it looks like? In a nutshell, this is like the basic started sequence of the packets that decodes the counter flow. The first thing you need to know about the processor trace data is that the decoder doesn't need to know where the data begins. It gets to the data from any level, from any starting point and it search for PSB packet, which is like unique. And when it reaches the PSB the protocol is that the following data will actually set up the trace. It will set up two informations, it will set up the timing, that's the TSC packet for and it will set up the virtual address that the CPU is currently executing on. So it's of course much more difficult, but this is the idea. You basically find the beginning of the trace, you set up the time, you set up the virtual address and now you are ready to decode. When you are ready to decode, there's the most common and most important packet in processor trace. It's TNT, it stands for taken, not taken and it's decoding the branches. Basically for every conditional branch in the trace, you have one bit in this packet. So the decoder needs to have the binary, it goes through the code and whenever there is a conditional branch in the binary, in the sequence of instructions, it consults the TNT packet and based on the value, if the branch was taken or not taken, it either jumps to new location or continues executing further. So at some point you will see the dump. I will show you an example of just the TNT packets describing the flow. So that's the basic idea, that's how actually the packets look like. Again, it's much more complicated. There are many more packets about saying the state of the processor and the power state. This is just simplification to show you how it works. Back to how we actually store the data. It's similar to the BTS. BTS trace. For Intel processor trace, the buffer is called TopaBuffer and we deal with it like with the BTS. So we store all the data from the TopaBuffer to PerfData together with the events from the RingBuffer. Again, that means that you need to have the decoder on the user space side in the report and actually in the script as well. Again, the decoder that will decode the data for you is behind the itrace option and for Intel processor trace, all those options are available for you. So you basically say I have the data, I have the data with processor trace, synthesize this kind of information for me. Synthesize instructions. You can say also the period of instructions. I will show you an example. You can synthesize the branches as in BTS, transactions and power events and some other helper option when you try to debug the things. Example. Let's go to examples. So I can basically start from where I left from the previous example. So in PerfData now I have the profile of that simple X example. I have the profiling data from running that. If you actually want to check how the trace looks by itself, you use the Perf capital D, which means we will give you the raw output of the data file. And if you search for the PerfRecord AUX trace event, this is the event. This is actually the raw data that we got from the CPU. I'm not sure if it's visible enough. Perhaps not too much. Better? Yeah, you must believe me it's there. So basically there's the PSB sequence and all the stuff and now you would see all the TNT packets here. Really, you cannot see it? Never mind, it's there. Okay, another example. So the key to get the data from the processor trace is the iTrace option. That's actually the decoder of the data and we have really nice main pages. We suck at the main pages but for processor trace it's really good. So if you go for report or script and search for iTrace, much better, right? Okay. It will give you really the whole scope of what you can do with the processor trace data. So as I said, basically you tell the decoder give me, if you use the i, give me all the instructions. And the beauty of it is that you can specify also the, you can specify the period. And the period can be much slower than you could actually specify for the hardware event. So if you use Perf report and say give me all the instructions and the period I want the instructions to be at is 2 nanoseconds. You will basically get the profile which like the profiling tick is 2 nanoseconds. So this is basically you can change that and change the profile. You will get different amount of instructions events for every period. But that basically allows you to do like the profiling after the real profiling. You can do some sort of offline profiling and you can use really small periods. So for small workloads, it can show you some meaningful data. To actually see the whole flow, I was using this iTrace with the instruction of zero. So this will give me all the instructions. We have several options that we encoded to standalone options which you can use. So if you say call trace, it will configure the decoder to give you just the calls and the returns. And it will basically show you the call graph of the execution from the processor trace dump. So if I go to see the main function, you will see that it's just the main being executed over here. It goes for the F write. You will see the dynamic loader kicking here and they actually I write over here. We have some variation on those options. This one will show you the call graph together with the instructions, which is really helpful sometimes to get an idea of what was actually executed under that functions. You can also trace the kernel, of course, and you can trace it system-wide. The only thing you change on the command line I was using is that instead of... Okay, I don't do it like here. So you go perf record, Intel, PT, and instead of U, you say K. You say system-wide and you say you want this for one second. So now I have the one second processor trace of the whole system. You can run perf script saying call trace, and this will basically get you the idea of what was executed during this one second. There's a little glitch in tracing the kernel. The thing is that the kernel is not static. We have all these jump labels things that basically changing the kernel code on the fly. And the decoder is going through the debug info. So at some point, the decoder will not recognize it will not match because the executed code is actually different than we have in debug info. So basically you will get errors like trace doesn't match the instruction. We have workaround on that. In the perf RPM, we have the following script. Perf with K-core. Okay, maybe I can make it huge. Perf with K-core. What this will actually do for you is it will run the profile the same way as if you wouldn't use it. But at the end of the profile, it will store the K-core information from the kernel. So we will get actually the image of the code that was used during your profile. And it will also use it in a way that it's available for report and script. So the example is like here, I need to remove previous data. Okay, you say record the name of the data, the event for kernel, system-wide and for one second. It's recording the data and at the end it will store the K-core for you. So if you look to the data, it will create actually the directory. There's the data together with the K-core information. And when you run perf script, pt, callgraph, callgraph, script, pt, calltrace. So you will get now the correct trace without any errors because we are using the K-core to resolve all the data. One last thing I'd like to mention before the end is the snapshot mode. Basically, even though the data is highly compressed, it's still a lot of data if the CPUs are busy and you want to trace for a long time. The guys from Intel introduced snapshot mode which basically it will set up everything as for the not snapshot mode but it will not store the data. It will just wait for you. You need to send the signal to the record command and once the record gets the signal, it will dump the data to the data file. So by this you can actually run the trace and there are still some data that gets stored to the data file. It still will be populated. We have some helping events to track the scheduler, like who is running at the moment. So it will still go but it's not so much and at the point you actually want to store the data, you send the signal and you will get the amount of the data in your data file. From performance point of view, this is the comparison, it's the same benchmark I use for LBR. I didn't measure the not snapshot mode because I wouldn't have enough space for this bench. It's really a lot of data but if you do the snapshot mode, the performance is even slightly better than for the call graph LBR so in this mode it's really usable. Even let it run over your workload, it's actually not that bad. I will skip that one. Just to give the credit to the guys who make it, those are the Intel guys and again it's lower overhead, you can do the system wind and there's the timing information and that's it from me actually. There are any questions? I heard something that, it's supported in the KVM now but I don't know the details, it's not my area. Sorry, I repeat the question. If there's the support in QMU, yeah, sorry, I don't know. So the question is about the support for S390, the Z. It came just a few months ago so I believe it's there. I've never used it, I don't have the hardware and it's coming in, the patches are coming in, it's being egg so I guess it must be in some shape that is actually working so never is it. Cool. Thanks.