 Hi everyone, my name is Jackson Huff and welcome to this presentation at LinuxCon 2022. In this presentation, I'm going to be making the Linux Performance Counter API really easy. So I'm Jackson Huff, I'm glad to be here at LinuxCon. We're going to be going through these slides really fast, so get ready. So this presentation about the Linux Performance Counter API is really important and interesting because there is no other Performance Counter API or PC API for short tutorial on the internet. That's crazy. You should probably think about, you should want, I want you to think about why there's no such other tutorial other than this one. Maybe you'll be able to make your own after you watch this, but either way, I had to figure out all this out on my own, and this is a shocking discovery I made. And because there's no other tutorial like it, there's also no other demo program like this, like the one you'll see very soon. So before we get started, here's some quick self-promotion. So I, Jackson Huff, am a high school graduate in Orlando, Florida, USA, Planet Earth. I'm planning to attend the University of Central Florida later this year. I'm planning to study engineering and maybe music or Spanish. A fun fact, some people think I'm from Puerto Rico, but it isn't the truth. Pero no es la verdad. Just something you might not have known about me. I've also been involved in IT for a very long time. I've been in Linux for almost as long as that, probably over a decade in IT. I've known about Linux for about 10 years now, and I've been learning about Linux for all that time. And because I've been involved with IT for such a while, I've got to learn some stuff about networking and now earn me the CCNA certificate in August last year. Finally, you can check out more about me at my personal website at JGHUF.com, where you'll also be able to check out some of my other websites for other projects of life. Okay, so here's the painless overview of this painless presentation. We're going to start with the introduction to a PC API, performance counter API, why you need the PC API, how PC API works, implementing the PC API, the demo program. Got to put that in bold because that's going to be the coolest thing here. Brain dump, how can we take PC API even further? Brain dump? Well, if you're familiar with the CCNA, that's kind of a bad word. But this, but here it's a good word because brain dump lets us, another term for it is recall practice. It helps us remember important concepts and use them for the future. After that, finally, is the conclusion to this presentation. Okay, check OBS. Three minutes. All right, we're making good time. Here's a group question. What are performance counters anyway? I'm going to be giving you 15 seconds to think about this. Now, if you haven't heard about performance counters before, think about what the term performance counter means. And if you already know what performance carers are, I encourage you to think about what are they anyway? Okay, 15 seconds, go. All right, 15 seconds are up. So what is a performance counter anyway? So performance carers are a way to let developers and their designers check how fast CPUs are going objectively. They measure objective and truthful metrics in comparison to what a lot of marketers would want you to think. So CPU isn't fast or slow, 20% faster or 50% slower, but rather, the performance counters reveal all the metrics you could possibly want to objectively measure performance across processors. Phew, what a mouthful. Here's a quick, so from coughing up, started from coughing a lot because I have a cold and if I need to blow my nose, there'll be a jump cut. Sorry. So here's a quick history about performance counters. They started with a secret Intel processor feature. We're talking about like Pentium's from the 1990s. These processors had a few certain undocumented instructions and some clever people figured out with reverse engineering that these secret instructions revealed secret registers that revealed all the statistics you could want from performance counters and these performance counters expose certain useful statistics that the designers at Intel probably could have used to design better Pentium's, but they never revealed it to end users. However, these clever people made it so this new feature was so useful for end users, Intel documented them, AMD copied Intel, ARM copied Intel and AMD, and the rest is history. All right, quick. Yeah, okay, sorry. So what do you use performance counters for? Well, if you're a software developer, perhaps you've heard of something called cache optimization. So cache optimization is when you say sequentially align variables or compress your data so that all of the data fits in the cache or perhaps you're compressing down instructions so they all fit in the instruction cache. Either way, that's cache optimization. It'll make your programs faster. Performance counters can less check how exactly how optimized something is and performance counters less optimized, optimized for not just the cache but the TLB pipeline instruction pipeline install amounts and more. So performance counters can less radically improve our performance. And that's going to be important as we'll see here why we need better performance. So today, ironically enough, even though our CPUs are orders of magnitude faster than those original Pentium in the 1990s, our programs are even more orders of magnitude slower than they were. So over here we've got bloated electron apps like a glorified text editor or a chat app. What about streaming video delivery? Well that's actually a better use for performance counters because a little change here and there can add up to a lot of change overall. So for example, Netflix could save millions a year, millions of dollars a year if they just made their performance counters able to be. If they made their streaming CDNs more optimized, say 1%, that 1% adds up over their thousands of servers. I can't talk today. What about high performance computing? So HPC for short is probably the biggest use of all for performance counters and that's because if you can optimize your programs then your task will be solved faster and you'll be able to save millions of dollars because often you're using big supercomputers maybe cause billions and these performance counters will be able to let you save time using them. That's a win-win. And finally you'll be able to know the nitty-gritty about all your programs and how they behave. And actually there's a fifth reason why you need them that I didn't include here but the statistics from performance counters just look cool to a lot of people including me even though I know exactly what they're measuring. So that's another win-win. So here's the Netflix success story. So for a while Netflix suffered from unoptimized video CDNs. I digress. I already said that before. So to fix that Netflix hired someone who knew performance counters really well. I can't remember if it was Linux, 3BSD or something else but performance counters worked the same across all OSs because they're specific to processors and not the OS. So some testing and experiments and bar charts later Netflix is reaping the rewards and rewarding their stockholders. That's because they optimized their video delivery and efficiency metrics. So I could say they applied cache optimization. They improved the IPC. Actually the IPC measures overall processor efficiency. Cache optimization is just one means that I end. And you can measure both of those with performance counters and that's great. And today this person who I can't remember the name of is a legend in the performance optimization world. Okay. Now let's start with a PC API secret. It's not all about the Benjamins. It's not all about the Pentiums but rather it's all about the SysCalls. And that's because the only actual interface with the Linux kernel you use with PC API is with the SysCall using nrperfect that open the read function and I forgot to add it here but you also use the IOCTL SysCall to interface with Linux kernel to use the performance counter API. Wow. So these three simple parts are the only actual, how should I say it, connections to the kernel that you have to make. So if you're worried about, makes kernel interfaces or something like that you don't have to worry so much. It's just how you use the data you get. So here's another PC API secret. Until now you need to cast a memory buffer to devoid type, void C type, to devoid C variable type. Wow. I'm out full again for almost all reads you have to do in order to get data from your performance counters. However, modern C++ is here to help with these ugly memory conversions by giving us explicit conversions and a generic code, C++ are called templates, other languages called them generics. I prefer templates because they tell you what they are more. And so these two factors are the secret behind agile PC API use. Now predicting in the future it's going to be nothing but generic code in these explicit conversions that make PC API widespread. I'm hoping. My code's ganged to me, sorry. So here's another group question I want you to think about. Why have there been zero tutorials on PC API until now? Actually, I already asked you this, but I'm giving you another 15 seconds to think about this one. Okay. Get set. Go. And that's 15 seconds. So what's your answer? You might be able to put it in a chat because I'm recording this in the past. I don't actually know, but put your answer in the chat if you can do that. Thank you. So here's how the PC API works. Here's the big overview, big bird's eye view, how you use the PC API. So first up, find the pit of your process, whether it's user input or generated somewhere else in the program. And in this case, we're going to be doing the pit of the process that is not the process using performance counters. It's tricky to explain, but it just has to be a different pit than what the program running it has. Okay. Second step, find all child pits. This is surprisingly tricky, but there's an easy modern C++ solution as I'll show you later in the demo. Open a file script for each event for each child pit, and this presents another problem that I'll show you later how to fix. Fourth step, reset and enable counters. Really easy. Fifth step, wait. Really easy. Sixth step, stop counters. Still easy. Seventh step, read file scripts into buffers and read as buffers into usual variables. Tricky, but if you just copy my template, it just works. Last step, go back to reset and enable. And that means this is all an infinite loop, which is great. So here's an important choice we have to make. So as you can see here, counting counters count in some specified time period. Technically, whenever they're enabled, you go back here, they would be counting at the wait step. What about sampling counters? Sampling counters count based on some trigger or breakpoint. This could be a hardware or software breakpoint. So to deal with sampling counters is after trickier to use, and they require some really spooky memory management. You've heard of something called Mmap, that's exactly what you have to use. And you have to specify the exact breakpoints that you want to track, which itself is tricky too. So here we're going to be using the counting method for our demo program and the presentation too. Sorry if I don't know how to say that better. But, okay, let's move on. Actually, one more thing I want to say, there's a third type of performance counter, and this type of performance counter interfaces directly with the Linux kernel to measure how fast CIS calls and other low level actions take. However, that kind of counting is even funkier than sampling type counters. So I'm not going to be talking any more about that here. Maybe in a future presentation when I also do sampling counters. I digress. So here's the plan of action for this presentation so far. I'll show you the basic steps that you would use in C first. You can split those steps into C++ building blocks. I should say modern C++ building blocks because they use some funky and but useful modern C++ features. And this demo program will use these modern C++ building blocks that I'll show you individually. Some and because I'm recording this as a video, I'm just going to guarantee that there will be some way to get the complete demo code with the modern C++ building blocks at Linux. I don't know exactly how, but there will be some way to do it. Let's start with headers. So the PC API requires some very specific C headers. So this top header specifies the Nrperf event open macro. Second one defines more than just breakpoints, but also the events. Third one solves a problem that we're going to encounter with excessive file scripted use. Fourth one tells us how to use CIS calls. Fifth one tells us how to use the IOCTL CIS calls specifically. We're already 17 minutes in. Either way, let's get started with structs. So on the left here, the struct called read format helps us, well, it tells the PC API, no, it doesn't actually tell the PC API anything, but it's the format that the PC API uses to store data in a buffer. And when we read a file descriptor into this buffer, it will happen. It will be in this read format. So we're just specifying a struct so we can access the variables that exist within this struct. It's really wrong. I don't like it. What about the second struct here called pcount? Well, this is just a modern C++ abstraction. Sorry for the junk. I just had to blow my nose. So right here, we have the modern C++ abstraction struct. And here, we have the explicit conversion using reinterpret casts from a read format pointer to a read format data. That is funky. And I hate it, but it works. So if you needed just a function to abstract away performance counter stuff, this function would set up performance counters and do the main loop for counter processing, as you seen earlier here, where you have a reset enable counters and this 2D read FDs into buffers. So the demo program is just this function, but put into main instead. So everything in that program could be put into its own little function. Or should I say big function, but either way, it'll work the same. So what comes next? A problem. So how do we solve the problem of needing all the child threadpids of our input process PID? So I discovered on Linux, a PID is per thread and not per process, unlike a Windows and I believe the BSDs, which is kind of sad. So how do we solve this problem? So if you haven't looked here yet, think about it for a couple of seconds. Okay, so here's how we do it. We use a C++ directory iterator in regents that parts this directory. So plug in PID, say 1, 2, 3, 4. We parse this directory and we count up the number of entries found and we get the names of those entries and we return a list of all the entries. And because of those entries are child PIDs of this example here. And because this is so hacky, I believe Linux needs a better solution now, today, not tomorrow, today. On the other hand, my hacky solution works perfectly. And on my Zen free software development monster, it only takes two milliseconds to complete. I believe on my Intel laptop, it only takes three milliseconds. So it's fast either way. So what about problem two? Each of them per counter uses a single file descriptor. So I want you to think about what this actually means. Each event per counter uses a single file descriptor. And each counter is per child PID and each child PID is per thread. So imagine if we had 200 threads in a program and we wanted to measure 40 events in total, then that's 40 times 200 and if you don't know math, that's 8,000. So we need 8,000 file descriptors to store all those performance counters, crazy. So this prevents this presents a problem because by default, non-Linux users have a generous yet limited resource limit for file descriptors. And I believe on my system, it's 4,096. So not even close to 8,000. This means we run out of file descriptors and we track a program with oodles of threads and some programs like Minecraft servers come with hundreds of threads right out of the box. So oodles of threads isn't particular unusual here. So how do we solve this problem? We just get the hard limit and we resize the soft limit to it. Really easy. This implementation is so easy, I'll show it to you in the demo program. All right, moving on to 23, problem free, how do we decide what we want to measure? Answer the PCAPI docs providing an immensely long list of event macro options. Wow, is that wordy? So an event is one metric beach can measure. So on the right here, we can see for the type perf type software, that's a macro itself. We can select one of these config options, which is an event. So we can do perf count SW CPU clock, which measures the CPU clock. We can do task clog page folks, contact switches or CPU migrations. Wow, and that's just a tiny sample of what you can choose from because we've got at least 40 events in total. So you're covered in any either way. So how do we set up our counters? Easy, just fill in this template here for the initial sys call that returns the file descriptor of our counter. So file descriptor is a long end NR blah, blah, blah. Tells the Linux kernel that we want to make a performance counter. Settings is a struct that holds our performance care settings. And as you'll see later, this is kind of, we don't need to define the struct in our code, but it's a first struct that we use to go past settings. PID is the PID of our target thread, or process if your process is single threaded. Negative one means that we aren't targeting any particular CPU core. So you can set this to zero or above the target or particular CPU core of that same number. GFD is our group leader, PID, or set it to negative one to create a group leader. So all covered group leaders very soon, zero, whatever goes in zero is for sampling counters. So how do we handle errors with performance counters? Because as it turns out, performance counters on Linux are really flaky, and I made it. But it's easy, just check the file descriptor returned by this function is equal to negative one. So if it's zero or above, then it worked just fine. If you want to do an additional check of the error no variable to figure out the exact problem, and as it turns out, there are lots of exact problems you can do, I've covered most of them in a demo program. And it's important to check for errors because performance care errors can cascade the other counters, easy. So let's configure our, oh, yeah, let's configure the event. I shouldn't say struct conflict, because it's just one detail of the struct. So the PC API uses C macro constants to store the event type. Events are what you want to measure, and some events are really strange. And these all have to do with cache type events. I don't know why they do it this way, but here's how it works. So you select the type, per type hwcache, or yeah, hwcache, you select what type of cache you want. I'm selecting the data TLB, and then you have to bit shift this macro here, left 8 bits, and that's how this double Chevron notation works. It bit shifts whatever is here by this amount of bits here. So it bit shifts per point per count hwcache up right, left 8 bits, and it shifts this thing left 16 bits. Wow, what a mess. So overall, this measure is data TLB cache write misses, easy. So how does the group leader work? Group leaders provide handy abstraction to make counter control a lot easier. So to wait, so for group leaders, consider a group leader like a mini abstraction to make enabling and disabling easier. Well, we just did something weird. So where was I? So group leaders are like an easy abstraction for file descriptors. When called, they act on behalf of all members of that group. And an additional advantage of group leaders is that all members of a particular group will be scheduled on the CPU at the same time. And that means you can do accurate addition, subtraction, multiplication, and division operations without having to worry about the accuracy. Tricky. And sorry. So how do you use group leaders? In the syscall here, set negative 1 as the group leader pit to create a new group leader. And then this return value is your new file descriptor for the group leader. And then you plug in that file descriptor, not a pit, as I mistakenly said, in order to make a new, to make an event a member of that group. This is really tricky to work. I'm sorry again. Then enter the file descriptor of the new group leader in the same parameter in future counters. OK. And as I said before, group leaders are all or nothing. If one counter fails, all of them do. And that's because they all have to be scheduled on the CPU at the same time. So this is an unfortunate side effect of that. I'm sorry to say. So what about the infinite loop? Go to infinite loop once you make all the counter events for each pit of the process. This loop exists for as long as you want to monitor performance using performance counters. So that means if you want to, say, stop using performance counters, you would call all the counters as you see in the demo program. And then you exit the loop. So just simple break in CRC++. Also in C++, simple while true here works absolutely perfectly. Yay. So next step is to reset and enable our counters. So what's created are counters are in a state defined by what you can figure out, but their values are undefined. So how do we fix that? We first reset the counters so their values are a defined value, zero, and then we enable them so they can start counting. Here's how that works. Call, use the IOCTL syscall for all file descriptors that are either group leaders or do not have a group leader. So what does that mean? So when you call IOCTL for group leader, all members of that group will be affected. However, some counters might not be in a group, so you call them separately as well. So use these two calls to reset and enable. IOCTL, FD, blah, blah, reset. And this here tells us that we want to affect all members of a group instead of just one. Same except for enable in order to enable the performance counters. So now we sleep. We just let the Linux kernel code collect data with performance counters on the processor while we do other stuff in our program. We'll want to wait some reasonable amount of time, such as five seconds here, and an easy cross-platform way to sleep. Working Windows, macOS, Linux, anywhere is with this and C++. Standard library, the spread is sleep for blah, blah, chrono, seconds, fuck it. Easy. What about disabling the counters? Disabling counters is also easy. It's the same idea as resetting and enabling just with a different macro. So IOCTL, FD, blah, blah, disable, affect the whole group. Easy. Reading, how do we read from these performance counters? So we're going to want to use the read. Is it syscall? Yeah, it's a syscall. And the return value of this read is going to be a long. So you're going to want to make a long size to store the size of the amount of data read by us. Invo, it's going to be no larger than an int. Long is just a saver. So you're going to be reading from the file descriptor of every counter. You're going to be reading the data into the buffer. And you want to read maximum of the size of the buffer. Read a maximum amount of data of the size of the buffer. I need a better way to say that. And you have to perform a memory check. If size is equal or equal to or greater than this equation. So the way this works is, if you remember earlier in the struct, there were two values for each entry in that array in the struct. Because both of them were long, longs, that's 16 bytes in total. So 16 times the number of events in that group plus another long, long for the whole group. So that's the equation there. If size is equal to greater than that, then it's viable to extract values from. Here's an example here. So reading part two. Now we iterate over each NR, which is the number of events in each read format struct. In that iteration, check the ID for a red value. Not a reading value, but the value that you just read is equal to the ID of the desired event ID. So if it sounds so technical, but that's exactly what we're doing. Then if this check is true, assign a user variable the read value of that ID. So we don't have this part here, but we do have these two elements here in this code. So we're doing check for greater than or equal to the equation. Now we iterate here over the buffer. We check if the ID is equal to the known good ID. And then we assign a variable to value of the thing pointed to by the ID. It's tough to explain, but as you'll see in a demo program, it's actually really easy. So how do we calculate the PID deltas? What is a PID delta? It's actually just the difference of PIDs because many programs dynamically create and destroy threads. And because they're dynamically creating and destroying the threads, that leaves performance counters trying to track PIDs that either don't exist or they aren't tracking PIDs that do exist compared to maybe five seconds ago. So here's how we do it. The complete, you'll be able to see the complete process in the demo program. Man, that demo program is getting legendary now. Here's how you do it. There's a lot of current PIDs, blah, blah, blah. I don't actually have enough time to say it all because we have to make time for the demo. So one more thing regarding performance counters is that the PCAPI causes security issues. Although the security issues might not actually be important because the only viable attacks known so far are proof of concept attacks. They can only extract a few bits, not kilobytes, not megabytes, but just a few bits of memory from other programs by using performance counters on those programs. And because it's a non-zero security risk, Linux locks the PCAPI for PIDs, Overt, and the column process. Remember, tough wording, but if you're trying to measure the process that you're using performance counters in, you don't have, there's no security issue there. So there's no extra steps you have to take. But if you are doing PCAPI for PIDs, Overt, and the column process, it, Linux locks it behind the cap assist admin capability, or since Linux 5.9, or maybe 5.8, I can't remember, but I believe it's 5.9, it's the cap perform on capability. So how do you do, how do you keep good security? First you should use a Linux, use a re-rescent Linux version as you always should, but you should also assign the cap perform on capability instead of assigning the cap assist admin capability or running the performance counter code as a root. So for my demo program, try using this command for it to add the cap perform on capability. So sudo, set cap perform on, forgot what P and E stand for, but you need them, and demo, which is what I call the binary of my demo program. So you might not already have set cap installed, but it's really easy to install. Okay, let's move on. Actually, that's all folks, that's the bare bones basics of using counting type Linux performance counters. So I might be covering sampling type counters in the future, but for now, we're just using counting type. So it's time for the demo I was telling you all about. Let's take a look at, let's put all of these concepts together using smart C++ building blocks in order to get a user generated PID and monitor another process. And the stem of steals almost all of its code from a real world project of mine. But I'm just leading the unimportant stuff to make a shining gem of something that you can learn from. Okay, let me fix OBS so you can see the program and I'll be back. All right, I'll work back. So here's a demo program in VS code, one of those bloated electron apps I mentioned earlier. Well here, I've got the whole program ready to go. So we start with our, we start with all of our headers, then we move on to all the C++ standard libraries, double check something, yep, it's going, then we have a classic main space FS equals STD colon colon file system that just makes the code cleaner. Then we have our instruct read format that we define. So the compiler knows on how the data is going to be structured when we get the raw data in the buffer, then we have our P counter abstraction here. So what this does is provide and it abstracts all the events that we want away. So we can just think about it per PID instead of per event per PID, that makes it a lot easier. So these four values store the event, the event ID, event value, and event file descriptors. And the buffer here, here's the buffer size equation. It's actually identical to the one I showed you earlier. So it's maximum bounce counter time 16, all plus eight. Then here's that explicit conversion using reinterpret cast. So here's get process child pins. Here's how that works. So we, we start by making a list of all of our PIDs, then a regex with proc some amount of digits, then task. We optimize this. Then try catch just for errors. Then we iterate over, over the directory using the final system directory iterator. And we select the directory proc, the PID we want to get all the child pins of plus task. And this is our directory entry here, constant auto reference directory. And then we add to PIDs using in place back, convert to a long and replace the whole path here with nothing. It actually just works. Then we return the whole list of PIDs. Easy. So let's set up a counter. I have a lambda function here to initialize all the arrays because I normally start with unknown values. Then we configure a struct because we have some common settings here. We have, we abstract everything into a lambda instead of having to configure a struct for every event we want to use. So we do perf type size of blah, blah, disable true. That's why we need the disable and re, no, the reset and enable. I have CTO pair early on. And then we specify we want to make, we have to use groups and the ID system. Then we set up the event. So we have the classic syscall template here. We check for errors. Now here's all the possible errors. There's a lot of them. Then we, for all of our events, we configure using, we configure the events using these lines here. So I've got two events here, one for CPU cycles and the other is instructions. Easy. We create our counters using a user provided list of counters and the user and the list of PIDs acquired from the, getting the list of current PIDs. So it just wraps for all of the PIDs past. We just set up new counters for all events for each PID. I need water, but I mistakenly put my water cup away. Can't believe I did that. So how do we get real counters when we're done with them? We use something called cold counters. So we have some insanely nested loops here, but it boils down to closing all of the group file descriptors that we don't need anymore. And we erase the whole performance counter from our, from our list of performance counters. Number mouthful. So what about resetting and enabling counters? So that's really easy. For all counters, we iterate for all groups within those. And for both beginnings of those groups, we do reset and enable, we apply it for the whole group. Easy. Now for disable encounters, it's the same exact thing above with the disable ILCTL thing too. Now how do we read from counters? So we set the long size for all S encounters. We check if it's above two. Now I want to tell you something. This was actually a source of days of useless vowel growing sessions because as it turns out, the Linux kernel gets rid of memory for you if you close out a file descriptor that is not being used because file descriptors by default have a value of zero. You know what else has a file descriptor value of zero, the standard input for a program. And because you're closing the standard input, whenever you enter like some string, you get a segmentation fault because the Linux kernel is getting rid of memory that should be there. And so we have to check if it's a sane amount for a performance counter. And I hope that helps you avoid days of vowel growing sessions as well. So after that, we do all the standard procedures that I showed you earlier. We get our size, check size, then we store all the ID values and the values acquired from those IDs into variables. How much time do we have left? We have six more minutes left. Okay. So this is how we solve the resource limit problem. So we set up a R limit struct from the resource inclusion here. We get our hard limit, we resize the soft limit, then we apply the new soft limit. Really easy. Then we set up our user variables that we want to use in our program here. We set up our, where we store all the counters and PIDs. We get a user generated PID. We get the current PIDs. We create counters for all the current PIDs, then we get into the infinite loop, reset enable counters, sleep five seconds, disable counters, recounters, set cycles to zero, instructions to zero because we want to reset them. And then for all the counters for each thread, we want to add the values to cycles to get the whole value for the whole program. And that's why we have the plus equals addition simplification here. And then this is how we do the PID delta process. New PIDs equals blah, blah, blah, clear, diff PIDs, temporary variable, get the difference, create counters for what's in new PIDs that isn't in old PIDs, sew on, vice versa, continue. Then we display our results here, remember to convert floats when you divide. And then we go back, really not easy. So now let's put this program into use, let me get this ready. So in my downloads folder, I have the demo program, enter a PID, let's enter banana. That's an invalid PID because banana isn't a number. So I didn't actually show you this before, but here's how you assign the cap perfmon capability. Set cap plus PEDemo, that's easy. So demo, let's try get cap demo. Yeah, we can see we have the cap perfmon capability. So let's use demo, but we need a PID to monitor. So I don't know, oh, let's get this guy's help. He's a musician from Columbia and he's really good. I know this audio player is called Totem, even though it doesn't actually tell you that here. So let's go to top, OBS, Xorg, blah, blah, blah. We also see Totem here with a PID of 52.273. 52.273, messed up that, 52.273. Wait five seconds. Okay, it looks like we got this many cycles and this many instructions. We have kind of junky IPC about one. So generally if your IPC is above one, then you're in good position. But if it's below one, then your memory limited. That's just a really general rule of thumb. So it looks like we're not too memory limited here. All right, shut up. Let's try OBS. OBS is 7657, 7657. All right, we're getting a lot more cycles, but comparatively fewer instructions per second and we have a really bad IPC of 0.6. So that means we're definitely memory limited here, which makes sense considering that my laptop has kind of junky memory compared to my zen-free monster. All right, what else can we check here? Actually, do I even have time? No, so let's go back to the Libre office. Let's take it further. I have these problems with the demo. Actually, these problems are opportunities to make the demo better. We only have two events in one group. There are four plus events available, so on and so forth. There's a non-zero delay to enable, disable, and reset counters, blah, blah, blah. And thank you for attending this presentation. Remember to check me out at www.jghub.com. See more performance cameras and action here at my GitHub repo, right, stole all the demo code. Thank you.