 Good afternoon, my name is Arnaldo and together with Chiri, and we will do it with Joe Mario that unfortunately couldn't come this time. He provided material for the DB optimization using Perf that will be presented at the end. At the beginning I'll be comparing Perf with other tools. It's complementary. Perf allows you to do some stuff, lots of things that you can do with it with hardware facilities like the CPU counters, or PMU for power measurement, or for the GPU or for NIC drivers, and there is a multitude of things, of counters that are available today, and different ones for different architectures. It's thousands of them that you can have, and we hope that you get an idea of things that are possible to do with Perf after you see this presentation. So Perf is just one more thing in the two chests. Some things more efficiently, it will not be polling slash proc for getting information about processes or M-map settings, so it has a lot more flexibility. So I start by comparing a little bit with VM stepgames. That's well-known tool that people use. It will show several metrics every one second, that's what I ask it. So there are several of those things that you can get from Perf as well, without reading slash proc and being able to say, just for this process or just for this CPU and so on and so forth. So Perf stats is the first thing you use with Perf. It just counts things. There are lots of things you can count. In this specific case, I'm counting contact switches on the whole system. I haven't specified any workload, then it infers that you want information about the whole system. So there's this interval print, which is the equivalent of the one for VM stat. It will be printing every one second. Then there is a count and then the event. You could specify on the contact switches together with several other events as well. You could say cache misses or any other things. It would count the number of cache misses or CPU instructions that took place in the last second or in the other interval that we may use. It runs until you press Ctrl C. There are lots of other events. So Perf targets, that's something which is powerful. All the things you can do with Perf who can do with it for, there are some things that are constrained by the hardware, but most of the things you can do it system-wide. Or for a specific CPU or set of CPUs, for a C-group or for a PID and its children, or just that PID or for a TID. You can even say for this TID on that CPU, or you can do all sorts of combinations. So this is using Perf stuff with some threads. So I'm looking at all the batch threads in the system. So when I started this thing, I was just pressing Enter in some of the terminals. And so we were seeing the number of contact switches in the short form CS. And then at the fifth second, just the system was idle, so there was no contact switches, there was no interactivity with the user. One other thing interesting that you can do that's it's to count the SMIs, the system management interface. Just to showcase one other no usual event that's available in modern machines. So for people that are working with determinism, with real time, they are interested in knowing if there are SMEs taking place in your system. Because they happen behind the OS back. The OS has no way to prevent those things from happening. You have to, if they are happening and they are getting the way for you to get determinism without spikes in let's say in processing packets, you should try to figure out why is this SME taking place. Sometimes you have to go to the firmware of your machine and disable some stuff like thermal throttling or some stuff and knowing the consequences of doing so. But it's interesting, in the past we had to do all sorts of things to try. You can try this thing on your machines and one way to generate an SMI in most systems is to press the mute button. When you press the mute button, the control is outside of the operating system but it's really quick. In this case it's not that much of a problem, it's just a way for you to test the measurement of SMIs. So when I was running this, I pressed the mute button and at some point there were some SMIs taking place on my system. The way you use perf status is either you have a two different machines or you have two versions of the software or you are trying to use some toning and then you run it first and see the things that it measures by default, those are the counters that is measuring and when it measures for instance cycles and instructions, it provides a useful metric which is instructions per cycle. So it calculates from cycles and instructions and the same thing for branches and branch misses. It says how many are taking place per second and how many of them are being missed from all of them. So this can tell you something about your workload. In this case what I was testing was the effect of the cache, just did a find on all of the files in Shacket Kernel 3 source repository and just throw it through to DevNul. Before I drop it, the cache is to see how much of those are having to take place when I do it on a cold cache and then if I do it again, you will see that 318 milliseconds and it went down to 68 so that's the effect of the cache but this could be anything else you would like. You could be trying to run your program on just one CPU and more CPUs and trying to see if these are gonna make a difference so you run it before and after just to get an overall view of the count of things that are taking place. Show completion is interesting because there are so many events that if you use this, it will use the older tool which is Perfelist to get all the events that are possible in your machine and then when you do a tab, you are getting these possible events that start with that string. So in this case is to demonstrate PMU, some sort of, some PMU events like this power energy cores, power energy GPU that it says how many joules of power are being consumed by the specific parts of your system. So if you try it with this interval mode like dash, capital E, 1000 milliseconds or every second, then you will see every second how it is and if you start some workload or move your mouse around or whatever, you're gonna see what these activity results in terms of power consumption. That this is something that is available in recent hardware. This notebook I have has this. I think that quite a lot. So that was for perf start. There are lots of other things that you can do with it but just to showcase it. Then you have top which will pull slash proc and read several files every second or every interval you determine and we'll show it on the screen. So we have perf top. Perf top does the same but it's not pulling slash proc all the time. The perf infrastructure in the kernel has ways for you to ask when a event like a thread creation or a name map or all those things happens to be notified via the interface which is the ring buffer. So the effect on the system is diminishing. So in this case there is perf top hierarchy which it will show it by default uses the cycles. It's a hardware counter at 4,000 times per second. It samples. The perf start was about counting. This is about sampling. And that's that PPP, it's like the precision level that depending on your hardware you may ask to be to use any Intel systems that PEBS component in the CPU. And then some of them allow to cover for SKID. Sometimes you are sampling something and without using the PEBS it may be sampling one instruction later of what really happened. So when you look at the annotation you have to take this into account. But if your machine has this PPP it will probe the system to see if this is present then you're gonna have no SKID. In this case a hierarchy shows I was building the kernel. So the CC1 was the one that was taking most of the sample or where most of the cycles were being spent. Then the kernel itself. And then Lipsy and then they overhead for purpose 2% in this case specific. Probably was more or less at the cycle of the monitoring. So it was still populating its table tables because if you go to the, if you press enter on Lipsy then you're gonna see inside Lipsy what are the functions that are where those cycles are being spent. So in this case was most of the time was memory allocation for the memory allocation, memory allocation consolidating things like that. SRSMP. And on the kernel it was those things. It was clearing pages, we're getting things from the disk and all from the file cache and all sorts of things. The entries is called et cetera because of the spectra and meltdown and mitigations. It start to appear more lately. So this is another interesting thing with Perf. I don't know if other tools have but if you do perftop-age and then a substring it will show just the options that have that substring. In this specific case you can use for instance this percent limit to limit the things that appear on the screen up to a tree show. So you could say that the only entries that have more than 1% or 2% you are interested to reduce or make it more compact. Then GDB. GDB, it's a well-known tool everybody use developers. So it has in general mode this generically those kinds of features. It has watch, backtrace, et cetera. You wanna see that Perf has all of this but without requiring stopping the workload. So the tool in Perf that gets the functionalities that GDB has and makes them available in different ways so you can hook into arbitrary code be it in the kernel or be it in user space. There are blacklisted code in the kernel that you cannot get because they are part of the infrastructure that I use it so we would get into a feedback loop so those are not there. It uses the most performance, the mechanism in the kernel that has the best performance because for instance if you are putting a probe inside a function in some arbitrary place it's something. If you are putting it at function entry or exit because of fast trace the modern kernels have facilities for you to do this more efficiently than using a K probe, a breakpoint. So in here we see we combine Perf probe with Perf stat where we are going to define a probe somewhere in this case I'm putting the probe on the Lipsy DSO the library and the function malloc at the start of the function malloc. So after you do this as the tool says you're gonna have this new event in the system up to the point where you reboot, probe Lipsy malloc that's it. So then you do Perf stat dash E that event and it was lit for one second that's the workload. So the is lit program those 31 mallocs. So you could use this to write all the tools like pairing of mallocs with freeze or two different other functions like P thread and P thread and lock so that you could have a general idea about counting events in some workload. So it defines a new event, it's you can use it and it's inactive when it's not used. So put it there and if nobody's using it's not activated. When you activate it, that's when it will put in place the K probe and et cetera. And you could do it multiple times in different sessions with other tools, not just Perf but perhaps with that trace or with other things. And what can we put probe? So Perf probe dash capital L, ICMP RCV says that shows the source code line for the source code information for, I don't specify with that dash X any DSO. So when we do that it means you are interested in the kernel. So I know that ICMP RCV, even the name of the functions is clear enough is where packets, the broadcast packets, pink packets are processed. So I ask for it to be listed and it finds the matching kernel source, the kernel debugging for package or the kernel source if you built it yourself and shows the offsets in terms of number of lines from the start of the function where you can insert breakpoints. You cannot insert breakpoint on lines 53 because it's just a comment. And then I decided to put a breakpoint exactly inside the ZIF block. So you see the source code for a kernel, works for user space as well, wait for your probes and it shows where probes can be put. Adding the probe, you do Perf probe ICMP RCV 59 and these inserted two. I don't know exactly why it's doing this time. I didn't have time to investigate. Last time I did it was just one. Sometimes this happens because it's an inline and it expands in different places. So it puts the probes in different places but in this case it doesn't seem to be this, seems to be some sort of bug. So the same thing that we did with malloc for user space. Now for the kernel space and now not a function start, but that's some specific point. And then I can use it together with Perf trace. Perf trace dash E probe ICMP RCV asterisk star. Then it will expand to the two events and it will trigger every time it reaches that point where I put the probe. If I combine this with call graph dwarf and then I send the ping, then I see the ping all the way from the main loop and the ping binary going down all the way to the loop back interface, the code that deals with the loop back interface in the kernel. And then it goes back to that point where it was. So you see all the network stack. So then we go to S trace. S trace, it's another thing that has a counterpart in Perf now using the facilities that Perf provides. So this is something that I show in several presentations to show the overhead. So if you do a DD and send five million, one byte packets from dev zero to dev null, it will be just a lot of sys calls. So we are measuring the impact that the sys call monitoring that's performed by S trace and performed by Perf trace has. If you do Perf S trace overhead and run it, and then you use this dash E accept. I took this one from a Brandon Greg presentation because DD doesn't use accept, but even so the overhead is there because it has to process all the things and notice that accept didn't get there. You should rather use dash E non. Dash E non means trace non. Okay, okay. But it would be equivalent, yeah? Because the accept is not available on all architectures. So it's not a portable example. Okay, okay. So S trace in this case was 25 times as low, which was much better than when Brandon Greg needed some example in his blog, which was 42 times as low at the time. But then he has a whole blog about it. So if you will, with Perf trace instead, the overhead is like 1.41 times as lower because it doesn't use the P trace interface. It doesn't stop the process to get the information about the sys call and et cetera. It just sets in the ring buffer. If well-sized, packets will not be lost, but evans will not be lost. But if evans are lost, you will be not fired and then you can retry the experiment in increasing the buffer size. It is the same thing that it does. It's with a collection sys call enter and exit using trace points that are triggered at those points. But there is a problem that is only for integer args. It just gets the, if it's a pointer for a file name, it just gets the pointer. So you need something else to get that. So that's something that I've been working lately. It's our methods sys calls using EVPF. So Perf has integration with EVPF. It was contributed by a guy from Huawei. The companies make telephones in China. It has intrubos with the United States lately. And, but they provided this was really nice. You can compile a link with the LibLVM and have one big thing with the same problems that are with PCC and stuff. Or you can have it as an external tool chain. It will be activated. The C file, you use dash E in any of the Perf tools. And then you see dash E, hello dot C. Oh, this is not a CPUF and this is a C program. So I will use clang, build it into a dot all file, an object file, which is ZBPF. And then I will load it in the kernel, set up maps and out the things that you need. It's done by BPF trace or done by the other thing. That's a way for you. That's interesting. It's the power of BPF. You extend the kernel without writing kernel code. You extend it and just connect it to events. And this is one of the examples. If you go to, now you can use perf trace dash E open, asterisk to get the open net or open, or there is one more open net by handle, something. And then I use another feature of perf trace, which is max events. So I only want the next 10 events. I was, in this case, I was building Perf on a set of containers. And clang 6.0 was the one that was running at that time in some container. I don't know exactly what was the distribution, but it was running, so you see, part. And you see that the output is looking more like as traced than before. That's because you can configure it that way. You say, oh, that's the event. Tools, Perf, for example, BPF argument that Ross is called, oh, which is precompiled. So you can use it without the clang, without the toolshank. And even if the current change, because the ABI that we are using is the syscall, which it doesn't change, you can run it in different kind of versions, the same object file generated by clang. And then there are several things to make the output from perf trace was different, a little bit different. There was the duration of calls and things like that. It was not showing arguments that were zero to make it more compact. The idea is that, and so there's another thing. The idea is that the output of perf trace can be compared to the output of a trace, making it a regression class for both. You can compare, the only thing was the pointers that needs to be, they change for mmap and et cetera, and they'll have to make some this tool that takes that into account and eliminates that thing. And I was asked to do this. And so this is only 10, the next 10 times that the close syscall fails in the whole system, and it's closing when a shell is trying to close fires that are not exist that are not open. So that's the part of my presentation now, as you can go and tell about one experience with even another tool that's present, that is perf C2C. I'm sorry, you have some questions. Just one thing, what went wrong with this pipe? Is it a wrong pipe? Yeah, I don't know, I have to check it. It's a different, yeah, it's strange. I didn't investigate it, I didn't investigate it. Yeah, yeah, but I don't know if it was closed twice. It looks like a race condition. Yeah, perhaps, but because the- When it broke, it was a pipe, and on exiting, it was no longer a pipe. Yeah, because perf trace has this thing, so when you do a, when it tries to augment the number, it looks in proper fast. So maybe that's a race. You okay? So he's going to talk about perf C2C, which is another tool which measures cache line contentions when the hardware has this feature in the PEBS, and he will describe something really interesting, it was a possible optimization for PostgreSQL that we discovered while we were preparing the presentation. Yeah. Happy coincidence, actually. So, are you okay? Okay, so I have two examples, as Arnoldo said, the first one is potential disaster because Joe Mario should be here instead of me, and- Disaster. No, and he's not. Yeah, this is the disaster. What happened? Yeah, right. You're talking about disaster. Shouldn't have said that, yeah. Okay, no disaster. So Joe should have been here talking- Talking about the C2C data that we have, and afterwards there's another example we have, how you can actually very easily to build perf from the sources, like not even from the kernel git street that where perf is, but we actually now produce some turbos. We will, I'll show you. After all, so let's start with the C2C example. Basically, Joe was profiling Postgres benchmark. This is the name of the benchmark. He would probably explain you more on that, but basically it's make the server very busy. And some details about the architecture. It was on the server with two new nodes. It was using the NVME storage that I was told is really fast. That's also one of the reason he managed to actually find some issue because the performance wasn't eaten by the storage code. So now when the storage was actually really fast, the memory issue showed up. And for more details on the hardware actually, it was a nice feature in the report. We store almost all of the information about the server in the data file. So if you run perf report, dash dash adder only, dash capital I, you will get all the information about the server that he was running the data on. This is just some basic information about what he actually measured at the end. If there's somebody familiar with the output from that benchmark, this is what Joe was looking for when he reached a speedup. So basically after we made the change, so Joe found the issue, compiled the page, applied the page, make new binary, and this is two results before and after. And after with the page, we actually managed to get almost 4% speedup on Postgres V11. There were better results on older Postgres. We will not go through the record phase because you need to run the benchmark and you need to have the server. We don't have it, but we have all the data that Joe actually captured. So this is the record data, record comment that Joe was using C2C record, give me all user, that means I will capture only user space data, do it system-wide and do it for five seconds. So what he got when the benchmark was running, he got five seconds of profile data from all the user space application. Afterwards, it's also not so much known feature, you can run Perf archive, which will collect all the data, all the binaries that the data reference. So you will actually get a turbo over here, perf data, tarb is a two, and you can unpack it on your server and use C2C report as if you were on the machine that the data were recorded. So, and that's basically what Joe provided to us and what we'll go through. And this can be even a cross architecture, you can collect on the ARM64 and say, do the analysis on XADC54 or any other kind of comment. Exactly, I made the data available. I thought you might be, actually I thought everybody will be like running those comments with me, nevermind. This is where you can actually, that we placed the data. If you go to that webpage, I put all the steps that I will go through to this file so you can just easily copy and paste it to the terminal. And C2CTARGZ is all the data that I will be now working on. And the binaries that I will use after is just the Postgres binaries. Before and after the patch was applied. So, basically now I will go through the C2C report and show you what I think was the way Joe was looking through the data and identified the data in the source code and came up with the patch, applied, and made up, made up the speedup, made the speedup. Okay. Maybe just quick introduction. I'm not sure everybody's familiar with the C2C. Basically what it allows you to do in a nutshell, it allows you to identify the shared data structure in the system which are within one cache line and are accessed from many places of the system. It will basically, if there's data structure like this and it's being really heavily accessed through all the CPUs, it will show up in the C2C report. And that's basically what we are after. The queries, if you identify accesses like this, the queries to separate them in the structure and made them aligned to the cache line so you can actually ease up the contention on those data. So, this is what C2C is in a nutshell. It's much more complicated. We will see it in output right away. So, if you actually go ahead and download that tar that I was talking about, we have two sets of data, the original and modified ones. Let me go here. So, if you unpack it, go there. So, let's go first through the original data. So, this is the profile that Joe made on the system without any page applied. This is the original V11 Postgres server running. You can, let me check the steps. You can run, yeah. So, what I right now did, I cannot see that okay. Okay. I run this document. This will basically take the data from the Perf archive. This is the table that the Perf archive spit and it will unpack it to the .debug file. And now you are actually able to run the Perf C2C report over the data. Okay. Okay. Yeah. So, this is the basic display of C2C, C2C report. Every line is now the cache line. And let me just go through the field. The most important is this field, hit Ms. And it actually shows the situation when you have the data structure and it's been accessed from multiple CPU. And one CPU is trying to store to the structure and the other one wants it also. And it ended up in the hit land access, which means I want to access this data, but it's in the cache line of another CPU and it's modified. So, I have to wait really long to actually get the data. So, the data, the cache lines are sorted based on the hit them by default. So, if I show the details about the FU press D in this, you will actually list the details about the cache line. This is basically what we were able to capture. This is the view of the one cache line. And in those columns, those two columns means the hit them accesses, the reading, that the CPU actually want to read something that was modified in another CPU cache line. And those accesses are like the CPU, some other CPU were storing to those accesses. And this is basically how the cache line was accessed through the system. Very important field here is the CPU count over here. And you can see the number of CPUs that were actually fighting over the data. So, it was the server with the 64 CPUs. So, you can actually identify really hot parts by just looking on the CPU count. Another really important information is the source. So, this will actually shows you where this access was initiated from. So, you can go to the source code, find this file, the line number, and you will see who actually caused this. As you can see, this is the offset on the cache line. And all the accesses are on the number four of the offset. So, there's not much we can do about this cache line. If it was different offsets, we can actually see, oh, there's access in one offset, another offset, let's put them apart and see what happens. But that's not the case of this cache line. There we were actually making the difference is the second cache line, which is not so hot, but still you can see there are hit them accesses from all over the place, but the offset is now really different. It goes from 10 hex to 28 hex. And again, you can see the CPU count is quite high. So, it's really a stressed code. If you go to the source line, you will actually get to the places where the code is accessed. And you need to figure out what each line is actually executing on each offset. So, for each offset, you have the piece information where in the source code this access happened. And you need to go to the sources and find out which structure on this cache line are actually looking at. That's what Joe did. You actually need to go through the Postgres sources and find out what is this access. What is the structure that's behind this cache line? Joe did it for us and actually identified that it's tracked buffer desk. And those three highlighted variables are actually the offsets that you can see in the data file. So, the first one, the buff ID and the state, it's some sort of lock in the Postgres. And this structure LWLock is like some common structure in Postgres that's doing the locking. And this state field is the main guy of the lock. It's the fields that all the CPUs are fighting about. And that's actually what you can see here. The 0x14 and 18 are those, the first guys. 0x28 offset is the last one. So, the idea to actually fix it was to separate this lock away from the other fields. Joe actually made a page that did a following change. He aligned the lock to the next cache line and he also had to align this structure. So, it all will be allocated on the cache line boundary and having this attribute aligned and this one made sure that this content lock will be in its own cache line. Here on the code. And if I go to the modified data, oh, actually this is the... It's not visible again, right? Never mind. It's there, it's on the web, you can go there. So, if I go to the modified data, I unpack it like in the previous case. I'm just unpacking the archive and run the report. So, he put the new binary in the place and run the same benchmark. Do the... Did the monitoring for the C2C record and we ended up with this profile. So, you can see there's still the first cache line. This very heavy cache line that we actually didn't touch. But the second one actually we got rid of the last access there. It was the offset 0x28 and just moving it away from that cache line. We actually eased up the load on that cache line and it showed up in the overall performance and we got the speed up of 4%, which is actually nice. No, I'm asking you now is for questions. Oh. Okay, so I will not go through the rest. It's on the internet. You can build a path from the sources. All the steps are there. Just need to download and execute. It's really nice. If there are any questions. Yeah. Now is the time. Not the local one, but the first address is all there. There is a fast facility in the kernel to improve instead of just being raised or something. Yeah, yeah. What about the user space? Does that make any name? Yeah, I don't know. I mean, they have the needs. I repeat the question. What? The question was he asked if I said it in the kernel for function entry and exit, you have a faster way to put the probe than putting a break point, which is the kernel nowadays is due to have a prologue and epilogue with knobs where when you enable function graph tracing, you go there and substitute the thing and without break points and it works. For putting a probe in a function start, you have that. So you use that instead of putting a break point there. He asked if in the user space, there was something similar. I don't really know. I mean, I think that you probes, it's there. And there is other things that even since that user which is deeninced that does some magic there. But I have not been following that much of this optimization of probes in user space lately. Question about EBF. Yeah. There is a limit on the size of the program. Yes, yes. What should we do if you want to put some bigger stuff that doesn't fit into this kernel? No, the thing is that EBF has tail calls, for instance, you can start on one and then go to the other one. But the idea of EBF is to be small. I mean, in the next presentation, and in my presentation about BTF, the type form and observability improvements that are being made for BPF so that you can see what's happening in those cases. You're going to see the output from BPF tool showing the sizes of a typical BPF program. So if it's that big, I mean, it doesn't really fit well with the BPF idea. It's to be something fast that doesn't get in the way that much. I just think of converting a stress into a pure typical thing and it has a lot of code. Yeah. And you can just fetch everything in the kernel and pass. It's not that easy. Yeah, the thing would be for stress, let's say, I think we should discuss this offline, but the thing is it's interesting for futuring things at the org. The initial idea of BPF, before a BPF in the 90s for packet processing was to do that, to add the origin filter things, to get just what you want. If you want everything, then you don't need the BPF to filter it at the origin. You would just get everything and process it in user space. And then are the facilities that we have now would be already sufficient if not for getting the point of contents. But the good thing with BPF for such a thing, for such a use case for S-Trace, let's say, would be to filter things at the origin. Because it depends as well. If you are doing something which is not just called intensive, then it would not be a problem even to use P-Trace, let's say. Any other question? So okay, thank you. Thank you.