 or modifying the cache line. And by just moving that sequence counter out of that cache line, the performance of the benchmark just went up another 18%. And that's a textbook case, but it just showed how powerful this is. So I get to, when I started working on Linux, I looked and said there's really no easy way to identify this. And a few of us in Red Hat developed in addition to the Perf tool. And Yurka will talk about it more later. I will first demonstrate what the problem is and show a little reproducer of how bad it is. And then Yurka will introduce the addition to the Perf tool that will help you find what took us over a month to find, in what I just mentioned, in five minutes. And where this is of interest, if you're developing in the kernel or if you're developing in user space where you have shared memory and processes across the system accessing that shared memory, or if you have a multi-threaded system where maybe it's one big process and lots of threads, and the threads are running on different NUMA nodes. So just a quick overview. Where does my program get its memory, the types that are expensive, how to find out where they're happening and how to resolve them? Just for reference, we'll use the basic architecture above, very simple nodes, CPUs on the bottom, L1 and L2 caches, main, last level cache and main memory. So we have a simple two socket system. When you do a memory access, where does the, let's say you do a load instruction, where are you getting your memory from? You'll either get it from the level one cache, level two cache, or the last level, L1 or last level, or your local memory, or you, here you can go out to a remote nodes memory, or you can get your load resolved from a remote nodes cache line. There are a number of different mechanisms that memory controllers use to resolve the handshaking between the different NUMA nodes when you modify a variable in another NUMA node. It's cache line, cache coherency, they're called snoop modes and vendors change them. They're snoopy, directory-based, directory home snoopy, and just, they're a number of them. I will just keep that at a high level. Here's one of the cases where the memory references can hurt. Suppose we have a variable foo that lives in memory up here on node one. This is a three-node system, not that there is one. And you are on a thread executing down on this node over here, node zero, and you want to modify foo. Well, the cache coherencies wants to go ask the home node where foo is located and say, give me a copy of foo, give it to me. If that variable has been, foo was recently modified here, suppose in memory it equals seven, foo equals seven in memory, but it was just modified by CPU 11 and it's now equal to eight in memory. Suppose this CPU incremented the value. This memory controller cannot give out the old stale copy back to the requester. It has to do some handshaking to say, give me the updated copy here and make sure it's synchronized in memory, the updated copies in memory, and I'll give the requester back the updated copy. There are optimizations to make that more efficient, but you can see it's a lot more work involved. So in the ideal world, all accesses would be local. You get everything from your local caches or your local memory. In the less than ideal world, you go across to another numinode once and you pull something into your caches. As long as you're not saturating your buses in between the numinodes and with the multi-way caches, assuming you're not getting evicted from your caches, this will work. You might pay the double the cost. If going to your local memory is X, you might pay two X to go a price in cycles to go fetch that, but once you have it, it's there for you to use. The real pain comes when you go to get that memory and another thread of execution is modifying it. Every time they modify that, trouble happens and a lot of handshaking occurs. So to look at this stuff, to figure out at a high level, the tools today are, all right, where's my memory located and where are my threads of execution, where am I executing the memory from? You probably saw earlier there was a discussion of LSTOPPO, Numistat, I'll just show a brief example of that. There are ways at a high level, easy ways, but crude to find out where your threads are executing. This PS command will show you a constant, the last CPU all your threads ran on, you can run top, hit enter F, and select the last CPU field and every time it'll update, it'll show you. That's crude because you're only getting the updates periodically. You can then go deeper if you want and run the trace command and look very deep, but it's giving you data, but it's hard to put all that together and say, how is this helping me or hurting me? Where is my problem? There are some places, Perf mem and Intel's Numitop are some ways to find out where your memory, where you're getting your memory from, but it still doesn't put together where my access is. Where's my memory, where's my access is and where the sharing occurs. Here's Numitop on one system, so it'll give you an idea of your system. The Numistat, you look and say, all right, am I getting CPU sharing? Here I fired off two Java processes, I just let them run wherever they are in the system. On node zero and node one, you can see that these are the megabytes of memory allocated by each process. So you can see that wherever the execution is, the pages of memory are allocated all over. They're spread across the box and they're bouncing around. So you can be sure that threads of execution on node zero are very likely touching pages over here on node one and vice versa, not ideal. If your application, if your fortune in your application fits on one Numa node, you can pin it to say I want, this is the same example again. Pin the application to a particular node. Here it's node zero on top and node one on the bottom. Except for a few maybe shared text pages that the application is largely pinned to where the memory is, to the Numa node that you asked. And in this case, I actually pin the CPUs of execution so that's the ideal sense. That's not always possible. So as I mentioned before, Numistat will show you where pages of memory are but how do you know where they're being touched? Let me see if I can stand back a little bit. What you can't see is where are threads of execution contending for the same memory and cache line? Where are your threads accessing memory on remote nodes? How often are they contending? Like when your threads are executing, are they touching memory on your local nodes or are they going across? And if they're contending with multiple threads shared memory and I treat the kernel as probably a big shared multi-threaded application because every thread in the kernel can see all the memory. Performance starts to tank if you're contending across Numa nodes. So here's a simple example. Take a structure at the top. You have a reader and a writer and a structure and I have two implementations of this structure. One where the two fields are packed together, two where they're separated by cache line. So in the first example, this is what they look like in memory and in the second example with the padding, the reader and the writer are in their own cache line. Cache line is 64, majority of architecture is out there. It's 64 bytes in size. There are some where it's 128 bytes. So whatever it is, we're just referring to a cache line. My diagrams here show 64 bytes and I created this example for this demonstration but it just goes to show what we have seen. Suppose we have the writer thread in a loop, one writer thread looping where he modifies the right field and then just pauses for a second to go back and do it again. And the reader thread does the same thing. Reads a variable and then pauses to do it again. And the question is, if each of these are doing it for 500 million times, how long does it take the reader thread to complete those 500 million iterations of the loop? If you're on a two socket box, one reader thread and one writer thread, you can start out with like two to four X depending on how the system is configured. Oops. When the reader and writer are together like this, the versus this, in this case, the reader will complete reading that loop two to four times faster. On a four socket box, the reader thread will be slowed down 20X or will speed up 20X if they're separated. And the nice thing is when they're separated, the reader, you can have as many readers as you want and they run fast and they're not impeded. However, in this situation, if you have a writer thread, the more readers you add, the more they slow down. And we'll talk about why in a second. Every time a fetch occurs, every time the reader thread fetches the catch line to read it to say, I want to read it, once it's modified by the writer, the reader thread has to throw away his copy and go back and get another one. If there are multiple reader threads, as soon as that catch line is modified, they all have to discard their copy, get in line and get a new copy. And that's why if they're like lock contention in the kernel, if you have lots of locks, the more there are, the more the scaling hurts because they all, your only memory controller will only let one in at a time. This pretty much says what I just said. On large systems, you see this pain a lot more. We had an example. The spec JBB benchmark is one that a lot of the hardware providers like to generate and publish numbers on. And many were running, publishing that benchmark on REL6, which is a 2.6 based kernel. And they hesitated to move to our REL7 or really anything newer than a 3.10 kernel. And the reason was the performance of spec JBB on these large systems was a lot slower. And finally, it was no longer feasible to stay on REL6, so they moved to REL7. And we started digging in. And it turned out there was a catch line in the kernel scheduler that had two hot variables and that hotness showed up on these large systems that this vendor was providing. 16 sockets. By separating that one catch line with two hot variables into two catch lines, spec JBB's critical J ops performance went up 146% just by moving one field down to the next. And in much of our work, this is what we see, whether these are shared memory, individual processes on shared memory or threads. But when you get a lot of contention, your performance will plummet. When we first started developing this, the tool that Yerka will talk about in a minute, I was running Limpac on a four socket box and gave it enough threads to find the knee where performance started going down. As I increased threads, I wanted to find out how many, what was the maximum number of threads I could run Limpac with. And when I ran it at its peak performance, I looked in the kernel and sure enough, there was again another kernel scheduler data structure where hot fields were contending with, two hot fields in the same catch line were contending with each other. And I separated them into two lines and was able to get another 25% more CPU threads scheduled to Limpac and the performance won up accordingly. It turned out, the solution was, when they looked hard at it, they said, gee, we don't really need to be accessing this. Out of the two hot fields, they didn't need to be accessing one that often. So they made it much less hot, very cold. The catch line only had one hot field in it and got the same speed up in performance. So, to summarize, CPU catch line false sharing, when you have multiple threads accessing or modifying the same catch line, multiple processes, once you start going across NUMA nodes, it can get very costly. Atomic operations, your GCC atomics, sync and add, sync fetch and decrement, whatever they are. They all use locked instructions. They will lock the cache line. They will amplify the performance penalty. And you'll see these performance hits on two and four second machines and you really see them as you go larger. The just a quick intro to the Perf tool to the feature that we added, C2C for cache to cache. Yurka recently merged it upstream. We're hoping to get it into the next version of RHEL. We have a prototype copy available for RHEL or CentOS, you can run. And it needs to run on a new Ivy bridge or newer Intel processor. But what's nice about it, and he'll show you is it'll, you run it for a few seconds on your application and it'll show you all the readers and writers to your cache lines sorted by the hottest. The cache line virtual address, the offsets into the cache line that each reader and writer is accessing. The PID, the TID, the instruction address, function name, the source line and file number where this is happening. So, you know, right where to go in your sources. The node in CPU numbers where these accesses are, you might look and say, gee, I thought this function was only gonna work on this Numa node. Why is it having accesses all across every node on the whole box? The average latency, what's nice is you're looking, talk about the average latency that the loads are taking. You can't see latency on store instructions. It's not available in the hardware. You can see it'll show you when hot variables are sharing a cache line. And what's even interesting, suppose you're a user program and you're using a P thread mutex. The last thing, they're hot enough. They're 40 bytes and a cache line being 64 bytes. The last thing you want are hot variables in that cache line. You're gonna slow the cache. Your application will slow down. So, this will show you, for example, if you're sharing, if your mutex is either split and spilling across two cache lines or sharing other hot variables in that same cache line. And over to your question. Okay, so this part of the presentation is mainly the practical one to show basically how the C2C command works. And we have a speedup example that I will show later how we actually use the C2C to find out the contention. So, basically, C2C is another perf subcomment. It's sampling subcomment. That means you have record and report phase. First, you need to get the data. Then we display the data using the report command. What are we actually sampling? It's the Intel memory events and each sample of those events basically is either load or store in your application. It's not like you will get samples from all of the loads in your application because they're just too many of them. But the CPU will just pick up some random part of it and that's what will appear in your data. Each sample of the memory event load or store provides virtual address. That means the place that we actually access in the memory. It provides the type, so we know the memory and we know, okay, we know where we get the data. Was it like cache or was it the main memory or was it remote or local memory? And we also get the latency. We got some idea in cycles. How expensive was that load? So, this is just the summary. So, it's sampling command. We got this kind of information. First, really briefly how you actually record the data. So, perf C2C record is just a wrapper over the standard perf record command. Basically, the C2C record configures everything for you. There's not much we actually leave for the user in terms of the event, so you don't need to care. Just few options like dash U, dash K. You can choose if you want to monitor user space or kernel space. And dash L with the latency number. With this option, you can say you are interested only in loads that took more than the number you specify in cycles. The rest of the options you can specify is basically whatever you are used to specify when you use standard perf command. So, whatever you use to specify your workload, the system-wide, the PID, call chains. Basically, basically anything. And you can provide the command also the same way you provide for standard perf record. Here are some examples of records that we actually use to monitor the data. So, this is just for the reference after when you go to the slide and try to use the C2C. Report is actually where it gets interesting. So, basically we sort the data somehow. I will have more details on this later. And we display to the user. We kind of have two interfaces. Basically, perf provides this text user interface with the navigation by using arrows and you can go through the data. We have that. And we also provide the standard IO interface that you can dump to the file and watch in your viewer. I will start with the stats. That's basically one of the first steps when you monitor the data. I want to see what you actually got. And we have like three statistics tables. This is probably the most important one and shows you how many loads, how many stores and details like how many loads ended up in level one, level two, level three or how many of them ended up in main memory. Probably the most interesting part here is the number of hiddens. Hidden is kind of memory access and it's kind of crucial, is really crucial to understand the rest of the C2C output. So, what is hidden? Joe was actually talking about it. He didn't mention the hidden name itself but it's the case when you are experiencing the false sharing. So, the situation is that the CPU5 has a cache line, has some memory in the cache line and owns the cache line and writes through cache line. That means it's in the data, it's in CPU5 cache line and it's modified. And now CPU zero wants to use this memory so it sends out the request for the memory and the request, it gets the response that, okay, we hit the data is in the cache line but it's modified and that's why it's there, hit M. Like, we hit the data in the cache line but it's modified and that means that CPU5 needs to dump the data to the main memory and CPU zero needs to load it again which is really expensive and that's why we are actually interested in this. This is what's happening during the false sharing. Someone's writing to the memory and the other guy across the CPU actually wants to read it and they are sharing one cache line. It can be removed or local hidden memory access. It depends if that other CPU is in remote node or the local one. So this is what we are actually looking for when we are processing the data in the report. We sort all the data based on the cache line on the virtual address and we are looking for cache lines that experience hidden accesses and store accesses because this is basically the case for the false sharing. As a first output, we basically show the list of the cache lines that we detected. So from the hidden point of view, they are like the most expensive cache lines, the cache lines with most hidden accesses and store accesses that we were able to find and you can see basically for each cache line we display this kind of information. How many hidden accesses was there? How many store accesses? And because we have all the other information about access, we display basically everything we have so you can actually see the other accesses to the cache lines where they were from if it was level one, level two, remote or local memory. So this is for start just to show you if we found some contented cache lines and how many of them and the heart of the output is basically this list that we call distribution Pareto and what we actually display is the details for each cache line. For each cache line, we display the list of the offsets that we detected that they were accessed and we display the way how they were accessed. So these two columns show remote and local hidden access and the next two columns show the store accesses. So this gets you the idea, what offsets of the cache lines were accessed in what way? And moreover, for each cache line, we will actually show you where this access came from. So you got the idea how the offset was accessed and now you see who is behind. So we actually take all the data we have from the sample and trying to convert it to source line information. Sometimes we are more successful than the other times but when it's successful, you can actually see the direct pointer to the source code. So you can see that for those offsets, those guys are responsible. You go to the sources, find out what variable was accessed and assign the variable to the cache line and you got the idea who is behind. We also display the latency. We count, as I said, for each sample, we actually get the latency information that means like we see how many cycles were spent for each access and for each offset, we maintain the average access latency so it can be also helpful. It's very helpful to see what offsets and they're more expensive than the other offsets. If I could say something about that. Take your recent x86 processor accesses the caches might be five to 25, 30 machine cycles depending on the level cache. Local memory, 50, 200 cycles. Remote memory, 75 to 200 cycles. Look at the average latencies for all the remote hit-ems, all the local hit-ems and then all loads in general. And you can see that these load latencies, these are the loads in your program and you can see that many are taking a whole lot longer than the numbers I quoted you. When it's not uncommon in this tool when there's a lot of contention to see some of your load latencies, the average go up to five and 10,000 cycles and then you can dump the raw data and sort all your loads by the worst latency and you can see exactly who in your program what exactly are causing, what locations are causing the worst latency. And we've seen latencies go out to 65,000 cycles and the reason we don't see any higher is the chip doesn't, that's the processor's counter-maximal. But anyway, it's pretty alarming to see some of the load latencies go well up over 50,000, 60,000 cycles. So go ahead. Okay, so that's basically it for the display. We detect the cache lines, we detect the cache lines that have the possibility of having false sharing scenarios of accesses and we display in details those accesses and give you the pointers to actually check in the sources and find out what you can do about it. We have actually very nice speed-up example that we are very proud of because we managed to speed up Perth, which is almost impossible. And the application we picked was a multi-traded report. It's not like, it's very recent feature developed by Namjong Kim. It's going to be upstream in a few weeks. Basically, the idea is sometimes the report can take a very long time. So we split the workload into multiple threads. Each thread gets the portion of the data file. It parses the data file and then merge everything together. It is faster than single-traded, of course, and we just use C2C to find out if we can make it even faster. And it turned out that we actually can. So how does multi-traded report works in terms of the balance load? Basically, it finds out the number of CPUs in the system and create the same amount of the threads. And as I said, each thread is doing the partial. The partial is parsing the partial data file and merging the results together. So we run this example. We pick some really big data file, run the C2C record session, and we find out that C2C actually showed us five cache lines with heavy contentions. You can see there's a large number of hidden and stores. So just five, but it was a good start. When we actually see the details, I'll show you the details for one of the cache lines. The rest was actually very similar. So first, when you get this detailed output file, for the single cache line, you actually check those access patterns, like if there's an offset that stands off in terms of it's being accessed just for reading or just for writing. So here, the offset 18 stands out because it has only the hidden accesses, no stores, and the rest of the cache line is actually being used for reading and writing. And when you see who is actually responsible for those accesses, the first offset stands out as like this guy behind is the local function to perf. And the rest of the cache line is basically functions for from pre-thread lock. So this is one of the point that you can see that there's probably something you can do because those two users of the cache lines are truly different. So we actually were lucky and got the source translated to the source file and the line. So if you go and investigate, you find that this access basically ends up in the structure with two fields. The first one, the entries field is eight byte longs and the rest of the structure is the mutex, is the pthread RV lock. So basically what you see in the cache line in the output is access to this structure. The first, the offset 18 is basically the entries field and the rest of the cache line is the mutex. And as Joe noted before, you always want to have your mutex separated in the separated cache line. And there are actually several ways to do this. It depends on the program, but you can like pet the members so they fell down to different cache lines or you can use the GCC attribute align. So that's what we did in this case. As I said, we got like five cache lines with contentions and we did the same thing. It was basically just shifting the entries of the structures. And when we run again, we actually got a very nice speedup of 45% with those changes, really easy changes, like one line changes. We got 45% of speedup in this thread report. So this is nice. We ran the record session again just to see if there, sometimes if you make some changes, you make some other problems like create new contentions. So we just wanted to check. And we found out we are left with two other shared cache lines showing a lot of contentions. And when we investigate deeply, we find out that those two cache lines basically equals to blocks. There are two locks in Petret, in threaded Perf report and those are their accesses. So actually the last thing that we figure out, we could try because those cache lines are shared across the whole system. So we actually see a lot of remote hidden accesses for each of the cache line. And by pinning all the threads on a single node, well, the server we used was for node system. And when we pin all the threads to the single node, we actually get even bigger speedup. So what we actually did, we change all the remote hiddens to local hiddens and it showed up. It's really even faster if you, instead of four nodes, you run it on a single node. There are some obvious conclusions like that multi-threaded Perf report doesn't scale so well. We need probably some design changes like sharing blocks across the whole system. It's not a nice idea. The conclusion is also, we were very surprised how, I mean, as Joe said, in the matter of minutes, you actually are able to see the false sharing. The report of the cache line directly points you to the source file. And based on the source file, you are able actually to see where the, if the false sharing is true or not. And yeah, moving to single circuit shows the remote to local hiddens speedup. That was another conclusion. We have some steps to minimize contention. We want to continue on that. So this, we're near the last slide. It's just some goodness. Packing, read mostly variables together. Keep them away from your hot written variables. Don't be afraid to waste space. Adding hot variables in their own cache line is goodness. Align your data structures to cache line boundaries. I'm not, I don't program in C++ every day and I have an example in the end where you can allocate a class on a line boundary and get your alignment in a C++ class. Lower the granularity of locks the longer you have your locks out the more contention you'll have. Use compile time asserts. The worst thing that can happen is you get your program all set up and everything's aligned and then a month later, another, let's say you have an embedded data structure and a month later another developer and another project adds or removes fields above you and now all of a sudden your alignment gets shifted and all the work you did gets thrown out. So there are compile time data structures where you can do an assert and say I assert at compile time that the offset of this field has to be so many bytes from the beginning of the data structure and so if they do add or remove fields above you in your embedded data structure it will not compile and your error message will be printed out for them so that's a nice way to not have someone undo your work. Next slide. Questions inside of virtual machine? The question was will this work inside of a virtual machine? The answer is no and the reason is there's something in the Intel PMU called PEBS, Precise Execution Events, Precise Event, BS. BS. Anyway, it's a feature that was added about four or five years ago and it gives you much more precise events. The CPU is going so quickly that by the time you get an event without PEBS when you report it you're gonna get a PC reported far away from what happened. PEBS is not supported in virtual machines yet. It's something Intel has not supported. They've tried to get it working, it's not stable and it's just not there. So Perf C2C, Perf mem, anything that uses PEBS will not work inside of a virtual machine. Any other questions? Yes. Can we get the information just for the C group and not for the whole machine? I don't think at the moment, no. We don't have any sorting for the C group. There are some works, but not C2C related, but not at the moment. But if your applications are not working for the C group, then there was another question. Can you go to the slide? Oh, you can. If you have a C++ class and you want to align it because some of the fields inside of it are causing you trouble with alignment. Take your new, suppose you have your foo, aero context new, suppose you have, that's your declaration. Modify it to an called to aligned alloc of 64 bytes where you're allocating something aligned and then add a second line to do a new on that. And that will give you an aligned C++ class. Once you know in your class, the base of it has been aligned at a cache line boundary, then you can use your compiler primitives to align the fields inside. Classes are a little more difficult because the things allocated in there can be in BSS, initialized data or BSS or whatever. So, if you're on any version of RHEL or even CentOS, I think the binary in there might even work on some other, I think I even tried it on Suze. You can grab that binary to play with it. The best thing to do is to go upstream and download it and build it yourself, but if you don't want to do that, just grab that binary and play with it. There's a blog here that will explain everything that we just talked about in Gory detail. Question, it does slow your program. The question is how much statistical noise gets introduced by using Perf. It will slow, it will consume a CPU and slow down your program, but your memory accesses and the patterns should be the same. I mean, maybe by slowing down the CPU, if slowing down one CPU can cause it, your memory access patterns to change, and yes, but largely you will get the same patterns and you will see it. I've never worried about the slow down when I'm profiling it, because the reads and writes are still there. Question, which Intel servers, I mentioned before that, can you run the sign? Ivory, Bridge, and newer, the PEBS events were not supported in the earlier servers. There was another question here? Yes. Should the compiler align the classes automatically? Malik aligns by default on 8-byte boundaries, I think Nu does as well, because he's underlying, I believe he uses Malik. Sure, well, I think so, but you know if you're doing a lot of small allocations, then this would really hurt you, if all allocations were on a cache line boundary. Any other questions? One more, yes. Oh, it's in kernel, we actually, you actually, so the question was if there's any extra filtering on the latency after? Yeah, so you just say the latency and it happens all in the kernel, that's part of the events configuration, so the CPU does it for you, there's another one. Yes, the question was, is there any worker plans to feedback the reporting into your sources so you can get sort of a self-correction? I know, none that I know of, there is a big interest, anyone here, people here familiar with the PA whole tool? That it shows all the breakdown of the data structures, there is an interest in mapping this to that so you can actually get a dump of all your data structures in your program or kernel and actually have the report output get mapped right to that. Any more? Yes, is it harmful that we can't run this on other architectures? If the other architectures have a, the idea of alignment and respecting cache lines, if they have a cache line size of 64 bytes, what you do on x86 should work, the improvements you make for your program on x86 should carry over to other architectures. If they have a cache line size of 128 bytes, you're gonna miss some things. There's nothing stopping other architectures from plugging in the underpinning so that this works on their architectures as well. This just got rolled out, so anything? Yes, is there any hardware support necessary for the events? On x86, Ivy, Bridger, Noor, and as far as the kernel versions? I'm not sure where it got introduced actually, but it's not that recent. And as for the hardware, it's introduced in Ivy Bridge, I'm pretty sure, but the best results, I mean, it's getting better and better with every CPU model. So we got better results, for example, on Haswell than on Ivy Bridge for sure. So the newer one, the better. Is that it? Yes, a lot of it is perf. And one of the things that in Yorca's output, he showed you that you can increase the frequency if you want, and sometimes the programs, I increase the frequency to do more sampling, sample more often, and then you'll see the perf cycles go up. So it's largely perf, and I'm sure some of it's, a good part of it's in the kernel as well. Yes, you can pneumocontrol and pin the perf tool to any CPU you want. Also the delay usually depends on what you actually want in the sample. If you configure it to have the call chains and other expensive stuff, it will be much more slower. Yes, call chains are slower. But I always like to do it without call chains first to get the high level view to say, oh wow, I do have a problem there. And then I'll run it with call chains turned on to say who are the callers that brought me up to that. I usually do it in two passes. But it, okay, thank you very much. Thank you. Thank you.