 To je spíš jak, aby jsme trochu dohrnutili se ptát. A jo, že si můžeme být čas na umtánské stejného. Jo, a vždyš jak občas koupní jsem, když ukážem 10 minut do konce času. Jo, dobrá, dobrá. Kdyby byla nějaká utázka, tak můžeme být čas na umtánské stejného. Jo, tak když budeš někdo tady pobíž, tak kde mikrofon to chytá, tak neutíkujeme o to lepí. Jo, jo, jo, jo. No, to celkem prejzláda. Jo, jo, jo, jo. OK, tak asi vůjštíčku. Podu, já jsem si ulepšený, ale ještě nějakou máte, že si bude. Ještě tříme sa tady tu. Ne, za tady možná, že je po někom, tak radě vydáhní další. Ať si nekom, ať si nekom terminujeme před náši vící. Jsí, no. Tak jo, že ještě je, co to bou, chci něco věděje nebo... Assy, ne, čo je, čo je, čo je, čo je, čo je, už je všetko. A ty děláš tady prdě? Jo, jo. Ještě chvíli jo, když mě je po tomu lana vyrazit. Stá se udělat něco, jako kontroversní horovat, tak? Jo, je by to nějak přestávat, můžu mám přestavit základu. No, záleží jen všichni. Vystandardně řekneme další přenážké o tomhle, a máte dle člověk, že víte, jde zatlaskajte. Ale ještě cvící, že za co vínit, vínit, na době. Tím jen vám proste jedno, v píšmín než vínit. No, jako přesemluvit o jinak člověk, bojí takové trochu divné, zvláště jistují medla. Vytrhám všichni, jak všichni jde to před náška, ho něče, a preto před náš je tenhle ten človíček, a meno je se třeba jirká, prostě. Vyjdem. Gut. Přímět. Vůjde se tak člověk, můž se je to s tímhle jasným. To tady nesměra. Ještě to není zapíšný. Já bychom těchto zaplajíc hodinu. Ale to pak nejví dobude s návastří hodinu. Ještě tímhle všichni, že ještě tímhle jimlý. Je to právě zaplajíc hodinu. To se to přestrímuje, otevřit jejou ukládání. Ukládání se to vyčitě, ale možná, že je to i stímované přímohy na nějaké youtube kanále, neco takové, takže nevím. Ahoj. Jaká to má? Ty nysky? Jako líkato? Já musím s návastří s návastří s návastří s návastří. Ne, asiže tady možná je to nějak kračí důvod, ale nipak jsou ještě ty jelajtní důvod. To kovět je jako deset dnívno tlapky typicky, jak je něco mi zaujalo a tělá něco porodného, tak jak bych to shernul a řekl, pohodně to zlejímá, jak se přídě podívat, to je rychlochky. Ale myslím, že poslední prezentace jsou do šeského. To je už druh. Se váží. Tak z které barty ještě vždyště mít? Jo. Vždyště jen tak to před. Jo, kdo bude? Kdo bude otravil a taky. Kolik hraností udělá svět dílku těl za parkorom? A jen vždyště můj stím od 100, vyděl, jak si mít šeského. Tohle to tady už jen povžal, tak už bude. Jo, to je těší, tak to vykne to, to musíš zabrať. To je jen 40 minut. 40 minut, 40 minut předtějí od tása. Takže vy měně nějak ukážení. 10 minut, 500, 500 a 5 minut, auto time. A pak už jděl. Jasně. Jak bude, auto time máš tak půl minuty, hroco potom tě vytáhneme. Vás mýš takhle, když ti se baví, tak tohle nezavází, člověk, to je jedno, ale pak už jděl. A pak ještě tě asi dáme v pléšku, na ty slidey, jestli nám tom být dál. Spušenost je taková, že když tohle člověka vytáhnete, tak pojď, než pošleta, prosím se, tak už tady vy neuvidíš. Jelem. Ready? Welcome on the next presentation. Please welcome Yuriy Olsha with his presentation Perf memory profiling. Hi, my name is Yuriy Olsha. I maintain Perf for Red Hat Enterprise Linux and I work for Perf in upstream. This presentation is about how can you use Perf to profile the memory. It's not the complete set of the Perf features that help profile the memory. It's just some new features. I will start with some basic topics like describing what we have in the memory cache and how it actually works. Then I will go through some basic events and I will go into more detail about kind of new features that were introduced lately. First is CQM and the other is C2C. A little disclaimer in beginning I will be talking about the caches about the hardware and everything I will mention about the hardware is connected to x86 architecture and namely to Intel CPUs. I will also show some examples of the Perf but not everything is in the upstream so if you actually want to try and use whatever I will show today you better download the source of the Perf from this git tree and from this Perf core branch and there's nice about how to install Perf from sources. OK, let's start with the basics. So there's a memory in the CPU and it's the device that stores the information and it's the one that stores it temporarily and it looks like this and it's used to. The problem with the memory is actually the speed and that it cannot keep up with the current CPU speed and that's why actually we are interesting in profiling memory accesses because when there's a stream of instructions going through the processing unit the access to the memory if it's load or store it actually adds the latency to go to the main memory and that's something that Perf can actually help a bit. So main memory is slow compared to the CPU so the people came with a small memory in between the CPU and the main memory and it's the small memory that's called cache so it's a smaller memory it's faster cache memory instead of going to the main memory if there's a case the data is not in the cache you need to go to the main memory and if you load the data from the main memory you load it not byte by byte you actually load in the cache line sizes that means for Intel 64 bytes so you go to all 64 bytes to the memory there's a actually the cache completes a hierarchy so there's not only one small cache in nowadays CPUs you will have the hierarchy of the cache so the first level really small, really fast memory then your second level and several other levels and each level the memory is getting slower and higher it's on the same place as the processing unit is so the speed accessing the cache is actually much faster than you go to the memory how it looks actually the cache hierarchy in the real life it depends it varies from architecture to architectures here you can see some latest Intel micro architectures so Brogdal Skyline being the latest one and basically you need to know what CPU are you running then you confirm the Intel software developer manual or the optimization manual and there you can find out what's the cache hierarchy what levels of the cache is and sizes you have in your CPU most of the time you will get this picture you will get so you have socket that in between the socket you have several cores that depends you can have what cores or any number of cores in the socket and the cores each core have a separate cache dedicated cache level 1 and level 2 and all cores in the socket share the last level cache in this case is level 3 so this is like the most common picture you will actually get how you can find out what was the topology of the CPU and the caches that you actually running on you can use the perf report because any time we store the data to the perf the data file we store also the information about the CPU topology so you run perf report this data this capital I and you will get all the information that is actually needed to get the some information about the topology so here you can see the topologies of the CPU so CPU 8 having core ID number 2 socket ID number 1 and the rest of the picture is the topology of the caches so you can see how many level 1 caches you have and what CPUs actually share those caches and for example level 3 here is shared by the socket so this is how you can actually from the data file that perf stores you can find out what was the topology as it continues ok so the question is if it does contain the new topology it is there it just didn't fit in the picture so we have it about the sizes and the speed so this is the last well the latest CPUs from the Intel and that's basically common sizes and speed for the level 1 and level 3 caches is what you can see level 3 first from CPU configurations it can be like 2 megabytes or several tens of megabytes 4 cycles being the speed for the level 1 is quite common if you go to level 2, level 3 it gets higher as you can see and if you go to the main memory it can be even hundreds of cycles latencies before I go to the caches just a quick picture about very common events we have in the perf that actually helps you to profile the caches if you are on perf list just grab for the cache and you will see all the events that deals with the caches so cache mc, cache references will get you the numbers of the last level cache and the rest of these events self-explanatory you will get the events for the cache, i cache everything you want to measure actually ok, so those were the basic now about the first feature I want to talk about cache quality of service monitoring this feature was introduced on Intel, CPUs it started on the Haswell micro architecture and it basically gives you idea it helps you identify how the program or the entities you monitor is using the last level cache and it can give you the occupancy like how many megabytes some item is using in the last level cache or it can give you the bandwidth like how many megabytes were taken from the main memory through last level cache at the time already the first part of the feature is implemented so we have only one event which is Intel, CQM, LLC occupancy and it gives you the occupancy for the workload you are measuring the occupancy of the last level cache how it actually works is the CPU as the there is a register for each CPU you can fill with the ID and you assign the ID to workload you want to monitor so let's say it's a process so each time the process gets scheduled on the CPU the CPU takes all the stores and loads from the last level cache with this ID and later on you can query this feature and get the data from the last level cache how many bytes were accessed by this entity and that's actually what you can do with the per-stat if you run per-stat like this you monitor the LLC event and you monitor you use the dash A for the system byte and dash capital A for the aggregation per CPU you actually get the output how many bytes are stored for each CPU in the last level cache another example of usage for this is you can monitor the workload that you can specify as a standard workload for step command so it will also give you the output of how the last level cache is used during this workload and last example it's also possible to attach this event as any other event in the path to the running process and it will give you the idea how the process is using the last level cache the other part of this feature the memory bandwidth used it's not done yet the last version was posted last year in the June so we are still waiting for a new version to come OK, that was the CQM monitoring of the last level cache the other feature I will talk about this C2C that stands for cache to cache it's actually not that new it was introduced like two years ago something like two years ago by Richard Folles Donzikos and Joe Mario at the time it didn't get to upstream for some reasons now we are trying to slowly push it to the upstream and we have a new version of this tool that hopefully will make it upstream very soon and all the examples I have today for this tool are from the new tool that we are now developing OK, so what does C2C actually do? thus it actually monitors the loads and the stores and it gives you the idea of the cache line contentions during your workload this is actually very tightly connected with cache currency so I will go and explain a little bit about the cache currency cache currency details let's explain on the example let's say you have this setup with two CPUs each having its own cache and sharing one memory and each of them is accessing variable A if we have CPU 0 accessing variable A just for reading the memory will get loaded to the CPU 0 cache now we have to the variable A so the memory will get loaded to the cache of CPU 1 and it will be changed to number 2 now CPU 0 again wants to read the memory so this is the problem for the cache currency what value it should actually get and actually what cache currency handle is that it will ensure that whenever the program is accessing the memory in the cache it will have the latest value so it will somehow ensure that the new update from the CPU 1 will get to the CPU 0 and the program in the CPU 0 will get the latest value and not the stale value it has before how it is implemented so cache currency is implemented in the hardware to implement cache currency Intel is using something which is called snooping cache that means that each CPU in the cache can actually see all the traffic all the messages on the shared bus so anytime any CPU wants to change something in the cache it shouts on the bus I am going to change this and this and all the other caches actually see this and behave accordingly and the other part of this snooping cache is that for each cache line actually we hold 3 states modified, shared and invalid and so together with this state for each cache line and the messages on the bus it actually completes the state machines and the transitions are the messages on the bus and well, yeah, that's how cache lines are maintained within CPU let it explain on the example we have so same story, CPU 1 reading the memory the CPU 1 when it's reading the memory it shouts on the bus I want to read this cache line CPU 1 see this message and just doesn't care because it doesn't have this cache line loaded in the cache now step 2 CPU 1 is changing the memory so it shouts on the bus I am going to write to this memory CPU 0 sees this message and it can see, ok, I have this cache line load it I need to do something about this because CPU 1 is going to change this so what it does it moves the cache line to the invalid state and CPU 1 loads the memory and change it and put its cache line to the modified state and the last line CPU 0 reading the memory again, it says on the bus I am going to read this memory CPU 1 sees this message and say, ok I need to do something I have this memory loaded in my cache and it's modified so what it actually needs to do it needs to store the data to do the memory and CPU 0 after it will load this memory to its cache yeah, I will get to it thanks so this is just the basic basic MSI protocol I am not sure if it was ever actually used in some CPU but all the other currency protocols or currency mechanisms that are used nowadays are in Intel are based on this MSI and they are actually using some extended extended version of this MSI protocol and this extended means that there are some extra states and these extra states actually allows to like there was a question if the memory needs to go to the main memory it allows the value to be forwarded through the bus so they are somehow enhanced version of this MSI MSI protocol but the idea is the same and it's good to know what's behind because with just very simple access to the variable you can cause a lot of traffic on the memory control and on the bus and that's why memory profiling is important so the cache currency is also behind behind false sharing which is actually performance issue and the issue the point of the issue is that in the one cache line you store two variables one cache line is 64 bytes so very easily two variables can fit in so if we check the latest example if you have the variable a and add another variable z and we are in the state that cpu1 has the variable a the cache line with the variable a in the modified state and another cpu comes in and says ok I want to read the variable z which happens to be in the same cache line what will happen the cpu2 again shouts on the bus to read this variable cpu1 sees but I have this cache line in my cache so I need to do something about this so it needs to store the data back to the memory update the memory and after that cpu2 can actually use the cache line so just to see that separate process accessing unrelated variable can cause a lot of traffic and a lot of latency in other two processors that actually have nothing to do with the z variable and that's what's for sharing above so this is the issue that c2c tries to address, tries to help tries to help profiling and what do we have to actually monitor this we have two basic events that were introduced quite long time ago mem loads and mem stores it is like each event in the press system you can measure the event or you can sample the event but it is a little bit more enhanced it provides additional data first it provides the virtual address so anytime you are going to load or store something in the memory you will get the virtual address which is good because that's what we are going for next you can say actually the latencies for the loads that you are interested in so let's say you are interested in this example only for latencies which are bigger than 30 cycles and it will give you only the loads that actually fit to this right area with four cycles being as a minimum number and the third thing is actually the data source the history behind the load so it will actually tells you you have this load instruction and you get sample from this instruction but how the data got back to the CPU cache was it because we hit L1, L2, L3 and if we didn't hit L3 what happened afterwards if we went to the bus if we saw this cache being modified in some other processors so this is probably the most costly case that you hit you want to load something which is loaded somewhere in some other CPU and it's modified and we have the same for the stores as with everything it's not a simple so you will not get all the loads or all the stores the Intel software developer manual for these events actually says it will randomly select loads and stores that you will see in the final data so you just need to keep it in mind that you are not getting all of it it's just some random subset in addition to load latency events there are data LA events it's kind of the same as the load latency it doesn't allows you to specify the latency for the loads but in a sense it's the same for each data source there's a separate event and you will get the you will get the data address of the load or the stores ok so we have the problem we have the events and C2C combines all this together it provides the record and report comments so to monitor the workload you just run per C2C record and your workload and you can use actually any modifiers that you would use for normal per record report is the main guy that does all the job it loads all the data all the stores and all the loads and source them by the cache line and the cache line are sorted by the most most coastly so at the end on the screen you will see data sorted by the cache line and you will see the most most coastly I will show you the I have like 3 examples prepared first is made up one just to we already know there is an issue and I will show you what C2C will actually display and the other one are like from real world actually identifying some false sharing issues in the kernel ok so first the simple one imagine you have shared structure between two processes which are meant to be run only on separate CPUs and one is storing to variable B the other one is loading from variable A and variable A and B happens to be in the same cache line you can use actually this very nice tool payhole if you specify on the binary and says what structure are you interested in it will show you the binary output of the structure and if the structure is bigger than this example it will show you the assignments of the cache lines within this structure and it will be very helpful for identifying those issues so I run the Perf C2C record this guy and what I've got if you run Perf C2C record among other things you actually see that the first one is this cache line as you can see it's still working in progress most of the time we see only the addresses if you are lucky enough you will see the data symbol so this is the basic input output of the report C2C report first you will see the sorted cache lines then you will see the data symbol if there's a luck and you will have it translated then there's a total number of records that belongs to this cache line and then you will see the data about the data source so you will see that there are some stores like here there are some loads like here so this is just the basic output and you can actually navigate through the cache line you want to see and press D like for detail and you will see the whole cache line actually the offsets of the cache line and the way the cache line exists so we know because it's made up example that we have one structure with two variables and from this output you can see the first variable ended up on the offset 16 the other one on the offset 24 and being stored to it's the other one and the other one is being loaded from you can get the number for loads here there are several loads and several stored columns I will not get too much into the details and you can press D again on the offset and you will see all the guys responsible for touching this offset of the cache line so again you can see it puts it more and the actual process that touch this memory so in this made up example the solution to make it faster is to put the variables into separate cache line actually only the other line is necessary it would put the D to the separate cache line and in the payhole payhole output you can actually see now you have variable A separated from variable B with 64 bytes so variable B will actually end up in the separate separate cache line more interesting example is the scheduler speedup we were able to find in this feature we have this benchmark per bench which creates many processes and those processes communicate to each other through the pipe so it's actually a lot of scheduling and so anytime you make a change to the scheduler it most likely will appear at a speedup or not a speedup in this benchmark so we are recorded like this per C2C record and the benchmark and what the airport will look like so you will get a lot of cache lines because the data are where monitored like system byte so all the memory accesses from the system will end up in the data file so again cache lines that were accessed virtual address of the cache lines if there's any luck you will have the translations of these cache lines into the variables and then you will have the access details for all the cache lines so actually you monitor your workload and then you go through this output and try to find out the patterns of accessing what helps is that for each cache line now we actually got the call chains so for each cache line you can actually see the pattern or the call chains how actually they were accessed for the issue I am talking about this cache line is interesting if we press D we will see details about this cache line and you can see it got accessed almost on all the all the offsets but what is interesting here is that offset 8, 16 and 24 are being accessed only for loading you can see there were no stores recorded here and from offset 32 it's combination of the loads and the stores if you see something like this it's kind of heads up that something like false sharing might be happening here so you go for the first offset and you will get the details about how the first offset was actually accessed and the first column is code address it's the address that caused this event to actually generate a sample so let's go and check the first one I will copy the address check the disassembly you will find out that this address actually belongs to this instruction which is unbiased of this macro so let's go and check this macro you can see this macro basically unbiased to this four-cycle and this four-cycle only touches the member of the schedule entity structure if you go to check for the skate entity you will get this output so this is the parent that actually that's where the sample is from then you have two other pointers and then you have skate average structure which is convenient for which fits what we have here we have first three items eight byte pointers being just loaded and the rest of the structure is the skate average structure and it is being written to so on the picture it looks like this the skate entity is sharing the cache line and this part is being read only and the other part is write and also some read and if you split the cache line if you put the skate average into the separate cache line you will get actually speed up of 20 seconds during the benchmark so it was very helpful a really small change and makes a big deal in the scheduler and in the benchmark last example I have is not it the other example is about running the same benchmark but on the system that actually has the f-trace enabled the function graph the f-trace is basically the internal tracing so if you enable that it adds extra work extra tracing of the kernel code into the buffer file so it adds extra overhead it's not that big but it is there and you monitor you monitor the benchmark in the same way and if you run the perf report you will get the similar output again the first column are cache lines the other column is the data that got translated from this cache line and the rest is the data that are associated with the cache lines like the excess data so you go through all this data and again look for some patterns I will make it quick and find the last one ok, this one this is actually cache line that shares two variables one is being one is being stored only and the other one is being logged only and if you go to the sources you will see that the first variable is actually the log which is guarding another global variable the log is trace command line underscore log the variable is safe command line and they are saying they are sharing the same cache line so while the log is being read and written to this variable is only being read if you put them into the separate cache lines the speedup of 8 seconds is not much but the change is not big anyway so it's nice speedup for 2 lines change this is the last example the future is the plan is to merge the tool to the upstream as soon as possible because it's been there too long and not many people knows about it and it can find some issues kind of quickly and it has big impact on the performance that's it for me if you have any questions sorry maybe it's the state of the tool sometimes the cache lines are from the stack so there's no way you can translate them to variables and sometimes the resolve code doesn't work good enough sometimes the variable name is not there but you just get the disassembly of the VM Linux go there and you need to find out which variable is being accessed sorry it depends if they are providing those load latency events which i'm not sure if you can in virtual machines if you can use depends if those events are available in your guest all the measurements were taken on the bare metal like the not virtual environment what happens when we push those changes to the kernel you need to justify the case and you need to persuade the developers that it actually makes a big difference so it's quite common but you need to have the case you need to really justify and show this is the best solution you can get for this issue but from what i heard from other like Joe Mario they have many places they put the cache line but they will never get to the kernel because it makes some key structures really big so it will never get to the upstream but it actually makes the difference in the performance i forgot maybe like 20% first sharing with two CPUs write to the same cache line if the other CPU does not read Z but write to Z yeah write is actually load in the sense that you need to get the memory to the cache line and then you write to it so it's even if there is a write it's considered to be full sharing in this situation you will not see the loads but the stores again on the other guys the data i measured is most of the time on the Haswell and this is actually the architecture we got best results from it's on the Broadvale and Skylake as well but we don't get such a good data as from the Haswell they are telling me i am out of time so thank you no no i have seen a similar presentation oh really the flash car yes hi hi where do you as a red hat where do you get hold of the latest perk code so we can have a look at that new chip it's on the it's on the get to you guys to show you what do you have for the latest product just gonna need to the slides and i will be there too by všetme ho mezi tím propučují některým před nášitím co chtěli jo to bude všetme ho všetme všetme ho všetme všetme ho všetme všetme ho všetme to bude všetme ho všetme jo jo to sledí doby nějak superfrapport několik ráce pomovo až tě jsem to najnou zvočil to jo je to strašný pomod jo je no víme o to so you're just running and looking for patterns so you were suspecting something and that's why you write it yeah well i guess most of the time you don't run this system one like this i was just doing this for the presentation i found this only because of the presentation but most of the guys that using this are actually trying to profile their applications so they are actually looking from data from their applications so they do have a problem and you actually need to know the application it's not like the tool will give you the answer to everything you need to go to the cache line to the code and try to see how it fits did the other guys it will just show you in the real terms copy of the sorry, cheer it do you have a copy of the copy of the sure do you have a copy of the copy of the copy of the copy of the