 Right. This is the first of the two talks which I have. They're building to some extent on top of each other. So you might think 90 minutes is a lot of content. When I would skip this talk, so I have versions of this, which I would probably be able to talk to you for a week about this. So this doesn't really, expectation is not really that, you will be able to do much with this, understand these kind of things. This is just to raise interest, just like it is the case for most conference talks. So don't be too disappointed if you are not able to sit down and immediately be an expert in these kind of things. But at the same time, the topic, especially PMC's, is of course nothing new. And people were the very early days of course expecting, oh yeah, it's very easy to use because they're telling me exactly what is going wrong with my program. So in the first version of this, sorry, the first part of this talk, I will hopefully be able to dispel the smith a bit so that you actually realize that you have to do quite a bit of work to understand what this meant by this. So and where is this misconception coming from? So those who are old enough might remember machines where we had instructions being executed nicely, sequentially and so on. We always had a nice series of sequences each processor went through. So this started from the early on. So if you look at the 88, 8008 Intel processor, the 8080 ones and so on. They all had these nice stages. Each instruction went through a couple of memory cycles and so on. And each of them were executing part of an instruction and so on. We know exactly how the program was behaving. We just had to look at the instruction and look up how many cycles its instruction would take and we knew exactly what the throughput is. And from that we were able to compute how fast it is and we were able to see, well, if we make this and this and this change, how much faster does the program become? So this already was weakened by the introduction of pipeline processors where now instead of having just one single instruction in five, we have multiple instructions which can be executed in parallel. So they are not executing the same stage. So this is something which was early on added. So by recognizing that the individual stages are executed to understand the different parts of the CPU. So why not keep all of the parts busy by executing different stages of each instruction. So each instruction is here, one of these horizontal bars. This is an instruction and the yellow part is indicating that at any point in time you have from ideally from one instruction, one part being executed at that point in time. And the result is that we can have in this case in a typical five-stage pipeline up to five instructions in flight at any point in time. So of course usually this is not the case that you have a program which completely maps to this kind of execution scheme all the time. So you have bubbles in there, you have dependencies at which you cannot start with the execution of one instruction until you actually have the results of a prior instruction, etc. at which point some of these rows basically are empty and then the instruction throughput falls. So this is something which meets performance estimates quite a lot harder. It still was possible because it was predictable in the sense that memory accesses, etc. back in the days when these kind of process architectures were prevalent. So memory accesses were still mostly predictable because the CPU core speed and the memory speed have not yet diverged that much from each other. And therefore we were still able to estimate to some extent what the time is but we frequently and very early on got help in the form of what is usually called a timestamp counter of some sort. So on the X86 front this was introduced in the 586 already back in the days when we had a dedicated register where at any point in time we were able to request a number, 64-bit number in this case which were indicating how many cycles of some sort have elapsed since the beginning of whatever, so usually the last reboot of the machine. And by subtracting the two numbers, each of the numbers got assigned to one cycle and by subtracting two numbers we knew exactly how many cycles were between inside and interval which we want to measure. So this was useful so we also could estimate at the very least how many instructions were executed and therefore like cycles per instruction which up to this day is a very important measure and we could able to estimate therefore how effective a program was executed and we also were able to measure what kind of effect a certain change the program actually had so these were absolute numbers and we were able to use them to effectively measure how a program behaves. So this kind of thing is still possible today for some of the simpler processes which you mostly find in the embedded realm. So they are not usually out of order execution processes as they are called so they are just in order, just like this. Maybe they have simple five-stage pipelines like this as well and this is still a very valid method if it applies. So but hardly anyone outside the realm of the embedded development world is using this type of process until today. It more looks like this. So I always like to ask the question, what is it? Because the connoisseur of microarchitecture will know exactly what it is. No one? It's obviously a Sunny Cove core. Oh, I was trying to figure out which CPU. Yeah, well no, no, that's a core. This is just an X86 core implementation, not the CPU, the entire CPU, just a single core and more specifically the Sunny Cove architecture, not Skylake, the next generation of the Skylake. That's what this is. So this looks like a complicated picture and it is even more complicated when you're actually using the thing because basically all of these arrows, all of these boxes which are, you see here, are individually acting parts of the system which cannot at any point in time work on something or not work on something, etc. They're introducing problems related to overuse, they're introducing problems because their penances, etc, etc. So quite honestly, every single line which you see here has some importance when it comes to analyzing program performance, which you're actually running on these kind of things. And to understand and explain the performance of program, you would have to, for every single sequence of instructions, you would have to see what is actually going on in the CPU and explain this based on these individual pieces. And they're even left out here completely on the right-hand side. You see there, there's a connection to the other core. So more correctly, first of all, to the what is Intel is calling uncore parts, the parts which are supporting all the cores to which, among other things, the main memory and the last little cache are attached. So these things are even left out here. And those are today with what I mentioned before, the discrepancy and speed between the cores and the memory where the memory is always of magnitude slower than that. So we have to introduce caches, etc. But they're also introducing inconsistencies, they're introducing randomness in the execution. So couple all of these kind of things together. What I mean is that you are today absolutely not capable of describing how a program is really behaving when it is executing on such a core. It's absolutely impossible. There's so much randomness going on that one run to the other, it will behave differently, period. So even if this would not be there, just look at the number of different things which are mentioned here and all of them have their own manifestation when it comes to the utilization of the various instructions that your program is using on top of the CPU. Each of them, you would have to model and understand how they're interacting with each other. This is truly impossible. And in addition to that, we have the problem that the programs we are running, so even if you try to say, I can try to understand this for my program, this is not what the next program will do. So the different programs can stress different parts of the CPU completely differently. And therefore one analysis doesn't translate at all to the next one. Every single program has to be analyzed individually. And even that is a simplification because a program usually is not homogenous over time in what it is actually executing. So at the very least, every program has a stop-up phase of some sort. Then it has perhaps some IO phase, some compute stuff, then perhaps another IO phase, and then perhaps it repeats the whole thing, etc. So all these different phases are stressing the system differently and you would have to map the different phases to the different parts of the code and then to the appropriate units inside the CPU and make an analysis for each of these different phases for the program as a whole. So you see this is a tremendously hard problem nowadays. And we need help. So what is actually possible today? Of course, this is not something which is unobserved by the folks who are actually producing these kind of machines for us, obviously. So let's start by a nice little analogy. So I'd like to give, as an example here, assuming I have a pipe system. So any physicists here know something about hydrodynamic. So we don't use water. We use some compressible fuel. If you do it in this case, it's easy. It's a nicer map. You assume you have something coming in at the top and you want to somehow control or whatever the way it comes out in the end. So you have different ways. You can move different length of pipes. You have valves in different places and so on. And at any point in time, they can behave differently. And you want to compute something like this. This is a somewhat rough analogy what is going on in the big CPU system where you have problems with different diameters of things. So which corresponds to different sized pathways inside the CPU, be it for just transport or for calculation and so on. You have different lengths. So it takes different amount of time for the material to be transported through that, which by itself, of course, is regulating the flow. So we have these kind of things inside the CPU where the performing in addition is very fast while if you have something like a division to be executed, that by its nature of the operation will take a longer amount of time. We have the equivalent of valves which are closed or unopened. So if they're closed, so the pathway is not able to be used in this case. Similarly, we have shared resources in the CPU which for the execution of a specific program for some period of time might not be available. So this has to be taken into account. We also have the thing that we might only be able to extract and insert into the system material at a certain rate, a maximum rate. Well, which means that even if we are doing something in the middle which is fantastically high-performing and it can transport as much as we want, well, it will not have a positive effect because we cannot produce or consume appropriate amounts of material. In this case, in our case, here, instructions. So we have, of course, limitations when it comes to reading the code from data, from the memory, or storing data and reading data and so on which puts an absolute limit as to what we can consume and produce and this kind of thing translates to the system as well. Finally, what happens there is that the available... so the material which is transported over whatever comes in at the top must also go out at the bottom. So this means that they are not independent of each other. We cannot just arbitrarily long consume more because the internal volume of the system, in the CPU sense, the number of instructions which can be stored as being in flight is finite. Actually, it's a really reasonably small number. Depending on the complexity of the CPU, it can grow. So nowadays we can, in theory, have 100 instructions going on before any of them has to be what is called retired, before they actually are terminating, but there is a fixed limit as to that. And we have to take all of these kind of things into account when we're analyzing the program because at some point we might make assumptions about how the program is done but any of these limits might be violated and therefore our assumptions are wrong. So it's very difficult to look at the individual pieces but then also have to keep all these boundary conditions in mind when we're actually analyzing the code. So let's go back to the picture which you saw before. So what can actually go wrong when we are executing code there? This is a tiny, tiny little subset of what is possibly going on. In some situations we are actually calling them hazards. So these are things where during the program execution something is not right and we have to do something usually meaning waiting. And I'm not reading through them here right now so completely so it doesn't really matter. You just see that we have different types of situations in which we are waiting for something to happen and they are mapping to individual pieces in the actual CPU core. And execution of a CPU as an instruction inside the core actually works not randomly in some there. So they actually follow a path through this core implementation. At the very top there you will see the level one instruction cache which feeds the data coming in which is the program into the instruction decoder and so forth which translates the assembly language of binary code into something the CPU can actually work on. So that is by the way whoever thinks actually 6 is a SISC architecture so really doesn't know how CPUs work. Every CPU nowadays is a RISC architecture because the first thing an actually 6 processor does it translates every instruction into RISC instructions which are called micro-ops in Intel terminology. So the instruction gets translated into these micro-ops which are put in the waiting queue and then when all the preconditions are met plus the appropriate unit which can perform the action which the individual micro-op is asking for then it gets executed and after the execution the result might linger around for a while until the logic of the sequence of instructions tells it oh yeah by that time now the result of this instruction can be used. It has been verified that this is the right way of doing things and therefore the results are written in the permanent state. So there are all kinds of instructions in this reorder buffer and so on. There can be 100 instructions going on and floating around at any point in time. Usually it's not that many because the program has too many possibilities to actually run into hazards and therefore you rather fill this kind of thing but really really well written code with lots of linearly executed instructions could potentially fill the reorder buffer completely and therefore make the process really happy because it can just issue many many instructions right away and the number of instructions which are actually at any point in time is also something which has changed over time. So in the beginning I mentioned we are working on one single instruction at a time then in the pipeline methods up to five instructions here we can potentially issue four or five depending on the architecture micro-instructions per cycle to be executed to the individual what they're calling here pipeline sensor on and inside the pipelines we can have multiple and some of the other pipelines they are not finishing a one single cycle and they can have multiple instructions in each of these pipeline stages as well so you have not only a couple of instructions dozens, hundreds of instructions to be decoded at any point in time you might even have a dozen of instructions being executed at the same time and that perhaps every single cycle. In fact if you do the analysis on normal code on Exodus 6 today if you have the number of instructions which at any point in time is executed at the same time or issued from the Reodov-Hoffrich-Tui execution part if this dips below two you know your program is in really really deep trouble we are aiming for instructions per cycle which is another measure which is quite important so we want to have the IPC value to be larger than two maybe even larger than three at any point in time for the program and so on that's a good well performing program anything low is really really bad so you can imagine there are so many things going on and at any of these kind of stages represented here by these red dots something can go wrong there are all kinds of different things they are mapping two caches to the behavior that you might actually don't find the data in the caches or you have to write something out or you have a dependency of one instruction on the next instruction and all are mapping to different parts of the CPU itself so it's not that you can in isolation look at any individual part you have to look at the whole thing completely and try to find out by feedback from the CPU what your program actually is in lack of in most situations and just to make sure so this is said one of the last generation this is not unique to that so here is a new picture one of the new pictures which I've got this time around this is an ARM Cortex A72 core so you probably have never seen it exactly in this form so I drew this myself to map somewhat closely to what the XUD6 core you saw on the previous slide is looking like that's not that much difference so especially if you're a little bit under not really that well versed in micro-architecture of CPUs you're forgiven if you think it's pretty much the same so we have the same thing we have different ports in which different types of instruction can be executed you have various cache result levels you have decoders, et cetera so the problem is exactly the same regardless of what kind of CPU you're using as long as it's not one of these very trivial ones like with controllers or so on so the problem we have to solve is not unique to the big core XUD6 it's a general problem the thing which also should point out at this point so the number of points which I have here on the right hand side is in this case a little bit shorter than the list which I have on the previous slides but all of the points which you see here on the right hand side are also present on the previous slide so they are easy to map specific events which are causing problems to whatever architecture the CPU core is actually having so it's not that every single architecture has a completely different set of hazards in which we have to look into there's a lot of commonality there how and exactly where we find them and perhaps how we can observe them might differ from the one CPU to the next but if you are looking at it with a little bit of abstraction that's something which we can actually do and that's something which we are going to look at now but more importantly before we are actually getting to the point is simply that the concept of a counter has been introduced fairly a long time back as I mentioned the TSC, the timestamp counter was introduced in the 586 soon after that in 686 we got the first instances of programmable counters inside so measurement counters inside the CPUs actually this was not really the architecture at the forefront of these kind of things quite honestly back in those days the architecture which was most prominent when it comes to these kind of things is the deck alpha so Joe might actually like that the shout out so deck alpha was ahead of its time when it came to these kind of things they had these kind of performance counters all throughout the CPU and they have the one of the concepts behind these performance counters is that yes it would in theory be possible to have for every single of these hazards and whatever kind of things we could have an individual counter assigned to them so the logic which you need for that is actually pretty small but the thing about them is that just as counters these kind of things are not really that useful we need a little bit of additional logic to it we need to not only measure things but we also want to have something where we can react to the counter being incremented in a general way so we are getting back to this kind of thing in a bit but in general we want to be able to say when you have 4,000 of times of this event happening for instance you missed the level one cache then let the program know that this actually happened so we have some logic associated with the counters and this pretty much makes it impossible to have 1,000 of these counters spread all throughout the core implementation and have them available and so on so what actually happened is that pretty much every CPU since then provides us per core with a certain number of programmable performance counters in which you can basically assign at any point in time a specific event to so this became much more sophisticated as the time went by so in the early days we had I think 2 or 3 on actually the 6 performance counters and each of the counters could only be assigned events from a certain class etc so these kind of things nowadays we have a lot more so if you turn off hyperthreading I think we have 8 counters available per core and they can be pretty much freely assigned any of the events is going in the hundreds now so if you want to have a spreadsheet where for instance for Skylake core where all of these they are listed the list is long it's incredibly long they have so much choice to actually select whatever event you want to count so you can assign pretty much any of them to arbitrary counters and for each of the counters they have an interrupt logic where you can say after n of these counters and so on of the counter events raise an interrupt so this is all possible so as I mentioned at the beginning Alpha was at the forefront of that and the engineers at DEC back in the days when they were not compact yet they were actually good and they wrote a seminal paper about statistical profiling which was the first one which was able where we actually learned how we should program and profile these kind of large programs and they used this functionality of interact from performance measurement counters so the quality of implementation of the CPU is really determining the number of counters which are available for you to use but not only that also the CPU vendors are charging you more the more counters they are they are disabling counters if you don't pay them enough so specifically on on x86 if you're even looking at the the low end one even the normal i5 i7 cores which you can buy nowadays they are tripled in their core implementation theoretically they could have huge numbers of counters but those are reserved for the expensive one either the i9s or the zion cores which are basically the same so there's only the only difference and there is the size of the level 1 level 2 cache and so on the rest is pretty much the same for all these kind of files except that the expensive ones have more counters for you available so which means that people like me for our workstations we have to buy the expensive ones so unfortunately it also means that if you want to do this it's hard because you don't have this functionality so don't be discouraged if you try to do some of these things and then do it on a laptop or on your small machine and so on you're not getting far that's by design not by my design but by some evil people who want to charge your money for that alright so finally there's one other aspect which is important and those were just the pictures of the core but parts of what the performance which is also governing the performance is obviously the part which the cores are connected to the memory, the last level cache these kind of things so we also need counters there and that's the problem how do you govern the reading of values which are not really directly attached to your core how can you do this in a fast enough time without delaying things where at the same time you might have requirements for every single core attached to the uncore part and they want to read out the same value so if you know something about electric so electronic design you know that reading out from the same value is actually a big problem so these kind of things are very hard to implement right so you have certain things uncore parts which need to be available on a kind of a bus system which makes it slow blah blah blah so all kinds of problems I'm not going to the details here but in general a good implementation still provides you with pretty much the same level of information for the uncore parts except of course they cannot be done with somewhat the same precision especially when it comes to the time resolution because the time the single core is taking is at a higher resolution than what the uncore parts are and more importantly the clocks might not be completely in sync so the attribution of events might actually not be exactly matching what you're seeing so there are therefore at least two classes of these kind of events we're looking at are the core events uncore events and we have to treat them carefully when they're analyzing them so this I already mentioned most of the parts already so as I said the thing is that it's not just about providing counters it's also to do something with them so the one thing which I haven't mentioned here is that over the years we always had hard machine-specific or version-specific counters pretty much everywhere over time it clarified which of the events we pretty much always want to count and so some of these counters actually were what is called architected again in total knowledge architected just means that they were elevated they were say from now on every single fugitive view we'll have to provide this kind of event as well so we don't have to select this and therefore they actually dedicate specific hardware to these kind of events so on the most recent CPUs you will find that in addition to the timestamp counters you will also get counters for the number of instructions which have been executed etc so they are always there so you don't have to reserve some of the precious resources for these events but this is something which varies from CPU version to the other so some older ones don't have that so when you are trying to use these kind of things at some point if you are trying to use for instance six different events which easily map to the four PMCs plus the fixed side counters and so on which you have available on any kind of let's say Skylake X machine if you try to do the same thing you will not perhaps be able to do this because it doesn't have that many independent PMCs available for you to count so these kind of things you have to keep in mind there will always be differences and the difference will be even larger if you are trying to use the same technology not just for having Exeter 6 on Intel but also Exeter 6 on AMD which will have a completely different set so their core implementation is different so the odds of the counters will be different and even more problems you will have if you are trying to do the same thing on other architectures like for instance the ARM architecture if you try to use the same kind of technology so it's possible you just have to be a little bit more careful and your programs which are doing this have to be aware of these limitations and work around them good news is that the tools we have available nowadays can do this so what if you want to actually monitor many of these events and the number of counters which you have available is too small well we can still use that but we use the same trick we use everywhere time multiplexing and if you are not aware of the limitations of the number of counters which you have the tools usually will silently move you into this mode in the time multiplexing mode and this is of course results because what they are doing basically and here is an example also you are dividing up your program execution intervals of certain lengths tau in this case and every tau whatever time this is after duration and so on you are switching to a different set of events which you are counting perhaps in your PMCs and the result might be that it looks like this if your fork count is available certain events are swapped in, swapped out in the end you have some left over interval and the good program which which allows you to monitor these kind of events will actually tell you yeah I have gone through was it 8 event cycles and then 0.8 of the last event cycle and here are the counts for all of these events in all of these specific intervals so all of the information which you see here being in could be accessible when it comes to the termination of program and for us to actually analyze code so the result is that we can create a table like on the right hand side more correctly we can look at the first three columns so first one of course being the name but we can reconstruct the total count and the total time the individual event was being counted we can reconstruct this from the data the performance analysis tools are providing us with but now comes the important point if you're not aware of this or if you deliberately are using more events than there are counters available interpolation happens so we know that the total program execution time is 8.8 and any measure we want to print out has to be based on the same time basis the only time basis which makes sense is the full length of the program and therefore events which are not counted for the entire duration of the program run will be multiplied appropriately in their count and therefore they have a much lower precision than the other ones specifically if your program is not homogenous over the execution time if you have as I mentioned at the beginning some startup phase then you have some compute phase and perhaps the compute phase by itself is varying over time if your event was scheduled in time for instance low arithmetic activity but you're counting an arithmetic event the interpolation which you're getting out of this in the end will be completely off on the other hand you can have exactly the same opposite effect in the short period of time when you're doing actually arithmetic in the program and you're overestimating this from the counters themselves and from the programs there is no way that you can get this kind of information out of this at best what you can do is you can run it multiple times and hope that there's enough randomization there that the events are not coinciding at the same time with the same kind of parts of the program execution etc but much better it would be and if your program is actually repeatable so it's reliably repeatable it would be to not rely on time multiplexing but instead use multiple runs of the same programs with different set of event counters but again this is not something the tools like PERF will tell you at this point they will just say well here's the result be happy with it you have to be actually looking actively for that so that was a lot of complexity already but it's gets even worse to some extent because how do we actually use this so if we want to use these kind of tools by themselves and if we are measuring the numbers which are getting out of the counters counters obviously are providing us with absolute values so if the program is executing 4 billion CPU cycles and it does 1 million memory writes so this is an output which a performance analysis tool can give us about the program is this good or bad it's a variance it doesn't matter so you can repeat it I can say I've repeated many times and perhaps it's plus minus a tenth of a percent so it doesn't matter in this case the question is not about the variance so is this good or bad but in this case it doesn't really matter because the point here is not how reliable is the data but if this is a program which computes the state of the universe at this point in time I would say it's a fantastic program if it's a program which computes the time of day I would say it's a lousy program this is not enough information for us to say it's a good program regardless of what the variance is we have to look at every single of the values of the performance counters in the context and the most reliable way to do this is to actually look at ratios so a long running program obviously will take half more cycles but perhaps it's actually an expected result and that's more computation obviously it runs longer so therefore getting a ratio where it basically says so we are going back to the numbers on the previous slide how many memory operations we are doing per cycle that's a more reasonable value to look at and there we can potentially argue then saying that that's just a reasonable, it's a good value it's a bad value so ratios are the foundation for performance analysis when it comes to these things so at the top level to actually make judgment calls when it comes later on to the actual process of optimizing a program where we want to see whether a certain change makes an advance in our program so it improves the program perhaps then absolute values certainly will make make a comeback because then we can say for the same input the program now produces or requires now 10% less cycles that's an absolute judgment call it's definitely possible but for a general analysis of the performance of a program absolute values are useless pretty much and we need to rely on ratios so ratios are everywhere I mentioned already a couple of them for instance CPI or IPC they are reverse of each other cycles per instruction so having a CPI value which is even 0.5 or higher indicates usually for the CPU that something is wrong so for an exit 6 CPU for an arm CPU I would argue that probably that's not actually that much of a problem in many situations especially for if you have in-order execution course then you don't have this kind of thing so even the ratios have to be looked at in the context but if you are looking at an environment an homogenous environment for instance an exit 6 server application of a program that's a certain amount of work comparing the ratios is actually a very reasonable thing but here again think back to the problem which I mentioned about time multiplexing so because you are dividing one by the other time is small etc etc the variance you would get out of the time multiplexing numbers can completely make the upcoming ratios useless so you also have to be very very careful when you are doing this kind of thing so ratios which as I said CPI is one of them another one which definitely makes sense is number of cache misses or per memory operation things like that which are common sense when you are thinking about what your program does they all are going to be making an appearance when you are doing the performance analysis of your program one of the other ones which I worthwhile mentioning at this point is the probability of a branch miss branch misses creating bubbles in the CPU which are one of the worst offenders when it comes to actually creating badly performance programs so if you have a probability of branch mis-prediction of let's say even 20% your program sucks and the program and the performance as well so things like that are easy targets for one to try to measure just without even looking at the inner details of the program getting these kind of top level measurements is actually very useful going back to the similar example which we had before so if you have two counters you are measuring them but how do you now compute these various ratios for instance how do you compute the CPI value do you take for instance the cycle and the number of retired instructions from the first interval divide them and produce perhaps in this case five different values that's useful to know how the program is evolving over time but perhaps this is not always the data is not always available it will fluctuate quite dramatically so perhaps you want to instead sum up the values but here you see that we are counting the number of cycles every single time every single time interval but we are counting the number of instructions only once because we have a limited number of counters how do you compute it then do you average out the number of cycles per interval and divide it by 66 so if you are any kind of scientist at this point you will say you will try to associate the time intervals with units and you will say oh no doesn't work because you are putting in the same equations completely some really related values the 66 for instance has nothing to do with the 95 cycle count of the second time interval it just happens to be that if we would have measured the number of retired instructions that time interval we might have gotten a number in this neighborhood as well but it has absolutely no foundation in you believing this is the case still if you have few resources available for you to actually measure these things this might be the best hope you can get but it means that you have to be aware of the fact that this number is noisy and not an absolute number and often times you will see the programs of course quoting out the numbers with 7 digits after the digit a decimal point so it's stupid but this also means that simply people are not thinking about it so for the mis-prediction here we have values always measured at the same time but which of them are you looking at so do you always look at the sums when they are available or not because in some situations you might actually want to know how a specific part of the program is behaving so it's becoming very quickly very much a big data problem I like to throw this keyword here because we can measure huge numbers of values and all of a sudden you have the ability or at least you have the possibility to do all kinds of computations on them and the possibility that you're doing something stupid if you don't know how the CPU works is really really high right so the important thing is that you need to do adjustment for instance for situations when your program is not run doing counting certain events but if you're doing this you are introducing possible errors and you have to keep them in mind when you're doing all kinds of analysis based on that so talk to you all about all these kind of nice things and whether you can use them and also bad things to not use them so how would we actually use them so the step which I mentioned before these ratios for the entire program are useful to get an overview but after that we actually want to learn something more about the program as it proceeds because just getting an overview and saying well your program runs with the CPI of her 5 which is bad so just saying that about the program is not enough you actually need to be able to say well where in my program am I actually wasting that much time and for that you have a couple of different possibilities at the very least you can go and actually analyze explicitly certain parts of a program and measure how they are behaving there etc so this is very much possible but it's also time consuming because that's not an easy way to inject the measurement code in your program code perhaps in the binary or in the source code to get this for all kinds of combinations of function calls etc etc these kind of things the easier way and this is back to the seminal paper from DacBec in the days they introduce statistical profiling as something which doesn't require modification of the binary and where you can still learn from the outside without doing anything to the binary or the source code about the program as it behaves over time but with a limited precision so both of these methods are useful and should be applied but none of them are really providing us with one definitive answer so for instance here I have another nice little example I don't know whether it's nice but going back to the count which we have seen before so we could use these measurements not just as a single value as we have seen it before but we could do it something like that where we have we compute the two values the CPI, second super instruction and the branch mis-predictions over time so in this case here we have the red lines so indicating CPI and the green lines indicating the mis-predictions and we get a picture which looks like this which means that for many parts of the program one ratio or the other is not present that's again back to the time multiplexing problem so we have to either live with this or we can work around this in many situations so we can also try to do something else but in general what you get out of this is just these these numbers it says just based on the time the program is executing here are the points at which the ratio is this note there is something missing which is the reference back from these time periods inside the code or inside the flow program code itself this is not provided by just observing these values you need something else so as I said before there are these two ways the first one is obviously we can use explicit counting there are interfaces available where ordinary user-level programs assess these kind of counters the interface which I've been using forever for very long times called PAPI and it allows you to create event sets and so on and then just say PAPI read and it reads out the state of certain counters period very simple these outcome 64-bit values which we can store somewhere and subsequent calls of this will return different absolute values obviously is the time spent or the number of events between two different calls so this is easily doable but as I said you would have to integrate this kind of code in every single function of interest inside your program it's something which I personally do when it comes then to measuring the effects of a certain change because then the absolute values are definitely useful I can use it also to compute ratios obviously it's usually too much time to try to cover the entire program with that so the other one is as I mentioned before statistical analysis and the picture which you see here is very similar to what you have seen before these what I tried to show you there where certain events are counted at any point in time and so on we are counting the absolute values now the difference here is that instead of just counting the events we are asking the processor to say that every single time event 1 has been seen 5 times give me a notification and similarly for event 2 every single time it was which was incremented 10 times so this is why you see at the top there a user level part the program being executed and after hit event 1 5 times immediate raise and interrupt some counter code get executed which not only does the counting of this event but also because it's an interrupt it can actually query where the program was interrupted and it can remember this information and with modern CPUs there is so much additional information which we can collect because there is something called the LBR last branch and so on where basically we can know where the program was before so this kind of information can be recorded and then the program is resumed so the counter gets reset basically if it again hits just 5 we have no interrupt and the same thing happened for the second one the result is that we are collecting through these interrupts not just the information about the total number of the events which have happened but also getting statistical information about where the events happen and by the way limits 5 and 10 are unrealistic where on a normal CPU we are counting them in multiple 2,000 or 4,000 something like that because these events happen so frequently but this means that we get information about the locality the program where inside the program certain operations are predominant and by measuring different events over the same program execution in different runs we know for instance where is memory heavy stuff where is arithmetic heavy stuff and so on we can map this to the source code or first of all to the binary code and then to the source code and this sounds good this sounds almost too good but we have to be aware of the fact that the resulting information is really cost-brained it's statistical we only get interrupted every in this case 5 or 10 times in reality 2,000 or 4,000 times the event happened so if we have a piece of code which by accident is where arithmetic for instance instruction is executed exactly every 2,000 times and the rest of them the rest of the arithmetic instruction happen something completely differently we can see completely different results this is statistics and we have to be aware of this when we also have to apply many of the statistical analysis techniques to the data unfortunately still isn't done because well you actually have to know statistics for that so the accuracy is limited by that the accuracy is also limited by the fact that we have out of order execution course so if you have up to 100 instructions to be executed at any point in time and you have an interrupt being raised it's actually not that easy and you can imagine why to actually fault a specific instruction for the event how do you actually find out one of these 100 instructions which is exactly the one which has been causing this event so this has been solved in some later iterations of the CPUs but there is a lot of effort went into this so interintroduce what is called PEBS precise event based sampling I think about 10 years ago and they have improved it ever since and we can see this that there are different levels of this by the way you have to interact inside the PERF program which comes from that corner interact this so you have to call in P or PPP or PPPP to the event name to request a special level of precision you are actually asking for so these things are not as easy to do but they are possible for some of the events nowadays because they have been proven to be useful but it meant that silicon manufacturers actually have to spend a lot of cycles on implementing this correctly so let's see of an example how this works in reality and I am really limiting myself to what you have already seen so PERF is now the one that is a program which is part of the last channel after earlier efforts in the form of the O-profile program always we are lingering around because they were developed independently of the kernel and the interface never matched up at some point the kernel developers got sick of it and implemented their own stuff which is PERF so PERF has a huge amount of functionality and the events as we are calling them which it can record at any point in time you can query this by just running PERF list and what you see here are just the first lines if you run this today on a high end system so again your tiny little laptop might not have as many events there although many of the events are not hardware events they are software events but the events which you are getting there are numbering the hundreds and those are only though the ones which actually have been given names there are other ways I am actually specifying events which are called raw names so raw events where you can program basically the performance measurement counter directly and instruct PERF to do this directly by giving it numbers which it has to put in which register etc so you can with this query every single event which is your machine and there is a huge number of them and usually if you are just diddling around with it you will have a hard time making sense of all these kind of things which are coming out but if you are trying to do something you just want to get an overview the good news is you don't have to worry about this in general so there is a sub command of PERF called STAT if you are using PERF STAT without anything it will give you a nice overview with some commonly used counters and it will print out the appropriate values over the entire run of the program so it gives us a last parameter there is some prop stuff in there and that's the name of the program which gets executed but you can also monitor an entire system running with this tool etc etc so that is all possible and easily do but you can also specify with the events which we are interested in in the specific instance that's possible as well so you see there it specifies CPU cycles and instructions and then CPU cycles, branches, branches, etc etc so all these kind of things these are commonly defined names and they work across architectures that's a nice thing about PERF because normally the events which are counter there when you are looking at the standards of the specific processes they have completely different names but they have been unified inside PERF, many of them at least so the interesting thing which you see here is the notation here with the curly braces anyone wants to make a guess what this means so if we want to compute ratios one of the things which I told you before is that dividing one number through another number only makes sense if they actually have been computed at the same time so if I would instruct the CPU to count the cycles and the instructions and branches and branches etc and then compute some ratios which you see here is happening so you see here for instance 2.10 instructions per cycle that's one of the ratios if I'm telling the CPU to not count them always at the same time the ratio is at very best it has a large error by using these curly braces I'm telling the CPU that these counters which are listed inside they always have to be scheduled at the same time I can have multiple of these curly braces which then each of them, each of these groups can be time multiplexed but it means that I can definitely compute the ratios based on reasonable values so that's an important thing so this is not actually not that long ago so Raymond do you remember when this was added? No but I've seen instructions per cycle be way off and it's likely because of this yeah this kind of thing so this is something which was only added I think 4 or 5 years ago for a long time we went without that so that's important to keep in mind when you're doing this before that you have to think what you're measuring and think about what you're doing before you're computing the ratios is exactly for this and the good news is we have ways to actually making sure this is the case so keep this in mind when you're doing your experiments so this was about the absolute numbers and getting an OVQ when I mentioned that for getting spatial information we need to do sampling so sampling is done differently you have to sub-command record it takes an event specification exactly the same way as before so you say cpu superf record and dash e and then all these names then have a program name at the end it runs the program and that's something like this and it creates normally by default a program, a file called perf.data but you can specify using your dash o parameter a different file name depending on what you need so this is then containing on the data and you can see here so this file is already 5 megabytes in size huge number of events are counted and each of the individual counts has all kinds of information in them like for instance which I mentioned where in the program it actually happened and more importantly on modern CPUs not only where it happens but where was the program before all kinds of information is there so to look at the data you don't have to do it manually obviously so that's the sub-command report you run perf report which by default looks at the perf data file it will spit out something like this and where it says oh yeah I have two different groups of counters count the number of curly braces out there and I can select them individually and looking at them and at the top level it will look something like that so this is abbreviated and it will start missing where it basically says that it sorts them by how frequently the event was counted in certain parts of the program blah blah blah so this is what you can then use and more importantly inside the perf program you can even go with the cursor keys up and down here and actually press the key and says show me the source code then it will even tell you exactly in which instruction and which line the events actually happen so the important thing here is based on that this is not the raw data which has been which has been collected this is already pre-analyzed data the perf program as part of the perf report start-up time will aggregate data or data points which it has measured which are in the same location for instance the same function or function group etc it will aggregating together and present them as such and then provide you with ratios basically saying oh yeah so and so many percent of all the samples actually happened here that's what you can see here and it even tries to be more helpful by showing you with red oh this is a little bit much they are over my threshold then you can look exactly where things might have going wrong so I don't say that for instance this being red doesn't mean this is bad it just means that this event happened there frequently but if you have an entire in the entire program exactly one place we are doing all the arithmetic operations and you're counting arithmetic operations in performance counters obviously it says it gives you a number of let's say 80% of all arithmetic happens in this location that's not bad you have to interpret the numbers in the context of the program and what you know about it so if you want to have something more you can actually request the individual data records by running per script there are a couple of arguments to be left as well so now you have access to the individual records which have been recorded and you can do your own analysis based on that you can also be crazy like me and I have written my own library to do these kind of things which is a little bit faster because this goes through a textual representation with some other form of representation of mine that's faster alright so that was the first spot any questions? are we working? I think one of your summer mentees has been looking at per free chord in under KVM in virtualization and that's something that we've wanted to get some better use for some of our work and I was kind of surprised that in a very early stage we run all of our code in virtualization in the cloud so why aren't there real tools for doing that or do you not need them or what do you think about that? you can run Perf inside the container or inside the virtual domain what they are looking at is measuring the virtual domain from the outside yeah in our case we have a custom operating system that doesn't have the support inside there is no support so we have to do this this is something which many of the Perf events are PEBS events and PEBS is not supported in a virtual environment on Intel CPUs at the moment so there are rich Perf features on bare metal but they are not available in the future yes usually really so I would not do really sophisticated performance online as this except if I run bare metal just to follow up to that if you are running a virtual machine with a workload in it and then really nothing else can you still count those events external to the virtual machine like actually monitor the hardware and sort of by proxy get an understanding of what is happening inside the virtual machine Perf does allow you on bare metal to reach into the virtual machine and be sampling events from in there I just I don't know if there is any overhead or lossage because of that this also probably depends on exactly the process of version but more importantly I don't think you can get the IP recording you don't get the LBR information because that's the different address space you might be able to count the events and so on but not really much information so by the way Joe's day job is to work with Perf not mine so he has to do all these kind of things when you are reaching into a guest machine and you are asking for samples you do pass in pointers to your symbol table it doesn't do the resolution in this sense there's a limit as to what it actually can do because I think the double page table count and so on so along the work which has to be done is far too much work for an interrupt and this is something which probably will improve over time more and more that's more of the workloads which people are interested in are in this category if I remember in the first days when we got VM access that we entered none of that worked at all ever this was added over time can I ask another question so you mentioned the mapping for X86 the SISC instructions to the microcode so the only way that I've worked is with the SISC instructions but we know that those mappings can be you can have many many risk instructions backing so is there any way to hook into the actual microcode that's being executed if you're an Intel employee yes because quite honestly so the way they're testing new functionalities they can reprogram the microcode and introduce completely new functionalities so similar we have seen this with the security problems which we have seen for the last two years and most of them Intel was at least with time penalties but they were able to fix them by microcode updates so this thing is incredibly powerful but it also to some extent explains why Intel has to go to such great length to keep up with the performance of the AMD by using all these out of order and the speculative stuff and so on in the first place so I think they're such a highly abstractable microcode as the Intel side has they're mapping more closely to the hardware alright so what's the time actually Sarah do we have to start right away yeah but you're my room coordinator yeah yeah it's true so I can continue yeah I can continue right away or whether I have to give you a break to the back a little bio break alright I say you have to be back in two minutes so that I can hope in five minutes I can continue