 That's okay. 350, we have until 440, we have half an hour, that's okay. That's okay. Okay. I don't wait a long time for this time. It would be something like that perhaps if you're really looking only at this but you should not ever do this. One of the things which you can always measure at any point in time is time. Three times the encounter and so on. And if you see that moving in the wrong direction, then you don't do the optimization in general. But there's quite a lot of overhead introduced in the decoder of Exodus 6. That part of the circle actually consumes a substantial part of the entire energy. There are risks processors, or risk 5 or something like this. The decoder is trivial as you have now seen. But on Exodus 6 I think it takes something like between 5 and 10% of the entire energy is spent on the decoder part. So if you have fewer instructions, it's actually really a positive thing. Because the microcode decoding by itself is approximately the same overhead for each individual instructions. So if you map an instruction to have a more complex instruction which maps a module of micro-instruction, that's actually probably better than have two simple instructions. Right. Do you want to capture that in your... Not in this one, but you can capture it in other ways. So it's always dangerous to look at exactly one value. I'm officially on vacation. Well, I didn't say that. The gravity can be... Sun shining sort of. I'm on the beach sort of. Sharks, a blind shirt. Are you on vacation too? All right, let's continue. So we have definitely at least half an hour left on material, so perhaps more. All right, so based. Joe. All right, so I'm skipping over that part. Okay. So you can actually look at the counters which we have available. One of the tools which I'm going to talk about in this part is using a website which is owned by Intel, at least by Intel employees, called 01.org, on which you can find for the CPU versions various types of tables which you can download and which the tools actually are downloading. So amongst them, for instance, for my home machine, this is the table Skylake X core release version and so on. You can download this and you can look at the number of events, just that core, not the processor. The processor, the uncorpo part as well. This is just the core, has 445 events which I can measure. So the next version, so Cascade Lake, has 2,347 events which you can measure. So they are actively working heavily in this area. They added all kinds of stuff. But just imagine now, so you are in charge of analyzing your program. Which of these events do you actually use? So it's somewhat okay using the terminology. So things like CPI will always be there, but that's just a subset of what you want to look at. We recall the complexity of the machine architecture. But one thing can help us, if we can try to find some uniform values which have general applicability, regardless of the architecture version and even independent of the CPU itself, that can help. So we need still awareness of the micro architecture as we have seen before but we can abstract some of the details out as a helper. So this does not solve the issue completely and it also means that if you really want to optimize for the last little bit of performance which is left you need to look at individual events and all the definitions and so on. But for general first level optimization it's okay to use some abstraction and to go about using some methodology to be able to analyze the programs. So I'm going to talk about one of the methodologies right now and just like with most of the stuff which I've experienced with this is based on inter-processes. This is simply what I've always been around with. So it might change in the future, but so far it's mostly X and 6 processes and there's something in terms simply introduced. And you can start out by thinking about the picture which I have seen now a couple of times that there are certain, so I already mentioned that instructions are starting their life at the top in the instruction cache and the decoder et cetera and they're going to the bottom at which point they get retired. So there are even in this picture certain boundaries which we can identify in the picture itself which are signaling certain steps in the lifetime of an instruction. So I've painted them really nicely right here. So for instance and this is not the 100% accurately picture but basically the first, the top one is associated with a couple of events where the micro-ops have been issued so the instruction has been decoded and we have now split it up into micro-ops and if it doesn't make any progress then perhaps there are some resources which are missing. Then we have the line, the second line where the micro-ops is going to be executed or it has been determined that the resources are available and we can do this so there's an event called MIOPS executed so that's the one event which we can potentially think about being counted there and at the bottom there's another event which we can count MIOPS retired this is basically when it is done, when all of the operations for the instruction have been done at that level we might be able to count something like that. So this is starting, the starting point of the methodology so if we are thinking about it, all of the CPUs in a similar fashion and try to identify parts of the program which are counting high level abstract more abstract events than the individual events the CPUs are providing with we can come up with something and Intel, so specifically engineered Intel Ahmed Yassin has actually written the paper I'm not sure whether he actually was the one who developed this because I know about the terminology has been used inside Intel for a very long time so I've seen people using it but when I ask about whether I can have access to that they were just smiling so the publication of this paper was after the smiling incidents and here's the reference to the publication which you can find out so it's in general called top-down method TMAM is what they're often times calling out and this is one of the figures out from this specific paper itself that's a nice summary so what this shows us for is that it's a top-down analysis where we are looking at the question about the program execution and so on at the top and then we're successively asking individual questions so for instance, is it front-end found? is the program hindered by bad speculation? is the instruction successfully running through? is it being retired? or is it a problem in the back-end? so that's the first level of questions we can see and depending on what the answer is for this then we can go down and say well for instance if it's front-end found well what is the problem with that? is it a latency that the decoder for instance doesn't get enough instructions or is it the bandwidth? so that we have far too many instructions so it's overwhelming us so these kind of things are going to be looked at successively you see here we have four levels of potentially four levels of different events which we can be looking at and that's this top-down method which is first of all being advertised by the government also we have tools now implementing them to some extent so each of these levels have different ways of how we are actually getting to the answer for that so for instance for the first level the way the paper specifically is telling us we are computing this is well we are first measuring whether micro-ops were allocated or not depending on the answer we are looking at whether they have been retired or not or whether there have been any back-end stops so with these different types of questions we can then categorize a specific event in one of these four classes and then increment the count and overall give perhaps then statistics saying oh yeah in so and so many cases inside the program it was front-end bound, back-end bound, etc so it's not that everything the entire program is just saying one for this one event this is simply a counting thing it's a statistical thing by itself as well so this is one nice methodology we can basically be applied by everyone if you're following this you still get a lot of value out of it even if the micro-architecture is changing dramatically whether you are then later on interpreting the numbers correctly and making the correct changes to your program that's a different question but at least you can analyze things so how would this look like so for there are various ways of doing this so PERF itself as you can see later on in the example has some of that built in but there's another tool which I will show you in Iran which has the much more complete implementation of the top-down model in it which is using PERF for itself but basically all of them have some form of coding for instance what does it mean for retiring so how do I compute this value if I have all these kind of performance counters which of them do I have to measure to actually get to the point that I can make a decision about how many times the percentage of time was the problem in well what did the program really reasonably retire instruction and therefore there was not really a performance issues and these values are computed using formulas but the important thing is they are varying potentially depending on the micro-architecture they will definitely vary across the CPU architectures as well but even in micro-architectures with the number of counters being added etc so all of the counters which are not architected are not promised by the CPU manufacturers to exist in the next revision so all of them can change but therefore it's important to get to the point that you have a formulation of what kind of counter value or more importantly in many cases like this what kind of ratio so again look at the terms retiring is actually a ratio which ratio gets actually computed in which way and the 0.01.org site which I mentioned before has data funds for many of these architectures and here again that's for my Skylake machine at home this would be the way how I compute retiring all the tools are computing the retiring value so it takes retiring is the number of muops which are retired divided by the slots the slots is pipeline width times cork clocks cork clocks is depending on how I would do these sampling and so on is computed in one or three ways and clocks is computed this way etc etc so this is written as a kind of code gets executed and based on the sampling values which are available I can compute this value and this is available for all kinds of different parts of these top-down model and more so they actually many many more ratios defined in the Intel manuals so I have this in the end a little bit which we can utilize for our program analysis but for the top-down analysis we just need a couple of these kind of things and they are defined depending on the CPU architecture by these nice little files which we can automatically use so that's good news so we don't really have to do that much work so the tool which I've mentioned which has this is called PMU tool so those are around in the Linux world no endy-fine Andy has been working for Intel for quite a number of years now and he has published this as an implementation of the top-down paper it's a Python script or a set of Python scripts and it's actually really nice and the tool itself is using the data which you can download yourself independently from the 0.1.org site so the Pergamon directory has all these kind of files for all kinds of CPU architectures they're available so how do you use it? in Perf itself as I mentioned it has an initial version of this available encoded by inside but only the level one remember there were four levels level one is available in Perf itself you specify the top-down argument but you also have to do global counting using the dash A argument that's not bad it's kind of a bad thing because this means you cannot monitor an individual process you have to look at the entire running system and if it's noisy if you have many other things going on you have to be careful in interpreting the data but you can do something around these kind of things for instance you can tie the process which you're looking at to an individual core for instance in this case task set dash C0 means that the process will always be executed in color 0 so you only have to look in the output at the line S0 socket 0 core 0 my counts will only be in this line not in this line otherwise it might be confusing so why are these different numbers they don't apply to my code et cetera so you have to work around some limitations of Perf so in general I've never been really used that I've used the top left tool which is part of this PMU tool set it's a Python switch as the name suggests you run it and you can give it parameters for instance in this case I can tell it oh yeah this is a single thread program so don't worry about setting up performance monitoring on all kinds of cores only on the core I'm actually going to use and therefore it's much much simpler than what you would normally get so in this case here I just run level 1 analysis on the program and as you can see it here the numbers are actually pretty good corresponding so you see back end bound 20.2% that's the red line and here it is 20.95% also this is the output it doesn't even bother pointing out that there's front end bound and retiring and so on if you immediately figured out that's the potential problem if there's any exactly what the red color here shows so they're basically equivalent at that level but PMU tools and specifically top left you can ask other levels so now I specify level 2 so it runs the level 1 test and the level 2 test and here I also told it oh yeah don't use multiplexing think back what do I mean by multiplexing I'm not repeating this because I'm out of time but you know this now what this means this means that I have to run the program multiple times so now it says here run 1 of 6 up to run 6 of 6 because it doesn't do multiple performance measurement counters at the same time it does them in different runs which is always a good thing for you to do if A the program is repeatable and B it doesn't run for days before you can actually do something so if it runs in a reasonable amount of time always do this because the precision is so much higher so in this case so you do this and you get the second level analysis which is not only BE means back end bound but specifically BE dash and core will snatch core back end bound it's core bound and you can go back to the picture itself and see what this actually means what the next level is et cetera et cetera you can do this up to level 4 then get this another mode where it does all kinds of other things as well so this kind of thing already exists you don't have to write anything you just have to interpret the numbers appropriately and that's not as easy as well because all they are doing is they are looking at the entire execution of your program and they're signing a number to each of these different for instance level results in this case but this doesn't give you any information about what is happening where in the program and I already mentioned beginning in the first part that you need also spatial resolution so that's not by a simple overview you need to do sampling and the good thing is that even that you don't have to really think if you're using top left if you use either show sample or even run sample it will analyze according to the top down method what your program might actually be suffering of and then issue an appropriately matching sampling command so similar what we have before perf records but the events which it records are chosen to match the found conditions in the program so the thing is that much of the knowledge which you would have to know about the micro architecture oh yeah well if I observe this kind of behavior here so what should I look for in detail that's already encoded in the script so Andy has done all the work for you so if you do that then all of a sudden you don't get information about the program anywhere in terms of these different levels of the top down method and so on you get now information in terms of events exactly what you need then it says oh yeah in this address space part and also if you wanted if you can use then also perf script to look at in a temporal resolution you can see exactly where temporally and look locally most of the time most of the events have been occurring so that's easily doable in a sense so you would then just run perf report and as I mentioned before it looks at the files this is the level at the sort of actually at the assembly code it will look something like that it would show where in the source code the various events popped up and you can look at them important thing here is that I mentioned that we have this PPS this precise sampling available but not for all events so there's always a fuss factor there so don't assume that because the number here says 10.53% where happening here that's exactly at this location it's somewhere in this proximity so there's always some form of fuzzing which you have to do in your analysis but this gives us a lot of information there alright, so good news is then I mentioned this also before is just knowing where the events happen doesn't tell us really how it happens to become a problem because for that you have to actually know from where for instance the program was called where the function was called not the program so for that we have now facilities in the processor as well the processor includes in its state a set of registers called LVR which depending on which processor version you have we have 4 or 8, 16 nowadays the most recent one we have 32 slots in this register where every single time the processor executes instruction which is not the next instruction it adds a new record in the LVR so with that and by having an interrupt and then dumping the content of the LVR we can actually trace back where the program came from with a certain resolution if you have a tight loop well of course the 32 records will soon be exhausted if you have a function called a function called a function called a function and so on we can very easily reconstruct the call trace so that's easily doable nowadays and so on and more importantly for many situations and that's something which is somewhat being exploited in the tools today but not really fully this is where some of my work comes in and my own tools is there's another extension to the processor Intel is providing called tracing, PTs the technology we're having where they have a very low bit rate recording of every single instruction to get executed so you can actually record the trace of the entire program without significantly slowing down your program if you record this kind of thing along with the perf data with that we can actually trace the entire program execution and see by the association of the time event the sort of timestamp counters of the events which we are counting and the events for the PT records and so on we can actually associate them and find out exactly where what happened in the program and reconstruct that it just is a lot of work but it's there it's all possible to do and the whole thing is that as a programmer now with the PMU tools a verbal top-life analysis you can start away with this very easily you let the scripts do the analysis and you get to the point but this gives you a global view of the program how it performs it's a one number summary for the program itself but as I mentioned before the program has different stages to be able to really useful you have to create basically microbenchmarks for different parts of the program and ideally you are able to isolate the different parts of the program so sometimes you can do this by actually providing different binaries which are just doing different parts of your program for instance in a compiler you are creating a different binary which just does the scanning the other part does just the IR, the intermediate representation the other one does only code generation then the individual optimization passes etc etc so if you can do these kind of things you can individually profile the pieces which make up the program itself then you can analyze them if you would do this as a whole you would not necessarily know what you are looking at at any point in time it becomes so much more complicated so with that this is the part which no tool can help you with this is something which when you are writing code you have to make sure that you can do something like this yourself okay I already mentioned that the sort of top down methods on is using all these kind of ratios etc but they are not all of them Intel has been for the longest time but also AMD has a similar manual but they are not as good in keeping up to date and completing it but Intel has what they call the optimization manual there is a chapter or a pending more correctly called using performance monitoring events and for that they are listing for different market architectures ratios dozens, hundreds of ratios with explanations of what they are measuring plus the way you are actually using them or how you are computing them that is incredibly useful this is an extension to the top down method you would use the top down method of course but then if you are really going into this you can measure even more things and all of these events you should be able to get through the perfect device but after a change once you have analyzed things and you have made a change to your program how do you actually see whether you make progress yes you can do statistical analysis again but oftentimes the changes which you are making will have effects which are small how do you differentiate that positive or negative change from the noise so this is why you usually should use something else I have mentioned this in the first part as well there is a set of interfaces called puppy and with that you can easily read absolute values of certain counters so here is the example before the work which I am actually investigating I am reading the counters after that I am reading them again then just compute the difference the result is that I can count individual numbers of events from inside the work which I have done this does not mean that there is no variance in the numbers if I repeat this because I mentioned this at the very beginning CPUs are basically stochastic by themselves there are so many different independent events that there is never one repetition exactly the same as the other one but you have absolute values and if you are measuring very short intervals of time to which normal statistics will not give you a good enough result in this case you can actually get really really good results and you get sometimes for certain events you actually have variance which are counted by one or two so instead of one or two percent so individual events you can actually count that is really a nice thing for you to do and if you are changing some code you also know exactly where the effect can be measured just add these appropriate cards and so on and the nice thing is that you have to set up this event set which is used in these cards and if you do it right and this is a sketch of how you do it so I have different codes how I do this so I usually prepare my program so that if I set an environment variable with the names of the events which I want to measure I have the code like this to parse the environment variable and program the events that are appropriate so that to count different sets in different runs of the program you have to change the environment variable and run it again so this makes it really easy to run through lots and lots of experiments so per rule mentioned remember that alright that was the quick part of this so I think I run out of time any questions yes, Subin Intel V-Tune is proprietary I don't touch it yes, ask Joe more because he knows it he is reaching for the mic stand up a few years back we played with V-Tune and working with the developers of Perf and everything in V-Tune you can do in Perf and a lot of what V-Tune does is exactly about two or three slides back where it talked about all the ratios if you want to look at a hot area here look at these events in this ratio and here's what it means and V-Tune kind of makes it easy for you to look at that all that's available in Perf and in fact Andy Cleans tools do a lot of what V-Tune does in open source and V-Tune is the 300 pound monster it's a very heavy weight so that's anything else? sure all right Tommy I was just wondering if do the the times when the events actually trigger do they ever get correlated with the workflow or can you assume that they're random? no so the events you're counting certain events so things like let's say last level cashmases they are of course correlated with the work I mean when you actually register them but this is where the tracing the association of the events to the timestamps and also to the instruction comes into play so in the earlier versions they were happy just to create the events of the early versions of the CPUs later on I sat with PEBS and so on they actually are able to record exact instructions for that and so the difficulty of implementation of the CPU and how much you pay for the CPU sorry to test you a little more clearly when you're time-sharing you should never time-share so if you do time-sharing all bets off literally so don't do it so the only time when you're doing time-sharing is if you have a gigantic non-ending process which you cannot afford to repeat the measurements many many times and where you do the measurements for hours and so on so that the numbers actually mean something but not if you're doing normal program development