 Good afternoon everyone. Ladies, yes gentlemen. I'm very glad that you're here staying behind to the last minute, especially in a room of such a peculiar layout. I'll do my best when I have to point out some things and slides. I'll say I'll do my best. Anyway, my name is Pawan Wol. In case you were wondering, I'm Polish. Although I live and work in UK in Cambridge and I work for a company that you may have heard about called Zarm. And I'd like to talk to you about a feature of compilers that is there as we'll see for many, many years and is not widely known and used. It's hidden behind a obviously three-letter acronym, FDO, or others as we'll see in a second. And it could be considered a magic option that will make your program run faster. Is it? The presentation will have very simple and standard agenda. I'll talk about the basis of the problem, give some examples with two particular types of FDO being used, and we'll finish up talking about real life deployments. So, not to take my time with starting, let's start with terminology. So, TLA's three-letter acronyms. I've already used the FDO one, which can be resolved as feedback directed optimization, all but as feedback driven optimization. It is also widely known as PGO standing for profile guided optimization. And there is a standing joke that if anything that we invent new in 21st century around computer science has been already done by AVM in the 70s. And indeed, they did do this in the 70s within mainframe and they called it PDF, profile directed feedback. If you look closely at Sans documentation, maybe a slightly later than 70s, but still quite a long time ago, you will find something called PFO, profile feedback optimization. It's all the same stuff. At least to me, unless I see any opposition, I'll assume that it's all the same stuff. And what is it about? This technology, let's call it, is supposed to help compiler to make decisions. Compiler has a lot of decision it has to make during compilation phase. And disclaimer here, I'm not a compiler engineer. I'm not a compiler guy. I'm a normal person. I work on the tool side. I got into FDO and story rather from the profiling, from the profiling side rather than the code generation side. But I still had to learn a bit about compilers and I'll try to pass my knowledge to you. If we have any compiler engineers in the room, I will not try to compete with you, meaning don't kill me. So the first and the obvious decision that a compiler has to do is to answer a question, is then more probable than else? If you have a very simple if statement, with condition, one possible outcome or the other, the compiler will have to make a decision which one to consider more optimal and which one to optimize for. And we will see an example of that one in one of the further slides. Is a function word inlining? That's an incredibly important question these days. Most of the compilation tool performance increase these days come from proper function inlining. I will not show an example of that because this is becoming especially important with huge code bases. And my example is very simple. And the other classic question is, should I unroll the loop? Unrolling the loop just means instead of generating a single loop body, generating a number of them in series trying to resolve a couple of resource conflicts. We'll see an example of that one later. Normally, compilers are answering this question using a bunch of quite complicated and pretty much finger in the air heuristics. They will have some code saying that if it is Tuesday and the code looks like this, we'll try to go that way. Obviously, I'm joking a little bit, but not that much. There is no hard science behind it. There's a lot of guessology behind. And this problem is quite old. Even in Fortran, the first optimizing compiler that we have ever heard came from IBM, obviously. And it was a Fortran compiler. We'll see some example from the manual late in the next slide, and there is an interesting fact about it. Fortran itself, the language defined a frequency statement that can be used within the code around the conditional statements to give hints to the compiler. Which path is more probable than the other? Which path will be usually taken by your program? And we have, we've seen the same very similar mechanism in GCC today. There is one of the built-in function, built-in-expect. Just pretty much gives, takes additional argument that is a hint to the compiler, one way or the other. This function is being used with hidden behind the macros likely and unlikely that you may have seen in Inus kernel code. If you look at the GCC documentation, though, just around the built-in expect, you will find a sentence and I quote, programmers are notoriously bad at predicting how the programs actually perform. And this is true. We have seen it, again, even within the Inus kernel a couple of times. So the obvious idea is to solve the problem automatically. To run your program on real data, measure the frequency, branch frequencies, so in case of if measure how many times then is being taken, how many times else is being executed, and just optimize for the more frequent case. That's an obvious, obvious idea. Obviously, IBM thought about it, maybe not with Fortran, but later on. Interesting fact, Fortran, when doing basic block alignment, so that's more or less what I was talking about. How do you align bits and pieces of code generated to make the flow optimal? What they did is, I consider it completely bonkers, but probably this was a good idea at the time. They pretty much did a compile time simulation. So out of the all possible space of combinations of all the basic blocks, they're doing Monte Carlo simulation and trying to figure out, well, randomly which one will be best. As far as I know, no one is doing this today. Probably in 10 years time, someone will come up with this idea and it will turn out that IBM thought about it some time ago. This is a screenshot, a photography of an original manual for the Fortran automatic coding system. If you look closely, I'll try to show it on both screens around this area here, you will find that the problem of compiler making decisions next this Saturday will be 60 years old. This manual has been dated at 15th of October 1956 and 15th of October 2016 will be this Saturday. Nothing has changed. I will use a very simple example. I must say that the algorithm that has been chosen in this example has been, I've stolen the idea from auto FDO guidebook tutorial in GCC documentation on GCC Wiki, although I have implemented it myself, I'm proud to say. Well, if you haven't spotted that it's bubble sort yet, it's time to get back to your university classes. The important thing is that there is a critical operation within this algorithm. There is this if one element is larger than the other one, then the next one, the previous one in this particular case, you swap them. That's the basic of bubble sort. Very simple. So let's get to the meet, instrumentation-based FDO. This is a classic approach to the profile guide optimization or feedback driven optimization that is available in both GCC LVM and probably most of the property compilers. The idea is quite simple. You build your program with a special compilation option in case of GCC or the minus F profile generate that will inject extra instrumentation inside your code. We'll see an example of what the instrumentation may look like later on. You run your program. And by the way, my bubble sort just goes over a static to define list of 30,000 random integers and tries to sort them. The data is constant. It's not randomly generated, but the initialization is always the same. So I run my program. The profiler, the instrumentation inside the code will measure what frequency of the, how many times the function, the critical operation has been executed and will generate information about it in a separate file per object file in the running directory. Now you build your code again, this time with minus F profile use and the compiler will start, will build the code again, this time taking this information under consideration, the data captured during real program run under consideration. So let's have a look at the effect of compiling my example program with GCC 4.8, quite an ancient one. There's a reason I've used this one with minus O3. A letter of explanation. This is a H64 arm, 64-bit assembler. Don't worry if you don't speak a A64. It's not an issue. The important thing I want to point out that the code is quite simple. LDR is load. So we will see two loads. We are loading two elements from memory. Then we have a compare, we compare one against each other. And if less equal, meaning the else case, we don't do anything in else case in this particular code, we're just skipping the rotation a moment. I'll do this twice on both screens. We load two elements to registers www5. We compare them. When we have to swap, we are storing it back to memory in different order. You can see that those registers are swapped. And we are setting also the Don flag just to show that the while loop has to go again. So I'll repeat it on this side. We are loading two integers from memory into register www5. We do comparison if we have to swap the elements. So we are executing the if statement body. We will store those the same elements in different order. www5 are swapped. That's all. Nothing more. I'm setting the Don flag. Now, what are the expensive operations here? Memory accesses. So obviously, we don't want to store. We don't want to do stores if it's not necessary. If it's not necessary, we don't have to swap the elements. And that's why we skip it with branch and branch itself. Branches are expensive. This is a completely separate discussion. We can chat about it later on. For the purpose of this experiment, let's believe that branches are expensive even if you have good hardware prefetch, hardware branch prediction units and so on and so on. Therefore, we want to avoid branches and avoid memory operations if possible. The profile generate build case will generate pretty much the same code in red. As you can see, the red code is exactly the same. Load, compare, branch, store and move. Now, the green italics code is instrumentation added. As you can see, there's quite a lot of it. It has kind of bloated the size of the very, very simple function. Now, the beginning and the end, the old blocks, the big blocks of green italic green code at the beginning and the end is kind of run once and forgotten. It's a preparation. In our case, preparation kind of preamble for the profiler and postample and gathering the data. In our case, it does not matter at all because it's a main function. You just run it once. If you were instrumenting some particular hot function, you would see the overhead of those extra instructions in your profile. Now, I just wanted to point you at the small green areas in the middle. Pretty much that's increment x6, increment x8 and increment x10. I'll repeat it here. I say it will be hard a bit. We are incrementing every now and then we are incrementing x6, x8 and x10. If you look closely, you will find that the x6, so the first ad, is incremented every single time when the if check in your source code is being executed. The x8 is incremented every single time that the swap happens. So when we enter the if then block. And the last x10 measures how many times the whole while loop happened. And that's more or less what the instrumentation does. It will just measure how many times each of the basic blocks or the larger in case of while the loops have been executed. What the compiler can do with such information can do that. As it happens in this data set that I have in my example numbers, the swap operation happens less frequently than non-swap. So we have previously, if you remember the original code, we have kind of optimized for the swap case. This compiler made the decision to optimize like the if statement body was always executed. It reasonable assumption. It didn't know better. So we had to pick one or the other. He picked that one. When I provided it with the real information, with the real data for this data set, it spotted that the store, the swap block is executed less often than the whole if statement. And in this case, if you look closely, you will find that it is optimized for the non-swap condition. So only if we have to, if the comparison returns greater than, we will branch to the two store instructions that will swap the elements. Meaning we are saving the expensive branch operation as often as possible in the overall goal of reducing execution time. So let's have a look. How did we go? Not very well indeed. A couple of data points here. Time elapsed. It's the wall clock time measured for the, for the program to run. Cycle is pretty much the same, is equivalent. It's the same metric. The profile generations, a profile generation, in a quite a simple case. We only have added three ad operations that run and then operate on registers. Very, very cheap. And some preamble code and postamble it will be executed once. And we already have 2.3% overhead. For more complicated code, this number can go up. I will have some examples later on. And unfortunately, in this case, the optimized code, so the, the code optimized for the, for the data set, it's not performing particularly well, I would say even worse. Now there's a good reason for it. 4.8 GCC, as I mentioned, is a special version. It was the first version of GCC. It had support for AR64. It's not doing any optimization whatsoever. It was only generating correct code, nothing more than that. I can assure you that the next example I'll have will show you noticeable and important improvements. Now, one thing that I want you to point to, to kind of attract your attention to is the last row is the IPC, instruction per count metric. It's kind of my personal interest. I spend last three or four years of my life looking at performance optimization, analysis and optimization. IPC is a very commonly quoted as a kind of proxy for program performance. If you can execute more than one instruction per cycle in, in most of the power processors, it's good. So IPC is better, meaning your program is better. So just keep it in mind, 1.1, 1.45, 1.44. It's good. So if I want to use the profile, actually I need to do both. Yes. And that, let's go, well, we'll talk about it is one of, that's the most important kind of issue with instrumentation-based profile. And that's what puts people off. Make no mistake, the numbers there are both against clean O3. So it's separate overhead of, of profile generation. And then I was hoping to see some improve, improvement with the profile use. But yes, you have to do this two steps separately. Okay. GCC 6.1, even with normal O3 already generates slightly better code. And we'll see it in the numbers. Code, the, the argument's exactly the same. Just instead of having two separate instructions to load two elements from memory, there is a single load power instructions doing exactly the same. Nothing has changed. We're loading two elements from the, from memory, comparing them. If we have to, if we don't have to swap them, we are branching to kind of else condition that doesn't happen. Okay. Nothing has changed. Profile generation, the instrumentation added by the compiler with minus F profile generate hasn't changed much either. Again, you will see a couple of ads here and there. There is one kind of funny thing about it that attracted my attention. It's really irrelevant. But you can see that there's only, there are only two ad ones instructions because the compiler actually optimized the instrumentation as well. It realized that the for loop will be always executed 29,999 cases. So there's no point of adding one 29,999, 29,999 cases. It's perfectly well, it's perfectly good enough to add 29,999, once every single time the loop is being executed. So even instrumentation can be optimized. 6.1 is much better than 4.1, 4.8 was. And now we have used the profile. This is a 57. It's got a data prefecture. I have carefully chosen the date, the size of the data set, the 30,000 integers. So it fits in L2. So I don't see some huge delays of the interconnect, unpredictable, but it will miss, it will always miss L1. So that's how it has been chosen. That's why we'll see significant differences in a second. So what you can see here is quite a dramatic change of the generated code. It's loop unrolling. The compiler was told that this loop is pretty hot. It is worth unrolling it. What you will see here is the fact that there is loads of load pairs. So it's just doing the content of the four-loop body number of times one after another. That's point number one. Number two, it has optimized for the non-swap case as well. What you can see in the code is that there is a bunch of load or load pair, compare, branch, load, compare, branch, and so on and so on. Do the same on the other side. We'll see load, compare, branch, if necessary to swap, load, and so on and so on. The stores are kind of in a separate block of code. It's done both. Unroll the loop and optimize the branch frequency. And this time with pretty damn good results. The profile generation instrumentation cause, so sorry, O3 itself, clean O3 versus 4.8 on pretty simple code is already over 1% better. There is a difference between 4.8 and 6.1. Profile generation is slightly more expensive, but probably because we have kind of saved the 1%. So it's a guy in 2%, 3%, and it's very simple use case. And with the loop unrolling and the branch frequency optimization, we have, say, 23% of execution time. Have a look at IPC. It's pathetic. It's below 1. And the code still runs faster in wall time. It's just a kind of, as I say, it's a hobby of mine. I'm just showing that IPC is not always the direct proxy of efficiency of your code. When anyone wants to have a chat with me later on, I can talk to you about hours. Why this is the case. Now, if we can get 23% improvement for free, practically, why isn't it used by everyone every day? Because it's not really free. So number one, problem number one with FDO, PGO, whatever you call it, is training data set. I have cheated. I run my kind of training, the profile generation code on exactly the same data that will be in production. Now try doing the same with Firefox. Obviously, you could try run your profile generation every single web page available in the world. Well, I wish you good luck with this. But there are certain people who tried and failed. Maybe not try that particular one. But as I'll say later, Firefox indeed has a support for FDO in the build system. It's not being used by default. Spec 2006 is a very widely known set of benchmarks, kind of real live benchmarks as they market it, has a special set of training data. Carefully selected training data that are supposed to represent the code flow of the program that is being used with production data. It's one thing that is important. It has been chosen so that it runs faster. If a full, let's say, povray benchmark of Spec 2006 can take five minutes to execute, the training data will only take something like 30 seconds. And there is a paper evaluating this particular workload showing why that workload, this training data is good. Why is it good representation, good equivalent of the production data? You can say that they say, immediately you can say that it's a quite a hard problem if academics kind of touched upon it. I would even say even more. It is very hard problem because that's pretty much the only academia reference that you can find around this problem, meaning it's so hard that even academics don't want to talk about it. This is problem number one. How do you create a representative set of data for your code? That's hard. Number two, overhead of the profiling. 16% is quoted across the other Spec 2006. It can go all the way to 100 times in particularly bad examples. And the last but not least, as you've pointed out, you have to do the build, your build on twice. Well, it may not be a problem with my bubble cert, it compiles in no time whatsoever. It's a different story with a build that takes a couple of hours. Your QA and infrastructure DevOps, as they are called these days, people may not be particularly happy about having blowing their build process twice. And you have to also remember that you have interleaved those two build stages with the test run, with the profiling generation run, which also adds time and complexity to the build system. It hasn't been used that the FDO is not being used that often. There is a kind of reason why. So let's forward back to about 2008, when the first paper, when a paper authored by a bunch of guys at Google shows up. Feedback directed optimization in GCC with estimated, that's the important word, edge profiles from hardware, even sampling, again important difference. Flow is very similar, kind of. You build your program, but this time, normally, just clean o3, o2, whatever is your poison. Sorry? I have seconds, all right. Just, yes, for a reason. So well spotted, yes. So the minus g, the symbol, the debug information will not impact the execution time, but it will impact the image size. Well spotted, we'll talk about this later. Then you run your program on your training data, whatever it is. But again, your code itself doesn't have any instrumentation whatsoever. What you do is you perf it. Just normal standard Linux pair of record command. You may notice that there's a minus b there. You know, eagle eyes are, you probably have spotted the minus b there. It's not exactly the completely default pair of record session. And we'll get back to this later. Then you run additional, and this is important thing. You run a separate tool that will take the perfect data and generate the profile. It's a separate tool, meaning you can do this separately, fine. Somewhere else. You can take the perfect data of your system and do it somewhere else. It's a tool created by the Google guys. It's available on, it had a number of kind of reincarnations on the way. Currently, it's available on GitHub. You can find it without any problems. And then, like previously, you build your code again. This time, I'm taking this, the profile generated by IOTA of the all under consideration. We'll have a couple of kind of, there is a couple of gotchas here, and we'll talk about it later. Let's have a look at the, okay. Let's first talk about the advantages of the sampling profiling first. We'll see the results of my example in a second. So, number one, if you ever have used pair with default record command line, you have probably noticed that the overhead is measurable, but it's nothing like 16% or 100 times. It's manageable. Let me put it like this. It's good enough, as you will see in a second, for Googles to run it in their data center. It's kind of, for me, it's a kind of stamp of approval. As I said, profile generation, so the auto FDO tool can be run offline. You can run your stuff in your data center, if you wish, collect data as they do and generate your profiles later on. No need to generate training, special training data. You can actually run on, because there is no, there is no, the overhead is so small that you can pretty much do your normal operations, you can run it on your production data. Google data center case, they will just run it in background on every single Google search execution that you do. And obviously they, in case of, you could say that they will, they could also kind of run it for every single web page that someone is watching at currently. This won't be the all web pages available in the world, but when it's, when it's Google, that's a pretty good representation of what people want to watch, I would say. And you just run it on the, on the, on the real data, on production data. Important thing here is that not only you can do this on production data, but you can aggregate the data. So your Firefox run, you could have at home, profile your Firefox every single time you watch, you read some, you watch some website. In Monday, Tuesday, Wednesday, Thursday, and so on and so on, you will aggregate the data, and it should represent pretty much your interests. It should represent the profile from the whole week of web browsing should tell you that you are most likely looking at cnn.com or whatever else, pbc.com or couk if that's, that's what you're from. Then the profile can be aggregated and used on Sunday to generate Firefox for your next week browsing that will be faster. And that's more or less what the idea is. You have a, the guys at Google that would, they would have a release cycle. They would release their set of search tools every now and then. And every new release will be built using the data collected with the previous release. So you start with nothing. First release of your new, if you started from scratch, your first build will have no instrumentation whatsoever. Well, tough. It won't be perfect, right? Version one is never perfect. Version two is also to be removed and probably version three will do something right. But that's the idea. You will build up on the, on real life production code from your, from your previous, from your previous execution to create a new version of the revision, which will be better or not. As we'll see in a second. We'll talk about it in a second. No, the example, I guarantee you, we got some improvement, not as, well, you'll see. The third important thing about the auto FDO profile, the Google idea is that the profile is kind of sourced based. Traditionally, the, all profiles generated with by just see all the instrumented based one will generate their profile linked to an object file. They will just record say PC. And then you will have to do some extra work to figure out what was going on really there. Now, as you will see an example later on, this profile is kind of source based. So there is a, this function. If there's a function name offset from the first line of your source code in this function, we have, we have attributed 14 samples to that particular line. It's like what you see with Perf report or annotate. And last but not least, because it's a single run, well, you don't have the extra training run nor extra build stage. It's much easier to integrate with, with, sorry, the source profile one, because it is kind of based on, I forgot about one important thing, because it's source oriented, this profile, it will actually survive some code changes to a certain level. If you have 10,000 functions in your, in your code, you won't change, well, 10,000 probably closer to a million functions in your code. You will not change every single one, every single you do release, it will change probably 100. So the information, the profile information will be still valid for the 900,000 plus ones. The pay, one of the papers is evaluating kind of dengeneration of the profile over time. If you don't update it, and yes, it goes worse, right? If you start applying the, if you keep a single version of your profile and start applying code changes, the effects will go down. You will see, you will get less improvement from the profiling, but it will be still there. So it's about balancing the number of times you update your profiles. You do your release, your formal release, versus the time you are gathering the data and changing the code. And all this makes it much easier to integrate with your build systems. That's the effect of auto profile build of my example. I won't bore you with details. There are two important observations. Number one, loop has been unrolled. Number two, it did not optimize the if, else condition. You'll see that there is a, every, all around this code, you will see a combination of load, compare, store, instructions. I'm meaning that the statistical profile, by definition, it's just a statistical representation of the shape of your, of execution of your code, did not provide enough information, detailed enough information to spot the information that the non-swap case is more common than the swap one. It may have been kind of for the instrumentation, in the instrumented case, you are getting exact numbers because you're counting every single time something is happening. Here, it may have been kind of borderline important, something 48% versus 52. With statistical profile, you are likely, this is the case here. No, do not, you may miss that particular fact. Numbers are still quite interesting though. Comparing, just from the rightmost column on the slide shows the auto FDO based improvement, minus 14%. Well, if you save 14% of execution time in a Google center, Google data center, I think you have just paid for your salary in the next 100 or 200 years. We probably get the promotion as well, so maybe it would be slightly less. Again, IPC 0.65, miserable, and it still runs faster than my 4.8 case. Again, just my hobby. Now, let's talk briefly about the sampling profile quality because we've just seen that it did not know, it hasn't noticed the potential for optimizations of the swap non-swap condition. There are ways of improving the accuracy. Because what we want, and I've just run the per-record with the default command line, meaning it will just take a sample every 10,000 each instructions. Maybe it's 10 milliseconds instructions, but it will just based on any instruction executed. But what we really want to measure is the behavior observed, the behavior of branches, and CPUs, certain CPUs can provide a PMU, hardware PMU event, triggering only when branch is taken, executed, sorry, in particular execution one is interesting. So if I have chosen, if I've extended the per-record command line asking for branch executed event, I would have automatically improved the quality of the profile because I would be focusing on the, and probably reduce the frequency as well, sampling rate. Increase sampling rate, reduce the sampling period. So that's one thing that can be done. The important thing is that has to be, the PMU has to work in a special mode called precise sampling. Intel has got it under the 9PBS. I must admit that at ARM, we don't have any real world processors doing anything similar, but keep up to date, things are coming. The second thing, and that's the important thing, that's probably the most important thing about generating profiles for LBO, for AutoFDO, as you can read in the paper titled Taming Hardware Event Samples for Precise and Versatile Feedback Rate Optimization. It's again the same bunch of guys from Google. You can kind of spot a plot here. You can see where it's all coming from. The branch history, it's a feature of reprocessor that will, every single you take a statistical sample of branch, for example, will also provide you an exact, and that's the important thing, exact history of branches that happened just before that. So although you are still sampling, creating a statistical sample, so you're creating a statistical representation of your program flow, every single time you take a sample, you have an exact and precise information about the short period of your program execution. And that's a key layer feature here, right? Again, I must admit, we don't have anything like it at all. But stay tuned. Now the third one, and that's the ultimate solution for the code, for obtaining a very accurate information about program execution, is processor trace. And I can say that both Intel guys and we do have program trace. Linus can do most of it today. If you have learned, if you have watched some of the presentations from Tuesday, and some of the presentations from tracing some yesterday, you will see that there can be a slight overhead of steel, if you run it in the classic kind of Linux style case, there can be some overhead of using the processor trace steel. It's something that for many embedded guys from, with the past from, with the past talking about JTAG debuggers, trace boxes, etc., etc., can be a surprise. One of our friends, who is not here, measured 20%, in particularly kind of pathological case, granted, he measured 20% of overhead of using Intel PT with Perf. So it's not free. You might consider, sorry, processor trace, I should probably mention, will give you an information about every single de facto, every single instruction being executed in program flow, every single one. You will know exactly which branch was taken and not, and you will have full history for the full run, potentially for a price. The price is huge amount of data. So you may consider you're doing this, but only for critical code sections, performance critical code sections. This is a some numbers, some real world numbers from spec 2006, some, sorry, subset of benchmarks from spec 2006. The blue bars are improvements claimed in yet another Google paper, this time called hardware counted profile guided optimization, when they do analysis of training data versus production data on spec 2006, worthwhile reading really, it's a good one. The red one where the red results are ones measured by the other way around, red ones are Google paper, the blue ones are mine, obviously you can say from the color of my polo shirt. The differences I kind of ignore, we had completed two different environments, I was running a different version of processors. This data is pretty old, it's last year, this was before the FDO was upstream with GCC, so it was special GCC branch from within Google, when they were doing their paper, they were using different code and so on and so on. What I was trying to answer, is it real? As in I read the paper, I noticed the 15% improvement on the spec, on some of the benchmarks, and that's something that our marketing guys would like to see, they would kill for such an improvement on profiles. I just wanted to see, is it real? And actually it is, I was getting pretty good results in certain cases, the ones that are different from Google ones, I just contribute to a measurement error. In general, it works, if you put effort into doing this. Number one, the tools are not exactly major. The auto FDO tool can break. If we have some eagle eyes in the audience, you may have noticed that I have an extra parameter to the auto FDO tool, called GCOV version, it calls one. The tool is generated, the profile file will contain a version number, and it will have a hard coded version number for whatever the last Google branch was. The main line just see the FSC, expect different version, and she will just get a GSC complaining that the number is not correct, so I had to force the version that GSC is expecting. And it requires debug symbols, because the profile, the tool itself, will have to resolve a PC as sampled by Perf, and match it with the debug symbols in order to generate the source-based profile. That's why the minus G was there. It can be an issue, but there are ways of dealing with it. We'll see how Google dealt with it. And the last thing is an interesting one, and this also proves the kind of non-perfect condition of the whole solution yet. You can have quite a bimodal distribution of the results of your exercises. It has been observed that, let's say, revision 1 brought us 10% improvement, revision 2, 3%, revision 3, 10%, revision 4, 4%, revision 5, and so on, and so on, and so on. There is a number of explanations for it, but the most probable one and the one that we have observed is the kind of the fact that running your Perf, so generating profile, based on an already optimized code, can kind of lose some of the hotspots, can lose some information about what's worth optimizing, and one of the examples is given here. Very simple case. If condition then result takes one value or the other. That's a very common case. You can even write it as a conditional operation. It can be either implemented with branches, as we saw previously, right? If compare, I mean load, compare, and then branch, if one option, and then etc etc. Or, and this is common kind of operation across different architecture. In our case, it's called C cell. This is a single instruction, not a conditional instruction. It's always executed, and it will do pretty much that. So if condition, then result will be taken from one register or the other. The moment that compiler, and apparently that's the case with libQuantum, one of the examples used by spec 2006, it is a critical operation. So the first time you build your code, the FDO will be just red at this branch in the congenerated code. So the compiler will do the obvious thing and generate the C cell for it. And by the way, there are also good reasons that I don't want to get into why C cell is not generated, right? It's not an obvious answer for everything. But the compiler will immediately spot the fact and generate C cell instead. So next time we run the same profile, the branch is not there. None of the hardware blocks that I've mentioned will spot the fact. It will drill into the C cell and tell you what the C cell did. The information is lost. So next time you build your process, your program, well you're generating branch again, and you're taking the penalty. So, an example that has to be kept in mind. Very quickly, I've already mentioned that the FDO is also available in both flavors in LLVM. Both are instrumented approach. It just takes different arguments, but the flow is pretty much the same. And auto FDO, again, the flow is slightly different. You can use different tool. You used to have to use different FDO tool that generates LLVM profile because they used to have completely different profile formats. I'm pleased to report that in 3.9 or maybe it will be 4.0. Sorry, the 3.8 now can use the same GCOV profile as you see does. And in 4.0, I think the parameters are also unified. But that's, you know, implementation detail. The basis of operation is exactly the same. And this is example of the text version of a profile generative for LLVM. I've chosen to show this one because it is a kind of human readable text. This obviously is part of Drystone, everyone's favorite benchmark. At least marketing favorite benchmark. And they would kill more than one person to get a proper improvement on Drystone numbers with any compilation tool. What you can see here is top right corner is just except from the profile. You'll see a function name. 14 samples have been attributed to a function preamble. The yellow 5 colon 14 means that the line 5 inside this function, so offset 5 has been attributed to 14 samples and so on and so on. The important thing is that the last line offset 8. This from that line, there were some calls to other function, Proc7, it has been measured in times. One can wonder why there's this discrepancy between 14 and 10. The answer is quite simple. This data is coming from the proc function execution. It has been called from that one, so this edge has been generated. The thing is it's a statistical profile. Therefore, we have sampled Proc7 less often than Proc3. Here you can see the again example of the statistical nature of the whole thing. But statistically speaking, if you had if Proc7 else something else in this line, you still should theoretically, that's the sampling profiling basis, you should see equivalent of a representative number of calls for both functions. Those profiles are just repeated are not exact. They are statistically relevant. They should be statistically relevant. If they are, you will still get very relevant information out of it. So now to finish up, let's talk where is it being used? I already said that it's not that widely used after all. It's kind of public secret that number of commercial products do use FDO. I know, sorry, I have been told that portions of Windows kernel are built, not only portions, are built with profilagated optimization, the performance critical portions of Windows kernel. I already mentioned Firefox and also CPython. They have support for FDO in the build system. It's just not being used by default. So the packages that you will get from Debian, for example, will not be optimized using FDO. Now, allegedly our colleagues at Intel contributed kind of FDO static profile to CPython code that already immediately brings something like 5% improvement on the interpreter loop. I haven't found any evidence of it. It may be their custom branch that I just simply didn't find. But it would be on principle possible. You could contribute profiles to the source repository and just keep them updated every release that happens. Very similar to what Google is doing in their data center. Google data center are obviously using it. That's the place of birth of auto FDO. They are, as you will see in second years, quite extensively. And also their friends are at Chrome. Both the browser and the OS are using the auto FDO on production data, which I read us on Chromebooks. They run perf at customers Chromebooks and then kind of feed the data back. That's just my interpretation. And the last couple of months or maybe last year, kind of auto FDO keyword showed up in ClearLinux. It's Intel's initiative of generating cut-down distribution for VMs, for virtual machines. And they say that the profiles there are being built using Tune micro-architecturally, so for particularly Haswell, and also built using auto FDO. I couldn't find any evidence, but that's clearly they stated on the website. Auto FDO is being used. So the Google guys. Auto FDO, automatic feedback direct optimization for warehouse scale application. Good paper. I encourage you to read. The picture is stolen from there. The important observations are, number one, bottom left corner. You will see that the bottom line, left one, that they run perf everything. They run in their data center. Everything. Data from the perf is then being collected and stored in a sample database. Also, they've got a separate store of debug information. That's what it gets down to. So every single time they build it, they strip the binaries and release the binaries into production, but they keep the debug symbols in a separate place. To be able to generate the profile that is being generated on aggregated information from the last days of execution. And then it's being used to build the source again from the, by the, some compiler into the release binaries. It is pretty complex infrastructure, but as I said, in such a case, in case of a Google data center or any data center, every percent of savings translates into big money. So they're investing into it because they have return on investment. And they can afford this as well with their workforce. Now, what's the future of it, of the, of the FDO kind of infrastructure? The Google guys are doing pretty damn good job with LVM these days. Well, they pretty much shifting into LVM direction quite dramatically. And they are keeping LVM kind of in mind when it comes to FDO. It's not on par in GCC yet, but it will be pretty much soon. There will be more and more hardware from ARM, from Intel, that will provide data relevant for the use case. So more and more precise information about code execution. And the last thing is that all we're talking here about, that is no execution based optimization and so on and so on. And it has been invented by IBM in seventies. And no one is still using it. It's all fine when it comes to static code compilation, but funnily enough, JITS, manage environments, whatever you call it, are using it, are doing this for a good couple of years. V8 OpenJDK. In many of JIT cases, you will find the code that will measure the code that has been just optimized, generated, and potentially recompile it with the data captured in runtime. So it's happening. Anyway, it's just the static code is behind us always. So a friend of mine, when I showed him this title, joked that if traditionally, if a presentation has a question in the title, the answer should be no. And I was very happy to accommodate him and to say that no, AutoFDO is not a magic, make my program faster compilation option. Although if you use it carefully, you know what you're doing and you're ready to invest into it, it will bring, it can bring significant improvements. AutoFDO dramatically reduced the entry barrier, dramatically, it's much simpler to do, is it? So give it a try. Just make sure to measure the results, because as you have seen, sometimes you may be surprised. Thank you. And obviously I'm happy to take questions now. I'm, we have a coffee break now, so you know, stop me on the corridor, whatever. Based on the AutoFDO or something similar? So this, the question is whether something like this, something, optimization like that can be done in runtime. Yes and no. The idea is very old, it's called dynamic binary optimization. It has been successfully deployed in HP risk architecture. So the HP risk guys, kind of by definition, they didn't care about static optimization, they were doing everything online. And out of this, a tool called Dynamo happened. It has, but there were people who were trying to deploy the same tool on x86 with miserable results. There is something good that came out eventually from that experiment, it's called DynamoRio, and now it's a dynamic binary instrumentation framework that can be used like Valgrind or PIN, if this talks to you. Interestingly enough, because it's got the kind of dynamic binary optimization pedigree inside, if you run Specter 2006 under DynamoRio, certain benchmarks will record, normally you would expect overhead of running, of using the DBI tools. Certain benchmarks will actually show some improvement because the way the DynamoRio is being executed, it creates a dynamic tracking cache of your program. So it kind of reshuffles blocks of code as it see fit and creates an optimized version of it on the side and runs that one. So yes, there are some examples. Not that successful though on x86, and the same applies to ARM by the way. In what is the status of auto-ID only upstream? As far as I... Not in GCC or VM. So I think when it comes to... Yes, so with GCC it's there, both as you saw, both instrumentation and the auto-FDO approach. Well, so the profile does not, it's not like the events that they support. Profile, once you have used your profile capturing tool and your converter tool potentially, you are getting edge frequency information, and that's what compiler wants. It doesn't care about how did you get into this knowledge. Yes, so that's the... If you allow me to scroll back slightly, let me just quickly find the slide. So let's talk about auto-FDO in particular. So here we are. That's the process. Minus B will instruct Perth to include, well to ask the hardware PMU on x86 to include the last branch record. So extra information about every single sample into the Perth data. But it's not necessary. You may kind of skip the minus B. It just improves the auto-FDO quality so much that I did not want, I wanted to have it in the slide. If anyone showed it, use it as a reference. It would be unreasonable to use on x86 at least because I mentioned we don't have this on ARM. You can still, by the way, one thing I have kind of swim above was the fact that I have just captured the profile for auto-FDO on x86 and use it to build my ARM code. It's perfectly fine. It works because it's source-based information. But anyway, yes, so minus B will make sure that Perth.data has enough information for the createGECOF tool, for the auto-FDO tool, to create good edge frequency information in the profile that is then being fed to GCC. When it comes to Intel PT in particular, the same tool in some newer version. There is an article somewhere on the Google. I can find a link for you if you want me. You will do pretty much the same. Instead of minus profile Perth data, you will use minus minus trace equals the trace file. And it will do the same stuff. Interestingly enough, these guys have observed no difference whatsoever between using LBR and trace. There was no extra improvement. The LBR provides good enough information to extract every single last bit of optimization possible with GCC. It will work as well with the future. That's important. Today you won't be able... Sorry, I had it working on my desk in a very kind of convoluted and hacked way. I was doing this on Drystone and I got 5%. With Perth I got about zero ish, so it's not surprising. But Drystone is a very special case. I had to restrict myself to Drystone for certain key problems, now namely the size of the data. When it comes to... When our friends Adlinaro and Matthew was here, I think in this very room on Tuesday, showing the state of cross-site integration with INUX. And I think the patches for integration with Perth are either 4.9 or are due 4.10. So any day now, you will get the same kind of level of integration with Perth as you get with Intel PT. And then we can start talking about using it in the tool. It will happen. Okay, and the same applies to LVM. Some of the same things. Because LVM now can take Gcov, so the same format, you don't even have to use the other version of AltFDO tool, then yes, you will get the same information. Both compilers want the same. They just want edge information. As it turned out, Gcov was doing a better job on describing it than the original LVM profile. So they had no choice but to start taking Gcov as well. If not, I will probably say goodbye to you and the closing game is next or maybe the next after one. There is a talk between first. Oh, there's a keynote. First and then the game. Do not miss the game. For God's sake, do not miss the game. If it's your first time at ELC, do not miss the game. The best part. Thank you.