 Yeah, hello. As Renato said, my name is Kai. I'm one of the maintainers of the LDCD compiler and yeah today I'm talking about the profile guided optimization which was recently added to the compiler. Okay, there's a lot of specific words in there. Let's explain what it all is. So what is profile guided optimization? It's a well recently added to LLVM. It's based on a simple idea. I can have better optimizations if I have more informations and what is the source for more informations I can use for optimization. Well, I can do a lot of static analysis and yeah there's a theory about it that's a lot of stuff that's hard work and you always can't determine everything just from looking at the source. So another way to get information for optimization is to just look at the running application and get information from the application itself. And to do this there are two approaches. One approach is I can sample the application. I just stop the application in certain amounts of time. Look where is the application and store this data. That's one way to do it but that's not I'm going not to cover this in this talk. I'm just covering instrumentation of the source. So that's a LLVM. This is a common approach. We have for example the sanitizers which also add some instrumentation and do certain stuff. So for optimization what information do I want to know. For example I want to know how often is a function called or if I have a branch and if how often is the positive or how often is the branch taken or how often is it not taken. I simply can count this. For example I can count how often calls the function A the function B to know the dependency for this. So I have to add a lot of counters to the source and collect all this data. I run my application. I collect this data. I dump this to the hard disk and then I have something which we call a profile of the application. Yes this is also the main drawback. We run the application to create the information. So the quality of the information we derive depends on the run we do. If we run the application with an average input I can expect to get some good result. If I run it with an exceptional input so I look just for a specific case then I get a profile which does not cover everything and this can also result in well not an optimization but in the performance loss because I'm pointing the optimizer towards the wrong places to optimize. So it's a good way to get information about to use in optimization but you have to be careful. So the very very important point with every optimization by the way is you have to measure just to say this is a proven optimization. I just use it blindly and think everything will be good. This does not work so be sure to measure your result. Okay so this first was about what is profile guided optimization. The next thing is what is this D language. Everybody heard of D? Okay so by the way there's David the other maintainer of the LDC so we have very competent support for it. Okay yes so what is D for the people who did not heard about it. D was created as a system programming language. It was about 99 or 2000 when it appeared the first time and it was really meant to be what the name implies a better C++ but today we are proud of having our own language. But yeah we are using C like syntax so you can immediately read it if you are known to C or C++. We also use static typing that's also the same like C++. What's better we have a good module support so this will come in C++ I think in the next standard or so we already had it and it's very good. And one of the good points of the language is we support many paradigms so we have polymorphism you can do some object oriented programming. We also support the functional style if you prefer that. We have generics and a lot of other style and it's all well integrated so you don't have to switch to this style or this style you can mix it up and it works all good together. From templates are also known from C++ but with D it's going to be a step further. We have really template meta programming with a nice edit if so then you can choose which template is selected based on some conditions. Let's make it very easy to use template meta programming and it's not the hard stuff in C++ that's very nice. Yeah we use a garbage collector this is always point of discussion if it's good or if it's not good. The language uses this but there's also work on it to replace it with or to make it let's say a plug-in and to use memory management by hand for example. It has some drawbacks maybe but let's see a garbage collector is always a safety guard you don't have certain memory problems with it and also nice unique feature I think it's called compile time function execution. You can say that certain functions are executed by the compiler not at runtime but at compile time so that you don't have to run it at runtime just insert the result into the binary and that's it. That can make things very fast for example if you have regular expressions and you know the expressions at compile time then you can compile this regular expression into decode and this gets all the optimization done by LLVM and that's very very fast so there are certain use cases for it and that makes it very nice. Yes, LDC is one of the decompilers available and it's a let's say a combined afford. We have a reference compiler it's the DND compiler and the first D comes from the fact that the front end is now written in D itself and LDC uses this front end and adds some C++ glue code to it to generate the LLVM ER and then uses the LLVM back ends so we support a lot of the back ends LLVM support but we are just copying the reference front and so we say we are really compatible with our compiler but we generate really industrial style machine codes that's very good. If someone is more interested about the compiler I think in 2014 I had at this place a talk about the inner working of LDC so you can download the slides on the first time page if you want to know more about it. Okay so we have now the part profile guided optimization which is provided by LLVM. We have the LDC compiler and now we need it to plug this all together and yes this is the latest release it's already there so if you go to our GitHub page and download the binaries for Linux you get it for free but of course the implementation was not for free and one guy Johann made the effort to really implement this and it was let's say a lot of work what was needed to be done the first you have to add the source code instrumentation which boils down to we have to insert calls to LLVM intrinsic at the right places and if you look at the pull request then you see a typical pattern to it. We have a PGO profile guided optimization class that's more or less copied from the same class from Clang and this class tracks the state of the current function and so what you always see is I have to get a reference to this tracker class I inform this class about the current statement where I'm currently in and at the right place you have to do something with the counters for example okay I'm entering the function I have to increment the counter corresponding to this function this is the common pattern there are also code to count how is the branch taken or not taken and this must all be added so we have a big visitor more or less which goes through the abstract syntax tree and this needs to do all this stuff this was really added and that's a lot of work because D is a full-featured language not a toy language so parts of the compiler the front and I said is written in D so that's a really a lot of work to do but there's no automatic way to edit yeah another point is now that we have to the instrumentation we also need to add the profiling path to the path manager and yeah one of the things we do with LEC is that we support different versions of the LLVM backend and this leads always to some kind of ugly code because the instruction profiling path is a bit new it was added in LLVM 3.7 I think and with LLVM 3.9 it was the name was changed it was renamed so just after a year we had to use this if to say okay if you are using newer version of LLVM then stick to this legacy name and otherwise use the old name so that's also a common pattern in the glue code of LEC we always have to check for the version of LLVM and then use this name or that name or different order of parameters also that's sometimes a bit annoying to support okay what else needed to be add yeah we have the profile data so we collect all these counters the counter data and the counter data is attached to the function or some basic blocks for example for loops or ifs and this must be after the end of the program written to disk and this required a bit of fiddling how to identify where I am in the source code and put it to disk so you have to add some function for reading and writing this profile data and when I'm say reading this also means we have them the data and we must bring it to LLVM and if this is done we are the metadata of LLVM so reading also implies that we go through the data and attach it again to the LLVM IR so it's a it's a bit of work to do this yeah and not everything is just lowered to machine instructions by LLVM there's also a small runtime library and a small runtime library we have to link to so at first our driver needs to know if we switch on profile guided optimizations we have to link to an additional library that's one part that's easy the other part is we need the library so we take the easy approach and edit the profile IRT library to our repository and we did this for each supported LLVM version so this is a big big directory which contains the profile RT from I think 3.7 to 5 LLVM 5 now so because that's the easiest way to do it otherwise we would have another external dependency which would be annoying okay but that's the whole stuff and with this done we are ready to use profile guided optimization with our LDC compiler okay I said the latest release contains it at least for Linux because this is as far as I know not available for Windows for example but with Linux you can go to our github page download the compiler and everything works as described here so what you do you have a test program or whatever application you have I called it the PGO test.d and the first is we have to add the instrumentation during the compile process so you use this lengthy switch profile instra generate it's I think it's the same name used by Klang and this just not only compiles the module but also adds the instrumentation one good hint here is just give it a different name so you know you're having this binary with the instrumentation the reason is simple adding all these counters it's a lot of stuff and if you are out of luck you're having a counter for every basic block and this is also a great performance impact you do not want to distribute this this instrumented version to any user so just use a different name and everything is right okay so you now have the instrumented binary just run it when you run it it will write the profile data at the end to the hard disk and yeah two things here the default name is default.proofraw and you can change this name and you want to do this because of two reasons maybe default is a very generic name and if you do it a lot you get always this file will always overwritten and the second thing is you will yes you will do many runs of it and for every run you like to collect the data and just merge it into one profile so you really need several of these profile files and for this you can either give it here a name after with the equals sign and then the name or you can use an environment variable and in this name you can also use some parameters which can be replaced for example with the process ID so you can really say I have here a script file and I run my test application with ten different inputs and you can get ten different profile data files and what you do if you have these profile data files you need to convert it therefore we have the LEC prof data tool there's no magic in it it's just a renamed LLVM prof data but it's renamed because we are tied to a specific LLVM version and again there can be many LLVM versions installed on the for example in Linux and with this we make sure that we are using the right one for our tool and so this tool has just two basic options merge and show for the run you use the merge option and the merge option there you give all the prof raw files you have generated and it creates the final profile data file and with the output parameter just give it a name there's some format conversion in it but basically it merges all the profiles together and create a format which is again readable by the compiler and then you're done you can take the second compile one in this case you you see use command switch you get the profile data with it and then LLVM hopefully gives you a better optimized version of your program okay and again the hint measures the result because it's only runtime that's really looking at the runtime behavior of the application and it can go wrong it might not but it can go wrong so measure the result okay what kind of optimization is done with this with LLVM 3.9 there's a new optimization introduced which uses exclusively the information derived from a profile guided optimization and this is called the indirect call promotion and that's a I think it's a very important optimization because indirect calls so I'm calling a function through a function pointer that's very often used in for example object-oriented languages but also for example in C it's common to have a pointer to some function and do the call about this pointer so the optimization we can do here is when we know which function is called through the pointer then I don't use the pointer I call the function directly and yeah okay that's a pointer a pointer can point to many functions so we want to know what is the let's say the most likely called function or the most likely to call functions yeah and so how do we get this information we call we simply count how often is this function called and from where is it called and that's exactly the information we can collect with the profile guided optimization in our profile data so this is a nice optimization and the benefit of this is if we know what is the most likely called function and replace it with the direct call then we enable other LLVM optimizations for example function inlining which can say that for the most common function it's a very very fast path because we have eliminated all the branches in it yeah let's look first at the high level and then at the low level so this is decode and I think it should be really readable for everyone okay I have a simple function which is really simple it just returns one value and there's also a function pointer to it and so to make the things more interesting we're going to have a loop and call this function in a loop so there's no sense behind it from this function is doing really something useful it only demonstrates we're calling the function indirectly through a pointer okay and what does the indirect function call promotion do it's quite simple it checks if the pointer is the most likely called function okay we have only one function so that's quite easy and if it's the case then we call this function directly and for sure this is only a heuristical method so the compiler does not know what happened with the variable maybe in other places so we also have the forebake and say okay if it's not this function then just do the call by the function pointer so this is a very simple approach and for sure you can do it yourself you don't need compiler support for it that's with other optimizations that is not that easy but if you have a good knowledge about which function gets called then you can do it yourself and yeah you can really do it in D with the need syntax so you can define and template which hides the check for the most likely function and then call the likely function or use the function pointer and then we have this nice syntax which is a very special feature of the language but it's readable for I think because we say we have the we have the function pointer and we apply this is likely template and this template has the most likely function as a parameter and everything else is hidden another template so we could do all this stuff by our self and we don't need the compiler support so that if you know it you can do it this way but that's not really a very good approach because you have the problem you don't know which is the one in most cases you don't know which is the most likely called functional look at LLVM it has millions of lines of code and at a specific place what is the most likely called functions don't know so let's do the compiler the work okay let's go down to the ER level so just how is there indirect call represented at the ER level it's quite simple we load the value of the pointer so we have here this pointer and load it into the virtual register one and then we just call the function that's a very simple load value and jump okay okay so what is the next we have the instrumented version the instrumented version is the step is the immediate step we have to add the counter for the function for the loop be in a loop if you remember I had a loop and we need to count how often is the loop used at this point we also need to record which target we do call with this pointer and of course the loading and calling the function is also present but there's a lot of stuff around it when we run the instrumented application it generates the profile data you can also use the show argument to the LDC prof data tool and look at the generated profile and for example you see the ICP function it has one counter instrumented in it and it's called 1000 times that's what I would expect from my source and there's also the main function there are two counter counters in it one for the function and one for the new body and yeah the function is called one time so there you can get over a few of your profile yeah and now we use the profile data to optimize the function and this is with the O3 switch we really optimize and in this case well you see the load the pointer we have a compare with the most likely function the ICT function and then a branch and the branch is if it's not equal we go to the original indirect call that is the format and in the other case where we jump over it and yeah this function was very simple so the result was in line by LLVM and so in the fee instruction we see okay if you come from the forebody then use the value 42 so in the most likely case there's no jump at all and that's what we're going to have and also you see these profile data is attached in the metadata by LLVM yeah what's the result of all this Johan who did the pull request for it he tested it also and if you only look of the D part of LDC then the compiler becomes faster by about seven percent this was done on a use case a test case from an industrial project and so it's not some simplistic benchmark but it's a real test case from application so that's a very good approach a result in my eyes and he had also some more ideas and these more ideas are therefore only a sample implementation he called it the virtual call promotion which is the same as the indirect call promotion but applies to classes which uses a file table structure to do the virtual call and with this there's also ways with the same this is similar approach to replace these virtual calls with direct calls and that's a very interesting approach but not yet committed yeah and so my summary is it's really worth to have this profile guided optimizations in our toolbox now because if you're done it right you can get a better speed margin and I think just having seven percent faster application this only running the application several times that's a very good result I think okay so I'm done now are there any questions yes so I did not really mention it the question was how much overhead there is and the overhead is really noticeable you don't want to distribute an instrumented binary I think it was about a factor or two or so it's really slow so you do not want to really distribute this yeah more questions excuse me could you repeat so the the question was why we choose this approach with the source code instrumentation that's a that's the generic way to do it it's a work with all back ends so we simply added one and we have it there okay thanks I was wondering whether you could profile memory accesses memory allocations yes yes okay so the question is if the instrumentation could also be used for other kind of stuff for example memory allocations and at this point no so it's only for there's all also only this optimization for the indirect call promotion but for sure that's the profile guided optimization it's an array of active development in LLVM so yes it should be possible and but there's no work for it yet but yes it should work yeah it's so the question is if the function is not in mind then there's no big benefit with using this optimization and yeah it's really depends and it depends on the way the CPU works because it has a lot to do with the branch prediction in the CPU so if you're doing an indirect call and the CPU cannot predict the right target then there might be a lot of stuff the CPU has to do internally so it yeah it really depends on your target if it's profile if this is a good optimization or if it has no benefit so no general answer in this case we really have to measure it that's a that's a kind of problem here because we use data which is collected from runtime and from there you don't have the absolute knowledge about all the possible inputs and all the possible branches which which can be used you you would do to have some kind of let's say global program analyzes to really deduce to say that's the only target for this pointer and that's not possible now if you run this imagine that the setting the pointer is not a global setting so I mean the LTO doesn't know what the target is but with PGO it says this is the most likely then you would hoist the whole thing out of the loop and say return 42,000 right if the pointer is this return 42,000 otherwise go through the loop and that will be a big saving. So LTO is also supported with LEC so this can be combined. Okay more questions? If not then thank you very much.