 Okay, I guess we can start now So my name is Andy clean and I'm talking about GCC link time optimization and the Linux kernel And this is actually more about the compiler than than the kernel But my background is actually kernel. So I'm not Like a really a professional compiler developer anything So you might ask me something about the compiler that I don't know, but let's see how far we get And this was more a little bit of like a hobby project for me. I did it over a long time most young weekends So it actually took a long time. It took over a year to get it booted So just a lot of people helped me on this just very quick especially honso beaker and agile Lou with a lot of changes that are needed for this Other people helped me doing this. So what's contribution from others So why LTO? So you want to optimize over the ball binary. So initially like before free. Oh, GCC are only optimized per function Global optimizer and then we did all it got like per file. So that IPA But then an LTO you want to optimize of the complete program and you could do this before but it was pretty complicated Because you'll attend your life your your make files a lot. So LTO is basically pushing the optimization phase towards the link time and Especially one one thing where I'm interested in especially for kernel is the kernel has a lot of What's called include hell so where you have include inline files and includes and Because the inlines often need a lot of other dependencies. Sometimes there's something you cannot do in line Because you get a loop in the inline dependencies. I had a few things where I couldn't do something simply of course this problem so But if you have the capability to actually inline Without having includes so without having it in a header file and you could avoid this problem and it would I mean one problem is that it slows down the build somewhat because the header files getting bigger and bigger and bigger So there's some other issues Another motivation for LTO is you want don't want to change the market make files significantly So people actually did it before like global optimization Just including everything in a file or there was used to be a GCC option, which did this kind of automatically But this is very intrusive especially large systems of very complicated build systems And it's very intrusive so it's much better to have something which is how do we plug in so you can just plug in into existing So the LTO originally in 405 GCC got the first LTO, but it was It did have some limitations and later in 4.6 and 4.7 is especially there's something called MWO PR, which is more or less parallel LTO So that you don't because otherwise if you do all the optimizations in a single thread at link time That's just too slow. It doesn't scale. So they switched over to Doing in multiple phases. I'm just giving you a very quick overview because I'm talking about some of the details later But essentially you start off with what's called LGAN. So it is just a standard parsing and then generating code and Also, you you compute some optimization summaries And and traditional LTO actually generated code at this point. What's called fat LTO So you did the full code optimization, but you also put the immediate language also in the object files So but so you essentially Generate everything twice Then the next step is when you essentially go to the linker the linker detects that the object file contains LTO data and then calls back to LTO one. So the compiler and then it At first analyzes all the dependencies So it does a global call graph and looks at the types and does much Merges the types which are different and does some more summaries But doesn't actually look at all the actual code It just looks at the summaries and then it generates partitions so which is kind of like a Part of the call graph which is together so you can optimize it together and then eventually it Actually calls the back end. So that's called L trans On and each partition in parallel So it actually generates a make file and runs with make minus j and then it does inlining inside the partition It runs optimizations between functions so global optimizations for real and Does the then optimizes the functions again? So does the back end and finally generates object code So that's the the current scheme in and GCC And the goal of this this partition scheme is basically to Do as much as possible parallel because we do everything single-threaded. It's just too slow Yeah, that's right. If you build a large system of make minus j1. Nobody does this anymore so There are a lot of optimizations which can be enabled by LTO I mean it doesn't support all optimizations because some optimizations always need to look at global state So it kind of relies on optimizations Which first generate a summary and later on you can actually work on the summary and I mean especially the Optimizations I'm interested in for example inlining. I think it's very important This was one of my primary motivation then you can do function cloning for specific arguments So that if something is always called with a specific constant you can make a special version of it You can remove unused code I personally didn't consider that too important, but some of my users do they're very interested in code size You can increase alignment For example, something's vectorized is normally not used in the con then you kind of can discover poor cons It can What's interesting is if it looks at everything it can actually keep some a cache some Global syn statics and registers because it can prove that's a function called doesn't clubber that register that's can be In some cases can give you a nice improvement then Devisualization It's very interesting what right now. We cannot use it in the corner. That's everything. That's green Somewhat confusing is what I kind of currently cannot use in the corner Then change a bi this is useful of vectorization, but it's not in the corner constant propagation Replacing structures with individual variables this come sometimes give large wins because you might actually put part of the structure and register So I had actually cases in lot and in kernel benchmarks There's such an optimization gave me real numbers So like 2% but it was a very big benchmark So 2% was a big improvement because people do a lot of work for 2% but this can be as a done globally and Constructor destructor merging. I would actually like to use this at some point But right now it's because the current doesn't use standard destructors or constructors that doesn't work But these are the optimizations can be enabled by this over the complete program So build time Build time. I mean it's somewhat slower. You can see those are the first ones as well It's with LTO with minus LTO means without LTO So, I mean it definitely costs you so we can't even build time is more than four or five times slower So there's a small conflict The user time I mean if you look at the parallelism, it's also so much slower But here I computed the parallelism so that's the the user time was Divided by the real time and you can see that that LTO is a lot less parallel than the non-LTO So that's definitely So cut some things and it has a lot more memory consumption. So example here. I have the the minor faults and LTO does a lot more minor faults So 4.8 is slightly slower, but not very much I'm actually first when I did a lot of my benchmarks at first afford for that it was a lot slower than 4.7 in LTO But then I figured out I actually had some debug option enabled so it was a mistake But so mostly it's very similar I think the difference is slightly more for non LTO than for LTO, but it's not a big difference So the performance in terms of build time is mostly common. But in general I mean it costs you So it's the build time is slow and the parallelism is not good enough So especially I was looking a little bit more at looking at the parallelism problem So this is just plotting the runnable So I basically just did a VM stat over the build and just plotting how many processes are runnable So you can see at the beginning This is the L-gen phase where it actually just reads all this source code files and generates the object This is pretty parallel. I mean, this is a system with a runnable of 16 threads for 16 CPU threads like 8 core 16 threads Then at some point there's a long phase which is VPA or WPA Where it's not so parallel then it does some generate something Parallel again. So in the actually see this happens three times One of the reasons is that this is the kernel uses something called KL sims Where it's actually a once-to-build and it's symbol table And unfortunately does multiple links to do this because it first has to generate a link Then create a symbol table then link again To put in the symbol table that generated from the first link and you have to do this multiple steps because something can change and this costs you here because it's essentially with LTO you're doing a large part of your Optimization this Multiple times. Yeah, but unfortunately, I cannot disable LTO as a KL sims in our cases. So stuff do this here So anyways You see that there's some parallelism issues, especially in between At the beginning of the LTO. There's a long phase where where things are not parallel essentially single-threaded Another interesting work you can see at the end This is my modules. So this is as a conflict was mostly static but had a few modules Normally you you when the modules run and there's a bug currently in job server So job servers the mechanism and make that the sub makes don't generate too many jobs I was asked the the main make for the job But the job server currently doesn't work when you run it from LTO so passing through or there's a bug sometimes sometimes it fails So it has to be there and this discusses these spikes at the end So that is actually gonna generate small jobs in configured So I hope that we fix at some point, but right now the bug is still there So you can see and the first I mean one problem is the four four times linking of three to four times But the other problem is there's a little long phase Which is essentially single-threaded So I already mentioned this so it's been two and four times you can actually disable it But but I mean normally it's extremely useful to have the symbol table in the corner because otherwise there's something crashes It's it's much harder to figure out what's going on And there are various features in the corner which depend on it So right now this is not fixed actually try to fix it so by changing part of the link over to be an Incremental link, but currently it generates content. I don't boot it all even without LTO so I haven't Trace down yet what that problem is So I hope it will get fixed, but right now it's still there So yes, I'm some differences for example If you if you disable ksms how much faster it gets with LTO so it costs you about Two thoughts of the build time currently if you run it on a lot of system I don't have to for eight number here, but it's it's fairly similar to four seven So then was looking more at the memory usage because I said what is problems with memory? So first I mean if you look at the minor faults so that the The graph at the bottom that the minor faults only go up quite a lot And this is even with with two megabyte pages So actually I did some tuning on the GCC garbage collector to map better to two megabyte pages with transplant huge pages But even with that one day, I mean the minor faults are going up quite a lot So I mean with the two megabyte pages you have less minor faults because every fault is two megabyte instead of 4k But so it's going up quite a lot So what they're more interesting part is the active memory that was plotting the same thing And you can see I mean first yes There's quite an active memory use at the beginning, but this kind of it's not too bad It's just about the same what you would expect from a normal build This is just the the L gen faces, but then there are spikes in the middle, which is the type merging So this is when it reads all the different information from all the object files But they all have their own types, but it needs a global type symbol table for global types So it has to merge all these types together and this causes very large memory spikes Another problem I had at the beginning this was like back in the 4.6 times I had temp equals temp of s and GCC the driver puts the partitions into temp The problem is there were some issues with the way the partitioning was done So I was putting a lot of data into temp and then it was using a lot of data to do type merging and generating code and so on And all of this I cut some really nice thub stumps so first thing as well as don't put a temp into temp of s and Build your program with temp of temp there equals object here So don't put the temporary files of the compiler into temp of s otherwise you you can with temp equals temp of s You can compete essentially with all the memory consumption The other problem is I mean What's also difficult is is because there's a problems with job server, which you probably need to fix To two large minus j with large modules can also cause memory problems So because then if you're really unlucky and the type merging phase of a large module comes together with the VM Linux You might also Exceed your memory. So you have to be a little bit careful about that I thought that the partition algorithm got a lot better with 4.7 and also 4.8 So it's got a lot better in terms of how much data it generates, but I still recommend to not put the temp here into temp of s this can still cause problems So here's a large build. So this is the the all yes build So this is a much larger configuration is basically you enable everything you can and put it into the main VM looks as I like the worst case for LTO in terms of Memory consumption and build time and so I mean The build time is pretty long as well. I disabled the K did a symbol table here. No, so it only did the back end But you see I'm not this like a 42 minutes on our eight-course system It's really quite a special operation with with KL sims is even long and I currently have with that one I have a 15 gigabyte peak So you definitely cannot compile it on your laptop unless you have a really big laptop I mean, I should add this is an somewhat unrealistic configuration because normally people don't use all yes Yeah, it's more like you test something But it's I mean, it's still an interesting test case for GCC So you can see I mean the memory can go up quite a bit And there's a really long phases again, which are single-threaded and this is actually most of it is the type merging So I actually have a can prove it here. I Profiled it you can see here. I mean most of the time spent is just merging looking up hash tables and most of it is type merging I was actually told but especially by Honza would did a lot of work like on Mozilla with LTO He said he thinks that something the way the colonel does types is the worst case It's worse than other programs. I cannot tell you if that's true or not But at least I know that I spend a lot of time in type merging So I actually spend some time looking at this and trying to get it faster I mean type merging is actually a somewhat complicated problem because in the tree data structures that GCC uses Essentially when you have a type like for example structure and two pointers to it And you have a recursive data structure which then points to all these things and this has to be all unified So you have different versions from every object file and has to be all unified into one And it's essentially it's not doing hashes with iterative hashing iteratively hashing the objects and then putting it all in there The then thing exactly has two different hash tables and then that this is a really scary thing I hadn't seen this before everywhere anywhere is it has hash functions which have another hash table as a cache So it's a pretty Let's say complicated operation and I actually ended up with Hashtag in the large case I had at f with several gigabyte of hash tables in the peak case and was just spending a lot of time there So I did some some Investigations make how to make this faster For example one thing I tried this because you saw the earlier Profile a lot of time was actually spent in my lock A little bit down there. So I tried TC my lock instead. So that's the Google my lock It actually was the biggest win. I got nearly 10% improvement Unfortunately, I also increased the memory consumption about 45% So it's faster, but it uses more memory But it's I mean, it's a pretty nice Improvement it might even be something interesting for general users and TCC investigate different my lock. That seems to be different Some some possibilities here then another thing I tried is because I noticed that a lot of time was spent increasing the hash table because the it uses like a standard Knute hash and if the hash table fills up by 50% it doubles it to the next prime And it's been a lot of time resizing the hash and especially with my large hash table. It was Oh Over two gigabytes. It was spent a lot of time just doubling it and then reallocating it and copying it and so on so I wonder one experiment I did was to with the small set up by was to We just added added a tunable for the hash sizes and just started off with them with the mains with the actual The target size so I knew it before I measured it before I start off with the size There's a similar technique like what's for example a threat sanitizer in a previous talk was using you say Virtual memory is free. I just preallocated. I might not use all of it, but I can use it So it's actually a valid technique. Yeah, so you could say I mean it might in some cases use some more memory because If you don't compress the hash table by a smaller hash table to hash by hash table You must use more memory but if you don't use too much of the hash table still valid to have like just I look at a really large space and use use it without having paying the overhead of regular like copying it and doubling it and so on so this gave me something like About 4% No, sorry. No, it's about somewhere 2% so starting with the full hash on the small. Yeah, that's that question 0% So I mean the chart is just various deltas. I got from the improvements So so and these are the deltas in terms of full time and this was more conflict So you see larger improvements if you use the large conflict The viax is a delta and build time and second and percent of seconds So another thing I tried was to improve the hashing So I used the Moomer hash instead of the J hash and the other thing I did I did multiple changes. So one thing I also noticed is that there's a there's actually a large What's called pointer hashing where it actually just hashes pointers. It's a somewhat weird concept, but it does it you can see it here in the Pointer map So But the pointer hashing was not actually using all the bits of the pointer So I just changed it first to a different hash function, but also just using all bits there And the other thing I did is it was what it did was a little bit weird is so it generated that at J hash and Finalized at J hash, but then it did iteration on the ch. So it took the final value and added Hasht more values in there But most of the iterative hashes I was looking at was actually they keep like an expanded Internal representation and just add more to it. So they I mean, I'm not an expert on hash functions, but it seems to me That would be a Probably a better way to mix all the hashes So I but I did this I switched over to Moomer which is a modern hash and it keeps on expanded internal Representation and just add more to it. So just give some anyways For example, I was looking at the collisions. So the the hash table Data type in GCC looked at the collisions and originally had over 90% and then I got back to 70% and then 64% with these changes. So And it gave me gave me something like about 4% build time improvement for this case So there's there was some improvements there, but you can do with the hashes, but even with that one it was still pretty slow I mean I also tried the split trees at some point like TCC as a split tree Actually, I don't have the numbers, but it was really really slow. This play tree doesn't seem to perform well. So I actually gave up So I feel so think there must be probably some more Changes in there so in the future it might be possible to do more on the on the hashing So one idea for example was to pre-compute the hash in Elgin and just stream it already in the object files and then and because For my understanding the biggest problem with the hash table is that there's not enough locality of reference So if you do hashing you just jump all over Several gigabyte of memory and that's just slow So the idea was if you make the initial hash table large enough and you pre-compute the hashes You can stream out the types and the hash order and then if you just fill in the types later It should be like in the order of the hash table and not all over But it's only works because the the the initial phase doesn't know the collision So it's only works when you make the hash tables large enough that that you have very few collisions because otherwise If you have a collision, then you already are already out of order again The other idea is I'm not very sure if the work or not That's the only merge per partition So actually not do the merging before you do the partition But do the the merging in parallel when you do the partitions, but there might be some some dragons here So I'm not fully sure and there might be other possibilities to paralyze and WPA further, but I'm not sure So incremental linking this was another issue I spent actually quite some time working on this so that the kernel build system uses LD minus R and AR extensively and originally the problem was that if LD minus R if you only have LTO So C files in there, that's okay But if you also have some assembler files in there dot s files They get thrown away during the LTO phase because the compiler reads in the object But it just ignores everything it doesn't know about and if there was any assembler in there was just Disappeared so obviously that didn't work for the kernel because it has and not that many but it has the substantial number of Assembler files then the other problem was originally with LD minus R. There were the The LTO objects Segments and they had all the same name and what happened is the cop that link emerges everything with the same name into the same section But the problem was that then every function or every every individual object file Which was there before the LD minus R was on a single section and TC didn't understand this So you just had a lot of concatenate sex, but it was only looking at the beginning So it didn't really work. So one thing I did was to I just add random post fixes to the LTO sections and then So that that you have multiple ones that don't disappear during the LD minus R The problem was originally I made the hash too small So there was some really nasty build problems that sometimes in recases you had some weird Linker errors. It turned out. I needed to double the hash size of the random Post fix but then it worked. So, but anyways There was still a problem there that you drop the assembler So my original approach is I implemented what's called slim LTO or the actual option is called non-fat LTO so the idea was that you Because original GCC always generated everything twice it generated Locally in the object file so the binary in the object file and then the LTO information and then again Later on when you do LTO you read it back and you throw away the stuff which has been generated first It has a few advantage if you generate it twice Especially sometimes when when people make a mistake in the build system and something doesn't get LTO Then you have a fallback because the symbol still appears. It's somewhere It's just have to fall back which is not LTO But it's of course, it's slower, but anyways the original idea was We do slim LTO. It should be faster. It was faster but we can also Everything that's not LTO must be assembler So we can just change the compiler to pass it through and generate it on the outside again when it generates the object files There was the original idea But it was actually it turned out it was done in a different way So ajlu fixed it in the Linux bin hotels. So he added a similar heuristics there where the the bin hotels detect this case And they basically pass it through when they when they call the linker plugin Unfortunately the problem with this approach that has worked for me was that The bin hotels mainline didn't want it So there was something like we don't learn to use ld minus r And then the discussion stopped because I mean, I am not reviving all the kernel mac files to not use ld minus r So I was okay. I just use linux bin hotels I mean if you don't want to support it So there are the patches out there, but they're not in mainline This just means you have to use the linux bin hotels from ajlu currently to do this This was incremental linking Then kernel modification. I don't want to go for all of this but I had to do quite a lot of changes to do things So I had a pretty large patch Especially one big was section attributes so the kernel It marks some data sections and so on as in it only And so that they can be thrown away after the startup of the of the kernel But the problem is they actually I mean, you have a read only section has to be marked read only versus Read wide so normally the normal linker doesn't enforce this So often people got it wrong. For example, they put the cons at the wrong place So sometimes you get const char versus char star const and so on this has They exactly put it so the problem was that lto enforces this so I actually erode out when you When you when you put something in a read wide section with a street only or the other way around So I added a lot of changes to to change this It was a lot of patches most of them got merged right now Data problem was I mean top level inline assembler. So I'm using f wall program And everything often there people were assuming either that um If something is on top level inline assembler and it's in the same file They can just reference it or the other way around But if lto doesn't know about this it might just discard it because it's not nobody references it So it essentially means you have to mark everything that that's used in our inline assembler and it's using a global Global symbol you have to mark the symbol as visible. Yeah, that's one thing Um, and there were a few related bugs So sometimes people were assuming that something is in the same file But because partitioning can actually Move things to different files. So sometimes you had problems there and there were a few changes like this Um, I had a lot a lot of visible Especially also for assembler Then a big problem was so so gcc essentially if you use f wall program turns everything into static Because it's essentially static and with static standard gcc adds like dot number for some reason adds adds it to the symbol and This caused a lot of problems because um There are various tools in the kernel that they have some tools which post possess object experts and A few other things and they all couldn't deal with this So as they love at least one feature which doesn't work because of that is moduled versions This is like a text for an exported symbol that takes the type it generates a hash of the type And there's still some problem with there and had to do a lot of changes on for these number symbols We had also one interesting changes. There's some auto generated symbols, which can be very long And we over flew a buffer In one of the tools which generates the table for kl sims, which also cause some really interesting bugs so I had also disabled lto for some files Then for seven still has some bugs in the partition that sometimes you have to make random things visible otherwise they disappear um, I had to write the gcc ld because Lto you have to call gcc to link because the compiler is linking But the kernel calls ld. So I did like a wrapper which turns the ld options back into gcc. So it's like a pretty artist shell script And a few fixes for optimizations But they were very rare. So I didn't have a lot of problems with that but a few Wants. So but it was still a substantial kernel change um, and this is for example, this is the size of my current patch kit um Debugging so currently bugging can be more difficult because lto adds adds a lot and does a lot additional inlining Especially for example, one thing is if you if you have a static kernel There's a lot of code which is called at the beginning And if it's not called indirectly for a constructor, then they all get inlined in the same function because Gcc has the heuristic that if something else called only once you inline it because it's free So you end up with some really really big functions Um, one problem is if something goes wrong in there, normally you can look at the back traces You okay, it happened in this driver or it happened here or there, but if it's all in a single function, then this is much harder so um It makes it somewhat more challenging to look at at crashes I mean one one idea I had is but it would be quite a bit of work to do this Was to do our inliner where backtracer so basically look at Draft2 information and then garetz and okay, this this this region is on inline We put a marker in there and the internal symbol tab we could do it. I know he's he's cringing a little bit But I think it might be useful in some cases anyways because in some cases you might still have a lot of inlining But i'm not sure I want to do this, but it might be a possibility to handle this So compiler was another problem. I had to actually I had a lot of compiler bugs There are a lot of regressions for example, just Somebody just added a new patch to the 4.8 branch after the 4.8 release like in the last month or so which breaks my build again And I had a lot of similar problems in the past so I triggered a lot of interesting compiler bugs The problem is I don't have simple test cases because normally you have a file And you listen and then compiler crashes, but I don't have a file. I have a complete build tree. Yeah, which is really big So it's it's actually pretty hard to generate a test case So it's possible to delta So try to track down where it is, but it takes a long time. It usually takes me over a day to to delta something So this is a definitely a big problem I mean, it's it's not traditionally compiler debugging but just a single process But now if you look at my some of my graphs before it's Yes I haven't checked it, but the modern gcc has a holistic and the inliner which limits Stack growth from inlining and we enable it for the kernel, but I can check it if you want I Believe this has been also improved in modern gcc All the gcc had this problem that it always put all the locals separately, but but they added at some point I don't remember which function and there was a special change to to to allocate this more smartly But it's a good point. Yeah, I should probably check that now. I haven't checked that So it's a complex multiprocess So it's actually can be quite difficult if something crashes to just Attach the debugger to the right process because even if you just start with the linker It actually goes for multiple calls To of different functions of different programs until it actually gets down to to your compiler And as attaching and profiling it so you had to do some special tools And I found it most useful to debug the compiler to enable core dumps There's an as an obscure option called minus dh That says if you have one If something bad happens core dump and then just look at the core dump I found this is the the most simple way to at least see what's going on so Just debugging is definitely a challenge with lto both for the compiler and for the kernel So global optimizations So what what's actually this was my small config So I was looking at what's what's enabled here So I thought I had 21k new inlines between files of those were 13k unique So that's simply quite a lot of additional inlining going on from from here by the way Actually because there's a nice way to dump a Debug file in gcc which gives you all the information And actually wanted I I did a script to put it on to graph this and generate a big graph But it overwhelmed graph with so the pdf was not readable I was actually hoping to plot it on a plotter like they have a large plot at work But I think it was too big for that Because I had too many notes But if somebody wants to I still have the pdf so why do you want to see it? But it's not very readable Okay, but anyways so inlining I got quite a bit new inlines So scalar replacement so that means that that it breaks up structures and does it on the on the function level There was no change of money So I have a few but not too many and I didn't really change it so partial inlining that means that it Inline only part of For example, you always have a if at the beginning it might inline the if if it think that it's a hot if The partial inlining I got quite a few 16 plus So constant propagation means it generates Generates a special version of the function For example, it determines that the function is always called with the same argument So it can actually generate a special version of the function which is optimized for that argument So you have like a for example, you always call function with one So you generate f dot one and you you have a you just constant propagated in the function So quite a few more of that 40 percent So there's also an optional flag It's called cloning for constant propagation. So normally it does this constant propagation only When it's unique, so it's always the same argument But here with the cloning for constant propagation, it can't decide to do multiple versions For this so it said, okay, there are a lot of calls But not all calls, but a lot of calls are having this constant argument Um, so I do a special clone a special version of this So, I got quite a few of those like from originally at 266 So without LTO, I got over a thousand But unfortunately as it cost me 200k So 2 percent text. So it's a little bit big. So right now I have it as a conflict And it's not clear. It's helping all that much But these are some of the possibilities So static instruction changes So I just looked at the VM Linux comparing it. So I actually have so these are always the deltas I have 4 percent less calls So there's a lot of inlining going on And my functions are over 8 percent larger on average But also the standard deviation is also Going somewhat up I have over 8 percent less pushes So there's definitely a lot of optimizations going on in terms of register use and so on And also similar with pops like 7 percent less pops And I have something like 1.5, 1.8 percent less moves So I mean there's definitely no noticeable Changes in the generated code So runtime, I would say there was something mixed So for example, this is currently This is like what's called LKP that's a suite of kernel test cases So on the newer CPUs, there's a small improvement for not very much And on the older CPUs, these are some larger improvements So this might be related to You said a theory is that the new CPUs can deal better with worst code So the older you are You might deal better with it There were a few other things Was there a question? Okay But I mean, I would say the wins currently are not extremely large So we actually tried a few things Tried a few different ones I mean one issue is My theory is what's going on is why I don't see larger wins is The kernel does a lot of indirect function calls There are a lot of indirect pointers Like for example vfs So you have to look at the pointer And this is always a barrier for the optimizer The optimizer cannot look through it So I think one one thing I want to try Which I think is very interesting is to get some kind of De-virtualization going The De-virtualization means that you had some hint Or use a profiler That most likely this pointer is going here So essentially you can add an if Target is here inline Otherwise go indirect pointer So you can actually optimize over this This is something I want to try Another thing I want to try is to do item testing Because item is a small core And it's much less tolerant to bad code than a big core And this is also something I plan in the future So what I mean right now I would say I mean there are a few wins But it's not really compelling at this point In terms of performance for a car But I have some more There's some more For example text size It actually for my small config on x86 goes up About 3% in terms of text size But if you have a really small config For example I know it's the smallest config It goes to number 50k So I think the reason for that is that The standard kernel has like a library And the less you if you don't use everything Then there are some unused functions You can throw them away I should add one problem if there is It works best when you have a non-modular kernel Because if you have a modular kernel Then some stuff is exported again So it prevents it from being removed if you don't use it So generally LTO works best when you're a non-modular Because every module and everything is a barrier for the compiler I can't use So for example one thing I had from Tim Burt He was interested, he ported LTO to ARM And he was very interested in text size And he got some improvement on text size So it was a 4.7 So I think he got a slightly faster boot But it's not a big improvement So for example this was one thing This is an embedded system It's relatively small config So they were interested in using this And they're really fighting for every byte on the flash So that's fair So it helps somewhat there So these are the improvements I would like to have So especially for file options This actually breaks some stuff for me That there's no way to set a different option for different files So the most common case is And I have a partial patch for it But it's not integrated right now Is so there's for example the function GraphCraser in the kernel Which instrument every function Does use instrument function But it has to be disabled for some files Where it doesn't work But right now when You cannot disable something per file Because all the options apply to everything In LTO So there's some changes there Then fix passing four options So in theory if you specify option at the Object file build It should be used at the end in LTO In theory it actually has code to do this But it only works for something like half the options So actually I had to manually hack up my make files to do this This is something that could be improved Right now it doesn't work very well Like having a standard GCC LD So when the make file calls LD versus GCC for linking You need a standard wrapper So right now I have my private wrapper But it could be much better if there was a standard wrapper Then don't error for a sector mismatch As I already mentioned this This is just a pain As far as I know it's completely useless Because there shouldn't be any reason While you error out when this happens The better procedure for LTOA reporting I think that's really important Because I mean at least I hit a lot of bucks And I kind of deal with them But I'm not sure that everybody can deal with them So I think there needs to be a better way They're going to rate test cases from LTO Maybe some automated data procedure or whatever But right now for the compiler it's a big issue So another problem I had is Type mismatch I fortunately didn't have a lot of them But if you have a type mismatch that means that Something is different between header files But it appears like the same type Then LTOA kind of deal with it Everything has to be the same type So essentially from the same header file So right now you just get a very cryptic error And it can be hard to figure out Because if for example the type mismatch Is not in your top level type But somewhere in the bottom of an indirect struct Because every struct that's referenced Is part of the type Then it can be very challenging to figure out Where actually the type mismatch is So the current as far as I know The current best way to do this is to There's a special patch from Richard Beiner To add to GCC and then you have But then you have to do GDB on GCC And call a special function to the bucket That's currently the best way to do it So this is definitely not very good So for type mismatches I think There just need to be much better in our messages Then the X66 post fixes I believe there's already some work on this But as far as I know it hasn't been really fixed yet Another problem I had was top level reorder Because originally I have to do So normally when GCC does partitioning It will freely reorder everything To fit the partitions well So that the partitions fit the call graph But the kernel has what's called What's called init calls These are essentially constructors But they're done in a custom way And these init calls cannot be reordered Because the code doesn't tolerate it When some init calls which used to be late Is suddenly early I had a lot of problems with that But then I have to use what's called F and O top level reorder But they're only needed for the init calls And it would be nice if it either was Per symbol so I don't have to disable everything I did some measurements I'm losing some inlining because of this Because the top level reordering changes the partitioning And then it cannot match the partitions as well As it could to the call graph And you can only inline inside the partition So essentially disabling the reordering changes the inlining So I would have to have it per symbol Or some other solution for this Another thing that's really a pain currently Is especially if you have assembler Is to manually figure out what needs to be externally visible So it would be nice if there was an automated tool Because you could actually I think you could discover it at the linker level With a plugin or something like this And the other problem is also So when you have something For example a function which is called By a top level inline assembler So right now the only way to handle this Is to also make it visible But actually it would be better if you could mark the inline function To just say this reference is something This would be, for example The same thing is with a static So sometimes people call it statics But the only way to describe it actually is Could be to make it global Because otherwise you cannot There's no other way to say This is used by something else Which is assembler So I think this is something that should be fixed in GCC If possible So status, I mean it kind of works on x86 That's what I primarily use, ARM and MIPS So Tim Bird is doing ARM And Ralph Bechler is doing MIPS It's up to date with Linux 3.8 And some options are still disabled Especially function trace Kind of covered in mod versions Due to some of these problems mentioned Okay, that's what I have Here's just my references And I'm open to questions