 Ah, stop it. Perhaps in the end, so if you're still awake. So every year I have to drag Jacob out of here, so he's not voluntarily coming. So he's recluse, even though he just leaves a couple of hours away from here. So we're trying to get you to be more interested in the compiler side and so on. And last year we already had to talk on compilers a little bit and so on, talking about optimization, et cetera. So this time it's a little bit different. So we're touching the same topic a little bit. But it's mostly about how do you actually, well, how can you actually peek into the compiler? How can you actually see what is going on and understand this? Because believe it or not, if you're invoking GCC or G++ and sign this, that's not a monolithic thing. There's all kinds of things going on in the background when this is. And hopefully after this talk, you will be able to actually, if something goes wrong or if it doesn't work exactly as you do, so you actually are able to peek into some of these kind of details. And this is also something which is dear to our heart and so on. Give us, when something goes wrong, actually much better feedback on what is going on. So, yeah. Yeah, it's there so well. It worked before, so. And just do it like this. Work before it. But it's not actually even moving. It's h down. Oh, no, it doesn't. Do you have data? No, it's not doing anything. Yes. So, yeah, just stating quite obvious that compilers is a very large program that does a lot of things. And it's quite hard to understand what each part of the compiler does and how it fits together. So, this is just a small picture of the general sequence of things which are done during compilation. We take the source code, then there is a front end. We have, GCC has multiple front ends, so C, C++, D, ADA, GO, and many others. And they transform the source code into something that the middle end can use. And then there is a generic code which handles inputs from all the front ends and optimizes that. Then the code generator is actually another intermediate layer, another language which describes closer as the machine code, which is specific to each of the back ends we support, so like 386 or AR64 and other architectures. And GCC emits assembly as a text file. Other compilers sometimes emit directly the object file or binary, GCC does not. And GCC emits a text file that needs to be assembled by another program, usually the GNU assembler. And what this produces is an object file which is then linked by the linker and that can be actually executed. So this is the architecture which compilers, like GCC, but also Clang is following. So this doesn't mean every compiler has to look like this. And you will often find people that answer if you're looking at things, sites like Heccon, you're saying, oh, my compiler is 100 times faster than GCC. Well, but they don't implement basically the same thing. So as Jakob mentioned and so on, for instance, we need to have the clear separation of the front end and the middle end because we want to support multiple languages there, which also means that we need to find the representation of the program, which is independent of the front end, or mostly independent of the front end, as it is emitted to the middle end, to the optimizers and so on. And the same is then also true for the back end. So because we are targeting, I don't know how many of these are, 50, 60, 100 different architectures, CPU architectures, which we can target with the same compiler, there needs to be an interface between the optimizer, the middle end and the optimizer and so on and the back end, the code generator, which is generic enough so that it actually can target all the different architectures and so on. If you are writing a C compiler, which only targets, let's say, X to the six or R or something like this, I can write something which is 100 times faster than GCC, but that's not the same beast. It's something much more simple than all these kind of things. It's individual steps we just seen, and it can be merged together and it can be skipped and it can be optimized, et cetera. So don't forget this. This is something which is oftentimes a critique, but it's completely invalid given the goal GCC has. Well, different compilers can have different goals and if compile time is the most important goal, then of course it might generate slower code, but might be much faster. So for C and C++, we actually have a pre-processor there as well, which we'll talk later about. The pre-processor does the pre-processing phase described in those two standards. In the past, like 15 years ago, GCC had a separate program which did the pre-processing and then piped that into the C compiler. That's not done anymore. If you want, you can pre-process separately and then compile it again, but unfortunately it loses some information like, for instance, we have some warnings which depend on comments in the source code and those are gone through the pre-processing. Yeah, and there are other things there. We can do what is called pre-compiled headers which also shortcuts some of these kind of things. It means that some of the system resources which you see there at the top, they don't necessarily have to come in in textual form. They can come in in some kind of binary form which can speed up compilation significantly for large projects, specifically C++ projects. And for those who are, of course, everyone of you is following the C++ standard process and so on and knows that in 2020, we will get modules in C++. So they will also, of course, be implemented then basically at the front end and side where instead of injecting the system resource and even the resources of other projects and so on in textual form, they will now come in this module form which has most likely we'll have some more binary representation. Those are already mostly implemented in GCC but on the side branch, it hasn't been merged unlike coroutines, for instance. So it will be next year that it will make it in. And with modules, there are some issues like for make dependencies and stuff like that. Yeah, so this picture also shows that what we do with LTO, which is a link time optimization in which we are able to look at the whole program or whole shared library and analyze it and optimize it even more. What we actually do is the front end emits some intermediate language, which is then optimized by like 70 different early optimizers, optimizer passes. And then we store also information about different attributes of the functions which we are storing and store everything into the object files. And then later at linker time, we can read it again and see the whole picture and split it into small partitions which we then actually compile and then run further 150 optimizations. So we'll see a little bit later on how the process is involved and then we go into some of the more details because there you can actually see this. So if you're thinking about the process here in general works on what is called on compile unit level. Every single input file, this is .cc file or .c file and so on is individually handled and for each of them it produces an object file. Even if the compiler is given on the command line, multiple .cc files, they're individually handled. So this can be a problem when it comes to highest optimization levels because you might want to have cross compile unit of optimizations happening and this is where then for instance, LTO comes in. So we'll go as I said later on and we have a little bit more on that. This just shows that there are some points where there are some interfaces between different kind of intermediate language. The interfaces are unfortunately not very clean. There is a way for instance that the middle end can call language hooks and ask some properties from the, ask some questions to the front end. But these are usually done before the LTO is written because then it can come from many different languages. And on the other side, there are target hooks, callbacks which provide machine specific information already in the middle end. So can change different things in the middle. So when all the compiler engineers amongst you, so I guess half of you, will be able to actually extend the compiler specifically at these kind of interfaces. If something completely new comes on, theoretically, yes, you can do all kinds of things there. So you can replace pieces there and so on as long as you follow the interfaces that's there. So it's at least, as Jacob said, it's not 100% clean in the moment and so on. So if someone really wanted to and wants to invest the time, perhaps we can do this. But at least these are, the interfaces are attempted to be capped at the minimum and so on and so you can actually implement something there. For instance, the D front end was just recently added in there. This was possible because, well, there's a fixed interface, more or less fixed interface to the middle end. And we have a couple of more coming in. So I think Pascal, the modular two and so on and might want to come in sometime soon, et cetera. So, okay, so this shows what kind of processes are run when you actually run the GPROS program. As you can see, G++ is not the actual compiler. It's a compiler driver that calls other processes and creates temporary files and stuff like that. So at all these triple dots and so on, I abbreviate it best. This is a lengthy path somewhere. You are not supposed to care about what it actually means in practice and so on. But what I left is the nine there. My computer had GC9 version. You can have multiple different compilers installed on your machine and actually, well and more or less, can work with them at the same time in the same installation. But they're all hidden somewhere in some deep hierarchy, which you should not care about. And the compiler is also written in a way that it's relocatable package. So if you store it in some other path, then it should be able to find the files in the relative path against the compiler driver. So here we see that if we try to compile a project with two C++ sources, the driver actually runs first the compiler. In the past, it would run first the preprocessor and store it either in a file or pipe it into the compiler. CC1 plus is the compiler and it's called with the first source and produces an assembly somewhere in the temp directory. Then we invoke the assembler and assemble that into an object file. One can use the minus pipe option for the driver and then instead of using the temporary files, it pipes the output of the compiler into the assembler. And dude, so there are things which you notice here as Jacob said, so the preprocessor does not run anymore as it used to be, but you can still get to this and we are actually going to show this. What you see here is that the compiler CC1 plus outputs a file with dot as lowercase s as the output. So that's not something that's deliberately a file extension, a lowercase s in the GC terminology at the very least means that is an assembly file which doesn't need preprocessing. If you're writing assembly by hand, you're supposed to use a capital S, which means assembly code which requires preprocessing. In most cases, this is the case. And the compiler driver GCC or G plus plus and so on knows about these extensions. And it also knows about the extensions which we used to use before we had the integrated preprocessor that's an extension lowercase i. That's preprocessed C code or dot ii preprocessed C plus plus code. And they can still use this today as well. If you do that and you have that one, the G plus plus command line dot i and dot ii files and so on, the compiler still knows how to handle these kind of files. The other thing to point out here is that if you have large projects, many, many files to compile, it is a bad idea to invoke the compiler once with all these files on the command line. Why? You don't get multiprocessing. If you want to have, nowadays we have multicore machines, so my machine at home has 30 something cores and so on. I want to actually run the G plus plus GCC driver multiple times in parallel to get compilation happening at the same time. So don't do it like in this example with multiple files on the same command line. And another reason is if you changed just one of those files, then you unnecessarily recompiled the other one, which has not changed. Make has an option which says only put those files which actually changed on the command line, but still it's a bad idea. Don't ever do this. Have a separate rule for every single input file. So collect2 is just a wrapper around the linker on Linux that's mostly useless. It's especially for some weird architectures. Yeah, it's not right. It used to be very important in the past. It did something when the linker itself did not. Well, or in architecture. So the GCC can be handled on proprietary architectures as well, at which point we don't have control over the linker, we might have to work around idiosyncrasies or failures or bugs or whatever in the linker and collect2 allows us to do these kind of things. Again, GCC is a compiler which is not just modular in the way it's structured, but also a compiler which can be used on all kinds of architectures. So nowadays we don't use any of the proprietary unixes anymore, but back in the days and so on, we had Sonos, Solaris, we had Altrix and so on. But the unifying thing is that on all of them we use GCC. And for all of them, unless we were able to use the GNU-LD as the linker there as well, we use the system linker, the system linkers were not uniform, we had to work around limitations. And this is how collect2 came into being. Yeah, collect2 can do stuff like parse the error messages of the linker and if some symbol is not defined then instruct the compiler again to compile something in some ways of C++ compilation which these days are not used actually. So this just shows that in the distribution, I think by default is C-Cache installed and that's another program which is early in the search path and actually remembers the pre-process source which has been compiled already in the past and just avoids compilation is something that has not changed. Yeah, so quite, specifically also here on the slides because it actually requires a lot of knowledge of the compiler in which it invokes and every once in a while there is something which breaks C-Cache where the assumptions it makes about the compiler are not valid anymore and then you have to directly invoke perhaps GCC or G++ and so on to actually do the compilation yourself. But most people really have no idea that this is actually going on, that when they're typing in GCC they are not actually getting the GCC package by getting the GCC from C-Cache. C-Cache actually does the pre-processing separately and then feeds that to the compiler so that's an issue with the lost warnings and exact locations and stuff like that. There is a way how to do it in GCC like we have F-directives only option which kind of does a limited kind of pre-processing just merges the files which are included and does nothing else but C-Cache actually does not use that option unfortunately. So if you invoke the compiler driver with minus V you can see a lot of details about many things. So there's tons of text which is coming out there as you can see. Much of it for the normal just user is not really that interesting perhaps. There are a couple of things which are which should be at least so for instance look at the lower markings which are added there basically these are the path that the compiler is looking at first before it looks somewhere else. So before it actually normally would assume I look into user include well but actually a couple of directories it looks at first for various reasons. We have things like offloading so offload compilation and so on which we're talking about later on well which of the targets are available you can look at this only from this input so normally this is not something a compiler advertises in any other way. So this kind of information is there and for if you want to replicate the compiler so you might want to compile it yourself because you might want to want to poke in these kind of things. So either you have to be yuck open notice by heart and so on because they actually select these kind of things or you look at when you see the second highlight at that point it actually shows you exactly how to replicate the compiler. You can get the appropriate source code and that's the command line for the configure script to actually get this in that compiler. So it helps you to get to all this kind of information that's just one of the screens. And then it shows the exact command lines of the different processes it runs. So there is the CC1 command line in there which shows how it compiles the C file and then later on there will be somewhere the assembler assembly is there. So here it's CC that's a C code CC is the C compiler CC1 plus C++ compiler. And for instance these command lines are useful when one uses for instance minus safe temps option which doesn't use random temporary files which are immediately removed but actually keeps them on the file system and then you can just cut and paste those command lines and run them by hand and change something and so on. So LTO is as we said before is the link time optimizations where the compiler can see the whole picture and it's something that's going to be enabled next week in Fedora so. So for the entire distribution. So that's a big thing. Well probably we'll have to opt out for a couple of packages either they use inline assembly in a way that incompatible with LTO or there can be other issues but or we will see how well it works on the 32 bit architectures if for larger, larger compilations it actually fits into the available RAM on the servers. Made strides there so you can compile Firefox with LTO and it still works on a decently sized machine so. And Suze actually switched to LTO by default last year and so they have already some experience. We are trying to gain it as well. So you can see so we mentioned LTO is something which is happening as the name suggests at link time so. So the first part is basically the same as on the previous picture slide and the only difference is actually just passing the minus plugin option to the linker and passing LTO and that's basically it. And the linker has a system of plugins where if the linker sees specific sections in the object files which the plugin is interested in then invokes some callbacks in the plugin and lets them handle it and pass back some object file which should replace something. So and when we say that the first part is pretty much the same there's one change which is not visible here but which Jakob already mentioned there is additional information emitted into every single object file. So basically the intermediate representation of all the code of the compilation unit in addition to the normal codes, the normal generated assembly code for the module, the intermediate representation is also part of the object file. Well there are actually two modes of LTO one is the slim one which actually emits just the LTO object file just the LTO data and not actually assembly and just emits symbols and that's it. And that has the disadvantage. Well that's something for instance in the distribution we can't easily use because for instance for static libraries we want to strip the LTO stuff from them because it's dependent on the exact compiler version and maybe should the static libraries we don't want them, don't want to recompile them every time we change the compiler. So we emit the fat LTO by default in the distribution which includes both the assembly. So if you ignore the, if you don't invoke the linker driver and just throw the data away then it will link normally as if you didn't compile LTO at all but if the linker plugin kicks in and optimizes it somehow then the normal assembly can be completely ignored and replaced by something later on. So the linker wrapper spawns a LTO wrapper program. The veered syntax with the add symbol at the start is a way how to store multiple options in a text file so that temporary file includes many, many options which are not listed there. And the LTO wrapper invokes again the compiler driver again passes some options and the source is, the driver tries to compile are actually not the normal source files but are the object files and it's instructed to use it as LTO so reads from the object files the data and tries to compile it. So the LTO one is the equivalent of CC one or CC one plus but in this case we don't need to differentiate between the languages anymore. As we said before the intermediate representation once we are out of the front end is independent of the language mostly independent of the language. So we have this as a way for unifying this process and we only have to make sure that LTO one can handle all the object code which is why the intermediate representation which is part of the object file there. So in this case it's done in two steps you see there for each of the input file. So there, yeah the first phase is WPA and that's the phase which where the compiler sees all the objects at once and because the program could be extremely large it really can't read all the bodies of all the functions and stuff like that. So it parses mostly just summary information from it and decides also based on command line options how many partitions it will use and partition is some code which is related and which needs to be compiled together to be beneficial and then splits the data from the input object files into other object files which are then feed it to the next LTO one. And then the second step is basically done once per partition so there can be many partitions if it's a very large project this is a very small project so it has just one partition and the LTO wrapper can actually even invoke bake and ask make to handle the spawning of the G++ process is due to run the compilation of the different partitions can even ask if it's invoked from some upper make file like the upper GCC driver is invoked from a make which is paralyzed and so it can talk to that make and work together how to split it on a multiprocessor. So this is only versus can you make which can act as a server. And then there needs to be some plus in from the command. So there are ways how you can handle these kind of things but so let's get a couple of examples what is actually where is LTO actually worth all this trouble so you have all kinds of additional processes which are going on. So imagine that you have compile units you're calling one function from another one especially if it's C++ code you have things like template functions et cetera and so on and oftentimes you'll get told also to write very small functions which means the function called overhead might actually be larger than what the function does by itself this is an optimization which is countering that which is very frequent it's called inlining it actually takes the function definition and puts it in place of where the function would be used. Well if you compile this normal way with individual compile units this cannot happen because the definition and the use might not be in the same compile unit. Well it depends if you place those small functions into header files then- Yeah in header files but if they are not if they are in different- In the other files. Then you don't have this as a possibility and LTO works around that as one of the possible things. It actually sees as Jakub said at the beginning the whole program at some point and it can make decisions. Oh yeah that function is actually a candidate of being inlined and this happens automatically. So otherwise before there was link time optimization whole program optimization and so on developers if they wanted to they had to carefully design things with header files and other things to actually make sure that they get exactly the kind of program layout and optimization is going to happen with LTO. This is not and given that the actually the main efforts in programming now this is really to get the program done in first place because and only optimize it if there's really the need or the possibility actually doing this. It's the second thought most of the time. So having LTO actually doing these kind of things for you is a big, big win and this is why we're enabling it for the entire distribution. There are many other things it can do like it can find out that all the colors of some function which is in some other translation unit for instance have some range for integer values and it can propagate that range into the function and optimize it. And you can see this in the generated code you have function names and some of them somewhere in the name that appears then clone. If this is the case then this kind of optimization kicked in. It can have all kinds of forms. It's range propagation. It can be also it the function is always invoked with two as the second parameter. So that's constant propagation in there. We can also optimize on the value return from function if you find out this function returns only this value or only this range or and so on or for instance some functions take a structure but actually use just a single member from that structure. So in that case we can change the calling convention and pass just the element and many other things. Or return just one element. So there are all kinds of things which can go on. So an LTO is a big thing. So yet it adds costs but it's worth it in most cases hopefully. And it's especially when you combine it with the profile guided optimizations where you actually compile instrumented program and then run it on some benchmark or something which represents the normal usage and then you compile it again with this gathered data. How many times some function has been called or how many times some if statement decided to go to this branch and so on. How many iterations some loop had. So if you're writing codes make sure that your production code is using LTO actually. You should not use it in the normal development cycle because additional work. But whenever you're coming to the point that you actually want to see how the program perhaps behaves in production how the production code would work then turn it on and compile it this way. It should not change the semantic. With the FET LTO it actually at least doubles the compile time because it needs to compile everything normally and then do the LTO compilation. With. The higher stream optimizations. So if I'm running all that stuff and it's going to compile that is it worth beating those out or? I don't quite get that. So if you're passing say that show two or three is GCC. Yeah. A lot of that's happening in the compiler stage Is it worth throwing those optimizations away for the extra-quick compile time? It completely depends there as well. So the even with LTO and so on so there are some optimizations which are still are already carried on and so on. But your question is in a completely different direction. How do you get the highest compile speed and so on? Obviously if you're not doing any optimizations and so on you're not wasting any time on these prices of compiler. But quite honestly I'm not sure. I haven't done the experiment recently. So but about 20 years ago so I did the experiment with compiling something with just without dash O any kind of optimization and with a one and the difference in sizes of the generated code without optimization in many cases are negating the effects because with O one you at least get some simple optimization is going on which can in some cases reduce the size of the assembly file in the entire three representations so dramatically that you actually get this. So it might today be different. So back in those days this were much, much slower and so on maybe where we're over that point today so it would be interesting to get this. But I almost never compile without a minimum level of optimization. The only reason why you want to skip perhaps optimization is if you're losing debug information and so on or getting this but we have, I don't know how many years or 10 years we have OG now for a long, long time put in OG. If I use OG I still have GDB at all and the variables optimized. Yes, that can happen but. It's getting better. OG has some advantage for debugging for over O zero and other disadvantages. So we have a plan for OG to also try to artificially use all the variables at the end of their scope which we would make sure they are alive. But theoretically the dwarf debug information could represent this kind of things and Alex Oliver so unfortunately left us for another opportunity this last year and so on he was working on many of these kind of things and I think he still continues to work on some of these actually make the debug information generation much more efficient. So we actually don't have this on these slides so just to give you an impression why this is so difficult and so on and optimizer itself so we're not really talking about this at the beginning slides you saw this block optimizers. These are actually many, many different optimizers. Jakub mentioned something like 50, 70 blocks for the LTO and then another 150 for the backend and so on for the generations. These individual steps which you're going to do so couple more slides on that later and so on but every single transformation is transforming the code tree, the code representation to maintain the debug information or format that the debugger can have this. We actually have to do two transformations, the code transformation and the debug information at the same time and in sync and that's difficult. This is why we are losing some of this information in many cases the compiler simply pessimizes this and says I give up I don't do anything about this variable anymore. And we have also the self-imposed requirement that we don't want minus G to change the generic code so it just needs to add the debug information but should not change the generic code so that if you compile without minus G and later on find out I need to debug this you should be able to do using the same. And before you complain the compilers we had up to the mid 90s, the other compilers except for GCC either allowed you to use their G or optimization. Pick one. So GCC is much better. So offloading is another thing which recently is getting much more important and that's because we have all those different GP GPUs and other hardware which can actually do stuff much faster at the parallel level than the main CPU. So in C and C++ for trend there are various extensions which allow you to write code that uses the offloading hardware. Explicitly. Yeah, so one of them is OpenMP, OpenACC is another standard and there are other ways how to do it, CUDA and whatever. And so when GCC is involved in a program which includes some offloading code written in OpenMP or OpenACC pragmas, then again we have the same first commands as during normal compilation except we pass an extra option somewhere. The new thing is again done from the linker. Well, during the compilation we already store store again in the LTO format but in different sections the functions which are intended to be offloaded and then the compiler can have different offloading backends. One of them is NVPTX for the NVIDIA devices. In GCC 10 we have also the GCN backend for AMD and there is HSAIL which is also for AMD and it's a virtual assembly code which is then compiled. Could theoretically target others, so it's not in a moment. Yeah, there is some emulator as well which emulates it just on the host in different processes. So the difference here is as Jakob mentioned you might not have picked up on the slide, Nuance there is that we are not emitting the entire program as intermediate representation to the object file. Only the specific functions which are meant to be offloaded to be accelerates. And variables too. Yeah, and so this is slightly more efficient when it comes to compilation time and so on but of course very specific and you have to make what a compiler has to make the right decisions about what to offload and what not. And then one can either use the default which is when you don't specify it just use openmp to enable the language extensions or you can use minus afloat equals something and many options and then choose exactly I want this offloader or I want to pass these options to that offloader and so on. And so this shows just a single offloader to NVPTX. But you can also potentially have the same program being ready for multiple offload targets. The same program and it will at one time use the appropriate one. What basically happens is that the linker plugin creates a Fed binary which contains the main program and some special data section which contains for each offloading target some binary or text block. It depends on for NVPTX is actually text. And you see this line in the middle of the green of the yellow block here NVPTX none bin s. That's the PTX as some are actually which most cases will come from an NVIDIA package in this. Well this assembler actually is a dumb program which does almost nothing because assembling the text file basically preserves it and just verifies it is correct. And but it has an option if the NVIDIA binary only program PTX AS is in the search path it can actually use it and check the assembly if it's correct completely. And the linker also does not do much it basically concatenates the different assembly files together into one thing. And I suppose this is not some kind of cross-compile splitting the program into. So if I have a code that must run in NVIDIA kernel as NVIDIA cannot go it needs that. And for the rest of the code actually as to run on the host using NVIDIA automatically. Yeah well the normal host code is compiled by the first CC1 in there or a second one depends on which host code you only have. That's normal compilation and it just does an additional thing it puts as a data section the LTO bytecode of the routines and variables which need to be offloaded. And then the linker plugin can actually make sure that's get assembled and linked if it needs to be and put into the special data section of the binary. And the code which is prepared for offloading can be an entire function it can be a subset of a function. This is you have to look up the OpenMP and OpenACC standards how you're specifying these kind of things. And the thing is also that you said the code is for this and the rest is for the host. The host gets the entire code as well. So theoretically the practice also the program can run without in this case an NVIDIA card on it. It will just run slower. And runtime what happens is that actually the runtime library queries the hardware do I have NPTX hardware? Yes, I have three devices. Fine, I can use them or no I don't have any. And in that case there is a fallback. OpenMP actually allows to write that a specific function will be offloading only or offloading only if it's this and this target. Then select the different devices and all kinds of things. But normally everything which needs to be offloaded is emitted twice once for the host. So here we only spoke about the component invocation. We could draw a similar image for invoking a program which uses offloading and there you will also see that in the background all kinds of magics going on. So as we had the slide with the different interfaces those are different kind of languages. The pre-process for data is a way how to write pre-process code in a text file. And then another interface is the abstract syntax tree. Then we have different levels of middle end of machine code. And then have some machine code, some intermediate language which is much closer to the hardware, RTL, and then Assembler. And this you can generate. Exactly. So all of them can be emitted and can be inspected. Over the lifetime of the code through your compiler you can look at all the intermediate steps and we're going to look at these kind of things. This is oftentimes not that easy to actually understand as you can see this. But compiler can also submit some summary reports of some form. Other compilers are admittedly a little bit better on some of these kind of things but we are trying to work around this and you will see later on a couple of examples of help us programs, for instance, hyperscouts which I wrote to actually make this easier. So first we show something how you can record actually the switches which some source code has been compiled with. And there are two options or there is also with an OVN plugin which is the third option. With the first option it emits a section which has different strings for each of the options and those strings are merged. So I find this option actually quite unusable because it's then hard to find out which options were used for which functions or which parts of the binary form. It depends a little bit on how you are programming things if you have different, very optimized with individual function, getting different options and so on, yeah, this is not that useful but this is very easy to add to your make file. Another one and that's actually the default is the minus g record options and that includes those options in the DWAT producer string in the debug information which has the advantage that in the debug information you can map this function is covered by this debug information, compile unit and that compile unit has been compiled with those options. So this slide shows how you can pre-process C-saurus code and you just use the minus uppercase E and those lines starting with hashes are about entering or leaving some header file or source file or just moving the location. All the preprocessed stuff is gone except for this. This is recording this. You could reconstruct from this the parts of the header file and the source code of the file which the compiler actually sees. This can be in some situations really helpful if you're using especially header files from projects we are not really familiar with. You have to find out what is actually I'm seeing, what kind of macros you find to see that and that. So this is the output you're actually using for that. The magic numbers in there, the one, two, three, four are about one is that you are entering a new header, two is that you are leaving some header and three is that that header is a system header and four is about implicit extern C for C++ headers. Some headers are implicit extern C and others are not. So you can also preprocess and ask that it dumps the macros which are defined. You can use either dm or dd and dm shows just the macros and dd shows the preprocessed file in thermics with the device. So this is something where you might not know about this but the compiler environment specifically and so is by itself defining, I don't know, 100, 200 macros all by itself which are predefined when you're doing things. So you might be familiar with macros like which indicate what kind of CPU you are using and so on but there are many, many more as you can see this here and especially when it comes to C++ as we can see on the next slide. Then everyone here is of course a C++ developer using the latest standard and so on. You will know that there's a feature in the standard ever since C++ 11 where you can query inside the source code which feature of the standard is actually available in the implementation. This is all done using macros. There is, well the most important underscore underscore is C++ which tells you the year and month of the standard and so you can ask whether, well you can pre-process something depending on whether the current C++ standard is newer than C++ 14 for instance or older than C++ 17 and so on and recent versions of C++ and many, many further macros for each small feature. Which is there. So you don't have to have something like auto-conference on so configure scripts to determine does my compiler actually support this. You can put this code, the appropriate actions inside your source code. It can either be something like a static assert saying well I don't compile without this feature or it can be working around the missing of this feature and there's not an easy way usually to actually find out what my compiler actually supports so this is why I have this script there basically so you can get this basically so you just run it to C++ features or you run it by specifying which standard version you're actually targeting. With that you can find out all the features which are there and you can rely on them and et cetera and work with them. This is just one way to actually how you can use this DM output. These are actually compiler predefined macros. Not only that. And the library has also its own but the library ones need you to include that header. Well, and this does it so it does it for you so. And the exact values also sometimes change like in C++ 14 you can have one value and in C++ 20 another value. Those macros, the standard definitions were for C++ 17 on the side in a separate document as the six and in C++ 20 they were merged into the main standard so you can see there the minimum values and so on. Anyway that's a nice way now so for the compiler to automatically advertise what it actually does and what it supports and it's all done using macros. And this is for instance when different compilers implement new features they don't implement everything at one time and because it takes three years for a new C++ version then those three years can have different features one by one. So. Now get, now the fun starts. So this is just a very small stupid example where I just wanted to show some features of the dumps. GCC can use the FDump option where you select what kind of dumps you want. Like three dumps are the Gimple dumps. IPA is about the entire procedural optimizations. RTL is the backend stuff. And the third word in that option is which exact paths are you interested in so you can write FDump three Gimple for instance. And it dumps just one file and if you write all it dumps all of them and there are many of them. As you can see there are 192 of them in this compilation for this exact compiler version and they are numbered so that you can see which pass is first and which pass is next. And there are some gaps because some passes are have some gates which depend on some options and so on so some passes might not be invoked in this particular case in that case. And don't rely on the numbers and the names to always be the same. So between each compiler release or even answer on these things might change. They might change in order. Some of them might go away. Other ones get added, et cetera. And you can see some numbers at the end like DCE four. So some passes are multiple times at different parts of the pass queue and so they are numbered depending on where they exactly appear. We proposed, we promised you to get inside in compiler here you have it. So imagine your project with a thousand files for each of them 192 files in addition to the output object file so have fun. So this is the first free dump, the original dump that's the interface between the front end and the middle end and that's actually still part of the front end late in the front end. And so it uses data structures which can be deeply nested. For instance, you can see those expressions in there. Those are not normalized in any way. There are already some optimizations happening as you can see because in the source code there was two times X plus 12 and now you can see X plus six times two. Yeah, so some optimizations already happened at this level but most of them don't. What's kind of weird is that you see twice the declarations of the variables. That's because the first ones, the less indenting ones are just lists of the variables which are in the function and those more indented lines are the declaration statements of the variables where the variables actually start living. For most of the variables that's where the actual assignments of the initializer to the variable is but for instance for VLA's and stuff like that it matters even more because the size of the variable is decided at that point and so on. So this is what you want to see where you are looking for what comes from the front end. So this is what the compiler sees in the first place and if you already have a mistake there so if you make wrong assumptions well you know what to look for. So as I said, generally the front end trees, well there are different front ends have different front end trees but what you see in the original dump are the CC plus Cish trees which most of the front ends emits as the interface to the middle end. The previous dump is immediately lowered at the start of the optimizations into this form which we call GIMPL. Initially it was still trees but it had some requirements like we don't allow deeply nested trees. We want just very simple assignments like this is the left value that we assigned to and it should have at most one, two, three, maybe four operands which are fixed for the particle operation for most of them, one or two. And so you can see it lowered this way. There are no basic blocks at this point and it's not in SSA form but we already use some SSA names as temporaries like you can see the underscore one in there which is used just as a cheaper anonymous integer variable in this case. And you can see all the gotos in there because there are no basic blocks at this point. So you see this is practically C code. And actually there, yeah, you can easily translate it by hand into C code if you want or there is an option for the dumps like where I said something about the first three words but then there are options for the dumps and you can say like you want details or you want to modify the dump in a certain way. And there is a way how to produce something that we have also another front end which can read these dumps and start from that. So you can buy something. Usually not interested to use or compiler developers really nice. So for instance for the ternary operator you can see it creates temporary variables if temp.zero and assigns the values in different parts of the code and then assigns the rest resulting into the A. Just speed up a bit. So this is then lowered. This is actually dumped from the optimized dump which is the last one in the middle end part. And here you can see basic blocks. And it's in SSA form so you can see the phi in there because in SSA form we require that every SSA name is assigned just once. So we need to solve the problem that the same variable could have different values when coming from different edges of the control flow graph. And basic blocks just means that as one input, one entry point turns to the block and one exit point. So if you have a jump somewhere in the middle there so this breaks a basic block. And there can be multiple edges coming into the basic block and out of the basic block different properties. What you can see in this picture is as well the debug statements which say for instance at this point the A variable is equal to this value. And the source code contained also a variable which has been completely optimized out. There is nothing left, nothing needed for that. But in the debug information we store that how you can compute that optimized out variable and so can propagate it to the debug information. And the D between the parents is for a parameters is the value passed to the parameter. And for other variables it means an initialized value. Like if you forgot to initialize some variable somewhere. And it's represented by the D as well. Well that's something which a compiler developers actually are not using that much which is something I added back in I think 97 it was. We can actually have a graphical output as well of the compilers, intermediate stages and so on. So this is the equivalent of what we have seen before in textual form now in a nice graphical form. So nowadays we are emitting this in the dot format. For that we have all kinds of front ends here for the presentation. I transferred this into SVG format but using X dot you can actually display graphically on the screen. And for those if you really want to follow this you can see this here now really nicely represented in the sense especially also you see this here the probabilities of the individual edges the compiler will assume in this case. In this case it's. Probably this can be either guessed by the compiler using some heuristics depending on yeah this compares some variable to some exact value so it's more likely that it's some another value than this exact one or the loop usually loops and doesn't break and so on. But the counters can be also recorded from the profile guided. Or you can use what's called built in expect that's a compiler feature or nowadays in C++ standard likely unlikely and so on. So which means that if you're looking at this and you disagree with the probabilities you can actually do something about that. And we then have completely different language which describes which the optimized dump I showed before is then lowered to this form where you have each machine instruction usually written in one RTL instruction in all the operands and there are details about this is this register and so on. It has still several different forms. Before register allocation for different values we use pseudo registers which then the register allocator allocates either to some registers or to memory. This is quite late dump so it shows all the it's after the register allocation. So basically this is the pseudo machine code representation. So this is very much something which can be directly mapped to what machine can actually execute. And actually some architectures have very weird assembler like S390 and so on. They use those single or two letter mnemonics and I usually just read the dumps which are more readable to me and explain what the instruction actually does. And this slide shows for a particular instruction the compare SE1 how it's actually written in the machine description. We have language which allows you to describe in this RTL language the properties of the instruction. So we say for this instruction that it sets the flux register which is the machine register which contains the results of the comparison or in 386 it's modified by most of the instructions and sets it's the comparison of one operator, one operand and another operand. There are predicates for those which are usually used before the register allocation and which require like non-immediate operand says it can be immediate value but so it has to be a register or a memory. And the last string in there are the constraints. That's something you can also use in the inline assembly and you say so this register, this value needs to be pushed into this kind of class of register or memory and so on. So the 30 last line is the actually assembly code. Well it gets constructed from the actually assembly code as Jakob said before where Jesus emits at least in a moment a textual file. This is the string from which it is there. So this mode suffix forget about this is simply the register can be used at eight bits and 16 bits, 30 bits, 64 bits. Yeah, yeah the square, the angled brackets are about macroization we use this actually instruction for multiple different mode sizes. So for eight bits. Otherwise we would have four different versions of that. That's in order to simplify it. And the curly brackets in there are also that we support multiple different kind of assemblies. And so for Intel assembly there is one order of arguments. AT&T the helmet. So but see with this what you can also see is how actually compiler generates code. It has all these patterns and it has to match them to this representation of the program which you see here on the left hand side. And by using the different patterns and so on optimal and so on it has to cover the entire representation of the program. Once it has that it just can concatenate the appropriate pattern and has code. It's that simple. Compilers are trivial. On these patterns you can do stuff like the compiler which just tries to propagate the RTL of the operands into further instructions and tries to match it. And if it matches then it has a new instruction that can do more magic. We have special optimizations even at that level people optimizations which can do specifically compiler architecture specific and CPU specific things. So anyone knows 68K assembly? Where there's a special processor and it has a special mode where you access a variable to increments it. Well it's very late you put this in the people optimization and it takes advantage of this. It's not something you want to do in the compiler in general because it's the only machine but that's. So we already talked about the SSA forum. Let's skip it. This is 68K was the specific thing only. But this is where they had some of this thing thought out how to use this but I think 68K was the only one which actually implemented. But it doesn't matter. So there's simply features in processes which are specific to individual ones and so on. We can take advantage of them by using these kind of rules. So on this slide I want to show a small part, a small function abstracted from the spec benchmark. From one of the spec benchmarks which contains undefined behavior. That's the statement with yellow in there because if it reads, if it loops over all those 16 iterations it references the array at D16 which is after the end of the array and that's undefined behavior in C and C++. The spec actually refused to fix it. What actually is in the spec is the function without the if. So we changed the compiler so that in those cases where the user writes these loops exactly 16 times and there's no early exit. In that case we just warned that it invokes undefined behavior in the 16th iteration and stop doing anything. But if there is an early break then the program itself is not necessarily invalid. It's only invalid if the sum of the values up to that point is not five, 12. So here I want to show you which optimizations actually change the behavior of this small program. You can look at the, that's the last dump and you can see there's no comparison of the loop iterator against 16. But if you look at some early pass you see it in there. So there is the comparison of K is K less or equal to 15. So we can grab which pass was the last one which had this statement in there. And look at the next dump and what's going on in there and that's the VRP dump, the early VRP. And it shows all the interesting information in there that we are analyzing the number of iterations of the loop find out that there would be an undefined behavior if it's iterated more than 15 times. And in that case decide that the number of iterations must be below this number and propagates through that into the value ranges. And then we see this comparison if K underscore five is smaller or equal to 15 and we have a value range for that SSA name and that's zero to 15. So this statement is always true. So we optimize it out. And we get an infinite loop because of that. But that's only because the program was invalid. And that's of course you see especially at the security folks saying that the compiler should not do these optimizations but you're telling the compiler to do that. You're telling it to optimize according to the rules of what you're putting in there. The compiler doesn't know anything else. It only sees the code which it is. The compiler actually is not trying to do any harm. It just tries to use the assumption that the code being compiled is valid. This is another case, another test case where as you can see the programmer probably would expect that it has an infinite loop. But when you actually run this, it's not an infinite loop when you compile by recent GCC but it will crash. And that's because you can find out that in the VRP again we are finding out that the de-reference of the P pointer is done only if the pointer is equal to null. So it changes explicitly to reading from the null pointer. And later on you can see that other pass is making this assumption, assume loop one to be finite. Because the C++ and C standards have a requirement that programs actually make progress. And infinite loops are actually not valid in C++. GCC actually, if it's normal infinite loop without any exit then GCC does not optimize it out. But if it's a loop which has some exit then it assumes it must be finite. And in that case it removes the loop and it just reference null and that's why it crashes. And with this you can find out what you have to do also to restore the behavior in case the compiler misunderstands you. You have to then follow the rules in which you can do this. So we're almost there so go very quickly now. So the compiler has lots and lots of features and so on options and so on. So you kind of have to dash dash help option which gets you generic information and so on. You can break it down to actually get options for about specific topics and just here you can see. We have a loan for the optimizer 243 options there and for the parameters these are things which you can set a compile command line to influence some way the compiler is behaving. There's another 221 of them. The major difference between parameters and normal options is that we actually don't guarantee that the parameters will be preserved in the future versions. Exactly same. So there are tons of things which you can influence but we're not really here to tell you about everything. As I said most of the passes in the compiler have some gates and what's often present in those gates is checking some options. So what you can see is on the next slide so you see there's another script of mine which I wrote at some points which you can use. This will tell you for instance which of the options without gates and so on. So which of the options can be, are used probably for the different O levels and so on optimization levels. You can select them individually still but by just using the O options and so on you can see that some of them get turned on the higher the optimization level is. So this is just a subset of them. So this scrolls, it's quite long. And another option of the three dumps is missed which shows messages about which optimizations weren't made actually because some reason it tries to show the reason. But unfortunately we don't do it for too many things. Unfortunately at the moment so this is work in progress but again so this is also a little bit harder to see. So next slide, so we can actually save information in the JSON file. This is one of our colleagues wrote this option after safe optimization records. It looks like this, this is completely unusable. So I wrote yet another script. So next slide, which you can get there. So this you point to the JSON file which is the compiler's emitting and it will show you code output like this which actually is injecting the appropriate messages which you see there, intermingling it with the source code and so on and so on and things like that. So are the appropriate messages. So we don't claim that this is complete in any way or form. This is something where if you're interested in compilers you can easily make progress very, very quickly yourself. So just get there and make sure that you're emitting more and more of these useful messages in all the different places. And this just shows what the compiler actually produces. Yeah, so it doesn't produce object files so it produced text files. So we're doing these kind of things and this is a complete representation and this is something which the external program can handle which in theory you could insert your own step between assembly generation and assembly and we're generating object files, transform this or collect additional information out of this. And if you want to get more information you can use the, see at the top there for both systems so it emits a couple of additional pieces of information there so you can actually trace it back to where it comes from in the source code, et cetera. If you emit debug information, those information are usually present in there as well as through the dot line but with the proposed system in assembly it's much easier to read. There is another option minus a DB which allows you to intermix the RTL of each instruction together with the actual assembly. All right, I guess this one, I think this was the last one. So just a little bit of an overview. So compilers are scary, they're big pieces but that's not unmeasurable. You can actually look into them. So are there any questions you wanna ask? Well, we work on GCC so C-Line doesn't exist. It's not, no, it doesn't have to be so, you know. Yeah, it's not sufficient enough speed up that it mirrors. It's not, really. Well, you need to have very slow file system like on Windows perhaps it would mirror much more the file system is really slow. It's quite honestly not much anymore. But especially if it's slash temp which is some memory back file system. Exactly. Just doesn't mirror. Can I reach some presentation? Well, the presentations will at some point later on be made available through the websites, whatever this is, they will be collected. But also I will put this up on my server at some point. It's usually always do, yeah. The information earlier, what you called GCC or T++ is the generating one object file for your source file to pinch you. Is that not parallelized? Well, that's what I meant before. That's sequential. This is why you're not doing putting them on the same command line. Except for LTO. In the LTO case, you compile the files and then you do the linking and the linking would take too long if it wasn't parallelized. And that's why it splits the work into smaller partitions. These are the related functions which need to be optimized together. For instance, it finds like for Firefox, it can use. But this only works because it can work with make together. So by using the make server, you can guarantee that the machine is not completely overloading. Otherwise, we would have to add this, but there's no need to do this because you can invoke the compiler multiple times and then you have complete control over this. Like I think in the default building of LLVM, the default is to use LTO, but without any kind of locking and any kind of parallelization. And at that point, you just run out of memory. So it's just a bad idea. Okay, thank you very much.