 So, I'm only, this is Jacob, we have been working on these kind of things like compiler for a couple of years now, two or three I guess, and still unfortunately not too many people know about this kind of stuff. And so, we're going to talk a little bit about compilers as an intro and so on, but more specifically, we are trying to get you to understand what is necessary for someone to actually write optimized code, because believe it or not, it's not just a matter of you specifying an option at the compiler command line, compiler tries to do something, but it might not actually be what you're expecting it to be done. And so, you will see a little bit in this direction, but none of that will replace you actually doing something yourself in this area, and there's a lot, believe me, a lot which you can do and which you have to do to actually write highly optimized code. For those who know, well that may be the first question, so who actually works on compilers and not with compilers on compilers, all right, so that's your audience. So a little bit of an overview, so what compilers are, so most of, for most people it's a black box, but they're actually pretty structured in some way or form you can see here. So we have, as a compiler, the inputs are some form of source code, obviously, and for most of the programming languages, there's some form of library, whatever form it comes, including some from the system, so either text form, binary form, whatever. It all is accumulated in a part of a compiler usually called the front end, so probably everyone heard about these kind of things. That's the part of a compiler of the well designed compiler system, which is the only part specific to a programming language. The rest of the compiler, if it is done correctly, is language independent. So for GCC, for instance, this means we have front ends for C, C++, Fortran, but also then for other things like whoever uses this Go and D or whatever, so that, these parts are completely implemented only in the front end itself. The rest is identical. There might be some specific pieces in the rest of it, which is tailored for a programming language, but as in general, it's not. So everything is fed into the front end, then it's passed on to the various pieces. The second piece on the slide, which you see here, is what's called the optimizer. So the optimizer itself is what is the most interesting thing which most of this talk will be about. We don't have to execute the optimizer, but we can immediately proceed to generating code from this, and that part is the architecture-specific part of the compiler. So we have the language-specific part in the front end, and we have the architecture-specific part in the back end. So that knows, for instance, you have X86, you have ARM or whatever, and it knows how to generate code for that. But it spits out in the end what is called machine code, which is what the processor actually can execute. So in the front end, so I have a little bit of an example code, so I took this from somewhere so I don't take this too serious, the code, so everyone knows Euclid's algorithm to actually compute GCD, and that's the code in a kind of a Python notation. I used that for various reasons, which should not be of interest here, so the same thing. So what the front end basically does, that's, as a first step, it's called tokenization. It spits out the serial code of the program, in many cases, as in normal programming language or in many of the programming languages, it's independent of the lines in Python. It's dependent of the line ending and so on, also have a meaning, but in specifically it creates a token for everything, which is of substance inside the source code. So for this specific piece of code, for instance, I mean it can look something like that. So you have tokens, in this case, depth, it's not just a character sequence, it's recognized as the keyword depth, in this case. For things which it does not recognize, they are variables. It says something, oh yeah, this is a specific thing, it's a variable, and the name of the variable is, for instance, in the first, in the second box there, it's Euclid. That's one of the things. So this is the first task of the compiler. It's very boring stuff. If you read all compiler books, they are harping on these kind of topics. Now no one cares about this. We have tools to do these kind of things. The interesting thing is that once you get into the point that we have the tokenization form, it's still a very linear form. We have not yet discovered what it actually means for the nesting and for the dependencies of all this kind of code. This is then expressed in something called the abstract syntax tree, which is an example which you can see here. So there's absolutely no demand that there's a unified form for the syntax tree. Every compiler comes up with its own firm, and some of the compilers have multiple forms. So GCC, for instance, has multiple ways of representing the same information internally. But this is one possible way of representing, for instance, the abstract syntax tree for the program which we saw before. Now we can see this is not linear anymore. We actually uncover the structure of the program by encoding it into this kind of tree form. We see what is depending on what. That's part of the language syntax, and the grammar describes how the compiler actually understands the source code to make this possible. So once we have that, we actually get to the point that we can do the next level, get to a more complicated level, and actually translate this into basic blocks. So after the lexical analysis is done, after the syntax analysis is done by the front-end and perhaps some from the language specific of definitions like that, usually compilers create the control flow graph, which is a directed graph where the vertiges are the basic blocks. Basic block is a sequence of statements. The important properties are that there are no jumps into the middle of the basic blocks and there are no jumps out of the middle of the basic block. You can have jumps to the start of the basic block and then from the end of the basic block you have some branches as well. The edges in the graph are either normal control flow, for instance, you can have a go-to, then it's a single edge, or you can have conditional branch, then you have two edges. But you can have other special kinds of edges, for instance, the exception handling edges are something when an exception is thrown, and that needs to be represented for the optimizations to be able to find out other ways of control flow. For instance, long jump and set jump have very complicated control flow as well, and that needs to be expressed in the edges as well. In this, on this picture you have the Euclid hardware from lowered into simple operations. Various compilers have different forms of the internal representation language, usually it's language neutral, and various compilers have multiple forms of the internal representations. For instance, GCC starts with a code like this after the front end is passed to the middle end, and then this is lowered into a single static assignment form, which is something which has been developed by IBM in 1988, and these days it's used in most of the compilers. A single static assignment form has some very good properties for the optimizations, namely that each variable is assigned just once, so you can stick various properties to these assignments, or as I say, name versions, and this is achieved by adding some subscript version of each variable. So, if you assign to a variable new value, you create a new subscript for the variable, and this page shows some weird functions at the beginning of some of the basic blocks, and those are fined in five, and that's the way how you express that the value, for instance, of A3 is either A2 or A1, depending on which branch branch into this particular basic block. So, the point here is that with SSA we got through this notation the possibility to actually express algorithms, optimization algorithms. So, before we have this, there was really no way for an algorithm to be really expressed as a kind of a formula. We had to say, yeah, we have this index, we have some sort, and if this condition is met, well, then we can potentially perform this kind of optimization. The thing which I showed you before, every single optimization inside this optimizer and so on is a transformation in a mathematical sense. It's something which keeps the semantic of tree completely intact while transforming one tree into another one, where the second one is hopefully easier to compute. That's the whole thing. And when we are defining optimization steps, we have to do this in some way or form, and SSA finally gave us the possibility to actually express this. We can write down the specific algorithms in mathematical formulas. We can say, if this variable is in this set, not in this set, and this condition is met, then you can do the following transformation on the block. That's the great thing about SSA. This is why everyone is using it, and if you're really honest and you really want to learn about writing optimized code, unfortunately, you have to learn about this as well, as Jakob will show us. So for the optimizations, I would note that the optimizations only care about valid programs. If they have defined behavior, then it's completely valid to optimize it away or do anything. So the optimizations are based on the knowledge that valid programs should not have defined behavior. For the SSA form, I would note that not all variables are actually transformed into the SSA form, at least in GCC variables, which are aggregates, structures, and so on, live in memory, and they need some different treatment, and variables which have address taken from them also need to be treated similarly. And you also see here the temp1 and temp2. So if you draw the line theoretically, you can code back and see temp1, temp2 again. Well, the point here is that these variables live inside the block. Their usefulness or their use does not survive the stepping out of the block, and therefore they also don't have to be handled. But once some program is converted into SSA form, basically it doesn't matter that much what the variables actually are. All you care about are the indexes of the SSA versions. In some compilers, the versions are unique in the whole function. Like in this example you have a1 and b1, but for instance in GCC you have a1 and b2, because you can't use the same version for multiple variables. So this makes it easy. You can just... And then after you do this transformation, for most of the things, except for debugging for, you don't care what the variable originally was. It's just something that holds a value and holds it here. And you can hopefully see that this makes things so much easier. So you don't have to worry about the difference between different variables. It's just every single variable gets one value assigned for the entire lifetime, and every variable even after it has one single name. So that's nice. The control program also shows a loop. You can see there's a header with a a1-t node. That's the loop header. And there is a loop latch, which is the basic loop A3, which is the... It's usually a single basic block inside of the loop that is the predecessor of the loop header. If the loop has multiple latches, then some algorithms don't work very well. But it's easier to transform them. So, for instance, in this case, you could think that the two blocks above the latch could have latches directly to the loop header. They can in some form, but you can have multiple latches. So, GCC has over 100 optimization passes, and some of the optimization passes do multiple optimizations as well. And it has also hundreds of optimization options. There is unfortunately no one mapping between the options and the passes. And while we have some infrastructure, for instance, the minus O, some level options, enable some optimization options and disable others, which optimization pass has also some gates test, which says if the pass will be run at all, and those gates can include not just the latch for the options, which control the optimization, but also other properties, like does this function have some loop or whatever else, and some of the passes also have some passes. So, if they have their own gate, and that gate says no, then no pass from the subpasses are run. And that's important, for instance, because you can see that many options, optimization options are, if you dump GCC options or GCC minus minus half or something similar, then it can show even for minus O, zero, many optimization options enabled, but those passes are actually not run, because there are passes with many subpasses which are gated on optimization more than zero. So, this is a little script which I wrote. You can see the address there, you can download it, which tries to summarize this in some way or form. So, this is information which is gathered from GCC. So, the important thing here, the most important thing I want to point out, the blue lines, these header lines, they list all the possible optimization options. There is no O99. There used to be... Well, we parsed it. We parsed it. And that's the point. So, many people believe, because the compiler accepts it, it is there. There is no O99. Believe me. You can see, there are some people who have this in the early, early Linux days. So, we have people who are leaving this and putting this in every single makefire out there. But these are the ones which exist out there in the script. It would help us just discover this. So, this is just a subset of, as Jakob said, we have a hundred of them. This goes on screen after screen after screen to show you. But it helps you perhaps discover a little bit about what the options are for each of these individual ones. So, the thing is that if it's... If you don't like what the specific option, whether it's enabled for your O2 or O3 or whatever you're using, you can enable or disable it individually by using the appropriate option there. So, just use dash f on the new option name or dash f NO dash on the new option name to actually enable or disable it. You can individually do this and GCC allows that. Sometimes not advisable to disable hundreds of optimization options because then you can get into the... Yeah, but for some things, especially when you're running into compiler bugs, very useful. Aside from blaming him, so... So, this is the... GCC algorithm written in C. So, adjust as a reference because now we are going into compiler territory. This slide shows how you can use a dump option to dump actual text files which represent the intermediates, represent the nation of the program at each pass. Those... The graph suboption tells the compiler to also emit a dot file. A dot file which can be processed by job and you can create a picture from that which contains the CFG visualization. So, here... the entry block is something virtual. And... Hopefully you can see the correspondence between the code and also between the correspondence of the... from the previous slide. This is exactly this kind of thing, exactly the same structure. And the compiler can construct this even for a very large one. So, dot comes from the GraphWits project and so on for the package, if you install it. It works reasonably well even for large graphs. And I, at least I don't know whether Jacob is still using, I always use this kind of thing when I'm hunting for things which don't compile to the code which I expected to do. I like it better than the texture representation in some cases. But the one here, you would see here on the bottom there, so it's hinted that's exactly the same information. Exactly the same. The way how it's actually represented at least for the... Well, as I said, GCC has multiple internal representations and one major, one is... which is statements written in two-plied form and that's what you can see here. This is from very early past, right after the CHE construction. There is still no... There's no... But you already have the CHE. We have some other examples later on where SSH information is in there and then you can see also correspondence there. Machine generator part of the compiler. This is a different intermediate language which is at least written, less polite and it's much more closer to the actual instructions but still commonly enough that you can use many optimizations which are common to all the architectures. The textual dumps are numbered from the first pass to the following ones and the T1 indicates the game pool, the tree. There are also I dumps which are for the interprocedural optimization and then there is R which is for the RTL. For the game pool dumps, the language is C-like except the basic quotes are expressed there and the phenotes are written as comments after double hash or debug statements are written similarly. It's not exactly C but... If you want, you can nowadays write programs in game pool. So this shows we have many policies and the slides. So at the top you see there, it's OFAS, it's a Euclid program again compiled with this except that it uses a little compiler plug-in which I wrote again at the bottom there, you can get it which just prints the name of the pass. So this is a subset, it's not even everything. It's a subset of the passes which GCC does if it runs with audio optimization or almost audio optimization enabled. Again, not all passes do... Sometimes passes exit early and don't actually walk the CFG or the intermediate language. Other passes do walk it but don't find anything interesting there. Well, in this program specifically you can imagine there's not that much to optimize. But still they're running. These are the dumps. Again, the list is incomplete because it doesn't even include I... You see at the bottom there, there are 252 files after I do that in the directory. So huge amount. So this is a dump from the end of the... Gimple or three optimizations when we remove the SSA form. Some other compilers keep the SSA form even longer until it reaches our location. We do not, but we still keep the basic blocks around until end of compilation but just switch the language in the basic blocks. What you can see there is there on the edges there are probabilities. Those are either guessed by various algorithms like if you are testing for non-null pointer then be predicted as likely that the pointer is non-null and many other reasons like that. Or the probabilities can be measured actually. That's what we call feedback directed optimization profile guided optimizations which is something which is very useful especially for compute intensive applications that when you actually build them that you build them twice once with special option that tells them to gather this information then you run the application with some typical workload and this generates some data which counts for the different edges and other information like the string copied by string instructions and so on. Then you run the compiler again and it uses this information instead of the guest information this part of code is code and so can be optimized for size and this part is important so let's vectorize it and let's do other optimizations. It's quite important also you have the possibility to overwrite this so for a long time we have built an expect as a compiler built in and C++ 17 has likely unlikely attributes which you can just use and you're overriding whatever you get and that information will then plugged in I think it's 90% 10% or something like this then you will find that in the branches and this is quite important also for you to look at so if you look at your programming and you know where the hot loop is make sure that that part actually is annotated appropriately in the compiler-generated data structures though because as Jakob said the compiler will use this information to guide the level of optimization we are doing so not everything gets heavily inla out loop unrolled for instance or vectorized or something like this this would blow up the code size it would also potentially make things for some situations actually worse so the compiler is concentrating the heavy duty optimizations which might require a lot of compilation time might require a lot of memory at runtime to those part which it knows will be executed more often so if you have a loop which is then always there and you have an if which handles an error condition which almost never happens you never want to spend much effort on that and you can mark this explicitly or through PGO and so on have this automatically uncovered by running your code and then recompiling using the information so this is something which is not new so the PGO staff, the profile-guided optimization part have been in there for 15 years at least something like that so that's something which everyone should already know about and use in their programs the only thing is that you actually need to have a representative workload for you to actually run on the code before you do the recompilation and that's what most often breaks things so sometimes if you run the test suite then the test sheet covers the common cases and not the usual workload on the other side if you write just a simple benchmark then the benchmark will be fast but the other staff the workload needs to be typical for the application and what GCC does is also an optimization which splits the publishing into multiple code sections one is for code code for the code code in that case as well for the cache iCache you want the code code to be together so that's the the title, sorry the book that's this, this is the only real book which I still know which handles really SSA as an optimization and so on so if you're really interested in this I think that's still the book to get it's quite over 2004 very simple examples of optimizations dead code elimination is removing non-useful code there are actually two kinds of dead code elimination in that function one is the return F can be optimized by removal of unreachable code basically you create a basic block and then you find the guarding condition is never true because unsigned value is never smaller than zero so you throw it away immediately and other dead code elimination is afterwards because you have the 10th variable which is assigned some value and you never use it and it has no side effects so you can be thrown away as well dead store elimination is usually for memory in this case it's more like dead code elimination again but if you thought about L as being memory something that has way in memory then the value in memory you write A2 is then overwritten by writing the result from the function to and then it's not used even until the end of the function so you can again remove the stores of course if you don't know about the function whether it has any side effects you can call it common expression elimination is something which is done in several places in GCC both on the GIMPOL and on the RTL as well it's done through some value numbering and finds out that the two expressions are common and can just use a temporary on the GIMPOL it's usually called full written full written non-seal elimination partial written so in this case what I oftentimes see people they're making that code ugly because they think oh yeah I don't want to write something like this again although this might to someone who's observing the code easier to understand introducing yet another obscure variable somewhere on something like this so in general you don't have to worry about this the compiler is pretty good at doing this so this is an optimization which has been around for decades and we are pretty good at that and then it shows also a possible optimization that if you compare the if you test for a range of TMP variable then can be done more efficiently than two comparisons in this case by subtracting the low value and comparing actually the comparison should be done already in unsigned type so that it doesn't feature undefined behavior like there are there are other optimizations for the comparisons of a set of values like you can use and not for the subtraction or after it and that can cover even bigger sets of so this just shows the comments of expression elimination so there is just a single underscore two I haven't mentioned that for writing SSA variable in GCC is that some variable name underscore and the variation number and if the variable name is missing then it's anonymous SSA version just doesn't bind to any variable it's temporary and at the start of the function there is a list of the types and you see here the SSA form has a phi node here at the end you see phi it comes I'm the first node comes from it's a value 12 which you can see is in basic block 3 so BB basic block comes there and the second one is 0 which comes from 2 so it's an absolute value it's not a variable so this is in this notation also possible you can directly translate that into this in the code which you see on the right hand side so this is actually pretty easy to understand once you get over the hamper with this strange phi stuff even without the powers it's actually not very hard to transform it into valid C variables are declared and you can use just the valid variables and you just need to find out something so this shows another optimization which is value range propagation GCC that is only for a subset of types like integers or pointers to some extent it doesn't do it here for vectors it doesn't do it for voting point values in some cases might be interesting as well so what the value range propagation figures out is that because of the testing and if in the alismog the value of A is always smaller or equal to 3 in that case it can optimize the way one of the cases from the stage because it will never come and another thing which is thus it's able to build a value range for the A at the return statement because if it's initially bigger than 3 then it will be bigger than 0 and if it's if it's initially 3 then it will be 13 and if it's otherwise it will be 0 you can see it in the final see the final block it has A3 or 0 or 13 so the compiler does a pretty good job and so the A A is smaller than 0 that can be optimized now we can actually do the same thing even if there is A plus plus equals 3 and the reason for that is that this is done in teacher type in the side in teacher type and in C C++ it's undefined behavior so this is an important thing to see so I put this in deliberately because people don't expect this so the optimizer as Jakob said at the very beginning we are only doing the transformations based on the assumption that the code is correct incrementing an integer variable so that it wraps around is an ill-formed program and you cannot expect anything from the optimizer at that point it will transform it to the code as the previous one but all of a sudden if you look at the variable A at that point it will actually be negative but the compiler will treat it as positive you are writing bad code there are some use it but these are lousy programmers but that's the thing so if you can't handle that 0 is your friend so live with that just some positions as well alright we have to go but for this I would mention that we have the sanitizers if sanitize it would something in UBSEN to find the front-end the back-end do you mean side or does it hold for unsigned for unsigned the braking is well defined at that point so if you want to make it well defined just cast it to unsigned and cast it again to integer then it's implementation defined in C in C++ 20 well defined forces the choose component that everybody sees alright so this is another thing so when we are looking at optimizations and so on often times it's not obvious when the compiler which of the compiler passes and so in GCC 9 one of our colleagues has added an option called GCC itself which writes out yet another file which you can get in this case it's based on the file name .opt .json.js so it's a compressed .json file which often times has tremendous amounts of data in it unfortunately we are at the very beginning of using this file but it's really not usable by looking yourself looking at the code itself so I wrote yet another script the third one today which you can also get called .opt mark in this case so if you get that and run it on one of these .json files which are generated it annotates the source code with the information from this file and this gives you that information about what the compiler actually rejected in this case for instance the red lines so red lines with this thing oh yeah this is a negative thing we have to do this operation for some things it actually would show yeah I did the following optimization the other ones are annotations which the compiler currently spits out so we are at the very beginning of using these kind of things so this is something which we have to work on a lot more for GCC 10 at least so I will have to work with what David actually make this much more usable don't expect the word from this right now sometimes it works sometimes it doesn't so because the compiler doesn't but well if you find some use for that or if you find some problems with this so let me know what David Malcolm is the one who actually wrote the other code so we are actually looking in there so we have now a couple more slides where this so we are supposed to finish off so but we already took some of the questions so let's go on a couple more minutes here are a couple of examples of optimizations so the first one is called tail calls so anyone who has ever done functional programming specially scheme something like that knows the term because there is a well defined mechanism this is that a function which calls another function as a very last operation can in theory be generating code for which does not return to the intermediate function so in this case the function f so that's a good thing so you see on the right hand side the second block there see one fi that's the implementation of the fi function for the correctly performed tail call optimization so instead of using the call instruction it actually jumps to the other function after doing something so often times you have to actually work to make this work so for the second column on the right hand column here you see that there is some additional operation which is performed on the tail call function and so on in this case of course we cannot perform tail call optimization because we have to add one to this so this is nothing the compiler can handle by itself you have to do something so if using tail call optimization can be really really useful because if you have lots and lots of recursion which looks like recursion in there but you use tail call optimization you don't use up the stack so this is really important in simple situations and languages like scheme actually demand tail call optimization but to make it work you actually have to rewrite the code a little bit so but there are other things which are standing in the way and this you only discover if you're looking at the code so take a look at the code on the left hand side again it looks like normal tail call operation so similar to the code which you saw in the previous slide so it has tail through something like this but there is a little bit of code in there in the middle so at the beginning of the function and that code screws everything up because after the tail call and so on the compiler thinks oh yeah I have to generate some code to actually unwind the stack frame one that Jacob earlier today this optimization in GCC actually doesn't escape analysis on which variables escape and if they escape like in this case then then it stops doing the tail call optimization but the thing is that while the string escapes it actually is that it is not alive because the scope in which the clarity goes out so the the so the destructor for the string could have been caught before but it isn't GCC doesn't handle it so it prevents tail call optimization so again this is something which you have to do explicitly so if you rewrite the code slightly in this case and this comes from a real-world example I did this for a project another project out there so I can't continue basically this kind of patch where I then said so I split out the code into a separate function specifically in this case in that case it was some error handling but also to get tail call optimization in this case I have a function which I mark as not inlinable so and in this case I'm actually getting the code which you see here it's much much shorter than what we had before so this is possible but it's oftentimes not possible automatically so you have to actually look at the code because you have no information from the compiler actually saying oh I didn't do this in this case you already see so for this for the talk so I came up with all kinds of example and Jacob as well and everything why didn't GCC do this automatically why didn't it do it automatically it's bug-fighting bonanza here so you saw this kind of code before I had this the case statement so what if you know that in the function G here the parameter A actually can only have a couple of values but the function F in this case here is a general one for instance this can be error handling but in some cases you might know well the error which you get here can only have the following error codes so how do you do this so the full code is actually pretty long it has a conditional jump in there and has a long table but if you just add a little bit of code in there so the highlighted pieces here which you see here there's a compiler intrinsic called unreachable which tells the compiler that if it gets in here well basically this cannot actually happen this branch is dead which means that the other part the else part so the consecutive ones for those the condition of the if actually holds this is a way to tell the compiler I know something about this variable we have contracts coming up so that's a different thing but this kind of thing exists today and they can have a remarkable impact on the code which the code quality you are getting out of compiler again the sanitizers are able to catch it because instead of having something well they transform the built-in unreachable into a library called which kills the application so why would you use another compiler you're talking to GCC guys that's nasty talk alright so last class of things we are talking about is vectorization yeah so you do that so this is a loop which can be on many torches like in this case but we first also the code is smaller we first 128 bits vectors it can be vectorized either way but if you don't tell the compiler that the res and a arrays or whatever objects they may point to don't overlap the compiler has to insert some additional runtime checks which check that those two arrays so you see there in the prepare loops in the prepare loops you get very large code and some runtime check but in many cases you know that those two don't overlap and there are many ways written in this test case how you can tell the compiler about that she has the rest rates but we have underscore we understand that or you have prime mass so this this is one of the differences between cc++ and Fortran and if people come from the Fortran side they forget about restriction get bad performance because in Fortran you cannot have overlapping objects that's why sometimes if you try hard in c++ you still don't get what would you get so until you see why we are having on this the first three instructions there they have a P in there and that indicates that these are vector instructions and actually performs multiple instructions at the same time so in this case two doubles in this case but you can have it more and if you use the full AVX512 or AVX5256 you get even more out of this but then you have something after the loop this L3 label which you see here this is the cleanup part so this is one and a half an odd number of array elements then you have to do something on the scalars and that's what happens there but again if you know that there will be always a multiple of 16 or something then you can again tell the compiler about that using for instance the built-in on the chable and it will be able to optimize the way the scalar loop idea is that important so we have to go to the next slide I think we have just a couple more examples on vectorization so we have the possibility to have functions like square root and this kind of thing there are on Linux at least there are parallel versions of that so in this case we don't even have to do this because it's not the library function it's an instruction on most of the architectures but for things like sine, cosine tangent and exponential logarithm and so on which do that's the computation of the values multiple times and we can still paralyze these loops completely so this works really well nowadays but the test case shows that if you try it with for instance O2 or O3 it actually does not work because post-exit I think mandates that the square root sets R now if the value is smaller than 0 and that's very unlikely case and that's why GCC normally for the scalar code uses the hardware instruction and then just compares the value against 0 and if it's if it's smaller than 0 then in some code code actually calls the original function to do the kernel handling and that's what breaks the vectorization of the space and that's why you need to use O fast or something so F fast and so no O fast both work around these kind of things and that's another one is another one so we'll make the slides available there's only one more slide there so hopefully you get an impression how complicated this is actually right optimized code and how hard the compiler actually works behind the scenes and to actually verify that the code the compiler generates for you actually corresponds to what you think there's no way you have to either look at the Gimbal code or just magically trust the magic of the compiler doing the right thing all the time all right hopefully you got an impression so thanks