 Hi, my name is Kem, thanks everyone for coming. Today we are going to talk about some of compiler optimizations and I will be citing GCC and Clang in between and in some cases you will see what you could do with one another. So not necessarily these optimizations are something that will get you a better code or faster code but you will understand what these optimizations are and how you could effectively use it for making your code more effective and we will learn that while generally they are good they may not be good it depends upon where you are applying. So we will be covering few of the optimizations that are going to be covered here but there are a lot more so I will have links in there you can go and look for and see which one fits your needs. So I will just introduce a little bit of the compilers and Clang is a relatively newer compiled infrastructure, it is built on top of LLVM, it has limited front ends, it can do CC++ and Object to C and besides Clang there are several other tools that are available on LLVM project so if you are interested go have a look they have a lot of nice tools like there is a Clang scan which can do static analysis, then there is a Clang tidy helps you write CC++ code and I am thinking the list is very long so see if any of those tools you can utilize in a workflow, they might be handy. It was released in 2003 and the latest stable release right now is 3.9 as of December last year and 4.0 is coming out I think in a month's time so they are in RC stage already and a lot of people call it C-Lang but actually it is pronounced as Clang. So in GCC I think most of us we know what GCC is here, it is the new compiler suite and it can be built both native and cross compilers so there are cross compilation infrastructures like BuildDrew, Open Embedded and several other which can build tool chains for cross architectures and it has many language front ends so there is CC++, Fortran, Ada, Golang, there are a few others that you will see so it has a lot more front ends compared to Clang for example and it is modular in nature meaning that it has the concept of front end, middle end, back end. So it supports multiple architectures, if you compare the list of architectures GCC has way more architectures, I have provided a small link here. If you go and check this link you will see almost every architecture or not you will find there is a support in GCC. Clang only supports a handful of them right now, ARM, Intel, XRD6, RPC, MIPS to name a few and then there are a few others but not as many as GCC does so if you are working on any of those architectures Clang may not be an option for you yet. The latest stable release is 6.3 again during December time frame and the next major release will be 7.0 probably late summer this year that will come out. So both compilers have a major release once a year roughly and then there are point releases which are kind of bug fixed releases that happen over a period of months for some time until the next major release comes so it's pretty predictable when the new compiler is going to come out. So we will start with the optimization flags so the minus O flags. So minus O flags are actually collection of optimizations underneath and I will show how you can see what it is doing underneath. So O0 is unoptimized which means it applies no optimization, simple conversion from your C code to assembly and then puts it out there so it's generally unoptimized code but might work always. It is not the default. You can use it if you are debugging some issues in your code and stuff like that. O1 is general optimization. It doesn't do any trade-offs for speed or size. It applies optimizations generally which doesn't effect either of, you know, doesn't degrade any of your conflicting requirements like debug info or code speed compilation and size like that. O2 is more aggressive and it can effect your size. It can also have some speed optimization. It can apply additional optimizations. O3 is a bit more aggressive. It applies some inlining and so you can have more code but it may run faster. Again you have to see whether it really runs faster or not but it may be that if your architecture supports it, most probably it will run faster. OS is optimizations for size. It is like O2 but it favors code size so it will eliminate the passes, the individual optimization passes which increases size. So there's a new switch called OFAST. So it can do O3 and it can basically add some inaccurate mathematics, inaccurate math calculations which are not compliant with IEEE Maths. So if you are okay with that, you know, if your workloads are fine, you can enable that optimization for you. OG. OG is for debugging experience. So most of the time people use minus O2, minus G and then many times you see that, you know, you don't get accurate debug experience. So sometimes when you're debugging, it's course level debugging, then you're flipping around the lines and stuff like that. OG is designed for giving you a good debug experience at the same time having certain optimizations apply so you're not debugging totally O0 code, essentially. Clang has another option called minus OZ which optimizes for size even more. What it does is it disables all loop vectorizations. GCC didn't do that. So it can basically, in theory, generate a more compact code. So another difference you will see is when you use minus O, minus O translates to one of these levels and in Clang it translates to O2 and in GCC it translates to O1. So it's an interesting fact to know when you're using that option. So feel free to ask questions or comments in between. Yeah. Sorry, could you repeat it? By default it should be using minus O1 or minus O option. Yeah. So you mentioned OG is better for debugging, but usually O0 is good for debugging too. Can you verify why it's better to do OG for debugging? Yeah, so O0 applies no optimizations at all. So it's like whatever compiler generates the code it will just emit the same code but with OZ it will still apply the optimizations that for example doesn't degrade your debug experience. Some level of optimizations you get. Yes. So that's a good point as well. So some code they check for optimization flags. So when you enable optimizations there is an underscore underscore optimize underscore underscore defined by compiler. So they depend on those things and they will check for those things. So talking of optimization flags, in GCC there is a way to see what all different passes it will generate or it will run. So if you do verbose asm and dump it with just generate the assembly output, I've only given a few options but if you do it there is a huge list that you will see. So all those optimization passes are enabled. For example this is output for minus O2. O1, O0, all those options you can map here and you can see which options are enabled or which passes are enabled at particular O levels. There is another option in GCC which is I've given here which is a optimizer's help. It also gives you a nice formatted output of every optimizer and tells whether it is enabled at this level or not. So if you are looking particularly for a specific optimizer and its state, use this option and it gives you a nice formatted output that you can use to determine. So when you enable this O options and you see that you need to disable one of the optimizations that they enable by default, you can very well do that. Just specify a minus F no for that option. For example in this case I'm just saying F no aggressive loop optimizations and you can see that that's reflected in the option passes right there. So you can look at the specific optimization options. I've provided that link here. There is a lot of documentation. I highly recommend that you know go through those. They explain it really well on what each pass do. So as you can see minus O means a lot when compiler is working underneath. It's doing a lot with your code. So we'll talk about little about aliasing. So there are options in compiler to enable you know strict aliasing helps the compiler to generate better code when you know if your code has aliasing problems. It's better to either avoid them by using strictly aliasing in your code itself or especially find or helping the compiler basically to follow the strict aliasing rules. You can find out whether it's having the aliasing issues by using the warning to find where it is violating those rules and C99 has the restrict keyword. So when you know this your pointers are not aliasing use restrict keyword and that helps compiler know that it can apply the optimizations because you've informed that these pointers don't alias. So if you see the code that I'm showing here the first function is aliasing because the aliasing rules say that same types can alias. So it won't apply good you know the optimizations there but in the second example the rules say that you know the different types cannot alias. So in theory the function is same doing the same stuff but the second function will generate more optimized code because of the strict aliasing rules. Sometimes we do conversions and just keep in mind when you do conversions that can result in aliasing as well. So in this example for example I passed long and if I came into the function and I typecast it back to int then the effect will be similar to the first function. So you can help compiler by following these rules and another thing there are bugs also in compilers at least you know GCC has had a history of bugs and as of 6-0 this issue was fixed where intates were not treated at same level as unsigned char and char. So they would break the aliasing rules. This has been fixed in 6.0 so if you're not using 6.0 GCC you're using older one you might be having that bug in your compiler. So it's important that with aliasing you know you kind of either help the compiler by letting it know or write your code so you know compiler has less work to do. So inlining you can use inline keywords to hint the compiler. So inline will not inline but it will hint the compiler that you intend to make this function inlineable. You can force inline functions with some function attributes and if you see always inline is a function attribute that you can use and that will force the compiler to inline the function. GCC actually has three different implementations for inline. So if you're using older compiler which didn't default to C99 you will get one behavior which is a GNU89 inline and then if you use C99 which is a default for the recent GCCs. So the default will be the C99 inline behavior and then there is C++ inlining if you're using C++ so be aware of that. So inlining is an interesting optimization. A lot of people do it and it may help you it may not help you. So it really depends upon how big your function is for example and if you do excessive inlining by using force inline knowing that you know better than compiler you might create a large function blocks which are basically spilling over your cache so you might not be able to use instruction cache as efficiently as you could otherwise. In some cases it might be opposite if your function is small and you do inlining is a loop or something and the function is small enough and it fits into your instruction cache lines you can get a lot of performance boost. So it's important to understand that when you apply inlining look at the code I always say that look at the code what GCC is doing or any compiler is doing and with your code. If it is inlining look at what it is inlining and secondly many times it's important to know your profile also. You might inline a function that might increase the code but that might not be a hard function so overall your performance still might be nothing. So you have to know where the inlining is most effective because it does have penalty in terms of code size and so generally keeping those things in mind probably you can effectively use the inline keyword. In the older compilers it didn't have heuristics to determine what to inline what not to inline. Newer compilers are very good at determining the inline limits and identifying a function is inlineable or not. So many code that's written people have forced inlining that becomes quite static architecture to architecture when you put that code to another processor that may not work as well as it worked on that old MIPS processor. So it's good to let compiler decide on it and then take control when you have issues and you know that this function should be inline for this particular case then you can use the always inline attribute. So stack optimization so it's quite interesting. There is a diagnostics option. You can see if you use this fstack usage it can dump the information in a separate file about how much stack is being used by a function. So you can get that value and you can do some maths you are doing about how much stack your call chain may use and that might help you to redesign your function or algorithm. And you could look at whether local data, local variables they contribute towards stack function parameters they contribute towards stack. So if you write functions with so many parameters you might want to avoid that. You might want to pass them as pointers to structures and stuff like that. And then if you are returning structures right by value that they can also add to stack usage or if you are using a lot of local allocations. So watch out for those things. And this is a good option. So if you are using stack a lot then see if you are using recursive functions they are bad for stack. And if you are using deep call chains you know in some cases I have seen call chain going into hundreds. And in those cases you can curtail if you can that would improve your stack usage. And you can also put a warning. A W stack usage it's a good option. That tells you my stack size if it grows above that then please warn me. So if you are on a limited system, a limited constraint system which to a certain extent all systems are it depends upon what your limits are. This is a good option to put in. So your builder breaks and then you know that you know somebody has added a function that's going over the limits. So f conserves stack this is option. So all this option by the way I didn't find in Clang they are all GCC options. And f conserves stack tries to minimize the stack size. So it will try to reuse locals and not allocate duplicate locals and like that. So code might run slower because he has to spill over and reload and stuff like that. But this is an option if you are looking at optimizing your stack usage you might want to use it. For size optimizations most of them are taken care by optimization for size. So if you use OS it uses a whole bunch of optimizations and in addition you can also try to use your stack boundaries. So you want to have your stack alignment to say 8 or 16. Generally it is 16 on x86 and you can make it 8. In this case the accesses will slow down but your functions will use less amount of stack. So both compilers I saw they have an option and just that options are a little different. And you can use another option called merge constant so you can let compiler work harder on finding similar constants and then it will merge them. And that will be reducing your total size eventually. And as you can see GCC Clang both has that option just the options are named differently. And omit frame pointer. That is one option that you can try. It removes certain instructions from your function entry points. But it debugging will suffer. So if you want to have debugging a good debugging experience you might want to keep it. So it is a trade off at that point. If you want to make that some people do in some cases some functions you might want to do that. You are pretty sure those function you never debug. You might want to enable that. And in some areas where you think that you want to debug or you might have issues there you might want to keep it enabled. So it is say again as I was saying it is a mix identify your loads and then enable these options accordingly. So one size does not fit all. So these are two options that I wanted to highlight. Function sections, data sections. Both compilers have this option and what they do is they have separate L sections for each function. And then similarly for global data and also initialized data they create separate sections. So when you enable garbage collection in the linker, linker is better able to find out unused functions for example. So it is able to throw a lot of code away. But if you do not use these options then there is a single text section that is available in the end in the object file for the linker and it may not be able to identify a lot to remove. But when you use these options many times code breaks. The reason is because now everything gets into their own sections and you might have entry points. And linker might not see that as entry point and it might see oh I do not have any call chain here. This whole function list is unused. So there are ways in the compiler to identify your functions that I use them they are used. So keep that in mind when you switch to these options you might see smaller code but then it may not run. And then usually the problems are like this where you have not identified your code for entry points because that is what linker is going to use when it is trying to find out which functions are not used and which data is not used. So these are few points that can help you to reuse the code size along with the general optimizations for size options that we talked about earlier. So I want to talk about the profile guided optimizations, the PGO. It is quite useful. Nowadays in newer compilers they work really, really well. And what they do is they kind of help the compiler to feed data, actually execution data that it can use to optimize further. So there are basically two kinds of profile guided optimizations you will find. One is more like statistical in nature. It can do less but it is low overhead. And then the other one is instrumented. So it can do more accurate stuff. It can give you more options and optimize your code more but it is quite intrusive. At least it has double build things. You have to do an instrumented build and then run it and then get the instrumented data and then run again an instrumented compile. So what you can do is there are these options that I mentioned here, F profile generate. So there are these steps you will have to do for a more precise collection of data. When you run the instrumented code with your training data, that is whatever inputs you are feeding into your application, it will generate extra files so you need to have file system on the device so it has a lot of requirements from that word. If you are doing embedded system development you have to make room for those things where it can write files and then you have to be able to take those files off the box and then feed it back into your compile process. So if you are doing cross compilation this can be a little involved process. However they help you optimize your code a lot more. So the third step after you have collected the data is to build your code by using this profile data. So there is a lot that happens underneath but overall what you see is it is feeding in information about is that path taken if it is taken more than else or else is taken more than if. So all that data is collected and it is annotated. So when it is recompiling your code it is annotating that data and feeding that information into compiler. So that is an additional information that compiler will use in this pass here when you are doing the build and it will be able to use branching properly. So he knows that this branch is taken more often than the other one so he is going to optimize using that value range. So yeah. It depends upon your training data really and you might have it is a live system you have interrupts and stuff like that. I don't think you will be able to regenerate same kind of profile data every time. That is repeatable. It is a study compiled most of the time but many times what you do is you do many runs. You got 10 percent and you are excited you want 11 next time and suddenly you see this time it is 9. The reason is that the training data you have had might be still same but system might be processing other stuff meanwhile. So generally it is in the range that you get improvements. So as I mentioned before it is a little harder to use because the instrumented run and then collecting data feeding it back into your system so it is bit of involved process and then because you are compiling the code two times it also degrades your compile time and it is quite tedious when you feed a lot of accurate data into it so compiler is taking more time to compile your code. So there is a lighter version it is called auto FDO and I have provided a link for that it uses Perf and it is more sampling based profile so it is a lot faster and it is not going through so all these things are happening underneath and you are not going through you know the three steps that we are talking about but again it is a sampling based data so it is not instrumented code you might miss. If your function is running like it is a small function it is within your samples then you might totally miss it. So it gives a good result but instrumented one is much better but then you have to pay more to use it. So link time optimizations so when you are compiling your code compiler sees file by file so you have your compilation unit which is a single file so that is the view it has. What link time optimizations do it gives a whole program view to the compiler so it is able to do a lot more on a function for example if you have a global function and nobody is using it so he doesn't know until he has the whole program view. At that point of time it knows even if it is global it is not used by anyone and then there are other optimizations it can apply. So both compilers can do LTO and this example here is showing how you would do that with Clang so what it does is it generates the LLVM bit code and for your source file and with LTO again the problem is similar to what we had with PGO. So there are also two versions a thin version and then there is a more fat version which is you know can do a lot more and we will talk about that. So what it does is it generates this bit code information about you know the symbols and types and all that intermediated data and then it emits that into your object file and then when you are linking and you have to specify the FLTO option then what is going to happen is with Clang I think you didn't need to specify that option it can identify that there is a LTO data but with GCC you have to tell it that you have to use FLTO and then it will invoke the link time optimizer. So what it does is you invoke the compiler it has a plug-ins that invokes the linker and then linker plug-in is able to invoke optimization passes because the metadata to do all those optimizations is in the bit code. So it is able to see the whole program and it can do a lot more optimizations on your code. So FLTO equals you can say full if you say full what it does is it also generates the normal object code say you are not using LTO. So what happens is when you are linking it so you are providing it as a library and somebody doesn't want to use LTO or don't use LTO you can still use that library and if somebody is using LTO they will of course have so that's why it is called full and then the thin it's faster to compile and has almost same amount of optimization gains you can do but it needs gold linker. The reason is when it is doing linking linking is a single process in full case but in this case when you have gold linker it can do threaded links. So what it does is it launches many threads so that's how it is able to do faster links. So GCC has similar options as you can see instead of generating LLVM bit code it is generating JIMPLE bit code. JIMPLE is the GCC's intermediate representation and the rest of the process is pretty much the same and there is a Fuse linker option, linker plugin option similar to what Clang has and of course it needs linkers with plugin support. What that means is binutils have this plugin support and you when you have the linker with this support it is able to call the other tools like archiver, size, all those tools. Basically if you think those are sort of tools it is invoking when it is reading those objects and linking your application. So there is this minus F fat LTO objects which basically is telling you that I want to generate these libraries that I can link both in LTO and non-LTO mode and one of the issues caveats as usual there are always compromises you make and minus G might not work with this. So there is work done in this area but you might see that when you link your applications with LTO debugging is even worse. So keep that in mind. Auto vectorization, so there are multiple loop optimizations that are done underneath so we are not going to talk about all of them but we will talk about just auto vectorization so others are actually put together by your minus O options underneath so they are doing a lot of loop optimizations. I highly encourage you to look at those loop options that you know compiler is doing, it is doing a fantastic job underneath. So what auto vectorization does is nowadays all processors in embedded Linux space even have some sort of SIMD units. ARM has NEON, there is SSE units, AVX units on Intel x86 and then you have Altewek on PowerPC, MIPS has similar things. So auto vectorization is about using those units to speed up your compilation. It is enabled at minus O3. If you don't use O3 then you can use minus F3 vectorize. This is a special option to enable it and of course you have to also enable this SSE options or Altewek otherwise it will complain that hey you want to do vectorization but you don't seem to happen to have a SIMD unit so what can I do? How is it bad for size then? Sorry? It is bad for size I'll let you know how. Because what it is going to do at O3, so question was why O3 is bad for size because it is by default unrolling your loops. So if you are unrolling your loops it doesn't matter forget about auto vectorization. It is doing a normal loop unrolling. So if your loop it finds is looping around 100 times he's going to unroll it four times so the loop only runs 25 times if it can do that because it is trying to save one of the branch instructions. Now it is again subjective that it might improve your code execution but obviously you are replicating code four times so that is definitely adding to your code size. What auto vectorization does is it identifies those codes snippets or loops and then it tries to execute them using SIMD instructions. So if it is just running a normal loop instructions right there are a lot of instructions that it is doing. In this case he is using all the SIMD power to do the calculations whatever you are doing in your loop so you can basically quickly execute that. That is what it is doing there. It is enabled by default so you don't have to do anything per se but you can help the compiler when you write your loop and you think that I think compiler should look harder in this loop here I wrote. I believe is something compiler can auto vectorize then you can help him by providing the hints through pragmas. So the pragma I gave you is a good option in Clang for example I found GCC has similar but it can tell you know this particular pragma you can put it above your loop and GCC will or any compiler for that matter provided you use the right pragma there will work harder to vectorize your loop. And there is a lot of information more vectorization features like it can go beyond a single basic block it can look across basic blocks and see are we doing some similar operations in this block and this block and they don't depend on each other he will coalesce them together. So it is more aggressive optimization takes a lot longer to compile but then it can give you a lot more you know optimizations so but you will lose compile time. So what I saw was that compilers do support the super world level parallelism. Clang is going a bit ahead and it's also doing it in two phases so first phase and the second phase is when it has done all restructuring it will again run this pass. So if anything is now it can catch second time when reordering happens it will try harder to find more cases. So target specific optimization so CPU type you know there are this MRH M-Tune MCPU options look very closely at them when you know you are optimizing for your architecture they are very they can give you a lot of you know gain as well as pain depends how you are using them. So MRH tells you what instruction sets you are allowing compiler to use. If it is choosing wrong instruction set you know it might use instruction that are not available on your processor. So it is very important to understand which you know architecture your CPU is using very important in embedded Linux there are different CPUs and you know it is a plethora of them and M-Tune tells the compiler for tuning the performance so they have different latencies for different instructions all machines. So when you tell it with M-Tune you are telling him this is the latency architecture that I am using it on a particular processor so it will apply so at least compilers they have all that data for scheduling so it uses specific scheduling when you let it know that you know this is a tune that I am trying to use. And MCPU tells it about what features you have like you know ARM V8 has this security encrypt extension so do you want to use them or not. Extra processing units that you have so you can use those options. They are not to be used every time most of the time you can get by not using them but if you have you know performance critical applications then you can use them it helps a lot. And you know SIMD as we talked about auto vectorization use them as much as you can and also ABIs they also are able to use them nowadays you know so for example at least on ARM ABIs are different depending upon how you want to pass your function parameters so you can use for example hard float ABI which uses vector floating point for passing parameters so if you have like you know audio related applications then you can pass you know floats and doubles and all those things using those registers. And target ABI's I just made a small note here but there are some obscurities there especially in MIPS and at least in the old ABI you know if you are using like PIC code position in pen code then calls are very expensive and so there has been additional ABI work that has been done and MPLT for example helps it to reduce those three instructions to two instructions when you are making a call because you know you are letting it know that you know I have local PLTs. And you can explore more about target specific options using GCC target help. I didn't find any of that kind of you know nice way in Clang that I could dump individual passes maybe there are ways but there wasn't any like a given option that I could use. And here I am pointing you to a documentation subsection where it is actually documenting for all supported architecture what are architecture specific optimizations. So take a look at that see which architecture you are building. So built-in functions. Built-in functions are provided to use specific instructions that compiler is not able to schedule. So when generally you are building your code compiler only has a subset of instruction that it is using but when you know that oh my algorithm involves this you know multiplications then you can use built-ins to effectively do those calculations. What it does is it uses those specific instructions that your architecture is providing to do those computations. And so they are all documented really well but I saw was they are different for different compilers different for different architectures. So if you are using them know what you are doing. And just don't use them without any guards you know. So use guards around them if you are using them. Or sometimes compilers are also able to use built-ins themselves. If they use it then you are okay because they know what which architecture you are targeting and they will do the right thing. But if you are explicitly using it in a code then you know you have to be aware that they particular built-in you are using here may not work on MIPS or may not work on PowerPC for example. So unsupported GCC extensions in Clang. This is a highlight I am making here because Clang is a newer compiler and doesn't support everything that is in GCC. And you would find that if you have existing code that has been built to say with GCC and then it may not directly compile. And some issues are just that it might be catching more errors that weren't caught by GCC. But then additionally it can be that it hasn't implemented those. So there is a section actually that lists a lot more. I am just listing here which are mostly relevant to a lot of software. So variable length arrays instructs. So this is a typical case that is found in G-Lib C. It is found in Linux kernel. And Clang communities will not implement this because VLAs is not supported by any standard. So they don't want to implement it. So if you have these kind of code then you know you have a problem to fix. But there are ways to fix those too. So for example you know at the same array if your ending element is zero then that's accepted. But the way they are used in G-Lib C and kernel is a little different extensive. So built-in apply is another one. And nested functions, again I've seen it's being used in G-Lib C. There are several other components I've seen where the built-in functions are used. So they are not supported in Clang because they are not part of any C standard. Built-in VA hard pack, I don't know where they are. I haven't encountered them but Clang website lists them as unsupported. So I believe they might have encountered them somewhere. Forward declaration of function parameters. This is very commonly found. So what you do is you declare an integer and then you pass that as a parameter right there. So when you are doing this VLA stuff that's when you do it. So the number one and number four are tightly connected. So this is not accepted by Clang. So many times you know there were questions, oh this works fine and suddenly it's complaining. There's something wrong but it's intentional that they don't want to support it. So there are other things there. So some things are intentionally not supported and some may be supported depending upon how the use cases are found. Yes, question? No. So it is the forward declaration of function parameter. This is what it is. So you are declaring a function parameter. It takes one parameter. But you are declaring it and using it in the real parameter which is your array. That's where you are using it. So it's a GCC extension. It's a smart thing but it's not standard. So I think in summary help the compiler. It'll help you. It's give and take. Evaluate the impact of optimizations. So one optimization may not work on a different load. So there are no one way of doing all the time. So evaluate the impact always and know your cache sizes, your architecture, your latencies and profile your code before you optimize. Data is the king. So if you know, you collect the data, how your application is doing, you'll find most of the time people say, oh the string functions. But it may not be the string functions. It may be something else. Because in embedded space you might have one of your own functions which is causing it and not a string function. So very important that you have that profiling information and then you identify these are my 10 hard spots and then just optimize those. You'll get a lot more. And then apply all these techniques there. I think impact will be a lot more. Rather than doing a global level change at minus, changing from minus O2 to OS and all those things. Those are very disruptive. You have to measure your load. And writing portable code is good. We saw there are so many cases where portability can become a problem and you may not be able to take advantage of various tools. And don't over optimize. So many times it's a common pitfall people have is they do over optimization and it's actually called pessimization in compiler terms. Because you wrote a good code, compiler doesn't understand it anymore. So he says, okay, I'm going to do what best I can do. So compiler always falls back to very assertive, very conservative code generation when it doesn't understand the code. So he has to identify patterns and if he doesn't, then he will fall back to the worst code it can generate because accuracy is more important for compiler. So he has to do the right thing. And whatever he thinks at that point, he cannot take chances. So keep that in mind, right? Code that compiler can understand easily. So that's all I had. So now I have some time for questions. Actually I'm above time already. So we can take one question. If not, thank you very much.