 We are always looking for better performance and when it comes to performance we are always trying to have the best optimization that is possible at the software level. So how can we do that and we look at few other things. So things I will be covering will be compiler optimizers and the overhead of optimizations as in compiler optimizations and what when we are writing our code what we should keep in mind to get better performance and some of the profiling tools. So that we can measure the performance of the application. Moving forward since I am going to start with the compiler optimizers before I talk about the specific optimizers let us just look at the phases how it works. So when we write high level language code we compile it into an executable file. So what happens in between from your high level language code between you compile it to get executable code. There are various phases during this process that happens and first is it converts your code into some intermediate representations which is independent of your machine architectures. And then on that it does the first level of optimization and there are various passes obviously in that in the first level of optimization itself it does a lot of changes to get better performance and only if you enable pass on the optimization flags to the compiler. And then it converts your code into machine independent intermediate representations. In GCC we have this RTL representations I will not get into the specifications of this intermediate representations but RTL is the machine independent intermediate representation when you compile it using GCC and then it does optimization at second level optimization at this level and then you get your assembly code. Here in the diagram I have not included but after assembly code assembler comes in which compiles your code further and I mean it converts it further to object codes and then you have link time optimization as well. So there are three levels of optimizations that can happen. Now talking about the first level of optimization I will talk about some of the examples here. So first is common sub expression elimination. So as you can see we have two expression here K and R in both the expression this i plus j is common and i and j values of these two variables are not changing in between these two expressions. So we can eliminate it compute the value before i plus j and use that like this we are doing i plus j taking the saving it somewhere and then using it in these two expressions. So that way you do the i plus j computation only once and not twice for both the expression. Now second is function in lining. So in this we have two one function double sorry square which is doing nothing but taking one variable and multiplying it to itself and returning it. So it just calculating the square of the argument pass to it and then from some other functions say for example main or any other function defined in your program we have a far loop which is iterating over a huge range and in each iteration it adds 0.5 to the your iterator and takes the square of that and adds it to a variable sum and adding up everything. So instead of so if you know when there is a function call you have to add the compilation level at the runtime there is a new stack frame that has to be created and the control flow has to jump to a new address which will be the address of the square function. So there is a lot of thing happens to avoid the overhead of doing that we can inline the function if it is possible and within the far loop itself we will do like sum plus equal to in bracket i plus 0.5 into i plus 0.5 we can do that but again we are doing i plus 0.5 twice. So we just take it out to compute it once and then do sum plus equal to the variable where we are in storing i plus 0.5. So we inline the function to avoid the overhead of calling another function. Then we have loop unrolling here the inside the loop we are just we have a y array and we are assigning the values which is same as to the index of the array. So it is pretty simple and straight forward so we know that for every index of the array we have the same as to index we have to assign the value. So instead of looping over it we can just take it out and do it in this format obviously the array is not very long here so we can write this code and even if we do not write it compiler will do it if we do this optimization over our code. So what happens that in every iteration you are avoiding this initialization, incrementation and check initialization anyways happen once but you are avoiding incrementation and this check in every iteration. Next is loop invariant code motion. In this example we have a for loop inside this loop we have x is equal to y plus z and the value of y plus z is not changing anywhere inside this loop and then you are doing this computation sorry this a i is equal to 7 plus i into x plus x. So x is also not changing anywhere. So this part is does not have any role here to be inside the loop we can take it out and x is also not changing within the loop so we can take that out as well. So whatever is not varying inside the loop we can take that out so that the computation happen only once and we are not doing the same computation in every iteration. So only if the variables do not change we can do this kind of things. Then we have loop peeling in loop peeling we try to kind of vectorize it in a format that it can run parallely so the idea to do it is parallelize the computation. Now if we talk about compiler optimizers there are three level of optimizations one and the flags are hyphen 0 1 hyphen 0 2 hyphen 0 3 so there are three level of optimization possible and there are lot many optimizers that I am going to show the examples so in first level mostly we have the optimization that are happening are machine independent optimization in second level mostly the optimization happening are combination of machine independent and machine dependent third level are pretty advanced and it is mostly machine independent. So this is kind of machine dependent optimization and because we do it to parallelize it. So here we try to take out some of the iterations which makes it parallelization make it impossible. So here if we look at the current state like only if p is equal to i is assigned then only the next iteration of this thing can happen this computation why I initialization of this y array at that index. So what we are doing is say for example we have array of 20 size so we are taking some of the current index and its previous index. So for the first we have to take it like that in a manner that p is initialized to 20 and i plus p and we are maintaining two variables if we take the only the first computation out it is possible to just use one variable. So this makes it possible to parallelize because we have a dependency of just one variable here and it does not need to since there is only one statement it does not need any other statement to be executed before the next can be executed. Then we have code store that code and store elimination. So here in this example we have a variable called b which is not used anywhere actually and if we look at this code here we are returning c and there is no condition to it and even after that we have two statements. So that is redundant so we can directly remove it. In further iteration it can probably recognize that b is anyway is not used so even this can be removed. Then we have copy propagation transformation in which it is just trying to see that there is a variable y in which we are assigning the value of x and then we are doing this computation 3 plus y. So y is pretty redundant here we could have directly said 3 plus x and that simplifies the operation. Then we have jump threading here in we have a two conditions we are checking so a is less than 0 so if that comes out to be false obviously a is going to be greater than or equal to 0. So we did not need to do this second if operation it could have been avoided and we could have just do else statement so these kind of things can be avoided but when we are writing the code we mess out on these things and we write that. So it is better to be careful when we write the code itself but if we do not you can use the compiler optimizers to get these are kind of optimizations done. Then we have loop vectorizations again this is a machine dependent kind of optimization wherein the vectors are created the normal simple arrays is split into vectors now for this kind of thing to happen your architecture need to have SIMD support only then it makes sense. So you break the entire thing into vectors and say for example you have four processors so the four of them could be a parallely run on different processors and that is what happens in loop vectorization but it is helpful only when your system has support for such things. Then we have people optimization it just it again works on machine dependent optimization and machine dependent intermediate representation so only when after your code is converted into assembly code it tries to see if there are any redundant stack instructions which can be removed. So here we have A is equal to B plus C and then D is equal to A plus E so there are move operations happening here B is moved to R0 C and then C is added to R0 so and then in we are moving A R0 into A variable then again in for this operation first A is being moved into R0 but the previous in previous operation we already moved R0 value to A so this was pretty much redundant here in R0 we anyway have the same value right now we are overwriting it with the same value so we could have removed this so that is what is happening here so I have just drawn this chart based on a very simple example wherein I was doing over say 10,000 iterations and I was taking some of the square of every single value so with that kind of simple program with I with O0 which is with no optimization it took this much time with O1 it comes down significantly and with O2 O3 it pretty much gave me the same output so this was a very simple program you just considered our softwares are pretty huge so in those if you don't do the optimization it makes significant difference so that is when you use compiler optimizers and but the downside of it is if you do compiler optimizers obviously there are so many optimization passes happening I just gave you the examples of very few ones there are so many of them and it takes longer to compile and when you come across a problem we always have bugs in our code so if you hit on a bug or a situation which is unlikely undesirable you tend to load it in gdb and come debug it so that time it happens that you have optimized code you try to print get the value of a variable in your program and it says the value is optimized out so you don't really get to see and it becomes a little bit difficult to map it to your source code so but you can still do it looking at the assembly code if you are comfortable with it but mapping it directly with your source code it becomes a little bit tricky when you are debugging it so when this was about when you do the compiler optimizers now what are the things we should be taking care of when we are writing our code knowing having looked at some of the examples so the very main part of it comes like load and store are very heavy operations so knowing how your memory hierarchy is and how it works comes very handy and helpful to write very optimal code and get better performance so we know registers are part of your processor then comes cache and then memory and then obviously your program source code resides on this so when you run it the first three part of this is coming into picture and the size of it is less at the top level but speed is higher so you have the limitation of size up there so you should write your code in a way so that you are using most use use of registers then you should try to access your data in a way that their maximum cache hits rather than misses we'll just have a simple example here to come to that point and any accessing anything from memory is very expensive in terms of time so you should see how you can minimize that so here I have a example for that which is swap operation in using two methods in the first method I'm just initializing a temporary variable in that variable I'm saving the first variable that is passed to it then just do the swap operations so basically point is I'm introducing a new variable so new variable gets pushed onto your stack and that is how it works so you are in a way kind of accessing your memory here making one extra memory reference in this example in here you are not creating any extra variable you are just doing XOR operations to get the values out so you still you have only two memory references with that you are doing arithmetic operations to get the values out so this is more efficient in terms of time and performance than this one because of the extra memory accesses that are going to happen here so these are the kind of things we can think of when writing code what are the alternate ways to do the things that we want to achieve and what will fetch you the best performance then comes the profiling part for any memory profiling you can use valgrind for time waste profiling as in how it's doing in terms of time you can use perf and memory profiling if I say in terms of performance I would be mostly trying to check the cache grind so to see how many cache misses are happening cache hits are happening and what I can do to get it better so here valgrind cache a grind this is how you run valgrind tool is equal to cache a grind when you specify your executable here I have a small example say you have a multi-dimensional array with r number of rows and c number of columns and here I am accessing it with first i that is row and then column and then column gets implemented that is this way and when you create a multi-dimensional array this element comes first and in the next memory location this goes so when you access the first element say you are reading from this array then the chances is that the all the next whatever is the cache capacity it will try to pull that many data from the array put it into the cache so that next iteration whatever data you are reading you are going to have a cache hit and you don't need to read it from the memory whereas in this example it's reading the k is your iterating over column so you are reading putting the column part first and then the row part so it's going this way reading read operation is happening in this manner so you read the first element it brings in the data say the cache had the limit of whatever is your row column count say c had the capacity of c only pulled it up to here but in the next iteration you are reading this so there is a cache image and again in it pulled all this data up to this and again next iteration there is a cache image so there in every read you are having a cache image and so it will try so there is two things happening for every cache image it will try to update the cache and ultimately to get the data it will have to read it from the memory so that brings in a lot of overhead and extra time so you can minimize it by doing it seeing how you should be accessing over your array elements to get the better performance and for the perf part specifying the perf report here hypheny will give you a record of what is happening in the user space of your program so here I get with the same example that I mentioned earlier I get this output so maximum time CPU time is taken in this function that is some here so maybe if I'm not very happy with the performance or overall time of the program if since the maximum time is going in this function I'll my first attempt would be to open the code of the some function and check what is happening there and what I can do to get a better performance then gcc optimizations you can use this command gcc hyphen c hyphen hyphen help is equal to optimizers to list get the complete list of optimizers that are available with compilers and with this command you specifically get what are the out of all those optimizers what are the ones which is enabled at hyphen o2 level and similarly here you can hyphen instead of hyphen o2 you can use hyphen o1 hyphen o3 to get a list of what are enabled at different levels and this is the llvm command to get the list of optimizers with llvm that's all it's left time sorry five minutes okay any questions so far just you can see here the list of optimizers and there's one liner explanation to what all optimizers are there and if I do one thing here is that even with no optimizer that is hyphen o0 when you don't really are optimizing your code implicitly there are some of them which are still in it you see here it says enable and some of some other ones are also enabled so even if the binary is that we have in fedora rel or any other distros they are mostly non-optimized uh source uh binaries so even with that when you are doing some kind of debugging sometimes you try to print some value and it says optimized because some of the optimizers optimizers are enabled by default so if i want to uh see what happens when i'm optimizing my code when i'm compiling it with gcc obviously here right now i have these three files dc uh dot bc dc dot c sorry four files dc dot o and if i compile it with this option let's say i call font is it fine okay sorry okay so here uh since i compile just to object file you notice there are certain files are generated and compiler gives you the number dc dot c original then gimp gimpel so gimpel is the first level intermediate representation with gcc and this number tells you which pass happened after which so this is the first that happened and with numbering you can trace in between there are some passes which probably with the flags i passed on were not triggered there are so many passes and they are you know brought into picture only with certain flags that you use when during compilation so with the these with no flag in place with fdump three all i just got the passes that are happening if i do optimization passes happening here and it gets you the output of all of them and you can trace back like what was the output at this level and what changed between these two so it is with gcc it's very fairly simple to track down what kind of changes are happening in which pass and if you noticed earlier with this command help optimizers it gives you the name of the optimizers like c dc and fdce is here so you can see here in the output these there will be some fdce dce sorry will be there at the end of it that is how you will know that it is output of dce optimizer and you can you know track down what changes are happening at each level of pass with compilation and that was with fdump three all it gives you the output of i mean passes output of all the first level optimizers which is machine independent there is something called fdump rtl all and there is something called fdump ip all with ip all you get the output of sorry link time optimizer with rtl all you get the dump of all the machine dependent all changes that are happening so it becomes easy for you to track down and see all the changes thanks everyone any questions if you have questions you can offline outside okay