 Okay, let's get started. Thanks everyone for coming. And so today actually I'm going to talk about some system optimization techniques or best practices that you could use, especially at embedded Linux or also you can extend that to other embedded systems that you might be working on. Primarily I'm going to cover GCC and Clang to a certain extent, some in Clang, some in GCC, but mostly it is around system and that could benefit you. And I think at the end of the talk, you might have a good idea of how you could interact with the compiler to get maximum out of it for you. So agenda, as I said, is just the tools and some compiler optimization switches, data types you can do a lot with how you arrange your data, help the compiler that way, variables and functions. So you could do a lot of stuff with how you deal with variables or how you define your functions. And so we'll talk about that. And then a series of optimization tips that I'll share. And then in summary, we'll just summarize what we talked about, maybe some, again, some takeaways. So feel free to ask questions, start discussions if you have some ideas, I share what I have and you might have further effective tips and I think it would be great to share if you have some of those. So please share and ask questions during the talk, otherwise we'll also talk towards the end of it. So tools, know your compiler tool change. So in Linux, generally we deal with GCC and as developers we assume that our compiler is GCC but it's important to know what your compiler is. So in open source, we have GCC-based tool chains, we have Clang-based tool chains and there are vendor-based tool chains. So if you are doing something that is embedded, there are certain vendors well-known to provide proprietary compiler tool chains. So take a look at that when you work on an embedded systems project and each one of those compilers, they have different things to offer and they would document a lot of this in their manuals. So in many cases they would offer additional compiler-specific optimizations or constructs that might be beneficial for you. So go through those, very, very important and you will know why as we go through this talk as well. The other thing that is to understand the memory layout of your systems. When you're working on a desktop, you have a pretty much virtual memory map and you really don't worry about that too much but when you are doing embedded systems, your applications might be distributed across different kind of memory types. You might have different like NAND flash, different kind of memory technologies that will be in play and data would be put in some area. Read-only data will be somewhere and to knowing all that is very important. And how you would see is that you can see the memory map. There are linker scripts that basically define the memory map and you would basically look through those and see how the memory is laid out and that way you will understand when you are doing your data partitioning and data access, you can write efficient access methods for those kind of data. If you do not have specific linker scripts that you use, that means it's using a default linker script that is provided by tool chain but it still is using one so you could still dump it and look at what it is trying to do. Commonly it is used across all applications so it will do the same things but it's good to even take a look at that if you're interested. The other thing to look at that same thing is you generate map files. So if you generate map files, it will dump your code data and BSS sections and which symbols are where and what address is and all that map also gives you a good view of what the total memory map looks like. Then there are these tools, I'm pretty sure been using them, if not they are very effective in achieving all this inspection of your binaries and objects. And I've listed a few of those here. They are primarily from the noodles package and object dumper, this is basically taking an L file, that's your output file or in fact, not on the L but other kind of formats but in Linux we use primarily else. So you can disassemble also, right? So if you want to take a look at what compiler did for you, you could disassemble your code and see what's the corresponding code that was generated by compiler for this source code that you had. And it's very educational if you take a look at that and it's also very sometimes unexpected where you expected a certain piece of code to generate a different kind of code but you see something different there. So size is another utility if you're interested in size measurements, you can actually use this utility to even manage size of your application. You could put it into your CI systems and always measure the size whenever there's a change that is being added. Redulf is again displaying the content. If you want to look at your relocations, you know other pieces, what different sections are there, you could basically use Redulf for symbol dumping as well. NM is another, it's a symbol lister. It will basically show you what all symbols are there at what address they are. Strip is actually another utility basically for you to strip away any debugging info, symbols, primarily when you ship the code on the target, you don't want to spend your precious flash size for debug info or symbol info unless you need it. So look at various options it has to offer you. And there are more tools there. So I've given you a few of those in here but if you look in the suite of tools you get from Biniutals is quite more than that, what we have here. So besides that, I think there are additional tools you will have for debugging purposes and we'll talk about those as well. So this is for optimization levels and O0 is just a translation. So you get one to one translation and you can see what compiler is doing without any optimization. It's just translating your code from higher level language down into assembly. O1 or O, it does somewhere, some optimizations makes your code not regress for debugging, not regress for basically speed improves a bit, your debugging is nice, and O2 is moderate level of optimization levels. So this is most commonly used in any projects. So always look out for what's being used in any project that you're doing, especially if you're doing embedded Linux, you might be using certain packages and they may have their own optimization levels and some of them, for example, media-centric packages, they tend to use more aggressive optimizations and it's always good to know what they're doing. OS is for size and OG is actually recently added, it's for better debugability. What it does is, so if you look at those, these are high level optimizations which basically transfer into smaller optimization passes in the compiler. So it's actually a collection of those optimizations that is represented as O2 or O1 and you can actually look through the documentation I have a link in subsequent slides where you'll see the collection of 100 optimizations that the compiler will apply as a result of these. So what you see is it's basically a subset that is curated subset that it will apply for you. OZ, it's also relatively new and it's actually OS with further, more aggressive size optimizations. What that means is you will suffer some performance. You are telling the compiler that, you know, I care for size a lot more than anything else. So I'm fine if code is not optimized for performance. O3 is similar like OZ but it is more on this performance side. So if you want more performance and you can use O3 but O3 has a little bit of, you have to be cautious because it will be very aggressive and it might break certain assumptions and certain standards that are recommended. So when you use O3, recommendation, my recommendation is generally have your code working well with O2 before you bump to O3 and go from there. And there is another one called OFAST. This is primarily like O3. It's on Clang as well. So this is a link and it's a very handy link. Go there and look at what these all optimization passes are and how they are clubbed together to form this one of these meta optimization options. I couldn't find one for Clang but there are ways they document each options. So there is a way to dump the optimizations that are applied to a certain level. I did not include it here but it's similar to what GCC have. So other options that you might find interesting is from security point of view and for some it may matter, for some it may not but be aware that these options will, you will pay some penalty with code. So if you enable these options they might add extra checks into the code which basically will improve your security but it will slow down your code. So Stack Protector, this is a age old optimizations or a feature in compiler. So it has various levels and so you could basically enable any of those. Most commonly used is the Stack Protector strong and there are more like function level Stack Protection or things you could do that can basically have, you are juggling here with performance and security you want. There's a defined 45 sources. So what it does is it links your code to additional like check function. So you have mem copy, it will basically use the prototype for mem copy underscore CHK function. So these functions have additional checks so it will validate your arguments and stuff so you get more warnings doing build and so you can basically fortify your sources a bit more. And there are certain options that are there for checking formats. So you don't have like string overflows. So these are very handy options if you're doing security related work or you want just to do a bit of fortification of your source code and then all these warnings you can turn them into error. So if somebody is adding code later on then you know that your code will fail. So very handy options. So here's a question for everyone. Does compiler support O4? Any guesses? It does but O4 essentially is O3. In fact you can specify O99 or you can specify O3000, doesn't matter. So be aware that this number is not to be raised infinitely and you'll kind of force compiler to compile better and better and better. Why I brought this up is because in the end compiler is also a software, right? So it can do certain things but it's not something that has infinite amount of power that you can kind of extract max out of it. So data types, so we use integers, we use floats and all the basic types and complex types but it's important actually, especially in embedded space to know your processor word size. So the processor word size is your integer length that your processor can process, right? So it could be a 16 bit processor. It could be a 32 bit, 64 bit. And the standard C type may not map onto it one to one, right? So always try to use that can be mapped onto the word size of your processor. There is a reason for that is that when you do use types which are smaller or bigger, there is extra penalty that compiler has to do conversions, right? Like sign extension, geo extensions, all those things will come into play as you operate on that data. So it's basically there are certain defines, right? That I will cover next but you might see certain different processor types behaving differently on different data types, right? So if you have a risk based processor or six base processor, the characteristics might be different. So it goes back to know your processor architecture argument. So delegate to compiler as much as you can, right? I know that people have defined their own type systems, right? When you write a big software, you have this one file called types.h, right? And you define your own type system, whatever you like. That was fine in the past, but I think C99 and newer, you know, it has tried to abstract those bits and provide those facilities to the user. And you will have this fixed bit types, UNT, and you will have the minimum bit for Uint least T and for fastest bits, you'll have Uint fast T, right? If you use this, then you really don't have to worry about the underlying processor architecture as much, right? Because the compiler will do the right conversion for you. And then there is, of course, the portable data types, UNT and, you know, where sizes, you know, 8, 16, 32, 64 likewise. So that gives you a good abstraction based on the width of data that you would like to do. So highly recommend it. Both GCC and Clang will thank you for doing that. Variables and functions. So I'll talk about const a little bit. When you qualify, you know, your functions or data types, it's an additional help that compiler has. You are basically annotating your code, you know, by using these keywords, which basically tells compiler that it's a data that's not gonna change. It can be more aggressive with its optimizations. And then there is imitable, there is documentation. So you are basically stating it out to whoever is reading your code what this is supposed to do, which is that you don't want it to be changed. Whenever this is going to be used elsewhere, compiler will be able to give you better diagnostics about, you know, you doing a conversion between a writing to a const or doing a conversion between like const and non-const. So a lot of opportunity there for a compiler to optimize as well if your data type remains consistent. So highly recommend to, you know, think about when you define your data set to see what all data could be const. And ideally all data could be const, right? So that would be best. So as you think more about it, probably it will be better for your program and also you will extract a lot of information from the compiler or optimizations. Function parameters, so know the ABI, right? Especially the calling convention. It depends on every processor architecture. So they have different register sets. Therefore, you know, you will have different ways that compiler will pass your parameters to your function. So I'm taking an example here for ARM. So when you have ARM functions in ARM, four parameters are passed in registers, right? If you have more than four parameters, they will be passed on stack, right? So now it's also important that you have like four words, like 32 words that you can pass, right? Now if you've got a long, long or something that you want to pass in between. So make sure the parameters are organized such that there is no padding, right? What will happen is if you take a parameter, which is say eight byte long, and there is another one which is four byte long and you place them, say you place the eight byte after the four byte, guess what's gonna happen? One register is gonna be empty because it's gonna align it. So the registers will be passed in the, you know, register three and four. One and two will pass your first parameter and basically register two will be empty. Your third parameter will go on to stack. But if you do a little bit of trick where you pass, you pass the first long, long and it goes into first two parameters and then pass the third one which is long, that will go into the third register. So fourth register is still available. So your fourth parameter will pass onto the register as well. So alignment, keep an eye on alignment all the time when you do the function parameter passing, right? Floating point has its ABI's too and especially in ARM for example, there are three floating point ABI's that you have. So soft VFP, soft and hard. So how you transfer your floating point registers through function parameters, that's what it is about. So when you have, you know, vector floating point units on your system, you want to take advantage of those. So it's important to understand that what all your chip has and then enable the right floating point ABI for you. In many cases, you may not be able to do the parameter passing in floating point registers, but you still might have a neon or some of those simd vector floating point units. So you could tell it that yes, once the data is transferred, you could invoke a vector floating point, but I cannot pass parameters in floating point registers. So you still get some performance boost, especially when you're doing loops and loop and rolling, but you won't get the parameter passing will still happen through stack. When you do the hard floating, then it'll get both, right? So you'll have the processing also happening as well as the parameters will also pass through floating point registers. So there will be no copy of from general purpose registers into floating point vector unit back and forth. That won't happen. Soft VFP is actually you don't have, sorry, soft or soft FB is basically you don't have any unit, all floating point should be emulated in software. So it's important to know that if you work on different architectures, they will have different considerations. So always refer to your TRMs for the hardware to find that out. Avoid globals and static data in loops. The reason is this is global data. It will not be cached. So they have to basically every time you iterate over the loop, it's has to go and write it down to wherever it is. So be aware of that fact if you're accessing global data or even static data and you're looping over it, then that loop will be slower. Use volatile when really needed. The reason is that by doing this, you're telling the compiler, I know that this code may not be rearranged or optimized in the way. So I want the sequence to be as such. So usually, while accessing hardware registers and stuff like that, you would do that. But don't use it to kind of get compiler out of your way because you want to do something and compiler is aggressive and it's optimizing it away. And you put in volatile in there and suddenly you see it's working again. The reason is you kind of trampled on compiler rather than fixed the problem in the right way. So be aware of what volatile is for, right? And similarly, like if you have function calls in loops, check that if this function is in line. If it is in line, it's fine. If it is not in line, then you're incurring a function call. A function call in a loop would mean that it has to evict all the cache it has done and all the nice things about cache usage they will go out of the picture. So make sure that the loops are having smaller functions which could be in line, otherwise avoid any functions at all. The other thing is actually attributes. So there are a bunch of attributes in there and we could go over there all day long and we could explain each one of them but I think I've just given you a pointer here. Read through those. Attributes are very, very helpful to express what you want to do with data, right? So you want an alignment, there is an attribute for that. So you can define your data at a certain alignment boundary. There are type, you want to check types, you want to do various things with your data representation. You want to let compiler know whether a particular variable is used so don't optimize it away. You can add that attribute to it, right? So you can help compiler by doing those things but also get your program behave correctly while using higher level of optimizations. GCC also has function and variable attributes. Compiler attributes between GCC and Clang are pretty much similar but you might find few differences. Clang has few additional ones. GCC has few different ones. Overall, you have a very big common set amongst those but there are certain differences. So when you use this, what do you lose? You lose portability, right? So make sure that when you use these variables, you conditionalize them with the compiler defines. So if you're using Clang, use this. If not, use something different. If you are worried about portability, that is important. Okay, so some optimization tips create base lines, right? So many times we go for optimizations and we are excited about it and we don't set a target for ourselves and we don't have a starting point but we just say, well, it was slower, now it is faster. Doesn't work that way because when you go for optimizations, always there is a low hanging fruits and you will certainly improve it and then it gets tougher and tougher, right? So you have to stop at certain point of time. So that's why it is important to set your baseline and then set your target and then march towards it. Find very good measurement tools, right? So measure, as much as you do like to do optimize, optimizations need a lot of measurements, right? So find good measurement tools. There are good tools out there, profilers, debuggers, that can give you a lot of information, tracers, right? That can give you a lot of information about the nature of your code and put them to your use. And you can also augment your compilation with other tools. For example, if there is additional tools that gives you information about and that could be a proprietary tool or something, go for it, that's fine. And experiment, right? So whenever you have this set in place, now you can make changes and do measurements, right? And look into the code. If I do it this way, what is compiler doing? So this basically gives you a good set or good way to lead into and not go haywire by thinking that, oh, I think now it is running faster because I changed the data structures, right? But this will give you that framework to dig deeper and see what the effect is. Consider portability, right? So porting is important if you are, you know, envisioning that your software will run on different platforms. And it always will. You may not start that way, but generally it will be, you know, running on maybe you'll have a different product line where you want to reuse. So it helps a lot with code reuse. And follow like standards, right? And compiler does have options to enforce standards in your code. For example, you know, C99, there is a GNU99, which is C99 with GCC extensions. But if you want to work across compilers, right, you want to use Clang and others, just use C99, right, for best results. It's easy to give up portability, right? So in many cases you are doing, you are under stress to deliver the product and you don't have that much of time to do it in a very portable way. But it does come back to you eventually. So, you know, you could basically see what there are in compilers that could be done in a portable way, right? So, for example, progmas are very helpful. So you can have compiler specific progmas. So if you say progma GCC optimize under all groups, I've given an example here. When you use Clang, this is effectively no op. So what you could do is you could do a if else loop. If GCC do this, else Clang do this. And in the final else, you could error out. So if you say now are working on the third compiler, your code will not compile. Then you know that I have to do something about this progma that is functioning on other compilers. So there are different ways you can manage some portable code there. Okay, so which one is better? What do you think? So this is a ternary operator and this is a if else statement. Correct, so they are same, right? So many a times you would say that, well, on some architectures it may not be same, some compilers it may not be same, but the point remains is that when you have the code, don't assume that a smaller code will generate smaller code. So the ternary operator here is basically gonna generate the exact same amount of code that the if else will generate. In some cases, it will generate worse code. So always measure, look at what's being generated out from the code. So it's important from that aspect. Anyway, so moving on optimizing for stack, stack size is very important. Know it what the stack size is, right? So default applications are assigned to certain stack limits, but if you're doing embedded systems, you define your stacks. And so know what the stacks are and usually they are small in embedded systems. So you have to help the system to be able to operate at that limit. So for example, large arrays, right? Any local variables, a lot of function calls. They all end up eating up a lot of space on your stack because the functions, right? So they will end up eating a lot of stack. So be aware of that, that whenever you do these kind of constructs, they're gonna be some pressure on your stack. So there are some tricks, for example, in GCC as well as in Clang. For example, you could apply what I call and call optimization. So what that means is, if you have a function and you call another function in there, right? And it has no dependencies, which means that you're not post-processing the output of that function when it returns. Then what compiler can do as it is in this code, it would go, go into foo, go into bar and straight away return from bar back to the caller of foo. What that means is you're avoiding some of stacks that it will otherwise end up using for bar. So there are these kind of cool optimizations that you can help compiler with by writing the code such that it could be classified into end call optimizations, for example. So if else, right? So most of our code is, if else spread everywhere. So put most likely code that you have in a hot path. So if you have like if, else, if, else, if, right? Make sure that the most likely code is on the top, right? So you are avoiding few checks down there and it's only in the rare conditions that it will have to do those most number of checks. So how do you get that information, right? So you have to do some profiling. How your application is behaving. Collect some profile-guided optimization data and that'll showcase that, yes, this is your hot path and rearrange your code that way. So compilers optimize switch cases really well. So if you have like simple conditions, try to convert that into switch cases. Compilers do that too, by the way. When you write a simple conditions internally, it will translate that into a switch case if it can. But if you do that yourself, you know the code, compiler doesn't have to do the work for you. And remember that compiler is, it has to be very pessimistic and very conservative. So if it thinks that it can't do the right thing, it will drop that optimization, right? It will move on. So any hints you give basically arms compiler to optimize your code better. So the tail call, the tail call recursion elimination, right? So that those kind of things will help you to avoid, as I discussed a bit earlier, to generate better code. So tail call recursion is interesting, right? So for example, there's a factorial function. What you're doing in the first case is you're processing the output, right? So you're calling factorial and you're multiplying it with X. So it will basically go, it will not be able to do the tail call recursion optimizations. But what you do in the second case is you are basically returning the same function. You're not manipulating the return value. So as a result, it will be able to apply the tail call recursion function to this, you know, the optimization we talked about. And so this code will be much, much faster than what you would do in the code above, right? So these are simple transformations that can help the compiler. So in summary, help the compiler and it will help you, right? So the reason is compiler, as I said, has to be very conservative. And it cannot, if it finds out that there is optimization, it cannot apply, it will not, right? And also remember that it has certain amount of time. So it cannot compile forever. Otherwise you will complain about slow compilation, right? So it is under the conditions that it is operating under. If an optimization turns out to be very expensive, it's gonna drop that. Do-while is better than for loops because the termination of the loop test is not used. So if you look into like Linux kernel code or in generally in any other well-established code basis, you'll see that they use a lot of do-whiles. In fact, there are macros to basically write those loops using do-whiles. They don't even let you decide on that. And we talked about the pragmas, the function attributes. So annotate your code. In many cases, compiler may not be able to detect that this code can be vectorized, right? And if you tell it that this particular piece of code is vectorizable with a pragma, that's a hint. And compiler will basically try harder on that particular piece of code. Or, you know, subsection in a function. So annotate your code. And use intrinsics when possible, right? So compiler does provide a lot of intrinsics. And they are documented, well-documented. Go for it, you know, have them use as much as you can instead of writing your own functions basically to do, you know, the normal manipulations. They do provide more optimized versions of those for you. Avoid release and debug modes, right? Many times, this is very common where, you know, you use different optimization level when you're developing, and then you use a different optimization level when you're releasing your code. It's not a good idea if you want to debug the same code that is in the field. So choose a good option set that you can debug and you can deploy. But, you know, having these two different streams will basically give you a lot of trouble. And we talked about know your system, right? Processor architecture, what kind of bus bits it has. DRAM, right, how big RAM you have. Flash, what kind of flash do we have? You know, your clock speeds. This will help you to, again, design your system well. Profile before optimizing, right? So many times you look at, okay, my program is running slow. I think I know where it is slow, most of the time you are wrong. Profile it before you optimize. Very, very important. Profile it, use, like, you know, GPROF or some other profilers out there and find your top 10, right? And then go with it because most of the time, loads, you know, it depends on those two. Delegate to tools, right? We talked about that. Use tools as much as possible. Delegate to the compiler. Don't try to overstep it. And inline assembly is nice, but it's quite, it can be problematic when you have different compilers, portability is always an option that you want to consider. So use it, you know, judiciously, I would say. So I think with that, that's pretty much I had. Thank you very much for any questions, any comments. We still have a few minutes, couple of minutes. Yes, yes, so very good question. Question is that, you know, when we optimize for different compilers, portability and optimizations are always two different ends of the game, and that's very true. So what you could do is, you know, you do have compiler defines, right? So you can use the compiler defines to do compiler specific pieces that optimizes your code, but doesn't affect the portability of your code. So that's a compromise that you can have between those. And there are ways that we talked about initially, where, you know, I didn't mention whether one compiler will do better than others. So those are common constructs that will help you across all kind of compilers. But then there are certain pieces that will give you the extra buck for your money, but at the cost of, you know, portability. So, but there are still some ways for you to manage that even though you have to add additional code. Okay, any further question? Yeah. Yeah, yes, yes. So I think you could run, you can compile for all of your files, basically. There is like bit bake minus K switch, right? So there's a similar, yeah, that's in the make files. So you can set it in the make files. It's not a compiler option to say, because compiler only sees one file. So it will add around that or it will be compiled successfully. But if you're compiling a sequence, yeah. If you're compiling a sequence of files, then you have to tell your make system to say keep going. Yes. So there are switches for make and others have to, yeah. Okay, thank you very much.