 Thank you ladies gentlemen. My thank you for my able assistant there and the wonders of Rebooting Linux So we're a compiler company and operating systems these days, but we're best known as a compiler company and We're doing a lot of work on risk 5 both GCC and LLVM But our particular speciality is deeply embedded systems and for deeply embedded systems You want your code to be very compact and so my focus is on How do we make the compilers generate very compact code minus OS and if you're LLVM minus OZ as well you can only improve something if you can measure it and this is about measuring the size of compiled code for risk 5 and and This is to help compiler designers. So this talk is about Seeing inside the compilers looking at what's working. What's not? Investigating and understanding it one thing. I want to be clear is it's not this I'm not trying to tell you which is the best processor or the best compiler and We've looked at three architectures We've looked at the design where Arc HS the ARM Cortex M4 with the thumb to Instructions and the risk 5 RV 32 IMC Those are all 32 bit architectures. They've all got 16 bit short instructions and in all cases We've been looking at them without hardware floating point The design where Arc HS is a bit out of place there We really ought to have looked at their EM architecture, which is their embedded architecture The reason it's HS is because the customer we're doing this for is using HS at the moment and wanted a comparison with that So they're all similar architectures and what we're trying to do is to find out which of what they do Well and what they do badly each as a compiler So What to measure okay, so so what what we got how we're going to measure this well We're going to use bebes. How many people here have heard of bebes? Okay, it came out of some research into energy efficient compilation that we did with Bristol University six seven years ago And it's the Bristol Ember Cosm embedded benchmark suite. It's a free and open source test suite It aims to have a mixture of types of programs so some that do a lot of branching some that do a lot of memory access Some that do a lot of interoperations operations and some that do a lot of floating point It's designed So many benchmark suites assume they're going to print out the answers Which if you're deeply embedded is a real problem because you generally don't have print F So we try to have minimal IO here and the results are captured without actually printing anything out There's a paper on the research behind this on archive the links on the screen It was originally just 10 programs Those are shown in the table on the left and they're color coded according to how much they work and their code reflects different types of operations Bebes version 2 now has 79 benchmarks and I've got a task to do which is to actually Reproduce that table for all 79 benchmarks to show that we still got a good mix of programs So what are we going to measure? Well Let's look at it as a picture Broadly an embedded system like any program. You've got a mixture of code and read only data Those always go into ROM or flash some initialized data, which is writable. So you've put it in RAM and uninitialized data BSS Which also goes into RAM and any RAM you've got left over you'll use for your stack and your heap So you might think We should look at how much size code and read only data state because that's going to tell me how much How big my ROMs got to be there's a bit of a question over that because Initialized data has to be initialized from somewhere and you usually put that into ROM as well But for the purposes of this talk, I'm just going to focus on the size of code and read only data Because that's the key thing and it's also because that's the figure that the standard Linux size program gives you So the absolute statistics we have for each of those 79 programs We're going to capture the size of the code and the read only data and We're going to record a total total amount of data That's going to be dominated by the large programs because one large program will have as much as 10 small ones And we'll also record the size of the largest program and the size of the smallest program but we'll also look at some relative statistics and We'll look we're looking at three processors I've taken risk five as the baseline because I'm talking in the risk five dev room. It could have been any one of them It doesn't matter. It's just an arbitrary choice. Okay, and we'll look at each program in turn relative in code size to Risk five is it bigger is it smaller and we'll take an arithmetic average of those and because All these sizes should be close to a hundred percent then we're not going to kahane distortion We'll use the arithmetic average that should be fine. We don't need to worry about geometric or harmonic means instead And we'll also look at the smallest relative size So in other words which program on this processor is so much better than risk five or so much worse than risk five And those are really interesting because they're ones where the compilers are clearly doing something different So those are the statistics we're going to record. So let's just run some statistics and I don't expect you to be able to read the individual Programs off here, but let's just look run the programs compile them put them through size and measure them I'm there. We've got in yellow the arc processor in red the Arm processor and in blue the risk five Processor and you can see some programs are big and some programs are small What you might notice is that the arm results the little red results never get below a certain point There's quite a lot of small programs where there's a reasonable size spike for arm And in fact if we look at the detail statistics, we see that We look at the total, but we look the minimum size you get from an arm program is 4k whereas arc and Risk five both under 1k If we look at the maximum code size and send the biggest program for risk five is notably a lot bigger than the biggest five for armor arc And if we put the relative percentages in on average arms coming out at being 222 percent the size of risk five and that is because any small program is coming out four times as big so it's skewing that average there and For arc actually on average is slightly better than risk five at the on this case and The smallest programs arc has some programs. It could do twice as well as risk five arm has some similarly twice as Would and both arm and arc have some really pathologically different cases where they're far bigger than risk five So why does arm do so badly with small programs? well, let's look at the smallest program program called NS and Look at the symbols in there, and if we run nm on it. We can see there are 61 symbols defined in the arc program a hundred and thirty symbols defined in the arm program And just 43 symbols in the risk five program and in particular look the last one there There's two right functions in why is a program that does no IO? Showing right inside its list of symbols and the answer is the arm startup code is pulling in a lot of standard C library code And that's that's that see runtime start up CRT zero is a problem So it's distorting the results and it's not allowing us to see what's really going on in the compiled code Because it's just a big chunk of code bolted on to every arm program So let's put a dummy CRT zero when we're only looking at code size We don't have to run these programs So let's just put a dummy in there and when we put the dummy program in we get a different graph And you can see most of the arm spikes of small programs have gone away. Not all of them will come back to that later but most of them have gone away and When we look at the number of symbols now we see we've mostly got the small number of symbols We've got rid of all the symbols not just from the startup, but from the C library code it was pulling in as well, so we're getting a better comparison and We can see here. Well, actually things change around now arm is looking relatively is much better It's not a hundred and two hundred percent bigger on average. It's a hundred and eleven percent on average Oh, actually, which was better is now looking a little worse Okay, and the smallest programs are now tiny because basically they're tiny little programs and the biggest program is not surprisingly unaffected because the size of startup code was trivial um and so so Some of the larger programs do use the C library the bit all the big ones They have to use them for some things even if they're not doing IO and what we observe is that quite often these programs are Often larger for arc and arm not all this case, but arm and arc seem to have bigger programs are noticeably bigger So let's see whether the standard C library is causing a problem And so we'll just as we put in a dummy CRT zero we'll put in a dummy standard C library It just has dummy programs for any C library function. You're going to use they know content and What we see here is now suddenly arm looks Very good and in particular you may not notice the very worst arm program relevant to risk 5 is now only 18% bigger None of those several hundred percent bigger So the pathological cases have gone away for arm Suggesting there is some pathological code in the C library that's linked into one or two programs The problem hasn't gone away for arc its worst program is seven times as big as its equivalent Risk 5 program and we see now arm is starting to look really quite good. Okay. It's 17% better than risk 5 okay, so What's actually useful is to let's have a look at the total size of code because we'll look at it when it had the standard C Librarian and then we'll take the C library out and what we can see here is that When you take the C library out for arc or risk 5 the total code size goes down about 6% But for arm it goes down 24% and that's telling me that if you look the arm Standard C library seems to be very very big compared to others. Okay It may be we're still trying to find out it may be that it's actually compiled focused on size There's only one C library and if it's been optimized and unrolled all its loops, of course, it's going to be big You really need to have a multi-lib variant for compact code as well We also know because we've seen it there that the library is not done at one function per file in the library So if you pull in one function, you're going to pull in a whole load of other codes So you can't do the garbage collection of sections on it So but we're getting rid of more confounding effects So any effects of I'm distorting my figures because I'm just not looking at the compiled code I've got this big watch of C code stuck on the end. We've got rid of that but Some notable variations. There are some programs cubic and frack Where risk risk fire arm and arc both do well compared to risk five arm in particular. That's so far and The clues to what's going on is when we look at these two programs map molt float and map molt hint Which are the same program matrix multiplication one is done with floating point numbers and one is done with integers And we see for the for the floating point one arm does really much better than risk five Where the integer one it's almost the same And that gives a bit of a clue that arm seems to do a very good job with floating point Arc is a bit more variable. It's not quite sure what the message is on arc Now remember these are all chips without a hardware floating point unit. So this is about floating point emulation so that's done in lib GCC because this is the GCC compiler and And let's see whether it's actually the arm has got a super compact floating point emulation And we'll do exactly the same thing is we'll put in a dummy lib GCC and so now we do We run the same things again and things are starting to get much closer actually we're seeing fewer extremes and And if we look at the table of data, okay, what we see is that actually now Arm is not so good compared to risk five And indeed arc is coming more into norm and we notice now the pathological cases for arc have gone away Suggesting that arcs pathology lives in some of its floating point emulation routines. Okay so now we've got rid of Distortions from a block of startup code distortions from pre-compiled See library and distortions from the emulation library. We're looking much more just at the code we actually compiled for this test and We can see the impact of the floating point library if we look at the absolute total So if you look at the total code size and we can see that for arc and arm when you take the floating point emulation away you get rid of arm sorry for arc and risk five you get rid of a third of the code, okay, and When you do it for arm, you only get rid of a fifth of the code So arms clearly got a much more compact floating point emulation. Okay Now I've actually taken away LibGCC and of course LibGCC does more than just floating point emulation So we do just need to check it is the floating point emulation not some other emulation in there That's that we're worrying about The last thing is to say well, we looked at Lib C Let's look at Lib M the math library because we all really ought to see is that having an effect does one architecture have a particularly good math library and Actually when we look at this the figures don't change much from what we saw in the previous round and We can see that when we look at how much is saved in the absolute code size The math library basically accounts for six percent of the total code size. It's the same for all architectures Almost certainly everyone's using the same C math library just compiling it It's going to be proportional. It's all going to come at six percent But it's still a block of pre-compiled code and if we're looking at the effect of the compiler We ought to take it out So if we look at the overall summary We have the figures at the top for the baseline where we just compiled the programs and looked at it But as we go through the lines We've stripped out the distortion of a block of pre-compiled startup code the distortion of C libraries Which are of different quality in a pre-compiled the distortion of the emulation library and how its emulation is done and the distortion of the pre-compiled math library and if You take all of those out the bottom line there Really all you're looking at is the size of the code you actually compiled and for the purpose of the compiler writer That's the one that's going to give you insight now I've just put the averages there if our risk fives up 100% arm It's coming in about 8% better and arc Is coming out about 14% better, but I emphasize this is not a beauty contest because there's a lot of variation Okay, if we go back up to the previous slide another previous one slide We'll see there are some programs where arm is worse and some program where it's a lot better and the same for art So somewhere it's really very much worse and somewhere. It's much better so the takeaway is Useful results for compiler writers only consider only you need to only consider the code you compiled Okay, libraries and start-up code can found the results So remove your start-up code remove your C library code remove your emulation library code and move your math library code That's easy actually in this case because I'm only looking at code size You have to actually do that try and remove the same factors when you're doing execution speed and that's a talk for next year Because actually how it still has to work But code size is what we were looking at and what comes out of that then is some useful graphs You can compare arc against risk 5 relative size and order them and the interesting ones are the ones at each end Okay, the functions that the cubic for example where arc is nearly twice as good as risk 5 or I can't remember I can't read it the one at the other one the other end where arc is almost twice as bad as risk 5 and those are ones We're going to delve into because why is the compiler which is still GCC? So different for those programs because there should be something at one end that arcs doing really well that risk 5 isn't And the other that risk 5 is doing well and arc isn't so this is good for both the arc compiler team and the risk 5 compiler team And we can do the same for arm Variation is less there, but I'd still like to know why are there some programs where risk 5 is 18% better I'd certainly want to know why is it that cubic again is 50% better and Of course we can compare arc against arm and that gives us insight as well So we've got a three-way comparison here And what did we learn about GCC? well first of all and This ties in with Christa's Email this morning Some new instructions would help now The work we've only actually been taking these results and actually trying to apply them to compile since the first of January my colleague you and Renek has been doing that work and I actually wrote this slide at 9 o'clock this morning to get the very latest data from him But actually if you had an instruction to add 14 bit constants You'd take more than 1% off the code size of those benchmarks across the board if you had a 40 bit 8 bit instruction to load a 32 bit Constant you take another 1% off And then we can apply compiler techniques and that's what we can actually do today The first ones will feed into future discussions about new instruction set extensions One is Millicode Which is where you find common code sequences that you keep on using and you pre You pre-package them so of course those take up some space, but then Everywhere you would have generated a lot of instructions. You just generate a quick call and return and that Just the simplest Millicode we think will take it save us a third of a percent and that in compiler terms across the board Is actually a decent win for something fairly straightforward Putting common sub expression elimination into the linker Actually, it's quite a useful win We still haven't got a good estimate none of these other ones. I've got good enough estimates for But an increasing theme that comes out with modern architecture is you need to use the linker To do optimizations. That's not link time optimization Which is the linker feeding stuff to do global optimization in the compiler? This is actual optimizations in the linker. It's taking relaxation to great extremes. Okay So we think that will give us a good win as well There's more advanced Millicode Jern's done some initial experiments suggesting that if you could have Millicode function, they do the scaled index load That would actually be a win There are in various places dead register loads where you load a register copied somewhere else and then use it And why didn't you load it into the first register second register in the first place? That can be done with a peak pole optimization and that seems to be surprisingly common We need to look at the register allocation to see quite why it's doing that but then if we can't fix it there, we can do a peephole and We can also roll up some loops. There are still places where in compiled code when compact code The loops are still being unrolled and there are there are circumstances where that's a win even for code size But there are very few of them that so This is a talk about measurement not really about the compiler optimization But even in one month we found a whole load of stuff that between them will give us a few more percent of the compiled code size So what do we need to do? We need to do more measurements. We need to repeat this for LLVM Okay, so all package up. It'll be easy to do for LLVM I I hope to do that this week not least because I've got a whole customer breathing down my neck for the data I want to give arc a fairer crack. It wasn't fair to compare HS Which is a high performance Linux class processor. I should look at that EM processor and say driven by customers that one But I think you know Arc did very well given it wasn't in the same class I'd let I like to separate out code and read only data Okay, the two are there are optimizations that push stuff into data to drive the code And you do need to actually break those out and I'd actually look like to look at initialize writeable data because that often goes into ROM as well and We've only spent four weeks looking at the compiler analysis Okay, and I think over the next coming months There's going to be a very fruitful set of optimizations that come out from looking at those extremes on the two graphs To find out why is arm twice as good? Why is arc twice as good? Why are we actually good here? What are we doing? Well, okay, and that's actually beneficial not just to risk five But it's general beneficial to arm to arc and to the wider GCC community Okay, so the resources for this talk Bebes is freely available It's it's free and open source software. It's available on bebes.eu We modified it a bit for this talk to make it easier just to drive through the stuff There's a bit more information that will eventually go into a new release of Bebes But I've put you there the github link to the actual branch we use And this is all written up as an application note Which is in review at the moment and should appear sometime next week, and that's my talk. Thank you questions Yes So this is the question that that arm has a multi-instruction push That keeps on coming up The initial analysis suggests that it's not quite as big a thing as you might think not least because modern compilers are Quite good about optimizing away prologues. They're not rigorously stuck at the start of the function But that's also a big driver behind the early millicode stuff because millicode can actually give you a push-pop multiple register So a good global link time optimization phase should actually be able to allow you to decide whether or not to pull Millicode in but that's probably indicative of the fact that millicode doesn't actually win that it's not quite as big a problem as you thought Right so I'm happy to look at any other processors The reason we've done these three is because I was paid to do them Okay, so it's you know anyone who wants another processor my pockets are open But The first part of your question, which I've yeah using object code the re we can use object code What we want to do we didn't do it on this one is I'm the big one of the two biggest improvements in compiler optimization The last 20 years has been link time optimization and I want to be and I know I could link time optimize But not put the library in What we so I Need I'd need to do linked. I'd need to do object code that I'd link time optimize. So it is a possibility It's a different approach But that's why we didn't trivially at this case just say let's look at all the object files It doesn't always give you the same but it's an alternative approach. It's not right or wrong I don't think okay, I'm gonna look this way because I've been taking questions there. Yes You can't and I am the first to say that if you're benchmarking you should actually the best benchmark code is your application and you know We're looking at code size here when it comes to performance There are all sorts of things for example actually where you position your code in flash makes a big difference to energy efficiency for example So this is only one aspect of what you do There's no reason why this approach can't be used with any code. You know, this is about an approach I've used bebes. It works for us. It was appropriate in this case But there are other you can use any computational kernel you want to and apply the same thinking to it compare it What we're trying to find out is what does the compiler do well and what's the compiler do badly? So I think yeah, you do it with anything you want to and whatever's most relevant to your use case Questions more questions. Yes. Yes That's the question is did we use new Lib or new Lib nano with arm and I cannot for the life of me Remember, I have a suspicion it was new Lib nano because of the looking at small go But I would have to go away and take that one Even if it was new Lib nano, I would still want to take it out because I know new Lib nano is a lot smaller And so there will be less distortion, but you should still get rid of pre-combined libraries anyway But no good question Okay, thank you very much indeed