 Gweld bod yn wneud, wrth gwrs i'n ffosdem. I'm Jeremy Bennett. I'm chief executive of Embercosm, but I am an engineer by profession, and this is an engineering talk. Just if you're wondering why I'm looking like a 1960s civil servant out of an Eelin comedy, I'm part of Open UK, which is the trade body for open source businesses in the UK. It has got a new niece of life with a new chief executive, and you'll see they've got a stand downstairs, and they're all wearing hats like this, and so I'm advertising it. It is boiling hot, I'm going to take it off now. So that's a little less formal. The subject of this talk is about using benchmarks to improve compilers. Our day job, what Embercosm does, is we develop LLVM and GCC compilers. If you've got a new processor chip you're bringing out, we're probably the company you come to to help you bring up GCC, LLVM, and all the tools around it. I have been involved with the Mbench project for the last year or so. I'll tell you a bit more about that, and then this is more of a talk about a process. I'm not magically going to tell you all the secret tricks for improving a compiler, but I'm going to tell you how you can find out what those are. So first of all about Mbench. The first version of Mbench will be launched at Embedded World later this month. It's Mbench 0.5 because we still consider it our first experimental version, and it's a free and open source set of benchmark programmes for benchmarking computers. Let me tell you a bit more about it. We've tried to benchmark computers going a long way back. 1972 and Whetstone. How many people here have ever run Whetstone? How many people here were born before Whetstone was written? Then we've got Limpac. Limpac is still used. It's the basis of the top 500 supercomputers, but it was written in 1977. Drystone came along inspired by Whetstone. It's the first general synthetic benchmark for running on systems and evaluating how a general computer works. Then you get the industry consortium, the spec consortium, forming to actually try and say a single programme isn't enough. Let's have a set of programmes, and that's an industry consortium. Very well respected. Downside is you have to pay to play. It's not freely available. You find Cormark. That came out of Embassy, and I'll say a bit more about Embassy in the moment, but Cormark is what everyone quotes, and I've just been told the risk 5 dev room downstairs has just been quoting Cormark as the performance of the low-risk platform. The problem is Cormark is like Drystone. It's a new Drystone. It's a synthetic benchmark. It's a single programme, and it has done one thing. It has improved the abilities of compilers to optimise Cormark. Okay. Then others, another mainstream one, ML Perth. You may have seen it any times last week. That's a new benchmark suite for measuring the performance of machine learning systems very much for the AI world. There is a connection which I shall come to in a moment. Then there are some others that are less well known. The top row is the well-known ones. There's Embassy. There's a well-known in the embedded space. It's a set of embedded benchmarks. Again, it's pay-to-play. It's a consortium. It's not free and available, but they're quite widely used over time. When we get to the free and open source world, you've got MyBench, and MyBench has been around for 20 years nearly, and that's a set of small programmes you can use for evaluating systems, widely used in academia because it's freely available. We did some work on energy efficiency. Some of you may have come to the Dev Room five years ago here. Looking at how compilers could influence energy efficiency on embedded systems, we needed a set of benchmarks for embedded systems. That means they don't use printf. They don't use libraries very much. That's the Bristol M because I'm embedded benchmark suite, or BEBS. Actually, BEBS isn't new because it builds on MyBench and one or two other benchmark suites there. Lastly, you've got the timing analysis benchmark, tacklebench, which came out only a couple of years ago. Not many people have heard that, but it's a good set of programmes, and in fact, tacklebench builds on previous benchmark suites as well. We came to the conclusion that there was nothing perfect for general use. We decided to create a new benchmark suite. It's called MBench. We drew seven lessons from all those existing benchmarks. First of all, it's got to be free. If you don't make it free, Speck and MBench, and why does everyone use CoreMark and not its parent or embassy? Because CoreMark is free and embassy is not. Even though embassy being a set of programmes is a better option, it must be easy to port and run. If it's a pain to bring it up, then people won't use it. It must be real programmes. The problem with synthetic benchmarks is that they try and capture a bit of everything and end up being a bit of nothing. We wanted to build a benchmark suite built on real programmes like Speck and Embassy do. You look at some of those benchmark suites, Drystone. There isn't an organisation to maintain it. It's the same forever and forever. You need a supporting organisation to maintain it, but people still want things simplified. They want a single number, so you need a summarising score. This is my MBench score. There's all sorts of ways you can summarise. We came to the conclusion that actually benchmark suites can be dominated by one big programme or one fast programme. Geometric means actually help to level that out. Using the standard deviation means you can see the variability. You can see if one programme is dominating the impacts. The other thing we do is we make the data relative. Instead of saying we've done the geometric of all these programme sizes if we're measuring code size, we'll take it relative to a particular baseline value. Lastly, we see on those benchmark suites there are some that are purely academic, like Tacklebench, like my Bench, and they're not heavily used in industry. We find things like Embassy and Specmark that are done by industry, but no-one uses them widely because they're all paid for, they're a team to pay for, and they're focused just on industry's needs. Cormark is free and was developed by industry, but actually hasn't had the input of academic latest thinking. So we wanted to involve by academia and industry. What was the plan? We started the first six months of last year a small four of us, Dave Patterson, and Dave Patterson is the originator of Risk One and the originator of Risk Five, Professor Emeritus now at Berkeley, and he drove that, and that's the connection to ML Perth because he also was behind ML Perth. I came in, and the reason I came in was my involvement with Bebes, and an M-bench is derived from Bebes. Palmer Dabbelt came in because the Risk Five community were very interested in this at the time he was at Sci Five, he's now at Google, and Chésaray Garlatti came in because it's not all about compute speed. Chésaray is very much looking at how fast things react, what are context switch times, what are interrupt handling times. So it's more than just compute times. And for six months we met every month in California in Sci Five's offices. And in June we announced this project to the wider world and opened a wider group, and now a group, if you go to mbench.org, you can join the mailing list, you can join the monthly calls, there's a git repository with all this on. If you go to mbench.org, you can find all the links there. And the goal is to launch at the end of this month an embedded world, a first version that is reasonably stable and that people can try. But it is only a first version because one of the things that came from our principles is you've got to keep this alive. M-bench in two years' time will be a different set of programmes because we don't want all the compilers to be improved to our set of programmes, and then they're just good at optimising that set of programmes. We're going to keep on changing them. That's what SPEC mark's done, it's what embassy's done. It's currently... We tried not to do everything at once, because otherwise you won't succeed. So we've focused initially just on the small IoT class device. 64K, RAM, ROM. And we've got 19 benchmarks which are small programme kernels, real programme kernels, aimed at deeply embedded computing. We'd like a couple more. I think we're running out of time for 0.5. We'd like to get Bluetooth LE represented in there. We'd like elliptic curve DSA. It's quite hard to find a elliptic curve DSA system that's real that will fit in that size programme. OK? So we also have an early version of a context switching benchmark. So that's that thing that Chesra is interested in. We really would like more help with that. That's not going to be just an exemplar. That has to be written in Assembler. It's not a programme you just compile and run. It's a description of how you write an Assembler programme to measure this on your platform. OK? At this stage, we've got simple Python build and run script, and that's where the last-minute panic work is being done at the moment, is to get that clean and easy to use, ready for three weeks' time. So far most of the testing has actually been with very later models and simulators. The latest work is starting to actually get it working on real hardware because it's no use if you can't run it on real hardware. And for deeply embedded systems, that's not quite as easy as it is for something running Linux. OK? So we're widening it. You'll see me talking about Arm and Risk 5 today because I want to do comparative stuff, but we've done it on Arc, we've done it on Atmel AVR, and we'll do it on more as we go forward. So we've got a benchmark suite. Here it is. Here are the 19 programmes. It's none of these are new. We've taken them from other people. OK? They're free and open. It's GPL3 licensed. You can see their hundreds, or a small number of thousands of lines of code, their hundreds, or I think the biggest, what, 10 kilobytes of code and data size, and code size, data size, and they've all been normalised to sit in a loop, so they all take about four seconds to run on any platform, so there is a parameter to control that. The other thing is they're tried to be a mix, and this is something that came from Biebs, so we looked at what the proportion of branching programmes, memory access was on ALU in the programmes, and you can see here some of them have got lots of branches, some of them have got very few branches, some have got lots of arithmetic, some have got few, some have got a lot of memory, some have got few. There's no floating point in here. One of the things we also said, we'll worry about floating point, there's almost no floating point. You can't have real programmes with a little bit of floating point, but we've thrown out the explicitly floating point ones. So, that's the benchmark suite. I say, please join in. It is a community effort. There is no one paying for it other than people giving up their time. The group is chaired by Dave Patterson, so it's a good chance to get to know Dave if you want to talk to one of our great computer scientists. I'm the vice-chair, so I'm the one who's going to talk about how meetings happen and so forth. This is not really a talk about Mbench. I just wanted you to know about it because this is a talk about compiler optimisation. So, what affects Mbench, or indeed any benchmark results? Well, the instruction said architecture. How good is your architecture, is it ARM or RISC-5 or ARC or AVR? What extensions are you using? Is it the ARM-V7 or the V8 architecture? Are you using thumb-thumb-2? Are you using RV-32, which they have extensions, MCB, whatever? What compiler are you using? One of the open ones like GCC or Clang LLVM or a proprietary one like IAR. Which optimisations are you including? You're using the compiler, but how much optimisation are you turned on? Are you optimising for size? Are you optimising for speed? You might think that you've got a brand new architecture come out, its compiler might be immature, and if you've got an old architecture, its compiler might be mature. So, today I'm going to look at ARM and RISC-5, RISC-5 quite new, ARM been around for a while. The other thing that can affect things is the libraries. I'm not going to talk about libraries today, but you can make your programmes as good as you like. But if your libraries they rely on, and even embedded systems rely on libraries for things like emulation and so forth of floating point, however much you rely on that, you're going to have some library code in there. So, Mbench at the moment, and this is still a bit of a debate, excludes libraries when sizing because actually even the little emulation libraries, the C runtime startup, for those smaller programmes, they can completely dominate the code size less so the performance. We don't want to keep on measuring the same thing again. So, if I want to use it for compiler, let's have a comparison matrix because what I want to do is learn from other people. Just as the previous talk, you've got the question about LLVM and LLDB and GDB. You can learn from each other. So, let's look in two dimensions. What can we learn by comparing Clang LLVM with GCC for a particular architecture? And secondly, what can we learn by comparing Clang LLVM for different architectures? What does Clang LLVM do well on one architecture and your another? Here is to share knowledge, to learn from others. So, at the gross level, we can actually look at two compilers. So, here's risk 5 LLVM against risk 5 GCC. I've looked at code size, I've looked at the various optimisation levels and what it does to code size. So, you can see here that as we increase optimisation, code gets bigger. And notice that they're all normalised to a reference platform. The reference platform happens to be the code size of risk 5 IMC run with minus OS using GCC 9. That's actually going to change for the release. The reference base architecture will be Cortex M4, but I'm using what we've got today. And you can see there are little bars. Those bars are the geometric, a single geometric standard deviation up and down. And you can see that this is close to the reference. So, this is the reference for size OS. And the error bars are quite small. Up this end, they're quite big. And we can also look at code speed. So, M bench doesn't actually say what the metric is for. You can do it for code size, code speed. We haven't done it yet, but in principle this is the efficiency. And if we look at code speed, what we find here is that obviously when you go for small code things go a bit slower. When you go for high optimisation, things go a bit faster. This time it's normalised against risk 5 32 IMC with minus O2 and GCC 9. Perhaps surprising for risk 5, you don't get a huge amount of speed up. Optimisation's not perhaps as effective as you might hope. So, we can do that gross comparison. That's just to give you a big picture. And we can look by architecture. For architecture I'm going to focus on code size. The reason for that is I haven't actually got good reliable measurements of arm execution speed. And because I'm focusing on embedded I'm going to focus for the moment on the rest of this talk on optimisation for code size. The same analysis applies to optimisation for code speed. It's just I've only got a 40 minutes up. So, code size by architecture. So, yellow for arm. And here you see that arm is a more mature compiler. When you optimise for size it actually does better than risk 5. When... Sorry, more mature architecture. When you optimise for speed actually arm's code size goes out the window. But that's a bit because arm is doing optimisations there. If I did have the performance figures you'd actually see arm gets much more of a speed up when you turn on minus 0,3. That's at the expense of doing a lot of a loop unrolling and inlining and so forth. So, the code size gets bigger. That's exactly what you would expect to see. So far so good but it doesn't hugely help the compiler writer beyond being able to talk about how good my compiler is in the round. That's not what I'm interested in. So, let's go down to the next level. Let's look at individual benchmarks. Let's compare LLVM and GCC code size. This is for risk 5 with minus OS. There's optimisation for size. I've compared with minus OS because GCC does not support minus OZ. Now, you can see there just visually that some of the benchmarks risk 5 has bigger code and sorry, LLVM has bigger code and GCC has smaller code and for others it's the other way round. But it's a bit all over the place. It's much better if you sort those benchmarks by, that's exactly the same data but I've now sorted by the difference. So, at the end, you've got our harm on 64 and for that, LLVM, the geometric mean, sorry, the code size has no means here because this is individual benchmarks is nearly twice as big as the code size that you get with GCC. So, we could learn something by looking at that one. What is it with that particular benchmark that risk 5 does so dreadfully, sorry, that LLVM does so dreadfully when compared with GCC? Up the other end here, we can look at Huffbench and say there are some there where LLVM's getting it right. What's LLVM doing right about Huffbench that GCC isn't so good? That helps both the GCC and the LLVM team. This is not who's got the better compiler. It's what can I learn from each other. We can do the same analysis on ARM. You can see from ARM that there is the cubic program and the crypto programs. ARM's code size is really good. If you want to write a small crypto program choose ARM as your architecture, not risk 5. That's the message there. But there are others here solving the end body problem and a state machine. Actually, the compile code for risk 5 with LLVM is better than that for ARM. We can look at both of those. The ARM guys can look at that and see what is it that the risk 5 compiler is getting right that ARM could learn from. We're starting to get now and see there are some interesting programs there. Let's go deep. This is where we start to get the useful information. There's a really good option to NM that sorts your symbols by size. I've done NM size sort for the clang and GCC compilations of the program with the biggest difference. I've only shown the text size. They are stripped out all the data results there because I'm not interested at the moment. I draw your attention to this. There appear to be three programs in the GCC image that are not... functions that are not in the clang image. Those have clearly been inlined. You can see that that the benchmark body is a hex16E on GCC and the benchmark body over there is 5C6. It's much, much bigger, three times bigger on clang LLVM. Let's delve in deeper. If we look at the source code, mont64.c for this, we'll see that those famous functions, mul64, modulo64, are called many times, partly from Montmull, but mostly in benchmark body. If I am doing inlining, I'm going to put three copies of each of those into the benchmark body, which given I'm optimising for size is possibly not such a good thing. The third one that we're missing, the XGCD function, actually only appears once. Inlining something that only appears once is probably not a bad idea for code size. We see that if we disassemble the benchmark body because if you look in benchmark body, you see series of code scenes that become mulHU, mulHU, mulAd. That same sequence appears three times, which is the inlining three times of the function. Whereas those are the only muls you see at all in benchmark body and GCC, and there's nothing to do with those functions anyway. What that's telling us there is we can learn there there's too much inlining going on, arguably, in LLVM for minus OS. Now, there is a case to be made for saying, well, I've got minus OZ in LLVM and maybe I'll turn it off in minus OZ. But that's given you some insight that if you're worried that OS is perhaps not yielding what you want, you could perhaps look at inlining. In fact, you can look at the debug information because there's a helpful DW tag inline subroutine to tell you when you've inlined something. If we compare all the binaries, LLVM and GCC, we see that in general that LLVM at minus OS does far more inlining. If we want to improve LLVM for risk five, one thing we could do is reduce the heuristics that control the amount of inlining going on. I'll draw your attention to PicoJPEG. PicoJPEG is third in the list of ones with differences. Look at that, it's got 180 inline functions compared to 40 in GCC. Even GCC doesn't inlining here. But the point is it's got that huge amount of inlining going on. It's not actually, it is only third in the list, it's not top of the list. So there's more going on in code size than that. So we've done a bit more digging, and this can go on for ages and ages, and I'm not going to go on for ages and ages. So if we look at cubic minus OS, which is one where the inlining is less of a fact, there's no inlining going on at all in cubic, but cubic is still worse between CLAN and GCC. What we see is the main routine, cubic is notably bigger on that. So that suggests looking at solveCubic would be an idea. Well, the first thing we notice is that the first line of solveCubic is to set up the stack, and on LLVM I have a far, far bigger stack frame. 1,424 compared to 304 bytes on GCC. The reason for that becomes clear. So that gives you a hint of what's the difference there, and you can see it here, which is where I've looked at one particular line of code and how it's translated. I picked this out to illustrate it, we looked at the whole thing. Look at this. When I want to load a constant, LLVM explicitly loads a 32-bit constant by loading the upper immediate and then adding in the lower bits. GCC has a small data area where it puts those constants at once and it can do a single load. So 32 bits here, 64 bits needed here. So that's one of the reasons for code size. And there's an element you're going to end up saving stuff on stack, which is why the stack has got so big. And if you go and look and debug and look at the S data, a begin plus A, they do have the constants in there that you would expect. So GCC is making aggressive uses of a small data section. That hasn't made it into LLVM yet. So there's an optimization we could add to clang LLVM by looking at what GCC does. But we can also learn more from GCC because actually the first two times that does, two 32-bit instructions to load 32-bit constants, those 32-bit constants are 0 and OX for 0, 0, 0, 0, 0, 0, 0. Okay? And LLVM actually is intelligent. So yes, it uses small data but it uses it mindlessly in GCC. LLVM doesn't use small data but it's quite intelligent about constants it can do clever things with. So it knows that it can use a short instruction C.li just to load a 32-bit zero value and it knows it can actually only needs to load the upper bits to set that big constant there. So in this case, GCC used 64 bits, we only used 48 bits in clang LLVM. So the GCC guys can learn something from clang LLVM on that one. And the point of the benchmark suite is it gave you the pointers to find this and that's why a benchmark suite is useful and the analysis that goes with it. I'm going to finish now with just a passing comment because you can do the same looking at arm risk 5. And very quickly, you see things like here the sort of augmented instructions where you've got an exclusive awe of R2 and IP into R0 but you're allowed to actually rotate the second operand. So you've got these synthesised operands and that's really useful for cryptographic things which gets heavy use in netl AES, the AES function. So what you do in 32 bits in arm you need a total of if I've added that up right 96 bits in risk 5. And that structure comes up quite a lot in that cryptographic code. That's why arm actually comes out well on the cryptographic stuff. That's something to bear in mind when looking at instruction set extensions for risk 5, we can't do much about that in the compiler, that's an architectural thing. So sometimes the answer you'll get given is you can't fix this but other times you do. We do observe other things. Arm makes heavy use of constant pools. We saw a bit of that in GCC. At the end of functions and having a global area with short accesses for global constants via other registers. Whereas risk 5 tends to have many global loads and stores. The other thing we see is arm famously has conditional instructions. Those turn into explicit loops in risk 5. So we can do something about constant pools because we can implement that in Klang LLVM. The other two are architectural. That's over to the chip designers to sort out. I hope that was useful as exploring how you go around the day-to-day job of improving a compiler. Standard benchmarks provide a useful comparison tool and comparison between different environments can identify optimisation possibilities by comparing compilers by comparing architectures. But some problems can't be fixed by the compiler and it works without any benchmarks. I've chosen Mbench because I've worked with Mbench and it's being launched next month. It's a good chance to talk about it. We use that professionally in our day-to-day job. But there are other benchmark suites out there. You can use any of the ones I talked earlier. There are other specific compilers and I particularly draw your attention to our colleagues from Western Digital and their repository of tiny code fragments that break compilers. You can do exactly the same analysis. It's in our benchmark suite. It's a is your compiler any good suite and that's a great set of code fragments to look at. But exactly the same analysis applies. It's the same technique. So thank you very much. That's my company address. And mbench.org. I do encourage you to get involved with Mbench. If it's going to succeed as a free and open benchmark suite it needs many, many hands on it. I should take questions and I've already got one. So the question is the question is how can you get involved with Mbench? You go to the website. You sign up for the mailing list. It's a conventional list of mailing lists. One of the things that will get announced on that mailing list is the date and time of the monthly call which is the third Monday of every month on at 8am California time. So 5pm in Belgium. And that call is actually implemented using Zoom. I know the concerns over using Zoom. We can't get everything right. So that's how you get involved. And it's a repository. It's a GitHub repository. Send in your issues, send in your pull requests and my next three weeks will be spent dealing with the queue of pull requests and issues that have been raised and the more the merrier. And it goes on. It's launched next month, but it's going to go on. It's going to be renewed year after year. So new benchmarks are always welcome. So the question is are we going to version Mbench? The answer is yes. Version 0.5 this year. The intention is to renew it every two years. I suspect the first full version of Mbench will probably come out after only a year because I think that will be enough. And then every two years. And I've already got the first submission of a program for the next to go into the benchmark suite next time round. I'll take this one over here. So these benchmark numbers you showed for the performance. So what's kind of the reference platform used there? So the question is what is the reference platform used for Mbench as reference platforms are implemented by the hardware implementation? It is the RI5CY Risky core from the PULP project at ETH Zurich and that's an open implementation and it is a specific commit from their github repository of the Verilog source. Thank you. You plan to separate the 14-point benchmark from the 6-point benchmark because it plays the numbers. So the question is are we going to separate out floating-point benchmarks? The answer is the first benchmark suite is not called Mbench it's called Mbench IoT it's specifically IoT class integer. We see two directions going in the immediate. One is to do a floating point aimed at that same class of processor because there are embedded floating point and you can get M4 with floating point unit and you have RV32 IF. So absolutely and the second is we also want to do benchmarks appropriate for application class processors that are going to run on an X86 or an AMA class processor ARCH 64 type architecture. Those are projects for the future we're looking for people who want to join in take ownership of those so I would expect in the future there will be a series for the different classes of Mbench each version. Very good question from Jerome. Are those benchmark figures for speed numbers or are they relative to the clock rate? The answer is we report both. The absolute value tells you how fast your processor is the value per megahertz tells you how efficiently you've implemented it. We did do some figures with risk 5 comparing the risky core from ETH Zurich with an early PSY5 core and actually you found the risky core was about 5% more efficiently implemented when you divided by the megahertz it was the same. The speed data I gave you were from a very later model of the risky core which I happened to clock at notionally one megahertz so it was divided by one and two figures are the same. Any more questions? Thank you very much. Perfect timing.