 Hi, everyone. My name is Greg. I am a I'm a hardware engineer from Lowisk Previously, I've worked at arm on a class processes on memory systems and a broad common GPUs And it's really great to finally be working on some open source hardware. This is what we do at Lowisk So today, I'm gonna talk about Ibex, which is our our CPU core to microcontroller class CPU core Two-stage pipeline very simple 32-bit you mode M mode PMP Supports EMC and IMC and we've wrote this in system verilog and It came to us from the pulp team each exuric Which I'm so many of you are familiar with and where it was known as zero risky and we have picked it up and we developed it from there We've done a lot of work into improving the RTL and really doing a lot of verification on it And so it's being developed by Lowisk, which is the company I work for we are not for profit company Work on open source silicon, so we use collaborative engineering We work with various organizations to build open source hardware designs And a notable use of Ibex is in our recently announced open Titan project that you may have heard of which is an open source Silicon route of trust and we hope there's going to be many many more things. It's going to be used in so Today's talk is about improving the performance of Ibex so here what I'm really talking about is I'm trying to reduce the Total number of cycles. I've spent executing some benchmarks. So in this particular talk I'm talking about Cormark and M bench Cormark because well Everyone talks about Cormark numbers and M bench because it's this nice new interesting open source benchmarking suites And I felt it had a great range of stuff to run and I thought it was a useful thing to look at You do have to be careful when you're working with benchmarks You've got to be careful. You're not just optimizing for the benchmark only So you need to it's very useful to analyze what's going on with a benchmark and use that as a guide as the things that you can Then do you then need to take a step back and think is this actually generally useful or am I just making specifically Cormark quicker? And I would say that the improvements I mentioned here are going to be Configurable options in Ibex so you're going to be able to choose having between a smaller or simpler Ibex core Or you can have a faster core that would be a bit bigger and a little bit more complex So the trial system that we are running these simulations on is simulating Ibex vera later We've just got a dual ported memory containing code and data with a single cycle access latency I feel this is a fair analog of a best-case real system Obviously, you can get other systems that aren't going to have quite this set up and thus aren't going to perform as well So how do we go about working out what's going wrong with our performance? You can you don't have to do I didn't do anything complicated literally run some benchmarks trace the simulation Open up the trace in GTK wave and just start having a look around the what you're doing here Is you're just trying to look at it identify interesting behaviors that you can then poke into more and then determine if they're actually causing you performance issues So it's not a quantitative analysis, but it's just a quick and easy way to get a car idea of what may be going wrong so I Run core mark in on on our vera later simulation and I open it up in GTK wave and I have a poke around And here's a couple of things that I found So the first thing is having a look at how is this example here, which is a conditional branch So you can see the instruction down in the bottom there It's a branch of not equal and this particular condition is passing So the thing I have in the red circle there is the first cycle of us executing this branch We've pulled the two operands out of the register file We're seeing and it lo and behold they are not equal So our result there is a positive So we are going to branch but before we can branch we have to work out where we're going to so in the next cycle We use the ALU again to calculate the branch target So we're reusing that resource first a computer condition and then to do an addition and Then finally we branch so this has taken a total of three cycles and During this time in these in the second in the second two cycles The processor is stored because we stay in the same pipeline stage while this is happening So if we could improve this somehow then we can speed up our execution Next thing we can look at is here loads and again We end up stalling on every single load and the reason for this is again our two-stage pipeline in the first cycle We request the data We have one cycle access latency so in the next cycle we get it back But we need to wait a cycle for us to get to get that data back So we end up stalling the pipeline again And we actually have the same behavior on stores The reason for this is we wait for a response from the store to tell us if there's an error For example, if we accessed a non-existent address So we end up stalling for at least one cycle on stores as well So we've got these stall cycles around branches loads and stores which you know clearly isn't particularly good for performance So we're going to actually do some quantitative analysis now We're going to use the performance counters to actually See what's going on and dig in to some of the And try and determine how much is actually impacting our performance So we've just identified two three interesting behaviors that are probably slowing us down. Let's see what the actual effect is So run we run our simulation across a variety of benchmarks Using these performance counters and we get some results that tell us how much time where we're spending stored So this first graph here on the bottom Axis on the x-axis you can see the various benchmarks on the far left is Cormac on the far right Is a geometric mean of all of these numbers and in the middle is the M bench benchmarking suite And what we have here is the percentage of total cycle spent Calculating the branch target. So if we go back here that second red circle there This is the total number of cycles percentage of total cycles. We're basically spending doing that red circle Sorry that red circle and you can see On average around about 4% of our time is spent doing this. So if we could remove this cycle we could Increase reduce the number of cycles that we spend doing these benchmarks by around about 4% and different benchmarks have different amounts of branching Which is you know unsurprising We can do a similar thing for the memory Know that the the wire axis has changed in scale. So we're back But otherwise very similar graphs as a percentage of cycles that we've spent waiting through a memory response Whether that's the data coming back or the response from the on a store from the memory system And this is a significantly bigger chunk of time you can see on average That's around about 15% and then different benchmarks have different amounts of memory activity. So fairly spread about how much performance you can potentially save by Reducing the number of cycles you spend hanging around for memory So We've identified these things it definitely seems to be slowing down these benchmarks We can go is this you know if we improve these things are we just improving benchmarks or we just are we improving other things? I think it's pretty safe to say Everything is going to use branches and memory accesses. So yes This is generally useful and yes, this is a major thing that is slowing down the ibex core right now. So We need to improve things So the first thing we're going to do to improve our branch performance is to introduce a new ALU a branch target ALU So on the right hand side here, I have a diagram of the ibex pipeline very simplified We got our two stages. We have our instruction fetch So that grabs the instruction out of memory and then we have decoded and execute instruction moves into there And it sits there and Until it is done until it is finished and written back to the register file So if we want to remove this stall cycle around computing branch targets, all we need to do is just add in the second ALU so rather than spending Two cycles using this main ALU twice We are going to use the main ALU to compute the condition decide whether or not we're bouncing and at the same time Perform the addition that's going to work out where we're branching to We increase the area of the core a little bit because we're adding in some extra logic But we're getting a four we reckon we're going to get around about a four percent performance gain out of it So this is probably a good idea So we do this nice straightforward, but there's other things we need to consider that's the implementation So the physical impact of these changes we're going to add some logic. So that's going to add some area It's also going to alter the timing. So what frequency we can run the processor core app So I built an experimental synthesis flow using Yosis open-source synthesis tool if you're not familiar with it and I did timing analysis with open sta I pulled the 45 nanometer NAND gate library that you can get out of the open road repository And I use this to generate some implementation numbers I would caution that things like Yosis and the NAND gate 45 nanometer library Aren't you know aren't going to achieve The best numbers compared to what you could potentially do with commercial tools and libraries So this flow is not here to say here's the best numbers we can get out of ibex for frequency and area etc But it is very useful to see relative changes as we make these as we make these improvements And then see where our air as a timing pressure are see what chains of logic are slowing down our frequency and thus Work out where we can improve things So here are the results of implementing the branch target ALU. I've just put the core mark results in but You'll see some n-bench results later and core mark per megahertz well That's gone up by about four and a half percent, which is what we expected great area has increased a little bit But not too much great The problem here is we've just lost some frequency So our f max the maximum frequency that our core can run it has gone down by 13 percent So actually this means overall we haven't necessarily increased performance because if we're running at maximum frequency Our maximum frequency has just gone down So the fuck our core mark per megahertz has gone up, but overall we end up with a lower core mark result We end up with a slower core So can we do anything about this? This is kind of disappointing. We have this nice little simple change But which seemed obvious, but actually we started going slower So we need to dig into the implementation and try and work out what's going wrong Now you sis allows you to do this It's got some nice tools for selecting out logic and examining paths and things And it produces these wonderful diagrams Which you can spend a very long time staring at obviously it's not particularly easy to just pick out of something like this So this is a graph of some of the logic inside the ibex core as produced by the yosas synthesis tool You can't immediately go aha. It's this line here But you can with some knowledge of the design spend some time examining this have a bit of a think about it and work out What's going on which is what I've done? So here's a far simpler diagram which actually explains what we've messed up so what's happening here for people who aren't familiar with synthesis and logic implementation The the thing that's slow that sets the maximum frequency in your design is based around the the length of the longest chain of logic in your Designs you have a whole bunch of logic gates All connected together and the longer the longer those chains get the longer It takes for a signal to propagate down and the slower your clock can run And so what we have here I've got this gray line, which is the longest path in our design So we start for the instructions that's instruction We've just read out of the instruction fetch stage in the previous cycle which we are executing We have a decoder which is a blob of logic which is working out what we're going to do It's going to set up the controls for this this instruction This then feeds into the main ALU which is going to compute our results for us And then on the right hand side here. I've got what I've labeled the PC muck selection logic So what this is doing is choosing where our next PC is going to be which then feeds into that insta adder o on the Right-hand side there That is the address we're next fetching from instruction memory And so it's actually quite complicated to decide this because there's quite a few things going on You might be branching you might just be going to the next instruction There might be an exception there might be an interrupt We actually have a prefetch buffer which is trying to fetch ahead which is also affecting the instruction address So this PC muck selection actually has quite a lot of things to decide So it's a reasonably complex blob of logic and what we've done here is previously if we were going to branch We're going to spend two cycles over it. So the first cycle main ALU is going to compute the condition It is then going to remember whether or not we passed the condition the next cycle The main ALU is going to compute the branch target and what's going to happen is it's going to come out the ALU It's going to come up this dotted line here and into this mux and then out to give us the next instruction fetch address We have just removed this dotted line because we now have this branch target ALU and we've introduced this new line here This is the condition coming out of the main ALU and it feeds into the PC muck selection logic So what we've done is we've taken what was the longest path in the design namely this dotted line here And we've added just a little bit of extra stuff onto it because we now put the PC muck selection logic in there because of our Because we're no longer saving the condition. We're using it immediately and this has slowed us down So how do we make ourselves faster? Well, you need to have a look at some of the blobs and try and work up what it is We can prove the main ALU was probably not much we can do about that you implement the best that you can and there you go It's as fast as it's going to be What we can probably do though is look at how we're setting up the operands for the main ALU So we've got these red lines coming in here. These are the control lines Which are selecting what operands we're doing So what we're going to try and do is make those main ALU Operands turn up earlier. So the main ALU results comes earlier And the problem here that's causing these things to be quite late. It's because of the decoder This is a big complicated blob of logic. It basically controls the entire design It is trying to work out what the instruction is doing. We need to make it go quicker So there's a few things I need to fix to do this sadly I've only got time to discuss one of them and this is this instruction flop fan out so This thing here, which is what I just called the instruction flop Connects to an awful lot of logic All kinds of gates and so in order for it to physically drive all these gates It needs a bunch of buffering to push that signal out everywhere and this slows it down quite a bit So if you could somehow reduce that if you could reduce that fan out We can speed up our design and so what we do is we just make a copy of our instruction flop We now have two of them And when we split out the decoder So it looks at a replicated version of the instruction and then it uses that purely to decide the ALU Operand select and then the operation and then the D then the decode for everything else comes from the other register and in doing this we Make these red lines appear earlier in our cycle So all of this returns earlier and we help fix the path and as I said there are some other things we need to do But I have not got the time to discuss So after doing this plus some extra improvements, I've now got a better implementation Area has actually gone down a bit I think that's because I've reduced the amount of buffering in the design due to duplicating that register Sadly, I still haven't recovered the frequency though Now I'm not too worried about this for a few reasons one the tools I'm using YOSIS and ABC the optimizer that it uses don't actually take timing Sorry, I owe timing constraints into account. So what we're the path we've got here is an output Path and there's a constraint on it saying how early it needs to appear You can't actually feed this into YOSIS so it can't really really target the optimization at it So I think if I had some other tools that were capable of doing this I could probably optimize this path a lot better and I could hopefully Make the rest of this problem go away And then the other thing is this is a microcontroller You're not necessarily wanting it to run it at the maximum possible frequency But it is important that we're trying to maintain We don't just keep adding things that keeps slowing it down and slowing it down So overall probably a pretty good optimization for someone who really wants to run ibex at the maximum frequency They can maybe want to turn it off, but they can do that because it's a configurable option Second thing I've done Was add a third pipeline stage to ibex now. This is actually caught. I've made this look very simple This is actually quite a complex thing to do It's just one of those things don't really have time to dig into it But what we've done here is now an instruction goes into decode and execute it Computes its result then the next cycle it sits into this right back stage here And then it writes itself back to the register file So this gives us an extra cycle to wait for the result to turn up for memory or for the response to the store So we lose a store cycle out of loads and stores And so this is going to help solve our our our store problems around memory And as I've said, this is quite a Reasonably complicated thing to do. I just don't have time to really talk into all the details about what is going on And then when I implement that area goes up as you might expect But we've got about a again on core mark. We've got about a 20 improvement in core mark per megahertz. So Notable area cost, but it's outweighed by our performance gains fmax has basically stood still so this is the right back stage along with that new branch target alu So it hasn't really affected it It's still those branch target alu changes that are dominating our timing and keeping us at that fmax that we have So nice complicated graph Same thing you've seen before benchmarks on the bottom car mark core mark on the far left Geometric average on the far right m bench suite in the middle And you can see and then it shows you our total speed up and a combination of where What the branch target alu has done for us and what the right back has done for us on top of that And overall our geometric mean there is 21.3 percent. So I think that's a really quite significant gain in performance for 7 percent area, which seems like a pretty good deal to me Obviously it doesn't affect all benchmarks equally not all benchmarks are Memory or branch bound. I think some of these benchmarks in particular using say a lot of multiplies So that is what's slowing you down now And there we go Um quick whistle stopped off through some of the performance improvements I did if you would like to find out more you can check out our ibex repository here on on a github The the work I've just been discussing not all of it is yet in the main repository Um, so I've put up a special ibex foz den ranch in my own ibex fork You can go take a look at if you want to recreate my results or play around with whatever what I've done You can also check out low risk got a low risk website We're now we're currently hiring so if you're interested in working open silicon Piring both software and hardware positions. Do you get in touch? And if you have any questions about this work or anything else then feel free to get in touch and drop me an email Great. Well, thank you very much