 Greetings, RISC-V friends, and here is another video. This isn't going to be a single topic, it's going to be a bunch of topics, because I have a bunch of little things to update about. I'm building a RISC-V processor, not on an FPGA. I'm going to talk about a new update to my ALU design, and I'm going to talk about a tester board, another idea about building a tester board, and based on that I'm going to talk about the unused signals that I have available to me, and then I'm going to talk about looking at all the instructions to see if I have to use any of those unused signals, and then I'm going to take a closer look at the branch if equal instruction, and I'm also going to be taking a closer look at the load and store instructions, and also the jump and link instruction, and then I'm going to talk about the instruction controller, which is the thing that will just fetch instructions from memory and execute them. Alright, let's talk about a new revision of the ALU. So I was talking to a friend at work about how the ALU is designed, and basically I said it, you know, here is the ALU, here is source one, here is source two, and here is the destination, and then of course you have a function that goes along with it, and I talked about the carry look ahead adder, and basically, again, just to go over it really quickly, you have a bit slice, in fact you have multiple bit slices, and you also have a carry look ahead unit, and the idea is that with the bit slice you have four bits at a time, each four bits and so on, oops, four, and you have the four bit result, four bit result, and you also have a propagate and generate line, propagate and generate, propagate and generate, and based on the propagate and generate lines, which don't depend on any of the carry-ins of the bit slices, you can determine what those carry-ins are supposed to be using the carry look ahead unit. So the carry look ahead unit would look at the propagate and generate bits, and generate the carry-ins for this. So the problem, though, is that when I explained this, I realized that what I'm actually doing is I'm taking a layer of bit slices, and I'm feeding an input into it, then I'm taking a carry look ahead unit, and I'm feeding the input of the carry look ahead unit from the output of the bit slices, then I'm taking the output of the carry look ahead unit and feeding it back into the bit slices, and then taking the result of the bit slices, and then outputting it, and this is just not going to work. So these are memories, and the idea is that, you know, here's the address, and here's the data, here's the address, and here's the data. So the problem is that what I've got here is an unclocked feedback loop right over here. That's the unclocked feedback loop. So what would happen is I would put an address into this first memory, get the data out, and feed that as the address to the second memory. The second memory would then generate some data, but of course that data is only guaranteed to be stable after a certain amount of time, like let's suppose, you know, I think it's going to be 10 nanoseconds. So in other words, when this gets fed back, the address is basically going to be pretty much random. It's not going to settle until 10 nanoseconds later, which means that the data here is not going to settle, which means that the address here is not going to settle, which means that probably the data here is never going to settle. So this is just a bad idea, and that's what my original ALU was designed on. So I needed to modify this. So in order to modify this, basically what I said was this. I have a bit slice, which only generates, propagate and generate outputs. That goes to the carry look ahead unit. Then I have a bit slice over here that the carry look ahead unit feeds, and also the input feeds, and then that is the output, and you can see that I don't have any feedback loop here. So really the only thing that this thing does, it's not really a bit slice, it's just propagate and generate. This is actually the bit slice over here. So we have one layer of memory here, another layer of memory here, and another layer of memory here. Now of course, when I feed the address into this memory, the data is not going to settle until 10 nanoseconds after that, which means that the data out of the CLU is not going to settle until 20 nanoseconds after I initially put the address in. And of course, this line right over here doesn't change, so it's only this line that changes. And that means that this memory incurs an additional delay of 30 nanoseconds. So from beginning to end would be about 30 nanoseconds. And again, there are no feedback loops, which means that after 30 nanoseconds, I'm guaranteed that the output is stable and correct. So that's my new design for the ALU, and we can take a quick look at it. So here is the ALU, and there's just a lot to it, but basically it flows from left to right. Over here on the left are the inputs, and then there is a layer over here of buffers. And then here is a bunch of memory that generates the propagate and generate data. That feeds, let's see, where is the carry look ahead unit? I think this is the carry look ahead unit right over here. So that generates all the carry outs. And then here is the section of memory that are the actual bit slices, and that feeds buffers which goes to the output. All this other stuff are additional buffers and ROMs in order to initially load the memory on boot up. So that's basically that. In terms of what it looks like with the printed circuit board, so that's kind of what the printed circuit board looks like. It's pretty huge. It's got a lot of stuff in it, and it took a long, long time. We can even look at the three display. For some reason, Kycat is not displaying the bodies of the chips, but basically that is what it's going to look like. And I guess you can just sort of imagine chips on here. I've looked online and I have no idea why the chip bodies are not actually showing up. But in any case, I have over here this nice little sort of testing. I bring out some of the signals to make sure that the thing boots up properly. And I've got, unfortunately, I had to put some of the flash chips on the other side, not flash chips, some of the RAMs on the other side. And these two are actually the boot counters and a bunch of capacitors and a few resistors ended up on the other side as well. But that's okay. I mean, this thing is just going to be hand soldered. It's fairly straightforward. So anyway, that is the ALU. Okay, so anyway, I was talking with my friend about this and he got fairly excited about it and wanted to help design a testing board for it. Now, I had originally designed a testing board, but I was never really satisfied with it. So what we did was we decided to do something like this. Now, Adafruit has this neat thing where there is a board with a USB connection on it and it uses an FTDI chip and there are just GPIOs on the outside. Now, this FTDI chip is an FT232H, which has the ability of taking in and sending out data on the USB, but also configuring its GPIOs for things like SPI and I2C and RS232 and some other things and also general GPIO. So it's kind of a versatile little board and I kind of like that. It's only $15. So what we decided to do is take this and use the outputs to drive some serial parallel IO. So here is the schematic. Up here in the upper left is just a DC to DC converter because I want this to also power the backplane as well. So it powers all the cards. So I've got a little jack and I can put in apparently, according to the specs of this converter, anywhere between 3.45 and 17 volts, but I'm just going to say, well, it's 12 volts. And that outputs 3.3 volts to power pretty much everything. And I put in a little power LED because of course you should always have a power LED. This thing on the left side is the Adafruit FT232H breakout board, which I say it must be configured in MPSS-E mode. That basically just means that it can output I squared C. In addition to outputting I squared C, it can also, I believe, output GPIO on the other pins that aren't used. So I have a reset line and an output enable line. Okay. The nice thing is that I'm using this chip. This is a PCA9698. It is basically an IO expander and it has five groups of eight GPIOs. So 40 GPIOs in total. And in order to fully populate all of the signals that I have on the back plane, I only need four of them for 160. And actually I don't need some of those signals. But in any case, this is how I grouped it. So that's nice. So the I squared C from the FT232H breakout feeds all of them. Each chip is on a different I squared C address. And that's pretty much the tester. So I basically feed commands into the USB, which changes these signals and can read some other signals and send them back over the USB. At least that's the theory we'll have to see. So the printed circuit board looks like this. Well, it doesn't really look like a lot. Okay, so here is a socket where I can put the breakout board. This is the DC to DC converter. Now, this is actually an LGA, a land grid array. BGA, it doesn't have balls. So this is going to be a new technique for me in terms of soldering. I'm kind of hoping that it'll work. It really bothers me that I will have to basically heat this up and hope that all of the solder melts properly. But apparently people do it. So I guess it's worth putting that technique in my toolbox. Here's the jack over here on the right side. And here are the four GPIO chips. I also have a jumper here because I'm not quite sure whether the I squared C pull-up resistors will need to be 5 volts or 3.3 volts. The whole thing should be 3.3 volts. This breakout board is actually powered by the USB 5 volts. And I'm not actually using that 5 volts, except possibly for the I squared C pull-ups. I think the I squared C will be fine with 3.3 volts. The GPIO chips which take the I squared C signals are 3.3 volt chips, so I guess that'll work. Anyway, so I can use that as my tester. And my friend is going to be writing some Python code to basically send and receive data from this tester board in order to test the ALU. And then of course we could write more programs to test all the other boards out, which is great. So let's see, here's what it looks like in 3D because that's always kind of fun to look at. Hooray, there we go. On the back, I put just a few capacitors. I just couldn't fit them in on the front because with 40 IOs per chip, this got pretty dense pretty quickly. So anyway, that's what that printed circuit board looks like and I will hopefully be getting that done over the next two weeks or so. So the next thing that I want to look at is, well, I have a backplane and I have defined a lot of the signals on it, so how many signals don't I use on the backplane? And looking at the schematic for the tester, which of course I would like to be able to hook into all of the signals, we can look at the right side, which is the card edge connector, which I broke the schematic symbol into multiple units just to sort of group the signals nicely. So you can see that here is source 1 and here are the signals for source 2, there are 32 of those, and here are the signals for the destination, 32 of those. Here are all the power and ground signals, there's a lot of those, they sort of intersperse among the other signals, which helps provide a little bit of isolation between signals. And then we have a lot of the control lines, so there are four boot control lines, there's one shifter select, there's an ALU select along with three ALU functions. There's the RS1, RS2, and RD register selects, five bits each. And then I have three lines to basically say, read RS1, read RS2, and write the destination. Now these signals over here, I don't actually need anymore, I thought I did at one point but I don't. Basically these are the so-called not connected signals, there are 24 of them. So what I decided to do was look at the rest of the instruction set in order to determine what other signals I need to define. So I looked at the instructions and I decided to draw representations of the instructions. Now there's this neat book that I had as a kid, called Programming the 6502 by Rodney Zaks. He actually founded this publisher called Cybex, which had a lot of low level programming and hardware books out in the 80s. This is a 1983 book. And basically, like all programming and hardware books and microprocessor books of the era, it started with really basic concepts. I mean, what is programming? That's pretty basic, like flow charts and stuff, bits. And then it went into what is the 6502? Again, basic programming tips that really don't have anything to do with the microprocessor itself, except maybe BCD arithmetic because 6502 had a BCD mode. And then it went over the 6502 instruction set. Now, I really liked the descriptions of the 6502 instruction set. So we can go down to one of the pages. This is the store why and memory. So each of these instructions was basically a one pager showing what the function of the instruction was, what the format of the instruction or the machine language and coding of the instruction was, a description. And then it had some interesting diagrams for how the instruction worked. The 6502 has addressing mode. So how does that map to the format? The effect on the flags and some extra instruction codes, I guess, also based on addressing modes. Here's a two pager because the store accumulator and memory had a whole bunch of addressing modes. Here's a really simple one, set interrupt disable. That's the format. It's just that thing. So, you know, I kind of like these because I guess, you know, this was the first time that I saw an instruction sort of graphically represented. And I like that. So I decided to emulate that. So here's an example. This is the add I instruction or add register immediate. And I put some, you know, vital statistics about the instruction, you know, what the format is what the opcode type is. What it looks like an assembly. So this is add I destination register source register one and a 12 bit signed integer. What it actually does in terms of, you know, I guess a sort of pseudo algorithmic way of representing the instruction. So basically this says the contents of the destination register gets the contents of source register one, plus the sign extension of the immediate value in the instruction. And that's basically just the description. The data path. I drew this register file. And here's the instruction and coding down here showing that there are parts of the instruction that point to the register file, different registers in there. You know, this register goes into one end of the ALU, the immediate value gets sign extended and goes into the other end of the ALU. The function of the ALU comes from the funk three section of the instruction. And then the result of the ALU goes into in this example x 30, which is determined by the RD part of the instruction and coding. So I thought that was kind of neat. And then I put in an example. So for so here. And I think this is actually wrong because this probably needs to be negative one. I'll here, I'll just change that to negative one because that's actually what it is because it is actually a 12 bit signed integer and f f f is not a valid 12 bit signed integer. Negative one is so this is the destination is x 30 the sources x one and then you're basically just adding negative one, i.e. subtracting one. So here I have the encoding of the instruction with the program counter pointing to it and the initial contents of the registers and then of course afterwards. Well, the program counter got incremented by four and x 30 changed so that it is x one minus one. And I put some notes down here. So that's kind of neat. You know, this this basically just uses the ALU and I already know that, you know, there's RS one bus RS two bus and RD bus and there are some register selection signals. So I've already I've already had all that and I've already covered those in other videos. We just save this. Then I got around to looking at the branch if equal instruction. So what does the branch if equal instruction look like? So here it is branch if equal. It looks at source register one and source register two. And it takes a 13 bit signed integer that is even. So in the immediate value in the instruction, you don't actually have the lowest bit because because of alignment. So here is the function. Basically, the program counter gets either changed by the immediate value if the two registers are equal. Otherwise, the program counter just gets incremented by four. And here's the data path. And that's really what I wanted to look at. So here we can see that we can use the ALU and I've put in a function called equal. Now, my original design of the ALU did not have an equality function. It did have a less than function because there is a set if less than instruction, but there wasn't any set if equal instruction. So I figured, well, I just didn't need that function. Well, it's actually nice to have because not only is there a branch if equal and branch if not equal, but there's also a branch if less than and a branch if not less than or in other words, branch if greater than or equal to. So the branch if less than instruction can be implemented using the ALU because of course we're comparing RS1 and RS2. But there wasn't any function in the ALU that could be used for branch if equal. So I just threw that in. So that's a new function. The output of the ALU goes essentially to a multiplexer because the output is either going to be zero or one. And if it's zero, in other words, if it doesn't compare as equal, then I just use four. Otherwise, I use the immediate value and I add that to the current program counter. Great. So that was one modification to the ALU that I made. And the interesting thing is that I didn't actually need to make any hardware modifications with the exception of outputting the actual control lines for equal. I had to use one of the encodings for equal. But aside from that, I just had to change the memory contents in order to give us an equality comparator, which was kind of neat. So here is the example. If X1 and X3 are equal, then change the program counter by minus eight. Now, unlike I think some microprocessors, that offset is from the current program counter, not the program counter if it were to point to the next instruction. So that's interesting because if you use zero as an offset, basically that means don't change the program counter. So here is the encoding of the instruction. Here is the program counter pointing to that. Here are X1 and X30. I intentionally made them equal. So of course, after that instruction, the program counter goes back by eight. Now you might be wondering, well, you know, in terms of alignment, and here I have this whole note about misaligned instructions. So the program counter has to point to a 32-bit aligned address in memory. And if it doesn't, then you're going to get a misaligned instruction fetch exception. Well, you might say, well, that means that the program counter doesn't really need the bottom two bits because the bottom two bits should always be set to zero, right? And unfortunately, we're talking about even addresses here, which are 16-bit aligned. Well, the reason for that is that there is a compressed instruction set where the instructions are 16-bit aligned. So that's why you are actually able to specify even instructions because those are valid in the compressed instruction set. I'm not implementing the compressed instruction set, and if your processor doesn't, then it should throw a misaligned fetch exception. If you're pointing to somewhere in the middle of a four-byte word. Okay, so that's the branch of equal instruction. Let's take a look at the load and store instructions. Okay, here is the load half-word instruction. So in terms of assembly, you have a destination register and a source register. And what you're doing is you're taking the source register and then you have this signed 12-bit offset, which you add to the contents of source register one. And then you go to the half-word in memory of that. You sign extend that to 32 bits and then store that in the destination register. And here's the picture. Here's the register file. So basically we can see that we take the source register and add it to an offset. And that points into memory. These are, this is any byte in memory. So you're taking two bytes from memory and you are treating that as a little ending in value and sign extending it to 32 bits and then storing that into another register. And here's an example where we are taking x1 and subtracting 4 from it and using that as the address and storing the contents of that address into x30. So here's the program counter and here is the address that we're targeting, sort of. If you look at x1, so if you look at x1, you see that we're pointing to f6 and of course f6 minus 4 is f2, which is halfway through this 32-bit word. Which doesn't matter because memory is byte oriented. So we're just pointing to, well it turns out to be c5, c1. And then we're going to sign extend c5, c1 to ff, ff, c5, c1. And that's what gets put into the destination register. And there's a little bit of an explainer over here because, well I guess I didn't want to assume that people were familiar with little endian and big endian and what it means in terms of bytes. Well actually because that sometimes confuses me, so I put all this notes for me. Anyway, so that's load half word. Now the interesting thing is, here's the source register 1, so that is the source register 1 bus. And here is the destination register bus. So really all we have for this offset, you know, where do we put this offset? Now you know that we have those 24 signals left over. Could we put this offset in those 24 signals and then use the ALU for this? That would mean using RS2. And I suppose we could, except for the fact that the output of this plus block, if that were the ALU, would be required to be the destination bus, destination register bus. But we've already taken up the destination register bus. So this cannot be the ALU, which means that there has to be an adder somewhere else. So that was just one of the mental notes that I made for myself. Let's take a look at the store instruction. So here is the store instruction. In this case, I chose to write up the store byte instruction. So you take a 12-bit signed offset, add that to RS1, and you take RS2. This should be the contents of RS2. Oh wait, that's the assembly, right? So in terms of the function, we sign extend the immediate value in the instruction, add that to the contents of RS1. Treat that as the address in memory or the byte address in memory. Then you take the contents of register RS2. You only take the lowest 8 bits and you store that into memory. So that's what this looks like. And again, we've got RS1 and RS2, so those buses are being used. Which means that, again, this plus block cannot be the ALU because we've already used up RS2. We could have the destination register over here, but again, because RS2 is not the second argument to the ALU, we have disqualified the ALU from being this plus block. So obviously there needs to be this extra plus block. Okay, next let's look at the jump and link instruction. Okay, the jump and link instruction. This is the assembly language format, so there's a destination register and also a signed 21-bit offset that must be even. And the functionality is basically you take the next instruction, you take the address of the next instruction and put that in the destination register, and then you change the address of the next instruction to execute based on sign extending the immediate value. And this is basically how it works. We have a destination register, so that is where the next instruction is going to be stored. And then you have this immediate offset, which you have to add to the current program counter and put back into the program counter for the next fetch cycle. And here's an example. X1 is the destination register and you're jumping to the offset 12. So here is where the current instruction pointer points to. Here is the address of the next instruction and here is the destination register, X1. So of course when we execute this instruction, the program counter increments by 12, that puts it here. And the address of the next instruction, which would be where you return to, if this were a subroutine, is put into X1. And I have a little note here about what happens if you're trying to jump to something that is outside the bounds of this 21-bit signed integer. Well, you can use these two instructions, which allow you to do that. So now the problem is that we have these two adders, and we could in theory use RS1 and RS2. Where RS1 is the current program counter and RS2 is 4, and then that allows this to be the destination register. That's one possibility. But going to the next instruction is something that we're going to have to do all the time. So we're always going to have to add 4 to the program counter and then store it back into the program counter. So it's probably not a great idea to use the ALU. So obviously we need this adder block, and we're going to have to have another adder block. So that's the jump and link instruction. So the other question is, well, if you look at this, this is actually a 21-bit offset. So are we going to have to use 21 signals to go to somewhere? And the answer is I don't think so, because if the instruction goes into some instruction controller, and the program counter is also on that instruction controller, well then we can just take these bits out of the instruction and put an adder on the instruction controller and put another adder on the instruction controller for adding 4, and then just, you know, route the output of this adder to the PC and route the output of this adder to the destination bus and then just go ahead and clock everything, and that stores the destination register and the program counter all at once. So that is an initial idea of how the instruction controller would work. And so we get to the final topic of this video, which is what the instruction controller looks like. So first, we're going to talk about the fetch and execute cycle. So we have some program counter, and we have to point it at memory, and the output of the memory is going to be the instruction. So we will have an instruction register, which is clocked. So what this basically allows us to do is we have the program counter that's already set up with the address that we want. It flows into memory, and then, you know, however many nanoseconds later, the instruction comes out and is presented to the instruction register. At that point, we can go ahead and clock the instruction register and now we can use the instruction without having to worry about modifying the PC and having the instruction change on us. Okay, so that's basically phase one, the fetch phase. Now, the second phase is to basically decode this, and, you know, this goes into all the control signals, decoding, and, you know, so we would have, you know, maybe RS1 address, the RDA address, the destination register address, you know, we would set up, oh, I don't know, maybe it's a shift instruction, so we have the output enable for shift set up, and, you know, we would set up all the control lines, and basically all that flows through whatever card is activated, in this case it's the shifter. The results are computed, you know, that all happens in however long it takes, and then we go and latch or register all of the results. So in this case, if it were a shift, so this would actually clock the destination bus into the destination register. Simultaneously, what we want to do is we want to take the PC and add four, because we're not doing any jumps or branches or anything like that, and we want to get to the next instruction. So that sort of implies that here's the PC register, and, of course, well, that's PC, right, PC, and that means that we want to clock this on the second phase. So simultaneously we modify the program counter and also all the destinations that, you know, need to be set up, which might include, for example, if this were a store instruction, that might include writing to memory. So it's an open question at this point, whether I need to do some sort of a phase two followed by a phase three, which is actually going to be this, to change the program counter, because once you change the program counter, that means that, you know, you're going to be accessing memory, you're going to be fetching another instruction. So I'm not sure whether this is going to be three phase or two phase. So definitely phase one is fetch. So one possibility is that this is execute and you don't have a third phase. The other possibility is we have fetch, we have execute, and then we have, what, I don't know, next because that is going to modify the program counter. So I'm not sure which of these is going to be the correct one to use. If execute and next are in the same phase, then I suppose it's possible for the PC register to get updated and things to happen before the, say, registers can be updated or before the memory can be updated. So one of the problems is that, you know, let's consider a store instruction. So let me go ahead and get rid of, well, the whole thing, say. So let's suppose we have a store instruction. Well, the store instruction, first of all, requires the PC to be incremented by four. And it also requires, you know, some offset from some register to be treated as the address of memory. So here's memory. And we're going to have some sort of a multiplexer here. Here's the address. Here's the data that presumably goes on the destination register. So we've got two things happening at the same time. We have the destination register getting clocked and also the PC getting clocked at the same time. Oh wait, this, yeah, okay, sorry, that's actually a load. The store would actually look like, you know, this. Some source register, you know, maybe this is RS1, this is RS2, and then of course we clock the memory by writing it. So again, if we do that simultaneously with updating the program counter, and then of course this multiplexer is going to be, here is phase one and here is phase two. So this is phase two and this is phase two. So obviously the address that we want is the address during phase two. And then of course once we go ahead and write memory and rewrite the PC, we want to switch to phase one. So, I don't know, maybe it'll work. So here is the basic clock. Here is phase one, phase one. Here is phase two. It's going to be like this. So let me put little arrows here where things happen. So in terms of this multiplexer, what will the multiplexer look like? So during phase one, that's when we do the fetch. So basically the multiplexer is going to look like PC plus four over here. Or will it? No, actually it won't. Because what's going to happen is we're going to latch the instruction and the instruction is basically going to set up all of the phase two stuffs. So the multiplexer is going to have the address. So then once the address is set up and everything else is set up, including the data and all the control lines, then we go ahead and have the rising edge of phase two clock everything into memory. And also switch over the multiplexer to PC plus four. So hopefully now because these will be on different cards, this signal and this signal are on different cards, which means that they will be slightly offset. So it's entirely possible, for example, for phase two, the rising edge of phase two to appear at the program counter over here, while a tiny bit later the rising edge of the leading edge of phase two appears at the memory. So, you know, it's entirely possible for the multiplexer to output the incorrect value by the time the memory gets clocked or something like that. So, you know, I may actually have to go to a third phase. It depends. I guess we'll have to find out. So anyway, that's really all I wanted to talk about. Over the next probably two weeks, possibly three, I'm going to get in all the parts for the ALU and the tester and order the printed circuit boards, get them in, solder them up together and hopefully things will work. But until then, I'll see you next time. I'm building a risk five processor, not on an FPGA.