 That's okay, right? It's already, yeah, looks pretty okay. Okay, hello everyone. Today I'll be talking about designing a 16-bit processor using Verilog in a basis tree board, which is a Xilinx board. And so just some background. So I am listening to be learning Verilog and VHTL because of the E26 module in NUS. And during the module, they actually pass you a basis tree FPGA board, which is the Artex 7 architecture. And there's a whole bunch of logic cells, and it's actually pretty nice. And they actually teach you how to use the development environment, Vivaldo Design Suite, which is a bit old, but whatever, it's okay. And this is what the board looks like. So this is not actually part of the module anyway. I just wanted to build a 16-bit processor because it's just like, when you're building something, you get to really learn how it works. So it's really fun to go ahead and design your own processors. And also have a lot of time. So for specifications, I set for eight internal registers. There are no special registers inside, so there's no register that's always zero or something like that. It's a 16-bit instruction set architecture, so the instructions are 16-bit long. And I had implemented some fake RAM temporarily inside the FPGA, and it's 16-bit addressable to keep things simple for now. And it's a very simple pipeline. Yeah, I'll explain more of the pipeline later on. So the very first part of the processor will be a register file. And what is a register file? Well, it contains the eight 16-bit internal registers inside this block diagram. And basically the basic ports, the inputs are on your left, the inputs are on your left, and the outputs are on your right. So the inputs are the clock, the right enable, the enable, the overall enable, the selectors for A, B, and D, and the data input for D. So how this works is that selectors for A and B, they are three bytes long, and they select for between the eight internal registers, and those will appear on the outputs, data A and data B, and those are 16-bits wide, the busses. And the select D allows you to select a register for you to write to. And the data D usually comes from the ALU, so the output from the logic, arithmetic logic unit will go into data D, and select D will decide which register that gets written into. And the right enable and re-enable is because of the pipeline later on, which uses enable assertion to make sure that all of the modules go enabled in a certain pattern, which I'll explain more later on. And the clock is because every CPU needs to be synchronously clocked. If not, everything will happen at once, and nothing will work. So this is the basic verilog for something like that, and it's pretty simple. It's just the part on here, what it does is just assigns the selected register and outputs it to data A or data B. That's essentially it. Okay, so for FPGAs, there's this special thing. Like, you know, if you have code, you usually just compile it every time, but for FPGAs, you don't generate stuff and test it on your board every single time. This is because I found that every time you generate a bit stream, it takes like a few minutes, and it's really slow. And also because you wanna make sure that your hardware works before you flash it onto your FPGA. So that's the need for simulating at regular intervals. So each separate sub module, you would simulate it separately, and write specific test ventures to bring it through all the steps and make sure that everything is working properly before you spend the time to go ahead and generate a bit stream. And a bit stream is basically the file that you flash or sort of flash onto the FPGA to set its logic blocks. So for the register, what I'm doing is the test vent process is that you read register zero at one, you write FFFF into R zero, and then a bunch of stuff happens and then you check on R4 at the end, and R4 should have zero X4444 after that. And this is the waveform that you get. And you see that there's 44444 there, so it works, trust me on that. So the next part is actually the instruction set decoder. But before you decide on the instruction set decoder, you need to decide what the instruction set is because as I said, it's a custom instruction set. So I had to design my own instruction set architecture from scratch. The features is meant to be bare minimum, the functions in summary, I'm leaving out a few, it's just signed at subtract, the compares, the bitwise logical operations I'm sorry, not signed, I mean. Yeah, and then the jumps, the loads for the immediate values, the fetch and write to memory, and yeah, that's about it. They are about, they are exactly Fortin op codes for my case. And the full instruction set features available on GitHub, you can go to the GitHub page and you'll see a read me file, which has all the things very detailed. So when you're designing an instruction set, I wanted to make sure that it's relatively easy to parse inside the FPGA because I didn't want the slices of the instructions that go to the outputs to be dependent on the op code. I didn't want to change that around on a conditional with the op code, which means that so that there are no multiple access there. And I realized that you need a four bit op code for the instruction format because there are Fortin op codes in total. So four bits will give you 16 op codes. So there are two unused op codes that we're not using. And so to do that, we have to classify the instructions that we have into the different instruction formats. The first one there, RR, RD, is two source registers and one destination register. And an example of such an instruction would be add subtract. The second one is RRD, which is read memory, which is one source and one destination register. RDIMM is load register, one destination register and immediate value. IMM is just the immediate value. RR is two source registers and R is one source register. So if you trial and error a bunch of times, this is something, this is what you get. The first four bits are op code, I have a flag there, which is a bit awkward to have because I could integrate it into the op code, but it's a flag for now. And then it's that slice into the destination register, the first source register RA, the second source register RB, and that overlaps with the immediate value down there. And there are a bunch of quirks in this ISA in that we only have eight bits for immediate value, which means if you want to load it into a 16 bit register, you take three steps. You have to load it twice as eight bits once and eight bits twice and then you have to order the two eight bits to get the total 16 bit value. And that's implemented through a load high and load low, which means that you load high, means you load the eight bits into the higher, eight bits of the 16 bit register, and load low is load eight bits into the lower 16 bits of the eight bit register. Yeah, so now coming onto the instruction set decoder. This is what the outputs will look like. You have a clock enable and an instruction that's 16 bits wide. The outputs will be the selection for the A to the register file, selection to the B of the register file, selection to the D of the register file, the ALU opcode, which goes to the ALU which determines what logical operations it does, the immediate value, which is sliced out from there as well, and the right enable, which goes back to the register file to set the right enable for the register file. Okay, we'll skip this. Test bench process is just to set an instruction to the instruction set bus and see if the instruction is correctly decoded. And this one it is. Yeah. This is a preview. I'm gonna do a part two on this and I'm building a C assembler to assemble assembly stuff into my custom instruction set. Yes. Okay, moving on to the ALU, which is sort of like the brain of the entire processor, and it takes in the opcode, the outputs from the two register files, which is data A and data B, the immediate value, the enable bit, the clock, and the outputs will be the data output after it does the arithmetic, it does the logic for that, and then the branch, which is actually used for the jump instruction. And when the branch goes high, actually, that's only a one bit value, but yeah, that's not 15 to zero. It's only one bit. When the branch goes high, it does a jump operation in the program counter, which we'll talk about later. Okay, now we have the basic parts down. We have the ALU, the register file, and the instruction set decoder. We can put them together in this way. So the instruction set decoder connects to the register file, connects to the ALU, and there's a few loops there, but I didn't draw it out. And this is what the test bench code looks like. I just put three instructions in, load register R20XFF, load register R10X01, and I'm doing a subtract of R2, and it's perm subtract of R2 and R1, and putting them into R3. And the output there is, if you can look all the way here, it says 00FE, which means it works. So there's a problem there. So if you notice in the test bench, we have a 30 second delay between setting of adjacent bit stuff, even though like that has to be hard coded, because if you set it to one clock cycle, which in this case is 10, you realize that things don't happen in order because for the ALU to do its job, it needs to get the output from the register file. And for the register file to do its job, it needs to get the output from the instruction set decoder. But each component here takes one clock cycle for it to set the output values correctly. So what you need to do is you actually need to wait for each of them to give the correct output before you enable the next one to use the output from the previous one. And that's where we come to pipelining. And so this is what had basically happened. You get completely wrong values. Up there it says FF01, which is definitely not the correct output value. And that's because everything is haywire there. Okay, so pipelining comes from the control. And basically it just uses enable assertion on all the enable prints I described in the previous modules. And it's essentially a state machine. It uses a six-bit internal register and sets the enable for the fetch cycle, the decode cycle, the register read, the ALU, the register write and the memory. So that's basically the order in which the pipeline proceeds. It fetches, it decodes, the register reads, register sets its output values. The ALU gets those output values and sets the output for the register to write into. And the next one, and then it fetches our writes to memory. Yep. So this, just a state machine. The second integrated test uses this pipeline in and it's a bit more complicated now. So it loads high into R1, 0x, ED. It ors, and also at the top it loads 0x, FE into R0. It does a logical R of R2 and R1 puts them into R0 and then it loads 0x01 into R3. It loads 0x02 into R4, it adds them together and it does R overall. And at the end, you get 0x02FF again and it's the correct output. So now we have most parts of the thing working. It's a pipeline, by the way. You can see the pipeline here. Decode register read, ALU, register write. Register write has two high bits because it has to enable both the register enable line and the register write enable line. That's why there are two output bits there. So for the program counter, what it does is, it sets the, its job is to fetch the correct instructions from memory. So it takes in a clock, it takes in any program counter value you want to set it to in case you want to jump to a specific program count. It takes in an opcode, which determines whether it stays at the current program count value. It increments by one. It resets and goes back to zero or it sets the output program count to the program count that you input it. And it connects to a fake, the fake RAM that I implemented and the fake RAM is 16 bit addressable, which is why the program counter right now is, the bus is 16 bits wide. Now, this is basically the state machine, not really a state machine. It is this basically the decoder that controls the opcodes to the outputs. The third integrated test implements everything together. It puts the program counter, the pipelining control unit and the ALU, the register read, register write and the instruction decoder all in the same test bench. And then this is what's written to memory of the fake RAM. The first is to load register to the low position 010XFE. So this is essentially the same as the last one except for the last part, what we do is we add R3, R4 and put that into R3, back into where it took R3 from and then we jump back to that instruction. So that basically does an increment cycle there. So it should increment from one, two, three, four, five, six. Yeah, it starts from zero and goes all the way up to whatever. And so if you observe the cyan-colored thing at the top, you'll see that it increments. I think it's a very low frame rate, but it goes from zero, one, two, three, four, five, six, seven, yep, and on and on. And you notice the pipeline got more complicated as well because we added in the fetch and the memory stuff in this stage of the pipeline. And the fetch is basically doing the enable for the program counter, yeah. Okay, so wrapping up the summary, I'm actually gonna have a part two on this. So this is all purely in simulation. We haven't synthesized actual logic to go on the FPGA yet. So what I'll be doing is you have to make sure that it's synthesized properly, you flesh on the FPGA, and there are a few improvements that you can do. First, you wanna clean up the instruction set architecture, fix the quirks that are there, and finish up the C assembler. And also there is, you can actually add memory map IO to this. So if you map certain parts of RAM to a LED or something on the FPGA, you can get it to toggle on and off by setting the memory value of that particular memory address in RAM. And if you wanna look at the GitHub page, it's there, I'll be uploading the slides later on so you can take a look at it. The FPGA assembler is the one that has the full detailed instruction set architecture. And I think that's it, yes. Thank you. I actually have five minutes more. Anyone have questions for suggestion? Yes. You can explain about how you came up with the ISA. What are the minutes decided to include and why you included the instruction? Okay, so the ISA, okay, deciding on the minimal amount of instructions that you need is basically, it's gonna be the really bad minimum. So if I actually, I can actually, but I don't have internet. But if I pull up the basic instruction set, it's add subtract, then those are the basic arithmetic operations that you need. There's no multiply, there's no divide because those require more complicated stuff. So it's just add subtract. And both add and subtract need signed and unsigned variants, so that's four right there. And of course, oh, I forgot to mention that when I did the add subtract for the signed and unsigned, instead of putting that inside the opcode as separate opcodes, I used the flag bits instead. So like for example, add signed and add unsigned are the exact same opcodes, but the flag bit will be zero or one. Okay, coming back. So those are, that's the basic. And then you need all the logical operators. So that's OR and XOR. And then on top of that, you need load register, which is to load any immediate value into the register cost. You do that pretty frequently also. On top of load register, you need memory fetching instructions. So you need write to memory and read to memory. And then after that, you'll need your jump instructions, your jump on conditional. So if you want to do a conditional jump, you need a conditional jump instruction there as well. And also a normal jump to immediate value and normal jump to a register because as I said, I only have an eight bit immediate value. So I can't jump to a memory address that is more than eight bits wide. So the way I solve that is by writing to a register first and forming my 16 bit value and then jumping to it. Yes. Yeah, that's the basics that you need. All right. Thank you. So you have no support of a stack? No, not yet. I'm planning to get there eventually. Is this a one human or a hard-barn machine? Hm? Is this a one human or a hard-barn machine? Oh, oh, oh, oh. I think, I think it's, I think it will be more of a Harvard architecture. Yeah. Yeah, at the end of the day. What's the total line of code count? I have no idea. I did not count the code. But if you go to the GitHub, you can probably see it. I'm pretty sure, yeah. But it's not very long, I think. I think each module is within a hundred lines, hundred and something lines. And then, I mean, very long, very long is extremely compact compared to VHDL. VHDL is based on Ada and it's so not C, basically. I prefer very long simply because that what takes you to the right hundred lines and VHDL takes you like 20 lines in very long, yeah. Do you have to, so how complex would be the addition part? Is that all handled by very long itself? Hm? For? The addition. Oh, the addition, yeah. It's handled by very long. So both signed and unsigned addition, signed requires you to just add a dollar sign signed and then you put your variable inside and it handles the addition stuff for you. It's just a plus operator. And when you model it in behavioral, yeah. The ALU is sort of, is it a 16 bit ALU or just a GPU? It's a 16 bit ALU, yeah, if you look at it. Yeah, it takes in 16 bit values, outputs a 16 bit value. And actually, when it outputs, it's only a 16 bit value here, but internally as a 17 bit register so that you can account for overflow and stuff like that. And so eventually, I'll be able to add an overflow bit to this so that you can see if it overflows. Very cool, any other questions? Yeah, will we have an Arduino version of it? Yeah. I'm sorry. This is slowly maybe, but it takes time to build up everything to get to that point, yeah. It comes more and more complex as the instruction set grows and probably 16 bit instruction set is very, very limited. Like you already see, like the ISA is packed to all the max opcode size already, yeah. Okay. All right, I guess if you've seen him around, you already know.