 Hello, happy new year, everyone. This is video number five of the series of creating a 6800 CPU on an FPGA using nMyGen. So I'm going to try to keep these videos down to about half an hour each. I think that the format of going way more than an hour is probably just way too much. So this necessarily means that a lot of the coding will be skipped, and a lot of the intermediate steps about debugging are also going to be skipped. And I'm just going to show probably the most interesting parts. So that's what this video is going to be about. I add a few instructions, namely StoreA, StoreAccumulator, and I also upgrade all of the existing instructions to handle not only Accumulator A, but also Accumulator B. This is exciting. So I guess let's just go straight to the video. Again, happy new year 2020, and let's hope that the year brings perfect vision. So before we get started, I just wanted to show you this one file that I wrote, and it basically implements the CPU that we've been writing on the Lattice Ice 40 HX8K Evaluation Board. And the reason that I did this is not to see whether it worked or not, but I wanted to see how big the result in terms of LUTs and cells is currently so that as we go along, we can sort of compare by adding instructions, how many extra LUTs did we use, how many extra flip flops and so on. So basically what I did is just as we saw in the, I think the second video or first video maybe. So I create my platform, and these are the resources over here that I'm defining. Now I've defined this utility class, utility function called bus, and what it does is it basically makes numbered resources. So for example, this is the name of the resource ADDR, and the number of pins is the number of those resources. So this first pin is address zero, this next one is address one, and so on to address 15. All this function really does is it creates those resources one by one. So we have a resource for clock one and clock two, that's going to be our clock phases, phase one and phase two, and then there's this global reset. Now I've chosen to place clock one, clock two and reset on these global pins or globally buffered pins because of course they're going to be used by many, many, many flip flops. So it's going to have to have a huge fan out, which means using a buffered pin. So here's the bus for address, here's the bus for data, notice the data is IO, and here's just an extra bunch of debug pins that I put out. This is just going to reproduce the reset state that I can, so that I can make sure that if I actually put it onto the board and look at the reset state pins, I can see that it goes from zero to one to two to three, and then it stops there. And then there's the read write pin, which because we're outputting it, I may as well assign that to a pin as well. So the default clock is clock one, and the default reset is just reset. There are no connectors and everything else is basically the same. So in terms of the actual code, this is the module that sort of envelops the core. There's core right there. So we can see that I'm creating two clock domains instead of one. The first one is phase one, which is on the positive edge. And the second one is phase two, which clicks over on the negative edge. So I get the signals for those, I get the pins for a clock one, clock two, and reset. And then I assign the pins to the signals. So for example, clock one dot I, that indicates that this is an input pin. That gets set to phase one dot clock. Clock two dot I gets set to phase two dot clock. And the resets for the clock domains all get set to the global reset. Now, to hook up the address lines, I just iterate through the address resources. So here's how you get address and pins zero through 15. And then for each one of those, I assign the CPU's address line to that pins output. The data is just a little bit different. Of course, we request data 037 and CPU's data outline is assigned to the pins output. However, the pins input is assigned to the CPU's data inline. And you choose between the two by setting pin OE, which is simply set to the negative of CPU.RW. Because if CPU.RW is one, that's a read, and if CPU.RW is zero, that's a write. So if it's zero, then we want the output to be enabled, which means that I have to invert CPU read write. This is in case I decide to actually put it on the board. I just have a fake memory. That way I don't have to bother with programming an external ROM. It's just right there in the FPGA. This is hooking up the reset state, hooking up the read write line, and that's really all there is to it. So let's go ahead and compile this and see what we get. So all we have to do is run the Python file. And the output appears in the build directory. And as before, there are plenty of files here. The bitstream is top.bin. But really, I want to look at top.rpt. So if we look at top.rpt and go all the way down to the bottom, we can see that our current CPU, as it is now, is using 167 LUTs, 282 total cells, 14 of which are carry units. So that's interesting, because of all my belly aching about having PC equals PC plus 1 all over the place, it doesn't look like it actually either used a whole lot of addition units or what. So I don't really know what's going on there, but OK. The other interesting thing is in top.tim, which is actually the timing report. And if we go down near the bottom, let's see if we can find this. So these are critical path reports. And what it does is it basically goes from an input to an output and sees or estimates the time that it takes for the signal to go from one end to the other. And this is the interesting bit right over here. It basically says that you can run the clock at 126.68 megahertz at top speed. So that's a lot faster than the original 6,800. So maybe let's definitely load up the CPU onto the FPGA board and see what happens. Now, the ROM, the fake ROM, is just a vector that jumps to 1234 when the program starts. And then at 1234, we just jump back to 1234. Now, this is a three byte instruction. So we would expect to see the address lines go from FFFE to FFFF in order to load up the start factor. And then to 1234, 1235, 1236, and then back to 1234. So let's hook up the board. And what I've done is I've hooked up the LEDs to the low byte of the address lines. So we should be able to see exactly what happens. All the LEDs go on. Then one of the LEDs to go off, that's FFFE. And then we should see 3435 and 36. And I got that backwards. It was FFFE to FFFF and then 343536. So I've also implemented this as a 1 Hertz clock. So we can see that the CPU is actually executing properly. Great. So let's take a look at the instructions that we've implemented out of the extended group. So we've already done jump. And we've done five out of the 10 of these instructions in the next block. So why not just complete the set? So we will implement and, bit, compare, exclusive or, and or. So here's the ALU. And I've made some modifications. I've added the functions that we want to implement. So first thing that you'll notice is that bit is actually the same as and. It's just that you don't store the result. Basically, bit is a bit test instruction. So it just sets flags. And it does exactly the same thing as and. Same thing with compare. Compare does exactly the same thing as subtract, except that the output isn't stored. You only care about the flags as a result. And bit and compare are typically followed up by a branch instruction, which would look at the flags. So I've added and, exclusive or, and or a. And one thing that I made, one change that I made is to load. So instead of taking its input from input one, I've changed it to input two. I just wanted to make it a little more regular because the operand for the ALU always seems to end up on input two. So for example, for add, it would be the accumulator on input one and the operand on input two. Same thing for subtract. The accumulator is on input one and you're doing input one minus input two and the operand ends up on input two. So I wanted to do the same thing for load. So when you load the accumulator, input one doesn't matter and input two is going to be the thing that you want to load into the accumulator. So here's and X or and or a. They're pretty straightforward. The flags are set according to the table, which means that V is always reset. The negative flag is always set to the high bit of the output and the zero flag is always set if the output is zero. Now let's take a look at the core. So I've made another bunch of modifications here. So first of all, I've added all of the instructions that we want, but second of all, you'll see that the calls to implement those instructions are a little different and they are definitely more regular. So instead of having one function per instruction, I realized that all of these instructions basically do the same thing. Effectively, they put the accumulator on the ALU's input one, the operand on the ALU's input two, and then they possibly store the output of the, and then they possibly store the output of the ALU back into the accumulator. So that doesn't happen in two cases, namely for compare and bit. So I have this extra Boolean flag called store, which is by default true. And the second thing is, is that the only difference, the only other difference is the function that the ALU is supposed to execute. So why not just coalesce all of these into a single function? Again, we're not optimizing or refactoring at the hardware level. We're doing this at the Python level. So there may still be hardware optimizations to do, but we're not gonna do that now because that would just get way too confusing. So here's the new super instruction for the ALU operations. Basically, it just takes the function and whether you wanna store the input or not. We are always going to set mode extended. We are always going to read the operand of the instruction into source eight two. And we are always going to put the accumulator onto source eight one and then execute whatever function we said we were going to and then possibly store the result back into the accumulator. Now for load, note that we are still putting the accumulator on source input one and this is why I changed the ALU to load from source input two, which is the operand. So for load, we're just ignoring this, which is fine. Again, I just wanted to make it more regular. The only other thing that I did was I alphabetize these functions so that they're a little maybe easier to find if you were just scrolling back and forth. Obviously, if you have an IDE where you can just go to the definition, it's right there, so. Now in terms of formal verification, we can see over on this side of the IDE that I have added a whole bunch of formal verification files for and, bit, compare, xor and or. Let's take a look at and. It's fairly straightforward, right? All of these registers do not change. Because we're doing mode extended, this hasn't changed from any of the previous files. There's input one, it's A prior to the instruction. There's input two, it's the operand and the output is just whatever the, whatever the accumulator is afterwards and I set the Z flag if the output is zero. I set the N flag if the high bit of the output is set and I reset the V flag. And then I assert that the output, that is the accumulator after the instruction, is equal to the input one, logical ended with the input two. And then I assert all the flags. So this is fairly straightforward. Let's look at a slightly more complicated one which is compare. So for compare, basically the same thing happens except of course A doesn't change since we're just doing a compare. Input one again is the accumulator and input two is the operand. Now, I've defined these two other signals called signed input one and signed input two. So it's basically just a straight up copy of input one and input two, except that these are signed signals. So this is the indication to end my gen that whenever operations are carried out it should carry them out signed instead of unsigned. And that really matters for just comparisons. So for example, the Z flag, the zero flag is based on input one minus input two is zero. In other words, input one has to be equal to input two for the zero flag to be high. That's pretty straightforward. The negative flag is not negative in the sense of less than zero. In other words, if you take input one minus input two and then do a less than zero, well, that does actually work but it's an unsigned comparison in that case. So I could just as easily have done input one minus input two, well, I couldn't do this because well, these are unsigned signals. So two unsigned signals, they'll never be less than zero no matter what you do to them. So instead, this is what I did. And it's basically the same thing for formal verification with subtraction. It's just input one minus input two and then you just take the high bit. Okay, so that's N. And there's a little explanation here about why you can't just take the signed inputs and then compare them to see if they're less than zero. It just doesn't work because N is based on unsigned comparison. Okay, now if you look at this chart of branch instructions and what tests they do on the flags, you can see here that branch, if greater than or equal to zero and remember that this is a signed comparison is equivalent to X oring the negative flag with the overflow flag and checking to see that it's zero. In other words, checking to see if N is equal to V. So remember that, that if input one is greater than or equal to zero as a signed comparison, then N has to be equal to V. Okay, and I write this down over here that greater than or equal is true if and only if N is equal to V. So again, this is a signed comparison. So I just set up a, this isn't even a signal. It's just an expression. So I'm checking signed input one is greater than or equal to signed input two. So V, if greater than or equal to is true, V is equal to N. Otherwise, it's the opposite of N and that's the V flag. Now, notice that I didn't do this in the ALU. Again, this is the philosophy of writing it once and then writing it in a different way using different calculations in order to formally verify that you've done the right thing or at least you're consistent. Okay, and this is the carry flag. So the carry flag is gonna be true if input one is just less than input two. This is an unsigned comparison. So, and that's the only formal verification that you have to do. You have to check that these four flags are correct. And in fact, they are when you run through formal verification. So I'm just gonna run formal verification on compare. Okay, the cover statement worked and bounded model checking worked as well. Now again, this doesn't prove beyond the shadow of a doubt that compare actually works. It just means that the tests that I've written or the assertions that I've written pass formal verification. And again, because I've implemented compare effectively in two different ways, it means that my thoughts about compare are consistent. And a lot of what I've written in the code comes straight out of the documentation for the 6800 processor. So again, I can be pretty sure that I've implemented subtract and compare properly. So I won't bother going through all the other formal verifications because they work just fine. This is part of a reference card or poster that I've been working on for the 6800. And this basically shows the grid of the op codes. So for example, zero one is a knob. What are the other ones that we did? We did jump extend it, that's seven E. So you can see here all of the instructions from zero zero to FF. The ones in black are the ones that are not documented. So one of the more famous ones is HCF or halt and catch fire. This is either AD or looks like DD. Not, sorry, 9D or DD. And what HCF does is it just makes the address lines increment every cycle and it just keeps doing that until you reset the processor. And it said that that was sort of like a little test mode that they added to the 6800. NBA I found in some article or other about the 6800 and they said that this is you and B with A. And I think you store it back into A. And BRN is an instruction that I verified on the transistor level 6800 simulation that visual6502.org has. If you look at the branch instructions, you'll see that every instruction comes in pairs and one is just the negation of the other and BRA is branch always. So it would make sense that BRN is branch never. So that's the branch never instruction. This set over here I actually determined is subtract with decrement. So it's as if you were doing a sub, it's as if you were doing a subtract with carry instruction except the carry would always be set to one before you did the subtraction. But in any case, we can now look at the instructions that we have implemented, which I've highlighted in green here. Okay, so there is knob and there is the jump and there are these instructions. So we can look at this table and sort of decide what we wanna do next. I kinda wanna do store A next because there's a whole in our implementation. So why not do store A? Okay, so here is store accumulator. There is the extended mode B7 is what we're gonna be implementing. It just takes the accumulator and stores it into the memory location given by the operand, which is of course kind of the opposite of what LDA does. And if we look at the flags, it is exactly the same as LDA in that the V flag is always reset and the N and Z flags are set according to the accumulator. So let's take a look at how many cycles the thing is supposed to take. Okay, so here is store A for extended and we can see that it is supposed to take five cycles. So it looks like the read-write line is actually set to zero, which means that we're doing a write, or actually is this the read-write line? Let's see, yep, that is the read-write line. So the read-write line here is set to zero, so we're doing a write. This is actually the valid memory address. So it looks like there is actually one cycle here where we are loading up the destination address on the address lines, but we are not yet ready to do the write. And this is over here on the next cycle where the data is actually put onto the data lines and a write occurs. So let's see if we can code that up. So the first thing that I've done is I've copied ALU extended and I've substituted read-byte for exactly what it does. That's basically down here because of course we're not going to read a byte, we're actually going to write a byte. So during cycle two, of course, during cycle two, we want to get ready to set the address lines to the operand. So that on cycle three, and remember that in the documentation cycles are one-based, which is kind of unfortunate. So here cycles are two-based. So during cycle three, which is actually the documentation cycle four, which remember had that valid memory address to zero and nothing actually happened during cycle three. So yeah, so nothing actually happens during cycle three. So however, we do probably set things up for a write. And of course now we also need to output what we want to write on the data outlines and what we want to write out is just self.a. So we don't need that. Now this is going to be set up during cycle three, which means that on cycle four, which is the last cycle of the instruction, this write will actually take place. Now here for verification, we do want to say that we have written something. So we're going to write to whatever the address lines are. And I guess we need to set up the address lines as well. So we're going to write to the address lines this data. Okay, and this is complaining because self.dout is not callable. That's because I forgot the equals. Okay, so that's that. And then we have another cycle at the end. That's not what I selected wasn't cycle equals four. So this is on the last cycle. Well, I guess that's it. Right, we're just going to end the instruction, I guess. What else would need to be done? I think that's probably about it. Yeah, that's pretty much it. Now at this point, I did formal verification on store A and it worked just fine. So I was pretty happy. Except when I started to implement the B version of the instruction, that is store B into a location, things began to fail and it went rapidly downhill from there. So I did a lot of digging and a lot of rounds of formal verification later, I finally got it to work. And there was just one mistake that really eluded me until the very end of the whole several hours of formal verification. Okay, do you see the error? It's right here. I should actually be putting the load operand on source bus two, that's input two. So let us run formal verification once more. Cover past, look and great. Okay, okay, so we finally have a working store A instruction. Now I do wanna point out that when I first implemented the store A instruction, notice that bounded model checking worked just fine. And the reason that it worked fine is that I did not set up formal verification properly. So it fooled me into thinking that formal verification worked. And this can happen with any sort of testing, even unit testing. You may write your unit tests wrong and then they will pass. So two with verification, if you write it wrong, then verification may very well pass when it shouldn't. Hopefully as you implement more and more instructions and add more and more things to verification, eventually you'll get to a point where verification fails and then you realize that you've done verification wrong and then you can go back and re-verify everything you've done. So, and then find more errors. So again, just writing formal verification tests is not like a magical wand and all of a sudden everything works. You still have to be careful. I guess what I wanna say is that what the output of formal verification should be better than unit testing. So in other words, formal verification tests a lot more things than unit testing does. In fact, it tests pretty much all of your inputs. And again, if you write the tests wrong, then your output is gonna be wrong. But if you're sure that the tests are right, then formal verification has tested all of the inputs, which unit testing would not be able to do. And now this means that we can actually fill in, store A extended with green. So we've now filled in this almost the entire row and there's still JSR, LDS and STS to do. But I think that what I wanna do now is just get a really, really quick bonus and implement the B row because really the only difference is accumulator B. So it should be fairly easy to do. Looking at the way that the instructions are encoded, we can see that this entire row basically here has the A register while this entire row here has the B register. So really the only difference is one bit. So we should just be able to change our instructions to check on that one bit and use B instead of A. So looking at the code, so for example, for LDA A, LDA B would have this bit position here be a one. So in order to handle both, we're just gonna put a don't care in that position. And now of course we have to update the comments and the name of the function. Now we have to modify the function. So really we have this statement here or expression that we want to get bit six of the instruction which is gonna be one if we wanna use register B or zero if we want to use register A. So here's where we're using register A right now. So I can just replace that by a multiplexer. If B is true, then self.B otherwise self.A. Okay, that's the easy part. The slightly harder part is how to write to A and one of the problems with using multiplexer is that you can't use it on the left hand side of an equate which is kind of unfortunate but oh well, I guess we'll just replace it with an F and that should be all there is to the ALU instruction. Let's do the same thing for store A. Well okay, all of the tests have now passed. So now let's take a look. Now that we've added a whole bunch of instructions, let us, let's go ahead and fill in. Okay, so those are now the instructions that we've implemented so we're getting there. Now that we've implemented all of those instructions let's try to compile this for the FPGA itself. What did I do? CPU underscore lattice. And let's take a look at how many LUTs and cells we've used. Okay, 527, that's pretty much a doubling of what happened. Okay, that's fine, I'm okay with that. Again, we can certainly do some optimizations after we're done with all of the Python code. So let's also take a look at the timing report. And the timing report now says, oh well, you're only able to run now at 65 megahertz. Oh well, again, maybe we can optimize that. So I'm totally fine with that. Now the thing is that our clock frequency was 12 megahertz and I guess I'm not really sure what the iStorm tools actually do but they may actually just use that as a constraint and say, well, anything over 12 megahertz is just fine. So yeah, I kind of wonder what happens if I set the clock to 70 megahertz. I'll just go ahead and quickly do that. Okay, just for fun. Well, I guess that means it failed. Yep, the maximum clock frequency was 68 megahertz so it just didn't work. Okay, well, that's fine. Again, maybe optimization will solve this. So that's where we are right now. How's that, Cat?