 Hi, I'm Rob. I thought it might be fun to make a series of videos about making a CPU on an FPGA. I just decided to work on that project, and I thought it might be interesting to make some video blogs about how I'm doing with it. And specifically, I'm going to make a CPU that can run this Zork 1. If you don't know what Zork is, get the to the Wikipedia. It was the first commercially successful, or actually the very first commercial, text adventure game. What we now know today as interactive fiction. This version is the Apple II version. This box is actually from 1984, but it basically started around 1980 that it first came out. The interesting thing about Zork 1 is that even though it says it's for the Apple II, which ran on a 6502 processor, the game itself did not run on the processor. It ran on a virtual machine that ran on the processor. And in this way, they were able to come out with the same game for, say, the TRS-80 Model 1, or whatever other machine there was. IBM PC, for example. All they had to do was change the virtual machine, but the game itself was completely unchanged. So the history started in the 70s by a guy named Will Crowther, and he was an avid caver, and he would often go to Mammoth Cave in Kentucky. And what he wanted to do was create a computer program that could sort of recreate the experience of going through the caves, but on the computer. And he called it Colossal Cave Adventure. He worked on it throughout the early 70s at BBN, which was the defense contractor that worked on ARPANET. In fact, Will Crowther worked on some of the early routing protocols for the ARPANET, which, of course, is the predecessor of our beloved Internet. It ran on a PDP-10, which was a mini-computer of that time, and it was written in Fortran. So it was essentially complete in 1975, and at that point it made its way to Stanford University where it caught the attention of Don Woods, who decided to dig in to the program and add some more fantasy elements. And he just called it Adventure. He completed that in 1977, and then released the game and its source code on, I would say, the web, but actually on ARPANET. Which at that time, of course, connected mainly universities and defense contractors. So it went from Stanford to MIT, where at least three students, if not more, Dave Lebling, Mark Blank, and Bruce Daniels got ahold of it. And they rapidly and vastly expanded it to a huge adventure game. They call it Zork at the time Zork was MIT slang for something unfinished, because it was never finished. And they worked on it for about two years through to 1979. At that point, they had the great idea of starting a company, and they called it Infocom. So from 1979 through the end of 1980, they worked on Zork 1, which was essentially the first third of the original Zork. And of course, they had to write the software for the personal computers of the day, which barely had 64 kilobytes of RAM. Some even only had 48 kilobytes. The TRS-80 Model 1 did not run on a 6502 processor. The Apple II did. So they were thinking about how they could write the software so that it could run on the very limited personal computers of the day. And they came up with the idea of a virtual machine. It wasn't quite a new idea, but nevertheless, they were probably one of the first commercial implementers of a virtual machine. So the program that they wrote was called ZIP, which stood for Z Machine Interpretation Program. Z Machine was the virtual machine that they wrote the game for, and it was an interpretation program because it took the actual game, which was written in Z Machine code, or Z code, and it converted it into instructions that the actual computer could understand. So in that way, they could write the game itself once, and then any time they just wanted to release the game for a new machine, all they would have to do is rewrite the ZIP, which involved simply executing the instruction set of the Z code. So Zork 1 came out in December 1980, and it was a smashing success. So from the beginnings of Infocom in 1980, starting with Zork 1, they released the rest of the adventure on Zork 2 and Zork 3, and that took until basically the third quarter of 1982. At that time, they released more games, including the text adventure version of Hitchhiker's Guide to the Galaxy, until 1986, when they were brought out by Activision, and that was pretty much the beginning of the end. At that point, Infocom became just a brand inside of Activision. They did release a couple other games, until 1989, when Infocom as a brand was pretty much shut down. In 1997, there was one more release called Zork, the Undiscovered Underground, but it wasn't really Infocom, it was just trading on the name. So for the Z machine itself, it went through several versions. It started with version 1, and it ended up by the end of Zork 3 with version 3. So all these three games were essentially written in the Z Machine version 3. As they wrote more games, they needed more capabilities, so they added more instructions, so they went to version 4, and then finally to version 5. Now there is documentation on these various versions, which I link down below. Take a look at it, because we will be using that document extensively in order to implement the Z Machine on our FPGA. There is a rich community of writers of interactive fiction today, and they have brought the Z Machine out to version 8 or maybe even version 9, I'm not quite sure. There is also another virtual machine that is called GlullX, I think is how you pronounce it, G-L-U-L-X. And that is a completely different machine from the Z Machine. It starts from a completely new base, so perhaps we will look into actually implementing that after we finish our versions of the Z Machine. Now at its core, the Z Machine looks pretty much like any other CPU. It has a core processor, and it's got input and output, and it's got memory. In this case, the memory was limited to 64K, making this a 16-bit addressing machine, and the input and output are text only. Now, the interior of the processor is also 16 bits, in that its registers are all 16 bits wide. However, the memory itself is byte-oriented, so you can access any byte of memory, and if you wanted to, for example, load a register, you would have to load the first byte and then the second byte. So let's take a little look at the instruction set that we will need to implement. So let's talk about the opcodes. They are 8-bit, so we have basically 0 to 255. And what we're going to do is we're going to take a look at, first, this particular section of opcodes. These are called the zero-operand opcodes, because they take no operands. Which is kind of convenient, because it means that essentially the operation is described by a single byte. So, for example, no op. That doesn't take any operands, and it doesn't do anything either. There is an instruction called return true, and basically all this does is return from the current subroutine with a true value. Again, it doesn't take any operands, so it's described by a single byte. And another zero-operand opcode is called newline. Newline, all it does is it outputs a newline onto the text output stream. Again, it takes no operands, so it's described by a single byte. Okay, so the next series of opcodes is called the white-operand opcodes. And these are things like increment or return a value. So this takes a single operand, so of course there is one byte for the opcode, and then a certain number of bytes for the operand. And the one opcodes are divided into three sections. The first one takes a large constant, that being a 16-bit value, or two bytes. The other one is a small constant, which is a one byte constant. And then there is a var. So the way that variables work in the z-machine is that variables are described by a single byte from zero to 255. And they're divided like this. So we first have variable zero, and this is basically the stack pointer. It's always the stack pointer. So anytime you want to, for example, increment variable number zero, what you're actually doing is simply incrementing the stack pointer. From zero one to zero f, these are reserved as local subroutine variables. So these are the locals, and there are 15 of them from zero to f. So every time a subroutine starts up, it can basically say how many locals it uses, and of course it can't have any more than 15 locals. Each local is, because again the z-machine is a 16-bit machine, each local has 16 bits, or it's two bytes. So for example, if I were to increment variable one, that means that I would be incrementing the 16-bit value that's in local variable one. And the rest of the variables are all globals. So there are, what is that, 250 of them, 249, something like that. And again, they are 16-bit values, but they are available to the entire program no matter where you are. If they're not based on any subroutine, they are global variables. So for example, if you were to increment variable number one zero, that means that you go to the 16-bit value at global one zero. Typically whenever you start up a subroutine, you have to reserve an area for the locals. And this is usually placed on the stack, and we'll talk about stacks and stack frames in a little while. And likewise with globals, they are actually stored somewhere in memory. Okay, now let's talk about this first section of opcodes. These would be two operand opcodes. They specifically take either two small operands, or a small operand than a variable, or a variable and then a small operand, or two variables. So for example, if I were to add two numbers, that would be two operands. So for example, if I wanted to add for whatever reason a small constant to maybe local variable one, I could do that with one of these opcodes. Now the question is, where do we store the result? There is actually another operand, but it's not counted in the two operands, that tells you where to store it. And it's always going to be in a variable. So for example, I could add a small constant to say local number one and store it maybe in local number two, or in global number one zero, or even possibly the stack pointer. So now let's talk about this other section. These are also two operand. Well, they're called two operand, but sometimes they actually have more than two operands. So the encoding of this instruction takes a variable operand field, and variable in this case just means it's a variable number of operands. The number and type of operand is encoded in this variable operand field according to this table. So there are two bits. This is a two byte or one byte, possibly two byte field. Each operand is encoded in two bits. So bit zero zero means it's a large constant, zero one is a small constant, one zero is a variable, again stack pointer local or global. And one one means that this is the end of the operands. There are no more operands to follow. So, for example, if we wanted to have a, for whatever reason, a three operand instruction, and we wanted it to have, for example, a large constant, a small constant, and then another small constant, and then stop, that would be encoded as zero zero, zero one, zero one, and then we would stop at one one, and then this other byte wouldn't be present because of course we don't need it. Now remember, of course, that we were dealing with machines that could only access 64k and memory was very precious, and that's why you would actually leave off this extra byte instead of simply filling it with all ones, simply because you could save another byte. Now where is this useful? Well, as you can see from the table over here, there is no way to specify a large constant. If you were limited to this, you would actually have to put your large constant into a variable using, for example, maybe one of the one opcode instructions with a storage variable. And then, of course, you could use var to var, or small to var, or whatever. So this is a convenient way of being able to encode a large operand. So what I'd like to do now is just go over one of the examples that's in the z machine document that's linked down below. This is where you want to multiply a large constant by a variable and store it in a variable. So the opcode for multiply when you have a variable operand field is d6, and then after that follows the variable operand field. So in this case, we're specifying a large constant, which is 0, 0, followed by a variable, which is 1, 0, followed by no other operand. So you fill the rest with one ones. So the first thing will be the large constant, which is, of course, two bytes, and that's 3, 8, which is in decimal 1,000. Then follows a variable. So we encode that with 0, 2, meaning local number 2. And then after that comes the storage variable, if this is an operation that stores something. And in this case, of course, it stores the result of the multiply in variable 00, which is the stack pointer. And that's basically how you would decode one of these more complicated instructions. Now the very last section of opcodes that we're going to go over is called the variable operand instructions. They also take a variable operand field, so they can take anywhere from actually 0 to 8 operands. Now this differs from the two operand versions in that this could be any type of instruction that can take any number of operands. Whereas this usually takes two operands. There is an exception. For example, there is the jump if equal instruction, which actually compares its first operand with all the other operands. And if any of those operands is equal, then it actually jumps. So with this section, you have things like call. So when you call a subroutine, you want to pass it a certain number of operands. Rather than passing them on the stack, you can pass them inline as operands. So the first thing that we're going to do is we're going to discuss the general architecture of a Z machine that we can put on an FPGA. And we're going to talk about the hardware description language that we're going to use to program the FPGA with our Z machine implementation. Okay, so before I start on the architecture of the actual machine, I just want to make a few remarks about some lies that I may have told or maybe some oversimplifications. So the first lie that I told is that there was only 64k of memory. In fact, for up to version three, you could address up to 128k of memory. And the way that works, in order to see the way that works, so we have to look at the memory map. So the memory map looks something like this. Memory is divided into three areas. The first is dynamic, the second is static, and the last is called high. And the reason that it's called high should be apparent after we talk about how to access memory above 64k. So the first section, dynamic memory, this is memory that can be read and written. The second and third parts are read-only memory. So the static area basically contains data, tables, things like that. The high memory also can contain data, but it contains code, which is the z-machine routines, the actual program. So because these are read-only, they could be stored on a read-only media, such as, for example, the floppy disk. It doesn't have to be stored in memory. The dynamic memory, of course, does need to be stored in RAM. So this is why you can actually get away with 64k of RAM. But in order to address anything in high memory, the z-machine used a trick, which is that anytime you want to access high data or code routine, it would divide the address by two, which means that you could represent the address of data or code in high memory by a 16-bit address. So because you could actually address up to 128k of memory, we'll just change this number to 128k, and that should be good enough. So now let's talk a little bit about the stack. This is another lie that I told. Originally, remember when we did the multiply example, I showed the multiply result going into the stack pointer. That doesn't actually happen. You can't actually read and write the stack pointer itself. With a couple of exceptions, when you read the stack pointer, that actually pops from the stack and gives you the result of that. When you write to the stack pointer, that actually pushes onto the stack. So you never actually get access to the stack pointer itself. And now let's talk a little bit about stacks and stack frames. So when you make a call to a routine, you have to be able to remember your local variables and remember the address that you want to return to when you exit your subroutine. So a typical frame looks something like this. When you make a call, what you do is you store the previous frame pointer, which we'll get to in a moment. Then you store the return address. This is the address that you want to return to after your subroutine is done. Then you have an area for your local variables, and then you may have some extra stack space that you can push and pop onto. And the location of the top of the stack is given by the stack pointer. You also have a special pointer called the frame pointer, which points to the beginning of your frame. Now, when you return from a subroutine, basically what you do is you know that the return address is stored just after where the frame pointer is pointing to. So you remember that. You also know what the previous frame was because you're pointing to that using the frame pointer. So all you have to do is go to the return address and then replace the frame pointer with the previous frame pointer. And now you're pointing at your previous frame, which remembers all your locals from your caller. So that's how frames and subroutine calls actually work. So it's pretty clear that we're going to need some sort of a stack pointer as well as a frame pointer. So that will go as registers in the processor. Now, because we also have global variables, it would be kind of useful to be able to know where the global variables are stored. And that will also go in a register in the processor. And finally, we need to know which instruction we're pointing to, so where we're actually running from. And that's called the instruction pointer. And that also goes into a register in the processor. So now that we have the minimum number of registers in the processor, let's talk about how the processor actually executes instructions. There are a couple of ways that a processor can be built to execute instructions. And what we're going to do is build an architecture called micro-programming. The way that micro-programming works is that the processor can be defined as having a bunch of special purpose hardware. For example, an ALU, an arithmetic logic unit. It could have a branch unit, which is hardware dedicated towards calculating the addresses of branches. It could have multiplexers. It could have special hardware dedicated to the input and output. So the way that you access and tell all of these smaller pieces of hardware what to do is using a micro-program. So every instruction that the processor executes can be broken down into a series of more fundamental instructions. And those instructions are called micro-instructions. So when the processor gets an instruction or an opcode from memory, it actually executes a micro-program using micro-instructions. So in effect there's a tiny processor inside the actual processor. And the purpose of that tiny processor is to execute the opcodes and control the hardware of the processor. And this is an easy way of building a processor because if you need to change the implementation of one of your opcodes, you can easily do that by just changing the micro-program in the processor. So I've added a couple of special purpose blocks in the processor. So the first thing you'll notice is that we have a micro-program, and this is the equivalent of the ROM in the processor. We also have a micro-instruction pointer, which tells us where we're running at the moment. We also have some branch hardware, an arithmetic logic unit, a memory controller, a bunch of extra registers. So let's think about how the simplest instruction works. The simplest instruction is no op. And I'm not talking about the micro-instruction, but let's call it the macro-instruction. So how does no op work? Well, first of all, you have to be able to fetch the opcode from memory. Otherwise, you don't know that you're actually executing a no op. So the very first thing to do is to fetch the next byte from memory. So right away, we know that we have to instruct the memory controller to retrieve whatever memory is at the instruction pointer and store it somewhere. And where we're going to store it, I just arbitrarily called it register mm from memory. Now, the next thing that we have to do is based on what mm actually is, we have to execute the micro-program for that opcode. So the very first thing that we're going to need is a table that tells us where in the micro-program the no opcode is. So what I did was I gave us some sort of micro-instruction that has some kind of a table somewhere and you index into that table using m and you load the micro-instruction pointer with that. Now, of course, at some point, you're going to have to increment the macro-instruction pointer. So let's put that into the micro-program. So what I've done is I've put that immediately after the read from memory because almost always we're going to want to, when we get an instruction, or when we get memory at the instruction pointer, we're going to want to increase the instruction pointer. Now, what exactly is a no op? Well, it's no operation. You do nothing. Well, of course, you can't just do nothing. You have to do something. You have to go to the next instruction. So the micro-program for no op would simply be, well, go back to the beginning. And that's our basic driving loop for the whole processor. We get the memory at the instruction pointer. We increase the instruction pointer. We find out where in the micro-program to execute the op code. We execute the op code, and then we go back to the beginning. So that's how no op works. And now you can see that this section of the micro-program basically never changes. That's where you start all of your processing from. And then this will just be one after the other what to do when you try to execute an op code. Now, the other interesting thing is, because these are all bits of dedicated hardware, you don't have to execute this one at a time. You can actually execute them in parallel. And this is what's known as horizontal micro-programming. The idea is that the micro-program memory doesn't have to be 4 bits or 8 bits. It could be 20 bits. It could be 30 bits. It could be 50 bits. It could be however many bits you need to control all the various pieces of hardware in the processor. So you can do a whole bunch of things in parallel. So here's an example of what I mean when I said that you can execute things in parallel. Let's suppose, for example, that the incrementing of the instruction pointer takes place using the ALU hardware. So we instruct the ALU to read from the instruction pointer, add one, and write back to the instruction pointer. Well, the writing back to the instruction pointer doesn't happen immediately. That only happens on the clock edge. So while you're telling the memory hardware to read from the instruction pointer, go to RAM, and write the result to the M register, you're also telling the ALU to read from the instruction pointer, add one, and write back to the instruction pointer. Those writes don't take place until the clock edge. So that's why they're able to use the instruction pointer. And then on the next clock edge, the instruction pointer and the memory register get written with the new values all at the same time. Now this is just one design decision to have the ALU be responsible for incrementing the instruction pointer. Now you might want to simply have part of the memory hardware maybe have its own ALU that's extremely limited. You know, maybe all it can do is add one or add some offset to the instruction pointer, because typically all you're going to be doing with the instruction pointer is either incrementing it or going to another address based on an offset, or possibly loading the instruction pointer with a constant. For example, when you jump to a location or when you go to a subroutine. So that would be an extremely limited ALU and that could have its own dedicated hardware. Just for the sake of illustration here, what I've done is I've used the ALU that maybe also can perform addition, interaction, multiplication, division, modulus, comparisons, that sort of thing. I use that as the calculation hardware for the instruction pointer. You don't have to do that. It all depends on your analysis of the entire machine's opcode repertoire. And that's in fact what we're going to have to do. We're going to have to look at all the instructions and see what kinds of hardware we will actually need to implement those opcodes. And it would be nice to get sort of some efficiency by being able to parallel internal micro-operations. Now one of the things that I didn't talk about is how do you control the micro-instruction pointer. We did have a step over here where we load the micro-instruction pointer, but of course we have to increment the micro-instruction pointer. So in step one, assuming that these were serial operations and not parallel operations, every one of these instructions, with the exception of the load, would have to have an instruction that tells the micro-instruction pointer to increment itself. So this is yet another parallel operation. And we can have the micro-instruction pointer be its own bit of dedicated hardware that knows how to increment itself and how to load itself from some value. So for example in step four, go to one. Well this is actually, you know, maybe it's micro-addresses zero. So what we would do is we would load the instruction pointer with zero. And that would mean that on the next clock pulse we go ahead and start executing from here. So we know that we need a micro-instruction pointer that can be loaded with a value and that can increment itself. Now what happens when you reset the processor? Well when you reset the processor you typically want to start from the very beginning, which means that you need to be able to clear out the micro-instruction pointer. You know, maybe the very first instruction that you execute in the micro-program is always going to be at micro-address zero. That's pretty convenient, so let's just go with that. So clearly we have three micro-instructions that we can tell the micro-instruction pointer to do. Load with zero, load with a constant, or increment. These are three instructions that we can tell the hardware of the micro-instruction counter you could call it to do. So of course since that's three different types of operations we can encode that using two bits. So maybe zero, zero means clear yourself, maybe zero, one means increment yourself, and maybe one, zero means load yourself with a constant. And of course that constant would have to be an operand, and we're not quite sure how many bits that operand needs to be because we're not sure how big the micro-program is going to end up. Let's just assume for the sake of argument that the micro-program is 64k, just for example that means that it would be 16 bits. So you would need 16 plus two bits for the actual instruction, or 18 bits just to tell the micro-instruction pointer what to do. Now the memory hardware will need more bits. The ALU will need more bits. The branch hardware will need more bits. So it's clear that our instruction word is a lot bigger than 18 bits. It's a lot bigger, maybe even than 32 bits. That's why this is called horizontal microcode because each word is huge and horizontal. Now the other nice thing about micro-programs is that earlier I said that maybe the implementation of an opcode can change. So as we go from version one to two to three and so on, some of the implementations of the opcode will change. And we can implement that by simply pulling out the old implementation of this part of the micro-program and replacing it with the new part. So what I've done here is I've listed the operations that we're going to want the micro-instruction pointer hardware to do. Clear, increment, and load with some constant address. Now I know that later on we're probably going to want to add a branch instruction because you can't really have a computer unless you have a branch instruction. That's one of the definitions of a computer. So we are probably going to have to add more operations. But since we are going to use a hardware definition language to program the FPGA, we don't have to actually design the hardware for this. We sort of design the hardware using software. There are several hardware definition languages out there. Which one are we going to use? So there are three major flavors of hardware description language. There's VHDL, there's Verilog, and there's System Verilog. And I've categorized them according to what kind of programming language they look like. VHDL was based on ADDA, so it's very ADDA-like. Verilog has the flavor of C, and just for the sake of making an argument, System Verilog is sort of what I consider to be a better version of Verilog. So it's sort of C++-like. This is not quite accurate, but the point is that if you're familiar with the syntax of ADDA, then VHDL should look very familiar. If you're familiar with the syntax of C or C++, then Verilog and System Verilog should look familiar. Just for that reason, I'm going to choose one of these two. I'm going to eliminate VHDL from consideration because I'm not familiar with ADDA. So it's either going to be Verilog or System Verilog. Now, I looked at Verilog and I looked at System Verilog, and it seems like there are many differences between System Verilog and Verilog. With System Verilog, you can describe things in a more modern way. So, for example, today we don't really have global variables. We have variables that are defined in namespaces. Well, you can have that sort of thing, and System Verilog are called packages. You can have that in Verilog, but these packages end up being more like include files. They end up being global. So for that reason, I like to use System Verilog. Now, there's a very nice bit of free software called Verilator. So you can actually write your System Verilog program, pass it through Verilator, and what that does is it converts it to C++. Then you can write your test bench using C++ and actually test your hardware. Now, it doesn't do timing analysis. That's really up to a particular FPGA vendor software to do. But the point is that you get to at least validate that your hardware does what you think it does, even if you don't know how fast it does it. So I think that the very first thing that we're going to do, finally getting away from the whiteboard, is we're going to write some simple System Verilog to implement this micro-instruction counter. We're going to pass it through Verilator, and we're going to write a simple test bench in C++ in order to exercise our hardware and make sure that our hardware does what we think it does.