 Greetings, Risk Five friends. So another short video. I think the videos are just gonna be shorter now because there's not a whole lot of code to go through. All the code has pretty much been formally verified and it's just a matter of refactoring things and maybe correcting a few errors. So that's gonna be one topic. And the other topic is the printed circuit boards, the schematics. I'm building a risk five processor, not on an FPGA. So this diagram should look a little bit familiar from last time. It's basically the ROMs that I'm using to sequence the entire risk five processor. There's a problem with this. Well, there are actually two problems with this but one should be fairly obvious. And that is if I have these inputs here, these 12 inputs, so let me just grab a thing. Okay, so five of these 12 inputs going into the trap ROM and I'm expecting these 32 outputs but also this other output that enables the other ROMs. The problem is that the ROMs that I'm using, they're not RAMs, they're Flash ROMs. So they're slower. The ones that I've chosen are 70 nanoseconds, 70 nanoseconds. So you can see that from the time that the inputs are stable, the output data will take 70 nanoseconds to become stable and get to the right value. So if I input the data to the trap ROM, then 70 nanoseconds later, the enable signal will either be high or low. It will become stable to its final state. Then the sequencer ROM has to take another 70 nanoseconds because this. So the sequencer ROM in the meantime is also taking 70 nanoseconds to get from its inputs to its outputs. So depending on which one gets there first you might have a glitch. So we're gonna have to wait a little, just a little while longer to get the final output. So instead what I've decided to do was take the enable signal, oh and not only that but the enable signal itself also has some propagation delay between the time that it enters the sequencer ROM and the time that the data lines actually go high impedance. So we're sort of wasting time here and I looked at the enable signal itself and I determined that it's actually fairly simple. It only takes a few inputs and the output is the enable signal. So I moved that out of the trap ROM and into some extra separate logic. And if you want you can look at the GitHub link down below to see how exactly that was done. So let's see. So another problem that I had is can you tell what's wrong with this circuit? Well, nothing, right? You give it an input, you get an output. What's wrong with this circuit? Well, this is your classic oscillator. It will oscillate as fast as it possibly can. And that actually can be exploited. In fact, in a lot of chips, there's an onboard oscillator that kind of looks something like this. And in fact, in order to slow things down, what they do is they typically put an RC network in between so that the signals actually take a little while longer to go from input to output. So that's your typical onboard oscillator. So there's nothing particularly wrong with doing that. It's just that you have to be wanting to build an oscillator. All right, there's another circuit. Let's see if I can get this right. Cross, cross, input, input, output, output. What's wrong with this? Well, nothing if you're trying to build a flip flop. And that's what this is. It's a flip flop made out of NAND gates. So I guess what I'm trying to say is, in general, if you have a combinatorial circuit and you have some inputs and you have some outputs and you tie some of the outputs to some of the inputs, then either you're building an oscillator or a flip flop. And if you're not really doing any of those, then you shouldn't be doing it at all. So with that in mind, what I noticed, and I think this was in either a YouTube comment or one of the Twitter comments, but I took a closer look and I did find an error, is that in the end my gen code, I was basically modeling everything as logic. I didn't model things as ROMs. So the whole sequencer ROM is sort of like the hardware implementation of the software, which is described as logic. Now, being described as logic means that just from a strict cause and effect perspective, I might have several parallel lanes, let's call them, of inputs and outputs. So this could be, let's see, so there are 76 outputs. Maybe this is 20 outputs, this is 16 outputs, this is two outputs, and these are the rest. And the inputs could maybe look something like this. So really, the sequencer is composed of a bunch of independent modules, essentially. Well, in hardware, I implemented this entire thing as a ROM, so basically none of it is independent anymore. So in other words, I can't just take, say, this is one of the independent sections, I can't just take the output of that and feed that into the input of the other section, because to a ROM, these are all the inputs and these are all the outputs, and there's 70 nanoseconds between input and output. And that's only if the inputs remain stable. Sorry that the diagram just disappeared on you, but apparently my camera feed froze and everything else was lost. So rather than just seeing a freeze frame of me on the bottom, I'll just repeat what I was saying. So let me give you a concrete example of this problem. So we have an instruction, and this comes from memory, it goes into the instruction register, and the idea is that one of the instructions actually needs to know whether the immediate value that's encoded in the instruction is zero or not, and it does one thing or the other, depending. I think it's one of the CSR instructions. So what we need to do is in the instruction, there's an opcode, and based on the opcode, that determines how the immediate value is encoded in the instruction. So we'll have a decoder here, the instruction goes in there, we extract, so from the opcode, we get the opcode format, and that can feed into the decoder to give us the actual immediate value of the instruction. Okay, so we also need some sort of a comparator to see if this is equal to zero, and that's basically a one-bit signal. So of course, the opcode, so this goes into the sequencer ROM. So of course, the opcode also goes into the sequencer ROM. And the idea is that if you're performing a CSR instruction, you'll look at this other one bit, and that will determine, that will make the outputs look different. So in the abstract logical representation, in the software, what I actually was doing is this determination of the opcode format, I actually didn't have that as a separate circuit, and instead I said, well, I've got the opcode feeding into the sequencer ROM anyway, so I may as well just make the sequencer ROM output the opcode format, and then the opcode format can be used by the decoder. So you can kind of see where this is going now. So the problem is that there's 70 nanoseconds between the opcode coming into the sequencer ROM and the opcode format being determined. That goes into the decoder, and if it's a CSR instruction, then this bit is going to be a one or a zero, and then there's going to be another 70 nanoseconds before the outputs reach their final state. So that's number one, it's 140 nanoseconds, not 70 nanoseconds. Now of course, in the code, this part was independent, which meant that it basically ran in parallel, so there was no real problem. I mean, yes, there's some small propagation delay, but not 70 nanoseconds worth. Now the other problem is that with these ROMs, what happens is you have an address, and let's suppose you set it up at that point and it just remains stable, nice and stable. The idea is that the access time of the ROM is 70 nanoseconds, so 70 nanoseconds later, the data comes out and it's fine and everybody is happy. The problem is that during those 70 nanoseconds, you have to keep the address lines stable. They can't change because if they start changing, that means you're starting to look at other addresses in the ROM, which means that the data is gonna start looking different. So one of the things that you have to do is when you change the address lines, you must keep them stable until you're ready to read the data. Now, while the chip is reading the data, the data lines can actually do anything they want. And that's another consequence of having to wait 70 nanoseconds, that access time. The data that comes out is actually unstable until 70 nanoseconds after the address lines start becoming stable. So because of that, if you look at the sequencer ROM and if you treat all of these inputs as the address and all of these outputs as the data, then what's actually happening is you're feeding the address lines into the sequencer ROM. 70 nanoseconds later, the outputs come out, but then the outputs go ahead and immediately change the address lines again. So now it's even worse than that because it's not necessarily 70 nanoseconds later because the moment you start applying the address, the data lines can do whatever they want. So the opcode format is busy, changing at random basically, which means that this bit is changing at random, which means that the outputs are changing at random again. And in effect, what you've built is an oscillator or a flip flop or something that's basically not stable. So I noticed a few cases of that in the code. I think I noticed two cases of that. And in both cases, I was actually able to separate that out of the sequencer ROM so that all paths from the input to the output, there was never a combinatorial feedback loop going on. So I fixed that problem. Again, ran it through complete formal verification in prove mode and everything worked. Prove mode now takes 60 minutes running on eight processors, which is pretty good. So that's that. So now what I wanna do is look at some of the schematics that I'm working on in order to implement the sequencer. Okay, so here's a look at the multiplexers slash registers in the sequencer. So what we have alongside the left-hand side are the sources of data. You can see that there's a bunch of them. And along the top side are the outputs. So this is kind of like a crossbar bus where you've got a bunch of inputs and you select from those inputs and you route them to the outputs. And some of the outputs happen to be registered like the program counter or the memory address register. Some of the others are not registered like the X or the Y or the Z bus. So what I did was I just basically marked out what input can go to what output based on what the sequencer wants to happen. And based on that, I was able to basically enumerate all of the input signals and all of the output signals and which would be connected to what. Now, if you count up all of the input sources, these are just, I'm talking just about the 32-bit input sources. So that's that upper left section. The others are less than 32-bits. But so let's ignore that for now. But if we count up just the 32-bit inputs, we see that there are 17 of them, which is kind of unfortunate. I mean, it would be nice if it were just 16, which could be encoded in just four bits. But you've got that one extra one that now needs an extra bit. But there are some savings that we can do. And one of them is if you look at the three gray inputs, those are actually derived from other inputs. So for example, we have, so for example, we have Z as an input, but we also have Z shifted to the left by two as another input. You don't need to actually have a separate 32-bit bus running throughout the entire processor just to hold Z shifted left by two. You can actually compute that locally as it were. The same thing with the mem address. Sometimes we need the mem address and sometimes we need the mem address with the least significant bit set to zero. So again, that doesn't need to run as a completely parallel 32-bit bus throughout everything. It can be just computed or generated locally when it needs to. And I'll explain what that means a little later. Now the ones in green are a little more limited use. So the trap cause and the instruction really are only used when there's a trap. There's no other reason to use them. And they only go to one place. So strictly speaking, we may not even need those as 32-bit inputs. And then the sham T, that's the shift amount. Well, there's actually only a limited amount of shifts that you want to do up to 31 basically or maybe yeah, up to 31. So that could be encoded in just five bits. So you don't really strictly speaking need that as a 32-bit input either. So if you discount all of those, then you're just left with 11 32-bit inputs, which is nice. But again, even with those locally computed ones, now you've got three extra ones. But in theory, you don't need to select those green ones because again, those are very limited use and you don't need a sort of general crossbar to handle those. So now we're comfortably below the 16 input limit. So let me show you one of the cards that I initially tried. Okay, so this may look a little bit complicated, but it is actually fairly straightforward. So the idea is that along the left side, we have some card edge connectors. And I've found, I don't have them with me, but they're actually quite small. There's something like about that big, so something like 65 or 70 centimeters. So three of them would extend out to maybe 22 centimeters. Sorry, yeah, 22 centimeters, so 220 millimeters. That's really not bad. And each one of those is a 168 pin connector. It's fairly small pitch, I think it's 0.6 millimeters. It's a surface mount thing. I really should get one. So here they are, they're quite small. And if I take one out of the package, you can see all the little surface mount pins, hopefully you can see that. This is the card edge side. There's a little protective sheet on top, which is also used for pick and place machines to pick them out of the carrier here. But they're very small, which is nice. I like small, they're dense. There's 168 pins, and they've also got registration pins, so you can only put them into the Princess Circuit Board one way. And there's no moving around of this thing. So you put it in and it's pretty much guaranteed to line up properly, which is really nice. So in any case, that's what these, that's what these left side things are. So that's my connector over there. Now, even though it has 168 pins, I just can't bring myself not to put grounds next to every data signal. So because of that, you don't actually have 168 signals. In fact, you only have 106. So D, D zero over here, all the way up to D 105 right over here, along with of course ground, but we've also got 3.3 volts power on four of the pins, which I'm kind of hoping will be sufficient. And if it isn't, well, I guess I'm gonna have to think about things. But in any case, so I end up with only 106 signals. Now, in terms of 32-bit buses, a 106 signal connector really can only have three of those. So I have 96 of those signals for the buses. And then I've got 10 extra signals that I can use for things like control or clock or something like that. So this is just the edge connector. Now, if we go to the next one, we've got this internal bus interface, which breaks out the connector into specific buses. So for example, right here, for this edge connector, I've got mem data read over here, mem data write over here, and mem address over here. And then I've got some of those selection lines, which will determine which thing gets multiplexed to what. And I've also got some clock lines down here. So I've got three of those basically. So now I'm assigning some of those 32-bit signals to some of those connectors. Now over here, I've got a bunch of buffers. And like I discussed last time, all those buffers, the outputs feed into a single bus. And you turn on one of those buffers. So what do the buffers look like? Well, they just look like this. They're 16-244s. These are 16-bit buffers. Each one takes four capacitors because each one has four power pins. So they each need their local decoupling or reservoir capacitor. So that's what that is. And of course, there's an output enable line for the entire 32-bit signal. So depending on which of those output enable lines is selected, one of these buffers will be active and that output will go to a register if this happens to be registered. Now this is going to be the card for the program counter, which is registered. So at the end of this, we have a register. And it's just a 16-374, which is, it's a pair of those, which is a pair of 16-bit registers. Again, they also have an output enable, but I'm always gonna tie that to ground because I always want it to output. And then there's a clock that goes into it. So that's the 32-bit registers. These are the buffers. Now the idea here is that, again, like I said, if we don't want to select any of these other signals like PC plus four or X to go into the program counter, then it should just go into itself. And that way we're not applying combinatorial logic to the clock lines, which is never a really good idea. And I've also had these pull downs to all of the address lines because, so there are two reasons for this. The first reason is that when you've got an input to a circuit, you never really want to leave those inputs floating at any time because if they do float, then the inputs can actually float to a value where it's both zero and one. So you've sort of got, effectively in the circuit, the MOSFET that goes to power and the MOSFET that goes to ground both turned on, which means that you're gonna get a huge amount of power, which again is not a good thing. So you never really want to leave inputs floating, especially on CMOS circuits, which is what these are. We've also got some kind of a modification here. So remember that I said that Z shifted over by left can be sort of locally computed. Well, this is what I mean. So the card edge connector contains the 32 bit signal for Z. That is fed on this card to a sort of just a rerouting network. And this just shifts the signals over. Of course, you still need one buffer for Z shifted by two. And you need another buffer for Z and then you select one or the other. And that's how that works. And then finally, this is a three to eight decoder. Strictly speaking, it's a four to eight decoder because one of the selection lines goes into the enable. And the idea here is that if I don't select any of these inputs, not even the PC, if I don't select any of these buffers, then the pull downs pull down those signals to zero, which essentially allows me to reset the program counter to zero, which is kind of neat. So that's a sort of poor man's reset, I guess. But anyway, so this was my idea. And I went ahead and turned it into a printed circuit board. Just, I haven't ordered the printed circuit board. I just turned it into one to see if it can actually be routed and what it would look like. And sure enough, it looks pretty cool. Okay, so here is the routed board. You can see the card edge connectors down here. So the idea is that the sequencer itself is sort of like a base, kind of like a motherboard with a bunch of slots. And then you would put these registers, these crossbars into the slots. And there would be a line of them. And the buses would go across the slots. Kind of like the entire machine works with a sequencer card, an ALU card, a shifter card, and that sort of thing. So basically you've got the sequencer card itself is kind of a motherboard, which is a bit odd, but okay. And we can view this in 3D. And that's what this looks like. So there it is in 3D. These are where the resistor networks, those pull downs are supposed to go. I don't have the package 3D things for these, so they don't really show up. On the back are where all the decoupling capacitors went. There's kind of a whole lot of them. You sprinkle them on like salt. But that's basically what it would look like. Anyway, somebody on Twitter said, I noticed that it says PC register. Does this mean that you're gonna have a separate board for every register? And I thought about that and I said, but that doesn't really sound like a great idea because each one of these printed circuit boards is gonna have a setup cost. Also, I really would like to have these fingers, these card edge fingers electroplated with gold because that will allow me to insert them and remove them more cycles basically because this is a prototype, I'm probably gonna be plugging this in and out. And if even one of these signals, if one of these pads gets degraded because I didn't go with the gold option, I'm gonna start kicking myself. And the gold option has like an additional $15 setup fee per printed circuit board. So, it could get pretty expensive if I've got a whole bunch of these, like 10 of them and they're all different. So instead, I am working on a sort of common printed circuit board and it looks, I haven't finished it, I just started and it looks something like this. Okay, so over on the right side, this is just the leftovers from the original circuit which I haven't routed yet. But this is what I've got. So on the left side, I've got four card edge connectors. Okay, so now we're up to four. These are the bus interfaces and now they're a little more generic. So if I open one up, you've basically got three buses, bus A, bus B and bus C or as I like to call them, the bus to the short-term lot, the bus to the long-term lot and the bus to the terminals. Pause for laughter. Okay, so with that, I can now define what bus A, B and C are for each of these edge connectors. So I've got all of the signals that I'm gonna need. I've got one, two, three, so 12 of them. And here what I'm doing is I'm setting up buffers for each of those. Now not every card is gonna be populated with every buffer and certainly, they're not gonna be populated with every register. In fact, only the register cards are gonna be populated with one set of registers. The pure multiplexers that aren't registered like the cards for the X and the Y and the Z bus don't even have registered registers. So the idea is that I can define the function of a card simply by populating whatever chips are required for that card. When the card is fully routed, it's just that if some chips are not present, then that function is just not present. So for example, for the PC register, I may not have the chips for the instruction buffer because the PC doesn't get anything from the instruction ever. So that is really what I'm working on right now. So the cards are gonna be bigger. They're gonna take up more space simply because I need to route everything as if all the chips were present. But aside from that, I think this is probably a better solution given that every printed circuit board has at least a 30 to $40 setup fee. So the cards themselves are cheap. They're like, I don't know, like $8 or $9 each in four layers and even if they turn out to be 12 or $13, that's still a lot better than $30. So anyway, that's what I've been working on. I'll have more to report next week. So we'll see how probably by next week, I will have hopefully either fully routed one of these boards and have a tentative 3D view or I've thrown the idea away entirely because it just didn't work out. So we'll see. So I guess with that, I really don't have much else to say. So I hope you have a good weekend and I will see you next week. Bye. How's that, Cat?