 Thank you. The year is 3014. After a series of demotions, I find myself as the sous-chef, the assistant sous-chef, on the intergalactic spaceship Icarus. One second, I don't have presenter notes, and that's going to kill me. Oh, thank god. After an unfortunate accident involving our phase matter array and a dark matter cephalopod that was terribly hungry, that winds up killing the captain, the science officer, the engineer, and of course, our beloved computer. I find myself stranded in space. All I have is a power supply, a bucket of NAND gates, and this book. I am, everyone join in, lost in space. So let's build a computer and get home, shall we? Welcome to GoGurukko 3014. To set some expectations, this is a highly technical talk. I'm hoping it is a highly approachable talk. I have 229 slides. I have to go incredibly, incredibly fast. I will not have time for Q&A, but I will be out in the break afterwards. And very important, this is not my field of expertise at all. The last time I did anything considered hardware, I was paired with an ex-Navy electronics guy, so I didn't exactly learn anything. So I'm terrified. So Seattle I B has a study group, and we studied this book in the spring of this year, The Curriculum Hales from a University in Israel. It's taught by two professors, which you can see there. I'm not going to try to pronounce it because I don't like fucking that up. And it's been since taught in several American colleges and apparently even high schools. I was quite frankly, I was blown away by how good this book was. Really, I love this book. I love this book so much, I'm here doing this talk. So the book goes from the very basics up to a running computer with a game or other application, whatever you want to write on it. It's very open-ended. It's a lot of stuff to cover in 12 weeks, let alone 30 minutes. We're only covering the hardware aspects, and that barely touches machine language. This is the layout of the book, the actual chapters. This is a diagram from the book itself. As you can see, we're only going to be doing chapters one through three and five and hinting at chapter four. This, more than anything, is the thesis of the book and the theme of this talk. We don't build a computer. We build Boolean logic. And with that, we build arithmetic. And with that, so on. Every layer is entirely dependent upon layers below it. And this is bottom-up design at its finest. Very important, though, perhaps the most important thing in this talk is that this is not comic sans. See those teas? They're all slightly different. That's because this font is actually hand-drawn. It's called Sketch Note. It is exactly related to the book that Jess showed earlier. And I think it's really awesome, so I'm using it throughout. So let's get started with Boolean logic. Let's start with a NAND gate and the most obvious question to follow. The fuck is a NAND? It's really simple. It's a not-and. The little circle is a not, and the d-shape is a NAND. Together, they make NAND. The lines are just pretend wires, and they're not actually part of the symbol. The truth table for it is exactly the opposite of a NAND, where NAND loves agreement and returns true when it's given truths. NAND hates agreement and returns true anytime there's a false. If we were to code it, it might look something like this. Simple enough, the savvy Rubyist might try to add it to object. The traditional symbol for NAND is actually really boring, and it turns out that the turned ampersand Unicode characters way cooler, so I decided to go with that. NAND is special, like the NOR. It's considered one of two universal gates. That means you can build all other logic gates out of just one of these types of gates. And we're going to do that with the NAND to build a whole computer. So as I said before, complexity arises from simplicity. We're going to build not, and, or, and all the others from just this lowly NAND. First thing we're going to build is not. So how do you do that using only NAND? The truth table is simple enough. It just returns the opposite of its input. How do we get that from this? Well, if you look closely, it's already there. If you give a NAND two truths, you get the only false result from it. If you give it two falses, you get a true. So wiring the input signal to both of the input wires gives you a not. So now let's work on NAND, but this time we can use NOT and NAND. So how do we build NAND from those? Well, an NAND is really just a negated NAND. So we're going to expand that to NOT, NOT, NAND, let the double negative negate itself and we get an NAND. So we have a NAND on input, passes its output to our NOT, we replace the NOT gate with a NAND equivalent to get NAND. Next up is OR, this time using the gates we have available now. We're probably all familiar with OR, but how do we build that from AND, NOT and NAND? After all, they're all kind of andy. It's actually pretty easy using De Morgan's law. So let's get a quick refresher on that using everyone's favorite, Venn diagrams. Everyone loves Venn diagrams, right? Maybe not that one. Maybe even more like these. So let's start. We've got set A and B. Take the union of the two we have OR, the opposite of that would be NOT A or B, which is the same as the area outside the union. This can be equivalently expressed as NOT A and NOT B, which is everything outside the union of A or B. That's exactly what we don't want, so let's negate that too. Giving us the equivalent of A or B using our gates and that's De Morgan's law in a nutshell. So let's implement it, shall we? First, flip it using De Morgan's. We take the source code and translate it from bottom up. NOT A, then NOT B, then we AND the results, then we NOT the result. The chip layout's totally straightforward at this point. But when we look at it in terms of code, we can actually see some simple optimizations right off the bat. Remember that we built AND with two NANDs and NOT with one NAND. Here we can see that NOT AND is just NAND, so let's replace three NAND gates with one. That's better. Next, let's replace NOT with the equivalent NANDs, making our OR out of three NAND gates instead of five. That's a very nice optimization which really adds up in the long run. Okay, we're in the swing of things, right? Let's tackle XOR and we're pretty much done with standard logic. We now have OR available to us, so it's much easier to do. XOR or Exclusive OR is just one or the other. It's A and NOT B or B and NOT A. That's really easy to express literally, but this adds up to one plus one for the NOTs, two plus two for the ANDs, and three for nine NAND gates. Let's convert it and see what we can do to clean it up. Working left to right, let's replace the NOT gates with their NAND equivalents. Then, let's expand the AND gates to their two NAND gate equivalents, and finally, let's expand the OR to three NAND gates. At this point, it's entirely equivalent to what it was before, but you can see some stuff clumping up and you might be able to detect some patterns really quickly. Specifically, there are NOT NANDs in the stream that can cancel each other out. You'll have this anytime that you have an AND gate feeding into an OR gate because AND ends with a NOT and OR starts with a NOT. So let's remove them. Next, there's a non-obvious optimization. I had to get it by working out the truth tables, but I'm sure there's some smart way of coming out from it, but as you know, I'm just a sous chef assistant. This is hand wavy, but knowing that NAND hates agreement, we can share a single NAND in the front for the same effect, dropping NAND gates down to four, which is a huge optimization in the grand scheme of things because there's a ton of X, Y, and ORs inside of a computer. MUX is the first name that doesn't have an obvious analog in Ruby. It sounds big and scary, but it's really just an if statement. When the select flag is true, we pass B through to the output as is, and when the select flag is false, we pass A through. That's it. It's really simple. It's just a dumb name. So how do we make it in chips? It is a dumb name. Multiplexer, come on. It's either select flag in B or not select flag in A, which takes some getting used to for me is that we're calculating both of these all the time and we're using OR to choose the right result. Converting it to equivalent NANDs gets you the following. I'm skipping the optimization steps for the sake of time, but it's exactly the same process as we went through before. DMUX has no analog in Ruby. It's the opposite of an if statement. They're still pretty easy. They just decide which of two to assign to based on the selector ignoring Ruby's scoping rules. So here's our truth table for DMUX. When the selector is true, the input goes to output A. When the selector is false, the input goes to output B. And really, it's just parallel computation. A is just input in selector, B is input in not selector. There are no optimizations to be had here, so it's just straight port to NANDs. In source, I would actually just leave it as AND and NOT to keep the gates clear. So that's the end of our binary operations. We now go into wider branching. OR8Way returns true if any of the eight inputs are true. It would be implemented in Ruby quite simply. Very straightforward, albeit messy, just to make it fit on a slide. Because we're using Ruby's OR, it would shortcut on the first toothy value. You might implement OR8Way like this. Seven OR gates with eight inputs, one output. It's simple, but it has a flaw. The implementation is serial. It has to go through seven gates to get an answer. But this, same inputs, same output, but by implementing OR8Way as a binary tree, it drops the depth from six to two. Wide MUXes are mainstay in computers, as you'll see in a bit. MUX4Way takes a two-bit selector to decide which of four inputs it will output. Much like the theme of this talk, a four-way MUX is simply built of more two-way MUXes. As you may guess, an eight-way MUX is built of four-way MUXes. Absolutely no different here or here. Trivial, with a bit of thought, you can exercise these out on your own, and it's just really an exercise in editing and figuring out the macros of your editor. For the sake of time, I'm gonna consider this an exercise for the reader, or the viewer. Just remember that complexity builds from simplicity, and you'll do fine. 16-bit logic. Sounds complex, but it really isn't. It's just 15 more of the same simple thing in a single chip. 16-bit knot would look roughly like this, and so on. Again, trivial, and an exercise for the reader. And with that, we have all the basics we need to start computing. Now we're gonna build Boolean arithmetic. Specifically, we're gonna build adders. Not these kinds of adders. Nor him. Just good, simple arithmetic. So remember back in elementary school, how you were taught to add two numbers. You start with the least significant digits. You calculate a sum and a carry. You repeat until you're done. Using two inputs to calculate a sum and a carry is called a half adder. Using two inputs in the previous carry to calculate a sum and a carry is called a full adder. Computers add numbers just like elementary school kids. The only difference is that computers use base two, not base 10. Half adder is easy peasy. The sum is one only if one of the two inputs is one, and that's one or the other, which means it's XOR. There's a carry if two of the two inputs are one, that's both, and so it's AND. If you take two half adders and glue them together, you get a full adder. The sum is one only if one of the three inputs is one, so that's in essence a three-way XOR, we just chain them together. There's a carry if two or more of the three inputs are one, you glue those together with an OR. So now we know how to add two bits and a carry. Let's build that up to build 16-bit numbers. So here you go. See how easy it is? You wire up 16 full adders to their 16 inputs, you give a zero to the first carry in, and each carry out becomes the carry in for the next adder, and it's chained through. But Ryan, you say, doesn't this have the same problem as OR eight-way? Very smart, yes it does. What they don't teach you in grade school is that there's more than way to add numbers together. For example, I was tutoring a fellow student in junior high on absolutely rudimentary math, and he blew me out of the water when he added numbers left to right. At the time I'd never even thought about it. I'd never tried it, thought about it, read about it, anything. After all, my way worked. Indeed, it's possible to calculate all the carries that each column would have in parallel, but to be able to do that, you have to rearrange things. So we go back to the full adder. You have the sum and the carry. We're only interested in the sum because it's the carry that's causing all this delay passing all the way through. So we remove the carry, and we wind up with what's called a partial full adder, or a PFA. It calculates a sum, plus whether the calculation is going to propagate a carry with the previous sum, or whether it would generate a carry on its own. Then it's just a simple matter of using that to calculate all the carries, and you do that with what's called a CLA, or a carry look ahead adder. The CLA takes four propagate generate pairs to carry an input and generates four carry outputs. You take four PFAs and one CLA, wire them together, and you have a four bit adder. You take that one level deeper and you hook four four bit adders together with another CLA, and you have a 16 bit adder, but this time it's logarithmic depth instead of linear. That's easy, right? It kind of is until you actually look at what a CLA is. So you take all those propagate generate bits, and you use those to calculate all the carries in parallel, and you have what I call a carry look ahead monster. This is as clean as I can diagram it. It is all in NANDs. Sorry about that. This is as bad as it gets. So do I go into all the crazy details about how a CLA works? This is the scheme code that I use to actually generate the NANDs. It's not exactly readable on a slide, is it? No, not really. Honestly, it's gross. It does a very good job, but I wouldn't want to look at this code day to day. So here it is on the raw mathematical definition. This defines carry, propagate, and generate at every level. Please note though that it's recursive. So if we fully expand it out and generate four levels with the recursion expanded out, it's written and formatted this way, which is admittedly a bit tedious, but it's very digestible. At this point it's just a matter of wiring, and I can handle that. But you don't really have to do all this extra complexity to finish the book. I just went down this rabbit hole because I found it fun, and I'm a little insane. So an incrementer is a thing that adds one to the input. That's it. Here's the most straightforward way to implement it. You wire up a bunch of half adders and feed the first one a hardwired one, and it does the job. This is how you do it if you cared about speed. It's sort of terrifying. This is from a Z80. It is apparently incredibly fast, but I don't honestly want to grok that. I decided the former was an entirely acceptable implementation, and I stuck with that instead. Next up, the ALU, or the arithmetic logic unit. This is as complex as this entire book got, well, on the hardware side. One second. It has two 16-bit inputs, X and Y, six flags to choose what operations to apply, two output flags to report on the state of the output, and the actual 16-bit output. Here's the hardware definition declaration for the inputs and outputs, and I put this here to show you how ugly it is. It is gross. Here's the implementation. It's also gross, but really I have this code in here because this part down here took me a long time to solve, and it fuckin' blew my mind when I figured it out. You don't need to understand this code. I really just wanted to show how cool it was to making it bigger. This MUX is sending parts of its result to four different places. We don't have a programming language analogy of that. We treat functions like mathematical functions. They're black boxes. You put some inputs in, you get an output out, you're pulling. This is pushing its results to four different places in parallel. Mind-plody. This will probably make it more understandable. I find this much easier than source. The I is able to pull out much more of the flow much more easily as long as the layout of the diagram's good. There's two things to note here. The AND16 and the AD16 are always calculated every time, every clock cycle. The outputs go to multiple places, and sometimes they just go in parts. The top top wire is one bit. The bottom two wires are eight bits of a 16-bit result. I find that mind-blowing. But since it's just a chip, you think of it more like this. You only have to deal with the inputs and outputs. It's a function. The cool thing are those six little flags. They are essentially our instruction set. With those flags, you can make all sorts of results. You have constants, negations, increments, ads, and lots of stuff. Mark Armbrist, who's the most helpful participant in the NAND to Tetris forums, actually designed an ALU with just 440 NAND gates. It's kind of amazing. Once you get used to looking at this stuff though, you can actually scan over the diagram and see a tree of ANDs, a row of NOTs, ORs, et cetera. It's not complex. It's just really busy. Okay, now we get to the good stuff. All this calculation stuff is pretty 19th century. It's fine if all you want to do is dial some knobs, pull a lever, and calculate a trajectory to fire a cannon. That's what we built this stuff for originally anyways. So you need memory to do something useful. Liz, let's make some. First, what's the difference between combinational logic and sequential? Combinational chips are very straightforward. You have some inputs that feed to some logic, spit out some outputs. Sequential chips, however, are more complex by design. They have the same inputs feeding some input logic. That feeds to its memory, and that feeds to its output logic that goes out to the output. The memory's actions are synchronized to a signal from the computer's clock. So here's our most basic memory element, the DFF, or data flip flop. And here's basically how it works, although this is not normal Ruby, because time is a variable here. The magic is that the input now is defined as the output then. And in Endotetris, they treat this as an atom. You get the DFF chip the same way you get the NAND chip, and you build everything else out of it. Not one to take things for granted. I had to poke some more, and I found out that the DFF can be created with NANDs itself. Mark, the same Mark, wrote a very nice article detailing how it works. You start with a NAND, you build up what's called an RS latch, you clock it, double it up, and build an RS flip flop, and then finally you extend that to a DFF. But that's, again, just details. You don't need to finish the book. But to me, they're interesting details. There's a big reason why I wanted to do this book, and that's what I wanted to do. I wanted to learn the stuff that I didn't learn in college. So, you have a DFF which outputs its previous input, but to have memory, you have to hold on to that data, and you have to be able to set it when you want it to change and leave it the same when you don't. And that's where the bit comes in. The bit takes an extra input, the load flag. Since we know what that memory is involved, it must have a DFF. So, we just rename the bit call to DFF, and we actually have our basics already. Since this is just an if statement followed by two different calculations, it's actually really easy to translate. If is just a MUX, memory is just a DFF, the awesome part is the fact that the DFF feeds back into the MUX, giving the previous result to its current input. That's really all there is to being a bit. But a single bit is not very useful. We want fatter values, we want numbers. So, we do the same thing we did with ANT16, and we go wide. And that's really all our register is. It's a bunch of bits, some input, a load flag, and some output. There really isn't anything to it. It just packages it up in a more convenient form. But even a single register, while it is a very powerful concept, is rather limited. We want to address a nearly arbitrary amount of RAM. So, we have many registers. We need to find an address of what register we want to talk to. Otherwise, it has the same structure as a register, input, output, load bit. We've just added an address of what value we want to read or write. But this is sequential logic. So, we need to figure out the addressing part. Given an address, how do we route the input to the right place? How do we get the right value from the right place to the output? Well, for a RAM 8, we use a 16-bit DMUX 8-way to route the input to the right register, and a 16-bit MUX 8-way to wire the right register to the output. The load bit goes to all the registers. And if we want bigger, well, that's where the complexity builds from simplicity thing comes back in. We're really just stacking these like Legos, building a bigger bank of memory from the previous Lego sizes, each step up until we have a reasonable amount of memory. So, how would you build a RAM 64 from a RAM 8? Again, that's an exercise for the reader. So now, we've got logic and memory. We're on the home stretch. The next thing we need is a counter. The thing that remembers where we are in our program. It's responsible for providing the address of the next instruction in ROM that we're to execute. And it looks something like this. It's ugly, but if you ignore the blah, blah logic and look at the bottom where the register feeds back to the incrementer, that's the part that blows my mind. We just don't do that in programming. So, now we have all the components we need to finally build a computer. We've done so much and come so far that I'm afraid the next slide is more than a bit of a let down. That's a computer. Yep. So, we have some memory. We have a CPU, which is mostly just an ALU and some logic. We have input and we have output. Our CPU takes three inputs and calculates four outputs. It gets a value from memory, which is an instruction to run and a potentially a reset flag for it to start over at PC0. It calculates a value to go to memory, whether to write it, where and what the next PC value is. Figuring out the CPU and all of its internals was the most rewarding task that I had throughout this entire book. I leave this as another exercise for the reader. The ROM is where the program gets stored and starts running at address zero. It's just like RAM that we just covered, except that it's read only. The keyboard is just a single word of direct memory access to the current value of keyboard input. Like the keyboard, the screen is also direct memory, but it's read-write and it's addressable, so it has an input value and address and a load bit. The memory is for both RAM, the screen, and the keyboard. The screen contents and the keyboard value and memory mapping can be addressed via normal pointer access, which means you can do a lot of neat pointer arithmetic and walk across your screen and do fancy pattern things with it. So, our original abstract architecture diagram is almost more complex than the real thing. Building the computer at this point took less than a minute in source. You have three components, you wire them together, and it isn't shown here because they're squarely bi-directional wire rings that go all over the place, but that's it, boomed on it. Literally took me a minute to do. It was kind of a let down. That really is all it takes to make a computer once you've built all the components. So, let's go home. Turns out that the Apollo Guidance Computer, or AGC, is an amazing parallel to this project. Julian Simione gave an amazing talk about the Apollo program's development process earlier this year at Mountain West. If you haven't seen it, I highly recommend it, and it's on confreaks. It turns out that the AGC was built entirely out of norgates. I'm really not pulling the wool over your eyes with this talk. This is entirely possible, and it did send people to the moon and back. If I didn't have these giant things and feel like a complete klutz working on a breadboard, I would actually start making one of these for real, just for the fun and the challenge of it. Thank you. Because I have four minutes of time, one more thing. MiniTest is an awesome Ruby testing framework of awesomeness. It is awesome. It is fast. It is simple. It is powerful. It is clean. It's in Ruby, and it's pluggable. MiniTest BISEC is a plugin that I just released a couple of weeks ago, and it isolates random test failures. Does anyone had their tests randomly fail and not know why? We wanna talk. This is called a test order dependency error, and I have a slightly antiquated video showing a previous version of this. So if we have a whole bunch of tests and pretend that each dot takes a couple seconds to do, and we have a reproducible failure, we can throw MiniTest BISEC to it. MiniTest BISEC files doesn't exist anymore because of the way MiniTest loads and randomizes its tests. But basically, it's trying to figure out the minimal set of files that it can run to reproduce this failure to simply make the next phase faster. I've cut this out, but the idea is the same. And it's figured out that there are two files to reproduce, and now it's trying to figure out the minimal number of methods to reproduce this, and it's done. That's all it takes to figure out the minimal reproduction for this test order dependency failure. And you can install that now with gem install MiniTest BISEC. MiniTest GC stats is something that Aaron and I brainstormed in CoffeeLine and implemented and released that day. So if your tests are slow, is it your implementation, or is it GC? Is it because your tests produce too much junk? Well, if MiniTest GC stats, you can actually figure out why. You replace your regular test run with a dash G flag, and you get an output of your top tests by memory allocated. Aaron ran this against ActiveRecord and figured out that the worst test produces nearly a quarter million objects. You can get that now with gem install MiniTest GC stats. Thank you.