 You might know me from a little pet project called Babble. I also work at this company called Cloudflare. I work on the UI team. You should all come work there, because we use this really awesome, terrible, terrible UI framework. Anyways, you can also follow me on Twitter at the James Kyle. So you want to know how to build a compiler, or maybe you don't, and now you're just stuck listening to me. Either way, I'm going to teach you. The format of this talk has a bit of a backstory. I got into compilers through Babble, or 6 to 5 at the time. I always understood what compilers did from a conceptual level, but I wasn't really able to grok them until the last year and a half. The reason I was able to get into them is because when I first looked at the 6 to 5 code base, it was the first time I was able to read the code of a compiler and really understand it. After Babble blew up into a massive project, I became interested why we weren't also blowing up in the number of contributors. I learned that most people are pretty scared of compilers. It's thought of as this totally unapproachable computer science-y thing that only the nerds of the nerds can understand. I realized that we need to do a better job of teaching people and making it more approachable to people. So I decided that I was going to go out and really learn everything there was to know about compilers. I jumped onto Amazon, and I ordered the essential books on compilers. And about three weeks later, I discovered why compilers were totally unapproachable. Those books are 500 pages of the most dull explanation of compilers that one could ever have written. I love compilers, and I was so bored, so bored reading them. I couldn't even come close to finishing them. The learning problem with compilers is not that they're complicated. It's that we teach that they are complicated. So what I'm going to do today is give you an over-simplified but reasonably accurate overview of what most compilers look like, and then I'm going to try and make it more real to you by walking through the code of an actual compiler that I've written. So what even is a compiler? A compiler is simply a tool that transforms source code written in a higher level language into a generally lower level target language. But let's go back to the fundamentals to explain why they exist. A computer's CPU executes millions of tiny little operations, operations so tiny that you would never want to write them out manually. And they're sort of pure form. They're written as binary. And this is known as machine code. Obviously, there is no way a programmer could understand and be productive writing code like this. So CPU architectures support a mapping of binary operations as an easier read language like this. And this is known as assembly code. It's a super low-level programming language that gets converted into binary code by a thing called an assembler. Externally, an assembler looks a lot like a compiler. It's taking one type of code and translating it into a lower level type of code. However, internally, they're much simpler than compilers. Ignoring any kind of optimization an assembler might be doing, they're really just a straight mapping of one syntax to another. However, assembly code is still too low-level to be really productive in. So the next level up is source code. And this is where the vast majority of programmers enter. These are programming languages that you are familiar with and would actually write large pieces of software in. Source code is different than assembly code. In most cases, it does not map directly to machine operations. It's meant as an abstraction to allow humans to understand what is happening and be productive in writing code. As such, an assembler wouldn't be able to translate source code down into machine operations. We need something that can understand a more complicated syntax and transform it very intelligently into optimized lower level code. And this is where a compiler comes in. Since compilers are doing complicated operations, they have been broken down into a couple different stages. These stages vary compiler to compiler, but they all fit within these three primary groups. Parsing, transformation, and code generation. Parsing is taking raw code and turning it into a more abstract representation of that same code. Transformation takes that abstract representation and manipulates it to do whatever the compiler wants it to. And code generation takes the transformed representation of the code and turns it into a new string of code. Parsing gets broken down into a couple different phases. The two biggest ones that you will see are lexical analysis and syntactic analysis. Lexical analysis is pretty simple. It takes the raw code and it just splits it up into these little things called tokens. Tokens are an array of tiny little objects that describe an isolated piece of the syntax. They could be numbers, labels, punctuation, operators, whatever. Syntactic analysis will then take these tokens and reformat them into a representation that describes each part of the syntax and their relation to one another. This is known as an intermediate representation or an abstract syntax tree. An abstract syntax tree or AST for short is a deeply nested object that represents the code in a way that is both easy to work with and tells a lot of information. Now, this might be hard to visualize so let's look at what this code would be. So we would represent tokens like this. We have an array of objects with a type of paran or name or number and each of them have the same format of a value. An abstract syntax tree gets represented like this. You have a much more descriptive format for them. You have a program with a call expression named add with parameters that could be a number literal or a nested call expression. And this is the final representation that we'll use for the rest of the compiler. The next type of stage in a compiler is transformation. Again, this just takes the AST form from the last step and makes changes to it. It can manipulate the AST in the same language or it can translate it into an entirely new language. And let's look at how we would transform an AST. You might notice that our AST has elements within it that look very similar. These are objects with a type property. Each of these is known as an AST node. These nodes have defined properties on them that describe one isolated part of the tree. We can have a node for a number literal with a value of that number or a node for a call expression with a name and the parameters. And when transforming the AST, we can manipulate nodes by adding, removing, replacing properties. We can add new nodes. We can remove nodes where we can even leave the existing AST alone and create an entirely new one based on top of it. And since for our purposes, we're gonna be targeting a completely separate language, we're gonna focus on creating an entirely new AST that is specific to the target language. In order to navigate through all of these nodes, we need to be able to reverse through them. This traversal process goes to each node in the AST depth first. So for the following AST, we would go to the top level program and see that it has a body and it was an array of nodes and we go to the first one, we go through them linearly. We had a number literal which doesn't have children. So we move on to the next one, which is a call expression which does have children and so forth. The reason I use the word visiting is because there is this pattern of how to represent operations on elements of an object structure. The basic idea here is that we're going to create a visitor object that has methods that will accept different node types. And when we traverse our AST, we will call the methods on this visitor whenever we encounter a node of the matching type. In order to make this useful, we will also pass the node and a reference to the parent node. The final phase of a compiler is code generation. Sometimes compilers will do things that overlap with transformation, but for the most part, code generation just means taking our AST and stringifying it out. Code generators work several different ways. Some compilers will reuse tokens from earlier or others will have created a new separate representation of a code so they can print it linearly. But from what I can tell, most use the same AST we just created and that's what we're gonna focus on. Effectively, our code generator will know how to print each of the different node types and it will recursively call itself to print nested nodes until it's printed one long string of code. And that's everything. That's all the major pieces of a compiler. And that's not to say that every compiler looks exactly like this, but compilers and compilers serve many different purposes so they might have a lot more steps than I've detailed here. But now you should have a general high-level idea of what most compilers look like. So you can all write your own, right? Yeah, no, of course not. When I was thinking of how to create this talk, I knew I wanted to frame things from a more comfortable perspective, but I also wanted to teach in the same way that I learned by reading actual working code. In my proposal, I declared that I would write the world's smallest compiler thinking to myself, that would be super easy. I just have a couple of requirements. It has to demonstrate what a real world compiler would do. It has to do something complicated enough to justify building a compiler. And it has to be small enough, a piece of code, that it can be explained in a conference talk. It turns out I was a little bit confident and that's really friggin' difficult. I tried all sorts of things and eventually I just gave up. There was no way I was gonna write a compiler for this talk, it just wasn't a practical idea. A few months passed by and now I'm learning about list languages. I play around with a couple different languages and I discovered how powerful their compilers are. Interested, I started looking into how list languages get compiled because their syntax is very different than the C-like language that you're all familiar with. And then I realized that list was the perfect syntax for this compiler. It's a super simple syntax to parse and it's really easy to translate into a new syntax. I played around with the idea and after only a couple of hours I had a viable candidate for a conference talk. The entire source code is around 250 lines of code. If you're unfamiliar with lists, it looks something like this. We basically just put the parens on the outside. So in JavaScript it's that, that, that, that, that. And then you could do a mathematical operation like that, just that, that, that. And the cool thing that Yehuda pointed out to me was that handlebars, he pointed out that handlebars syntax is actually a type of lisp. I didn't even, like, I didn't even incurred to me, but it is. And so yeah, that's what we're gonna do. We're gonna take a lisp syntax and we're gonna translate it into a C-like syntax with JavaScript that you're all familiar with. So here it is, the code of our compiler. We're gonna start off with our first phase of parsing, lexical analysis, of this thing called the tokenizer. We're gonna take our string of code and we're gonna break it down into an array of tokens. We'll start by accepting an input string of code and we're gonna set up two things, a current variable for tracking our position in that string like a cursor. And then we're gonna set up a tokens array for pushing our tokens to. We start by creating a while loop where we are setting up our current variable to be incremented as much as we want inside of the loop. We do this because we want to increment current many times inside of the while loop because our tokens can be any length. It's not one loop per character. We're also gonna store the current character in the input. The first thing that we want to do is check for an open parentheses. This will later be used for call expressions, but for now we only care about the character. We check to see if we have an open parentheses. If we do, we push a new token with the type paren and set the value to an open parentheses. Then we increment current and we continue on to the next cycle of the loop. Next, we're gonna check for a closing parentheses and we'll do the same exact thing as before. Check exists, push a new token, increment current, continue. Moving on, we're now gonna check for white space. This is interesting because we care that white space exists to separate different characters, but it isn't actually important for us to store as a token. In fact, we would only just throw it out later on. So here, we're just gonna test for existence and if it does exist, we're just gonna continue on incrementing current. The next type of token is a number. This is different than we've seen before because a number could be any number of characters long and we want to capture this entire sequence of characters as a single token. So we start this off when we encounter the first number in the sequence. We're going to create a value string that we're gonna push characters to and then we're gonna loop through each character in the sequence until we encounter a character that is not a number, pushing each character that is a number to our value in incrementing current as we go. After that, we push our number token to the token's array and we continue on. The last type of token will be a name token. This is a sequence of letters instead of numbers. They are the names of the function, the call expressions in our list syntax. Again, we're just gonna loop through all of the letters pushing them to a value just like we did with numbers and we push the value as a token of the type name and continuing. Finally, if we have not matched any character by now, we're just gonna freak out and throw an error. The end of the tokenizer, we just simply return the token's array and that's it, that's the entire tokenizer. So now is the parser. Here, we're gonna take this token's array that we've generated and we're gonna reformat it as an AST, like that. Okay, so we define a parser function that accepts our array of tokens. Again, we keep a current variable that we use as a cursor but this time, we're gonna use recursion instead of a while loop. So we define a walk function. Inside of the walk function, we're gonna start by grabbing the current token. We're gonna split each type of token off into a different code path starting off with the number tokens. We test to see if we do, in fact, have a number token. If we have one, we'll increment current and we will return a new AST node called number literal and we'll set its value to the value of our token. Next, we're gonna look for call expressions. We'll start this off when we encounter an open parentheses. We'll increment current to skip that open parentheses since we don't care about it in our AST. We create an empty node with the type expression, a call expression and we're gonna set the name as the current token's value since the next token after open parentheses is always gonna be the name of that function. We increment current again to skip the name of the function and now we want to loop through each token that will be the params of our call expression until we encounter a closing parentheses. So, we create a while loop and we continue it until it encounters a token with a type of paren with a value of a closing parentheses. Now, this is where recursion comes in. Instead of trying to parse a potentially infinite number of nested sets of nodes, we're gonna rely on recursion to resolve everything inside of it. To explain this, let's look at our list code. You can see that the parameters of add are a number and a nested call expression that includes its own numbers. You'll also notice that the tokens array that we have, you can actually see it, but if you look at the code, you can see multiple closing parentheses and we have to make sure that they match. So, we're gonna rely on the nested walk function to increment our current variable past the nested closing parentheses. So, back to the code, we'll call the walk function which will return another node and we'll push it to the parameters of our node. Finally, we will increment the current one last time to skip the closing parentheses and we'll return the node. Again, if we haven't recognized the token type by now, we're gonna freak out again and throw a syntax error. Now, we're going to create an AST which will be a program node. And you'll notice that it has a body which have effectively statements of that program. And we're gonna keep calling our walk function in a loop until we've run out of tokens and just keep pushing those as statements into our AST body. And the reason that we're doing this inside of a loop, as you can see, is because they can be next to each other rather than nested. At the end of the parser, we will return the AST. Next up, the traverser. So now we have our AST and we want to be able to visit all the different nodes in that AST with a visitor. We need to be able to call the methods on this visitor whenever we encounter a node with a matching type. So we define a traverser function which accepts an AST and a visitor. And inside, we're gonna define two functions. A traverse array function that will allow us to iterate over the array and call our next function. Traverse node. Traverse node will accept a node and its parent node so that I can pass both to our visitor methods. We'll start by testing the existence of a method with a matching type. And if it does exist, we will call the node and its parent. Next, we're gonna split things up again by the current node type. We'll start with our top level program. Since program nodes have a property named body that has an array of nodes, we'll call traverse array and traverse down into them. Remember, traverse array will in turn call traverse node through causing the tree to be traversed recursively. A lot of words. Next, we'll do the same for call expressions and traverse their params. In the case of number literals, we don't have any child nodes to visit, so we'll just break. And again, if we haven't recognized the node type, we'll throw an error. Finally, we kickstart the entire process by calling traverse node with our AST with no parent because there is no parent. Next up, the transformer. Michael Bay doesn't go in here. Our traverser, our transformer is going to take our AST that we have built and pass it to our traverser function with a visitor and it will create a new AST. This new AST is designed for JavaScript. In fact, it's actually the Babel AST that's right there. So we have our transformer function which will accept the Lisp AST. We'll create a new AST which like our previous AST will have a program node with a body that has an array. Next, I'm gonna cheat a little and create a bit of a hack. We're gonna use a property named context on our parent nodes so that we can keep our traverser flat. Inside of the methods of the visitor, we're just gonna keep pushing nodes to that context and you can see it's basically just mapping the old AST to the new one. Normally, you would create a better abstraction for this but for our purposes, this keeps things simple and like easy to read. So we'll start by calling the traverser function with our AST and a visitor. The first visitor method accepts number literals. We'll create a new node also named number literal and we'll push it to the parent context. Next up, call expressions. We'll start by creating a new node call expression with a nested identifier that is the name of the function. Next, we're going to define a new context on the original call expression node that will reference the expression's arguments so that we can keep pushing arguments to it. Then we're gonna check if the parent node is a call expression. If it's not, we're going to wrap our call expression with an expression statement. We do this because the top level call expressions in JavaScript are actually statements. Last, we push our possibly wrapped call expression to the parent node's context and continue to build up our AST. At the end of our transformer function, we'll return that new AST that we've just created. Now let's move on to our last face, the code generator. Our code generator is going to recursively call itself to print each node in the tree into one giant string. We'll break things down by node type. And if we have a program node, we'll map through each node in the body and we'll run them through the code generator, joining them with a new line. For expression statements, we'll call the code generator on the nested expression and we'll add a semicolon because we like to code the correct way. For call expressions, we will print the callee, which again was that identifier of the name of the function, add an open parentheses, we'll map through each node in the arguments array and run them through the code generator, joining them with a comma, and then we'll add a closing parentheses. For identifiers, we just wanna return the node's name, which is a string. And for number literals, we just wanna return the node's value, which is also a string. Again, haven't recognized the node, freak out. Finally, we have our compiler function. Here, we're gonna link together each part of the pipeline. We'll take the input and pass it to the tokenizer and create our tokens. We'll take the tokens and pass it to our parser and create an AST. We'll take the AST and pass it to the transformer to create our new AST. We'll take the new AST and pass it to the code generator which will create our output string and we'll finally return our output. And that's it. That is all of the code in our compiler. And, but let me prove that it works to you. The tokenizer, actually I need to move up to that screen. The tokenizer will take our string and generate tokens and if I can match my cursor, boom, it does. Let me prove it again and boom, it does. The parser will take our tokens and generate a new AST and boom, it does. It does, yeah, there we go. The transformer will take our generated AST and turn it into a new AST. The code generator will take the new AST and generate the output and it does. Finally, from beginning to end, running it through all of them, we will run it through the tokenizer, the parser, the traverser, the transformer, the code generator and it generates our code. And that is it.