 So my name is Nick. I'm a web platform engineer and horse mask enthusiast at opower Sometimes we write software, but we have a large collection of horse masks That's my responsibility to maintain them all not really I have that in my desk Opower is a company that uses Software and like behavioral prodding to try to get people to reduce energy and so it's nice being out here in Boulder, Colorado Like I'm from DC and so it's a little more urban but to like actually see the the nature that we're trying to save by fighting climate change is kind of cool and so far we've Prevented six million tons of co2 from entering the atmosphere which is equivalent to taking a million cars off the road for a year But that's the lesson guys say about co2 instead. I'm gonna tell you how you can make your own programming language and And first we'll talk about how you just take some text that you typed into your editor and make the computer actually run it and Then we'll talk about the industry standard tools you would use to do this and You might say well, why not just go straight to those tools? I have tried that. It does not work very well They get very tricky to use if you don't understand what's going on under the hood So that's why I'm gonna spend the bulk of the time explaining that to you and then we'll look at the tools at the end Feel free to ask questions at any time So but first off why why do we care about any of this? Well, we can't assume that the set of languages we have now are going to be the perfect things forever Sometimes we have one language that we like and we want to translate it into something else So we have GWT, which is Java to JavaScript Fun script, which is F sharp to JavaScript anything involving F sharp has to be fun And pie script, which is Python to JavaScript Another reason might be you want to make a DSL so you have some game And maybe your character designer is not a programmer But you want to give them a high-level language to say like when the character walks in come up and hit them with the axe or Whatever you do in games and so you can make a DSL for them to express these ideas So that they're not coming to you with every request It's also just fun and I think it makes you a better programmer to know how this stuff works under the covers But the big reason that we see new languages is that You're something about the way that we build programs has changed and existing languages just aren't cutting it anymore So when computers got powerful enough that we didn't need to optimize every instruction by hand C was invented and of course I say we as if I'm old enough to have been writing machine code directly, but anyway Then when people decided that building real-time distributed systems would be a good idea Turns out existing languages made that pretty difficult and so Erlang is built specifically to address some of those problems Or CSS for instance Was originally designed so that you could have a little bit of styling on your static homepage for your dog and now it's Controlling the look and feel for massive complicated web applications and CSS just doesn't really scale for that And so sass was invented So for the purposes of our conversation about how to make your own programming language I've made my own programming language called lambda script that does not solve any big problem in computing But it is simple enough that I can fit the grammar on one slide and that was the main criteria It also has a lambda like actually in the language as the function keyword Which when you're the one making the language you get to do these things and also emphasizes how impractical this is So this is the identity function Hopefully this looks pretty similar to things you've seen before We're declaring a function named F. It takes a single argument named X and then we just return that This is what function application looks like so we're applying F to X and then we're applying the print function to that And then this is the the extent of the lambda script I've written it's a is palindrome function and then I'm applying that to I'm printing that out to see what it does And you can see we have this little syntax here where you have a regular expression in like an accessor expression and What this will evaluate to is apply that regular expression to the string that this variable Contains and then the whole result is the first match group So I don't know if that's actually a useful thing to have in a native like level in a programming language But when you're the one making it you can experiment and see you know, is this useful or not? You'll also notice there are a lot of these brackets here to denote precedents This is the result of sort of organically growing the language And and running into these things and I'll tell you a little later why I have so many of those but One thing I'd recommend if you were doing this is Write out a lot of code in your sample language before you start to actually compile it and just see like hey Does this actually make sense? So we have our lambda script and we want to make it do something useful And if you guys are looking at this You might be starting to see some structure and figure out in your head how you could execute it But the computer it just looks like this It's just a stream of text with no inherent structure or meaning And so we want to make it look like JavaScript and of course this could be anything that the computer can execute directly I chose JavaScript because I understand it better than machine code and because JavaScript is Basically ubiquitous at this point. You can run it on servers on drones even in browsers so, you know if our language compiles the JavaScript we already have a pretty good install base and I implemented this in F sharp and I chose it because it's a great type system It makes it very easy to compose functions and the compilation is essentially just a pipeline of various functions strung together So if F sharp makes that easy, then I feel pretty good about that It also has Very simple unit tests and for something like this where it's easy to set up all these pure functions unit testing really lends itself to that and Finally, I just can't get enough of Hanley Milner type inference And so any chance that I had to work with that I was going to take So this is the pipeline and it's a little bit simplified But we're going to go through each step and explain what's happening And there's the F sharp that I wrote to implement this up top and you can see what I'm talking about with the nice chaining of functions And so we have the front end in the back end The front end is the stuff that you as the programmer would see so like the syntax and the semantics and How does the program look then the back end is the generation and you can envision a modular architecture where you could swap them out So you could say okay. Well today a lambda script is targeting JavaScript, but maybe we swap that out with a back end that targets the JVM or the CLR or something else So the first step is the tokenizing we just take one big string and we need to split that into the atomic unit of meaning the most Like the smallest possible thing that's interesting And as the language designer there are several decisions that you have to make at this point and Just some examples like is white space significant Does your token either just strip out white space or is it emitting tokens to be looked at later? How are string literals and comments delimited if you have a comment is it that from the end of the line Can you have open and closed comments? things like that and So this is my tokenized function. You can see the test in the upper right hand corner We pass in a list of strings where each string is a line in the input file And we get out a list of strings where each string is a token And you can see the way we do that is we just have this regex for splitting the strings into smaller strings We just map that over our input and flatten it all out Now this actually has a bug in it. Can anyone see what the problem with this approach is? I I Worked with this code for a while before figuring it out So don't feel bad if you don't see it immediately, but does anyone see anything so we're just doing a split so here for instance Anytime that we see a space We're gonna split into two separate tokens But if you have a string literal that has a space in the middle of it Well, congratulations, you now have two separate tokens one opening the string and one closing the string And that's that's not really what you want. So a better approach here is not actually to split But to just progressively consume input and know what state you're in So if you're in the middle of a string then you know you're just going to keep adding to that string until you find a Close quote or if you're in the middle of a comment or things like that that that would have gotten around this issue So next we do lexical analysis And if we're using natural language as an analogy, this would essentially be what part of speech Is each section is each token and you can see some of these are just You know like constants like lambda and others are parameterized So funk name has this payload of f our name has the payload of x and the reason we need that is It's not much help knowing the programmers trying to declare a function if we don't know what the name of that function is So we'll use those later on when we're actually generating the program So there are a few other types of decisions that we'd make here For example, what characters are allowed in variable names? maybe you don't allow numbers or Maybe there are certain symbols that you can't have and then you would emit an error instead of emitting a variable name lambda or lexical symbol Are there any reserved words? maybe there's a situation where you can have the word if and it's syntactically obvious that that's a variable name and not a Control flow structure, but as a language designer, you might just say no you can't have a variable named if that's just that's just too bad And then other things like how do you write a decimal literal how do you write a string literal? These are things you would have to design at this point So this is the core of the lexing function and again You can see a test where we take in a list of strings and we emit a list of lexical symbols some of which are parameterized with those inputs and you can see With the mapping the way we're implementing it most of the time It's just a straight pattern match where Okay, if you see the lambda then you're always just going to emit the lambda symbol Sometimes we're doing something a little more clever where we're actually doing a regex lookup And then getting that value out and then parameterizing it So after we do that we then need to start parsing the program to understand how all these parts relate to each other And that produces what's called the CST or the concrete syntax tree. We'll get to the ASTs a little later And so here you have some decisions like What are the allowable arguments to a function? What can you pass in? How do you apply a function to a value some languages you have parentheses others? You just put the function name before the value, etc What are the precedents of different operators? So if you have a plus B times C? How does that actually get evaluated? And in order to do this parsing We're going to need a grammar and I would recommend this book if you're interested in the stuff because it is quite useful and What mr. Parsons wrote in this book is a grammar is a finite set of rules for generating an infinite set of sentences and So here is our finite set of rules and this is the entire grammar for Lambda script in pseudo code and the thing that lets you generate an infinite set of sentences is the mutual recursion. Do you have a question? Okay So for instance, you can see an expression can be a function declaration or Any of these other things and a function declaration is a lambda followed by a function name followed by an argument name Followed by a funk dot followed by another expression Which brings us back up to here? I'm really good at drawing with this pen And so because of that we can just create these arbitrarily complicated structures Even though these rules all fit on one slide And let's take a brief Tangent just to talk about grammars in general You may have seen or even tried to do this yourself where someone's like I need to parse some HTML I know what I'll use regular expressions This is one of my favorite sacro flow questions of all time Because of the response that it elicited like I totally deserve deserves all the thousands of upvotes He got for coming up with that and so the reason that Trying to use a regular expressions someone's all go is because of the Chomsky hierarchy so regular expressions are designed to parse regular languages and HTML is a context-free language and The higher you go in this hierarchy the more complicated the grammar and the more complicated the parsers have to be That's true There are a lot of extensions that various language has have put in that takes it beyond like the pure definition of a regular expression Regardless, it's still not good enough to do HTML and So that's that's why these distinctions are important because if you have a mismatch between the complexity of your grammar and The power of your parser it's just not going to work So the language is rewrite for most computer languages or context-free And what that means is that they're not context sensitive So you can't do something like enforce that you must declare a variable before you use it on a grammatical level You have to come back later and inspect the AST to check that out So here's the grammar for limb described in F sharp and don't worry if you can't read all this because we're going to zoom in on Just the identity function the function declaration part that we've been working with and So what this says is if I have this sequence of things in the left hand side Then I'm going to emit a function declaration with those things as the children and this allows us to build up this tree and The way that we do this is with an algorithm called bottom-up parsing There are different algorithms and different algorithms can get tripped up on different things This is when I implemented because it's fairly easy to explain And it's conceptually similar to what most industry standard things do So the way bottom-up parsing works is you have a parse stack and you have input that you haven't looked at yet And at each step you look at what's on your parse stack, and you try to find these matching rules If you find one, that's great. You put that on the stack if you don't Then you take more input from the unconsidered input. So lambda by itself doesn't match anything Lambda function doesn't match anything. So we just keep going Until we get to here now This sequence matches what I was showing you earlier in the grammar So we can rewrite it as a function declaration And then there's that other rule I showed you that expression becomes a function declaration So we write that and now we don't have any way to reduce this further and we're out of input Which is what we would do if we can't find any matching rules So we need to figure out is this an acceptable end state to be in or not and in our grammar Yes expression is a valid top-level Place to be if it wasn't we would throw a parser error at this point And so if you have ever had the experience of maybe writing out a function and then forgetting the last half of it You forget the closing bracket or something and the compiler throws an error. This is what's happening It gets to the end of your input. It's not an acceptable end state and it doesn't have anything new to get So this is just that algorithm in F sharp and it's exactly what I told you so we start over here by trying to Find something we can reduce Find a rule that matches If we find such a rule, that's great So then we take the things that matched off replace it with the element that matched and then we just move along If we didn't find it. Well, then we need to take some more input If we have no more input then we check if we're in an acceptable end state or not If we are then we move on and if we're not then we return none This is another thing that I would learn in retrospect or that I've retrospectively figured out It's actually not great to just return none if there's a parser error Because then when you string this whole thing together and you put in some code and you get none out You have no visibility into what step that it was rejected So it's better if I were setting this up again Each phase that can fail should have its own type of error Ideally parameterized with some information about why it's failing. That'll make it a lot easier to track this down And then if we do have more input well, then we just recursively keep going So once we have that we want to convert from our concrete syntax tree to our abstract syntax tree So anyway, we have our concrete syntax tree and we'd like to convert that to an abstract syntax tree You can see that the concrete syntax tree has some some extraneous stuff like the lambda keyword and the function dot and What we really care about is just semantically what is the program we're trying to do and that's why we generate this function declaration tree node here and Another benefit of this is you might have a language where there's multiple Syntaxes for declaring functions for whatever reason but in this process we unify that all into just the higher level Okay, the programmer is declaring a function and it doesn't matter specifically how they're doing it And so with the AST There's just all sorts of interesting things you can do and there are some tools that just Generate ASTs and don't generate the rest of the code because of all these things So type checking for instance is the thing you do at this phase Syntax highlighting if you've ever had a an editor that will still highlight your code is like true or false Even when it's like in a string literal or in a comment. That's because they're using regular expressions and not an AST to syntax highlight your code Optimizations so if you see the same side effect free expression is being repeated over and over again in a loop You can move that outside of the loop and use its value things like that Linting so you can check for common errors that the programmer might have and then warn them about it Check style so if your AST has line and column information attached to it You can have some style guide and figure out if the code fits Code completion so when you're in your IDE, this is one way that it knows what your potential options are for the next thing you want to type Automated refactoring so all those cool tools that will rename things for you And then minification which is really just large-scale automated refactoring and it's trying to make it as small as possible Or it's hard to read as possible instead of Doing what you're asking it to do or instead of making it easier for you to program with And so after we've done all those things that we want to do we now turn to the generation side And so first we're gonna change our lambda script AST into a JavaScript AST And you don't have to have to read that Just know that there's a standard format for JavaScript ASTs and it's easy to write a function that outputs them Here's an example test. So my AST is litter is just the literal With a string high and this is the JavaScript AST that gets generated for that And then from there we want to produce actual JavaScript And the way we do that is using this neat node module called ES code gen Where you give it an AST and it spits out a JavaScript and we're using Hjs to interop between .net and JavaScript Which as an aside I was shocked that this worked the first time that I tried it But it did so that's kind of cool So now I want to actually demo this for you We'll see how the multi-monitor thing works After a good start. Okay, I'm gonna spend 15 seconds messing with this Nice, there we go So this is this is my Windows 10 VM that I developed this in Maybe oh my god. Yeah. Yeah, I think you're right. Let's try that Cool. Okay, so we just ran some lambda script And I will show you what happened. Unfortunately. I have to come over here to see it, but so it looks like the There we go, so you can see the different phases that it reports. It's cranking through everything we just talked about And then you can see the JavaScript That was produced Which is not ugly, but works. I'm sorry. It's not pretty. It's very ugly But works as you can see right here So then we run the JavaScript and you can see we emit true and false and that's because we were printing over here Is palindrome race car, which is true and then it is palindrome not a palindrome, which is false So I can edit this So now we'll see that it's false both times just to show you that I'm not cheating and it actually did run that So that was my quick demo All right, so how do you guys actually get started doing this stuff? So there are some some tools you can use Lex and yak are like the two big ones There's also bison, and then there's a JavaScript port of that called Jison and The key insight with these is that the grammar is totally separate from the actual parser And so you can see here on the left I have my grammar of matching those different sequences and on the right I have the parser and that code is totally separate So someone said I can build a compiler compiler which is what yak stands for yet another compiler compiler If I just have a way for people to give me the grammar and just answer a few questions about how the program is structured I can do the rest of it for you So this is a sample input for Lex for lambda script So the way syntax works is on the left-hand side You give it the string on the right-hand side you emit some symbol on the second line here we actually have a regular expression and We have a magic variable called y y l val and you obviously just assigned to it We also have a magic variable called y y text, which is the result of the match This this like indexer syntax is sort of cheating. I didn't want to write out what it would really be And then return to literal and so this is how you parameterize the literal with the value that we're matching And then for yak We declare some tokens and then we have our grammar, which looks pretty similar to the pseudocode I had earlier Now this yak program is not going to do anything other than just tell us true or false This input matches or it doesn't if we want to get something else out of it like a compiled program Then we need to add some semantic results to these rules So we say for instance when you see this sequence of tokens then You're going to emit a function declaration and again with the magic variable you assigned to and so we're saying function declaration Dollar sign one is the func name two is the arg name and four is the expression and so then over here We define what those functions are And then over here. We say once you have all those things. How do you actually output some code? So that's it for how you make your own programming language Obviously, there's more to know but I hope you guys are now able to get started So I'm excited to see what cool new functional programming languages you guys make. Thank you