 way Sue, that actually is how it's spelled, all the H's are silent, so on Twitter you can find me by so many H's. So today I'm going to talk to you about parsing. So for those of you who saw Aaron Patterson's talk this morning on the Ruby virtual machine, my talk covers all the boring steps that his talk skipped over, so if you still feel like you want to skip over those steps, now's your chance. So a little explanation about the title of this talk, I probably first started thinking about parsing when I was teaching English as a foreign language, and I had to explain why so many words in English like flies had different meanings, and depending on which meaning the word had, the sentence can go in really different directions. So when I started looking into how computer language are parsed, I noticed that there are actually a lot of similarities to the way humans parse sentences in languages that they hear and read every day. So I am a relative newcomer to the programming world, so I never learned about parsers or compilers in school or anything, so I have an alt title for this talk, which is how I accidentally a computer science, and so can you. So of course this isn't going to be a comprehensive survey of parsers, but I just wanted to share how I came about learning about them as someone who doesn't have a CS degree, because I had no idea what they were and I sort of ran into them by accident. So like I had been building Rails applications for a while, and one day I got tired of just blindly trusting all that Rails magic, and I wanted to see how things really worked underneath the hood. Specifically, I was trying to figure out how routing worked, so I went to GitHub and started poking around the Rails source code, and it came across a file called parser.rb, and I was like, cool, like I'd done an exercise, I was sort of like this, so I thought I'd probably know what the code would look like. But then I looked at the file, and it was nothing what I expected. It was just like a bunch of arrays with numbers, like where was the logic? Who even wrote this stuff? So I was like, okay, that's not enough. Looking closer, I saw that the previous file was actually generated using a different file called parser.y, so I went and looked at that. Still didn't make any sense, and barely looked like any Ruby code that I recognized, like they're semicolons, like what? So that's when I decided to figure out what all of this meant. So to get warmed up to the idea of parsing, let's play a game. I hope that most of you have played it before, but for those of you who may not be familiar, this is the word game where one person asks for certain kinds of words, such as a noun or an adjective, to make sentences. So for example, I need a verb, anyone. Did I hear drink? Sure, drink, great, awesome. The young man drank, awesome. So that's a valid sentence, right? If you can remember your middle school grammar lessons, we can diagram the sentence into a subject and a verb, where the verb is drank, and since all sentences need a subject to actually do the verb, we know that the subject is the young man. So makes sense so far, right? But let's see if there are other possible sentences that could have started with the young man. So for example, the young man did something. So here I need a verb and a noun. So we can stick with our original verb drink. So any suggestions for a noun? Beer. No, my crowd. So the young man drank beer. Also grammatical. This time, instead of just subject plus verb, subject plus verb, we have subject plus verb plus object, which again, abides by the rules of English grammar, if we interpret the young man as the subject, drank as the verb and beer as the direct object. But still, can we come up with even more possibilities for valid sentences that start with the young man, either using subject plus verb or subject plus verb plus object, or even something else. So for example, I need a noun. Anyone? So for example, is this a grammatical sentence, the young man, the boat? Yes, no, do we know? Based on the previous examples, it's at first, it's easy to think that it's not grammatical because we assume that the subject for this sentence is still the young man. So then when we see the boat, our brains are like, nope, not a sentence. And this is because we know that a subject has to be followed by a verb. But it turns out, you can parse the sentence in a different way. So here, if we interpret the subject as the young, as in young people, we still that it still follows the same rules as the sentence the young man drank beer, which was subject plus verb plus object. So we were initially led astray, because we tend to think of man as a noun, and therefore in the previous sentence as part of the subject, and not a verb. So this kind of sentence is called a garden path sentence. Garden path sentences are sentences that contain a word that could be interpreted in more than one way. So that you think that the sentence is going to follow one direction, one structure before pivoting on that ambiguous word and going in a different direction. Time flies like an arrow, fruit flies like a banana follows that pattern, as well as the man who hunts ducks out on weekends. The prime number few. When Fred eats food gets thrown. Mary gave the child the dog bit a bandaid. And the woman who whistles tunes pianos. So ambiguity in words or sentence structure can also produce hilarious unintentional second meetings for newspaper headlines, such as grandmother of eight makes a whole in one. Man eating piranha mistakenly sold as pet fish. I drops off shelf complaints about NBA referees growing ugly. And my personal favorite milk drinkers are turning to powder. So regardless of the ambiguity in natural languages like English, we can all agree that as speakers of a language, there are certain grammar rules that we are all aware of that allows to make decisions as to whether or not a sentence is a valid expression in that language. As we've seen from the last few examples, we know that one possible kind of valid sentence in English consists of subject plus verb plus stuff. And again, most of us probably did some kind of sentence diagramming in middle school that looked a little bit like this. So at the top of the tree, we have a sentence, which can be broken up into its constituent parts such as a noun phrase and a verb phrase, which then in turn get broken down further until every word in the sentence can be added as a leaf of that tree. In this tree, the leaves are the words for the sentence, John lost his pants, but the same exact tree could apply to the young man, the boat. So these kinds of trees are constructed based on grammar rules. The convention usually used to describe grammars in the computing world is a notation called back as an hour form or BNF. In BNF notation, you have a set of rules that together describe the language or in other words can define every possible valid sentence in that language. There are two kinds of symbols that are used to express each kind of grammar rule, terminals and non-terminals, which we'll get a little bit more into later. So the basic format of each production rule is that there is a left-hand side, which produces a right-hand side, or rather can be rewritten as what's on the right-hand side. On the left-hand side, you have a single non-terminal symbol or token, which produces either a terminal or a sequence of terminals and non-terminals. So for example, a super simplified subset of production rules of English sentences based on the past few slides that we saw might be something like at the top level, a sentence becoming a subject followed by a predicate, and then a predicate becoming a verb followed by stuff. So here, we can see an example of 10 production rules to describe both of the sentences we created earlier, the young man drink beer and the young man the boat. So these rules make use of six non-terminals, sentence, noun phrase, verb phrase, noun, verb and article, and seven terminals, which are simply the words the, a, young, man, boat, beer, and drink. So using this grammar, we can start at the top level and keep applying production rules until we generate the full sentence. So we start with the sentence, which can be rewritten as a noun phrase and a verb phrase. The noun phrase then translates to an article plus noun, which in the next step hits the terminals of the for the article and young for the noun. Similarly, the noun phrase, which gets rewritten as a verb followed, followed by noun phrase, then becomes, the verb becomes man followed by noun phrase. The noun phrase then becomes article followed by noun, article becomes the, and the noun becomes boat. The same rules can apply for the other sentence, the young man drink beer. But what does all of this have to do with computers? So it turns out the process by which we parse and understand a sentence in natural languages is not too different from what goes on with the computer language parser, somewhat debatable, but for our purposes. The basic flowchart of what happens to your code can be seen here. Your code gets fed into a Luxor or tokenizer, which breaks up each sentence into recognizable units. These get fed to the parser, which for most computer languages outputs an abstract syntax tree, which can then be walked by the compiler or interpreter to output CPU instructions. Of course, most of us aren't in the business of implementing new programming languages, but by better understanding the Luxor and parser steps, we may be able to see how we can use parsers for simpler situations. Rather than a Ruby script, for example, you can feed anything, a log file, a markdown document, or even just a plain string, into your parser and get some kind of output, which doesn't even necessarily have to be a tree. It could be a different data structure, for example, or even something as simple as a Boolean. So you feed your string into your parser and you get true or false to tell you whether or not it was valid. So let's look at Luxing or tokenizing. All this means is turning your input into a sequence of tokens that your parser can recognize. These tokens will then be matched to the grammar rules that describes the language you are parsing. Let's look at a simple example. Let's build a parser that can handle super basic binary operations in math, such as addition, subtraction, and multiplication. The entire language will consist of just an integer followed by an operator and then by another integer. So the grammar rules for this language are pretty simple. We only need three of them to describe every possible sentence in our mini math language. The start symbol here is an expression, which consists of three non-terminal symbols or tokens, a number token, an operator token, and then a number token. The number token can then be rewritten as any digit, notated above using a regular expression. And the operator token can be rewritten as a plus, minus, or time symbol in production rules two and three. So here's a simple implementation of a tokenizer in Ruby. It just uses a regular expression matching pretty much identical to the one that we saw in our production rules to grab tokens of either type num or op and then it returns them all in array. So when we input the string three plus seven, we should get a three element array of num, op, and num. So now we want to pass it to the parser. Ultimately, we want to end up with a tree that looks something like this. So that when you feed this tree to the interpreter, it sees the head of the tree and knows to execute an addition on the left child and right child. So a super trivial implementation of a parser might look something like this. You have a parser that only expects three tokens in a specific order and outputs a tree with a single node containing the operator and then two children, which are the number tokens to be operated on. So very easy. But now for some slightly harder math. So take for example, this expression, two times parentheses three plus seven. With this expression, we need slightly more complicated rules to describe it. Whereas before our first production rule was expression equals num followed by op followed by num. Now we have at the top level expression becomes num followed by op followed by expression where expression can then either be rewritten as another expression in parentheses or a number. The number in operator and non-terminal still translate as they did before in the previous production rules. So we tokenize our input and get an array of seven tokens. And this is ultimately the tree that we want to build. But how do we tell the parser to construct the tree correctly? One common approach is to use a top down parser that looks ahead one token to decide what grammar rule to follow. And like garden path sentences, we need to know what comes after a given token to know what kind of rule to use to understand the sentence. So for example, knowing that the young man is followed by a noun, the boat, allows us to know what the subject and verb of the sentence are. If you can adhere to a grammar rule each time you consume a token, then you're good to go. But if you can't, then you reach an error state and the parser throws is a parse error. So here we start with the first token, which is a number token. Since peeking ahead at the next token tells us we have an operator coming up, we know that the only possible rule that we can follow is the one that says an expression is a number followed by an operator followed by another expression. So we can start building our tree with the node two. Shifting over to the next token, we have an operator. Peeking ahead, we see that the next token is a parentheses. The only grammar rule that has an operator followed by parentheses is still that first one. So we know that so far, we're still in the realm of grammatical correctness. And since there's no parse error, we can use that trivial parser from before and set the operator token as the head of our tree. So now we consume the parentheses and see that the next token is a number. Since we know that a valid expression can either be a number or a number plus operator plus expression, we still haven't violated any grammar rules so we can continue parsing and update the right child of our tree to be evaluated later. Now we consume the number token, which holds the value three. Previously, we knew that we had two options, either interpret the expression as becoming just a number token, which then terminates into three or as part of the first production rule, which states that an expression can be a number followed by an operator followed by an expression. So since we can peek ahead to the next token, which is an operator, this gives us enough information to know which rule to follow, which is in fact that second one. So now we know that the expression we have as the right child of our tree will in turn be a binary expression and using the same parser we know that the three will be the left child of that subtree. Now we consume the operator token and see that it's a plus peeking ahead to the seven. We know that this adheres to the same production rule as before and that we decided on in the previous step, which tells us that we're good to keep parsing so we can add the plus operator as the head of our subtree. Now we reach the seven, which we already know from the previous step has to be an expression of some sort. The look ahead tells us that our current token must terminate into a number. When we consume the final parentheses, this completes the sub-expression of expression becoming expression of parentheses so now we can build our complete tree. If on the other hand there had been no parentheses, then the second rule would have been violated and we would not have been able to construct a valid syntax tree. So looking at a summary of the steps we took to build our syntax tree, you may have noticed that we often have to call expression from within an expression call. In other words, the process we went through to parse the sentence was recursive and since we were building the tree from the top down, this method of parsing is also called recursive descent. This is probably the most common type of parser that most people try to write by hand, but there's a number of problems with recursive descent. For one, the recursive nature of the process makes it relatively inefficient so you may end up checking a token against every single grammar rule in your language before you can actually parse it. Or if you're not careful in how you write your grammar rules, you'll end up with infinite recursion. So for example, if you have a production rule that says an expression becomes an expression operator then when you hit the first token, you go back to expression which produces expression, operator expression and so forth ad infinitum. Finally, sometimes you have limitations in the way or order you write your grammar rules for your language. In the case of the calculator language like this one, depending on the order you write your grammar rules, you'll potentially run into issues with associativity. That is, if you say two minus three plus five, how does the parser know whether or not to give precedence to addition or subtraction? So another approach to parsing instead builds the parse tree from the bottom up. The key difference here is that this approach makes use of a stack that you can push tokens onto until there are enough tokens on the stack to fulfill a grammar rule, at which point they get reduced and popped off. For that reason, this kind of approach to parsing is also called shift reduce parsing. So let's go back to our slightly hard math example. Here we start with the first token two. We push it onto the stack for the moment. Based on the grammar rules, we know that two is a terminal for a num token, so the stack reduces it to the left-hand side symbol of number. Then we set aside the two as a building block for our eventual tree. Now we shift over and consume the next token, which is a time sign. This gets pushed onto the stack, and then we look at the grammar. The rule that matches is operator becomes multiplication, so reduce the token on the stack and move it over. Now we shift again and grab the parentheses and we add it to the stack. Since this looks like it could be covered by the rule where parentheses expression becomes, we just keep it in the stack for now and keep shifting. The next token is now a three, which we push. Again, applying the first grammar rule, we reduce the three to a number token in our stack and set it aside. Shifting over to the plus side plus sign, we add it to the stack as usual. We look at our grammar rule and know that this is an op token, so the next action is to reduce. And once again, we set it aside. And our next action again is to shift. We push the next token to the stack, apply our production rules, which recognizes a number token, but also that a number is a type of expression. So our stack executes another reduce on the stack and we put our seven aside. So at this point, the top tokens on our stack have fulfilled a rule that our grammar recognizes, namely that an expression can be a number followed by an operator followed by an expression. So now the stack takes the three tokens involved in the rule and reduces them to simply an expression, now shown on the stack in green. So now our parser knows what to do with those elements involved and can create a subtree from the three plus and seven in the same way that we previously saw in the top down parser. So that's all the reducing that can be done for now. So again, we shift to consume the last token. Again, we push it onto the stack. And now again, the tokens at the top have fulfilled another grammar rule, which is expression becomes expression in parentheses. So since the three tokens at the top of the stack fulfill a grammar rule, the stack reduces the parentheses, expression and parentheses into an expression token as per the rule. So now the tokens we have left on the stack are num operator and expression. But now we're at a state again where the tokens at the top of the stack have met another grammar rule, which is expression becomes num operator expression. This allows the first tokens we set aside to be added to the tree at the proper level above the subtree construct from the previous reduction. Again, we reduce the tokens at the top of the stack into the top level expression non-terminal. So now we've reached the end of our tokens and since we were able to perform a valid reduction, we know that the input was a valid grammatical sentence in our mass language and our parser can accept it and create the final tree. So these are two types of approaches to parsing. Top down or recursive descent parsers are the most commonly written by hand, whereas bottom up, which you might sometimes hear called LALR, are much harder to write. Unsurprisingly, the more grammar rules you have, the harder it is to construct a parser. But fortunately, there are tools to make parsers for you. These tools are called parser generators and all they need is a grammar file with all the rules that determine the language you want and bam, I'll spit out a parser for you. So those grammar files were like that wacky.y file that we saw at the very beginning of a slide deck, but now that we understand how grammar rules are written, we can see it's not that different from the BNF notation that we've been using. If you're using something like rack as your generator, it expects a class with a rule section and that's where you define your grammar rules. So here you can see that the rule for the non-terminal symbol expression consists of a num followed by an operator and followed by another number. When this sequence gets hit by the parser, it executes the Ruby code that you specify in the block. We saw this earlier in the bottom up parser that when a grammar rule is fulfilled, the parser performs some action. The grammar file also comes with a tokenizer, which is how the parser generator understands what the non-terminal tokens are that you define in your rule set, but it's pretty similar to the one that we saw earlier, so not concluding on the slide. So all that's left is to pass your grammar file into rack or bison and it should generate a file that you can now parse for whatever language that you specified. You don't even need to do anything or even look at it, but if you did, it would look something like this, which now makes a little bit more sense. The numbers in the various tables here simply correspond to the different actions the parser needs to take, but for the most part, you don't even need to worry about it and now you have a parser. But sing way, you might ask, when would I actually use a parser? And that's a great question because there's a lot of situations in which you can use one. So for example, you could use one if you're validating a string, such as a URL or an email address, or maybe extracting information from the body of an email or log file, or even converting some sort of formatted document into another form, for example, marked down to HTML. But now you might be thinking, well, I validated strings before just using regular expressions. And it's true, for some simple string input, you probably could just use a simple regex. But has anyone here actually tried to validate, for example, a URL using regexes? I'm sorry, because if you did, following the rules for how URLs are constructed, you might have ended up with something like this, which is terrible. You pretty much never want something like this in your code because it's hard to read. Don't read it, you guys are actually probably trying to figure it out right now, right? And it probably doesn't even cover all the cases that you need it to. And the reason for this is because there's a really important distinction in languages, even string languages, to take into consideration. And that is that all languages belong to a hierarchy. The established hierarchy for languages, which was first introduced by Noam Chomsky, has four types. The first type is unrestricted, and this is the category that most natural languages, like English, fall into. These are languages that do not have regular or consistent grammars, in which can be ambiguous. Type one is called contact sensitive, which we don't really care about for the purposes of this talk, but then we get into type two and three. Type two languages are what are known as context-free languages, and most programming languages are type two, but you don't even need to be a language as complicated as Python, for example, to be considered context-free, as we'll see in a minute. Finally, type three languages are what are called regular languages, and these are the only kinds of languages that can be properly parsed by a regular expression, by regular expression engines, hence the name. But what does that mean for a language to be either regular or context-free? For the most part, it all boils down to the grammar. For regular languages, all grammar rules can be expressed as a single non-terminal symbol on the left-hand side, and either a terminal or a nil on the right-hand side. This terminal can sometimes have a non-terminal before it or after it, but not both. On the other hand, context-free languages have more flexible grammar rules. While the left-hand side is still a single non-terminal, the right-hand side can be a sequence of terminals and non-terminals, and as it turns out, most languages aren't regular. So let's take a really simple language that you think is so simple that, surely, you could just use a regular expression to parse it. So let's consider the AB language. The rules for this language are that you have a series of As followed by the same number of Bs, so a valid string could be AB or AABB and so forth, whereas invalid sentences are a string of all As or ABB or ABAB alternating. So if we were to try to write a grammar, grammar rules for this, we might start with something like the non-terminal A becomes the lowercase A and non-terminal B becomes lowercase B. So then when we could construct a sentence, we would have to have S as the sequence of non-terminals AB, but this rule only gives us one possible sentence, which is just AB. So in order to produce all possible sentences, we'd have to have AB becomes AABB and then AAABBB, et cetera, and so on forever, and this obviously doesn't work. So this is kind of the same problem as figuring out how you, if you would have balanced parentheses. So what is the grammar? Thanks to parser generators, coming up with a grammar is the hardest part to writing a parser, but sometimes coming up with that grammar is a bit tricky since it has to cover all possible sentences. So this particular problem reminds me of a logic problem that maybe some of you have heard before. The problem is there are a bunch of gnomes who've been kidnapped by an ogre and put in a cave. Some of them have their foreheads painted red to be eaten right away and some of them have their foreheads painted blue. However, the ogre gives them one chance at saving themselves. The gnomes are allowed to exit the cave one by one so that they can actually see each other, but not communicate and they still can't see the color of their own forehead. And if they can sort themselves into a single line where all the red painted gnomes are on one side and all the blue painted gnomes are on the other side, then they'll all be freed. So does anybody actually know the answer to this one? No? So the answer is that if each gnome who comes out of the cave inserts himself in between the red and blue gnomes, they'll automatically sort themselves. And if you think about it, this is actually the same way that the AB language works. To use BNF notation, that means a sentence or statement S is basically an A or B with another statement token inserted in between the A and the B. If we specified that S can also be nil, then that provides the only terminal you need to cover all cases of our deceptively seeming AB language. So remembering that a grammar rule for a regular language can't have a sequence of terminals and non-terminals on the right hand side, we see that this language is in fact a context-free language. So as I mentioned, not everyone is in the business of implementing a programming language from scratch, but using grammars is useful in everyday work, such as understanding internet standards. For example, here are the rules that describe the syntax of HTTP auth headers from RFC 2617, which specifies HTTP authentication. A Ruby implementation of this standard can be found in the HTTP digest auth gem, but you could also implement a parser using these rules just to check for validity of input. Except for email. To validate email, just send the email. So as you've seen, you don't really need that much complexity before you get out of regular expression parsable territory. And because of that, you're going to be better off by figuring out the grammar rules for the thing that you're trying to parse and using a parser instead. This also ensures greater accuracy in what you're trying to parse. For example, there are a number of web servers that use a parser for HTTP requests, such as Raggle or Jetty. But even as recently as a few months ago, there was a vulnerability uncovered in Jetty, which is used in Java servlets that prints debug information in the response, so that was bad. The other thing about parsers is that they're often a faster way of getting what you need done. So for example, in a web framework, if you use a parser for your routes, then you can build a tree, which makes routing much faster than a linear look through. Bonus points if somebody can tell me the big old complexity for that. Of course, parsers are hard at a write, but thanks to parser generators, all you need to figure out is the grammar. So there are many resources that I used to prepare for this talk, and I couldn't list them all on one slide, but here are a few links. If you wanna see a full implementation of recursive descent parser or a shift-reduce parser, I also highly recommend the book Constructing Language Processors for Little Languages by Randy Kaplan, as well as Pat Shaughnessy's Ruby under Microscope, which Aaron also recommended for his virtual machine talk. So thank you very much.