 Our next speaker is Sing Wei Su. It took me several tries over the course of knowing Sing Wei to get that name correctly. I'm really good at mispronouncing names, as it turns out, you will recognize her because she has too many H's on her name tags. In fact, all H's, actually. Her Twitter handle, I believe, is so many H's, if I am correct. So, Sing Wei is a Rubyist. She works primarily in Rails these days. She is particularly interested in parsing, probably in no small part due to the fact that she speaks four languages. Can anyone guess the four languages? French is one of them. Somebody already, there's a spoiler in the audience here. Somebody already knew all the languages. I'll give away the other ones too. Yes, she speaks Mandarin, Chinese, French, and English. If you speak Cantonese, do not talk to her. You are not welcome here. She is a Mandarin speaker, okay? She'll probably be nice to you anyway if you try and speak some Cantonese with her. And there's one last language that's super tough to guess, actually. Ruby, Dutch, oh, cheater. You're such a cheater. I know you're cheating. Nobody just guesses Dutch first try. It's not a thing. So, Dutch, Chinese, English, French, and Ruby. And she's here to talk to you about parsing. Please, a round of applause for a round of applause. Oh, good. Yes, okay, good. So, my name is Sing Wei-Sue, and as Jonan said, it doesn't look anything like it sounds. A lot of Hs that are all silent, so go by so many Hs. And today I'm gonna talk about parsing. A little explanation about the title of the talk. I probably started first thinking about parsing when I was teaching English as a foreign language and had to explain things like why so many words in English have multiple meanings like flies. So, depending on the meaning of that word, the sentence can go in very different directions. So, when I first started looking into how computer languages are parsed, I noticed that there's actually a lot of similarities in how humans parse sentences in the language they hear and read every day. So, I'm a relative newcomer to the programming world, so I never actually learned about parsers or compilers or anything in school. So, I actually also have an alt title for this talk, which is how I accidentally eat a computer science, and so can you. So, of course, it's not gonna be a comprehensive talk on all parsers, but I wanted to share how I came to learn about them because honestly I really had no idea what they were until a couple months ago, and I feel like I came into them by accident. So, rewinding a few months back, I had just spent a year going through a program called Aida Developers Academy, and where I learned, among many other things, how to build Rails apps, and we were turning out new apps every couple of weeks, and one day I got curious about how things actually worked underneath all of that Rails magic. And specifically, I was trying to figure out how routing worked, so I went to GitHub and started poking around some of the Rails source code and came across a file called parser.rb, and I was like, okay, cool, I'd written a kind of mini parser to better understand what a language interpreter does before, so I thought I probably knew what it would look like. And it turned out it was nothing what I expected. I didn't even recognize this, so I was like, okay, I have to look a little bit deeper, and I saw this file was actually generated using a different file called parser.y, so I was like, okay, let's look at that. Still didn't make any sense, and barely even looked at any kind of Ruby code that I recognized, so I was trying to figure out the best way to get myself to learn this. And I've been told that a great of terrifying way to make yourself learn something is to submit a talk on it, so I was like, okay, sure. What's the worst thing that can happen if I submit a talk? Either I get accepted and I have to talk about it or I get rejected. So it turns out the worst thing that can happen is to get rejected and accepted, so thanks a lot, Josh, for putting me through all of that. But anyway, here I am. And to get warmed up to the idea of parsing, let's play a game with some audience participation. I hope most of you have played it before. It's where I ask for a type of word and you have to fill in the blank to make a sentence, so I need a verb. Anyone? Okay, drink. No audience plant or anything. So the young man drank. This seems like a perfectly valid sentence. If you can all remember middle school grammar lessons, you can diagram the sentence into a subject and a verb, where the verb is drank, and since all sentences need a subject to actually do the verb, we know that the subject is the young man, so it makes sense, right? But are there any other possible sentences that could have started with the young man? So for example, the young man did something, so I need a noun and a verb. We can use the same verb, drank, but drink what? Beer. Beer, great. I didn't anticipate that one at all. So also grammatical, this time instead of just subject plus verb, we have subject plus verb plus object, which abides by the rules of English grammar. If we interpret the young man as subject, drink as the verb, and the beer as an object, which we know makes sense, because objects are nouns. But still, can we come up with even more possibilities for a valid sentence that starts with the young man, either using subject verb or subject verb object? So for example, can you fill in this blank with a noun? Anyone? Well, you can try the boat. So the young man, the boat. Is this grammatical? Or is the young man beer, is that grammatical? You think not, right? Because you think, based on the previous sentences, we assume that the young man is a subject, and then you have a direct object, and we think, no, this is not grammatical. But it turns out you can actually parse the sentence in a different way. So if we interpret the subject as the young, as in like the young people, we see that it still follows the same rules as the sentence of the young man drank beer, which was subject verb and then object. And we were initially led astray because we tend to think of man as a noun, and therefore in the previous sentences as part of the subject, and not a verb. And so this kind of sentence is called a garden path sentence. Garden path sentences are sentences that contain a word that can be interpreted in more than one way. So that you think the sentence is going in one direction and then it pivots on that ambiguous word and goes in a completely different one. So time flies like an arrow and fruit flies like a banana is an example that follows that pattern. Another example is the man who hunts ducks out on weekends, sort of led astray by like thinking of ducks as the direct object. We painted the wall with cracks. You might expect paint or something after width, but crack describes what's on the wall. When Fred eats, food gets thrown. Mary gave the child the dog bit a bandaid. And the woman who whistles tunes pianos. So this kind of ambiguity and sentences can also be seen in hilarious actual headlines such as grandmother of eight makes a hole in one. Man-eating piranha mistakenly sold as pet fish. Eye drops off shelf, performances I guess. Complaints about NBA referees growing ugly. And my personal favorite, milk drinkers are turning to powder. So regardless of the ambiguity in natural languages like English, we can all agree that as speakers of A language, there are certain grammar rules that we are aware of that allow us to make decisions as to whether or not a sentence is a valid expression in that language. So as we've seen with the last couple examples, we know that one possible kind of valid English sentence consists of a subject plus verb plus stuff. Most of us probably did that sort of thing in middle school and we know that like a sentence diagram tree looks something like this. Where at the top level you have the sentence which can be broken up into its constituent parts. So here the noun phrase and then verb phrase which can then be further broken down until every word of the actual sentence is added as a leaf of the tree. So in this sentence we have John lost his pants as the leaves of the tree. But the same tree could apply to the young man, the boat. So these kinds of trees are constructed based on grammar rules. The convention used to describe grammars in the computing world is a notation called back is now a form or sometimes extended back is now a form, BNF. So in the BNF notation you have a series of rules that together constitute the language or in other words can define every possible sentence in that language. There are two kinds of symbols that are used to express each kind of production rule. And those are terminals and non-terminals which we'll get more into. The basic format for each production rule is that there's a left hand side which produces a right hand side or can be rather rewritten as what's on the right hand side. So on the left hand side you have a single non-terminal symbol or token which produces either a terminal or a sequence of terminals and non-terminals. So what does this mean? For example, a super simplified subset of production rules in English. Sentences based on what we saw before might be something like at the top level a sentence produces a subject plus predicate and the predicate is verb plus stuff. So here is an example of 10 production rules that can possibly describe both of the sentences we created earlier. The young man drank beer, the young man, the boat. And these rules make use of six non-terminals, sentence, noun, phrase, verb, phrase, noun, verb, and article. And seven terminals which are just the words the, a, young, man, boat, beer, and drink. And obviously some of those fall into different production rule categories. So using this grammar we can start at the top level and keep applying production rules until we generate the full sentence. So we start with sentence becomes a noun, noun, phrase, and a verb, phrase. And then the noun phrase becomes an article and a noun which in turn hit the terminals of the and young for the article and noun respectively. And similarly the verb phrase can then be broken down into verb and noun phrase and so forth until you finally get the whole sentence. So pretty straightforward. The same thing can be applied, the same production rules can be applied to the young man drank beer, but just in a different order. So you get the same sentence so we know it's valid. But what does this have to do with computers? This is just grammar, English grammar. When it turns out the process by which we parse and understand sentences in a natural language isn't that different at an abstract level from what a computer language parser does. So here's the basic flow chart of what happens to your code. You have the source code which gets passed into something called a lexer or tokenizer which produces a bunch of tokens. It reads the sentences and figures out like recognizable units. And those tokens in turn get fed to the parser which in most computer languages spits out a syntax tree which then your compiler or your interpreter walked to output CPU instructions. So but of course most of us don't spend a lot of time implementing new programming languages but by better understanding the lexer and parser steps we may be able to see how to use parsers for simpler situations. So rather than a Ruby script you can feed anything. A log file or a markdown document whatever or even just a plain string into your parser and get some kind of output which doesn't even necessarily have to be a syntax tree. It could be just a different data structure or even something as simple as a Boolean. So starting with lexing or tokenizing. All this means is turning your input into a sequence of tokens that your parser can recognize. These tokens can then be matched to the grammar that describes your language you're parsing. So let's look at a really simple example. Let's build a parser that can handle super basic binary operations in math such as addition, subtraction and multiplication. The entire language will consist of an integer followed by an operator then by another integer. So the grammar rules here are pretty simple. We only need three of them to describe every possible sentence in our many math language. The start symbol here is the expression which constitutes a three non-terminal symbols, a number token, then an operator token and then another number token. And the number token can then be rewritten as any digit and then the operator token can be rewritten as either a plus, minus or multiplication sign in production rules two and three. So here's a really simple implementation of a tokenizer in Ruby using the string scanner library and just regular expression matching. Pretty much identical to the ones that we saw in our production rules to grab tokens of either type number or operator and then it returns them all in an array. So when we input the string three plus seven, we should get a three element array as your output tokens. So number three operator plus and number seven. And now we wanna pass that those tokens into the parser and ultimately you want to end up with a tree that looks something like this so that when you feed it to the interpreter it looks at the head and knows to execute addition on the left child and right child. So a really trivial implementation of this parser might look something like this. You have a parser that only expects three tokens and outputs a tree with single node containing the operator and the two children which are the number tokens to be operated on which is really simple. But now for some slightly harder math. So take for example this expression. You have two times parentheses three plus seven. So with this expression we need slightly more complicated rules to describe it. Whereas before our first production rule was expression equals number followed by operator followed by number. Now we have at the top level expression equals number followed by operator followed by expression where the expression can be then rewritten as either an expression in parentheses or just a number. And then the number in operator non-terminals still translates as they did before in the previous production rules. So we tokenize for saving space it just outputs the two times and so forth. And this is ultimately the tree that we wanna build but how do we tell the parser to construct the tree correctly? So one common approach is to use a top down parser that looks ahead one token to decide what grammar rule to follow. So like garden path sentences we need to know what comes after the current token to know what kind of rule to use to understand the sentence. So for example knowing that the young man is followed by the boat allows us to know what the subject and the verb of the sentence are. If you can adhere to a grammar rule when you consume each token then you're good to go and you can keep parsing but if you don't then you hit an error state and then you can't continue parsing a syntax error. So here we start with the first token which is a number token. Since peaking ahead at the next token it tells us we have an operator coming up we know that the only possible rule we can follow is the one that says an expression is a number followed by an operator and then another expression. So start building our tree with node two. Shifting over to the next token we have an operator peaking ahead we know that the next token is the parentheses and the only grammar rule that has an operator followed by parentheses is still the first one since we know that an expression can also start with the parentheses. So so far we're still in the realm of grammatical correctness and we can keep parsing and we can set the operator as the head of the tree. So now we consume the parentheses and we see that the next token is a number. Since we know that a valid expression can either be a number or a number plus an operator plus an expression we still haven't violated any grammar rules so we can continue parsing and update the right child of the tree to be potentially an expression in parentheses. So then we consume the number token holding the value three. Previously we knew we had two options either interpret the number as just terminating into a three or as part of the first production rule which states that an expression can either be a number, an operator, an expression. So since we can peek ahead at the next token which is an operator that gives us enough information to know which rule to follow which is the first one and not the second one. So now we have the expression so now we have an expression that will turn into a binary operation and using the same parser we know that three will be the left child of that subtree. So now we consume the operator token and see that it's a plus peeking ahead to the seven. We know that it here's to the first production rule which tells us we're good to keep parsing and we can add the plus operator as the head of the subtree. Then we reach the seven. We know from the previous step that has to be an expression of some sort. Look ahead tells us that we have a parentheses which completes the expression, sorry, it completes the rule of expression equals parentheses expression. So now we can build our complete tree. If on the other hand there had been no parentheses at the end then the second rule, production rule would have been violated and we would not have been able to complete a valid syntax tree. So building from the top down. Looking at the summary of steps we took to build our syntax tree you might have noticed that we often had to call expression from within an expression call. In other words the process that we went through to parse the sentence was recursive and since we were building the tree from the top down this method of parsing is also called recursive descent. This is probably the most common type of parser that people might try to write by hand but there are a number of problems with recursive descent. For one, the recursive nature of the process makes it relatively inefficient because you might end up checking a token against every single grammar rule that you have before you can parse it or if you're not careful with how you write your grammar rules you'll end up in an infinite loop. For example, if you have a production rule that says an expression equals expression of expression then you hit the first token you go back to expression which produces special op expression which goes back to expression and so forth and so on out of an item. Finally, you sometimes have limitations in the way you can order or write your grammar rules for a language that uses recursive descent parser. In the case of the calculator language like this one depending on the order that you write your grammar rules you'll potentially run into issues with associativity that is like if you say two minus three plus five how do you know, how does the parser know whether to give precedence to subtraction or addition? So another approach to parsing instead builds the parse tree from the bottom up. The key difference here is that this approach makes use of a stack that you can push tokens onto until there are enough tokens on the stack to fulfill a rule at which point they get reduced and popped off. So for that reason this kind of approach to parsing is also called shift reduce. So going back to our slightly hard math example here we start with the first token two we push it into our stack for the moment and based on the grammar rules we know that two is a terminal for a num token. So the stack reduces it to the left hand side symbol for a num. Then we set aside the two as a building block for our eventual tree. So then we shift over and consume the next token which is a time sign. This gets pushed onto the stack and then we look at the grammar. The rule that matches is operator equals multiplication so we reduce it in the stack and move it over. Then we shift again and grab the parentheses. We add it to the stack and since this looks like it could be covered by the rule where expression equals parentheses expression we keep it in the stack for now and just keep shifting. The next token is now three which we push. Again applying the first grammar rule we reduce the three to a number on our stack set it aside again. Shifting over to the plus line we add it to the stack as usual. We look at the grammar and know that this is an op token so a next action is reduce and once again we set it aside. Our next action is to shift. We push the next token onto the stack, apply our production rule which is num7 and expression is num which recognizes the seven token and reduces it to expression and then set it aside. At this point the tokens on our stack have fulfilled a rule that our grammar recognizes namely that expression equals number followed by operator followed by expression. So now the stack takes the three tokens involved in the rule and reduces them to simply an expression which can be shown now on the stack in green. So now our parser knows what to do with those elements involved and can create a subtree from the three, the plus, and the seven in the same way that we saw previously in the top down parser. That's all the reducing that can be done for now so again we shift to consume the last token. Again we push it onto the stack and now again the tokens at the top of the stack have fulfilled another grammar rule which is expression in parentheses. So since those three tokens at the top of the stack fulfill a rule, the stack reduces them into simply an expression and puts on the stack. And so now on the stack we have left number operator expression. But now we're at a state again where the tokens at the top of the stack have met another grammar rule which is expression is number operator expression. This allows the first tokens we set aside to be added to the tree at the proper level above the subtree construct from the previous reduction. So again we can reduce the tokens at the top of the stack into the top level expression on terminal. And so now we've reached the end of our tokens and since we were able to perform a valid reduction we know that the input was valid. It was a valid grammatical sentence in our math language and we can accept it and create the final tree. So those are two types of approaches to parsing. Top down or recursive descent which is again most commonly written by hand whereas the bottom up which you might sometimes hear here called LALR and they're much, much harder to write by hand. So unsurprisingly the more grammar rules you have the harder it is to construct this parser. But fortunately there are tools that make parsers for you. These tools are called Yak, Rack and Bison. They're called parser generators and all they need is a grammar file with all the rules that determine the language that you wanna parse and it'll spit out a parser for you which then you can feed in any input to do the right thing. So those grammar files were like the wacky.y files that you saw at the very beginning of the slide Jack but now you can understand how the grammar rules are written but now that you understand how the grammar rules are written we can see it's not that different from the BNF notation that we've been using. So here you can see that the rule for a non-terminal symbol expression consists of a number token followed by an operator token followed by a number. And when this sequence gets hit by the parser it executes what's in the Ruby block about send the first and second index in your match. So the grammar file also comes with the tokenizer but it's pretty much the same thing that we saw before so I'm just leaving off the slide. So again the grammar rule is defined here as just num and the block is very similar to that trivial parser that we saw from the very first simple math example. So all that's left is to pass your grammar file into Rack or Bison and it should generate a file that can now parse whatever input for the language that you specified. So you don't even need to look at the output of what your parser generate generates but if you did it would look something like this. The numbers in the various tables here are simply correspond to the different actions that the parser needs to take such as shift or reduce or go to such and such rule but for the most part you don't even have to worry about it and you already have a parser. By H's you might ask, when would I actually use a parser? It's a great question because there's a lot of situations which you can use a parser. So for example you could use one if you're validating a string such as a URL or an email address or maybe extracting information from the body of an email or a log file or even converting some kind of formatted document. So for example like markdown to HTML. But now I'm thinking well I've validated strings before and I've just used regular expressions so why can't I just use that? And it's true for some simple string input you might just use a regular expression but has anyone here actually tried to validate a URL using a regular expression? Because if you did you might have ended up with something like this which is terrible. You pretty much never want something like this in your code because it's hard to read, definitely hard to maintain and probably doesn't even cover all the cases that you need it to and the reason for this is because there's a really important distinction in languages, even string languages that to take into consideration and that is that all languages belong to a hierarchy. The established hierarchy for languages was first introduced by Noam Chomsky and it has four types. The first type is unrestricted and this is the category that most natural languages fall into. There are languages, these are languages that don't have regular or consistent grammars and can therefore be ambiguous as we saw with all those headlines and garden pass sentences. Type one is called contact sensitive which we don't really care about for the purposes of this talk. But then we get to type two and three. Type two languages are what are known as context free languages based on context free grammars. Most programming languages are type two but you don't even have to be a language as complicated as Ruby to be considered context free as we'll look at in a minute. And finally type three languages are what are called regular languages and these are the only kinds of languages that can be properly parsed by a regular expression engine. But what does it really mean for a language to be either regular or context free? For the most part it just boils down to your grammar rules. For regular languages all grammar rules can be expressed as a single non-terminal symbol on the left hand side and either a single terminal or nil on the right hand side. This terminal can sometimes have a non-terminal either before it or after it but not both. So you can have a rule such as A becomes BX or A becomes XB but you can't have both. On the other hand context free languages have much more flexible grammar rules. While the left hand side is still a single non-terminal the right hand side can be a sequence of terminals and non-terminals in any order. So as it turns out most languages aren't regular. So let's take a really simple example language that you think you could just parse with a regular expression. So this is the AB language which consists of a series of A's followed by a series, the same number of B's. So valid sentences are AB, AA, BB and so forth. Whereas an invalid sentence would be something like AA where you don't have B or vice versa or where they're interspersed or they don't match. So if we were trying to write a grammar for a language like this you might start with something like a non-terminal A becomes the string A and similarly you have the non-terminal B which becomes a string B. So then you can construct a sentence as A followed by B. But that only produces a single sentence AB. So in order to produce anything else you might have something like AB becomes AA, BB and then AA becomes BBB and so forth at infinitum and that doesn't work because then you have an infinite number of grammar rules. So what's the grammar? Thanks to parser generators you don't normally have to do much more than come up with the grammar but sometimes the grammar itself can be a little bit tricky. So this particular problem actually reminds me of a logic problem that I'm sure some of you have heard before. So the problem is there are a bunch of gnomes in a cave that have been kidnapped by an ogre. And some of them they have their foreheads painted red to be eaten right away and some of them have their foreheads painted blue to be eaten later. However, the ogre gives them one chance at saving themselves which is the gnomes are allowed to exit the cave one by one and without communicating or speaking to each other in any way that they have to sort themselves into a contiguous line with the red painted gnomes on one side and the blue painted gnomes on the other. They don't know what's on their foreheads themselves because the cave is pitch black. So does anybody know the answer to this one? I know some of you do. So the answer to this one is that the gnomes as they exit the cave they insert themselves in between the red and blue ones. And if you think about it this is actually how this grammar ends up working. To use BNF notation that means a sentence or a statement S is basically an A or a B with another statement token inserted in between the A and B. And if we specify that the statement token can also be nil then that provides the only terminal that we need that will cover all cases of the deceptively simple seeming AB language. So remembering that a grammar rule for a regular language can't have a sequence of terminals and non-terminals on the right hand side we see that this language already is a context free language. So as we've seen we don't really need that much complexity before you get out of regular expression parsable territory. Because of that you usually are gonna be better off figuring out the grammar rules for the thing you're trying to parse and using a parser instead. This also ensures greater accuracy in what you're trying to parse. So for example there are a number of web servers that use a parser for HTTP requests such as Raggle or Jetty which is used in Mongrel and Java. But even as recently as last week there was actually a vulnerability that was uncovered in Jetty which is again using Java server that prints debug information in the HTTP response which is bad. So the other thing about parsers is that they're often a much faster way of getting what you need done done. So for example in a web framework if you use a parser for your routes then you can build a tree for your routes which makes your lookup much faster than just a linear look through. Extra credit points for whoever can tell me the complexity of that. Of course parsers are hard to write but thanks to parser generators all you need to figure out is the grammar. So there were a ton of resources that I used to prepare for this talk and I couldn't list them on one side but here are two links if you actually want to see an implementation of an actual recursive descent parser. This one happens to be in Python and stuff for shift reducing. And once again I am Singh Wei so many Hs on Twitter on GitHub, Elfers. I am a Rails developer at Karazhan and we're hiring so if you want a job come talk to me. Thank you.