 been excellent conference so far. We have had some amazing speakers, some very enlightening topics I love. Mountain West always seems to kind of hit all over the place. You get systems programming, you get web programming, you talk about people's pet projects this afternoon. We're going to hear about some some cool stuff to do with different kinds of systems. So it's really awesome to be here. I just want to start off by saying that. Today I'm going to talk about parsing expressions and specifically using them in the Ruby programming language. By raising hands, can I just see really quickly how many people are familiar with the idea of parsing expressions or heard about or used them before? How many people have used maybe a library like Treetop? Okay, excellent. Okay, so we're speaking to a pretty good audience here. I just want to talk a little bit about myself. I love Ruby. I've been using Ruby for about three years now. I'm a professional network for PATH in San Francisco. PATH.com, we're a small startup working on an iPhone app and a social network. And I handle all of our web site and our APIs and everything. So I know something about you and a dirty little secret. That most people in this room have a very dirty little secret that's hiding out somewhere deep in the bowels of their app. And it looks something like this. How many people have one of these lurking deep in the bowels of their app? Okay, and how many of you are like, we're like, man, okay, so first of all, what does this do? It's what this does here. If Pars is a UR, man, we've got some geniuses here. I didn't figure that out. Pars is UR. He's got a duck-duck-duck, right? Alright, we know now what this does. This particular example came from daringfireball.net. John Gruber wrote this one. It's supposed to be a promo powder. I worked on an app one time that had something look like this. And I was like, what? So it started with an HTTP optional S or FTP. So apparently we're supporting FTP as well with this regular expression. And when I saw this thing, I was like, yeah, man, I guess, whatever. Hopefully whoever did that knew what they were doing because I sure as heck don't know what that's doing right there. So then we're kind of cruising along. We're developing our app, you know, and then one day we kind of hit a bug. And I was like, what's the problem here? It's having problems validating the URL. And the URL looks something like this, right? It was somebody's blog spot URL that had a long sub to me. So any guesses on what the performance is of this regular expression on this URL? Any guesses on how long it actually takes to validate this URL? 12 seconds? 97 seconds. I was like, man, why is this failing? If we've got our unicorn processes, they're just being slaughtered after a minute, right? Because if they run for a minute and the master is like, what the heck are you doing? Something is terribly long dying, right? So every time somebody tried to save a user account with a blog spot address like, you know, jimmyandjane.blogspot.com, they were getting a 500 because the server just like died. I was like, man, what is going on? It's this beast that's hiding out deep in my app. And I have no idea what the heck this thing is doing. Beware if you're using something like this because I know that most of the time the tendency is to kind of go out and say like, I need to validate a URL. I need to validate an email address. Solve problem, right? People on the internet have done that before. Google help me out. Validate URL Ruby. Ah, there it is. Copy and paste. In my app. Lungs, right? You're done for the day. So let's take a step back and let's talk a little bit about what is the problem that we're trying to solve with these regular expressions, right? What is the core of the problem? It is text. We've got text from all sorts of different sources. We've got text coming in from our users when they're filling out forms on our website. We've got text that's coming in from web services, whether it's I'm pinging some API for XML or JSON or I'm reading that coming big file that's got YAML in it. I've got text that's coming back from a service like Redis that's just sitting on a socket and it's sending me back binary data. I've got to parse that. I've got all these sorts of different texts coming back. And for some, I have really good parsers, right? For XML, for JSON, for YAML, the standard formats. I have awesome parsers, right? I can just look out my Nova Gearia. I have my YAML or YAML or whatever and I can get the job done. But unfortunately, what the problem is, for all the other cases, we kind of don't really know what to do. It's like, oh, I guess I got. I'll just use a regress, right? I'll just use my regular expression out of parse that stuff. Because I don't really have a good parser for it, so I'll just use this. And it really has become a hammer for most of us because we kind of, we just see like a nail there. And it doesn't matter if the nail is made of glass or steel or whatever, you just whack it and hope that your tool works. But the thing is regular expressions really weren't designed to do a lot of the jobs that we're doing with them. So let's talk about what do we mean then? What are the alternatives, right? Because I'm guessing that most of us just use regular expressions all the time because, and I'm not saying you never use them, right? They're obviously really good for some things. Obviously, they're a very powerful tool. But for other certain types of texts, we need something that's more powerful. For example, if I'm going to, for example, who has read the spec, the RFC 3986 that talks about actually how to build the URI? Or, you know, it's pretty lengthy. Like the grammar for validating a URI is actually pretty lengthy. And so what happens when somebody tries to build a regular expression to parse that, they open up a spec like, oh man, I don't know what I'm going to do with this. Here this regular expression works for the majority of the cases, and then they move on. But what we really would like is a tool that enables us to take a spec like that, something that we find in an RFC, something that's written in an ABNF style syntax, and turn it into code, right? So there's no guesswork on the way from the RFC to the code. So my parser isn't like a good idea. It's actually something that was specced out beforehand, right? So let's talk about a couple of qualities that we would like when we're actually building our parsers. Obviously we want it to be fast. We don't have time to just parse stuff, right? Like that's the beginning of what we need to do. We just want to parse it and get on with whatever we want to do with that data. We want it to be simple, right? We want it to be simple because if I cannot read it, I cannot maintain that thing. And I have very little understanding of the code that is in my app. So maybe some of you can understand 1,200-character regular expressions, but I don't get it very well. It was a compute. Here's a fun one. I want it to be modular, right? Lots of times I'm reading a spec and I'm like reading the URI spec, for example, and it says a host, for example, could be either a host name, an canonical host name like www.facebook.com, or it could actually be an IP address. Well, what is an IP address? An IP address is an entirely different spec, right? So ideally what I'd like to do is I'm thinking about this in Rubyland. Ideally I'm thinking about the IP address module, and I'm thinking about the URI module, and I just want to include the IP address module into my URI module, right? I want them to be sort of separate things so that I can use them for different cases and then combine them as needed. Also, I want this thing to be flexible, right? I don't want it to be very rigid. In other words, when I execute an expression, when I pattern match on a string and I get back an A, if I'm parsing sentences in the English language, I know that that A actually should be interpreted literally, but if I'm parsing hexadecimal numbers, I know that that A actually isn't a letter at all, it actually represents a number, right? So in that context, I want to be able to change context based on what it is on parsing, okay? So these are some high and mighty requirements for a parser, so let's talk about this again. Is it fast? No. It depends, right? In this case, if you don't have a super long subdomain, yeah, it's pretty fast. It'll run pretty fast. Is it modular? Can I take this and include it into something else, or can I reuse any of this? Or is it just sort of dead? Is it flexible? Do I have any control over how the tokens that these parses are actually going to be interpreted without writing a whole bunch of other code that's going to go after and say, okay, if match zero equals this, then this, otherwise this, right? I have to do a lot of work at the end to actually use this thing. So let's talk about an alternative. Parsing expressions. We're actually first discussed at MIT by then in Grindford in 2004. They are a declarative alternative to a generative style like regular expressions. And other CFGs in general. You can find out more about them here at these two URLs. And don't worry, I'm posting all these slides and everything after that day. Let's kind of do a quick comparison of parsing expressions versus regular expressions. Parsing expressions are declarative, regular expressions are generative. I'm not going to go too much into detail there, but the difference is actually pretty important. Parsing expressions are able to be recursive. Regular expressions are not able to be recursive, which is actually one of the key reasons why parsing expressions are pretty good at parsing computer formats. Because a lot of computer formats are highly recursive. You have structures nested within other structures. Parsing expressions are pretty readable. Regular expressions not so much. Parsing expressions are pretty easy to maintain. Regular expressions are kind of difficult to maintain once they grow past a certain size, right? This next comparison refers to how they go about parsing your data. A parsing expression is not ambiguous. In other words, as you're reading the tokens in a stream, if you see a certain token, you make a decision to go from that token in a certain direction. Whereas a regular expression might see that token and then depending on what the next token is or what the hundredth token is from there, it may have to backtrack and take a different approach, take a different route. Parsing expressions are actually pretty fast. Regular expressions, for most cases in small text, are faster, right? So I was looking into this idea. I was like, yeah, I really want to use parsing expressions. I want to get into them. And I went to the Treetop library. For a couple of reasons that I really won't go into detail here. I just didn't like Treetop. To me, it seemed very slow. It wasn't being actively maintained. And it was very difficult to deal with the results that I got back. There was just probably too much code here. I was like, this has to be... I mean, we're just parsing text. This should be done in maybe a thousand lines of code. So what I actually did is I wrote a library called Citrus. So Citrus, you can actually download it from this URL, this gem install Citrus. Or you can go to this URL, your mji.tags.com's less interest to find out more about it. So let me go back to one slide. I skipped over this slide. This is the way that Citrus specifies its parsing expression, right? So you've got a grammar in this case. It's called addition. So it looks very much like Ruby code. But don't be fooled, it's not actually Ruby code. It just looks like Ruby code to make you feel comfortable. So you've got two rules in this grammar. One is called addition, one is called number. Now the key ability of a rule is to be able to refer to a number and rule by its name. So in this case, addition can refer to number by just saying number. So a lot of the syntax is actually very similar to what you'll find in regular expressions to kind of make it familiar to you. So in this case, a number is any of the characters 0 through 9. I've got a little character class there. In addition, we're going to say it is a number followed by one or more pluses and then another, right? It's pretty simple. So this is what I want my grammars to look like. And I'm going to go into some more advanced examples of how we can actually take RFCs and we can put them into grammars like this and just build parses from them automatically. So Citrus, let's go briefly through the syntax. It's going to be kind of a whirlwind. Most of it shouldn't be pretty familiar to you. So, like I said, it looks like Ruby code, but it's not actually Ruby codes. We need to know how Citrus is actually going to parse these files when we write stuff in there. So we can use double-coded strings to match a string of characters or likewise single-coded strings. We can use a special kind of string called a case insensitive string. Just put it in backticks and that will match either the lowercase, hello world, or any variation of that case. That's actually pretty useful for lots of different RFCs that say something like the case is not important or the case is not significant. You can actually put escaped characters. These double-coded strings are interpreted exactly like Ruby strings. So I can put a hex character in there. I can put it in these case insensitive strings as well. I can actually stick a regular expression inside of it as a last resort. If there's something that I just occasionally I just really need to use a regular expression because I know how to do this already because I'm familiar with regular expressions. So Citrus is not a purist approach. It's more of a hybrid approach. If I've got a little bit of text that I just really want to match quickly with a regular expression, I can drop down to a regular expression primitive and get the drop down. They would put all flags if you think. Character classes are what you would expect. In this case, we would match a hex decimal character. Same thing, but with hex characters. They're exactly like you would expect. They're interpreted just like Ruby. A dot matching a character makes a real character. So when we get the repetition, the star means zero or more. Now you can put a number on either side of the star. The number on the right means maximum number of times this may match. The number on the left means minimum number of times this must match. So those two cases, star one and one star, you're probably used to seeing as question mark and plus, which Citrus also supports. So you can specify your repetition in this way. So in this example, we would match the string zero or more times. Pretty elementary stuff. In this scenario, we would match the string hello world, a minimum of two, a maximum of three times. So this is a sequence. So if I want to say you must match something immediately followed by something else, I just separate them by some white space. So I can say match the string hello, white space, then match the string world. I also have a word of choice. So I can say match ABC or match DF. This is ordered now. So remember, once I match ABC, I'm not going to go back later and try and match DF. I'm matching ABC and I'm continuing from there. And that way the parse tree is not ambiguous. For any valid parse, there's only a single route through your grammar which actually helps to be quite fast. In this example, I can use parentheses to sort of group things together. So I can say A or B one or more times. Pretty subtle stuff. I also have a positive and negative look ahead. So this will say match a string hello followed by a string world but don't consume the world portion of the string. Just consume that little portion of the string. This would be like a negative look ahead. I would say match any character that is not preceded by ABC. So as the parse tree is going through the string, it considers this character and then it says is there an ABC here in the input? If so, I'm not going to match it. Otherwise I can get a match and consume that character. This is a common idiom. For example, this would match any character that is not one or more characters that are not ABC. So if you want to say match all characters up until you see this sequence of characters, match anything up until you see ABC in the input string, you can do something like this. This is a common idiom. And since it's so common, I reduce it down. You can actually just use the tilde operator, which is kind of common in some parsing expressions. You can just use the tilde to say match anything one or more characters up until you see ABC. So let's talk about what I mean when I say a match. Matches are built into trees. A match is essentially a node in this tree and it may have any number of sub-matches. These are all lazily instantiated so that the thing is pretty fast. So all I have to do is say try and match on this string. I'll build one node and it knows how to build all of its child nodes so that when I query it I can extract information from it and actually interpret what that node means in the context of my application. It can dynamically build that information out. So the initial parsing is pretty fast. So let's take this example. This is kind of visual how these matches are organized into trees. So I start with this string one space plus two. It might be broken down depending on how my grammar is into this subtree. I might have a one space which is further broken down into a one and space or I might have this two over here that's matched by some other rule. So the cool thing about these matches, about these nodes is that since I'm in Ruby and these are just Ruby objects, I can actually extend these objects with any Ruby module that I want. So I can say I've got a node and I want to I want to extend it with some methods that are going to help me extract information from that node. They're going to help me interpret it. So Citrus is not only to do with the parsing of text but it also goes a step beyond and allows me to find these modules. I call them semantic blocks. It's just a block of code that's going to get extended onto your node and your match tree so you can actually call methods on that node and it becomes much more useful to you instead of just being a little string of text. So when I'm talking let's get into some code. This is a classical example of something that you just can't do with a regular expression. This is a grammar called parenchar and my first rule here is parenchar and basically I'm just trying to match an open parenthesis followed by something else followed by a closed parenthesis, right? It's actually got a label on it so I'm going to call it letter. When I call it letter it is a print it is another parenchar or a char. A char in this case is just any letter A through Z. So you can see this is actually going to recurse to find for example a letter, a single letter that is nested in any number of even parentheses, right? This is a classic example of something that regular expressions just can't do because they're not able to be recursive. So I'm going to take that expression there and surround it with some parentheses and group it and I'm going to apply this semantic block to that entire expression, right? So I'm going to say the value of this block is letter.value. What is letter? Letter is a parenchar or a char. In the case of parenchar it's just going to recurse and call this block again on that parenchar. In the case of char it's just going to give me the letter. So this dot value method is defined anytime you attack a semantic block or something. The default value for something is just stream value, right? A node is just a piece of the original input stream and so its default value is just its raw stream value. So let me actually just go through one more example and then we're going to do some live code which everybody's advised me not to do but I think it's more fun but I shouldn't do it. I'm looking forward to it. Yes. Yes, how many people are looking forward to live code? So now watch Happy Fail. So this is a node that we already saw in addition it's going to be like a number followed by one or more pluses of a number. In this case I don't want that number to be I don't want to get back a stream of text since it's a number. So what I want to say is when I get that stream of text call 2i and that's the value of a number. So now that I'm getting a integer back from my number rule I'm going to attack on a semantic block to the addition rule. Just in case you have trouble following that transition, this is actually the code that I tack on. So the whole thing in parenthesis is that this is the value of an addition it is essentially all of your numbers so this captures method goes deep and finds all of the times that number matched. And it basically just says add them all up to inject them into this zero and add that value every time you come across an n which would be a number no. So let's let's pop into the terminal and see some of this stuff in action. So I'm going to pop into let me show you here I've got some so I've got some centrist files here. These addition and parentia are centrist files exactly the same ones that we just went through. So let me fire up IRB ok so I've got centrist here so I'm running on centrist head which actually I just committed like 10 minutes before walking into this room so it's nice and fresh. So let's say I'm on load of that addition grammar centrist.load addition in this case it's just going to look at the same directory and load up my addition grammar all grammars that it finds in that file in an array. Now grammars are actually in Ruby then we're just Ruby modules so I can take a grammar and I can include it in another module whatever. So let's say my match is going to be an addition dot parse now we said it was a number followed by one or more pluses and then numbers so I'm like great that matched what is the value of that match should be four four in the integer not four in the text right. Let's say let's load that other our parent chart grammar and let's just say match is parent chart parse I'm going to say A but I'm going to make a mistake this thing shouldn't match let's check it out this is one of the nice things about centrist that I just couldn't find anywhere else I'm going to give readable error messages because as programmers we are not perfect and we mess up stuff all the time especially when playing with new technology so we need a good reading to error message so check it out here I've got oh you filled the parse of the input on line one at offset two of the input so zero based offset so that tells me oh okay so that input was invalid right so let's see if I want to actually parse some valid input now the value of the match is just supposed to be it's a letter about value and I get back a letter A right so these are pretty elementary examples of these kinds of things so let's get out of here okay so centrist actually ships with some pretty cool examples in the examples directory so here it is in the examples directory I've got a couple of examples all the centrist files with their company in the test so if you're wondering how would I actually use this thing it's got an accompanying file underscore test dot rv that shows how you would actually use this from the beginning right so I've got a couple of them for example I've got like this this calc example is actually really cool okay so grammar so I've got my calc grammar now this grammar is actually pretty cool this is designed to interpret mathematical expressions exactly as ruby's interpreter would would do so I can use any of ruby's numbers or floats or numbers with like underscores in them or any of that weirdness that ruby does and I can use any of the operators they'll all be white space will be appropriately ignored and expressions will operator precedents will work right and this has again this has a whole suite of tests that go along with it so I invite you to try and break it if you can that would actually be really cool because it would get better so let's fire up citrus again and I'll show you how that works now let's load up that calc example citrus also has a require function on it where you can actually search your load path so that's kind of cool because then you can just use it like ruby's and that will require a set to require citrus files and that's kind of cool so I've got this calc module something that's cool about these actually that I haven't showed you yet is they're actually pretty inspectable so I can say they're pretty good to use at one time so I can say calc.rules and I can see actually let me do this so that it's clear so I can actually see in the grammar calc here are the rules and here's what their rule definitions look like so if I remember like I don't know gosh that's so small I don't seem to say something so if I want to know what is this rule like what is this rule even doing I can inspect it right in my term calc.rules so let's first get an expression and I'll say expression it's going to be 1 plus 6 through the 6 modulo 2 times 8.5 okay so I can say eval expression and I get 1.0 wow that's freaking how did I do it ah yes I told you let me say there you go that's some weird number right anybody recognize that number okay so now let's say we're taking we want to do this with citrus so let's say match equals calc.parse enemies now let's say so we want to know the value of this right it should be that weird number okay so it's pretty cool you can actually play around with parsing so what we're doing is we're taking this text we're scanning through it we're parsing individual tokens and then we're interpreting them exactly like so last night I was thinking about this problem about you know I was at the hack so I was thinking about this problem with parsing URIs and doing URI validation and I was like there's a spec for this why don't I just code it up into a citrus grammar so let's take a look at the one that I did this last night what so how long did it take me 25 minutes to take an RFC and just basically stick it into code that can generate a parser from RFC right so there's no guessing there's no like hmm I wonder if this actually validates or not it's it's spec as code right so I hop into here notice that I've got this grammar that so I'm requiring first of all that IDV-4 I guess IDV-6 I guess because that's part of the URI spec and I'm just including then that my URI uniform research identify with grammar because URI was already used by every URI and so I've got this rule you can see like the spec definition right above the actual rule definition I apologize that there's no syntax I should actually get back into a text statement because I did write a textmate ooh syntax okay I actually did write a textmate plug in so that you can actually get pretty syntax I didn't want to but I go back and forth with the textmate so right now I'm on the textmate kick so textmate is the best name so I'm including these other grammars and I'm just going through it and I'm just saying here's the rule from the spec here's what it looks like is citrus code it's very similar in most cases and so I can take this I can actually take this I do the example that I did earlier so let's see citrus for example URI ooh where am I going I understood no I'm not that's what I'm going to do except low path because when I say to require at the top it's like going it's actually going to look like another path so let's say low path and shift that's the file type spin path then I'm going to say I'm going to try that again yes do it okay okay so let's take this and let's say let's say who named this module and you don't have more resource than if I were about parks where are you from subdomain documents documents let's see yeah it works and it didn't take 97 seconds I do wonder I'm sure some of you know are going to have questions like how am I actually going to use this in my app what is it a good idea versus a regular expression do I have how much time do I have left I'm over cool just remember to always use crossing expressions for the right job otherwise you're going to have to wait there and that scene where she stole Dorothy Mayer's superhero how many of you remember that scene it's despicable it's despicable and he's like that doesn't look good on you so you can go back okay thank you very much