 Let me introduce myself quickly. My name is Michael Jackson. I'm not dead, as you may have thought. I'm here. I am living and working right now in San Francisco for a company called Path. We are making an iPhone app. And I do the website slash Rails backend API stuff for our app. So if you haven't heard of us yet, you may soon. We're about to launch, and that's pretty exciting. So I want to talk briefly today about a problem that we all have, that we all have to deal with. And that is text, right? Like, we deal with text all the time. It's coming in to our systems. It's going out of our systems. We're parsing it. We put it in a file. We read it back out. We parse it again. We package it up and send it over the wire to somebody else who parses it and tries to understand it. We're always dealing with text in our computer systems. And up till now, well, not up till now, but one of the most common tools that we have to deal with text is this. So who can tell me what this does right here? There's a, look, there's a www in there, I think. It's basically a really readable and extremely convenient way to parse a URL. And this actually pulled this from Daring Fireball. This is John Gruber's best shot at giving a nice, liberal, regular expression for parsing a URL. Truth is, it's really ugly, actually. So it's kind of turned into a hammer, right? We need to parse text a lot. And so we start off parsing little bits of text with a regular expression. See if something matches this regular expression. Great, it does. We can move it out and we can move on. We can shove it into the database. We can create an object from it, right? But then when it comes to significant amounts of text, like, I don't know, a URL, or an email, or an iCal invitation, or an event, or a micro format, or something like that, that we've got to parse a non-trivial data set, regular expressions kind of start to break down. They don't scale very well at all. And they're difficult to read. They're difficult to maintain, especially with a large data set. So in order to, I'm going to make the argument today that we should probably not use regular expressions when we're parsing non-trivial amounts of text. And I'm going to propose an alternative for doing this sort of thing that I think you'll find very attractive and pretty fun to use as well. But first, I want to get started. I want to kind of define a couple of terms that I'm going to be using throughout the rest of the talk. So a parser, what is a parser? I just want to make sure that everybody's on the same page. When I say parser, what I mean is a lexer. It's just barely smarter than a string scanner. A string scanner reads a string character by character. A parser is able to group that string, group characters into tokens or recognizable pieces. But it's not very much smarter than that. That's all a parser does. So somehow we have to tell the parser what pieces it should recognize, which pieces of this stream of characters are important, which pieces do we want. That's where our grammar comes in. So a grammar is basically just a well-defined set of rules for a subset of some data. So I could say a grammar for the English language, for example, dictates that proper nouns should start with a capital letter and that sentences should end with a period. That's just an example. The cool thing about grammars is that you can actually take grammars and combine them, combine these various areas of domain knowledge to actually increase your ability to parse. Once again, an example with natural language would be like, I know English, and I also am familiar with speech patterns of the Southern United States. So when I get to New Orleans and somebody says, Nolans, I can understand, like I can parse that. So I have these different grammars or these different sets of rules by which people speak. And I know them both, and so I can increase my ability to parse that information. Another good example is just straight up ERB. ERB, it's not a Ruby interpreter. It's not an HTML interpreter, if you will. But it does kind of straddle the line. When it reads a document, it knows what is HTML and what is Ruby. And it passes the Ruby off to the Ruby interpreter and puts it back in the document, and then you get your HTML document. So it combines these two grammars to get its job done. But that's still not enough to actually execute a program. What we need on top of that is we need an interpreter. We need an object that is smart enough to know what do these symbols and these tokens actually mean? When I get them out of the input, what do they mean? What am I going to do with them? An example of this in Rubyland is for the Ruby interpreter. But for example, you could have a tool like RDoC that will read the same file that the Ruby interpreter reads. But when the Ruby interpreter reads it, it interprets it in a way that it executes that code. But when RDoC reads the file, it doesn't execute the code at all. It's just looking for class names, method names, and blocks of comments about them. So its interpretation of that file is, I need to go and write something about this. There's documentation here that I need to extract. So multiple interpretations can be had for the same set of data. So ideally, our goals when we're parsing this text is we want to be able to do all of this. We want to be able to take some text. We want to be able to parse it pretty quickly and efficiently because we don't have a lot of time to spend parsing our text. And then we want to be able to take grammars and combine them in meaningful ways. So depending on what I want to get out of that text. And we need to be able to interpret this text based on the context that we're in based on our end goal for our program. So these are the goals that I would describe for a really good text parsing slash interpreting tool. So let's head back to this. Does this help me do any of that? Just one really, right? It helps me to parse out the text that I want. Is it easy to describe to this regular expression various grammars that I have? Various sets of rules, various requirements. Say I need to parse an email, which for example the email spec includes mime, right? Which it's its own spec, right? So if I was going to like parse an email, it would be pretty difficult to tell this expression. You're supposed to parse email, oh and by the way when you encounter a mime document, you need to head over here and use this grammar because that's what describes mime. Is it easy to change context in the middle of a parse or extract any information from this besides just strings? No, usually when you're using a regular expression you're gonna parse out the text. You're going to put it in the database or put it in an object somewhere and then you'll say, then you'll query that object, right? You'll ask that object, does your phone number have a valid prefix or whatever, right? But you have to do some more work to actually interpret the tokens that you get out of this text. So the alternative that I'm going to describe today is called parsing expressions actually. Parsing expressions were first discussed by a man named Brian Ford at MIT in 2004. He wrote a pretty interesting paper on it that you can find at these URLs here. But basically his premise is that there's a better way to parse highly structured data, which by the way is the kind of data that we're dealing with all the time in computer programs, right? We're not dealing with natural language, we're dealing with computer languages, we're dealing with formats like JSON and XML and you name it. They have specs, they're well-defined, right? So we should be able to parse them in a more efficient way. This table I'm just going to go down really quick and describe, because most of you are probably familiar with regular expressions, I'm going to describe kind of compare the two, right? So some of the terms here might be a little bit technical if you don't like do a lot of things with language, but don't worry about it, it's just kind of a quick comparison. So parsing expressions are declarative. Regular expressions are generative. What that means is a parsing expression declares the strings that may be found in your input. Regular expression is able to generate all possible strings that might appear in your input. It's a subtle difference, but it is significant as we'll see later on. Parsing expressions are able to be recursive, regular expressions are not. You can get some degree of recursiveness in regular expressions with back references, but it's not really recursion. Parsing expressions are readable, regular expressions are less so. Now, yeah, you can make the argument. I mean, I read regular expressions just as well as anybody, like yeah, they're readable to an extent, but no, you'll see that they're not very readable once they start describing the parsing expressions. Easy to maintain, difficult to maintain. I've already made that argument. Parsing expressions are not ambiguous. Now, this is kind of a technical one. What that means is when you parse something with a parsing expression, there is only one valid tree for that parse. In other words, there is only one route you can take through the grammar to achieve that parse. With regular expressions, well, not so much with regular expressions, but with context-free grammars, which regular expressions are, they kind of come from that family of thinking. They use different methods to try and determine which match you want. For example, longest match. Any of you who have worked with YAC or Bison will kind of understand that idea. So there's this idea of ambiguity, right? Which we don't need when we're parsing, when we're parsing computer languages and computer formats. Of course, regular expressions are faster. They're implemented in C, but parsing expressions are not slow. They're actually really, really fast. So this isn't gonna slow you down, so don't worry about that. I mean, unless you're doing like, and then I don't know why you're doing it in Ruby anyway. Okay, so the gems that we're talking about today, there are two gems that I am aware of that do parsing expressions in Ruby. The first one is TreeTop. Who here has tried to use TreeTop for a project? Right on, awesome. I've tried to use TreeTop as well. It didn't go so well. I had a couple of problems with it, and so I wrote Citrus. I'm not here to bash TreeTop. I just want to kind of discuss some of the differences between them and move right along. TreeTop is a parser generator. What that means is you take your description of your grammar, you put it into TreeTop, and TreeTop generates a file for you, right? It's the old compilation step that you get when you do compiled languages. Citrus, on the other hand, is a parser combinator. I think personally that's better suited to dynamic languages. You kind of lose some of the advantage of using a dynamic language like Ruby when you generate Ruby code. And so people say, well, the reason you generate the code is so it can run as fast as possible, right? That's a valid argument. I think if you're generating byte code, you want to do that, but in this case, it doesn't apply. On the TreeTop, the latest branch of TreeTop and the latest version of Citrus running TreeTop's benchmarks on Ruby 192, Citrus is actually about 50% faster on my laptop. So you can take that benchmark with a grain of salt, but it's just kind of to prove the point that code generation is not the key to speed. The key to making an algorithm fast is the algorithm. It's not code generation. And so besides that, just the difference is in API and ease of use, Citrus has some really nice debugging tools and things built into it, so I'd encourage you to take a look. The rest of the examples that I'm going to be using today, the Citrus syntax is slightly different from the TreeTop syntax, so that's the one I'm going to be using for the rest of the presentation today. Here's how you can get it. There's my website with a little link there. You can get some documentation on it and read about it. Okay, so let's talk about syntax. Let's talk about when you're writing your Citrus, when you're writing your grammar, right? How do you do this? So a string, a double-coded string, hello world in this case, hello space world is going to match hello space world, pretty basic. Same thing with a single-coded string. Citrus has case-insensitive strings. So if you enclose your string in backticks, it'll match any case of that string. Now these strings work just like regular Ruby strings. You can have hexescaped characters in it and things like that. Same thing in the string with the backticks or with the case-insensitive strings. You can escape characters, things like that. Also, when you do need to drop down to regular expressions because you just want to get something done, it's not the common way to do things in Citrus, but if you do want to get something done, you can also use regular expression literals. If you just have something that you want to express very succinctly, which regular expressions are good at, right? They're good at not writing a lot of characters. So you can use the flags just like you would in Ruby. You can use character classes in this case this would match like a hex character, right? Any case. This will also support, obviously supports ranges, it supports, it's just, it's pretty intuitive. It's just like you would expect it, the work just like you would expect them to. There's the dot. The dot obviously is borrowed from regular expressions. It will match any single character in Citrus that includes new lines. So let's say, for example, I wanted to match zero or more of any character.star, right? It's very similar to regular expressions so that you can get the hang of it quickly. This is a little bit different, however, this star number notation or number star, basically on the left side of the star is the minimum time, the minimum amount of times it must match and on the right side of the star is the maximum, right? So in the case when you have just the star, the minimum is zero and the maximum is infinity, right? When I have star one, the minimum is zero and the maximum is one. Likewise, one star, the minimum is one, the maximum is infinity, but you can just abbreviate that as question mark plus, which you're used to from using regular expressions again. So these stars, they'll also work when you're using string literals. This rule here will match any number of the string hello space world, any number of times. This will match the string hello space world two or three times, you know, consecutively. Okay, so a sequence. So if you wanna match one string or one expression followed by another expression, just separate them by a space. So you say, you see this rule here, this will match the string hello followed by the string world. Similarly, the vertical bar can be used to represent ordered choice. Now it's important that we talk about ordered choice. This goes back to the being ambiguous and non-ambiguous. When the parser hits this rule, it's going to say, does ABC match? If yes, ignore the rest of the rule, match ABC and move on, right? If no, then we're going to try and match DEF, right? So that's important to understand. So let's get into something a little bit more complex. A or B, parenthesis, inside parenthesis, plus. So basically what that means is, let's match A or B one or more times. Notice the parenthesis override, the parenthesis are needed there because we've got order of operations going on here. There's actually, it's all explained on the website and in the readme, but there's, you just need to pay attention to your order of operations when you're writing citrus code. For the most part, it works just like you would expect, just like Ruby's order of operations work. So there's also a limited degree of context sensitivity through lookaheads, positive and negative lookaheads, right? So in this example, I'm going to match the string hello. It must be followed by the string world, but I'm not going to match that. I just want to make sure it's there, right? So the parser is going through the string and it says, I match, oh, I can see hello. Let me check if it's followed by world. It is great, return hello. Otherwise, if it's not, we don't match anything, right? Same idea except negative lookahead. So basically what we're saying here is don't match ABC and then match any character, right? So in this case, the parser is going through the string. As long as we don't see ABC, we match a character and we put it into the buffer. As long as we don't match that, we put it in the buffer there. So that'll match any single character, for example, that is not ABC. So this is kind of a common idiom that you see sometimes in grammars where they'll say match, this basically is saying, match everything up until you see ABC in the input, right? Match any character where you don't see ABC one or more times, right? That's a little complex. It's a little bit, so there's actually a shortcut operator. It's a conditional lookahead. You can use a tilde and you can say, match one or more characters until you see this. And you can put any expression there. You could put a regular expression there or a dot or a character class or anything there. Okay, so I just, before I go into, before I continue, I just kind of want to go into some of the, let's see. I just want to go in and do some demonstrations, right? So let me bump up the size here. Is that gonna be good for everybody? Good, right. So let's say, so here I am in my RubyConf directory. I pushed all of this, by the way, up to GitHub in MJI Jackson slash RubyConf 2010. So you'll know where to get the slides and the code and everything after the presentation. But so here I am, I'm in my directory here and I've got, actually let me just, okay. Let me just fire up IRB. That's the easiest way to show you how this all works. Okay, so IRB requires citrus. So we got citrus here, loaded. So if I want to, let's say our rule is we're gonna have a citrus rule that's going to say match the string A, okay? So we can say rule.test A, see if it matches. Great, one. The number that is returned from rule.test is the length of the resulting match, right? So let's see if it matches B. No, okay, so no match. So okay, so we're kind of getting the hang of this. So let's make something slightly more complex. Let's say match A or B, right? We just saw this example. Rule.test A, good, B, good, C, no go, okay? Let's do something a little bit more complex again. Let's say match A or B any number of times, or one or more, let's just say. Actually we could just make it any number of times, zero or more times, okay? Rule.test A, B, A, B, A, B, good. We got the whole thing, right? We got six characters in our match coming back. Let's see if we stuck a C on the end of there. We should still get six out, right? Because the rule is only going to consume up to what it can consume and then it will return, right? So basically that's kind of how they work and you can fire up your IRB and just play with them just like this. This is, again, this is kind of one of the API benefits to not generating code is that you can just create objects in runtime and just play with them and inspect them and so anyways, let's get on with it. So when rules match, they create a match object, right? A match object subclasses string, okay? So a match is just a string, right? That's what we're doing. We're going through the input, we're matching strings. We organize them into a tree structure. Anybody who's familiar with any kind of parsing should be familiar with the idea of an abstract syntax tree, right? So you have this tree of strings. We're fast because we only instantiate the top node when we do a parse and that node knows how to instantiate its children but it only does it when it needs to, right? So that's how we keep it lightweight. We keep our memory usage down. So we can take these matches and we can extend them with Ruby modules, right? To help us interpret things at runtime. So let me display this visually. Say I had like a string that was like one space plus two, and let's just say that this is what the parse tree looks like, okay? The node, I've got one node that's one space, one node that's a plus and one node that's a two. So let's say I want to give one of these, because these are all just objects, right? Let's say I want to give one of these objects some special behavior. Let's say I want this one down here. I can extend it with some module that I have, right? That's got some methods on it. And it's really nice because it's flexible like this. You can, this is where you can tack on whatever methods you want to the nodes in your tree at whatever level you are. So that might sound a little bit confusing, but it's actually, it's a very powerful, useful technique. Let's hop into some examples and hopefully this is all gonna gel. So let's take a simple grammar like this. This is called our addition grammar, right? We've got our top level rule, which is gonna be the one that we're gonna execute when we say grammar.parse, right? So this is our top level rule and it's the first one and it says number followed by one or more plus number, right? So what is a number? A number is, you can see down below the second rule in this grammar, a number is any of the characters zero through nine, right? This is pretty basic. Kind of nice that these rules, when you put them in a grammar, they can refer to one another by name. So that's actually really handy. So let's go ahead and run this in the console. So I get in here and I can see I've got my addition.citrus file in there and that's in the Git repo that I was telling you about. So let's clear this and let's say rb-irr citrus. So let's say citrus.load addition, okay? Something kind of an interesting side effect of citrus.load is it returns an array of all of the grammars, which are really just Ruby modules. It returns an array of all of those in the order they were defined in the file. So here we've got this addition module. So we can say, we can actually do something kind of cool. So let's require pretty print and let's pretty print addition.rules. So we can see that addition, the rules that addition has in it, there's one called addition and you can see it's definition and there's one called number and you can see it's definition. In this case, we've taken that character class and compiled it into a regular expression for speed, right? So we don't have to check every character. So let's say a match equals addition.parse. Now addition matches a number followed by a plus sign and a number and we can do that any one or more times, right? So there's a number, a plus and a number. We get a match. What happens, for example, if we don't get a match? Let's try taking out that number. Well, here we get a nice little error message. This is a parse error that's been thrown. You can see that we've got information about what line our parse failed on, what offset it failed at. Here's a little help you figure it out, in case you don't wanna actually go to that line and offset in your input. So this is actually one of the features that I really wanted to incorporate into Citrus is because you go crazy if you can't see what's going on at a low level sometimes. So let's go back up to our good parse and so that works. Let's expand this grammar a little bit and let's add on a little method here. So that looks kinda like Ruby's block syntax, right? So remember I was talking about us being able to extend these match objects with any number of methods, right? So here we can say take that number which is gonna be a string zero through nine. It's gonna be one of those characters and we're gonna define a method on it called value, which is just something that's implicit in the Citrus. If you don't define a method name, it defines a method called value, which is going to return to I, right? I said a match is just a string. A match, a string as a method to underscore I which will return the integer value of that string. It'll coerce it into an integer. So that's fine. So let's go back to our grammar and see if we can do that. So let's say addition dot parse one and let's say the root rule that we wanna start at is that number rule, okay? So we've got, okay, so cool. So we just parsed the string one and we've got this match. So let's say match dot value, okay, cool. So we get a one back. That's an integer, right? So our block worked, right? When we called value on our number match, we got back the integer value. Let's do something a little bit more complex. Let's tack on some semantic data, right? So this is all we're doing now. We're just tacking on methods that help us interpret what these strings actually mean, except we don't have to go back and actually do this at runtime, right? We stick it in our grammar. We forget about it. So basically what we're going to do, these match objects have a method called find, which lets you find in the second parameter there. So it lets you find matches by their name, right? So in this case, we wanna find matches that have the name number, which should be pretty self-evident. And true means we wanna go as deeply into the tree as it takes. So we want to recurse into this tree and find all the matches named number. And basically we're just gonna add them up, right? So the value method of an addition match should return us all of those matches added up. So let's head back to the console. So let's say we wanna parse this and we wanna say one plus three, right? So we parsed it, good. So let's see match.value, four, right on. So we should be able to put any number of numbers in here, right? Whoops, where'd I go wrong? Okay, that may be a bug. I'm using the beta software, excuse me. Okay, I'll take a look at that after the conference or after I get done. So okay, so we've got a grammar here. Called paran letter, right? So this is actually a really good example of recursion. This is something that regular expressions are terrible at that they just cannot do. Let's try and match a letter A through Z, any case, that is inside some parentheses, right? So we wanna match an open parenthesis and then paran letter or a letter and then a closing parenthesis. So what this allows us to do is it allows us to match a letter that is surrounded by an even set of the same number of opening and closing parentheses, right? So let's say, for example, if I'm, let's head back to my console. Let's see, am I still, oops, okay. So I've got it in this paran letter file, right? So I'm gonna say rb-r citrus. Okay, so I'm going to say citrus.load paran letter, right? Okay, so we've got a paran letter grammar. So let's say paran letter.parse. We're gonna parse paran A, paran, we're good. Let's head back out here and extend this grammar with some more information. So let's say I wanna actually extract the letter from an arbitrarily nested number of parentheses. So I'm gonna find the first match that's named letter and return it. Let's head back here. So just to demonstrate that we're doing okay, parsing parentheses, we're fine. And then let's say dot value, we should get back to string A. And all we're using is our grammar here. If we wanted to make it a little bit more complex and actually get the level of nesting, right? What we could do is we could find paran letter, which would give us the number of times we recursed, plus one for the current time that we're on, right? So, I mean, again, that works. So this is actually some syntax that is new that we haven't seen yet. So if we want to define more than just a value method, what we can actually do is use def and end. So everything in between your curly braces is just Ruby code, right? So if there's something there that's not, if there are no methods defined and there's just some like two I, for example, we had in the previous example, then we'll just create an anonymous module and we'll tack that onto your matches. Otherwise, we're gonna create a module from these methods and tack that on, right? So let's head back there and we'll kind of test this one and we'll say let's tack on one more and then we'll call nesting on it. So it's three levels nested deep, right? If we could do like this, we can nest it one more, we should get four, so we're good. So do you see how as we're parsing through the data, we don't have to go back later and say, okay, I got this string out, okay, now what? Now what does it mean to me, right? I can define what it means right there inline in the grammar so that when I get it back out, I don't have to do any more work. This is actually really nice for when you've got a large spec that is very, it's, I mean, let's just face it, it's really prone to errors if you're gonna try and parse anything significant with a regular expression and it's gonna be difficult to extract this kind of value, all you get back is strings. So let's take a look at like, I think I have a few more minutes, do I? Let's see, 52, I started at 10 after, so 42 minutes in. Okay, yeah, I've got a little bit of time. Let's take a look at like, there's actually a grammar that ships with the citrus code in the examples folder. So let's exit out of here, let's go back up into my citrus directory and let's say IRB-R citrus, okay, or I should have shown you the directory listing, but anyway, let's say citrus.loadexamples, calc, okay. This calc example is actually pretty cool. It's able to interpret mathematical expressions exactly like the Ruby interpreter would do, right? So you could say like, you can use underscores in your numbers, you can use, you know, it'll respect your operator precedence, it will ignore the white space and everything. So we could say something like this. We could say, let's see, so our expression is gonna be two times eight, or three times eight, I guess, hit the three. Okay, so let's say calc.parseexpression.value, we get 24 back, nice. So let's create a little test that'll say defcalc.test, let's test an expression and what we wanna make sure is that the Ruby value of that expression is the same as the calc value for the expression, right? Cool, so let's say, I mean, we could pass any expression here now. We could say like calc.test and then pass the expression and pass it two plus five, true, good. Two plus five times four, we're good. Two plus five times four to the seven, we're good. Let's, I don't know, throw in some random parenthesis. How about like, no, that's not gonna change anything. How about we divide it by that? I don't know, make that a float, boom, we're good. So basically, you get the idea, right? It's actually a really interesting example if you pop open the examples folder because you'll see that we've entirely mimicked Ruby's mathematical expressions in about 100 lines of really readable, nice Ruby code. And so it's useful for that. So you might have seen other examples of this around like out in the wild. For example, a lot of them use the treetop gem right now, like the new mail parser in ActionMailer for Rails 3 uses the treetop gem under the hood to parse email messages. I saw, what was it, heist. It was a scheme interpreter written in Ruby, which I thought was cool. And that uses treetop under the hood. So basically, people are parsing like programming languages with parsing expressions. They're parsing documents that are well defined, like email documents. I have people emailing me telling me they're parsing natural language. Now, natural language kind of has a caveat. Okay, parsing expressions aren't good at natural language. There is lots of ambiguity in natural language. So I'll tell you, if you're thinking about this going down that road, you're probably gonna run into some roadblocks. But for computer languages, for text formats that are highly structured, parsing expressions are really, really nice alternative to the regular expression. And I hope, I've been able to give you some idea about that today and hope you'll give parsing expressions in Citrus Jam a shot. Thank you.