 Video equipment rental costs paid for by peep code screencasts. Alright, good afternoon everybody. So I'm going to be talking about yet another DSL. This one is, we're going to, we'll help you do parsing. So let me just first tell you kind of history of how I got, how I got to hear that I wrote this package. So in the past, I'm sure many of you, in the same boat where you've written all these possibly custom parsers, handcrafted, typically they're maybe what the typical term might be called recursive descent, but unfortunately they were in pearl. Later on, then I was getting tired of that, so went to some different tools. One of them was Antler for doing parser generation. Still wasn't good enough. So I was, what I was wanting to find was a parser generator that would, one, use the same language that you're generating the parser with. Instead of having this separate side file where you have BNF or whatever, I wanted it to be in the same language that you're dealing with. I also wanted something that would unify the lexer and the parser together. Everything else I found like, including rack, yak, and lex, Antler included, they were separate, they were separate beasts. So I was kind of wanting to make something unified with that. And I was actually, I actually searched for, I was looking for a language that might help me do that, so I actually, that's, this is why I learned Ruby. And so I came to this because Ruby has kind of the right mix for doing it. Operator overloading, blocks, duct typing, all of these make it work very well. And this is actually my fourth, my fourth rewrite of this, of this package. So first of all, I just wanted to show you how the grammar works to make a parser, to parse whatever language you need to. So in this, what you start with are these grammar objects. So grammar is actually, it's just a Ruby object. That's all it is. It's just a Ruby object that holds information for a grammar of something. So to build a grammar of something, some language that you want to parse, you build it piecemeal. You build it starting from the lowest level and build up. And all you do is use simply regular old Ruby operators and methods, throw in a few blocks to get your actions in, and that's how you build your grammar up. So you get this grammar out here, which is just a representation of what you want, of the language you want to parse. And then I have what I'm calling an engine, a grammar engine, that will take that grammar and then generate something more concrete, a parser. And that is the thing that will parse your input. So I have this layer between the grammar and the thing that understands the final parsing at the end. But this whole thing, this is just Ruby objects and methods. So if you want to contrast this to what most people use for parsing, typically these are regular expressions. So this gives a table showing equivalence between regular expressions. So the thing on the left, regular expressions, these are symbols, operations that you use within a string basically. So you have a string that has these operations that you combine pieces of regular expressions together to create a bigger regular expression. And then with grammar on the other hand, this is just straight Ruby. You don't put this in a string or file, it's just straight Ruby. It's a DSL that uses, like regular expression, we have a vertical bar to mean alternation. I use the OR operator in Ruby to accomplish that. I use plus to accomplish a sequence. I use here a positive to accomplish a positive look at assertion or a negative look at assertion. So then, and then at the top here is kind of the leaves of your grammar that you start with to build up. So characters, character range, I'll talk a little bit about the input sequence. The input sequence, it can be anything really, it just has to look like an external iterator, which I'm using the brackets method to basically get the next item in the sequence. And the items in that sequence could be anything. Characters, typically if you want to compare it to regular expressions, but they could also be tokens. If you want to do a parser that uses results from lexing. And then at the lowest level, this grammar called element, which I abbreviate as E from the previous slide, it takes these tokens out of the stream and compares them with a pattern that you give. And it just uses this triple equal method to do that. And again, the pattern can be anything. Just needs to work together with whatever's in input stream. And then we have a few more leaves of the grammar there. On the other end, the output of this, for a lot of other parser generators, they'll generate abstract syntax trees. I think for the way I do parsing, maybe a better description from the approach I have might be an abstract syntax forest. And so instead of each grammar parsing to a tree, it could generate just a sequence of things. So or you could just call an output stream. It just generates a stream of things. And the default is that the output stream is exactly like the input stream. And you use different methods to do different things. So this discard method, you apply it to any grammar, it all of a sudden starts throwing away the input rather than putting it in the output stream. You can redirect to a buffer and then a temporary buffer, then have a block do whatever you want it to. There's group and that's basically just a specific version of redirect that groups to a buffer and then it pins to the output. So the group is kind of how you would create hierarchy. You create your abstract syntax tree. So at this point, I think I'm going to go ahead and just give some simple examples. Just a couple. So here's a unit test of grammar. I don't know if that's readable. So I'm just at the top here. I'm just creating a few helper methods. Down here is this input method that I'm using to get into the form I need for my input iterator. So to convert a string into a string IO. When you get this getC method, it gives you back something that looks like a lambda that I use for getting the next token in my input string, which in this case, I'm just going to use characters in a string. So this first example, all I'm doing here is just parsing a single digit. And so this is how you would generate that. The grammar is just that first line there. Just setting to a single element that matches a character in that character range, 039. And then here's a few examples here just parsing out the first digit in this string. And here I'm appending to the previous output. That's why it has both there, the one and two. And then just shows a few invalid conditions. So we build on that and we just say a sequence of digits. Well, we just applied this, I have a method called repeat1 that does one or more digits. And then I can parse that. This time I also said digits plus EOF. So that's why this case here gives an error, because A, it's not an EOF at that point. So let me go back. I'll show you a few more examples a little later. So another thing you can do with your grammars, you can also take the output of one grammar and connect it to the input of another. And you can do that because grammar is kind of agnostic about what's in these input and output streams. So any grammar could be acting as parser, or lexer, whatever you want to call it. So I have one method here, supply, which takes the grammar for a single token and uses that to form a lexer. And it connects that to a parser, which is represented by another grammar. And then pipe similarly, except this one does some multithreading. So we can do a multithreaded lexer-parser combination. What it's doing there is it puts, instead of the input and output to the two grammars, it's using a producer-consumer pipe to connect lexer and parser. But we can also go beyond that, too. You can do as much chaining as you want. You could have a preprocessor that feeds a lexer, that feeds a parser, that you could do something, or even a compiler after that. Each step just basically takes an input stream and generates an output stream. And it gets refined to whatever you want to do with it. So I wanted to talk about the category of the type of parser that this thing is. It's called an LL parser. If any of you have written your own parser, those typically will disperse from left to right. And those are also LL parsers. So I'm basically generating a parser that basically looks like what you would write by hand. But it does do one character at a time, which you might write by hand would be looking. It may be using regular expression, so it might look a little further in front. But I do have a backdoor on that restriction that it looked one character at a time. It's called backtrack. Basically, you can take any grammar and you use the backtrack method on it. And it effectively looks, at that point, a single token when it does its parsing decision. So it gives you an infinite look ahead capability at that point. Recursion. So this is one of the more interesting parts of grammar. So this is one way for creating a recursive grammar. So inner and outer represent the same grammar. It's just different parts. So inner represents inside the recursion and outside is the final thing. So you call it this way and whenever it would hit enter, that would represent a reference back to the entire thing again. So but in addition to that, I look at where inner is to make some optimizations. So if inner is in the middle of a sequence of parsing, it just uses plain old recursion to recall the grammar again to do more parsing. If it's on the right side and there's nothing that occurs after it, then I do something that creates a loop around there. And that's equivalent to what compilers would do, called tail call optimization. So I eliminate these extra recursive loops, which would, if I didn't do that, it would cause the stack to blow up when you're parsing. And so it creates a pretty significant speed benefit. The last one, left recursion, when it occurs on the left side, it also implies a loop. This one's kind of interesting because historically, left recursion is, has only been in the domain of things of parsers like rack or yak. They're called, I believe, L-A-L-R parsers. Mine is an L-L parser though, but I'm still doing left recursion, which is, I believe there's only one other parser that I've seen do left recursion that's L-L. So this is quite unique. And this just shows you an example of how I do it. If you start with a grammar like this, what I do is I break it into two pieces. I look at that grammar twice. The first time I substitute it with a failure, which is going to cause it this part to fail, and I just get this part, the y part. And then the second time I look at this, I treat the inner as always passing, and then it gets x, and then I put a loop around it. So this first one gets boiled down into the equivalent of the last one. So typically, most L-L parsers, you would have to write it like the last point. But in this L-L parser, you could write your grammar as left recursion, which is a lot of times the most natural way to write things. I mean, even a simple expression, which I'll show you, it's more natural to write left recursion. Okay, so now go to this example. So let's go back to this digits example. I'll expand on that a little bit. And first of all, I'll go ahead and add in, previously, let me go back here. So previously, when we parsed this 2, 3 here, it came out just simply as two characters, to the character 2 and the character 3. It's not really that useful to us that we're just doing that. So what we really like is it to convert it to an integer, instead of just still characters. So in this example here, I just added this group method. So that causes it to shove the results of the parsing of that grammar into a string. That's, I gave it a string there, and then it gives a string to a block. In this case, I'm converting it to an integer. Convert the string to an integer. So I get, now I'm getting what I'm wanting, which is integers when I parse an integer. So I'm getting the 1 and the 2, 3 there. You can expand on that. This one shows some recursion. In this case, I'm putting recursion around parentheses. So I have, when I encounter parentheses, I want to recurse back to my original expression. Right now, I'm calling the expression just my integer. I'm going to keep expanding on that. And that's why I'm actually, this is just another way to do this recursion, where I start with an empty grammar, and then shove it in to recurse. So we get, we can, you know, parse however much a parentheses matching we need. And then this is an error condition because we got a mismatch on parentheses at this point. Then our next example, add in a unary minus here. In this case, it's, we want, if we see a negative, we don't, we want to discard it because we don't really want the negative in the output stream. And then another primary, and then here's our action. In this case, I'm going to negate what's on the last part of the output stream, and then put it back. And then, or it can just be a plain old primary itself. And then this shows a few examples. Expanding, we get, here I'm going to show you some left recursion. Because the product here is on the left side. Again, so if we see product and then times another unary, then we, then we put the product of the two back onto the, onto our output stream. And then this is the last one. So then I just added another level doing some. So our final, our final expressions basically a sum contains terms that have products in them. And then the product might have a unary and that the unary might have parentheses. And that could have anything, and then another expression or an integer. So this is kind of our final grammar for an expression. And then there's, here's some, some examples here of, of using that thing. In this case, I'm taking this expression and I'm evaluating it. But it could be also putting, putting in a tree instead. Just by changing the actions. But the, the actions that I gave made it do evaluation. But it could be arbitrary. You could do, do whatever you want. So I'm going to come back to this in a little bit. So I mentioned this grammar engine before. I'm going to talk about that a little bit more. Grammar engine, you could think of it as a visitor on top of the grammar. So we, the grammar that we're dealing with is, is a, is a big tree of grammars. And what the engines, the engine's job is to traverse, is to traverse that grammar and do something with it. It can do anything, but the typical use is to generate a parser. So, and here, right now I've implemented three engines. Actually one more, but three of them are here. One of them is called, I called Ruby zero. And that one doesn't really generate a parser, it just, when it traverses the grammar, it parses at that time instead of generating a parser. Yeah, the next one, Ruby, it, it generates a Ruby parser. And then you, it, and then you take that and parse later. And I have another one that is derived from that that compiles to C. So we get, we get a performance boost there. And then here's a few more ideas that we could do with, with making other engines. I could, we could have something that generates C code instead. Have something that generates native, Ruby C code and then one that generates native C++ code. Another one is it doesn't have to do, it also doesn't have to do parser generation. We could generate a dot diagram instead. It parses the grammatory, generates a dot diagram. And the other thing is this, this whole grammar layer is a very, very lightweight layer. It, it's basically just a user layer. All the work is done by the engine itself. And there's nothing that actually says that this, the engine has to be an LL parser either. So I could embed underneath, as an engine, an LL parser, a PACRAC parser or something else. Because it's that, that, like I said, that layer is, is just to help you build, build your grammar in a nice DSL friendly way. It, it basically creates all the operator overloading, creates a nice layer. So, if you've noticed in the, in the previous examples, the action blocks, they look like straight Ruby code, but they, they are, but they don't do what you think and they do. And the reason is because these action blocks, they have to generate code. Because I'm doing parser generation. And so if you have a plain old Ruby object, I have to, you have to send it through this brackets method to get a piece of code that I can use for generate. I can use to, to do parser generation. Method calls on the other hand, any method calls inside of one of these blocks though, they just work. They don't do what you expect. And they, they don't execute the code immediately. And so when you see a, let me go back to this example here. So when we saw, this is a simple example here. This guy right here, S to I. It actually didn't do, convert the string to an integer at that time when it executed the Ruby block. What it did is it generated a piece of code that said string S to I. So it kind of, and that's what all these things are doing. They're actually generating code and not executing right there. And the way I'm doing it is I make, I had a class that override, overrides method missing to generate a piece of code that uses that operator. So it's, it's kind of, it's kind of a neat little trick there with the method missing. But it, it really only helps me unfortunately with just method calls. So anything else, I can't use those types of tricks. So like, like I said, with raw objects have to use a brackets method. And then any built-in Ruby constructs, I have to use some facilities of the engine to do, to do the code generation. Because I don't want to execute those things right at that time. I want to generate the code that does that. So for example, if I haven't, if you want an if statement, you have to use the if method out of the engine. Or the or operator, same thing. Or even semicolon, a sequence of things. You have to use steps. I also have the, the, the variables grammar that can be used to create variables, share them between multiple grammars. And you create some kind of state that you need to. That actually allows you to do, do content, contact sensitive parsers. Which most parsers can only handle context free. So this you could put arbitrary state and cause these actions to affect parsing decisions. So it kind of removes a normal limitation that other parsers have. So here is performance. Comparing, I did a performance analysis of several parsers out there that I found in Ruby. I think down, down here, other than this, this bottom one here, this grammar Ruby zero, these, the ones on the bottom here I believe are PACRAT parsers. And this, this is a, I wrote Gison parsers for all these, all these different, these different parser generators. And this is basically, you know, how many characters per second is going through this. These, these big, this big piece of Gison code that I, that I generated. But you see, yeah, the PACRAT parsers typically aren't doing too good. Here's RAC, this is the straight Ruby version. The C extension gives you a little more speed. This grammar slash Ruby here, here, this is my pure Ruby implementation. It's, it's, it's faster than still RAC with a C extension. That's the pure Ruby version. And it's getting close to what my hand coded parser did. This is, this is one character at a time hand coded parser. This one right here is one where I, I believe I started with James Gray's version. This, a lot of this came from one of the Ruby quizzes. But I started with his and then optimized it as much as I could. And then this, the one at the top here, that's when compiled it to C. So we get, and then the very top, that is pure C. That's a pure C extension that is, is just meant for Gison parsing. So getting closer to that one, but not quite. But I don't think you'll find any other Ruby parsers that are going to be this fast. Or Ruby parser generators, rather. I think I missed a few slides. Yeah, I did. Okay, so here, I just wanted to talk about some of the, the lessons I've learned so far from writing this, this, this code, this grammar code. One thing is duck type. Duck typing is, is your friend. And it, it can add a lot of flexibility. Here are some of the, the concepts where I use duck typing. So my input sequence, output sequence, I used it on my, to do pattern matching on my, on the elements. Yeah, I didn't, I discussed earlier how, how I was doing the, the code generation. That's, that's all duck typed. So it allows you to use a lot more than you might, if you, you were thinking just, just using the specific thing. Also when I'm using duck typing, to extend the flexibility further, I typically try to limit it, limit the number of methods I use. So most of these I actually just use one method. And that's it. The best method to use is probably either call or brackets. And then it looks like a lambda. And then you can use, a lambda is probably the easiest to use with that. Another thing that was useful is layering. Putting a layer between this, my grammar and my engine, it added a whole lot of flexibility. My previous incarnation of this thing, the engine was part of the grammar as one piece. And I had a, had very limited flexibility on that. Now I've opened it up by separating out the two. And because now I can plug in any old engine I want, make a new engine to work with this. Part of typing was also useful. If you notice in this, this previous slide here, this Ruby zero was the bottom of the list there. And that was my prototype. So that one didn't do code generation at all. It just, it parsed, I mean it traversed the grammar tree and while it was traversing, it would do parsing. So it was slow because it didn't have this parser generation stage. But it was a good prototype. It allowed me to prototype the whole API. And then later it came along, built a new engine that did, that did full code generation. And that's where I got my speed there. And then, yeah, I would, I would suggest, yeah, use lambdas and blocks liberally. They, they, they add a lot of flexibility. And another thing, initially, when I was making the methods in this grammar, I would typically try to make one method, do a whole bunch of things for me. So I did, had nice short names for methods. Fortunately, that just caused too much trouble than it was really worth. And so I found that if you can come up with a way where each method is just as simple as possible and it's, you can document it easily, it's easier to debug. And it also provides performance benefit. So I would, I think it's better to minimize your overloading and just have nice simple methods instead of trying to jam everything into these complex methods. That the only benefit you get is, is you get to reuse the name. That's really your benefit. And I don't think it's worth it. Here are some of the, some of my lessons learned to get the performance numbers I was getting previously. Minimize object creation. That, that helps out a lot. And minimize method calls. As you can see, the, the parsers that I generate, they're almost completely flat. Actually, I'll show you one, one of them in a bit. Don't repeat yourself. And for performance, you just make sure in your code sequence, don't make the same decisions multiple times. And then I kind of already alluded to this earlier, keep, keep things simple. So, so that you're, when you're doing the simple common stuff, they don't have to go through all this extra overhead. So I meant to do this one earlier. Let me go back to this example. And show you some code like this primary here. So this was one of the examples I gave. And this, I'm spitting out the, the code that it generates. So this one is, I'll show you the example. This test unary. So we have some parsers a digit and, or, I'm sorry, let's go back to primary. Parsers a digit or it can start with parentheses and recurse back to itself. So that's the example I want to just show here. So here I'm, I'm just calling this method. That's the, this is the main, this is the main piece of the main code that you call first. And then this is calling this guy and then here it is up here. So it first parses a, parses a digit and then possibly more. You notice here in the grammar you specified like question mark zero, question mark nine here to represent characters. But the code is it has a little bit of ugliness here because those are just fixed nums. This is 48 is ASCII for, for zero. 57 is ASCII nine. So this one looks for a digit, then more digits. If it wasn't, if it wasn't a digit, then it goes down to another, the alternative here or a left print. And then it calls itself and then right print. So that's, that's kind of an example there. I believe that's it. That's where the project's at. Rubyforge. So that's all had. Yeah, any, any questions? Yes. You went to like Yambit, sort of what group? What, what was, what, Yambit C? Yeah, I haven't heard of that one. Right. No, I haven't heard of that. This, is that a gym package or what is it? No, it's a scheme. Oh, it's for scheme. Okay. Yeah, I haven't, I haven't heard of that. I'm, I've, one of the, one of the other parser packages that I can't think is similar on C++ is called spirit. That one, you, what's that? From boost. That's right. It's, it's similar in that you, it's similar to spirit in that you write the grammar in the same language. But this one, this one still has, this sounds quite a bit more flexibility than that. Anybody else? Yes. So I saw some discussion where you said there was something that grammar was going to do that was special that was different than all the other parsers. Mm-hmm. Left recursion? Left recursion. Yes. Okay. Yes, it's, I believe there's, the only place I saw it was in a, it's in one of the functional languages that did that. And, but it did it in a completely different manner. I think they countered the number of tokens until the end of the sequence. And then based on that count, they would limit the recursion. So they would actually do, do recursion at that point, but limit the depth of recursion. Normally if you, with an LL parser, recursive descent parser, left recursion implies infinite, an infinite loop, and you never get out of it. So that's what's kind of unique of about what I'm doing. Yeah. Yeah, that's, that's a good question. Very good question. Because I had a, I did have this working at one time, but I changed some code and now it seems to be broken with 1.9. But yeah, so with Ruby 1.9, when you, when you're, if you're dealing with a string, when you're parsing the characters, the, those characters are also strings. Previously I did have it working and it, it's, the code it generated just simply had strings instead of fixed nums. So it did work. You don't get the performance, as much of a performance benefit because they're strings, because they're full objects. But the latest Ruby 1.9 seemed not, it wasn't, wasn't horrible. Before that it was, it was very bad. But it looks like it'll still work. And the code will actually be a lot more readable because you won't have ASCII values in there. It'll have real strings. Yeah, I did, I did, when I had it working before in 1.9, it was, it was still a little bit faster than, than the table I had there. But I know when you compile, I don't think that Ruby2C extension isn't working at 1.9, but I know it wouldn't come, you wouldn't get nearly that performance benefit going to, going to C because fixed nums you can, you can optimize quite a bit. So, yes. Are there any other projects you know that are using Grammar? I know I've encountered some Ruby projects that are, that definitely have their own Ruby project, for example. Do you know of any projects you've used in Grammar? No, I don't. Yeah, I just, I just released this last night. So, I had a, this is version, what I discussed here is version 0.8, which I just released. If you, we go back to this, let me go back to that chart. So, you see Grammar 0.5, that's what, yesterday, that was what was there, just 0.5. So, Grammar 0.8 is the, are the other ones. So, yeah, I had, they're, comparing the pure Ruby version, there's a 5x speedup. I'm going to, from one version to the next. So, in this ad, I know this, this one adds a lot more flexibility, because the previous one, I didn't have that, the layering between engine and Grammar. So, yeah, but I don't know of anybody yet. Yes. I was curious about your motivation in writing Grammar. So, did you have a particular, something you needed the parts that you were interested in, or is it more kind of curiosity? It started that way. My background is, is IC design, microprocessor design. And that, so I deal with, I dealt with lots of formats. And, I think at the time, at the time, I was kind of going off of my own, I wanted to start my own company. I didn't get very far though. But, I just got interested in doing this, because I was wanting to, I was first wanting to parse something, and then I got sidetracked and did this whole thing. Yes. Any size limits on the way you've implemented it? Most of the parts are for a large part of it? No. This, it just depends on what you want to put in your syntax tree. In terms of storing the input itself, it doesn't store anything. Well, except for one case. If you call the backtrack method, then it has to buffer the input. It's from that point. And then, once it goes back, then it can wipe the buffer clean. But, yeah, it could handle, you know, gigabyte files. Yeah, it just depends. Let's say you have, you know, you have terabytes of a huge expression. And, you're doing that piece of code that I showed earlier. That one would use almost no memory. Because it's not generating anything. It's just constantly evaluating expression. That's all it's doing. So, there's no tree involved. If you generate a parse tree, then the parse tree would take memory. But that would be it. Anybody else? Well, thank you.