 Okay, let's keep on going. So next we are having here on the podium Michael Padden with the regular expression derivatives in Python. Hi everyone, it's great to see everyone got up early to hear a talk on derivatives. That's weird. Okay. Thank you very much anyway. My name is Michael Padden. I work and live in Tokyo, even though I have an Australian accent. My day job is at Google. I work on Chrome browser. We're always looking for new people to join the Chromium open source project. So if you're interested, you're very welcome. Please reach out to me. But that's not what I'm talking about today. Today I'm talking about a private project. This is not a Google project. It's a personal project. And the motivation for my project is very simple. I just wanted to generate scanners, you know, things that tokenize text that have guaranteed linear performance and understand Unicode. And as it turns out, there's a lot of tools out there, but none quite fit what I wanted. So I wanted to build my own. And I came across a really great paper by Owens, Repi and Turon. And they described how you can use regular expressions to build a deterministic finite automaton, a state machine. And that sounded like exactly what I wanted. They said in their paper that these regular expression derivatives have been lost in the sands of time. And few computer scientists are aware of them. So hopefully after this talk, a few more computer scientists will be aware of this technique. It's pretty cool. You know, it sounded exactly what I wanted. So a quick refresher, regular expressions are really, really simple as it turns out. So you've all used them in real life probably. But when you start looking at them theoretically, there's a very small set of things they boil down to. There's a null string. There's an empty string. There's a symbol, you know, A, B, C, D or whatever in an alphabet. But we normally, you know, use the big sigma sign to represent. You've got concatenation, you know, two letters together. You've got the star operator, you know, zero or more things. You've got logical law, which is often used. We use the pipe in posix regular expressions, but he will use a plus. And in this particular setup, we also have and and not as it turns out, they're kind of handy to have. So examples down the bottom there, you know, we can have, you know, a bunch of letters in a row. That's a regular expression. Hello, that matches the word hello. The second example there matches either, you know, the letters A, B and C, all the numbers one, two and three in sequence. And the third regular expression matches A, any number of B's, including zero and a C. And again, if you've used posix regular expressions, you probably are very familiar with this. Okay, another refresher, deterministic finite state automata. These are state machines. So basically, we want to turn a regular expression into one of these. The idea is you have a bunch of states. There's a start state, a bunch of accepting states, the ones with the double circles where things match, then there's an arrow state. And of course, there's the transitions where the arrows take you from state to state based on what letters you get. It's a really simple idea. You've all seen state machines before. Now, did you know you can take the derivative of a regular expression? What does this mean? This sounds very complicated, but it's really a simple idea. The idea is you take a regular expression and you feed a letter into it, and what you have left is the derivative. So if I have a regular expression A and I feed the letter A into it, one left with is the null string. Nothing's left. I've matched. If I've got a B and I feed an A into it, I get null, nothing matched. Okay, if I have A followed by B and I feed an A into it, what's left over is B. It's a really simple idea. If the fourth example there, if I've got an A star and I feed an A into it, I've still got an A star. That's regular expression derivatives. These things were invented by a mathematician called Yanos Brasovsky in 1964. This is not new stuff, but it's very cool. More generally, there's a bunch of rules. I'm not going to go through those, but the real issue here is that you can easily reduce all these rules to a simple program. On the right-hand side over there, there's actually a helper function, and that is also used in one of these rules, but also it's used to see if regular expression is nullable, and this turns out to be really useful later on. So there's this idea that a regular expression is nullable, which means that it matches. Okay. So why is all this stuff useful? Well, here's some Python code. This is almost exactly the code in my project, and it's very simple. I'll walk you through it. Here's a regular expression, and that's your start state. And then you sort of add your start state to your set of states, and then you stick the first state on the stack. And then you go into a loop. So you take a state off the stack, and you say, okay, for every symbol in my alphabet, let's take the derivative of that state, and that's a new state. Now, if I've seen that state before, or I haven't, if I haven't seen the state before, I add it to my set of states. And in any case, I add a transition because I found a new way to get to that state. And I run that loop, and at the end, I've got a state machine. I work out which states can be accepted because they're the states that are nullable, and the error state, the one where nothing is guaranteed, where you know we can't match anymore, is a state that is the null state. And that is it. That's really cool. You can produce a deterministic finite automata from a regular expression with that code. There's only one problem. Remember that bit in the code that said for every symbol in the alphabet? Well, Unicode's got a lot of symbols. So you'd be running through some pretty big loops. So that's no good. So the second thing that kind of makes this whole approach cool is rather than do that, you create derivative classes. And what these really are is rather than go through every letter in the alphabet, you can work out pretty much which letters are important and which letters are not important. We call these classes. So looking on the right there, if we've got the regular expression A, there's only two sets of letters that are really important there. One is letter A, which is obviously important to that expression. And the other thing that's important is everything that's not an A. And so we only need to really do two derivatives if we're trying to go for that algorithm for the regular expression A. And similarly, if you've got the example there where you've got... Well, I've got a bunch of examples there. We don't have time. But there's a set of rules again. You can easily program them. That's probably the important thing. So now we can handle Unicode, which is nice. The red is the change in the algorithm. Rather than go through every symbol in the alphabet, we just go through every state, every class in our derivative classes, and we take the derivative of... We can take any symbol out of a derivative class and use that to generate the next state. So it's really that simple. So now we can, you know, handle large alphabets, which is nice. And this is the last and perhaps coolest thing. Rather than do this for one regular expression, we can run this algorithm over a vector of regular expressions, over a list of them. And that means... And that's very simple. We just... If you want to take the derivative of a vector of regular expressions, it's just the derivative of each vector. That's another state for us. And then the same thing with the derivative classes. We just intersect them all. So a very simple idea, but now we can take a whole bunch of regular expressions and we can turn them into a single DFA. And this is exactly what we need if we want to do tokenization. So let's go and see what it looks like in Python. So when you want to implement this stuff in Python, there's a few key decisions you've got to make. How do you represent these large sets of symbols? So, you know, you've got... If you're doing any code, you've got lots of symbols. How do you... We represent regular expressions. How do we compare expressions for equality? Because remember, we need to see if we've seen the same state before. And how do we be able to scan it from all this? So large sets of symbols. It's pretty interesting. This is important to get right because it's right in the guts of the code and affects the efficiency. The easiest way is to just represent, you know, as a disjoint-ordered intervals. So in the case here, A to Z, A to Z and 0 to 90, you can just represent as their code points as like a tuple, a tuple of tuples. And this is kind of cool because now you can test for membership using bisect, which is in the standard Python library, and that's order log in. And union intersection and differences can be implemented as order and algorithms, which is cool. It's really tempting to subclass the collections ABC set, which is, you know, standard Python 3 to pretend you've got a set of integers. But as it turns out, that's a really bad idea. And it all comes down to hash. We need to put our sets into sets of sets, which means we need hash. If you remember that previous algorithm, we had sets of states. And the standard hash requires you... The standard hash algorithm requires you to iterate over every element, so that's really slow if you've got lots and lots of integers in a set. So we subclass tuple instead and make it look a bit set-like. We could actually ignore the requirement that hash is all sets with the same members. It's supposed to hash to the same hash, but I didn't want to do that, so that was a bit tricky. And to represent the set... To represent the expressions, we just create a class hierarchy. So we create an expression class, and then we subclass it to have all the different types of operations. And that means that we can then easily build our expressions as trees. So here's an expression, and, you know, here's the tree that comes out of it. So we just, you know, parse the expression and produce a tree. It's pretty standard. And the last thing I think that was tricky, as I said before, was how do we compare these sets? This is actually interesting. So what we do is we try to always produce the expression trees in a canonical form. Okay, this is called weak equivalence. So we always try and produce them in the same canonical form, and by doing that, we can just compare them by comparing the trees structurally. And if the trees are the same structurally, it's the same regular expression. And so there's a bunch of rules, again, that we can implement pretty easily. But a real-life example of this, we use a smart constructor. We use new instead of init. The reason we use new, which most people don't use in Python, is because we don't always want to create a new object. Sometimes when we're being a smart constructor, we want to return an already existing object or a different type of object to what we thought we were constructing. So that's what we're doing here in the concatenation operator, where we're checking to see if the left side is a concatenation, and if so, we just reorder things. So it's always the same structure. And then we have a bunch of checks to see, hey, here's a left-side Null, well, then the answer's Null. Is the right-side Null the answer's Null? We don't need to actually create a tree at that point. So we go for a bunch of tests like that, and then we actually construct the object at the bottom. So Null is really kind of cool for things like this. And if you didn't know it existed, it's worth going and looking and seeing how it works. And finally, we can build a scanner. I've got two minutes. We can build a scanner. It's really very, very simple. Once you've got the DFA, you basically run through your symbols and your text, and you run through until you get to a point where you can't match anymore, where you get to the error state. And then as you've run through, you've remembered if you've seen any accepting states. And if you did see an accepting state, what you do is say, wow, I found a token. I'll return it, and then I'll rewind back to where that token was, and then I'll start again. And so you don't go for the DFA just once to tokenize. You have to go through, you know, one time for each token. And this is the basic approach. But again, it's not much code. It's very, very simple. A simple example of the code I built. Here's the input. You can define letters and digits here, which are... Because we don't actually want tokens from those. They're fragments. But then we can put them together into, you know, four different types of tokens, and all the things on the right, of course, are regular expressions. So we type that in, and then we put it through the tool that I built. And you get a DFA, which, of course, you can't see, but it looks very pretty from a distance. There's a lot of detail there. And then, obviously, you can run that in linear time, and it supports Unicode. A slightly larger example, I took a Pascal Lexa from the Internet and ran it through Flex, and it came up with 174 states. And my tool, which is called Epsilon, produced 169 states, so roughly the same. So it seems to work. So as I said, the tool is called Epsilon. It's up on GitHub. Beta testers and contributors are welcome. There's a bunch of stuff not done yet. I need to do start conditions, which is something that Flex supports that I don't support yet. This generates different code. You can generate different code targets. So I've got Python and . for visualizations at the moment. But I don't have C, so I need to generate a C target. The actual reader expressions are pretty close to Perl with a few extra features. So that's kind of where it's at. And again, I very much welcome any contributions. So acknowledgements. Epsilon was directly inspired and based on the work of Owens, Repi and Turon, who I referenced. And of course, without the work of Brasovsky, none of this would be possible. Thank you very much. And I think we've exactly run out of time. So any questions, I'll take later. Thank you very much.