 Okay, so we'll start with an announcement. The announcement is that to spice up the class and to give those for whom this is too easy, a reason to live and the reason to live is a contest. It's a parser contest which has two parts. One of them is going on essentially now. The homework for parser that you are working on and you got a working parser but sort of a snail parser. You really need to turn it into a fast one. We'll measure the performance when you are all done. So Sunday night, well, maybe Monday night. Sean implemented or is about to deploy a system where you will submit your parser and we'll measure it on always the same hardware on our server, and we'll post the results, the current on-the-roll at the end of every day, anonymously on Piazza. So you invent presumably some secret ID, the parser hacker or something, and you will see yourself on the top, maybe. So this is part one. Part two will happen at the end of the semester when you are done with your own parser generator. So that's the story of PA4 and PA5. In PA6, you will write some fun little compilers, and then at the end, we'll measure everything, look at the code, how beautiful it looks, and we'll have a winner. Last time we taught the class, the winner was five times faster unbelievably than our solution. So I'm curious if somebody can beat that. So so much for the announcement. Today, there are no laptops in class, but for those who really need something to occupy their brain, I do have a puzzle. The puzzle is, write the regex, the tests which test whether a number is prime. You know the solution already? Well, you have some time to think about it. So the hint number one, it needs to be a regex not the regular expression. If you don't know what the difference is, you'll know by the end of the class, but I'll give you a hint what it is. So regex is come with an extension to regular expression, and this special notation back slash one matches the string matched by the first group. So this regex, C, A, and B, or B, and then the special thing here, matches what? Matches C and then A, A, or B, B, and this one matches whatever this group, which is the stuff in parentheses, matched. So it matches only C quadruple A or C quadruple B. So that's pretty much all you need to know. So remember what's special about it, that it doesn't match particular characters. It matches characters that will be on the input, and so it essentially makes a reference to the result of this match. So get your paper and pencils out, start working on it, and hurry up because I have another puzzle coming up in the middle of the lecture. Actually, much sooner than that. All right. So for those who still want to listen to the lecture, where are we? So now we are at the stage where we have seen enough tools, like coroutines, using which you can build control flow constructs, and parsers, using which you can essentially compile input things to target languages and build various tools like the regular expression highlighter which you saw during the demo last time. So we are essentially here. We now should be able to start seeing the value of programming under the abstraction, which means programming the things that other programmers use, rather than being consumers of the programming technology. And so we are here, and we'll start looking at small languages and various constructs from now on. In the next three lectures, we look at languages today at regular expressions because it is the most studied, regular, most studied small language, but we've seen a bunch of them already. Unit calculator, grammars themselves are languages, regexes are, jQuery, prologue is not so small, but we saw a lot of the spectrum. So what we'll do with regular expressions, they're all invented, but I'll walk you through a rather revisionist history of regular expressions that I twisted for pedagogical purposes so that we have the feeling of inventing it ourselves. And then we look at issues of how you actually integrate them into some other languages because with regular languages themselves, you cannot do much, but they can take problems from regular programming and make those problems smaller and concise. We look at two different ways of implementing them, and then we look at subtle issues of semantics between true regular expressions and so-called regexes. So let's go. So we'll start by looking at what people use them for, and I have five sample uses of regular expressions. There are hidden slides in the slide deck which you can see after I post them, so I'll just go through them relatively quickly here, the details you can see at home. Web scraping is what you saw in the first lecture, right? You got the HTML page and you had to find with it, presumably with regular expressions, certain strings of interest, and then perhaps find mailing addresses, turn them into URL links into Google Maps. Cucumber is a Ruby framework for testing. It allows you to write your test cases pretty much in natural language, and your test cases look like exactly like you see over there. So you can see that this is essentially a, the description of a test in a string language which you then process partly with, Ruby overloading partly with strings with regular expressions. Lexical analysis is the one use of regular expressions that typically is taught in compiler classes, so usually you don't learn anything else but that, so but we'll cover a broader range. And what happens in lexical analysis or scanning is that you look at the input characters, you break them into so-called lexemes, which are these, and then you return tokens, some of them associated with the lexemes, some of the lexemes converted into numbers such as here. So the product of Lexar is the list of tokens as they appear on the input, some of them with additional attributes. A lot of string processing is about just name manipulation. Imagine you have configuration files which describe how you run your tests. You might have variables such as here, which describes a directory, and it's a string manipulation language which translates these variable into a name properly handles backslashes so that you can use them in strings and thus the substitution recursively if necessary. These are surprisingly hard to get right. And then there is the usual search and replace problems that you may want to write the program that replaces this back string with the appropriate forward string. So that's the spectrum. And here is the second question if you want to keep yourself busy, is about lexing, the typical compiler problem. So as you just seen, a lexar is a front end of the compiler or interpreter which reads the input document character by character and produces these tokens, right? Those that you see here. One, another one, another one, another one. And you can think of those as being produced before anything else happens, right? After that the parser comes and processes these into a abstract syntax tree, okay? Sometimes however, that's not how it works. Sometimes you cannot perform the lexical analysis, the chopping into lexines without cooperation from the parser, with the parser. So a question if you know JavaScript and you've probably seen enough for that in homework one. So find a JavaScript scenario or a fragment of JavaScript program where you cannot look at the program and chop it down into these tokens and then hand them to the parser to do its job. In other words, we are looking for a case in JavaScript where the tokenizing itself needs to be syntax sensitive. The parser needs to tell the lexar. At this point, I'm in such and such a state. Therefore tokenize like this as opposed to I'm in a different state, okay? So can think about that. The answers will come later in the lecture. Okay, so now we are ready to do the revisionist history of regular expression invention. It didn't quite happen that way, but it doesn't matter because it is going to illustrate what happens when you design your own small languages. When you want to design your own small language, in other words, when you want to simplify programming, it really starts by looking at how people program and say, there is a lot of crud in here. I want to simplify it. I want to write constructs, abstractions that allow you to talk directly in the logic of the program and hide everything else, the mechanics, the plumbing under the abstraction. I need to figure out what sort of programming model is suitable for hiding things under the abstraction. Believe it or not, it's not always procedural programming. Cannot always hide the plumbing in a procedure and just say, oh, I'll call the procedure and then I'll call another. The way, say, sockets hide network programming in the calls to open, close, connect, and so on. So let's look at regular expression. As a case study, we look at implementing a lecture which is the tokenizer in the front of the compiler. We'll write it by hand in some imperative style the way you would do it in C or JavaScript. And we look at where's the plumbing and then we design the abstraction, okay? And we'll do a really simple scanner. It only has four different kinds of tokens. An identifier, which is a sequence of one or more letters or digits starting with a letter, a typical description of a variable. Then it has one double equal sign and a plus and a times, okay? That's all it has. So here is how it could look like. So I'll let you read the imperative version. Take you just a few seconds. So this should be pretty trivial. Each time we call this scanner, it reads from the input the next token. Well, token technically is this value here and lexeme is the chunk of characters on the input but often we use token to describe both. So it reads it from the input, consumes what has been read and leaves the rest of the input unchanged so that when it is called again, it can give you the next token. So it reads the character, checks it for equals. If it is equals, it reads another one, checks for double equals and then it returns the appropriate token. Otherwise it checks for plus times and if it is a letter, it will read characters in a loop until it's not a letter. It needs to return the character because it had no choice just to perform this look ahead so that it does read the longest identifier, right? So that it consumes the full identifier and then it returns the ID. I'm not doing everything I should be doing here so in particular error checking is not handled. So if this is your input, which is not legal because it contains equal and the only support double equal in our language, that is not going to be called. So let's assume that the input is lexically valid. We don't assume that everybody would be so mean to give us an incorrect input because then things would be even nastier and how nasty could they be? Well, people write these imperative lexers partly because they are fast, partly because, I don't know, they grew out of small lexers and then just grew bigger and bigger and this is a fragment of a lexer that I believe is still in the Firefox browser. It is the lexer for the JavaScript interpreter in Firefox. They may have replaced it with something else and of course it is much bigger, it's hundreds of lines and needless to say this might be difficult to maintain, right? You have things like go-to-bad things for error checking and I don't know. I wouldn't like to be the maintainer of this lexer. So let's look at it and see where is the logic and where is the plumbing. So can you identify the logic in the program? The thing that conveys the specification that is different for every lexer you write and we actually have to write it as opposed to have it generated by a compiler. So how many pieces of logic are here? How small can that be? So let's start by what we actually need to keep in the programming model that we want to discover. All right, that's fine. I'm happy to get an answer from somebody who's just scratching his ear. Okay, so somehow the mapping of this to that, right, would be good to maintain, all right? So this is our logic, the mapping of this to that, all right, what else do we need to preserve? If we had some way of hiding the rest. Okay, so somehow this and this or that and presumably a mapping of ID to this, okay, but also the fact that this is happening in a loop somehow, right, that needs to be preserved. Is there anything else we need to preserve? Similarly, there is plus and times, of course. What else do we need to preserve? Is this enough to say? Uh-huh, okay. So I think you are describing a scenario where we may have ABC and then plus ABC and you would like to recognize that this is one ID and a plus and another ID. So clearly the code here does it correctly, right, because it will eventually run the loop, it will run into a lookahead character which is not a letter or a digit, it will be a plus, it will return it to the input then read a plus then read another ID. So the code does it correctly. Do we need to somehow say anything about the boundary? So let's go back to the table and see if it would be enough to say that, right? So if I just give you this description, I'm not telling you anything about boundaries. I think that would be in this case, sufficient to tokenize. Often what scanners do say, this is white space. Tokenize it also, but throw it away, don't pass it down to the parser, right? So white space comments, all that would be thrown out. So I don't think we need to say that specifically. So indeed, this is essentially the logic, what you identified, but there is much more plumbing to it. So what plumbing do we have? Can we identify the elements of the mechanics that actually does the work? So what is the first plumbing? Yes, you'd rather not see that in your code, so that is one thing. This lookahead and backtracking is not quite backtracking, but pushing back of the lookahead, you'd rather not see, okay, what else is nasty? So this would be, okay, so the entire decision tree that you build from if statements, okay, excellent. What else might be plumbing? The actual, the reading itself, right, of the characters, the fact that they disappear from the input, that's the same in all the scanners. So why actually make it manifested at the top level? Do you see anything else? So let's see, I have a few here. So the reading of the character, right, is the how part, how things are done as opposed to what needs to be done. That's one thing. Then the returning itself, okay, this is not such a big deal, but why should I even see the return when we really just care about the mapping between the token and the value, between the lexeme and the token, okay? Then we have the undoing, we got that. Then there is essentially the decision tree built out of ifs and while loops that you need to build manually, okay? So that's all. So can we hide the plumbing? Well, we would like to avoid building that tree, call to the read message, we avoid it, call to the return, explicit lookahead code should not be there either, okay? So ideally we want code that looks just like that specification. So can we get close to that, right? So what we want is separate the, how the plumbing, how the computation proceeds from the what, which is the description of what we wanna do. And yes, there is such a model, it's finite state automata. How would you know? Well, you look at the code somewhat differently and you will see that what we do in every state of the automata, we read a character, then we compare it with C1. If it is C1, we make a transition here. If it is some other character like C2, we make a transition into a different state. And then we repeat it, repeat it until in some state we say, oh, return a token because by now we have identified the entire lexeme. Now, how could you see that there is finite state automaton in your model when you look at the logic? Well, you look here and see, well, this is our logic. This is where we read the characters. And now the transitions here are actually the comparisons with the characters here. So this is the star and the plus. This is the letter and here is a letter or a digit, letter or a digit, and you get your finite state automaton. So that's how we essentially abstracted away by finding the logic and suppressing the plumb. It's not always that way because depending on the situation, the thing that you want to highlight, the rest that you wanna hide is different, right? But it's best to talk to people who program in the domain and ask them what are the things that you use to describe what the program should do? What is the domain that you use to talk to each other? And that often identifies the concept that should show up. So the declarative scanner has two parts. The declarative part, which is the specification, what are the leg seams, what are the tokens that we return when we find them? And there is the operational, the how, which is common to all scanners hidden like in a library. And this is what does the actual reading and pushing back the token and so on. Nothing surprising here, but it's good to distinguish the declarative and the operational, the one that we would like to describe concisely and the one that we would like to hide, okay? So we are not quite done. We have identified that the state machines do the work and with them we can describe everything, but it would be painful to describe the lecture with a state machine because you don't want to draw states and lines and circles. That would be a bit painful. So what do we need? We need a notation for the state machines. And this is what happened in the 50s with Kleene who used state machines to describe human cognition, the states of human brain, and he invented the notation that we use to this day, concatenation, repetition, alternative, and that became the sort of textual language of state machines. And in our example, it would be A, B, star, C, describe the state machine that we have here. So that's essentially, it's all familiar to you, but it shows sort of a revisionist history of how you could have discovered yourself. So here are regular expressions today. In addition to what Kleene had, people invent a bunch of various other notations such as plus, character classes, things that do not match, digits, words, and so on and so on, okay? And it goes so crazy that this blog entry probably really well describes the state of regular expressions today. So I'll let you read it because some of them are interesting. And yes, indeed you can overdo your support for regular expressions and make it easy to do things with regular expressions and hard with anything else. And then perhaps you are in the wrong spot of the spectrum where you have a language that forces you to do things in an unnatural way. Okay, so now how do we implement regular expressions? We'll go over two implementations, but before we do that, let me tell you a true story. Okay, so the true story starts with this, okay? Here we have a regular expression. What it does is you first need to match X than any character one or more times, right? So any character one or more times than this whole thing here one or more times and then X at the end. So this one will match, this one will not match, okay? And now I have a question for the class. So there will be four answers and the four answers are, which of these will be fast? This one is going to be very fast and this will be also negligibly slow or really fast, maybe. This will be slow and this fast. This fast and this slow or this both will be quite slow. So who thinks that it's one? Both of these will match really fast. No one votes for one, okay? Who thinks it's number two? Who thinks it's number three? Okay, and who thinks it's number four? Okay, so we are leaning towards sort of the right intuition. So why do you think that it looks like three wins here? Why do you think three would be the right answer? So the answer is that since there are many different ways how to match this against these characters, say maybe this and maybe then three and maybe five and then one, there are many different ways to partition that string. You need to try all possible ways before you can conclude that there is no match, okay? That's a correct answer. And but if there is a match, you just need to find one way to match those groups against this, maybe this. And you match and you are done. You don't need to find other ways, okay? So that's indeed almost correct, okay? So let's go to the browser here. And can you read this? I hope you can. So this is the Chrome browser. I mean the Javascript console. And I'm going to match this string against this regular expression, okay? So this should match and indeed it returns right away. But now I'm not going to put it there and I'll show you the CPU load. And it is going to do what you described. But you will see now a nice green bar. It took longer because it's running on power but I will overwhelm it easily. And now it may not even terminate, you'll see. Oh, it did finish. But a few more equal signs would really mess it up. All right, so indeed your prediction was correct. It's like that in Javascript. It would be the same in Python. We don't need to try it. You can try it at home. But look here. I'm using a different tool. And what does this tool do? I am, this is an echo. So I'm printing this string of terminated with X to Gawk which is an orc tool created in the 70s. And what it does, it prints those lines in the file which match this regular expression, right? And it does it fast. And now I make it a non-matching one. And it does it also fast. And indeed I can make it as large as you want and it's still very fast. So what is going on? I didn't cheat, it's okay. It doesn't use the brute force method clearly. So what method does it use? Why can it conclude so quickly that it doesn't match? Well, if it could go from both ends, it would help because then sort of the exponential thing you lower the exponent of the exponential in half because you sort of have rather than one big tree, you have two smaller trees that meet in the middle but I can, let's make it as, why is this big? So that even that trick would not help and it still does it really fast. So what does Gawk do? Yes. Well, it could but I don't believe it does. It actually leaves the state machine which has essentially two nested loops. It leaves it intact and it still does it fast. How do you do on those project assignments? That would not be the right spirit here. I think that now they do want to do a good job. Sorry to pick on you. So now it doesn't do it. So let's look at how this is done. We'll start with the implementation that is based indeed on backtracking which is slowly on some pathological like cases and fast when you have a match. But the pathological case turns out is not so uncommon. I don't know whether it happened to you in homework one but in the last instance of the class again, students were playing with grease monkey and manipulating web pages but the problem was different. The problem was to find mailing addresses so identify what looks like a mailing address of course with the regular expression and then wrap it into a link such that when you click on it it goes into Google Maps and shows you that address on Google Maps. And so it's not easy to find mailing addresses because they come in different formats and they can be interwoven with HTML tags. So students really wrote crazy regular expressions and in the process some of them wrote the regular expressions that would time out. Did that happen to you during homework one? Okay, so we need to go back to mailing addresses so that you get the experience. And here is sort of a shortened regular expression that would time out and what it does it sort of matches a character with a space, right? So any, this is essentially a word. This is a bunch of spaces, right? And these are spaces and you repeat that some number of times, okay? So word, spaces and then this is a list of words. And then at the end you need hide there. So the student was probably looking for something like an address that ends with way or street, okay? And that would time out. And it is the same story as with the X because you need to try all possible backtracking here only to see whether you can match hide there, right? And why does it run so slow? Because, well, you really need to eat only so much from the input so that then you find this hide there here. But if you do it the other way, then you don't have sort of a condition here at the end that you just need to match just perfectly as you have here. So the slow rejects happen in practice. Okay, so how do we implement them with backtracking? So let's start with this regular expression. A simple solution would be, we turn it into a context-free grammar and then we know how to write a parser for a context-free grammar with prologue and you are done, okay? So let's do this as an exercise. Let's turn this regular expression into a context-free grammar. To make it easier, let's start with a simpler one, A star. Who can turn it into a context-free grammar? Perfect, so this is that, okay? So let's now do A, B star, A star. So what would be the grammar for that? So S goes to what? Let's think about it compositionally. Can we first get a grammar for these strings, right? Let's call the non-terminal B and it will be exactly what we had before, right? B, B or epsilon. And now we'll write this grammar and it will be A, whatever is generated from B, okay? And then the whole thing can repeat, so we'll do S or epsilon. Did I get it right? Yeah, that's it. So this is the grammar for our example. And the reason it works is regular expressions are a subset of what you can express in a grammar. Not everything that you can express in a grammar you can express with regular expressions, but the other way around you can. So can somebody give me an example grammar that I cannot express in regular expressions? Think about what the regular expressions would not be able to express. You'll be clear at the end of the lecture, but let's see, okay? Right, so checking whether a string has a balanced so-called balanced parentheses, right? Checking that the number of these parentheses and these parentheses is the same. That they are balanced. Why cannot I do it with regular expressions? Uh-huh. And a simple argument essentially along your lines is that you may need to count those parentheses on the left so that you know how many you have seen. And the automaton that corresponds to the regular expression, there is one-to-one correspondence between regular expression and automata, the automaton needs to have fixed number of states. So you could have a state for, oh, I have seen one parentheses and another one and another one, and this would be, so this would be an automaton that checks whether the string has balanced number of parentheses. You read one and then when you read this one, you are done. It does have a matching number, similarly for here, similarly for here, and you can build it like this except you can never make it large enough to handle arbitrary strings of arbitrary lengths because the number of states must be fixed. It's a finite state automaton. But the strings I can feed you on the input can be arbitrarily large. So balanced parentheses you cannot do. But back to this, once we have the grammar, we'll turn it into a prologue program as we have seen about two lectures before and we are done. The prologue backtracks and we would have a backtracking implementations of regular expressions. Another way how to do it would be to turn this regular expression into an AST, presumably with a parser of regular expressions. And the regular, how does the AST look like for this regular expression? How many nodes do we need? Four, five, three. So what would be the root node of the regular expression? Yeah, somebody said a star. It is a star. What would be the child of it? Or does it have more children? How many children does it have? That's a skill you should slowly start acquiring, look at the program structure of pretty much any program and see the AST that captures the logic of it. So how many children? Who thinks one? Who thinks two? One, two, okay. So what are the two children? The first one is a star and what would be here? Okay, maybe we should then do it like this. So let's think about the star node. The star node is what? It corresponds to this regular expression, right? Regular expression that follows a star and it means that no matter what regular expression you have, you can repeat it zero or more times. So this operator is a binary or a unary operator. It's a unary operator, right? And the one argument to it is this R. So therefore we'll have one child here because it's a unary operator. What is the child of the star? It's a concatenation, all right. And what is concatenation is a binary or unary? If binary takes two string and concatenates them, so what is here? This is A and what would be here? This is this star here, right? So that star here is this one and this star here is this one here. And it has a one child and that's a B, right? Okay, now what do we do with the AST? We do with AST things like translate the code or interpret it directly on the AST. So how would we implement it with coroutines, right? If you remember homework two, which was optional, but we'll get to it before the exam, is that homework two showed you an implementation of coroutines essentially by giving you three library calls or maybe four with which you can implement regular expression. So something like this would be implemented by calling library functions that looked something like this star cat character A, star character. So this is nothing really just a simple conversion of the AST into what looks like a prologue term, but really these actually calls to a library. You would call that. Then there was another call called something like match. You would give it the actual string and now this is the regular expression as the second argument and it would return the match. And it would work so beautifully because these functions, star cat character and another car were implemented with coroutines and so the backtracking was hidden under the covers. So you could just translate the AST into this program and be done with it. And you could then write in Lua something like this. You could write for all matches, let's say for all match, let's call it just M in matches. Now what does this do? It iterates over the matches, or I should say the string S, yes, and the regular expression. And how would this be compiled? When you read in the program, the parser will parse the whole program of course, but it will recognize that this thing over here is a regular expression. Call a parser for regular expressions, create this AST, then by walking over the AST you would print essentially this code. So you would translate this fragment into this fragment and then just call the loop which using coroutines would actually interpret that regular expression and do the match. And this is how it happens in all the languages today that there is a little parser of regular expressions which builds this AST, which creates a code like this, some, in some cases some byte code, but in Lua with coroutines you can do it easily like that. So that's all we need to say about coroutines for now. We'll get to them, we'll get back to them later. And let's look at how we'll implement it with automata. We look at two kinds of automata, deterministic, non-deterministic. How many people know what is the distinction between the two? Okay, so we'll go over it so that everybody understands. So deterministic finite automata, we'll use them as recognizers. What are recognizers? You give the automaton a string and the automaton will tell you, I recognize the string or I don't. Essentially it tells you this is a string from the set of strings that I recognize from the language that the automaton defines or it is not, okay? And so when you talk about identifiers this would be recognized as an identifier and this would not because that's a floating point number, okay? So automata usually drawn as graphs which we've seen. There is a normal state, a start state which has this special arrow and there is a final state. Final state is special, we look at it in a second and then there are these transitions. You make the transition when the next character on input is A, okay? And so here is the transition. You go from one state to another when A is the next thing on the input. And so when is a string accepted? The string accepted if going from the start state you can consume the entire string all the way to the end and then you do so you end up in the accepting state. And you reject it otherwise. So you reject it when maybe you end up in the final state but you did not consume the entire string or you consume the entire string but you don't get into the final state. So again it's accepted when you eat the whole string and you are in the final state, okay? So here is an example. This automaton describes the identifiers of JavaScript. They need to start with letter, underscore or dollar sign and then there is zero or more of the same thing plus a digit. So when I give you A underscore one, it will read A here, then it will read underscore here, then it will read number one here. It will end up in here when we are done reading the string. So this is an automaton for integer literals. Integer literals can start with plus or minus and then they can have one or more digits. Here is the final state again but you don't need to have the sign. It's optional, therefore you can go directly in here. Okay, so what does this automaton recognize? So look at it and try to write in English the description of what this automaton recognizes. So who's got the concise description of the set of strings that are recognized here? Okay, so any string of binary numbers that end with two consecutive zeros. So it's almost right. Can somebody give a counter example? That might be a good game to play. Almost right. This one ends with two zeros but it would not be recognized by this automaton, right? Because you will get, you will read one and then read one again and then you do zero, zero. You are in the final state. You cannot read any more characters because you do not have this edge and you have read only up to here. So you landed in the final state without reading everything, okay? So let's try to refine the description of these strings except at the end, excellent, perfect? Now how about this one, right? So it's a binary string ending with two or more zeros and it could have single zeros interspered between ones, all right? And now this is the string that you initially described, right? Which says it ends with two zeros. It could have more but if I would say when it ends with two, it could have more. It has the last two characters of zeros. Let's be precise that. All right, so I think we understand that. So there is a formal description of what is a automaton. It just wanna know what's really involved. It is the characters we are considering, the alphabet, the set of states, which is the start state. You could have one or more final states and then there is the transition which tells you if I'm in such a state on such a character I make a transition to another state. That's all there is to it. Now NFAs, why would we care about non-deterministic automata, right? So non-deterministic automata are essentially these oracular things that we have seen before. They are like DFA's except they have some little extra twist to it. What happen in DFA's? In each state, we could do only at most one transition. Meaning we are in this state, the next character is A. We could make a transition on A into exactly one state. Did we have to make a transition? No, if no outgoing transition was labeled with the character that is on the input, we would just get so-called stuck. So we did not have to make a transition but if we made a transition, that transition always had to be just into one state. In DFA, in a DFA you could not have two outgoing transitions labeled in the same state because then you would not be able how to decide. And there were no transitions that you could take without consuming a character. There were no so-called epsilon transitions. Why was that significant? What simplification does it give you? You could not have optional but the simplification for the implementation of the DFA was you could always decide what transition to take by just looking at the character. If I give you an automaton, that this is labeled with A and this is labeled with A and now A is the next thing on the input, you cannot decide which transition to make based on looking at the character because it's A and you could decide to go from here to here or to here and you don't know how without really looking into the future or asking the oracle. And indeed this is what non-deterministic automata do. They have the power to ask the oracle. So they ask somewhat harder to implement but they are more concise. So they do allow multiple outgoing transitions for one input and they could have these epsilon moves and there is a reason why we are talking about them and they still need finite memory. You can have only a fixed number of states, say 50 or 70, the number of states cannot grow as the input grows in other words. And so they can be still in multiple states at once because if you could make a transition in here along A as well as here along A, well you can think of it as you ask the oracle here, well do I go here or do I go there? Or you could think of it as you are trying to be in multiple states at once. Sort of trying all these alternatives simultaneously, see which one of them will pan out and manages to consume the string. It's still a finite set because the number of states you can be in is finite. So that actually saves us quite a bit. You'll see, okay. So what strings does this consume? One or more ones and you see here it's an NFA because when you are here initially this is a start state. When you see this string you don't know whether you should go here or make a self transition because you don't see the future but the oracle would tell you that on one, one, one you need to do one and another one and then you go here. Now there is the epsilon transition so when you are here you can stay here or you could make a transition without consuming any input. Why is it useful? Well maybe there is a transition like this and it's useful to have the freedom to get into this state without consuming anything. You'll see that soon. So let's play the NFA on this string. We start here, we are in this state. When we read one, what do we do? We have no choice just to make a self transition, right? Now when there is a zero, what do we do? You don't know what is the future of the input just like the automaton does them. So what do you do? You could ask the oracle and download the slides and look at what's the future of the input and then act accordingly but no, you cannot do that. So what would you do at this point? We could try one, right? Essentially that would be implementing with it prology, with backtrack, you would somehow remember the input and then come back to the previous state, essentially checkpoint the input and try again or what we can do. We'll try it along both paths and why is that possible? Because we have a finite number of states. So essentially you see, okay, I don't know which one to take but I'll take both choices and that corresponds to being in multiple states at once. So we'll do make both transitions, see? We went here and we also went here which means we are in this state as well as this state simultaneously. Now as we get the next character, are all of these states going to survive or some of them may disappear because we essentially got stuck? So what if we get a one? What do you think will happen if we get a one? And we got a one, look at this. Right, so in what states are we going to be after we read in this one? Are we going to be in this state? Yes, because this blue thing of the blue as a token that will travel along this one using this as the enabler, all right. How about this state, are we gonna be here? No, because this token cannot get here because this expects a zero, exactly. How about this, are we gonna stay here? Yes, because this token can travel here. The token can replicate, right? You can go from one state to multiple. So indeed, this is what happens. This should not show up as blue. Let me edit, okay? So that's it. Now, why are NFAs attractive? Well, I think you figure out what it means, right? You accept the string if any of the states can get to the final state because that means the oracle could have made the right choice. But we don't ask the oracle, we keep all the states possible. Alternatives in flight, these are our blue tokens. So they're equally powerful. It turns out that you cannot express anything more with them. You still cannot do balanced parentheses, right? So these regular expressions and finite automata have the same power to specify what strings are not in the language, but NFAs are no more powerful than the effects. You could, yeah, you could even paralyze them perhaps. So that's a good idea, but they are definitely more compact, so let's look at this first. So this is the NFA for the string that does what? How would you describe this, the set of strings here? Binary strings which end with, definitely they end with two zeros. What is the condition on the remaining zeros? Any number of zeros can interleave those ones, yes. And indeed this is the automaton NFA that we expressed before with this one. So there are fewer edges and it is sort of more readable because if you care to say it must end with two zeros, this is all of what you need to do. And essentially, what are we saying here? This is, we don't care. Anything from the alphabet of zero and one can repeat there, right? This is essentially, don't care what the prefix is as long as it ends with two zeros. Here in order to say that with the DFA, we need to have many more transitions. So NFAs are nice and composable because you say anything here concatenated with the condition that you carry, it ends with two zeros. And this compositionality we are going to exploit when compiling regular expressions into DFA's. It will be your first real compiler. That's one you don't have to implement because I guess it's too easy. So can we convert from here to here and why would we wanna do that? So first, if somebody converted it for us, maybe we cannot do it automatically, maybe we need to ask a human. But if you could get a conversion from NFA to DFA, what would be the benefits of that? Now think about how you would implement a DFA, right? So what would you use? Backtracking, coroutines, something cheaper than that. You would just have one variable which tells you I am in state seven. Then you would read that the next character from the input and you would say, oh, when I'm in seven and I see character C, I'm making that transition. So you would have a table which says, here are the states. Now here are the characters, A, B, C and D and so on. And now you see, oh, if I'm in state seven and I see character B, I go to state maybe six. So that the table is our transition table. So one load in this two-dimensional table will tell you how to make a transition. No coroutines, no backtracking, nothing. It's really efficient. So if we could convert this into that, then we could get the efficient table-based implementations of DFA's. Can NFAs be as efficient as the DFA's? Well, not quite because you need to keep track of which of the many states you are in. But one can convert NFAs to DFA's automatically by doing what? Well, essentially what you do, you create a bunch of new states. How many? Well, if this one has N states, you will need two to the N states. So it could be exponentially many. And each state corresponds to one particular configuration of those blue tokens. So maybe this state here would correspond to being in here and being in here. This state maybe corresponds to being in a state where we have tokens everywhere. That's why an exponential number because of exponential number of token configurations. Luckily, still a finite number. So you have a finite number, although exponentially large of these states. And now you place transitions between these states. If on character A, you get from one token configuration to another token configuration, you make that transition. And now you can simplify the state machine. There are algorithms for that. So in worst case, it would be exponential, but if you are lucky, it is not. And for typical automata, it is not exponentially large. So you could use NFA to think about the problem, to design it, to compile it, and then run these well-known algorithms to turn it into a DFA implemented with this really efficient table. Which is what GOG does, and this is why it was so efficient. So before we go on, why was this automaton so efficient on, we actually had plus here, but never mind. Why was it so efficient? So the automaton looked like this. We make a transition on X. Now we do any character here. Any character here. So the dot here means any character. And the edge over there is an epsilon edge. So did I compile it correctly? So this automaton doesn't actually try to partition in all possible ways, right? We believe that this automaton does correspond to the regular expression over there. And it just could be in multiple states. It essentially, as it goes through these equal signs, which really messed us up, because there are so many different partitioning if you use regular expressions, we will just be in this state and that state most of the time reading in those equal signs. Eventually you run into an X, or actually you read it to the end, right? You read the string to the end, and you realize, oh, I read the whole string. Not in the final state because this is the final state. And so by reading the character one, so linear time, you recognize that this doesn't match. Now the question is, how do we take a regular expression like this and compile it into that? So think about that. I did it by hand, and big regular expressions, you do not want to compile by hand. So you'll get to it, but the compiling of a regular expression to an NFA, let's first look at the puzzle. So who's got an answer to the primality test puzzle? You have an idea, that counts a lot, because I didn't solve it myself. I, of course, I read a lot about regular expressions and this was just great, okay? So let's start, do you? So test for, okay, that's excellent. So you are using essentially unary encoding of a number. So if you have a number seven, you will have seven ones. That's a unary encoding, not binary unary. The number of ones is equal to the size of the number. That's exactly what we need. So the first step, you're good, okay? And so we can express that conveniently in Python, by the way, with one times the number that we care about because the star here concatenates the string here n times, okay? So so far so good. Now how do we test for primality of this number? Exactly, so if the prime number, like eight, can be described as four times two, then we will find a match for four and then we insist that this match repeats in this case twice. Exactly, so I love this. So how do you write it? You say, well, this is how many ones? Two or more? Why, because you, you know, seven is one times seven. So we do find a factor, but clearly this needs to be at least two, right? So this essentially says two, I should say greater or equal to two. And this essentially says do this one or more times. Why does the multiplication work? Because remember, this is a unary representation. So if you write the number like 16, then that's four and four and four and four and actually, this is four, now it's four. Okay, is that clear? Please. Right, so you need to do this sort of ugliness here, that you need to say match the whole thing, match the entire string, okay? And you need to have a special case for one, of course, but let's see if I can actually run it. I thought I had idle running here, but okay, I do have it here, okay? So what does group one print? Group one prints the first match. So this is the decomposition of the number. So we have eight times one is the unary string eight, we decompose it. So eight is divisible by four. And let's try a big number and see whether it will choke. So hundred is divisible by whatever that is. Can somebody give me a large prime number? Is 107 a prime? Okay, so this is actually a correct answer. This is a no match because I should be doing this and I get a non no object or something. If I do print, I get none, which means no match, which means 107 is a prime number. What? Okay, and I should be prepared to reset the machine, maybe. Oh, it worked. It did work, okay. Well, you can try this at home. This is one thing you can try at home. So, but careful, this is not a regular expression, it's a regic, regular expression does not have the power to refer to previous matches because these labels here in regular expressions or in automata are fixed characters from the alphabet. In with this back reference, we are essentially saying what is here is anything that was found here. You see how much extra power it gives you. You can now insist on this being what was matched here. So effectively you're adding here a variable that has different value in each execution as opposed to a constant. And all of a sudden you can factor prime numbers. So you see now how the power of the constructs that you give to your language which allows you to do a lot. And indeed, these rejects with the back reference we cannot do on top of finite state machines. Okay, so how do we compile those finite state machines? So of course we create an AST. And what is at the top level of the AST? It's a plus, okay. And how many children does plus have? Is it unary or binary? It has one, okay, which one? It is this OR, okay. This is a binary one. Clearly it has a concatenation and this is A or B. Now this one has what? OHC, thank you. This one should be, is it concatenation or a star? Concatenation because star has higher precedence so it needs to be lower in the tree. And this would be here AD, okay. And now we want to walk that tree bottom up, yes. Well, so yeah, that is somewhat overloaded here. So in the regular expression language that Kline invented, indeed it is a concatenation. But concatenation often omits its operator because when you just put things next to each other this implicit sort of empty operator concatenation. And the regexes or the practical implementation of regexes do use dot for any character. But let's assume here that it is a concatenation so that our operator here is explicit. Yeah, so that's somewhat unfortunate. Okay, so we'll propagate bottom up and we'll build the final program from smaller programs. How? We'll build a program for this subtree and we'll build a program for that subtree and then we'll merge these two programs here, propagate it up, okay. This is how compilation happens. Now, what are those programs that I propagate up the tree? We are compiling into finite state machines. So what will I be compiling? What are those things that will be propagated up? Numbers, strings, state machines, right? Because we are composing things from state machines. So indeed, it will be state machines. So now this is great, exactly, okay. There's always control C, it's doing something. So maybe it will change its mind. I think it is trying to get on air bears. All right, so let's start here and maybe I'll explain it on the board and look what a concept. And then you'll use slides at home. So I'm going to learn how to translate all the primitives from the grammar of regular expressions. So if what I find in the leaf is, well, let's start with a character A. I want to build a state machine for it, which I can pass up. How will the state machine look like? It should be a state machine that accepts exactly this string. I have a start state, it has an A here and it has a final state. And I will pass this up and I will tell the compiler up the stream, up the tree that this is my start state and this one is my end state. And the compiler up there doesn't need to know how many states are here. It just knows that this is a state machine. Okay, now what about epsilon? It's the same, right? This can be, I could do the same, but I'll just put the epsilon here, it will work. I have more states, more transitions, but who cares? Now how about if I want to compile R1 or R2, I get two state machines from the tree evaluation. One for this, one for that. Each of them has a start state and end state. How do I compose them? So we have our two state machines. They can have whatever here and they have, and now how do I link them? Label with both, epsilon, okay. All right, okay, that's exactly it. Then we have a new state machine. How about the star? So if I want to do R star, I go to state machine for this, which has a start state and an end state, okay. How do I turn it into a state machine for R star? I'll create a new state just to make it simpler. And now epsilon will go from where to where? Start state to the end state, all right, and then this way, or this would be our new end state. This would be our new end state, okay. Well, is that enough? Or could we now potentially create, could we now accept strings that we do not want to accept? So I'll leave this for you to think at home, yes. Oh, this is epsilon, right, because you could, these matches, the whole thing matches an empty string, so you need to have a way of getting from here to end through an empty string, okay. So sometimes when you are too parsimonious as we are here, just using this state for the end state of this, as well as the big thing, you may end up accepting more strings than you want to, just because now you permit more transitions. So that's one thing you need to be careful about than when building this composition. But that's essentially it. This algorithm, by the way, is what enables GOC to be so efficient, and it was invented by a Berkeley student. After he left Berkeley and worked for AT&T, but this is part of Berkeley tradition. Pearl, as it turns out, was also invented at Berkeley. Larry Wall was not in computer science, he was in linguistics, so we do have this mixed heritage here on campus. So it's only appropriate that we cover both and let them battle it out in this experiment, which you saw. Thank you for your patience with this today, okay. Thank you.