 I never tell if that works. There we go. OK, so we'll start off today first by introducing the lead TA, Mohsen. Mohsen, why don't you stand up? Give like a nice princess wave to the class. So this is Mohsen. His email and his office hours are on the syllabus. He's also in charge of the submission system. So if you have questions with that, feel free to talk to him. Do you have anything you want to say? No, just start every degree of the project. All right, good. Yeah, start early. It's not going to be easy all the time. I like that. Cool. Any questions before we start off today's lecture? No, everybody wants to get right into it. All right, I like that. All right, so we left off on Monday with talking about programming languages and how they work and how we're going to look at how a program can actually take raw bytes from a text file or whatever and interpret that and perform some kind of computation, because that's really what we care about here. And so the first thing we're going to talk about here is how do we do that first step? How do we take those bytes and to turn them into something that a computer can actually try and understand? So we want our programming language needs to have a clearly defined syntax, right? So why might that be nice? Well, somebody raised this a bit slightly. There you go. So it can be parsed? It can be parsed by who or what, a program, right? So we're going to write a program that's going to parse it. We want the syntax to be well-defined so that the computer can parse it. Not because. Sorry. Yeah. We need them to call it because we know what it is. Yeah, so that we as humans, right? So it needs to be unambiguous in the sense that when we write something, we know that the compiler or the interpreter is going to interpret that exactly as we specified, right? Because otherwise, we may write one thing, but it interprets it something else. And then our program crashes or worse does the wrong thing. And it's a heart monitor, and now you've just killed someone. Yeah, so we know what can be done, what can't be done. We know what a valid program looks like, right? So the syntax defines that. So yeah, it's part of one of the first things, right? When you learn how to program in a programming language, you learn the very basic syntax. If it's Lisp, you'll see parentheses all over the place. And if you're not familiar with that, maybe Terrify. But if it's C, you learn that every line has to end with a semicolon, right? That's part of the syntax. And so these things need to be well-defined. And also so that the compiler, right? So that the compiler writer and the compiler itself can understand programs written using the syntax and can enforce that syntax to make sure that that program is valid. So if we think of it kind of like a system, right? So we think of it from that example I used on Monday, it was like a series of components. So the input here is a series of bytes. And what we wanna do is we wanna get from this string, this sequence of characters of raw bytes to some kind of program execution. So the very first step here, and that's kind of what the lexer's job is, is we need to assemble this string of characters into something that a program can understand. And the output here is gonna be a sequence of tokens. So you can think of this as kind of a high level, like rather than maybe I have an example on the other side. So yeah, okay, so in English, right, we have an alphabet. So I hope everybody here can speak some English so you understand what I'm saying, understand the slides. So in English, we have an alphabet, right? So what's the alphabet of English? It's kind of, it's easy stuff, like kindergarten stuff, right? Or later if your English is in your first language, and you wanna, I promise I won't make fun of you. The alphabet, yeah, so what kind of letters, or what makes up the English alphabet when you think of it like that? 26 letters A through Z. Yeah, 26 letters A through Z, is that it? Well, there's also alphabet number. Zero through nine, and then the... Zero through nine, so the numerals are the... And then the various symbols, of course, for our period. Right, and then the punctuation marks, and then all kinds of crazy, other types of punctuation marks that aren't used so much, like a long dash or a shorter dash, which is if you get into doing a lot of writing, you end up knowing the difference. So yeah, we have, you know, question marks, and exclamation points and periods, and all these kind of, at least in English, and this is what we're gonna focus on, right? We think of them kind of as one letter, or one tiny symbol, right? But is that how everyone here thinks about and process? So I should start out, I'm obviously not a linguist, right? So, you know, we're talking about things at a very high level here. So, does anybody here think about things in the form of letters? Like when I'm speaking, are you processing P-R-O-C-E-S? Okay, that's processing. Like, I'd say no, right? Oh, I guess, hopefully nobody's doing that. If you are, that's pretty impressive actually. So, we don't think that way, and so it's, because we think in abstractions, right? So what are the abstractions on top of the letters that we think about in English? Words, yeah, so we group letters together to make words. So we have like a higher abstraction than just the raw alphabet. So we have words, but how do we know if a word is valid or makes sense? If I just say, I was gonna make up some word, but that'd be really weird. Blah, blah, blah, blah, blah, right? Like that, you may be able to make characters from the sounds that I just made, but is that a word? Somebody, yeah? No, no, sorry, behind you. Yeah, like a dictionary, right? So yeah, so the words are defined in a dictionary. This is where the kind of the metaphor breaks down a little bit, right? Because English is a constantly evolving, changing language. So there'll be words that I say and you understand that are not exactly in a dictionary. Maybe urban dictionary is the new, better way to think about that, but I just probably don't wanna go there, so. And then, so we have the words, right, at a high level. But then what's, so, but words are just words, right? Are they all identical or all the same? No, right, we've kind of split them or categorize them. So what are some of the categories of words? Somebody from like the middle row, yeah. Maybe not one as well. Yeah, so different types of words, right? So like where they are and where they're used in the part of speech, so nouns, verbs, adverbs, articles, all that kind of stuff. And then we obviously, well maybe some of us do when we're writing, but we don't just write or speak just words, words, words, words, words, right? So we kind of group those together into a higher level being sentences, yeah, exactly. And then sentences into paragraphs and paragraphs into everything else, yeah, that's pretty good. Yeah, it could be anything, paragraphs, so essays, letters, different types of, I don't know, what are the kinds of types of writing would there be, close it? Abstract, what? Abstract. Abstract, oh yeah, like an abstract of something. Like a paper, yeah. Anything else? I don't know, a poem maybe? And maybe that actually doesn't fit if it doesn't always have sentences or paragraphs, but anyways, the point being you can kind of think of it as this series of abstractions, right? So at a very high level you just have some piece of writing that's made of some paragraphs and paragraphs are made of sentences, sentences are made of different parts of speech that have certain rules and those parts of speech are composed of words which are composed of letters. Makes sense? This is not like a super shocking surprise to anyone, right? Well, at least, I mean even if it's been a long time you've studied like raw English grammar, you at least internalize this stuff, right? Good, okay, so we actually use the very similar concepts when we talk about how to analyze and how to define the syntax of a language and how to think about and understand a programming language. So we also have an alphabet in a programming language. So what would be maybe the symbols or alphabet of a random programming language? What was that? The reserved words. No, close. Parentheses, yeah, exactly. Parentheses, character, like individuals, so ASCII characters, so actually kind of similar to English, right? So we have the lowercase letters, uppercase letters, zero through nine, most of the punctuation. You can see we kind of have more characters here than maybe in English, right? I don't know, I probably very rarely use a curly brace when I'm emailing or especially when I'm talking to somebody, I'm not sure how you would do that. But in programming, right, that's actually a very handy and nice construct. But it's just a byte, right? It's just an ASCII character. And the angle brackets are, right, a lot more important in programming languages. And really at this level, kind of the important thing is the meaning there, right now, these things have no meaning. They're just one character symbols, right? So a less than symbol in a C program, for instance, maybe it means the less than operator, right? But in HTML, it's actually the start of an HTML tag character. So you have the same character, the same, maybe the same letters and the same symbols of the alphabet, but because of the different languages they actually mean different things. Questions so far? Okay, so yeah, so just like in English we're gonna create abstractions from these low level alphabet. And the abstraction we're gonna use here is analogous to the words. So in this case, we're gonna talk about tokens. And so that's when we get into, sorry, here, yeah. So that's when we get into the reserved words would be a token or something like double equals, right? So that's composed of two equal sign characters. But conceptually, when you look at it and when the compiler looks at it, you know that that's actually the equality operator and not two equal signs put together in C. And in C, a single equal sign by itself is assignment, right? So those are actually two completely separate ideas, equality and assignment. And yet, there's a one character difference between them which can lead to problems if you're not careful. Some other things like less than or equal, right? So this is also two characters, one token. So while, this is what we talked about, oh. It moved to one side of the deck. So like while in a lot of languages, a lot of languages is a reserved keyword, right? So you can't name a variable while, but you can, while is thought of as a token, the token while, if as well. So what we're gonna learn about is how do we precisely define these patterns? How do we specify a pattern that says this is the exact token that I'm talking about, right? So for some of these things, it's pretty easy. So for double equals, what would the pattern be? Like the equality operator? If a specific character of a topic in America is followed by the exact same one, in this case, equals, then it's taken. It performs this operation rather than simply creating equals equals. Yeah, so very good. A little bit in depth, I like it. So yeah, two characters. So you have two, the bytes, the ASCII characters for two equal signs next to each other, it means the token, the quality operator, or whatever the language wants to call it. At this stage, we don't even care what it actually does later on, because now we're just worried about taking those two characters, which don't mean anything, and turning them into the token equality operator, which does mean something. So yeah, so for these examples, right, it's pretty clear. But what about something like an identifier? So an identifier would be a variable name or a function name, right? Those aren't something that we just specify. Ah, it's exactly these two characters right after each other, right? So we actually have to have some way to define and express this is exactly what an identifier or a variable name can look like. Okay, so we're gonna formalize this slightly, just because it's gonna be easier to talk about things. So think about the symbol, like so we're gonna actually define what a string is. I don't know if you've ever thought about that before, but the way we're gonna define it is alphabet symbols in a sequence, right? So whatever alphabet we're talking about, depends on the specific programming language, and we're gonna put two of those or more, or one or more characters together, sorry, alphabet symbols together to make a string. So really the important thing here is that a string over an alphabet is finite. So here we're gonna represent the alphabet as the large sigma character. And so we're saying that a string is a sequence of these characters. And so there's a couple of important things here. One, they're gonna be very familiar with. Epsilon, so epsilon is going to represent the empty string, which means a sequence of characters with length zero. Right, so I'm trying to think of you can think about that in a programming language context, it may mess you up, but so epsilon is gonna represent this empty string. So what if you concatenate a string S with epsilon? You get S, right, exactly. So a very simple concept, but this is where we start at the very bottom to define this of what precisely we mean when we talk about strings, and we're gonna build up to talk about how to parse and understand tokens from the raw symbols. So yeah, so we have this equality, right? So epsilon either before or after S concatenated together, so that's what these two symbols next to each other means is the same thing as talking about just S. And so in our examples, we're gonna be talking about patterns, there will be regular expressions, we're gonna be talking about strings, and to see if those regular expressions or those patterns match those strings. So to kind of keep this distinct and separate, I've tried that every time we talk about a string, we're gonna put it in either double quotes or italic and dark blue. Let's see, does this dark blue come out in the back? So it's also italic, just in case of color line, color issues. So obviously this may be trivial, but just so that we're clear, right? So the double quotes in that string in between double quotes, the double quotes are not part of that string, right? Correct, so they're just delimiting the beginning and end of that string. Any questions on strings? Okay, so now we're gonna move up from strings and we're gonna talk about collections of strings. So we're gonna talk about a language. So once again, reiterating. So sigma's gonna represent the set of all symbols in an alphabet, right? So that's our alphabet. Is it finite or infinite? Finite. Finite, right? It'd be very difficult to write in any language where you can invent new symbols and just pull out new things, right? So yes, so the alphabet needs to be finite. So what we're gonna do is we're gonna define sigma star as the set of all possible strings that can be constructed from our alphabet. So sigma star is a set and it has all the strings that can ever be created by combining any of the symbols in our alphabet into a string with no length restrictions or anything. So does that make sense? So yeah, so this is basically like if you put the infinite number of monkeys typing on a keyboard that had the keyboard and it has the symbols all in, was that sigma? Actually, maybe I, I think at some point, right, they will type all those, all of the strings in sigma star, but that may not be true because of weird infinity things, but. So think about it, right? So it's all possible things that could ever be typed or created or whatever from the characters in sigma. Okay, yes? Does that make sigma star finite or functionally infinite? What do you think? I'm thinking functionally infinite, because they can be if the strings have to open strings on their body. So yeah, so what's the difference then? Or, okay, so you're saying finite or infinite, right? So yeah, so you, so what does everybody else think? Infinite? Why? And then two of these would be the same length as? Right, yeah, so the big thing is the length, right? So yeah, that was exactly what you brought up in your question, right? So even though the alphabet that we're drawing from is finite, right, a limited amount, but you can always add one more character to the end of whatever your biggest string was in sigma star, and that string is also in sigma star. Right, so you can keep creating infinite length strings. So what we want to define, so here when we're talking about a language, right? It's okay, so we have this set of all possible things that could ever be written given these characters or these symbols. So we're gonna define a language L is a set of strings over, yeah, okay, so I like this definition actually better. So we're gonna define a language as some subset of this huge infinite any possible string set, right? Which makes sense, so there's an infinite number of ways you could type every possible character in the English alphabet, but not all of those strings that you're gonna generate are gonna be English strings. Right, so the set of the English language is actually a smaller subset of that crazy huge infinite set. Sorry, yeah, it's coming up, so good, good act. Good question. Actually, I don't know if it's coming up, but yes, the answer is yes, so epsilon, the empty string is a part of sigmas. All right, we'll see. Okay, so we already talked about this, so is sigma infinite? Alphabet? No, exactly, sigma star infinite? Yes. Yes, by definition, is L infinite? Yes. No, yes it is. No, that's a subset, a little subset has to be finite. All right, wait, you wanna say, hey, let's go here first. You take any subset, the subset must be finite, you can't. What does somebody else think? Unless it's uncountably infinite, sigma star, but I don't think it is. Let's go in the back on the stairs there. There's a lot of seeds, man, if you wanna see. L is maybe infinite depending on the language. L is maybe infinite depending on the language. What do you think? Let's have a poll. Well, wait, let's go one more. Some more input. Yeah. What's actually an example is like, I'll have the real infinite subsets, the rationales, but you're not infinite subsets, number three, so there you go. Right, so yeah, so there's a case where you have two, you have an infinite set, there's a subset of another infinite set, right? So you have the set of all numbers, well, okay, set of all, you can do it anyways, right? So the set of all numbers greater than two, right? There's an infinite number of numbers there. But the set of all numbers greater than three, it's a subset of that because it doesn't include two, but it's still infinite, so. So, is it yes? Is it no? We don't really have a whole lot of information about like L. I would say not enough to determine yay or nay. Yeah, so that's exactly, that's a good way to look at it, right? So you look at what's here, well no, like there's no constraints. All we're saying is L is a subset of sign with star. So it could be, we could have a language that had five strings in it, right? And that could be a language, the language is finite, is finite. Or we can have a language like, so I mean like English, right? You can have, you can keep making sentences longer. I wish I had this example. So there's an example of the sentence, you make an infinite length sentence using the word buffalo. You can keep, has anyone seen this? Okay, next class. Yeah, you can keep adding buffalo and each buffalo means something like the place buffalo or the animal buffalo. Buffalo is also plural. So it can keep going infinitely. So you can have, even in language, with one word, buffalo, you can have an infinite string so that language would be infinite, so. Yeah, which is kind of nice. So it means that you can write a lot of programs and you'll never run out, yeah. Why would that make that language infinite? Because that's just an infinite sentence. That doesn't mean that the language is infinite. So the sentence contains the set of all strings in a language, right? So L contains the set of all strings. So the way we're thinking about L, right? So as we define it, it's just a subset of sigma star. But the meaning behind it is it's a language that we care about. So like, if you think of the English language, right? So L star has every, let's say, every string that we would consider an English sentence. Let's keep it at the sentence level, right? But there is no bound to the length of an English sentence. You can continually add in clauses at the end and make it infinitely long. So all of that string and all of those sub-strings are in that L set, which means that it's infinitely big. Any other questions? Does that make sense, kind of? It's a little weird. It's not, I mean, it's not critical, but it's kind of cool to think about. So white space involved in the last class is in, like, y-nines or y-nines. Once there is, like, white space, and there could be, like, some symbols. Is white space, so there's also, okay, there is another thing I just thought of. So there is, like, the physical language constraints. There may actually be a limit to, like, the size of a C program that you, like, a C file that you could possibly write just because of the compiler constraints. But theoretically, there's nothing. So if you look at one C expression, right, so, like, one plus one, you can always add a plus one to the end there and keep on making a bigger and bigger and bigger and bigger and bigger and bigger string forever. So the white space example, I believe, is the same because you could just add computations to the end and the program can grow infinitely. There's no bound to how large that program could be. And it's still a white space program where it's in the white space language. All right, so that brings us to regular expressions. So before we kind of dive in here, I want to pull the room to see who has experience using regular expressions and programming a good amount. Is it from class or from outside of class stuff? Outside of class. Outside of class, cool, awesome. Okay, so what we're gonna learn here is regular expressions from kind of a more fundamental approach. And they're gonna be less powerful than the regular expressions that maybe you're used to using in JavaScript, Ruby, Python, whatever language. So try and kind of, like, empty your mind of your prior knowledge, but keep it close by so you can refer to it. But if you look at these and assume that these work exactly the way the regular expressions you're used to using work, there's gonna be some, you're gonna have problems. That make sense? But the concepts map very closely. Okay, so we're actually gonna use regular expressions to define the token of our language. And we actually run into, so I'll get to that in the next slide. So we really like regular expressions in a lot of computer science. They're pretty compact. We're gonna use a regular expression to define a language, as we'll see. So we can define this infinite set of strings with a finite representation. They're fairly expressive, so we can actually express some pretty complicated, interesting things here. They're precise, so this is very important for this class. So when you give me a regular expression, I know exactly what strings it matches, what it doesn't match, what language it describes, and the same vice versa. So there's no ambiguity there of, well, yeah. Also widely used, so that's the other, so you saw all your classmates using regular expressions. They come up a lot in programming, so it's a good tool to have in your toolkit. And the other thing, so it's very, so we're not gonna get into the specifics, but it's very easy to generate an efficient program that matches a given regular expression. So you give me a regular expression, I will, obviously the computer can create a program that can efficiently match any text input and look for that regular expression. I wasn't gonna go into the details, but how familiar is everybody with NFAs, DFA's? Some, what class is that from? It's only 255, 255, and that's not required for this class. No, because it has a higher number, okay. That's one good reason. Okay, yeah, so when you get to that class, so for people that do have the background, I'll just briefly mention, so that a regular expression, you can easily and precisely transform that into an NFA that you can use to match text. So that's why it's very efficient and pretty awesome. Questions? Okay, here's where we run into, I don't know if, I don't know, maybe it's because I'm a big nerd, but I always really like when I ever find like a recursion loop in real life. So here we actually have to define the syntax of regular expressions when we're here trying to learn how to define and how to analyze syntax. Right, so that's kind of awesome. So we're gonna do it, not informally, we're gonna be very precise about what we mean, but just so we're clear that this is what we're defining. We're gonna find the syntax, right now we're gonna define what a regular expression looks like. And so hopefully this helps maybe later when we talk about, okay, syntax, semantics, you can think, oh yeah, regular expressions. So the syntax means, just looks like this, and the semantics is the meaning behind what those operators, what that meaning is. Okay, so a regular expression is either the empty set, epsilon, and now one thing I've noticed, right, these are not within double quotes, they're not italic and dark blue, so these are regular expressions we're talking about here. So A is a regular expression where A is an element of the alphabet sigma. R1 bar R2, where R1 and R2 are regular expressions. R1 dot R2, where R1 and R2 are regular expressions. Parenthesis R, where R is a regular expression. And R star, where R is a regular expression. Yeah. I think we're gonna get to it in just a second. So the empty set is a set that, so yeah, we'll get there in a second. So hold that thought, yeah. Can we also use the dagger if we want it? Nope, so yeah, so there's other, so actually I believe if you look at the book, I've actually changed the usage that they use in the book. So they actually use the dagger, the plus sign for the bar in this example, but this is actually how most regular expression implementations that you use in a purging language operate. So that's why I've decided to kind of shift it over here. If it does create slight problems when we talk about parsing, but we'll cross that bridge when we get to it, and I don't think it'll be a huge issue. Other questions? So just to kind of be clear, so at this stage, right, we're just defining what a regular expression looks like. So if I give you any random string, right, I say is this a regular expression? You'll kind of know, it'll either follow these patterns or it won't. Okay, but the real question is what does this mean, right? Because right now it's just a bunch of notation. Okay, so what we wanna do, a regular expression defines a language. So this is going to be, a language is a set of strings that are a subset of all possible strings in the alphabet. So the regular expression is gonna define a language, and so this is how we figure out what the language is based on the regular expression. So this is gonna give semantic meaning to what it means to, when we see that syntax, what does that actually mean? So the language described by the regular expression of the empty set is the empty set, the regular expression, so the language, so I did do it here, take it, the language defined by the regular expression epsilon is the set containing epsilon. So here's where we get into the difference, where is it, up there? Got that question? See you? Somebody, nobody remembers asking that like five seconds ago? Here, whatever, doesn't matter. Okay, so the point is the difference here, right, is that, so the language described by the regular expression epsilon matches the empty string epsilon, right? So that's why, in this example, the set, right, so the curly braces denote set, so the language described by the regular expression epsilon is the set containing the empty string, right? So that will match epsilon, the empty string, whereas the empty set regular expression matches nothing, so that set has nothing in it, yeah. With the given population of the language represented by the null set, is that one or is that zero? Is it the one? It must have been the population of the micrometer. The cardinality of the set? How many things are inside the language represented by the null set? Ah, zero, so I actually have a better example later, but yeah, it's basically two braces, so left brace, right brace, and nothing in the set, so the cardinality of that set is zero, whereas the cardinality of the set containing the empty string is one. Make sense? Yeah. Okay, so we build up a little bit more, okay, so the language described by a regular expression that is one character, right, or one symbol from our alphabet is the set containing the string of that symbol, okay? Does this make sense so far? It almost seems too easy, okay? And so that's why, so just, once again, I'm gonna make sure we're very clear, so on the left, right, is, on the left, right, on the left is the regular expression, A, and on the right is the set containing the string, A, with two separate things. Okay, so now we're gonna get into the actual operators. So the language described by regular expression one, bar, regular expression two, is the language described by regular expression one, union with the language described by regular expression two. So, let me briefly summarize set union, not like formally, this is kind of important. You hear me talk, let's go over here, right this side, yeah, yeah, anything that's in that, you put the two of them together and sets, it doesn't matter if there's more than one element, right? So you just put the two sets together. So yeah, union, very simple. So that's kind of what we're doing right here, right? So the bar symbol, by itself, doesn't mean anything. All it means is that when we see the bar, we know exactly how to operate here. We can replace R1 bar, R2, with the language described by R1, union, the language described by R2, and then we follow all the rest of the rules. So then we can say if that was, let's say, A as R1, we would replace L of A with the set containing A, and if R2 was B, the regular expression B, we'd replace LB with the set containing B, take the union, and that result would be the set containing A and B. So that's the very long way of saying, in a second it should be very clear, semantically what it means and why we've chosen the bar symbol, the bar symbol. But for our purposes, it could be any character, right? So that's kind of cool, like that's what we're learning about in doing this class is how to actually define our symbols and give them meaning essentially. Question? Okay, so then the dot operator, so this is where slightly, okay, so we're gonna, so L1, so you can see that this is not a recursive definition, but very close. So the language described by R1 dot R2, the regular expression R1 dot R2, is equal to the language described by R1 dot the language described by R2. So right now, yeah, that's very ambiguous, so we're gonna define that formally later, but we're first gonna go and give an example of the bar operation. So this is, so at the top is the definition, right? So the language described by R1 bar R2 is the language described by R1 union with the language described by R2. And one thing maybe you wanna do in your mind that maybe you haven't thought about yet, you kinda wanna type check this equation, right? So if you think of kind of L as a function, well it takes in a regular expression and it returns a set of strings. So here we have, so we have L, LR1 is gonna return a set of strings, union with LR2 is gonna return a set of strings, that union is a set of strings, so everything type checks or not. Okay, so let's look through some examples. So it's, we're gonna break this down. So we're gonna have, okay, the language of A bar B. So A bar B is what? A regular expression? A bar B is a regular expression, right? I just wanna get there. So A bar B is a regular expression. So what's gonna be on the right side of this equation? The first step. Almost, the first step, the very first step. Applying the rule at the top, right? So we're just gonna substitute. Exactly, we're gonna substitute in A for R1, B for R2, and we're gonna get LA union LB, and then we're gonna apply the other rule, right? So what's the language described by A? A set A. Is that A in the language described by B? A set B. Oh yeah, there we go. The language described by A. So now you can see here, it's the set containing the string A and the set containing the string D, and we'll union those together. The set containing A and B. Exactly, cool. Okay, this goes vaguely more complicated. So the set containing, or the language, described by the regular expression of A bar B bar C. So applying the first rule at top once, right? We're gonna get the language described by A or B. Union the language described by C. So does everybody see that A bar B is a regular expression? In this case, it's R1 in the example on top. And so we know from above that the language described by the regular expression A bar B is the set containing the strings A and B, right? And so we will replace that there with the language described by the regular expression C, set containing C. So yeah, then we union those together, so we add C to the set, right? And now we've got a nice set there. So let's go back to the original question. I think, oh, we'll continue for a little bit. Okay, so now let's do this. So the language described by A or Epsilon. So we're gonna first apply the rule on top, right? Just separate it and say it's the language described by the regular expression A, union with the language described by the regular expression Epsilon. So what is that? Right, the set containing the string A, union the set containing the string Epsilon, and then when we union those together, what is that set? Set A. So we're gonna raise our hand for this one. Let's go here. Set A. Set A. Is that right? Don't say no. That's not right. So it's Epsilon comma A. It's Epsilon comma A, who said that? Epsilon comma A, right? So yeah, so that's kind of where you were getting to that Epsilon empty set distinction, right? So there was empty set, it was just A. Exactly, if it was empty set, it would be just A. So yeah, that's important to remember because here we're saying the language of A bar Epsilon is the set matches the set containing A or the empty string. So it'll match either empty string or A. Okay, what about this one? You wanna go right to the end, tell me what that set is? Set containing Epsilon. So the set containing Epsilon, perfect. Okay, and we already talked about this, but this kind of gets into it. So the set containing Epsilon is not equal to an empty set. Okay, so what, now that we've gone to these examples, maybe in excruciating detail, what, how would you describe the semantics of this bar operator? Some of you, yeah, go ahead. It's essentially like a Venn diagram that you have A, you have a circle of A, you have a circle of B, and then you just have everything colored in, in this case, D. How would you describe that in one word? And more. More. Or more. Yeah, no, actually that makes sense more. Yes, okay. And verbiter section. Right, so yeah, it's and, I mean, so you're thinking about sets, right? So you're thinking two sets together, I want both of them. So you're thinking union kind of and, but really when you think about semantically, when I say A bar B, what I really mean is I want to match either A or B, right? So the string A or the string B. That makes sense. So that's why, that's why the bar, I think that's, I'm pretty sure that's why actual implementations of regular expression use the bar because when I see it, I read it kind of as OR to myself. Questions on this operator? All right, now we have to dig into this dot operator. So this is definitely a bit of a cop-out, right? So here we're saying, okay, this is the definition and then we're introducing a new dot operator between two sets, right? So now we actually have to define that. What does it mean for one set dot another set? Right, and this is maybe another time so that programming language type mindset can come in. So this, within the L on the left is R1 dot R2. That whole thing is a regular expression. So the dot in there is a regular expression. Here this dot operator is operating on what? Sets of, what? Strings, yeah, sets of strings. I heard sets of rah rah rah rah rah rah rah rah rah rah rah. Sets of strings, right, exactly. So this dot operator we're gonna define on sets of strings. So for two sets, A and B of strings, we're gonna define A dot B as the set containing X, Y, where X, Y next to each other right here is we're talking string concatenation. So X concatenated with Y for all X in the set A and for all Y in the set B. Does that make sense? So maybe, well, let's go through some examples, because this is mathematical notation here, it takes a while maybe to get used to it. Okay, so an example. We have a set A containing two A's and B, and set B containing the string AB. So what is A dot B? So first thing to think about, how many elements are in the set A dot B? Four. Okay, so I guess what are those four strings? Anybody wanna take a stab? You wanna raise their hand? Somebody who hasn't spoken yet? You wanna suppose it's good? Yeah, I think it's A, A, A, A, B, B, A. Is he right? Yeah. Sure, so you just mess people up. Yeah, exactly, so there's four, right? So you're gonna take every element in A and append every element in B to it, and those are new elements in the set A dot B. So you take AA, right? You're gonna concatenate the string A with it from the set B, that's the first element. The second element is AA, concatenated with the second element of B, which is AAB, and then you're gonna take the second element of A, concatenated with the first element of B, which is BA, and then you're gonna take the second element of A, concatenated with the second element of B to get BB, if that makes sense? And does everybody see how this follows exactly from that definition? Okay, so is AB an element? Wow, I'm not sure why I put this here, because that's very confusing with the example there. So AB should hopefully clearly not be in A dot B, right? Well A, we've defined what A dot B is. We've written out the entire set. So that should be very clear. The other reason is we're not concatenating anything within B with itself, right? So we don't concatenate the A and the B within B. That may be the one way you get AB. Okay, so what about this? AB epsilon in A and A and B in B. So how many elements is this gonna have? Six, let's take a vote. That's wisdom of the crowds. I should like record this as data by the paper. Four, who thinks four? Who thinks five? Six? Seven? Eight? No, maybe one person. Okay, great. Okay. So does somebody wanna go with whatever they think it is? Let's go back there. AB, AB, AB. It was AB. Oh, I was pointing to him. Sorry. I had to look at the answer so I could. Sorry. Next one, next one. You gotta be ready. All right. Okay, yeah, so this is A, A, A, right? So the two A's in A, concatenate with the first A and B. A, A, B, B, A, B, B, and A and B. And A and B, so why the last A and B? Epsilon. Epsilon, right, exactly. So because of the epsilon. So we're taking the epsilon from A, we're concatenating epsilon and the string A from B and that is just A, right? We talked about that not a half hour ago. So does that make sense? And does everybody see why there's no epsilon in A dot B? Huh? It's like the empty string. It's the empty string, yes. So there's an epsilon in A, right? There's no epsilon in A dot B, why? Because there's no epsilon in B. Only if there's an epsilon in B. Right, so there's no epsilon in B. So we're essentially, you can think of it, so the way I like to think about this is string concatenation of those two sets, right? So we're concatenating everything that A describes and concatenating everything that B describes. So the prefix of the string has to match something in A. So only if epsilon is both in A and B can you get epsilon in A and B. Because if you can catch two epsilons together, it's gonna be epsilon. But because there's no epsilon in B, right? You're always gonna have something at the end of this string. What do you do, A dot B? Make sense? Questions? Okay, so now we gotta talk about operator preference, which is again one of these things where we're gonna talk about it in great depth later, but it's gonna come up here when we talk about regular expressions. So the question is, what does this regular expression mean? A bar B dot C. So what are the options, let's say? Well, what's, oh yeah, yeah, go here. Go from left to right, so it could be A bar B that has one or two dot C, or could be A bar B dot C is one or two. Yeah, exactly, so exactly what I have here, right? So yeah, if you think about math operators and using parentheses, so we could mean A or B parentheses dot C, or A bar B dot C. So just like in math or a programming language, it's up to us, right now, we're defining the regular expressions to define what it means when we see A bar B dot C, because nothing in our syntax prevents this, so somebody, you, me, anyone can write this regular expression, and I need to be able to understand exactly what language you're talking about here. Okay, so, yeah, so just like in math, so how is this math? B parts. B times C plus A. Why? At this, it's an operator present. Operator present, and it's on the board, okay. So yeah, so that's the order of operations defined in math, right, we've all been, we've been using that for a very long time. So in math, right, it's areas, so, right, so multiplication has the higher precedence. So that means when you see A plus B times C, you know that the B and C happen together, and then whatever the result of that is, is added to A. So here, we're gonna do exactly the same thing, and this is one of those things that it's all based on mutual understanding and definition. So here, we're gonna define that the dot operator has a higher precedence than plus. So that way, when we look at this, A or B dot C, which one of those two options are we talking about? The second one, right, so the A or parentheses B dot C. When you say plus, do you mean bar or? Ah, that's a good point. Yes, or bar. See, this is one of those things where you say, like I put in these errors intentionally to make sure you're awake and paying attention, and then I make a note to change it later. Okay, so then now that we've established this precedence of the dot operator having a higher precedence over the bar operator, how do we then interpret this, the language described by the regular expression A or B dot C? So does anyone wanna go all the way then? How many strings are in this set? Two, what are the strings? So am I raising my hand first? All right, yeah, so we first, because we know that the dot has higher precedence, we're gonna do, essentially you can think about we're doing that last. We're gonna first do the or, and we're gonna say the language described by A union, the language described by B dot C, and then we try to get both of those. We're gonna say the set containing A union, the set containing the string BC, and we union dot Seattle to hand. Over here, no people scratching their head. I'm at first row. Awesome, questions on operator preference, where we're talking about it here in regular expressions. Okay, so there's some things we haven't defined yet. So there are still, we're talking about regular expression syntax. So we've defined the empty set, epsilon, a single character from the alphabet. We talked about concatenation, we talked about the bar, the dot and the bar. So what do we have left? Star. Star, there's one more. There's parentheses. Parentheses. Yeah, maybe I should swap these around. So yeah, parentheses are defined very simply. So we're just gonna remove the parentheses, right? This just gives us the formal way to actually specify them and say that these are what parentheses mean. So that when we say something like the language described by parentheses A or B dot C. Right, so the same thing, but now with the parentheses. Now we can say formally that, well, parentheses A or B and parentheses is a regular expression. So on a high level, this is regular expression dot C. Right, so we're first gonna apply the dot operator and do the language described by A or B. So this is just applying the more at the very top here. Language described by A or B dot C. So does anyone wanna tell me what this is? Yeah. ACBC. ACBC, yeah, so we're gonna do, so the bar operator, the language described by A or B is A comma B and dot C, so this is, so this regular expression describes the language containing the strings AC and BC. Does that make sense? So yeah, so that's kind of, so how you wanna try and read these, and it takes, if you've never seen them before, it takes definitely some experience, but so when you look at this first regular expression here of parentheses A or B dot C, you can kinda think, okay, first there's gonna be an A character or a B character, and then followed by, so the concatenated by a C. So you can say it's gonna match ACBC, so, but you know, on homeworks and exams, it's always best if you follow all these steps. I mean, show your work, right? That way we can give you partial credit if you make a huge mistake, rather than if you just put the wrong answer, the wrong answer is wrong, so, okay. Questions on parentheses? Okay, the star, so this operator's called the queen, actually I don't know how to pronounce it, I'm gonna say clean, it's a clean star. So we need to define the language described by R star, and so this is where the definition is gonna be a little interesting, so. So I'm gonna give an example first, and then we're gonna define it kind of formally, so, so the language described by R star is the set containing the empty string, so the set containing the string epsilon, union with the language described by the regular expression, union with the language described by the regular expression concatenated by the language described by the regular expression, union with that thing three more times, union by that thing four more times, and so on to infinity, so does that kind of make sense? So, let's kind of define it more formally because this may help. So, we can think of it as a recursive definition. So, we have L0, right? The base case here is L with the superscript 0 of R we're defining as epsilon. And then, like we do when we're trying to define things recursively, we say, okay, so we've defined our base case and we say, what do we do for any arbitrary I case? The language described by the regular expression with the I superscript is equal to the language described by, yeah, okay, the language described by I minus 1, so the last one, dot the language described by R. So, does everybody see that the languages with the superscript, the L with the superscript, is going to get expanded out all the way to 0? So, for any I you give me, I can just do this, right? I can expand it out that many times. And L0 is going to be the empty, the set containing the empty string. And all of those LRs are just going to be a length of that. Everybody see that? So, if you give me an LI, if you give me a specific I, I will, yeah, so that will be L0R, which is the empty set, dot LR, dot LR, dot LR, dot LR, I times. And then so here we can formally define the language described by L star using this superscript notation. As we say, it's the union of all the I's from 0 to infinity. All the I's greater than or equal to 0, and this should be like the I's greater than or equal to 0 under the union, but PowerPoint doesn't really do that very nice. So, I think it's fairly clear of the LI's union R, right? So, the question you should ask yourself is, does this definition match up with this example up here? Not exactly? So, whose base? What base? Yeah, this is what this means. That's what I'm saying. It's one of those syntax things, right? So, we're saying this is implied. So, basically we're unioning everything starting from I is equal to 0 all the way to infinity. All those unions of L of that I. So, the union of L superscript R versus L superscript 0 of R, union with L superscript 1 of R, union with L superscript 2 of R, and so on, infinity. But the question is, does this match this example here? Why or why not? So, obviously, you know, I'm not going to write out the infinite thing, but the triple, but the ellipses there imply infinity. So, does anybody disagree that it doesn't match? Some people not sure. I'm sure of what the question is. Let's look at some examples I think that'll help. Okay. So, the rule is we're going to union all of these L superscript of I's starting from 0 all the way to infinity. So, maybe it's clear, maybe it's not clear. So, L R star, that language described by R star, is it finite or infinite? Infinite. Is it definitely infinite? No. It depends on the language. No. It depends on the language. I say, so we're talking about a language. We're talking about L L of R star, right? It depends on the regular expression, right? I think there may be only one case that it's finite, maybe more, but so we'll talk about that in a second. But in most cases, it's going to be infinite by an infinite set. So, we're not going to be able to write it out exactly like we were able to write out before. So, the language described by A bar B star, so what is that going to be? So, let's first take the bar, right? So, here we're going to say that the star is going to bind very tightly as far as operator preference, right? So, it's going to apply to the closest regular expression. So, it's not going to be A or B star. We have to write that with parentheses. So, let's take it. So, the OR, right? The OR, the bar operator splits this into the language described by A union with the language described by B star. Okay. So, what's the language described by A? A. Great. So, what's the language described by B star that we want to start with? Yeah. Have you said B? Yeah. So, wait. I think you said empty set. I mean, empty string. Empty string. Okay. Epsilon. Epsilon and empty. I'm going to start the same, so I understand. Yeah. So, we're going to have A, the string A, which comes from the bar operator, right? The language described by the regular expression A. And then we have epsilon. And then we have B, B, B, B, B, B, B, B, B. And you keep going, right? You can always add one more B to any string. Any string that is all Bs is within this set. So, everybody see how this set is definitely infinite? But it's still described by this regular expression, right? So, if I give you any string, I can say, is this regular expression in this language? Or is this string inside the language described by this regular expression? And that's what we mean by magic, right? So, if you give you any string, you'll be able to tell me yes or no. Because you can say, well, it's either A. It's going to be the empty string. Or it's all Bs. And it's either in that language or it's not in that language. Does that make sense? Okay, that's okay. What about this? So, A or B star. What's the first thing in this set? Epsilon. Yeah, exactly. So, we have first, so remember that the L superscript zero of R is the second containing epsilon. So, anytime you see a star, there's always going to be an epsilon there. So, what's that going to be union with? A, what was it? Say that louder. You said A, huh? A and B. A and B, yeah, so the second containing A and B. Yeah, so that would be L1, right? The second containing A and B. What about the next one? We're going to get to tricky. Events are on. Those are on. Someone in this string. I can call on people. Yeah. Yeah. So yeah, so what we're doing here, right, is we're taking the language described by A or B dot the language described by A or B. And that is A, A, A, B, A, B, B. Right? So, you can also think of, and then, so what about the next one? Are you going to go to the next one? Let's go more this way. You're kind of in the middle. It's still not off the hook. So, yeah, so then it's all the strings of like three that are A, the set A, B, dot the set A, B, dot the set A, B, right? So, it's all possible combinations. So, you can also think of it as all strings of like three composed only of A's and B's. And so this other dot is going to be all strings of like four that are composed of A's and B's. And the same thing with the next row, right? It will be all strings of like five composed only of A's and B's. And so, hopefully you can see that again, even though this set is infinite, there's an infinite number of strings, right? I can only do like three or four rows here. But still, if you gave me a string or I gave you a string, you can say, is it in this set, right? So, if I gave you the string A, B, C, is it in this set? No. No, why? Because C is not in the set. C is not in the set, exactly. But if I gave you the string A, B, A, B, A, B, A, right? Doesn't matter what I say, as long as I don't say anything, it's not an A or B. Yeah. So, what this language is describing is all possible combinations of A and B along with... An infinite number of times, exactly. Yeah. So, it also depends on our alphabet, right? So, this... If our alphabet is only A's and B's, this could be the set sigma star. Not that it really matters, but maybe just think about it. Right? This could be every possible string in our language. If A and B are all the symbols in our language. So, we describe binary. That would be r star of binary. Yeah. Yeah. But if you're talking about the language of, let's say our symbols are alphabet. The alphabet, the lowercase alphabet, right? And this just matches any string that only contains A's or B's in no particular order. So, we don't care about the order or anything. But as long as it's either all A's or all B's or nothing, that's the other important thing. Questions? So, I kind of think about it. Yeah. So, that means that we start as like infinite set, right? The language of what we start as a infinite set. So, that means the language of the infinite order? I would say not every time. I think if you do have the language described by epsilon star, it's just epsilon. Right? Because no matter how many times you do that, you're only ever adding epsilon. So, it's the set containing epsilon. So, I think in that one case it's finite, but I think for every other case it's infinite. But don't ask me to prove it right now. More questions? Is L of 1, 0 star, the order by the same set? Is L of 1, 0 star? Yeah. So, that kind of goes to this point. So, yeah. It all depends on what you pose as symbols, right? So, here if we just substitute in A as 1 and B as 0, then this describes every possible binary number. Combine the empty set with all these operations. What happens when you combine the empty set? So, the short answer is it'll probably never come up. It just means that for all these operators, it just doesn't do anything. Except maybe for star. I think maybe the star of the empty set star would maybe be epsilon by definition. Yeah. But I think for all the other ones it just is like. So, if it's A or the empty set. Cool. All right. Let's talk quickly about tokens. We've got like five more minutes. Oh, whoa, whoa. We got time. All right. So, what we really want to do, so we're going to use, so now we've covered regular expressions. We want to define tokens using these regular expressions. So, we're going to do things like, so we want to define a letter generally, right? So, in English, how would we define a letter? Regular expression. Does anybody want to do this super verbose? But just a letter. Just one, like one, let's say that alphabet is like all printable ASCII characters. You want a letter? So, now I guess. Or take a shot? Yeah. It's one of 52 specific characters, upper and lower case. With what in between them? That. With what in between them in the regular expression? With the syntax that we talked about. What was it? Bars. Yeah, bars. So, yeah, so we just, we have all the characters we care about. Each of a single character is a regular expression. We're going to put bars between them all. So, now we say a letter is a regular expression defined by A or B or C, lower case A, or lower case B, or lower case C, all the way to uppercase A, uppercase B, and so on. A digit? A digit? Same thing with zero through nine. Exactly. So, zero, one, two, three, four, five, six, seven, eight, nine. So, some people who maybe are already familiar with regular expressions where they have the bracket syntax that you can specify a character class. This is, you don't need that, right? You can build that with this. Okay. So, what, somebody want to, what is the rule for an identifier? I'm going to say it's specifically a variable, but we'll call it identifier in C. So, what is like the regular, how would you write a token from what you know about C, about how that works? So, without white space, we're not going to consider white space right now. We'll go somewhere else. Perfect. Yeah. So, that actually messes up my example, because I forgot about that. But, yeah. So, so I just have a letter as first, but yeah, I forgot you could start with definitely an underscore. You can't start with a number though. It's actually, if you've programmed a language that allows that, it's actually can become, can be nice sometimes. So, this is kind of the syntax for a regular expression that defines an identifier in C. But, you'll notice that I did some cheating here. So, what am I forgetting here? So, is this a regular expression based on what we've seen so far? No. No. No, why? Yeah, no concatenation. Yeah, no concatenation, right? So, this is where we say that since we're all very, we've got all this thing down, we're going to cheat like a math and take a shortcut. And just like you don't always write X star Y, you mean X multiplied by Y. We're going to say X Y, and you know that that means multiplication, right? So, the same thing here in regular expressions, which makes sense if you think about the definition, right? The definition is concatenation. So, if I put two characters together, then that means, so, you think about the regular expression foo, matches the string foo because it's foo f dot o dot o. Yeah. So, do you have to exclude taking a Y over hip in that regular expression as well? It depends. It's the long answer. Yes, it depends on the language and how you do it. You can ask me a question on Wednesday, or Monday, on Monday, and we'll probably get there. Okay, so, let's... Okay, so question, does this match that regular expression? Yes. Yes? Yes? Why? Starts with a letter and it follows by zero or more letters, digits, or other stuff. What about this one? No. No, very good. Okay, see you guys on Monday.