 Being here, I notice we have a little bit less attendance on Fridays for some reason, I don't know why that is. There you get three day weekend anyways, I guess. Y'all are trying to do a four day. Any questions? Before we get started, we're going to get started on the syntax analysis. Do it. Okay. So syntax analysis. So, what did we just talk about the last week? Good expressions. And yes, regular expressions for what? Purpose. Lexical analysis. Yes, what does lexical analysis do? Because tokens out of bytes, perfect. Input, bytes, output, tokens. And how does that work if you zoom into the box and rip it apart? How does that actually work on the inside? Regular expressions. And when you have a bunch of regular expressions that you want to use to express tokens, how do you decide which tokens matches the bytes? I always want to say, I think it's the longest matching prefix rule, is that right? Yes. Because I want to say the longest matching prefix rule. For some reason that sounds more natural to me, but I guess my instincts are bad there. Perfect. So, our goal now is we're now at the next stage of the pipeline. So we've got a sequence of tokens coming from the lexer. And so now we need to transform this into something that's useful, right? So just a list of tokens, a list of id.id, if, curly brace, id, dumb, right? It doesn't really mean anything. We want it to kind of turn it into, in the most vague term possible, something useful, right? But, so let's think about it this way. Is every sequence of tokens valid? No? No. So let's think about, I'm not going to define the token. Let's say we have tokens for num, we have tokens for dot, we have tokens for id, we have a token for plus, and let's just leave it at that, right? So these are our tokens, how would we describe it to these tokens? Using English? Regular expressions. Yes, we would define regular expressions here to define what each of these things mean, right? And so the lexer, right, gets in a series of bytes, I'm going to stop drawing these edges, right? And then outputs a sequence of tokens, okay? So the question is in, let's say a programming language, right? So what would, let's think about what this kind of means when you have, let's say like id.id. Is this something that happens in your programs anyway? Yes. How, when, in one language? Java. Java? When doing what? Calling methods. Calling methods or accessing fields, right? So this would be something like foo.bar, right? So in Java this would be like, on object foo I'm accessing field bar. So is this a valid sequence of tokens? Yes. All the time? Right? Well, so when would it not be valid in Java? If the methods don't exist, so it would, well, okay, let's hold on to that for a second. Yes, because that's definitely a problem. The method could not exist, right? And that's going to affect our compilation. But what I'm talking about here is kind of trying to get a feel for what does it mean to be a sequence of tokens to be valid or not valid? So let's think about this, right? What happens at the very start of our Java programs? Yeah, but like at the very, so we're talking about bytes, right? At the very top of your program, what does it look like? Import. Import. We have an import, whatever, whatever, whatever, whatever, right? And then maybe other imports. And these are actually dotted. This would be an id.id.id, right? So if you think about this in symbol, in tokens, there would probably be an import token, an id token, followed by a dot, followed by an id, followed by maybe a semicolon. This is going to be faster. But that's fine. Right? So this would be that one. So here, an id.id is valid. But on the next line, right after an import, can I have foo.r? Why not? Not in a main method. That's not inside of a method body. Even if it was inside of a method, let's say I'm inside a class, right? So in Java, you'd have to, everything has to have a class with some name, right? I have a curly brace all the way to the class. If I just have the next thing after this curly brace is foo.r. Is that valid in Java? So what do I have here? So let's go back to this original question. So is id.id a valid sequence of tokens? Yes. Sometimes. But what times? Yeah, so it's depending on, in some sense, the context of where it's used. Right? We can't just arbitrarily throw in any id.id and expect it to be valid Java code. There's only certain places where we can actually use that. Cool. And then other things like, you know, some tokens we know are just probably always wrong. Like dot plus adopt. Right? I think so. Tell me a thing of a reason why this would be a valid token. You can think of a language maybe where this does happen, right? What was that? Would this be maybe a valid? Yes. What would this mean semantically? What does this actually mean? It's a basic addition. Yeah, basically an addition of what? Of one variable up to another. Yeah, one variable to another, right? This would be different maybe than a num. That a num wasn't a num, right? But they're all valid syntax. And going back to the, can you do it, right? So id plus id, we already said depending on context may be valid. But what if you've never declared that id as a variable? Okay, same job. Then it wouldn't compile, right? So it would be semantically invalid. So those are additional checks that we're going to look at later about, okay. So right now we're worried about is this sequence of tokens that the lexer is giving us is this a valid sequence of tokens? And we already saw we can't answer this question in isolation, right? Because this id.id problem. We can't just say, yes, any id.id is definitely okay, right? Okay, so this is our whole goal. This is the goal of, part of the goal of syntax analysis is checking is this sequence of tokens a valid sequence of tokens? So we already kind of looked at some of these things, some of these things. Like decimal.num, does that make sense? No, right? It's like a, it's syntactically invalid. That doesn't mean anything in our language. Cool. Dot dot dot num id dot id, right? You can have any combination and the lexer will correctly tell you, hey, this is a dot, this is a dot, this is a dot, this is a num, right? But now we need to decide is that right? And if not, we need to throw in here. Okay, so we already learned about regular expressions, right? Yes, please say yes, you learned. We can also agree we are learning, you are currently also learning and refining your regular expression skills. We studied and looked at regular expressions. So if we already said, hey, we want to tell if a sequence of tokens are valid, we kind of already have a way to describe a sequence of characters, right? In terms of a regular expression, we can use a regular expression to define a set of strings. Why don't we use regular expressions here to define what are valid operators here and what are valid sequences of characters. Sorry, sequences of tokens. So let's use regular expressions. So let's say, okay, so we have a program. So let's, let's go like this. Let's do a regular, I like that. So we have a program, let's try to program regular expressions. What's a program composed of? Let's say like a very basic language. So it has maybe a main method, does it have to? No, not necessarily. You can write a C, a dot C file that does not have a main method which you're going to use and import. You can compile that to an object file and then link to it later, which is exactly the dot O files that we gave you in project one, right? So let's think about each line, most lines in a C program, what do they end with? Semi-colon. Semi-colon, right? So let's call that a statement, right? A statement ends in a semi-colon. So program is composed of statements and can we have just one of them? How many statements can we possibly have? Infinitely many, how do you do that with major expressions? Star, statement, star, right? So this would say a program is zero or more statements. So what kind of things can a statement be? What makes it a statement? Tokens, yes. So here we're only talking about tokens. If you think of our basic characters, they're going to be tokens. Maybe a keyword ID. So what would a keyword ID be? Oh, like the ID. Ah, so we can declare a variable. Okay, what are some other things that statements can be? Function calls, what else? Operators, like plus. Operators, like plus. What else? Something that can end the program. Something that can maybe end the program. Like a break or a halt or something like that. A control statement. So if, while, unless in some languages, loops, forms. Right? Let's go back to the operator. So for statements, so let's think about plus. So we have plus here. Check, cheat real quick. Okay, cool. Okay, so we have plus. On the left-hand side we have like an operator. Sorry, plus is an operator. So on the left-hand side, what should we have? Why is it not going away? I don't know what. I didn't want to use the, maybe if I do, literally. Nope. You're like burned into the page. That's what it is. You know, we're all seeing this, right? I'm not like excited. So statement is going to be, ignore the ghost. So what can be on like the, so we have a plus sign. Right? So what can be on the left? It could be an ID. What else? A num. A num. Decimal. Decimal. Right? It could be a lot of different things. Okay, cool. Let's think about it like this. So we have here, so let's just go with nums. So we have num plus num. Right? Okay. So let's think. I have a statement. So I have five plus five. Right? So five is going to be a token. What? Num. Num. And plus is going to be a token. Num. And num is going to be a token. Num. So are these tokens in the language described by the regular expression proger? Yeah. Yeah. Right? I have num plus num. That's in there. That's described by the regular expression. Awesome. So what about like this? Is this a valid expression? Yes. No? I mean, in a C, would we like this to be a valid statement? Sorry. Okay. Well, do you want it to be? Yeah. Right? We want five plus five plus five. Yes. Right? And what does this mean? Well, so we have like the plus operator. We have five. Right? And then we have another plus operator and that's going to be five plus five. Right? So the left-hand side of this plus operator is really a statement. Right? Yeah. Not just a number. Well, okay. We already said we can't have recursive regular expressions. Right? So is there... So how do we change this so that we can now match this plus? Is that it? Oh, sorry. The plus num. Plus num? Yeah. Into the bracket and star. Into the bracket and star. Interesting. Okay. Now I'm plus num so we can do plus num plus num. So one problem this gets us into, right, is just num by itself. Is that a valid statement? Maybe. Maybe not. So we can do plus num plus num. Okay. That gets us out of this problem. Now we want to do... Can we have as a statement... We have parentheses. Just the open parentheses. Well, sorry, sorry. Can we do a statement like five plus five? And then could I add any number of parentheses here? What's the key thing that I need though? Equal number of opening and closing parentheses. Right? They should be what? Balance. Balance. Yes. This was the scale. Yeah. They should be balanced. Right? And I want that... I want to know that this sequence of tokens is valid or not valid and I want to know if these parentheses are balanced or not. So how do we write a regular expression of balanced parentheses? So do we need a opening for a left and a right? Call it LT and RP. Okay. What goes in the middle? Why can't we do this? It goes recursive. It goes recursive. Can I do like maybe... What would this describe? Any number of left and right? What specific form is this in? So let's think. So I can do left parentheses and then any number of left, right. So this would be like this. And then a right parentheses. But does this match? Yeah. This one does. Does this one? Where do you put it? Like the stop. It should be at the left. Because LT is... It could be at the left. This matches? No. Why not? It has to be left, right. All of my left, right. All of my left, right. So what are we trying to say here? Like for this one here? We just said bounds. What does bounds mean? Same number of left and right parentheses. So let's try to boil this down. So this is when you're trying to think about, okay, what are the bounds of what I can do? Let's just call these A and B. So I want to describe... I want a regular expression to describe the set of all strings where they're a number of A's followed by an equal number of B's. How do I do that? So this is the strings that I want. So I want some regular expression R. I want it to be... We can include sigma or not. It doesn't matter. I want it to be A, B, A, A, B, B. A, A, A, A, B, B, B, and so on. So what's R like? A star dot B star. What string is this produced that is not in here? A, A, B, A, A, A. A, A, B, right? What else? Something like R is equal to capital A, capital B, whole star, which then calls A, alphabet A and B. So something like a token which calls specifically A and another token that calls specifically B. Like this? No, not a star. Just like that? Yeah. And then we can call R as equal to A, B. No, A, B, whole star. Like this? Cool. So what does this do? Yeah, this is A, B, which is in there, but it's A, B, A, B, and then A, B, A, B, A, B. You'll notice this is exactly the same thing we tried up here with this. So what's the problem? What are your brains hitting when you try to think about a regular expression to match this? We can't forget a single multiple. What is it about that number though? We're trying to make them repeat the same number of times, but if you separate the stars, that's... So what do we get? Think about the operators, the regular expression operators that you have. What does star give you? Zero or more. Do you have any way of specifying how many times do you want that repeated? No. Do you have any way of capturing how many times did that star capture and repeat? No. Not in any of our regular expression operators. So fundamentally, we cannot write a regular expression. This R does not exist. We cannot write a regular expression to capture this. And this is a very basic part of the program. If we broke the rule that we couldn't have recursion, we kind of already saw we could easily deal with this. We can say, yeah, we can have a statement here. And I think it'll prove that in 355, but I don't know. Right now you can take my word for it. And so that's kind of what we have here. So this would be a little bit more expanded an example because I wasn't writing everything down. We have program, composed of statement star. Statements are expressions or if statements or while statements. Operators are plus, minus, multiply to five. Expression is either a number, an ID or a decimal, followed by an operator, followed by a number ID decimal. And so we can see that an expression is num operator num that matches foo minus far ID minus ID. But now one plus two plus three is we saw we can't do this. But we can maybe do opt num ID decimal star like we did. And that kind of gets us into a good place. So the expressions and the ideas of statements that we want to describe, regular expressions fundamentally cannot capture. They are not expressive enough. You cannot use regular expressions to define that set that we want of balanced parentheses. So we need something that's a little bit more, has more expressiveness. But when we gain an expressiveness, we lose a little bit in performance. So there's trade-offs there. So this is why I said at the beginning, regular expressions are very fast and very performant because they're not as expressive. Does that make sense? Can I say what I mean by expressiveness? I mean, literally you cannot describe this R here. In English, we can describe this set R, right? Or this set L, this language described by R. I can describe this set very easily. The set containing all strings that have equal numbers of A's and B's, right? That set exists. But we cannot define that set using a regular expression. I'll show you other things where we tend to find, using context-free grammars, which is what we're going to learn about, you tend to find that set in a formal way. Okay. Get that. Parentheses. See, look at all this. Where do I do this? Boom. Okay. So context-free grammars. We need a new way of describing this. You can kind of think in some way that this is relaxing that restriction on recursion essentially. And saying we're going to define things that look kind of like regular expressions, but are definitely not regular expressions. And we'll see that they can be recursive. Okay. And the way we're going to do this is with rules. So we have context-free grammars. So let's try to think, let's try to think if we can define maybe something that could do this. So let's say we have a starting symbol that we're going to start with. We have some starting symbol. And so I want balanced lower-case a's and b's. Right? So s, whenever I see a capital S, I'm going to replace it with something. So what would I want to replace it with to capture balanced a's and b's? A, s, b. Or a, b. Or s. Alright, we'll do it. Or s. So what am I kind of saying here? What I'm saying is start from s. Apply it. Well, let's get rid of the ors for a second. So I have three rules. S can either be a, another s, or a lower-case b. S can either be... A, b, or s can be... We'll go to the example that we were... Or epsilon. So, I start from s. I apply one of these rules randomly. Let's say I choose rule three. What string is that going to generate? Empty string, epsilon. Is that in my set? I start with s again. Let's say I try to apply rule two. It's going to produce what? A, b. So I have the two can kind of get together. The string A, b. Is that in the language? Okay. I have s. Let's apply the first rule. What am I going to get? Can I stop here? Could this be about string? Why not? Yeah, because I still have some way to get rid of s. So now which rule do you want to apply? Let's go with rule two. Yeah, good call. So what string does this represent? A, a, b, b. So, what would happen if I randomly generated and applied every single one of these rules to generate all possible strings that could come starting from s? Would the string, would epsilon be in there? Yes. Would a, b be in there? Yes. Would a, a, b be in there? Yes. These are all definitely yes, because we just showed that, right? But would this string be in there? Sure. Yeah, it should be there. It would be this s produced again by a, b, s, and then that s produced by another a, b, and so on and so forth, right? If I keep applying rule one, you can generate strings of matching a's and b's. So we talked about this very intuitively and informally. So now we need to actually define things. Yes, please. What about a, b, a, b? A, b, a, b is not what we're trying to describe right here, because here we're just trying to describe matching a's and matching b's. And I can replace the a with the left parentheses, the b with the right parentheses, and now we have balance for this type of balance, where we have the equal number of a's followed by the equal number of b's. Could we get rid of the middle row? Yes. Is this the same? No. Why? Does someone want to raise your hand? Yeah. A epsilon b is b, b. Yes. Right, so we go a, b. a, b, this s would go to now epsilon. What's a concatenated with epsilon concatenated with b? a, b. And if instead of this I chose a, s, b, and then that s went to epsilon, that would be a concatenated with a concatenated with epsilon concatenated with b concatenated with b, which is a, a, b, b. Yeah, so actually surprisingly that rule is redundant. Cool. Any other questions on this, in this example? All right, let's define some things. So, these are called a context-free grammar. We're going to define a grammar. And so I know so far we've been talking a lot about very concrete things about bytes turning into tokens. Here, we're going to talk about context-free grammars a little bit abstractly because they're a little, I mean, I think they're a little bit more complicated than regular expressions and a little harder to build up an intuition of what they are and how they work. So we're going to kind of first looking at, just like we did here with symbols of s's and a's and b's. And then towards the end when we finally mastered this, I'm going to show you how real programs use context-free grammars to do parsing of really cool stuff, using tokens. So for context-free grammar, we need an s. What did we use for, why do we use this s here? What did the s mean? All right, you got me. Okay, let's phrase it a different way. Where did we start whenever we were building one of these trees? Or when we were starting and we wanted to say, okay, let's find a string that this language, this, sorry. We want to find a string that this grammar defines. Where did we start by? S, yes. Because it was a statement. So we need some kind of starting symbol. So how come I never expanded this a or this b? What was that mean? Why? Like why did I not expand an a or b? What was that? There's no rule for that? Yes, there's no rule. I don't have anything about how do I translate an a to something else, right? I know I can replace here. I can replace an s with an a followed by an s followed by a b. Or I can replace an s with an epsilon. But how do I replace an a? I can't. There's no rules for that. So I just leave it as is. I'm going to get back to that in a second. We have an a, so we're going to call this a, oh man, I just almost messed it up again. Terminal, right? So if you think about when we're breaking out all these rules, when we get to something that's a terminal, we cannot go any farther, right? That's as far as we can go. Go to the opposite of a terminal b. Yeah, something that's non-terminal. I hope I'm smelling this right. Cool. And so s is the start, right? That's where we're going to start. So it's going to be our starting non-terminal. Now we're going to do a little bit less on the formally defining exactly what a context free grammar is and what it must have to be a context free grammar. But these are the information we need. What is the starting non-terminal? Most cases, by convention, we will use s. s will always, so if you see a context free grammar, if it has a capital s in it, that is the starting non-terminal. The other thing, in all of our more abstract examples, all uppercase characters will be non-terminals. Lowercase characters will be terminals. And then we also have our epsilon. What is epsilon? There's a question you don't know the answer to. Well, it's definitely not a non-terminal, right? We can't replace epsilon with anything. But it doesn't, unless it's by itself, it doesn't end up in the output strain. So terminal, I guess we can call it a terminal. But it's something to be aware of that epsilon will need to be treated a little bit specially, especially when we start going over algorithms of how to calculate things about a context free grammar, which is what you're doing project three. So for any context free grammar, we should look. So even if it's not 100% specified, like here, why do we know A and B are terminals? Yeah, there's no rules, right? We didn't specify any rules to tell them where to go and what to do, right? If we had something like A goes to S, something like that, then it would be a non-terminal in this example, right? But I will not try to trick you like that explicitly. I mean, it could be a mistake, but that's how you should think about it. Okay. So we have terminals, non-terminals, starting non-terminals. So the syntax basically of a context free grammar is we have a left-hand side, followed by an arrow, followed by a right-hand side. So you actually see when we get to project three, it'll be really cool. You're going to be reading it, you're going to write a program that reads in a context free grammar, and the description of how to read that in is itself a context free grammar. So it's pretty cool. So the way to figure out this, there's some left-hand side. But can you have terminals on the left-hand side? No, it doesn't make sense, right? That's part of the reason why it's a terminal. Can you have multiple non-terminals? Can I have, let's say I can replace two S's with a B? So this is a type of grammar, but in context free grammars, no. So in context free grammars, you only have a single non-terminal on the left-hand side, and you have an arrow character followed by any number of terminals and non-terminals or an epsilon on the right-hand side. And basically this means anytime you see an S, you can replace it with a little A of big S and a little B. Questions? Cool, all right. The rules, we can call rules productions. So you can think of it as S produces little A, big S, little B. You can also think of it as a rule. We did all this, started with an uppercase, good. Oh, so we already did this example for matching parentheses. And we can use the bar symbol to do for just like in regular expressions. So we'll say this is red S can go to left parentheses, S right parentheses, or epsilon. So it's exactly the same as this. It's just a little bit easier to write. Okay. So we can do, from a context-free grammar, when we did these trees, what were we doing? How did we do this? And why? What was the first step? What was the first thing I wrote down when I wanted to do this? S, Y, S. It starts somewhere. Yes, you do have to start somewhere. And it is our starting non-terminal. So it's very easy to remember that, right? You've got to start somewhere, start at the starting non-terminal. Start with the thing, and then start at the same. Right, we wrote down S and then what did we choose? What did we do? I'm going to go like this. We chose one of the rules, and then we basically replaced S with that rule. Right, so here we're building, in some sense, we'll get back to it, but it's almost like we're building a tree. Right, by expanding each non-terminal and replacing it with one of the valid rules. Another way to think about that is a derivation. So in a derivation, so let's say I have S so left parentheses, S, right parentheses, or epsilon. So a derivation would be, start from S and apply one of the rules. Exactly. So we use the double arrow. So I start with S, I apply one of these rules, which rule? Sure. Epsilon. Then do I apply another rule? No, no, no. No, so my derivation is done. So let's apply the other rule. Left parentheses, S, right parentheses. Am I done? Then apply one of the rules. Let's go to epsilon, the second rule. Left parentheses, epsilon, right parentheses, which is the same thing as this. Right? So each of these, the double arrows, is a derivation. So this is one step in our derivation. Basically, we're saying, how do I get a string that is represented by this context-free grammar, meaning this context-free grammar? Well, I started by starting on terminal and I keep applying rules until I can't apply any more rules and that string is definitely in my context-free grammar. So we can do this, we can do all kinds of things, we can do it, this would be two, three, four steps. We can just keep applying this, right? Every step is a derivation. Cool. So now if I get a little more complicated, so here I have, so here I'm changing a little bit, back to tokens and lexers. So here the token would just be num. So let's say you have the token num and the token multiply and plus, but we are leaving the characters here. So I'm saying expression produces an expression plus another expression, or an expression is an expression, produces an expression, multiplies simple another expression, or an expression is a number. Right? So can I generate all possible, so can I derive an expression? So let's start with expression. Right? So I can apply one of these rules, let's say the second one. It's going to be expression times expression. But now we have a question. Which of these two expressions do I replace first? Does it matter? No? Yes? Maybe? Why? So let's say I change expression and I change expression to three. That's some number. And then I change this expression to expression plus expression times three. I exchange that to the two and then I change that to present one. So I have one plus two times three. So assuming these are both num should really be num plus num times num. Is that in this context for grammar? Yes, and I know because in every step of my derivation I correctly apply one of these rules. So as we saw, there are different ways of how to derive this string. Right? At this step, we had a choice. Which expression do we reduce, do we derive first? And so this will help us answer questions about context-free grammars that are very interesting. Or interesting in the terms of syntax analysis and eventually parsing. So left-most derivation sounds like a lot. What would you assume this is just on the title? Which thing do we choose to derive? The left-most for you at that time. Always expand the left-most non-terminal. This is incredibly easy to remember. So if we say do derivations show derivation that is left-most. Just always do the left-most one. So same example. So you could ask a question is this a left-most derivation? Why not? So is this, okay? Is this the left-most? Yes. Is it also the right-most? It's only one. Okay, so now in this one we're deriving the right one, not the left-most one. This is not a left-most derivation. Cool. Is this one on every step? Yes. Double technique. Which would lead us to a right-most derivation, which is what? Yes, always derive the right-most non-terminal. It's incredibly easy. Always expand the right-most non-terminal. Everybody know their right from their left? The left hand makes an L. Okay. I don't know, you read through that in physics, right? Like the right hand rule or whatever and like trying to figure out the forces and you're trying to take a test if you actually do it through your left hand and you're hearing the wrong answers. Okay, so is this the right-most derivation? Yes. Okay, cool. Okay. So, the cool thing basically here we're kind of drawing a tree, right? To do derivations and showing deriving things in the trees, right? I mean, I drew arrows which kind of messes up the tree in five but if you just think of them as lines like here so if I think about like S and then I have S goes to the left Paren, S right Paren which this goes to epsilon right? So this is a tree every node a tree is what's a tree? People know how you had to it has nodes it has nodes you guys are describing actual trees it's a computer science tree some of those terms do apply so I'm mostly messing with you a root, what's a root? the start of a tree which is weird that we draw them on top we're going to start at the bottom so this would be S in this case would we have a root for this tree? right? so what are some of the things? leaves what does the leaf mean? node with no children, what's a children? we have a child, we didn't talk about that there's no nodes how do you know? like an oddly philosophical over here what are these things? edges they establish the parent-child relationship without edges we'd just be drawing circles or not even circles here right? so I have these nodes I have S left Paren Epsilon this is the root it's the top of the tree S has three children left parentheses, S right parentheses does the order of these matter? of the siblings? sometimes in what we want to do in this time yeah why? convince me that it is important what was that? we're concatenating the role we're concatenating the role what are we concatenating though? the leaves we're concatenating all the leaves together to get the resulting strain and so if I have this tree is this tree the same as do these produce the same strain? no so the order of siblings is important and where does it come from? from our production rule we're concatenating all the leaves together and so from our production rules exactly from this rule cool so a little preview drawing this tree is the same thing as doing this derivation so we'll do that on Monday thanks for watching