 All right, well, let's get started today. We'll start off, anybody have any questions about first set, follow sets, anything we've covered so far? Questions on project three, homework two, anything? Yes. Yeah, so you did the last class, you did the follow set example with females. Yes. You started talking with a parsing. Yes. And you were talking, so just which token have its own parse function first to the book? So what we're writing here, and that's what, I'll put this up here since this is where you left off last week or on Monday. So what we're writing here is parse functions for every non-terminal on the ground. Exactly, so we wanna know how we're gonna parse an input string based on our lexer by getting tokens from our lexer. We're gonna say, okay, well we know we're gonna start on a starting non-terminal, that's how we started parse S, the starting one. And then by looking at the first terminal, right, we should be able to determine which of the rules that S produces does this string correspond to. Or does it, this string or token, does it correspond to syntax error? So it's not a string in our language. So that's what I think, I wanna say in the next one of these parse functions, we'll look at what happens when we get to a terminal. Right now we've only been parsing non-terminal. Any other questions about where we are right now? Or can you tell me? For the homework, do you want us to just model it after this or do we have actual return values or anything like that? So for the homework, you can model it after this. This is really what we're looking for here. And it should be valid ish pseudo code. It doesn't necessarily need to compile if you forget like a semicolon or not gonna give you points. But if, I don't know, if it's completely wrong, I don't know, no braces or it doesn't make sense or we can't read it, then it's like your own. So yeah, yeah. For the project, I was looking at the first sense of just like getting the program and trying to define it. And it looked like you couldn't just make a recursion as that part. I'd have to look at it directly because there's some ambiguity in it. So yes, it's intentional in the sense that it's not gonna change. You can write essentially a custom parser for that. So you don't need to do any, it doesn't need to be in this format like we're doing right here. Any other questions? So as we can see on the board, what we left off was we were looking at the function parse display name list. And remember here at the very top on the top left here, we have the rule. We have display name list goes to a word followed by display name list or display name list goes to epsilon. And so we already went through, we calculated the first and follow sets for everything in the grammar. So by checking the next token, by calling get token, we can check and say, well, is the type, if it's in the first set of word display name list, well then we know that that must be the rule that we've chosen. So, or that's the parsing rule that generated this string. So that's why here we check if t type is atom or quoted string, then we're going to put that token back, unget it, and then call parse word, parse display name list again, and then print out to say, hey, this is the rule that we chose. Otherwise, so the other rule, right, there's two rules here, the other rule is display name list goes to epsilon. Now, if that production rule or when that production rule is chosen, right, there's gonna be nothing in the resulting string, the resulting tokens, right? There's no token that corresponds to epsilon. And so how do we know if that production rule is taken? And the way we do this is we check the follow set of display name list. So what is the token that follows display name list? And that's what we calculated with follow sets. So that's why here we're checking if the t type is equal to a left angle bracket, then that means that this is the epsilon production. And so what we do is we unget the token, right? So we put that token back because we're not parsing that token, and then we don't do anything else because there's nothing else to parse because our production rule is display name list goes to epsilon, so there's nothing for us to parse or to consume from the input. So then we just return. Otherwise, we're gonna have a syntax error. So questions on, does that work for me? Yeah. So if the first is in this, if the epsilon's in the first set here for display name list, so if that was the case, then so for one of the rules, there's two rules for deciding if a grammar has a predictive parser, right? So the first rule says if you have a rule that A goes to alpha and A goes to beta, then the intersection of the first sets of alpha and beta better be empty. So in this case, right? So here we have A is display name list, we have alpha is word display name list, and we also have a rule where A is display name list and beta is epsilon. So we actually, when we started this, it was actually Monday, so it was a long time ago, but we looked at this and we took the first sets of each of those, we said, what's the first set of word display name list and what's the first set of epsilon and is there an intersection between the two? And we showed that there wasn't, so we're not going to worry about that. But if there was, then we'd stop and we wouldn't generate this predictive parser. We would be able to say that, hey, this grammar has a predictive parser. Shouldn't the parser be the four of just alpha and beta and beta? In this ELSIF clause, in this clause? Yeah. So yeah, so this is, we actually don't know until we've parsed a word and parsed a display name list, we don't know if this production rule is correct because it could not be a word or there could not be this display name list could be wrong. Because we've only checked one token ahead to make sure that the next token is valid, but there could be multiple tokens there. So that's why we wait until both those functions return to say, hey, we know this rule was chosen here and it successfully completed because control flow came back to this point. Great, okay, so now we're going to move on to the next function. So this is the first one we'll see that has terminals in the production rule. So here we're going to talk about parsed angle address. And so the rule here is angle address goes to left, left angle bracket, address specification, right angle bracket. So we've calculated first and follow sets for all these things here. And so our function, we're going to do very similar things, right? The first thing we're going to do is we call get token because we want to make sure that this is a valid string that was generated by this rule. So what are we going to check the first set of? Are we going to check anything? What are we going to check? I'll shout again once. Yeah. The first set of brackets is going to add this with each bracket. Right, so yeah, it's very mechanical. So we check the first set of the entire thing on the right-hand side, which in this case is left angle bracket, address specification, right angle bracket. And we know that the first set of that is the left angle bracket. It's a terminal, it's whatever that string is generating, that first token is always going to be a left angle bracket. So we check, is the type of token that we just saw equal to the left angle bracket? And then what do we do next? So I guess the question is, what do we traditionally do? Yeah. Unget token. Yeah, do we do unget token here? You want to say that louder? It's going to consume the... Yeah, so specifically here, we don't call unget token because here this is a terminal in our string. So this left angle bracket is going to be in that resulting string. Right, it's a token that we care about. So we actually want to consume that token and start parsing right after that left angle bracket. So we're not going to push it back and call unget token. So then we're going to call parse address specification, right? So we're going to recurse into and call that function. It's going to do something. And then what do we do after that for terms? Get token again, and what are we going to check? We need to check if it's the other side of the angle bracket. Right, so the right angle bracket or the greater than symbol. So yeah, so I kind of did this kind of the opposite way. We're here, we're checking for a syntax error, right? So we know if we saw a left angle bracket and then an address specification, if the next thing is not a right angle bracket, then we have a syntax error. So that's what I'm checking for here is hey, if the t-type is not a greater than symbol, then there's got to be a syntax error. Then once all of those checks pass, right, now we know, okay, this is the production rule that we correctly parsed. That means this is where we're going to print out that this is the production rule that we just parsed. So there we see why once again we don't call unget token after we check the greater than symbol, right? Because that is a token that we want to consume. All right, so what do we check if, what do we say if it's not a left angle bracket when we first got that first token? Error, right, yeah. There's no other rules, there can be no possible thing. It's gonna be a syntax error, it's an error if we're trying to parse an angle address and there's not a left angle bracket. That's a lot of the syntax error. Questions here? Yeah. So when parsed address specification calls get token or where? Okay, good. Yeah, so you can kind of think about it. This is why, so each function in isolation is going to consume only those terminals that are contained within its right-hand sides, right? So this is why the only tokens we consume are left angle bracket, and then we're gonna let parse address specification do and consume whatever tokens it needs to consume so it will move the lexer through the sequence of tokens until it comes back and then we say that next token right here better be a right angle bracket. Otherwise, it's a syntax error. So then we consume that and then we return and we say, okay, we've properly moved through the input, the sequence of tokens to capture this parsed angle address, yeah. What is parse address specification returning them? Because you're just running it as a function. So what if it runs into an error and prints out like an error statement or something? So it's part of, yeah, so it's part of the Program or the word is it what? So it's part of the semantics we define here that syntax error is gonna halt execution and just output an error. So we know it's not gonna return unless it was successful by parsing. Actually how you implement this can vary, right? You can have this parse address specification return something if you care about what it returned. You can have a manipulate data, some global data. You can have it, you can have it build a tree actually each of these steps could build a tree and so you can actually build the parse tree directly from here. So yeah, really however it's good for you. Is there another hand in this direction? Okay, and now we keep going. So now we go on and we see how address specification works. So we know that address specification is, address specification produces a domain, domain, I forgot what the second person is. Atom app, yes, okay, that's my own name, that's different. Okay, a domain and atom acts followed by domain or a quoted string app followed by domain. So we calculate first and follow sets here. So what's the first thing that we do in this function? Yeah, get token, we want to get the token, we want to see what the next token is, right? Same thing, so we need to know which one of these rules to take. So we call get token and what's the first set that we're going to check? What are we going to check? Let me, here it is. Yeah. Domain and atom acts. Yeah, domain and atom acts. We're going to check the first set of, remember domain and atom acts followed by domain which because that's the entire right-hand side of that production rule. And we know that that is domain and atom act. So we're going to check. If the type is domain and atom act, what's the next thing we're going to do? It's okay to be wrong. We're going to call what? Next token, get token. Get token? Why? What are we going to check with that token? We see, basically when I'm viewing this token, we are trying to see if it matches with the second part of the rule. Right, so what's the second part of the rule here? Domain. Domain and terminal. No, so right, so we're going to, so. So we're crossing this token. Yeah, so you're right in that we're not going to call unget token, right? Because we consume that token. We want to consume the domain atom act token. But what we're going to do next, right? We don't, we're going to let, because we know we've chosen this rule because there is a domain atom act, we don't need to check any of the first token and domain, we just call parse domain directly. So it can parse however it needs to parse. And then when that returns, now we know this is a proper domain that was just parsed. So now we're going to print out that, hey, this is the production rule that we chose here. Any questions on that? Frank? Okay, what do we check on the next branch? Right, yeah. QSA. QSA, yeah, so quoted string act. So remember once again, we're checking the whole thing, the first set of quoted string act and domain. And we're going to check, hey, if the t type is a quoted string act, then what are we going to do? Parse domain. Say it louder? Parse domain. Yeah, parse domain. Okay, so yeah, so we've consumed the quoted string act of this production rule. And we don't want to put it back because we want to actually consume that token. And then we're going to call parse domain, which will then recursively go and parse whatever it needs to parse. And then we're going to print out that, hey, this is the production rule that we took. Address specification goes to quoted string act followed by domain. What do we do if it's not one of these two strings? Syntax error, right? It's got to be an error if it's not one of these two strings. Questions on this? And I feel like I can get in the hang of this and put it in here. Yeah, awesome. Cool. Okay, we'll have a few more so we should be good then. Okay, so we have the very simple rule, domain goes to dotted, domain act? I mean, it's domain act, right? Dot it, dot it act, okay. Dot it add it, dot it add it, okay. This doesn't happen when you use these shortness. Okay, so we're going to write the function parse domain. What's the first thing we're going to do? Get token, yeah, we want to get the token, right? And then what are we going to check? First set of DA, first set of DA. Yeah, the first set of DA, so the first set of the entire right-hand side and we know here that it's just dotted, add it. So we're going to check, hey, if the t-type is equal to dot add it, then what do we do? We print the rule, why? We call this bids nothing after. Why do we call unget token? We call it the first one. Or do we call unget token? Fair question. We call it bids nothing after. Right, so yeah, so we've called get token, we've consumed a token. We know that this type is the type of this one rule that we have, the rule is domain goes to dotted, add it. And so we're done, right? We don't need to put anything back. We don't need to recursively call any functions. There's no non-terminals here. So all we do is print out, hey, we match this rule, yeah. I just want to use that comment after t-type into it, which is they check first of domain, right? No, it's always the right-hand side of the rule. Because there could be, if there's multiple options, you want to check the first set of the entire thing. Remember, in this case, it's only one rule, right? So it's the same. But if domain had two production rules, the first set of domain includes the first set of both rules, right? So we want to disambiguate between the two rules. And so the other, yeah. Should we check for end-of-file after checking whether the t-type is PA or not? So the question is, should we check for end-of-file? So why would we check for end-of-file? Like after like two minutes, there's a terminal, and up there. You mean because it's in the file set here? Maybe. Yeah. So we're, so it's actually one of the things we could check for, but we're gonna let something else check for us, basically. Because we know that we're not actually the end-of-string, right? It could be an end-of-string after us. There could be something else. But we were called somewhere. So we know, hey, we consumed our domain by consuming that domain atom. So we'll let whoever else deal with that. Okay, so what do we do if it doesn't match, if it's not a dotted atom? Syntax error, right? It's an error. It's gotta be, if we're trying to parse a domain, it's gotta be a dotted atom. So what is an un-get-token call again? So we sort of sense it. You wanna call un-get-token. You wanna put the token back when you're not actually going to consume it, so. And when you wanna consume it, is when there's a terminal in your production rule. So a terminal means, hey, that token's gonna appear in the sequence of tokens, right? So we wanna consume it to move our lecture to the next possible token. So basically the kind of idea is, hey, if the next thing is a non-terminal, then you want to put that token back to let that non-terminal part set, Barbara's close to part set. You just use that token. You peek to that token to decide which of these two rules apply. All right, so we got one more, I think. Word goes to atom or quoted string here. We have first and follow sets. So when we parse word again, what are we gonna do about the back? Get token, yeah, we need to get the token. We need to peek at the token to know what to do. So what are we gonna check first? T type of what? Against what? Against atom, yeah. So we wanna disambiguate between these two rules. So we're gonna check the first set of atom, in which case it's obviously just atom. So we say, hey, if the T type is atom, then what do we do? Do we unget token? Do we call some other part function? Print, yeah, we just print. Maybe it's kinda easy to tell with the space here, but yeah, if you think about it, right, we wanna print because we're done. We've consumed that input token, we've moved the lexer forward, and we've parsed a word. We've done what our function wants to do. Now what do we check next? Do we just say L syntax error? T type is what? So we wanna do it with your quoted string quoted. You saying something? So you wanna actually say, how about? Yeah. Quoted string. Quoted string, so we're gonna check first of quoted string, right? Because we wanna disambiguate this rule. We're gonna check, hey, if the T type is equal to a quoted string, then what do we do? Print, yeah, we wanna print. Say, hey, we got this rule. So this is the production rule. Word goes to quoted string. Okay, what do we do if it's not either of these two tokens? Syntax error, right? It's gotta be an error because a word will only generate either an atom or a quoted string, nothing else. So if you try to parse a sequence of tokens and you're trying to parse a word and the next token is not an atom or a quoted string, that's a syntax error. Yeah, definitely. What do you check for a follow? So you check for follow if there's an epsilon in one of the first sets. So if, or if there's an epsilon in one of the rules. So if it can produce nothing, the only way you have to tell of which rule to take is by looking at the next token and checking it with the follow sense. And so these actually, these, the checking and the disambiguation, it follows directly from the rules we have for deciding if you can have a predictive parser. Right, you can have a predictive parser if there's a difference between the right-hand side's follow sense. Right, that means you can choose which rule. So that's why we always check. And the second rule is if you have a rule where alpha A goes to epsilon, then it better be the case that the first set of A does not intersect with the follow set of A. So this is a way you can tell, hey, is it one of the A rules or is it the next one? In which case it would be the F one. So they're very linked here. Okay, and so, ah, I didn't have it this slide. So this is my thought process of how kind of like a very high level thinking, this is how you write a predictive recursive descent parser. You're first for every non-terminal A in the grammar, you're gonna write a function called parse A. And that's exactly what we've done here, right? Because we wanna parse non-terminals. And then for each production rule, A goes to alpha where alpha is some sequence of terminals and non-terminals. We check if getToken is in the first set of that sequence of tokens, right? So that's what we're checking with every if statement there. Hey, if it's in this set of if the next token is in this set of the first set, well then, we know we should choose this rule. If it's in the other one, then we know we can choose that rule. If it's in none of the first sets of any of the rules, then we know it's a syntax error. And then for every, here I have terminal and non-terminal A in alpha, so we go through, right? So that's, we wanna parse properly that right hand side rule here. So if it's a terminal, we just check that, hey, this is the correct token that we expect. If it's a non-terminal, we wanna put it back and we wanna call that parsing function. And so, and the last rule, yeah, it's just the syntax error. So if it's not in any of the first, if it's not in the first set of A, which means it's not in the first set of any of the right hand side rules of A, then it's a syntax error. You just do this for each of them. Oh, and this also, this checks, let's see, getToken, right, is not in the follow of A. So yeah, this is for handles that alpha, the epsilon production rule, where we have a rule that A goes to epsilon, we know we need to check the follow set of A to make sure that that's the rule that we choose. Otherwise, let's get back there. Questions on this? Yeah. If first of alpha contains epsilon for this right here, oh, so that would be this rule right here. So there's a special case rule here for if alpha is equal to epsilon, then check that getToken is in the follow set of alpha of A. So that's checking whether that rule is in there. I'm gonna have to revise this to see if this, huh. So if, yeah, I think I should probably change that too. Yeah, if it's, epsilon is in the first set of alpha. Yeah, that's correct. You have to check the follow of A. So there will only be, yeah, so there will only be, so by the two rules, right, there's only gonna be one rule that has epsilon in its first set, right? Because the very first rule of the predicted parser says that the first sets of all the right-hand side rules must be disjoint. So only one rule is gonna have epsilon in its first set. And then if that's the case, then you'll be able to tell if you take that rule, whether it's in that first set or it's in the follow set. Oh, you would print that rule, because that is a rule, right? It would be, if it's A goes to epsilon, then you would print that rule, if you print whatever that production rule is. Exactly, yeah. So I'll make a note of it in this slide before I post it. Any other questions here? Yeah. So when you mean that they have to be disjoint, you just mean that there can't be two epsilon in the first set of any of them. Exactly, yeah, that's what I mean, right? So check the first sets of one right-hand side rule, the first sets of the other right-hand side rule, if there's any element in common, like that would be disjoint, right? They have to be completely different sets. Okay. Yeah, nothing in common. Whatever way you wanna think about that. Any other questions? Cool. I'm sure as we get closer to one more two turn-in, there'll be more questions. Okay, a little bit. I know I just said it, but any other questions before we move on? Anything to do with parsing? Okay. Cool, so now we come to the question. Yeah. Dude, I just started though. Okay, go ahead. Do all syntactical errors come from a predictive grammar so far so good? No, not. So syntax error just means that it doesn't conform to our grammar. So the predictive part there means that we can, that's why we always call getToken is because we're gonna look at the next token that's there in our grammar and we can decide which rule to choose. The other way to do it is actually what the MailGun folks did when they did their parser. It's not predictive. It actually tries all possible parsing combinations. So it says, hey, first try to parse it with these rules. If that fails, then try to parse it with the other rules. And if we tried all the rules and all possible combinations, then fail. Then that's a syntax error. So would you only do that when there isn't disjoint? Yes, yeah. So you would do that if there's not a disjoint, if there's common elements in the first set. So actually if you take the grammar that they have for email addresses, you'll find out that it's not predictive because of that reason. So that's why I had to introduce those extra tokens so we could disambiguate it. Yeah. So that would be ambiguous for regular language? So I don't think so. So it's definitely not, it's not a regular language because regular language is regular expressions, right? So I don't think that language can be described by regular expressions. So ambiguous is two different parse trees with the same strength, which is different than predictive. Predictive means we can choose which one. I'm fairly certain you can have a predictive grammar that's not ambiguous. So always give you one tree. One of the trees will be correct. It's just that you don't know just by looking at the next token. So yeah, that's okay. So it could be that by looking two tokens or three tokens at night, you could tell. For our purposes, that gets very, very complicated. It's not as clean as this. So that's why just by looking at just the next token, probably have to, I think you have to introduce new tokens because that's really the problem that you can't tell just from the tokens that you get back which rule to choose. So you could either, yeah, maybe able to add more grammar rules but I don't know if that would always be the case. I think it would depend on the specific grammar. Let's go back. So I guess it would be about how clear you are. So for the homeworks and for exams, it's all about us being able to grade your work. So if you write, so let's say you have one step and you say, this is never going to change because of X, Y, and Z. That's totally good because you're explaining it, thinking there. So yeah, it makes sense not to write those extra ones. Otherwise, you know, you still are calculating, you're applying all the rules there, it just doesn't change. So you can also just write them. Yes, your grammar has to be unambiguous to me in particular. So are we using the leftmost derivation? We are, I think, using the leftmost derivation, right? In our perspective, do you use the rightmost derivation? Possibly. That's a good question that I have to look at a lot more. I don't know the answer off the top of my head. So I think that we are doing the leftmost grammar because we always are recursing to the left. Right, so we're consuming the inputs from left to right. So could you start from the other way, probably? I don't think it would change things significantly, so I'm not gonna say yes or no. Any other questions? Cool, all right, so now we get onto semantics. So now, up until now, we've seen that, so we've looked kind of just to make sure we're all on the same page, right? We've looked at lexical analysis, right? So that's all about defining how to turn a sequence of bytes into a sequence of tokens, right? So we've seen how to define regular expressions and we've seen how to apply those regular expressions, how to disambiguate when there are multiple regular expressions that match. So we know how to go from regular expressions that define tokens into a lexer that's able to pull those tokens out. So then after that is syntax analysis where we're concerned with specifying valid sequences of tokens and specifically turning these tokens into a parse tree. So with semantics here, it's very kind of logically related even though it seems we may be shifting a little bit. With semantics, we're concerned about, well, what does that parse tree mean, right? So up until this point, you don't really care what we're parsing when you're parsing an email address, right? You just have terminals and non-terminals and you can write a parser for that and you can parse it but it's parsing words and domains and domain atomats and quoted strings, right? You don't really care what it's parsing. You can write a parser for it and be able to parse it. So with semantics here, we're gonna actually go ahead and talk about well, how for a programming language can we define what the semantics are? What's the meaning behind the language constructs or the meaning behind this specific operator and what things are valid in our language and what things aren't valid in our language and kind of we're gonna look at the interplay of well, if you design it in a certain way, there's things you can do when you're compiling it that maybe you can't do in another way. Okay, so we want to define a programming language. So we've written all this parser stuff. We're able to get a tree from it. But what does that mean? So how have you seen language semantics defined? So like how did you learn Java? Or yeah, let's go with Java. Is that like the first language that you learned here? So any of you remember how they went? It banged your head against the wall or your computer. Yeah, pretty much that. Just constantly, I taught myself Java. So it was a case, it was a case of check the wiki, try to code, check the wiki, try to code. I've definitely done that as well. Seq fold. What about in class, how do they teach you in class? Yeah. Is it related to something that you already know with pseudocode? With pseudocode? So I'll implement something you already know. Wait, is that it? Explain it by relating it to something you already know with a pseudocode. Okay, is that how you learn Java? Well, I mean, I started with Visual Basic. Okay, awesome. Yeah, I mean, I thought. We won't get the example. I'm not proud of it. Okay. Okay, yeah, so yeah, so you relate it to kind of some pseudocode maybe that you know or you see, you kind of relate between like, okay, this is a Visual Basic implementation of this kind of pseudo algorithm. And so you kind of try to, I'd say maybe kind of like learning by example in that case, where you kind of are looking at examples. What else? Maybe like read a book. Do you read just read the book? We did. They would give us examples of, you know, what it looks like. Yeah, if you were just memorizing certain sections of it, they would kind of focus on one small part of the, in there. Okay, yeah, so kind of like just memorization. Like here are the rules. They were given down to us from up on high from I think James Gosling in the case of Java, right? So, you know, you just learn them and apply them. Anything else? Maybe, maybe read like the Java specification. Say it like it's a joke, it's a real thing. Yeah? Do you learn about the Java class? Right, so yeah, so they kind of teach you the various concepts in the language by examples probably in class. So yeah, you could read the line though. There is a language specification for Java. We'll get into it. But what, so what properties, so, so what properties do we want from language semantics definitions? Like what, so, I tell you I just invented this really awesome programming language. It's gonna change your life and you're gonna go be able to write this awesome, create a startup around, use this language and you're gonna be 10 times faster than a competition and half the memory, four times the speed, whatever. So you say, that's awesome, I really wanna learn that. How do I know what your language is and what it means? So what is, how would I tell you or give that to you? What are some, what would you want from that? Different variable types, something more high level, right? So like, I don't know, would you like a, I mean one way it would be like I just write it down. I give you and I say, hey, here's what everything my language means in English, is that good? Would you like that? It's not a wrong answer, do you have your hand on a kind of three quarters? Yeah, like an eight-gab, maybe like documentation, so I give you documentation on the eight-gab, what about the language itself, right? Like how do you know how to write a program in that language? But you want words to be descriptive. The words to be descriptive, yeah, so we want, and sometimes we want it to be precise, right? We wanna know, so descriptive I'd say is kind of another word for precise, right? You wanna know, when I write this code, what is it going to do after my program it gets compiled and executed, right? So what is precisely, what does this code mean? So I'd say preciseness, what is there, what are the types of things that you like, yeah? Consistency. Consistency, yeah, so that's kind of a, I'd say that's a language design goal that you want, like I think they call it the principle of least surprise, right? You want to just, you don't wanna use the plus-plus operator and have it do something completely different in your language. I'd say for terms of documenting the language semantics I'd say you want some type of predictability, right? So that you can read the specification or understand the semantics and then look at some code and say, I know exactly what that code's gonna do based on what this document that the professor's given me says, right? So you want some form of like predictability. What if I just gave you like a one-page document? Would you be suspicious of that? Yes. Why? Because I feel like it wouldn't contain, I feel like there'd be points where it'd be like, well, I'd have to guess what you're trying to think. Right, so yeah, so that's, so you want, you know, as much as we probably dislike going through a hundred-page documentations of program language specifications, the flip side of that, if your specification isn't complete, then, you know, you may be able to write it, you say, hey, huh, this program that I can write is syntactically valid, but your specification doesn't say what it should do or how it should operate. So what's gonna happen when I run this? What's it gonna do? Is it gonna do anything? Is it gonna explode? Is it going to do the right thing half the time and the wrong thing half the time? Is it gonna depend on what compiler I use? So that's another thing, right? So if you have other people who take your specification and try to implement a compiler for it, is it gonna be the same or are you gonna be able to write the same program and kind of run the same twice? So we've kind of been hinting at it. So what are some ways that you can specify these semantics that hopefully have some trade-offs between these kind of three properties? First person can use maybe the example I've already given. English, use English or I guess, you know, whatever language you're most comfortable in, but kind of the standard technical language is English. So write it in a human language, specifying exactly what the semantics of your program are. Any other ways? Is this the only way? Yeah. Pseudo code. Pseudo code, do you wanna explain that? Just. Maybe, maybe related to the, well, the one of the examples. We'll see. So about, let me know how like, Python started. So, yeah, I can't, oh, yeah. Was it readability, they wanted? Almost. So it's definitely, so it's not an English investigation, right, so can I, can I, I don't know, can I define a new language if I don't have a specification that I can take it for? Encapsulation. Encapsulation of what? Like an F. Okay, well, I'm thinking more in terms of deliverability, right, or how you, how I communicate my specification to you. Like I want encapsulation, those kinds of things. But one way I can do that is write a big documentation. What would be the other way? Yeah. That's a good one, I haven't thought of that. Yeah, it's actually, I also think related to the third one. So yeah, you can relate it to something that people already know, yeah. Can we do it through implementation? Yeah, so implementation, right? I just give you the compiler and say, bam, here's how this language works. If this program compiles it, it's a valid program in that language and whatever this compiler does with your program, that's how it should do, that's my language. So yeah, they call it, that's called the reference implementation, right? So that's, honestly, that's how a lot of programming languages start. So, you know, you're kind of developing this toy programming language for yourself and you think, oh, maybe people like this. Well, here, hey, this is some compiler for this language I wrote. Here it is, there's not a lot of documentation, but what the implementation, what the compiler does, or what the interpreter does, that's what the programming language should do. Okay, so the third language is related to some of these other concepts, yeah. So what you're saying for that reference implementation is you give someone like a block of code and it's just like you give statements or like there's nothing for a talk about? Is that what you're doing? No, I mean, you give them, so if you think about it at the basic level, you give them no documentation, you give them a program. You say, here's my compiler, whatever this, so you can also easily specify the syntax, right? So you can say the tokens, you can give them the grammar in exactly the format we've been looking at, that's how people define grammars. So those are very easy to precisely specify, but the fact of what it does, though what some people do is say, hey, here's my language, here's the compiler form. Now, obviously, you probably also give some examples or something to get people started, but the fact is that it's not a Python program unless it runs on the Python, the standard Python interpreter, or at least it used to be, it may have a specification now. But it'll be maybe the third way, so it's kind of been related to it, it's kind of related. Anybody here like a math major? Yeah. So yeah, we can actually formally define our language using math, we can specify the exact semantics of what every single operator and construction in our language is using formal language. So we'll look at some examples. So I've pulled up the, anybody wanna take a guess at how long the latest C documentation? I don't know, it's not the latest, it's just C99 that I found. Maybe I'll guess how long it is. I feel like a higher or lower game. That was a lower, all right, now we're just doing binary search. So 530 pages long is the specification. And it contains language like this. An identifier can denote an object, a function, a tag, or a member of a structure, union, or a enumeration. A type of name, or label name, a macro name, or a macro parameter. The same identifier can denote different entities at different points in the program. A member of an enumeration is called an enumeration constant. Macronames and macro parameters are not considered further here, because prior to the semantic phase of program translation, any occurrences of macro names in the source file are replaced by the pre-processing token sequences that constitute their macro definitions. So if you really study this, it definitely makes sense of what you know about C. This is just, this is one paragraph that defines what an identifier is. This is what an identifier means in C. The standard. So what would be some problems with using English or some benefits? Yeah, you don't know how to actually use it. Yeah, maybe if there's not many examples, it would just be a lot of language like this, and you'd read the whole thing and be like, okay, I don't really know what I'm supposed to do here. Is this writing one or is it not? So, and it may be, so remember this is just one paragraph that I took. There's like four after this, just in that one section that probably clarified exactly what an identifier can be used for and what it means. I guess this is specifying what things an identifier can denote. Hopefully at some point, there will be equally long definitions for an object, a function, a tag, and a number of a structure, a union, or a numeration. All of these things have to be precisely defined. You can think about it like, this is kind of legalese for programmers, because if there's any ambiguity, so what happens if there's ambiguity? What is just an idea of something? It's like a name. So what the compiler is gonna do when you feed it then? Yeah, so we don't know as a programmer what the compiler is gonna do when we feed it a programmer. And if somebody else, if I have Microsoft writing a C compiler and I have, who else writes a C compiler? Is Apple kind of on the open source community, I guess? I don't know, is it GNU maybe, you could say? Like GCC, right? So if they say, well no, I think a thing is this, and they say, no, I think a thing are these things, then now you have two different compilers that compile programs differently, and in effect you maybe have two different languages. So you have a huge headache and problem there. So yeah, it could be ambiguous, it could be ignored. So this is just a piece of paper, or not even that anymore, right? It's just a document, a PDF somewhere that some committee can read as like, hey, this is the latest C standard, but if any compilers implement it, does it matter, right? So you can specify that if it's a C program, it's got to do X, Y, and Z. Well, the compiler writers are lazy and nobody writes a compiler that does that. So language feature is gone even though it's in the specification. We're up not correct. So, does that ever happen to anyone? I hope not, but. Yeah, what if, I mean, the compilers themselves could have bugs, right? So if the compiler has a bug that doesn't conform to the specification, and you're trying to use that compiler, well, you know, you're kind of out of luck. They, you know, maybe you file a bug with them, but yeah, it could be, you could read the specification, know it precisely what it should do, and have that not be a case. And so we kind of talked about this, this is kind of about completeness, but what about, what about something that the specification doesn't mention? Is that a good thing, is that a bad thing? Have you ever encountered that in your programming careers? Yawning, that's a good sign. So we're actually gonna see an example of this later, right? But there's certain cases where you can write a program that's syntactically valid, but doesn't make passes type checking and everything, but still contains undefined behavior. And so maybe on some compilers it does one thing, and some operating systems it does one thing, but in other languages it does another thing. So this is like the trade off between, well, if you made the specification a thousand pages long, could you identify all those cases and what exactly they could be, but is anybody gonna read that thousand page document? But yeah, I'd say specifications in general, English specifications are good for having multiple implementations of the same language, because you can say, well, everybody should code it to the spec, and then you kind of fight about what the spec means and hopefully clarify things and come to a consensus, but it's a long, difficult process. Okay, any program Ruby? Test any Ruby, some people? Okay, so up until 2011, Ruby was specified by a restaurant's implementation. So Matz is the name of the guy who, Japanese guy who created Ruby, and they had documentation, but the specification was, hey, the semantics of the Ruby language are defined here by this Ruby interpreter, Matz's Ruby interpreter. And so that, you know, any program that the reference implementation runs is a Ruby program by definition, and it should do whatever the reference implementation does when it executes this program. So what are some of the pros and cons here? If I give you a program, I'd say you have some minimal documentation, but if I give you a compiler or an interpreter, what are you gonna do with that? Is it good? Is it bad? Well, it stops the problem where we haven't passed this language documentation where it can be helpful with. Yeah, so it's very, I'd say it's precisely defined, but the caveat is if given, on a given input, right? So the trick here is you actually have to, so if you have any question, like, hmm, what does this language feature do in this really weird circumstance where I like, I don't know, take an address, and then I add a class to it, and then execute the class, and then call some eval, whatever. If you have any question, you can actually write a program that does that, run it on the reference implementation, and whatever happens there is Ruby behavior, and that's how Ruby's going to behave. But what would be the downside? So yeah, so here it's precisely specified if you give it an input. What are some downsides? Testing by example doesn't really understand the edges of what something really is. Yeah, so let's say fuzzy edge cases, right? And kind of related there, so how do you know, so let's say you test some edge case, it does x, y, and z, whatever, it doesn't matter, is that actually how the language developer intended that to work, or is it a bug actually in the reference implementation? So what would happen if there are bugs in the reference implementation? There'd be bugs in the program, but what happens like, so you've released Ruby 1.0, whatever, let's say you wanna fix some weird edge case, can you do that? Yeah, so you gotta think like, you've released this program, people may be writing software that relies on this behavior because this is the reference implementation, this is Ruby. So if you change that behavior, you could break applications that are relying on those edge cases, which are really a bug which you didn't attend, but now it's kind of, well that is Ruby, that's the Ruby language, so you have to be very careful as far as maybe updating to Ruby 2.0, having to be rethought out and re-done, or you have to analyze use cases and it's all kinds of crazy stuff to say, well maybe if I can hopefully show that very, very, very small percentage of people are relying on this one feature, then maybe I can fix it or change it, but yeah, you have like, bugs become part of the language in essence, oftentimes. What are some of, what about if I write a language, an interpreter for a language, and it's written, I write it in, let's say, I write it in C, but I write it on Linux, and you, well nobody in here only has access to Windows machines, but let's say you did, right? Is that a problem? The good is the bad, I say screw you for the matter. Yeah, or maybe just straight up doesn't run on your system, right? And so here's this language, how are you gonna write a program in a language where you can't even run the interpreter because the interpreter is the only definition of the semantics of that language. If you had a document, right? If you had a 500 page document, if you were stuck on a desert island with a super weird computer architecture set, you could take the C, in just the C documentation, you could write C programs on paper, right? And by checking the documentation, you could validate and verify that they are actually accurate C programs, and that they should do exactly what that specification says. When you get to a computer, you can type all that up in your pile, right? So you could do that, it would be terrible, but you could, right? But with this reference implementation, you can never run it on your system, so you can't tell if it's valid or not valid, or how it should behave or not behave. So yeah, that's to me one of the real downsides here, it's kind of the portability of the reference implementation. Okay, so for formal specifications, it's actually kind of cool, so you specify the semantics of the language constructs formally, and there's different approaches, I'll show you an example of approach next. So what's like some pros or cons here? The what? There shouldn't be an ambiguity. Yeah, so the pro would be no ambiguity, right? Everything is precisely defined, however they define it. Usually the case we look at here is they define kind of what people are talking about. You define some abstract machine, and then you say, okay, for every single, like every single operator in the language, this is how it works on this abstract machine, and so every implementation has to do it that way, otherwise it's not valid. And so you can very precisely define that. So yeah, so every, you can make sure that all parts of your language, languages are defined. And some other really cool side effects is you can actually then prove properties about either the language itself or about programs written in that language using the formal specification, because you know exactly the semantics of all the operations. What about downsides? Why don't we do that for everything? Sounds awesome, yeah. Yeah, no, that's actually, yeah. It can be difficult to understand or hard to read. Definitely, I don't know if it's, I guess we could argue whether it's less difficult to read than a English specification of C. It's definitely so, I'll show you an example, and when you see it, you're just gonna be, if your eyes can go huge and be like, this is crazy, do whatever, do this. But once you kind of, you build it up piece by piece, right? Just by learning a program language, you learn it piece by piece and by examples, and then you expand it into this big thing, and that's not as scary and it's actually kind of easier. It can be easy-ish to understand. So, this is an example of JavaScript. So this is JavaScript formally specified by Professor Ben Hardikoff and his students at UC Santa Barbara, he was on my PhD committee. So this is their abstract interpretation, semantics for JavaScript. And kind of, I don't know, I can give you like a brief overview. You basically have, you're specifying for each operation how you change the state of the machine, where a machine is specified by like a set of names and locations and values. And on this column here, you have all the operators in the language that are specified formally and operations. And here, let's see what's this, premises. Oh, this is, so yeah, this is specifying an if condition. And so, if this, so this is like the if condition, if this condition is true, then these are the exit values of executing that function. And it's, yeah. So, is this super easy to understand? No, I should not, I mean, it should not be, right? You have to like study this for a while to understand what it all means. But the point is to show you that you can do it and then you can do cool things like having a provably sound static analysis system for a crazy language like JavaScript. JavaScript is very dynamic and very weird. Any other questions on this? Okay, so, okay, so we talked kind of at a high level what semantics is. We talked about maybe some of the ways we can define the semantics of the language, but what kind of drilling down, right? So what are the things that we care about semantically? So, what things do, what do we need to give semantic meaning to in a programming language? This is not rhetorical. What are some things you need to know, yeah. Conditional statements, what do conditional statements do? Is it kind of even related to that? How do the Boolean operators work? The double ampersands and the double ores, right? In most languages it's short circuit evaluation. So you know when you write a program you can rely on that and have cases that are maybe, you know that this second clause is only gonna execute if the first clause is false in an AND statement. So you can rely on that behavior. Whereas in another language it could be the semantics of that language that it executes every single thing in an OR and AND statement. What else? Actually that's a good example also about ifs. There's other languages that define if and also define an unless operator. So it's exactly the opposite of an if. So unless this statement is true execute the following code. So in some cases it can actually improve readability because you don't have if not this thing then do this, right? Then you have to think about, oh that's only gonna happen if that condition is not true. You can say unless this condition then do this. What else? What are the things? Yeah typing, so what are the typing rules? So do we even have types? How do types work? How do they define, how do we, what can we automatically translate between certain types? Can we do that? Can we not? Do we have to specify? How do we cast types? What does that mean? Yeah, types definitely. What else? Memory management. Yeah, so I'll kind of throw that under variables. But yeah, so kind of any variables naming, how do variable names work? How do we refer to variables in another file, in another program? How do we link? How do we, can we even do that? Is it impossible? How do we call other functions that other people wrote in other libraries? So yeah, variables and then memory management. How, when do things get free? When do they not get free? That kind of stuff, where does that come from? Do we have to worry about that or do we not? Like we use a language where we don't have to worry about that. What else? What other parts of languages do you know about? Yeah? Operations. What was it? Operations. Operations? In what sense? In computer learning, computer learning, computer learning. Yeah, so the semantics of all the operators, so the plus operator, the subtraction operator, in some languages those are actually defined as functions. So there's a function called plus, and they have a certain syntax that say, hey, you can actually, this can be an infix operator, so you can do something plus something else. Or you can just call it plus, bracket, something, something. So yeah, the built-in operators, there's gonna be some built-in operators. We can think about C, what does the star mean? What does the ampersand mean? What else, is that it? Do we have other, yeah? Functions. Yeah, functions. How do functions most work? In some languages you can, you only have procedures, right? So functions can't call other functions. You just call one function, it does something, and then return, it's done, right? So that would be one part of the semantics. What does it mean? Anything else? Bet it, yeah. Scope mapping. What was it? Scope mapping. Scope mapping, yeah, we're gonna get into that. So yeah, kind of under a variable. So how long are very, like, how can you refer to a variable by name, and how long is that name valid for? Not even, so what about, so we talk about functions. What about, what do you, I guess this is kind of a container function. What are you passing into functions? Parameters, right? How does that work? How are parameters passed to functions? Does the function get a copy of that parameter? Can it change that parameter that you pass in so that the callee can now see that change? So what are the semantics of parameters? Okay, so we talk about types, we talk about operators, what other kind of language features do we like? And let's say more, I don't know, probably not a lot of things, yeah. Scoping. Scoping? Yeah, that's under variables. Yeah, that's, we're gonna actually get to that next. What was it? Comments. Comments? I'd say that depends on the language, right? So yeah, in some, in a lot of languages, comments are just syntactic elements that basically are white space, but in a language like Python, the doc strings on each of the functions are actually part of the metadata of the function. And you can actually, I think, query at runtime the document strings of the function. So yeah, in that case, that would be a, those comments are part of the language's semantics. Yeah. Conditionals. Conditionals. Yeah, so I'd say conditionals, while statements, do, do while, as if there's a do while. How do those work? What are the semantics there? When do they execute? When do they not execute? What happens to? You gotta kinda think back to when we were first learning about programming. All these things you learned about, right, had some semantic meaning, and you were just learning it as that's the Java thing. This is how programming works, but then you learn another language. It's like, oh, this is something, this does things differently. What about exceptions, right? How does exception handling work? You have to do your language like Java, where you have to precisely define what exceptions can be thrown from your function so that anybody that calls you has to catch those exact exceptions. Exceptions can be very crazy. You can have in some languages. If there's an exception, it goes back up the stack. That caller can then do something and then actually go back to the function that had an exception and continue execution. So you can have like all kinds of crazy stuff. So yeah, we talked about control structures. Something else around there, constants, right? So are there constants? Like in C plus plus, you can specify that certain variables are const, which means that they shouldn't be changed. Whether that's actually is enforced or not, depends on the language specifications itself. And method in here. Oh, so yeah, we missed like object orientation, right? So how does function, how do classes work? How do methods of classes work? How, what happens when we see when we have subclasses? How does inheritance work? How do we know which function's going to get called in the object orientation hierarchy? Yeah. So what would like visibility all under this? Yeah, visibility. That would be, I'd put it under the classes. So that's a thing that specifies, yeah, what can you access from a class, from outside that class, from subclasses or child classes, all that kind of stuff. Yeah, so these are all the things that we've gotten to define. Yeah. Keywords? Yeah, that's actually really good. I don't think it's on here. I don't think it's on this list, but yeah, what are the key? So you gotta think of it kind of as like the built-ins, right? Or the keywords. So what do these keywords mean? So I guess the other thing to point out that's not on here, it's not really, we don't really care about the standard library at this point. So like the standard API functions, like we don't care. We're talking about C, we don't really care about print depth or those kinds of things. They should have documentation, but however they work, right? I still have to specify to these language semantics because we're calling some C or C, whatever, just a function. So here we're talking about what is the semantics of the language itself? Yeah, statements and expressions. Yeah, that's a good point, that's not on here. So yeah, there's various rules. Like in Java, you can't define, you can't assign inside of an expression. So inside of an if or while statement, you can't assign a variable, but in C and C++ you can. And that can also cause some confusion and problems. Yeah, that's good. Okay, so we're gonna start with variables. So we're gonna touch on a lot of these areas. And what I kind of really want you to focus on and to get out of this is that these are all more or less arbitrary decisions that a person or a team made when designing these languages. So these things aren't set in stone. It's not that all languages have to be like this. We're gonna kind of talk about some of the standard semantics of these different aspects. So the first thing we're talking about is declarations. So some constructs, like we know, so have to be declared before they're used, right? So if you think about variables like, right? Variables aren't construct. And in some languages like C or C++, we have to actually define that construct. We have to define a variable before we can use it. And often we want to associate the declaration with a specific name. So we want to give it a name. So we want to say, I'm declaring a variable foo here. And it has a type integer. And this is how I specify that in the language. And so this would be like, here's an integer, integer i. So this is a variable declaration. We're declaring a variable whose name is i, who has an attribute that is an integer. So it has a type int, a specific type int, but we can think of a type integer. But this isn't the only way we can write a language, right? So how does Python do declaration, do variables? Or, I'll say Perl is similar, but I don't know enough to make that point. Or another one. So I'm not talking about type, so type is just an attribute of a name. So it's still a variable, right? So we can assign to it, we can change it. But you don't have to specify it, you don't have to declare it before you use it, right? You just say, hey, this new target is equal to test value plus 10. So now target is just a new variable. So this can be handy because you don't have to say, I'm in this function or this block, I'm going to use a variable called target and a variable called test value and a variable called disk. You can just say, when you need it, it will automatically create a new variable for you. A downside to here is that if you then accidentally fatfinger target and forget the a or something or add an extra a, it'll gladly create a new variable for you and your program just continues and goes and does the wrong thing, I've personally done a lot. Okay, so the main question when we're talking about variables is, so this is what we're talking about with scoping rules. So once a name is declared, so once we've declared some variable, how long is that declaration valid and what other places can refer to that same declaration by name? So what are some options, global? So yeah, you could think of a programming language where I would say maybe, well, that's not what we're going to get. So like the entire program can access that variable name. You could actually think of a programming language where every variable declared is accessible by anything in your program. It could happen. What are some other boundaries or lengths that you could get? Private. Private, so in that case, only your class, only in that class definition. And so it could be the same problem, it could be different files depending on how the classes are defined. Yeah, what else? Only a function. What's that? Only a function. Only a function, yeah. So what about some variable, sometimes we only want it to be in a specific function? What about for just an entire file? What if we don't want, so it's kind of like a private variable. We don't want other files to be able to refer to our variables, but we want this file. So in this case, it's like the file level is how things are vibing about it. What about like really global? So think about something crazy. It's like if you had to write variables that referred to every program that was ever written in that language, right? So you could write, you'd write some variable and it would be globally accessible to anybody ever. Does this ever happen? It's not really in the case of program languages, but. Now you have to contest with other people and make names that are super unique, yeah. What about APIs called by third party? APIs, like APIs called by a third party, like maybe you need some. Like bloody tool something. Yeah, so usually I'd say they're main space by whatever API you're using. What about more global? Yeah. IP addresses. IP addresses, yeah, that's a good one. What about on top of IP addresses? Ports? I'd say ports are mapped to IPs. So we can have two different ports that refer to the different machines. What about what's, how do you get to an IP address? Of course we need. Hostname, yeah, so DNS, right? DNS is a global system where you have to declare a globally unique identifier. Anybody do mobile development? Yeah, so what about there that has to be global? Web services. Web services, kind of. I'm thinking more like the package names of Android applications have to be global. So that's, I mean global in a sense has to be marketplace-wide unique. And so that's why you get things like the Facebook app's name is like most Android apps are like com, dot something, dot something. And so this is how you get that. So we talked about function. And so the flip side of this question is how do we, when we see a name, how do we map it back to the declaration that that refers to? And so we talked about it's a scope is what we call this. So what's the scope that this declaration can be referred to? And so there's two important things here in scope. How long a declaration is valid for and how to resolve a name. So how when we see a name, how do you know what declaration that refers to? So this is a good, it's not a question, it's time. Everybody, see you on one there.