 Can you hear me in the back? Yes? Yeah? Good? Okay. Great. All right, let's get started for today. First off, we're going to kick off with some kind of housekeeping class related issues. First thing, the recordings, I'm trying to find the optimal path to get those out to you, and it seems that the first path I chose is very slow, so I will be announcing that on the mailing list when those are up, and I'm going to try to get it so that those will be posted a lot earlier than they are now. Questions on that? What will be posted? The recordings. So like right now I'm recording the lectures plus the slides. Any other questions? All right. Okay. So first thing, Homework One is being assigned today. So the only thing that we will use Blackboard for is homework submissions basically. So I've created a Blackboard site for 340. You should all have access to it if you don't. Let me know. We can click through to this Homework One, do, you can click through to it, and here's where the assignment is. So the assignment's right here. Let's see if I open that up. It's going to happen. Three questions, multiple sub-questions, goes over what we were talking about in class. It's going to be due next Wednesday because we don't have class next Monday. So we have a little bit of extra time there. Please submit it as a PDF file. So no, don't submit a doc. Don't submit a JPEG version of a thing you took with your phone. Those are fine. You could do that. Just convert it to a PDF first because that allows me and the TAs to give you feedback through Blackboard. So they've specifically requested that I request from everyone, only submit PDFs. And this is really important because it will help us give you feedback about how you're doing on the assignment. So if it turns out you completely, let's say, you made one tiny mistake in the middle of a problem, right? We can actually give you feedback and say, here's exactly where you went wrong. You've got to fix that in the future so you don't lose points on the exam. Questions? Pretty simple. Second thing, Project 2. Okay, Project 2 is also due on the 9th before midnight. So you have a week and a half. This is slightly more complicated than Project 1, but not incredibly more complicated. So the idea is you're going to be given a lexer. We will give you a lexer and in a .c and a .h file. And you are going to implement a program using the API of that lexer. You're going to write it in C or C++ again. So the API for the lexer is here. There's two important functions. I'm just going to briefly go over this at a high level. The details are all in on the website. The API for the lexer has two functions, getToken and ungetToken. So getToken reads a token from standard input and returns the metadata to you. UngetToken just causes the next call to getToken to return the same value and to not read from standard input. So this would be maybe if you wanted to peek ahead in your lexer to see what the next token would be. You can then say ungetToken to put that back. And then to get the return value of getToken, there are three global variables defined in the .h file. These will be set after the call to getToken appropriately. So the type of the token, the actual current token, if the token is one of these types, the length of that token, so the length of that string, and then the line number. And so here is the lexer file. I'll just briefly go over this. So these are where the variables are declared. Here is the getToken function, and here is the ungetToken function. So you can read through the .h file and .c file to understand how they work. And the descriptions are here in the file. OK. So questions about the lexer? OK. So the lexer is what you use to actually write project 2. And so what project 2 is, you're going to read all the tokens from standard input using the lexer functions. You're going to store that plus additional data into a linked list. And there are specific conditions here that you have to follow about what tokens to store in this linked list. Then you have to output in reverse order those specific tokens in this format, line, the token type, and the token value. So here, let's look at an example. So this was your example input with four lines. You're going to call getToken, getToken, getToken, fill the linked list of these tokens and output in reverse order. So the last thing is going to be 200, which is a number, 100, which is a number on line 3. Line 3, there's a number, 340, 2, there's an ID with programming, and so on and so forth. And then let's see if Mohsen got this set up. So yep, OK, good. So we have project 2 already on the submission site. So you can get the c file, the .h file, two test cases, and then a place to submit. There's four test cases, two of which we're giving to you, two of which we're holding back. The only, well, the big difference here in the grading is we're actually going to grade how you program and the style that you use, or not style, but what function to use. So we're going to do the automated test cases, but the TAs and I are going to also look at and analyze your code so that you're following these guidelines, but you actually implement a linked list, so that's required. And that you're using the C allocation functions, malloc, or Calloc, and freeing that input before your program quits. This means you're not allowed to use the C++ new operator. So those are the only restrictions. Questions? OK, so no questions, so that's good. Obviously we're here for office hours or the mailing list, all that kind of stuff. If you have questions, clarifications, please bring it up. Yeah? Are there any libraries that you don't want to use? I don't want to stand. Good question. Yeah, that would be fine. I'm fine with that if you use a standard linked list function, as long as it's in the C++ standard library. So if you're using, including somebody else's linked list, but that could be overkill. You could probably implement a linked list very quickly, with four or five lines. Other questions? Submission? Do I want to submit now? No, because what if my tests fail? I haven't tested that. Cool. Acts the material. So to go back to what we've been talking about, we've been talking about regular expressions and how we can use regular expressions to define a pattern or a language, a set of strings that we want to match these regular expressions against. So we looked at the star operator. We understand how it works. It basically means zero or more. We looked at the bar operator, which is, we think of as an or. And we looked at, let's see if I have concatenation somewhere here in this example. No. We have the concatenation operator, which concatenates strings together and matches strings with concatenated symbols. Any questions on regular expressions before we go forward? Is Daniels hot in here, or is it just me? Hot. Yeah. Okay. I'm going to have to make another complaint. Nobody can see a thermostat that we can mess with. No. Seems dangerous. Okay. Well, I'll ask. I'm going to get more info about the room to see if I can mess with those. Okay. Well, let's carry on. Okay. So tokens. So remember, the whole goal, the reason why we started talking about regular expressions is because we needed a way to specify, hey, I'm interested in these patterns, because these are the tokens that the rest of my program interpretation, my compiler or my translator, needs to interpret. So what we're going to do is we're going to define our tokens by regular expressions. So, for instance, we might want to define a token called a letter, right? And so a letter would be all of the lower case English letters and all the upper case English letters separated by the bar operator, right? So this regular expression, how many, so what length of strings is this regular expression going to match? One or more? Who says zero? One. One or more? Nobody? One person? So there's no star operators here, right? The dot, dot, dot is indicating, hey, A through Z lower case and A through Z upper case, right? So there's only bar operators here. There's no star, which means that it's only going to be matching strings of length one. Okay, so how would we define a digit? Yeah, so zero through nine divided by bar operator, right? So exactly the same way here. Okay, so I'm trying to think. We actually maybe went over this a little bit on Wednesday, didn't we? Yeah. Yeah, okay, I was just looking at ID. Okay, cool. All right, that's good. Yes, okay, so this is a good little review. Okay, so does this string match that regular expression for ID? Yes. Yes, it starts with a letter and it's followed by zero or more letters, digits, or underscores, right? What about this string? No. No, why? Because it starts with a number. It starts with a number. Awesome. Your memory's better than mine. Okay, the question is, so we want to kind of, these tokens, we want to be a little bit of an abstraction, right? We don't want to deal with digits and letters, right? We want to deal with tokens and words, thinking about it in English. So the question is, how do we define a number as a regular expression? So what do you think? How do we define a regular expression that defines a number? Anyone want to take a start? Yeah? In the front? Yeah. So the first question should be, what do I mean by a number? Right? There's, do I mean the natural numbers? Do I mean decimal numbers? Do I mean scientific notation numbers? Do I mean Roman numerals? Right? So it's important to talk about what we're defining. So we're going to go with just the, well, I'm going to use the wrong term. We're going to do zero up to infinity, right? So I think the natural numbers. So we're not going to include negatives. We're not going to think about that yet. So we only wanted to find that. So no decimals. Okay. So the first thing we try right is digit star. So is this correct? No, because it has the empty string. Do you agree? Yes. Right. Is the empty string a number? No. No, right? That doesn't make sense. The empty string shouldn't be a number. We wouldn't want, well, unless we say nothing is not a number, right? But we care about the number zero. We don't care about a string with length zero. So we have things like, okay, 1, 3, 2, right? 132. Is that matched by this regular expression? Yes. Yes. Right? It's zero or more digits. But then we have the empty string. So how do we fix that? Digit. Okay. Somebody raise a hand. Raise a hand. Yeah. Digit, digit star. Yeah. This is good. So we're going to say we want at least one digit followed by zero or more digits, right? So is that going to allow the empty string? No. Why? Because there's only one star operator. Because there's only one star operator, right? So the star at the end means ending with zero or more digits. But the digit at front means it has to start with zero through nine, one of those characters, right? 1, 2, 3, 4, 5, 6, 7, 8, 9. And the empty string is not in there. Therefore, this regular expression is not going to match the empty string. So it's going to match 1, 3, 2, right? 132. Yes. So the 1 matches the first digit and the 3 and the 2 match the digit star. Right. That make sense? What about zero? Yes. Yes. Right? So it's going to match the first digit and zero or more is going to match the empty string, zero, zero. So that's going to match our regular expression. What about this? Yes. Yes. Do we want this? Somebody raise your hand. Do we want this? Yeah. No. Because if you ended up having to do something like this, you could end up with an infinite loop or at the very least wasted memory, trying to figure out how many zeros. I'm not sure I agree. In what part? What do you mean wasted memory? Where? Yes. If you had a program that was trying to generate numbers, it's just wasted. Because numerically, 1, 0 is the same and it's infinite in many zeros. Yes. So that I buy. So I would buy that, yeah, in the sense that, well, it doesn't really make sense, right? I'm zero through infinity, right? And this many zeros, I don't even know how many it is, right? It still represents the number zero. It doesn't matter how many zeros you add. So it's not really a number. It's not really the language that we're trying to define or the token that we're trying to define. And the thing to think about, so what does your favorite programming language do when you input this? Anybody know? What happens if you input this into Python? If you did this many zeros plus 1, would it work? Would it parse? Would it execute? Yes? Is this from experience? Does this give you 1? Right, it'll give you 1. So, yeah, actually I tested this. In Python, it does work. I guess it makes sense to be a little more flexible. But let's say we're trying to be more precise, right? We don't want to allow somebody to write this many zeros because it may be actually a mistake, right? Maybe you forgot a 1 at the front. So instead of adding a million or 10 billion, whatever that is, to your number, you're adding zero. So then how can we get rid of it? Can we, I guess, can we get rid of it using regular expressions? Yeah, doing or zero where? So how would you change that regular expression? Do you want to think about it while I go to somebody else? All right, we'll go in the back. We don't want to effectively redefine digit. We could use an operator. We don't have a not operator, yes. So don't think about knots. It messes things up. So it would just be 1 through 9? Right, so there we go. That's one thing we can do. So we can't define a not. We don't want to say not zero. But we can define, hey, anything that's a digit, that's not zero, right? So we can define a new thing we call a p-digit. That's 1 through 9. So excluding zero. So then how would we, well, so one way to do this, I don't know if this may not be that, one way to do this would be, let's say, start with a p-digit and then followed by any number of digits, right? So here's your hand, turn your people. Let's go there. You can't represent zero. Right, exact. So why? But why? Which part of the regular expression doesn't match zero? Exactly, the p-digit. So this would match 1, 3, 2, right? It starts with a p-digit and is followed by any number of digits. But it doesn't match zero. It also doesn't match all zeros, right? So that's good. So then now what can we do? Right, now we can order that entire regular expression with zero, right? So if we do p-digit followed by any number of digits or zero, now we've matched a lone zero by itself and then we've matched any number that starts with anything but a zero, so 1 through 9 followed by any number of digits. So does it match 1, 2, 3? Are you sure? Yes. Yes, okay. What about zero? Yes. All zeros? No. No, why? Yeah. Because we only have the option of having one zero. Exactly, so even though we have the or, right, the or just means or one zero. So it's either a p-digit followed by any number of digits, right? So this doesn't start with a p-digit, so that fails. And it's not a single zero, so that fails. What about this? No. Why? We don't have any characteristic characters. Right, nothing in this regular expression represents character. So I hope this shows a little bit, that it's a little bit trickier, maybe then your intuition might tell you how easy or difficult it is to define numbers and characters and the things that we think about. So we're going to go a little bit, try to take this a little deeper. So how would we define a decimal number, given what we already have here? Somebody from back there. So the question is, what do we mean by decimal number, right? So let's just say, broadly we just mean anything, a number dot a number. Right, so what would you do to use a regular expression to define that? And you can use regular expressions we've previously defined. Anything else? Yeah, right there. Hold on to that. You guys did it so well the last time, maybe I said you have too much. So the first thing I would try, from what I just said, so we want a number, followed by a period, followed by a number, right? So let's try that first. So we have a number, and then, followed by a dot, then followed by a number. So why is this dot in the middle here of a backslash? Somebody raise your hand to explain? Yeah. Yeah. It's the escape character, right? Did I define an escape character? No, not really. But we're going to do that here. So here we're talking about a regular expression that matches the dot character, right? We're not talking about the dot operator in regular expressions. And so because these characters are the same, right? We're using the dot character as a special character within our regular expression construction, right? We're using dot as concatenation. So to specify, we need a regular expression that matches the dot character. We're going to use a backslash. And this is similar to in programming languages, how you define a new line character by slash n, slash lowercase n, right? So using two characters there to represent one character, the new line character. Yeah, there's a hand. Ah, we're not there yet. We're not here yet. We're still on the escaping. All right, we'll come back. Okay, so this also means, right? Once you add an escaping character, well now how do you match a slash? Slash, slash, slash. You have to do slash, slash. And that's how you get in the crazy escaping where you have like four slashes in a command line argument that then gets passed. So that's what we're going to find here. Okay, so does this match this decimal? What about this? And this one? No, why not? Somebody raise your hand. Yeah, right here. Right, so the number after the second decimal, right, so the second decimal is a num, and what did we explicitly work to define a number as? So is zero one a num? No, why? It doesn't start with a p-digit, exactly. It starts with a zero. So this construction doesn't work, right? So then, well let's go back and say, okay, well now let's relax what we're defining after the decimal place as a number. So let's say digit star, right? So then do we match this string? Yes? What about this string? This string? Yes. Good, we're on the right path. What about this string? Yes? No? Yes, but it's bad. Somebody raise their hand. Why does it match? Yeah? Because the star operator, digit star matches the empty string. So here we have, you can think of it as one dot empty string, which matches that regular expression. So do we want one period in our, one dot in our language? Somebody want to argue? No? You said no? I want to say no. You want to say no, why? Because it's not correct notation for a decimal number. Says who? Says who? Us. Yes. We say so, right? We can say that this is not. In general contains at least two digits following the period. At least two? In general. So 1.1 is not a decimal number? No, it's good. So yeah, so the point here is like we define what is a decimal, right? And so yeah, let's go back. You'll disagree? Wait, say that again? Say it louder. I would say that it can be defined. Yeah, so that's a good counterpoint, right? So we could want to say in our language that this, we may want to separate a number, the number 200 from the decimal 200 for purposes of significant figures or maybe with arithmetic. So deciding if we want to multiply as a decimal number or a numeric number. So anybody who's done division in Python probably had it drilled into them that you have to do, you have to put like a constant point zero in order to make sure that your statement gets transformed and does decimal arithmetic on it, otherwise you have crazy problems. So yeah, so, but let's say we don't want one point, right? We want to force people to be explicit and say if you want a decimal, you have to say one point zero. Are we going to do that? So why don't we try adding a digit in front of there, right? Oh, actually maybe I'm messing up. Okay, so let's say we want to add a digit first and say digit followed by zero and more digits. So now we get one point five, does that match? Yeah. Two point ten, one zero one. It's not that quick this time. Does it match? Yeah. Somebody explain? Yeah. Yeah, and then more digits after that matches one. Okay, what about one dot? No, right? Because it has to end with a digit. What about this? Is this what we want? Probably not, exactly. Okay, so we're going to actually stop here but you could keep going and going and refining. The point is that you need to be clear about what you're trying to define and how you're trying to define it and what's allowed and what's not allowed versus what you intended, yeah. Wouldn't that last one not work though? Because I thought we'd define num the first number or first digit could it be zero? Or zero, eight. Somebody raise your hand? So we want to respond? Yeah. Right, so remember we wanted to match zero with num so then we had to go in and add the or zero so we would match the single zero character. We didn't skip that slide. So then we could refine this some more and maybe say okay, maybe a P digit there. Anyways, you could continue going this further but we're going to kind of stop this so we can go forward. Anybody have questions on this part about token? Yeah. So is it like every decimal number are uniquely represented who we were having this zero, not zero, zero, zero? Is that specific with the language or what? So the question is, is that specific to the language wanting only one representation of a number? Right, I think it depends, it depends on the person creating the language. We could want that because, so what might be some reasons why we would want to disallow multiple representations of the same number? So we're going to step into a language designer shoes. Yeah, fill in the bill. Yeah, so one thing would be making sure that everything is... Yeah, parsing uniformly. That's a good way to put it. I'd say uniformity is what you're trying to go for so that that way when I, so when I look at your code I don't have to parse. When I see a 0.000 I have to think why would somebody do that as opposed to just zero? Unless it's something special about your language you're forcing me to do extra work to think about what you meant there. So I'd say maybe that's, you're removing degrees of freedom from the programmer which some people may think is like a straight jacket whereas other people would say, well it forces you to only do one thing. And do you want to take another side there? Yeah. So in that case for that language all the regular expression would be different from other languages, right? So here if we are defining numbers as something like digit multiplied by its star so it will be different in Python, right? Because there you will have the luxury of having C O O 1 as well along with 1. Yeah, I mean I may argue with the definition of luxury, right? I mean it just depends on the language. So you're... So anybody know how to do octal numbers in C? So how do you do octal numbers? I thought you were raising your hand. Here? Yeah, you start with a zero. So that could be another way as you could actually represent alternative number, base numbers using these kinds of formats. So if you start... So zero, one, one, one is a different string than... is a different number than one, one, one, right? That's not 100 and I don't know what it is because I'm not going to do octal character, octal math in my head, but... So the point is that it actually... languages can kind of abuse this. So you define, okay, you define a high level number and then you find different regular expressions to say what is a octal number? Well an octal number starts with a zero followed by digits. Or, and then you would define a regular number as doesn't start with a zero like we said here by any number of zero, zero through nine characters. Same thing, you could do binary, you could do hex values. They just have different syntax of how to specify them. Okay, so now we've gotten familiar with regular expressions. We understood the work it kind of takes to define actually a regular expression. So remember our regular expressions are just tools that we're using to define and build our lexer. And the job of the lexer is to actually turn this series of bytes, these symbols that we've been talking about, into a token. And into, specifically we want a series of tokens. So we briefly alluded to it when we talked about the project two, but we're going to kind of think of lexers from the API level as having a function called getToken that returns a token. So getToken will return the next token from the input stream each time it's called. So you keep calling getToken, getToken, getToken. And that function is turning those raw bytes into abstract tokens for the rest of your analysis to work on. Questions? Okay, so have we seen, we're actually going to specify these tokens using regular expressions. And we talked about it a little bit. So going back to maybe the first day when we talked about that whole pipeline. The first file, which is a series of bytes going into the lexer and what's coming out is tokens. And specifically it's going to be a sequence of tokens. So it'll be something like number, ID, number, operator, ID, decimal. And so that's really the lexer's job is to do that. So you give it a set of regular expressions that you care about and say what tokens they are and it goes and parses that. So now we're going to kind of look at how that works. So let's say we're going to build on what we've already done. So we're going to define an ID which we defined earlier, right? So an ID is a letter followed by any number of letters, digits and underscores. We're going to define a dot operator, a dot token with the dot here. So why, when maybe we want to do this? Like is the dot character significant in languages? So if you're using object-oriented programming in a lot of languages, you can use the dot operator like in Java to call a method on an object. You can also access fields and structs and see with the dot operator. All kinds of cool stuff. So yeah, it's important for it to be its own kind of token because semantically we care about that dot character. Okay, we define numbers as we defined them before. So a p-digit followed by any number of digits or the string zero or the character zero. We'll define decimals as we defined them earlier, even though they add a little bit of redundant characters. So the question is, so given these four tokens, right? So when we call getToken, when our lexer calls getToken on a string, it's going to only return one of these four values, right? So it's not going to return letter or digit because we don't care about those. It's only going to define these four tokens that we've defined in all uppercase. So what does getToken return on this string? Somebody raised their hand? Somebody said one. Is one right? Somebody's shaking their head. Somebody tell me why one is not right. Raise your hand. Yeah. One is not a token, right? One is, yeah. So one is not one of these four tokens that we care about. It may be, let's say, the value of a token, right? So whatever that value matches, that token matches. If it matches a number, then we can look and see what those actual characters are. But what we care about right now is the token. So is it going to return num? Is that one point wonder that this is number one out of a list, one ABC one? This is a string that is, let's see, one, two, three, four, five, six, seven, eight, nine. Nine characters long, right? So that's why it's italicized in blue because this is the string. Okay. So is it going to return num? No. No? Yes? Yes. No? Maybe. Is it going to return decimal? Yes. No. No? Yes. Maybe. Is it going to return ID? No. There's a lot more maybes and ums in that one. So we want to raise our hand and defend one of these, argue for one of these. Yeah, in the back. Should it return to all four? I think if its decimal would be 1.1, it would stop the ABC. Then it would be the ID which would be ABC one and then stop there and do a dot which is the dot and have num. So why wouldn't it return num and match one? But it matches both of them. So why? Right? So one matches the num, right? Does it? Is one a number? Yes. Yes. Right? It matches that string. Why? Because it's a P-digit. But it also matches the decimal, right? 1.1. So the question is here we have a case where we have more than one token that matches the input string. So now the question is how do we deal with this and how do we decide which token actually matches? So that's where it comes up, the longest matching prefix rule. And it actually is literally right there in the name. So the lexer is going to start from where it last parsed from the next input symbol. So from the string, the last thing that it parsed, it's going to try to find the longest string, right? So you can think very greedy. Like we don't want to just match a number if we can match a decimal number, right? And so we're going to try and find the longest string that matches a token. And this makes sense, right? So you wouldn't think as you're programming is 1.1 the calling the method one on the one object. It's a decimal, right? It's a decimal of 1.1. So what about if there's a tie? What if two strings are the longest? Yeah, so you could literally come up with anything, right? I mean we just have to, the point is we have to have a standard. Like so by default we're going to break ties by preference to the token listed first in the list, right? So we're going to choose an id, so let's see if I go back here. We're going to choose an id over a number. We're going to choose a number over a decimal, but only if the length matches. So this would be here. So if we just had the input string 1, which one of these is going to be returned? Which token? You want to raise your hand? Yeah. Huh? Which one? I can't hear you. Number. Number, num. Yeah, exactly. Perfect. So we just have the string 1, even though it matches both num and decimal, right? Because num is defined first before decimal. So does 1 match id? No. No. Does 1 match dot? No. No. Does it match num? Yes. Yes. Okay. Good. We're done. Okay. Questions on the longest matching prefix rule. So we're going to go over an example, but I want to give a chance if there's any pressing question. Okay. So this is how we're going to do it. So we're going to walk through step by step how our lexer is going to work and operate on that string we previously looked at with those four regular expressions, right? So we have id, dot, num, and decimal. And so the way to do this, to think about it, is we're going to look at the input string. We're going to keep track of what has matched so far, what regular expressions and tokens still have the potential to match, and then we're going to keep track of the longest match that we've had. So it should be kind of clear as we step through an example. So here's the input string, right? That character means that we're basically before the input string. So the next character we're going to consider is that one. So at the beginning, right, we have this input string. We haven't matched anything. We haven't consumed any of the input. We haven't moved down the input string at all. And so any regular expressions, right? So all here is not a token, right? All represents all of the tokens. So right now at the start, everything is possible. We haven't matched anything yet. So the first thing we do is we look at that first character, one. So which of those four tokens does it match? Numb and decimal. So let me raise my hand. Yeah. Numb and decimal. Numb and decimal. Does one match decimal? No. No, right? Why? Exactly. Numb and decimal. Yeah, so it's got your num, a dot followed by something afterwards, right? So it doesn't match it, but it has the potential to match, right? So we're going to put matching what actually matches. So num matches, we're going to put that, we're going to put num in the matching column. And potential, we're going to put decimal because what we've seen so far has started to match that, but we haven't finished the match yet. What about id and dot? Does that match? No. No, right? So we don't put them anywhere. We just get rid of them. So does everybody see that by inspecting that one character, we've matched num and we have the potential to match decimal. And then after we inspect this, we're going to move on to the dot character, right? And so now we've, does dot match num? No. Right? So we've finished, our longest match that we've seen so far is number with a length of one. So that's what in that far right column, you're right, yeah. The far right column, we have num comma one. That means the longest that we've seen so far is length one. Are we still, have we matched anything yet? No. What was it? Where's your hint? Yeah. So the question is why don't we keep dot as a match? Let's go here. Right, so it doesn't match one, right? So we've started at the beginning of the string. What we're trying to do is go to the string to find the longest amount of input characters that match one of our tokens. So we've started at one. So we know that that can't be dot, so we're not going to match on dot. So right here we're looking at the dot character, but it's not in our matching or potential, it's not in our potential. So we know that it can't be dot. Any other questions? Yeah. If a num is the longest match, the following letter would be like A, so they're matching a dot, right? So we'll get there. But right now, so you've got to think we haven't returned anything yet. Our lexer hasn't returned anything yet, because we haven't, there's still a potential that we could match a decimal. So what we're going to do is we're going to, so we're still, you've got to think we're starting at the beginning here, right? And we've said, okay, we looked at one. Okay, it could be one. It could be a number that we've returned, and then we look one more and we see, ah, there's a dot character. It still could be a decimal at this point. We don't know yet. And then, well, we'll see in a second. So now we're going to look at the next character, right? So we're going to keep going on and we're going to look at the matching character. So we're going to see that that one. So we're going to look at that one and we're going to say, aha, we've matched a decimal, right? 1.1 matches a decimal, right? Yes? Somebody just say yes. Convince yourself that it's yes, and then say yes, yeah. Because we, that's the, yeah, exactly, that's the last complete match that we've seen. So right now we haven't looked any further, so it could be 1.11111, and so we're going to keep going. So that's the thing is we're going to keep matching until we don't see anything more when there's no more potential, and then we're going to return that as the token. More questions? Will we continue? Okay, so that's why there is, ah, here, we've matched decimal, but there's still the potential for a decimal, right? Because we haven't looked any further, there could be another number after here, which means that we can still match that decimal. Okay, so what happens when we go to the next character? A, what's, so the next character is A, right? Right, so do we match what our potential is? So now we have no more potential, right? So now nothing matches, we have no potential, and so we say, hey, that longest, the longest token that I've seen was length 3 and it was a decimal, right? So now at this point, now I can stop, because I know there's no more potential tokens that match, right? I've gone all the way through, I've gone as far as I can in this string, I've gone three characters, I saw a decimal token, so now I'm going to actually return that, and so I'm going to represent that here, so we're going to return that, I'm going to underscore all the characters that we return as part of that decimal, and then our lexer is going to return decimal. And so now we stop, right? So the underline means, hey, we've consumed or we've looked at this input string, so now we're going to stop here, and the next time somebody calls getToken, we're going to start from that A character. Does that make sense? Okay, so then somebody calls getToken. What is going to be the matching? Do any of the four regular expressions match, or tokens match? ID, any of the other ones? No? So we're going to match ID, we also do have the potential for ID here, do we have that in there? We need to update this. Okay, so we're going to say, okay, we've matched ID, but we can actually still have the potential to match ID, so we keep going. So we go to the next one, which is the B character. Does that match ID? Yes. Yeah, so say that again louder. So the question is, it's a single string, so why does it continue here? So as opposed to what, I guess, what's the alternative? We haven't parsed this string. It's a single string. So anyway, so it's not a single string, it's just input. So you've got to think, these are just input bytes into our program. We don't know, you're kind of injecting your own intuition as, ah, there should be spaces in between these things, right? But that's actually defined as part of the tokens. So we haven't defined any of that. So white space, we don't care at all about white space. In fact, white space isn't even a part of our language. So there's no such thing as a string or like a single string versus multiple strings. All there is is an input string and we're going to return all of the tokens in that input string consuming all of it. Does that make sense? It should be. Yeah, it's a mistake. Yeah, because we can still match ID here, so that's why we're going to keep going. I think there's also that problem on the second row here. Num should be in potential because there is a potential after you match one there to be multiple numbers after that too. So I'm going to have to update that as well. This is why I don't like getting on slides early, because there's going to be problems that 240 people are going to see. It's not going to update them. Okay, so we've matched an A, we've matched an ID, we're going to continue matching an ID, right? So that's the potential. So what happens when we match, when we look at the B character? It's still an ID, right? So it's still going to be in matching and still going to be in potential. What about the C character? It's the same. It's the same, right? So it's still going to be in matching, it's still going to be in potential. What about when we look at the one? It's still ID. What about num? Why don't we consider num right now? Say that again? So it doesn't start with a p-digit. Yes, exactly. So we're not even considering num right now, right? Because it's not in the potential set right now. So we don't even think about num because we didn't start matching a num, so the string that we've looked at so far, right? Exactly. So ABC doesn't match num. So we don't care that we're looking at a num, because we only care about right now the ID. So we look at one, and it's going to match and still have the potential, so we're going to have another ID. And then we're going to look at dot. So what's that? Dot. Does that match ID? No. Does that match dot? No. Ambiguously worded question. Yes, it does match, but we don't care about it because it's not in the potential set right now. We only care about the ID. So then we've got here, we have nothing in the matching set, nothing in the potential set, and then we say, hey, the longest match that I saw was an ID of length four. And so now we return this, right? As, hey, I've consumed four characters from the input. I'm going to return the ID token, and I'm going to return ABC1. So now what happens if we call getToken again? Right? Yeah, so we're going to look at this next character, which is the dot character. We're going to say, hey, we match dot. Do we have a potential to match anything else? No. No, why? None of our other tokens begin with dot. Exactly, but why isn't dot in the potential set? Because dot is a single character. Right? Because dot is a single character. I can't possibly be matching more. Right? So then we look, well, I guess we can look ahead here. We didn't probably have to do this. Does that make sense? We look ahead here and we say, okay, well, we didn't match anything here, obviously. So let's put, we say, the longest token that we've seen is dot with length one. And yeah. So do we include the dot in the longest match of the ID? Here? Yeah. Yeah, I think we could. Yes, exactly. I think we could. Yeah, so we didn't have to look one more because we actually knew there was no potential for dot to be there. So we would say, hey, the longest match I saw is dot and it's length one and there's no potential for me to match more. Should we include ID one, two, three? Which, wait, what? Do you include ID one, ID two? I would say no if you have something that matches. Right? Because you haven't finished, like, you have to take the longest anyways. So if you're continually matching an ID, you don't really care about the fact that you've already matched something with, like, two versus something with, like, three, right? Exactly. So only when you stop matching, then you can say, okay, this is the longest I've seen of that token. Okay, so then we return the dot and then we look at number and then we'll end up returning that. At that point, doesn't it also have the option of being a decimal? At what we're just pointing at two, doesn't that also have the potential of being a decimal? Yes. Yes. Yeah. But then we go to here, right? So here it could be a number or a decimal. And then when we get to the end of the string, we know it can't match a decimal, right? Because it's got to match the dot, the period. But there is nothing there. And so that doesn't match. But the star, the zero or more on our number is going to match. So yeah, that will go in here. So I'm going to update this. Yeah. How are the tokens defined? How do you think they're defined? Huh? So I guess, what's specifically your question? Who defines them or how do we define them? Yeah. So the question is in project two, how are the tokens defined in the lexer? So that's part of what I kind of want you to figure out. I mean, that's part of why we give this to you. It's more of a, so there is a way that you can give, so there's programs called lexers, that you give them regular expression tokens and they'll spit out a tokenizer. Oftentimes, you can also hand code it. So that one is hand coded basically. So it, you'll see it has ways to, but it uses this, you know, this is a general purpose algorithm that you can run and match any set of tokens on any string. You'll get that and you'll see it's kind of, the cases are small, like in this toy language we're talking about, it's not super difficult. Questions? Okay. Then we can talk about a spaceship real quick, or a rocket. So there's a story that maybe some of you have heard that this Mariner 1 spacecraft, I believe it was a satellite, a NASA satellite crashed, and when they looked into why it crashed, it was a Fortran bug that I'm going to talk about. Unfortunately, it's really exciting to talk about that, but unfortunately it's not true. So I actually thought for a long time, because I heard that story, that this was actually a true story, but it's not. So we actually, we brought this up earlier, right? We were discussing the issue of white space, like what does it mean? Like why do we consider that whole string as a token? And so in some programming languages, white space is not significant at all, meaning it is 100% optional. And even if you think about it, right, in most programming languages, white space is not always significant, right? For instance, you know, the parentheses, space 5, space plus, space 10, space close parentheses, right? That's the exact same thing as parentheses 5 plus 10, parentheses, right? So that parentheses 5 plus 10, and parentheses, right? That's one single string, no white space in it, but there's multiple tokens that are coming out of there, right? It's going to be left parent. It's going to be a number. It's going to be a plus operator. It's going to be another number, followed by a right parent, right? So that makes sense. You're kind of doing that intuitively, even though you think like white space is important, and it is, I'll agree, it is visually definitely like you have to, like it's good practice to write things in a way that makes it easier to read later, but it's not required. So the way this bug went is that in Fortran, it turns, so is one of these languages where white space is completely ignored? So the only things defining how to interpret things are the tokens. So there's no significant white space at all. And so the true version allegedly, well, the allegedly true version of this story is that they found this bug while testing. So they were testing for a mission, and this code had worked fine on some, when the precision didn't need to be this high, but on whatever new mission they were doing, the code broke, and they found out that the result was this bug. So we're going to talk about it. So anybody have Fortran experience? Yeah? Like how much Fortran experience? Like what's a little bit? Like a Hello World, or like people pay you to write Fortran programs? No? Oh, cool. Okay. So you had to learn how to read Fortran at least, right? Good. So what does this do? It's like, yeah. I wish it had a view. I have no idea. I don't know how to read it. I don't read Fortran. I appreciate that. So it's an implicit loop, right? So basically like this code is a for loop, right? At least as far as my understanding, it is a looping construct that's implicit in one line of code. So this is kind of like looping from I as 1 to 100 and then jumping to line 15 or something like that. That's great. How do you even like do this? All right. So what's the difference between that line and this line? Well, okay. Yes. That's correct. What's the semantic meaning behind this second line? Yeah. The second line has one variable on the right side of the equation as opposed to above where it's kind of soothing too. So, yeah. So the, so yeah, do you want to... So an infinite loop. That would be even better than I think what actually happened. So the problem is the lexer is parsing these tokens and the first case above, it can see oh, there's a do and then a 15 and then an I and then an equal sign oh and then the comma means that it's going to then interpret this later as a loop. The problem is is when you have this dot, now without that comma, it's no longer a looping construct and it's variable assignment. You're assigning the value 1.100 to the variable do 15 I, right, because white space is not significant. So now you have variable assignment instead of a loop. So is this bad? Probably. Why is it really bad? Why is it worse in my mind than turning into an infinite loop? Because what? It will continue the program after it's assigned it. Yeah, exactly. Your program is going to continue to execute. So you make one tiny one character mistake and if your program, if that were the case and your program doesn't compile, well, that's fine. Your program doesn't compile and you fix that mistake, right? It's the same as forgetting a semicolon, right? You fix that mistake and you compile the program again and you fix that bug. But here, not only does it compile, it doesn't, if it was an infinite loop, it would run forever and when it hit this code, you'd be like, hmm, program not doing anything, that's weird, I should fix that. But here it doesn't do anything. It just doesn't do this loop that you thought it was doing and instead assigns a variable. So this is kind of like what I like to think about is this silent failure mode, right? Where silently fails and you have no idea until you debug like, why isn't my loop getting called? That's really weird. That wouldn't be your first instinct. So this is where the tokenizing rules and whitespace being significant, all that kind of stuff can really come to effect. It's very similar in C when you are assigning, when you accidentally forget and you assign a variable in an if statement rather than doing equality with two equal signs, right? So there's one character difference, two that silently fails and then you start taking branch conditions when you think it shouldn't happen and all kinds of madness. Okay, questions? So I don't know what the moral here is, probably don't program in Fortran, but... Yes. Or do and then it's pretty interesting for class. Don't worry, we still got 10 minutes. Why? Oh, why did it crash? That's a good question. I looked at it, it's like, who asked the question, so I can know normally, yeah, I don't know. I think it's like aerospace stuff, like the one part failed and then the other part was like shaking too much and too much resonance or something. I guess typical rocket stuff, I would imagine. So what you're selling is you're not a rocket scientist. I am not a rocket scientist. My brother is actually a rocket scientist, so, yeah. How does it feel to not be a rocket scientist? It feels fine. I'm okay, I'm okay with my life choices. You all could be rocket scientists. You know, you gotta go, just go to a job in NASA and there's like SpaceX. Is the answer what? I don't understand. Just speak up. The final, oh, to our longest prefix matching? That was actually what the student up there said actually right off the bat. So it's a decimal, right? It matches the one point, was it one point one? And then the ID matches ABC one and then dot and then a number. So four tokens. Yeah, the really interesting thing, right, is because you have 1.1 in two different places and one place is the decimal and another place it's part of an identifier and part of a dot and a number. So that's kind of what that's there to illustrate. Right, yes. So we didn't talk about it because I think it'll be pretty clear from your homework but there's usually two special tokens that aren't part of the language. One is the error token, right, which says, so if we saw some input that didn't match one of our four rules, right, we would say, ah, that's an error. Our program, our lexer just kind of swallows that and returns an error token. And then the second one is end of file, right? So you wanna know when you actually get to the end of string but there's no more input tokens. So those are the two kind of special characters. So really, what you do is you keep calling get token until you either call error that you can handle, either handle or quit, or end of file and then you do whatever you need to do. Questions? Yeah. It depends, yes. What our lexer does is when it calls an error, it just, whatever it got the error, it's gonna, at the next call, it's gonna start the input from there. So it kind of ignores it. But, yeah, that's up to whoever's writing the lexer about how they wanna do that. Yeah, yeah. Is there a special place where the lexer stores the string? Is there a special place where the lexer stores the mappings between, oh, you mean between the string that it read in? Yeah, so you'll see one way to do that in the homework, or in the project too. So, yeah, one way would be get token returns everything, right, like a whole big value. The way we've done it is you call get token, it returns the type of token, and then there's global variables that you've grabbed those values from. So you can see what line that token appeared on, you can see what the value of that token was, and all that stuff. So now we get to move on to syntax analysis. So, we've looked at the lexer, right, and the lexer's job is to turn those raw bytes from the source code and turn it into a sequence of tokens, right? So, y'all right up there? Sounds like somebody fell out of their chair. Okay, so the goal of syntax analysis is to take those series of tokens and to, I'm gonna say very broadly, to turn it into something useful that the rest of the program can operate. But the problem is, so what we've seen so far, right, with the lexer, does the lexer care about if the program is syntactically valid, does it have any notion of validness of a program? Somebody wanna answer? Does it? Yeah? No, it does not. No, it does not. Is it true? You wanna defend your answer? Mm-hmm. Great. Great. I know. That's why I asked it. So, yeah, it's a tricky, so in some sense, yes, right? So, our tokens define the regular expressions that we use to define our tokens define our language, right? So, we're saying what it means to be an ID, what it means to be a number. Basically, the way we've defined a number, right, it's an error to, or it's gonna be an error maybe to include two zeros, like that would be an error. So, we're doing some kind of checking, but what we really wanna do is saying, okay, does this program make sense? Let's think about just a mathematical program with one plus two plus three, right? So, we can define the tokens for one, for numbers, and we can define plus for the operator, but how do we know that they're in the right order? We didn't do like one, one plus. Right, yeah, question. Does the compiler check for syntax before it actually does lexer? Does it, does the compiler check for syntax before it does lexing? No, because it needs something to operate on. So, like the syntax doesn't operate on the raw characters. It's got, those have to be lifted to a token that it knows how to operate on. Okay, so we'll look at things. So, we need to, so yeah, this is an example. So, we need a way to specify and check if a sequence of tokens is valid, right? So, what we've looked at so far is just how to generate those sequence of tokens, how to say if our tokens were valid or not valid, but we need some way to say, well, do those tokens make sense? So, for that simple example, right, so num plus num, right, that makes sense, right? A number plus another number, right? We can interpret that and do something useful. What about like decimal dot num? No, it's really weird, right? Like why, this doesn't really make sense. And I mean, unless you have white space in here, this would be a very weird, so assuming we have white space in between our tokens, like this doesn't really make sense. What does it mean to have a decimal dot a number? Id dot id, does this make sense? Maybe, yeah, maybe in an object-oriented language or maybe with structs or something. So, we have some identifier dot another identifier. What about this? So, I guess the other answer, right, is that, well, it really does depend on our language that we're talking about, that we're trying to define the semantics for, or the syntax for, sorry. This could be maybe dot dot dot is like an ellipsis operator and you can use it in really interesting places. I don't know, but that probably would be from what we consider languages that would be invalid. So, the question is, can we use regular expressions to check the syntactic constraints that we care about? So, let's try. So, we're gonna say a program is composed of many zero or more statements, right? Kind of makes sense. You write a program as essentially a series of statements, maybe with other things. And so, we're gonna say a statement is either an expression, an if statement, a while statement, or some other jump. We're gonna define a regular expression called operators, which are either plus, minus, multiply, and divide. So, I did not do the escaping, or do I need escaping here? No, I didn't do the escaping of the star operator here, but I probably should have, but it's pretty clear here. I'm talking about four characters here. Okay, now I need to define an expression, right? So, we'll just consider kind of the numerical operators. So, we're gonna say numbers, IDs, decimals, some operator, numbers, IDs, decimals, right? So, let's define an expression as it's either a number, an ID, or a decimal. Some operator followed by number, ID, decimal. Does this make sense? So, we've already defined decimal, we've already defined num, we've already defined ID. So, well, that, okay. So, it's five plus 10. Is that matched by our regular expression program? Somebody, yes, why? It's a number and operator. So, first it's a program which is zero or more statements, right? And a statement is an expression. An expression is, right, number, operator number. Exactly. Okay, what about foo minus bar? The tide turns. See, like the group dynamics. If somebody shouts one thing. So, yeah, same thing, right? It's definitely a program. It's an expression. It is ID, operator ID. What about this? Not by our current regents. However, if we write it so that expressions in the data itself, in an expression, it could be possible. Then we're not a regular expression. So, we can't. Well, I mean, by having another definition, like another regular expression, having five. Possibly. Okay, so we're going to leave here, but you should try to convince yourself so is one plus two plus three a regular expression? No.