 All right. Very quickly, Project 2 is released. I sent an email out on the mail list and you should have all received it. So all the details are on there. We're not going to go over it now, but it'll be four recitation sections. So go to your recitation section, go to any recitation section so you guys can talk about the Project 2. Any other radio expression based questions for Anna? Every got regular expressions, lots of loaded data. Yeah, so regarding all the problem one question, the dead one has like a plus sign in it, so how do you do the plus sign? Is there a plus sign in regular expressions or the alphabet? No. Okay, there you go. I can see it now. The good way for showing you're working on that one is showing a sequence of steps, breaking it down and applying the rules we talked about here. So for instance, we talked about... That's what we got here. Right, so like in this example here, this would be exactly... hell, yeah, they're here. Right, so we have... so like we were saying what's the language described by the radio expression, parentheses A or D, parentheses C. Then we could break it down like this. We maybe also want to do one step to do the language defined by A union with the language defined by B and then another step where we replace that. So you just want to follow all the rules so that we know you're applying the rules in the right places. So... Yeah, that is exactly what we went and defined all these. You want our answers overlaying the whole marker separately? I'm not sure what that means. Like you gave us a doc? Oh yeah, if you want to put in a doc, that's fine. If you want to... it has to be a PDF though. I'm not going to accept anything else. But would it be fine if the answers are just in its own document just separate? Yes. Yeah, that'd be fine. All right, let's continue. So we got down... we started writing tokens. That's already left off last week because we started to try to define tokens. We started by defining a decimal number and we also talked about why we need to use a hash here because we want to match a regular expression that matches the dot character, the dot that's in the alphabet, not the dot that is the key coordination operators in our regular expressions. So we did all that. We talked to these examples. We kind of got to somewhere where we got almost closed-ish to what we wanted, but we still had this problem of double zeros and multiple zeros. But we're not really going to go into this any further. I just wanted to kind of show you and give you an idea of that. And then it'll be a little tricky to write a regular expression that exactly matches what you want and doesn't match anything that you don't want. So we built up regular expressions. We talked a little bit and tried to develop some intuition about how we could write regular expressions to define tokens. So what is the overall goal of lexical analysis? Why are we doing all this? What was that? Which not tokens. From what? Sequence of characters. The string is a sequence of characters. But I actually think of it a little bit lower, like a sequence of bytes. Because that to me really signifies, okay, these are exactly like, that's what a file is, right? A file on your computer is just a sequence of bytes. So we have incoming to our program a sequence of bytes, and lexical analysis takes those and abstracts them into tokens. And that's our goal is. So in project two, if you're going to be in the second half of project two, you're going to be given a lexer. So a lexer does lexical analysis. It turns the sequence of bytes into a sequence of tokens. And so we're going to talk about lexers as if they have an API of a method called getToken. So what's an API? Application program interface. It's a set of pre-written functions that you are allowed to draw from. So we're going to say we have this lexer, or we're going to learn how the lexer works. In project two, part two, you're going to be using a lexer we're giving you. And it has a function called getToken, which will read from the input string and return the next token in the input string. And we've been talking about regular expressions, because we want to define tokens with regular expressions. So we know precisely what they mean and what they don't mean. So going back to that initial image, this is the part we're focusing on. So we've built up our knowledge of regular expressions. Now we're going to apply those to extract tokens from bytes. So the question is, what does getToken actually do? And so ideally, every time you call getToken, it would return whatever the next token is from the input string. So here you'd be like num, id, num, operator, id, decimal, left brace, right brace, whatever. And so it's reading through the input string, the bytes, and it's telling you which ones they are again. This point would be dumb, but did you tell us what a token is? Like what is the essence of a token? What is it? Because it seems like we're classifying different things in the stream that we're reading in. So what's a token and what's not a token in that? Kind of philosophical. A token is whatever you, the person defining token, is defining that. With Rejects. Yes. Are we going to be reading in this byte stream and we encounter something and we run it against several different Rejects or just one? Yes. So that's exactly what we're going to get into. So hopefully that'll make it clear, but the question is really good about what is a token. It's really just an abstract concept. A token could have a one-to-one mapping of a character like left angle brace or left parentheses. Those are tokens in the language probably and they're tokens in their actual characters. It could be a one-to-many, well more than one, like if the character's IF mapped to a token, it's a token. Or in this case, num, there's no limit on how many characters a num can match. We just know what a num has to look like. It can't start with zero, but it can be any number of digits that don't start with zero. So now the question is, how do we define these and once you define how we actually use these to kind of put these things together? Okay. So in our lexical analysis, essentially what we're going to do is define a series of tokens. These are going to be the tokens that are important in our language. So it does all tie together. I guess you could think of tokens as just completely arbitrary from a language, but the purpose is to define the language. And so if we have, let's say, a language that has an ID token, letter, concatenated with letter, digit, underscore, star, this is what we actually all derive. Or mostly, I think we have an underscore, could start a letter, but let's stick with this right now. The dot character, so what is this for your question match? Just the dot. Just the dot character, exactly. We know because of the slash. The number we set is a p-digit concatenated with any number of digits for just zero. So decimal is number followed by a dot, followed by a digit, followed by any number of digits. So we'll just use this for now. So, is letter, p-digit, digit, are they tokens? But the tokens are the things that the rejects match. So letter is a regular expression. Digit is a regular expression. And this ID is also a regular expression. But I'm saying here, these are the tokens that I care about. Letter, digit, p-digit, those are just little helper definitions for me to define what I want. When I'm talking about ID, I want a letter. So instead of saying I want a, or b, or c, or d, write that all out here and here, I can define it in one place and then use that. And so we'll use most of the time, I'm pretty sure almost all the time, we'll use the convention that all uppercase letters are yours and noting that this is a token. But do these tokens only refer to regular expressions or these helpers? Here, what do you mean where? Like what? Because it's not in letter and it's not in digit. Oh, it's literally right here. So this is an underscore character. In this regular expression right here. I understand that, but what does it mean? It is the underscore character. So I actually didn't define the alphabet, but the alphabet here would be a through z uppercase, lowercase, all the digits, plus underscore, plus a dot character. Okay, because you don't have it in the digit. Right, because the underscore is not part of digit, right? Digit just defines the letters of the digit. So then you have to define underscore as one of the, like doesn't it have to be in one of the alphabets though? No, no, no. So these, so the letter digit, p-digit, those are just helper regular expression definitions. Okay. So we did them here, but we're using the same definitions we already had. Right, I know, but I'm looking at the slide for the last time. Like this. So these are those definitions here. Right, I understand that, but I guess my question is if there's an underscore character, right, where did it come from? Because we never, you have letters of digits up there, but you never said, oh, there's also this other thing an underscore character. The main problem is I've never defined a sigma. Okay. So, you know, I guess so informally, right, so I never defined what the alphabet is. It's basically a through z capital A through capital Z zero through nine underscore. Right, and so I guess from here on out we'll just kind of informally okay, that's yeah, so that's basically where it comes from. So I skip that step, but kind of when you're doing these, you kind of just if the character appears here, it's in the alphabet. Okay. We're getting a little bit further away from like the precise mathematical definition. Okay, good question. Okay, so we have these regular expressions, right? So decimals composed of what? Number followed by a dot, followed by a digit, followed by a digit star. So we already said the digit by itself is not a token. We're not interested in a single digit. But num is a token, right? If I had a case where a token is composed of another token, right? Do you technically need the digit in between the digit dot and digit star, or could you just do dot digit star? If you do digit star, it will get empty like one dot something. So this way you have to at least do one dot zero, but now the possibility of one dot zero zero zero zero zero. Which we can say maybe we want that for terms of accuracy. Okay. Okay, where is it going? Okay, numbers. Great. So the question is, okay, let's think about this. Given what I already have, could I write like a regular expression like this? This has a regular expression. This is ID, oh, sorry, I can say it. Is ID concatenated with num or digit? Yeah, so this is a recursive regular expression definition, right? I'm referring to itself here and regular expressions must be finite. I guess I don't know if we discussed that or not, but the regular expression itself must be a finite string. This is never going to be a finite string. So this means that what we're using here, like when we say num is equal to what's in there, p digit p digit concatenated with digit star for zero, right? When we do this, when we say num is this and then we use num here in decimal, right? We're not causing any recursive definition. We're just literally using equality. You could replace every instance of num with the thing on the right. Just like you could replace every instance of digit with zero or one or two or three or four, so you can get a finite regular expression when you do all those substitutions. Right? It doesn't just be one. So this is a helpful shorthand for us and but the important point is you can't have any recursive regular expressions. Question is there? So, get Tokyo has an API. It takes in nothing. You can think that it's breeding from the input string. Right? We want it to return a token. So what does it return on this string? 1.1, ADC, 1.2. Let's walk through it one by one. Could it be dot? Why not? So it's first got to breed from the beginning. Right? And so the first standard here is a one. Could one possibly match dot? No, it can't possibly match it. So there's no way that's going to match. Could it match ID? I don't know. Because it must start with a letter. Right? Exactly. And this is not a letter. Could it match a p-digit? Could it also match decimal? We have a tie here. So what do we do? We give up, move to the next one. Why? Do we do it in a whole place? Which way? So we can stop. So you have to continue to read the string until you've eliminated all of the possible. We could stop. We could just stop and save the number. It actually doesn't, does one match decimal? Is one in the language described by decimal? No. No, but it could be. Yeah. Right? Because it matches the start here. Right? So one way to do it would say, we could say, hey, the very first token that matches the number, or the very first token that is matched to return that token. But does that actually give us what we want? So think about it like this. So we have number here defined by this and I have an input string of one, two, three, four. Right? That's my input string. So if we go at that logic, I have other tokens. Does this, look at the first character. Does this match this token? Yes. So would we want to stop? Yes. In your scenario. In my scenario, yeah. So we would return, so get token a num. And then, so parts of this says this is a token, so then it starts from here. And so what would this be? A num. And then another number and another number. So does this make sense? What do we want num for? Why do we write num? Any number. Yes, any number, the whole number. Right? Yes, the strings one, two, three, four are in the language defined by number. But so is the string one, two, three, four. And so based on what we really want, we want, abstractly we know this is a number. Right? And so we want get token to just return one num. Which would be this whole string. So this is why, yeah, there's an alternate way if you could do this, but this makes more intuitive sense. Does this make sense with what our goals are what we're trying to do here? Right? So here 1.1, ABC, 1, 2. Right? Is it a num? Is it a number? Is it a decimal? Could it be an ID? What was that? It's zero. So based on what we just kind of talked about intuitively, what do we think should be the first token returned here? 1.1 as a decimal. 1.1 as a decimal. Right? And so when we do that, get token you can think of that input string as being consumed. Right? It's saying, okay, I know 1.1 is a token and that token is num. So it's going to return num. But the next time you call get token it's going to now read from ABC, 1, 2. As if you called it on the string ABC, 1, 2. Right? Because it's trying to put all these input strings back into num. Turn this input string into a sequence of tokens. And so when it tells you that something is a number, bam, it's maybe its decision, it's done with that input that it ran it back to a number now it's going to start reading the next input string. And so when it starts reading from here, what's it going to be? Is it going to be ABC and ID? ABC, 1. Why ABC, 1? Just like we saw with the 1, 2, 3, 4. Right? Yes, ABC is in the language described by the regular expression ID. But so is ABC, 1. Is ABC, 1 period? No. No. So we go back and we say ABC, 1 is an ID. And so now if we call getToken, what would the next one be? Dot. Dot. And then a no. Just on your first lesson. Yay. It seems kind of weird because if you look at this and I ask you what tokens are here you may say oh, a number, an ID or sorry, a decimal and ID and then a decimal. But that's because you can look at the end and say oh man if I just took that 1 from that ID then I can make a decimal instead of a dot number. Right? But lexing does not do any, there's no backtracking or anything. It's very much a greedy algorithm of I figure out which regular expression matches the longest and I say bang that's the first, sorry which of these tokens matches the longest and then I say bang this is the token and then I read all that and I say I'm never going to go back and revisit what I did before. That's always going to be this is always going to be a decimal and this whole thing is always going to be an ID. Okay. So this is called the longest prefix matching rule. Wait, longest matching prefix rule I guess I should probably read it if I'm going to say exactly the thing and mess it up. So the idea is we want to find the lexer every time getToken is called the lexer wants to find what's the longest token that matches the current point in the input string and it's going to say that's your next token. And so what happens if we enter a situation where we have number, that's not number we have num just like we always have and then I have another token let's call it I-num or something like that which is digit.digit star so this is p-digit so now let's say I have the string one two three four a-d-c right so this is my string I call getToken using these tokens which token is going to be returned so what is the longest prefix the longest matching prefix rule I will say l-n-p-r I don't know hey guys which token is going to be returned? num why? because it comes before I-num so talking about longest so num matches how many characters here starting from the beginning four up to this four and I-num matches how many characters right so they're both the longest and so then we need a tiebreaker basically so the convention we're going to use in this class is that we're going to break ties by giving preferences of token listed first in the list so if it comes so higher up on the list of tokens that's going to be the first token that we choose if there's ever a tie so now we need to go back too far back now we need to go and try to figure out okay you were doing this in your pen of tokenizing this 1.1 ABC 1.2 right but how do we actually do that what's the algorithm for doing this so that you can do it and the compiler can do it right we need to keep into account this longest matching preface rule so let's see if copy-paste works kind of like superlink okay these are my tokens this is my input string my input string is what was it 1.1 ABC we'll do 1.1 again wait it looks weird because it's going to parse differently even though it's like symmetric okay so what were we doing so try to think through what you were doing so this is the input string right we haven't called get token yet so we haven't read in any of the input and consumed the input right so looking at this first character of the input string is 1 okay so which one of these tokens match num decimal does decimal match so remember and this is something I think we've been kind of informally using this when we talk about regular expressions this is basically part 2 of your question 2 of your homework when I say does 1 match the regular expression decimal what I'm saying is is the string 1 does it exist in the language described by decimal no is it in the language described by num yes so that's a match right okay so we've only looked at this character here what's like a key trap so we've seen so now if we stopped here right if let's say there was no more string left we haven't looked at the next character we have no idea what's left we're just looking at it essentially oh no so we have to visually do everything right so we're considering only this one we're not considering anything else after so just looking at that one we'd say well a match could be a num right we don't know what's next but we know num and we know it's length is 1 we've matched the number of 1 right and we specifically know what doesn't match right we know id and dot don't match but we have another idea of some that have the potential to match so what's the difference that means that we know that there's something else in the string and therefore it could fit into another category if we continue reading input so we still don't know if there is anything else left yet but yes exactly but what we we want to think of it as one has matched so far in let's say decimal but it hasn't actually matched yet but it could continue to match right so we know definitely decimal right we know that that's a potential why yeah because let's say one way you can think of it is that there exists a string in the languages described by decimal that starts with one that has the prefix of one right if that's true then you say well I don't know what's happening afterwards but I know if there's strings that start with one then that means that I could potentially match something so can id potentially match without number well does number we just said number currently not true right so the string one is inside the language defined by number but are there other strings that could be longer than that that start with one yeah right and so I'd say okay these are potential matches then I go and look at the next character I'm going to try to draw a straight line oh I'll get so good at this okay so now I'm going to consider this next character as your dot now do I need to say does the string one dot match id why not potential match we've just decided in that last step what could possibly match and we said if id can't match one if there's no strings in id that start with one then there's no possible way one dot is a potential match right so this step actually limited the amount of regular expressions we have to consider in this step so now I ask the question well does one dot a does it match potential I mean decimal decimal sorry yeah let's use these both words so let's think about match does it match potential right one dot so we have one we have a dot but what is all the numbers in decimal have to have after that a digit right but it's still potential right exactly okay what about no no so the string one dot is not in the line it's not doesn't match and no strings start with one dot it's not possible right so there's nothing here and the match didn't change oh sorry not didn't change there's nothing here nothing matches one dot because there could be another dot right I mean after this so maybe we need to go back and say yes the longest one I could find was number with length one right but we still have the potential for decimal to match so we need to do one more check but we need to keep going so now we check here and we're considering the string one dot one but only considering the regular expression decimal right so is one dot one match decimal and we're going to say it's length three you know it stops there because it's also a potential match for decimal exactly right so one dot one is a potential match for decimal so now we need to keep going I'm going to check the next character it's going to be one dot one a so does that match decimal yes one dot one a oh sorry does it match decimal does it have a potential match decimal no empty set here we have nothing here and so should we go on to the next character why? there's no more potential right nothing could possibly match one dot one a b it actually doesn't matter what the next character is we already know it's not even possible for that to match right and so then we go and say okay which was the longest match that we've seen so far decimal with length three and so this is what we're going to return as the longest match that we've seen so decimal and length three and if there was a tie between two regular expressions at the same length we would choose the one that was higher up on the list now I call get token again what happens to my input string yeah so we did read a off of the input stream does it get put back on by some magical convention or did it get eaten we read it exactly we read four characters right we read one dot one a right but we still want that a why? because it might be a part of the match for the next token definitely but why do we know we need to put it back yeah but why do we read it again why do we know we can safely put that back and not the one before it we didn't use it yes we did not use it the token is only length three so how many characters from our input stream compose this token first three so now we know we can get rid of those by returning a decimal and saying that the length is three we're saying okay great these three are already removed it doesn't matter how many I had to look forward to determining which one was the longest match I know that last token I gave you is length three as three characters this one one dot one and so when I call get token again where am I going to be reading from yeah but if you're reading from this stream ABC one dot one and I'm going to call get token here the lecture is smart enough because the condition that we just went through A could have happened to an arbitrary number of characters I read all of these and there was still potential but we decided we didn't want these because the longest match happened to be of a different type the lexer is capable of storing all of that and putting it backwards supposed to before we try and parse the next token yep yeah it's not magic it's pretty easy you just think of like yeah you just read everything in and then just go backwards again characters you can think about doing that if you really are reading from an input stream but yeah this is how yeah because what if we had let's say what if we had token called A which was well it's called food and it is A star followed by B and then I have bar which is let's say A star or A and I have the input stream how many A's do you want C if I go get token on this how can I differentiate between food and bar B exactly so it's either going to end with B if it ends with a B then I know it's a food if it doesn't then I know it's a bar but I can only know that if I read all of those A's right and then I read this C and I say oh it does not match bar and it doesn't have the potential to match sorry it doesn't match food and it doesn't have the potential to match food it also doesn't match bar but this is the longest match that I've seen so far of bar and so all of those are going to get consumed you may have to go all the way to the end because you can have things like A star B so you have to read in a bunch and then go back but here's the key once you make this decision of it's a decimal of length 3 that's that decision so it is a little bit confusing that I said there's no backtracking there is going back to the input because you may need to read a lot of characters forward but you never change your decision on what your decision of the get token is they're always independent basically like the the lecture has no memory it doesn't know that it just read a decimal it just knows now the input is ABC 1.1 I saw some hands we're going to re-raise them how does the lecture identify the potential if it doesn't know the future so we don't know the future so how does the lecture identify whether it's a decimal or not how do you identify based on the regular expression if a decimal is not matching one right now so how does the lecture identify that it's a potential the short answer is the lecture can tell if it doesn't match like it's already said that it can't match or that it's still in the process of potentially matching so you go over more of this in 3.55 when we talk about regular expressions mapping into DFA's so you would know based on the DFA that you're still in the DFA so you're still up a potential to match and you haven't gone to a failed state yet if you go to a failed state then you know it's not possible so now it happens when I call get token on this so now I'm reading this guy A, if it could be an ID wait, can it be an ID I need to read my match first does it match ID? yes does it match dot does it match none does it have a potential to match ID does it have a potential to match dot no, the potential to match none and the potential to match decimal so we have ID and we have fear ID of like one perfect and now I'm going to be here considering the strength so it's an ID and ID so it is an ID and matches ID of like two and does it have a potential to match ID and we can go again just see well so does it match yes does it have a potential to match yes so we do it again so now we do ABC one does that match an ID right, because we said an identifier is a letter followed by any number of letters digits or underscores does that have a potential to match an ID yes yeah, ABC one, if there's another digit underscored letter after it then it'll work so we say ID we do it one more time here we get to this dot after ID and it counts exactly, so we're done here we say it's an ID of like four we return that as we get token and so what we have now is the next time we call get token what's the input string going to be dot one, exactly and we can step through this again and see that it would return dot and then it would return no, right, and so if I ask you a question if I gave you like tokens like this and then we said okay, what does get token return if you keep calling get token until it returns no more tokens what is the sequence of tokens that it would return you would say decimal ID dot no and if you want to show your work in exactly this form I have a question, yes what would happen if you reach into file you would use your last match so what if your last match wouldn't or you can think about returning an error if it doesn't match any tokens what do you do if it doesn't match any tokens here's a good tip for taking tests if we haven't really talked about what happens if tokens don't match and there's no way I say in case the tokens don't match return this token and you're doing the test and you get to a point where you think oh no, there's no token that matches what should you do start over yes don't ask me if it's possible for there to be no tokens I just have to say do your best that's what's fair but based on the questions and context that should be a sign cool, okay so so this is exactly what we did so I'm going to briefly walk through this because we all just actually figured this out and did this so this is the way you show your work when doing this flexible analysis here when doing this get tokens so why is it important to show your work on something like this, why is it important yeah well partial credit I literally exactly said that but why is partial credit so important on this kind of stuff yeah what happens if you mess up on this first token everything else gets completely messed up and so if you just write down four tokens and the first one writes the last three are wrong that's 25% but if you do that and you show this then we can take points off here and say okay if we assume that we start from this string which is wrong but hey if we start there and the rest of it's correct then we get points for all that that make sense seems fair, cool so the only other slight difference here when I was doing it by hand it's a little harder with the room is I do like to keep track of what was the longest match that we've seen so far and you update that as you go but besides that these are the exact same columns right that we have used so far we're going to make it a little more clear here that we're deciding that this is the input we said idea four the dots return that and number return number okay so why is this important why is lexing important we've actually reached the end of lexing this is how you do it, this is it to find tokens use the lexer the lexer spits out tokens so why this is important I didn't know the mariner one actually don't know if it's a shuttle or a rocket it's an unmanned vessel so it wasn't maybe I have it on my notes, I want to check real quick click to add notes, no okay so on that example we just had we had the numbers, decimals whatever happened to whitespace is whitespace important in your project languages no? eliminate all the whitespace in your project one or project two c code and see what happens are you going to work the same? yes whitespace doesn't matter and see the pilot doesn't care I still think there are cases keywords followed by each other maybe although actually that's not true I'm guaranteeing you it's like the actual pilot eliminates all whitespace when it actually lexes yeah that's where it is function signatures avoid and function name how do you know which one is important could you have a function name that starts with i your processor by the way the pilot never sees the includes yeah interesting, okay so most languages if you eliminate all whitespace it would be a problem languages like python who programs in python some people whitespace is actually significant it's how you define indentations also I'll say if you don't use whitespace when doing indentations and braces you're a monster because you're going to have to read that code if you bring it to my office and just gobbledygook that's all one line if statements and ball statements that can be a sort of error because you're not indenting it's easier to read I don't think shuttle but unmanned space thing it exploded and then if you've written code that caused something to explode you have, what was it what was it you raised your hand no yeah physically explode not virtually explode cause a physical object would explode that's hardware stuff though have you done that in software though most of the times it doesn't this thing exploded, why well it's because of lexical analysis so in most languages whitespace is kind of significant but sometimes not so here's yes so whitespace is important in C because between type names the variable the type name you can define the type names in the variable that's only in in C you'd have to type that first yes anyways, okay these kind of things are important in Fortran all whitespace is ignored completely well maybe that's not going to be a deal C does mostly that what's the difference between these two lines I mean this is not like a deep question what's one difference between these two lines literally a one character difference have we all made one character differences before? did they ever explode a space shuttle? probably not so most of the time when we make a one character error either the compiler will stop us and tell us what we did wrong or it will cause some kind of error or the program will be mostly similar but something is slightly wrong in this case we have two completely different lines of code the first one sorry the second one sets a variable called do15i equals to 1.100 and the first one does a for loop for i from 1 to 100 do this thing so if you like changing a for loop to a variable assignment it's still compiled and ran and it will cause a space shuttle crack because of one problem neglect saying cause it to parse it as a variable assignment instead of a for loop from a typing problem? the type system doesn't even come to effect here because why did i get typed the integer type instead of a decimal type or whatever in that case oh sorry it's a decimal whatever do15i is I'm not sure how for carrying type systems work the main problem is really that they didn't want to create a variable they wanted to do a loop but instead initialize a variable something across