 So, thanks everyone for being here today. Let's see. Okay, we're left off, sorry. Okay. Those of you that are wondering how work one will be posted today. So, how will you find out when homework one is posted? Your core site. Core site, that's one way. Manage list. Manage list. Yeah, I'm posted on a manage list, right? It's a part of my manage list so I can tell you all about these things. So, make sure you're checking it. Make sure it's delivering emails to you. All that kind of good stuff. All right. Okay, so we already talked about this. We talked about all the different types. The semantics of regular expressions, right? So, to refresh everybody's memory, we're trying to find a way to define a concise language that can express the tokens that we want to extract from these sequences of characters in straight. Okay, let's do something. We're going to define a letter as any character from A through Z, lowercase, A through Z, uppercase, right? Those dots are not actually part of the regular expression we're defining, but they're signifying that we're going from A, small a to small z, up to uppercase A to uppercase C. Then we'll define the digits as 0 through 9, right? So, a digit is anything that is a numeral 0 through 9. Okay. Another great thing about regular expressions is we can combine them very easily to do things that we might have. So, I think we talked about this maybe a while ago, but just to refresh everybody's memory. So, what is the pattern for a variable name in, I think, C or C++ or Java? What is it? Letter digit. Letter digit? What does that mean? You can describe it. You don't have to. Okay. It's like you have to have this one letter, and then you can't have digits. Okay. So, it has to start with a letter. It can't end with a digit? Is that in Java or...? I don't know. It can end with a digit. It can start with a digit. Right, exactly. So, can you have exclamation points or parentheses in your variable names? I don't know. No. Yeah. So, we can actually, just using these two regular expressions, right, letter or digit, we can write that regular expression. So, what's that regular expression going to look like? Where we say we want to define, we'll call it an identifier, right? That'll be our general term for variable names or function names, anything when we're going to identify something. So, how do we write that as a regular expression? So, how do we capture that starts with a letter? We say... Yeah. Respect the letter, union, digit, and then... Almost. Okay. So, the unions, right, are, remember the semantics of how we define the regular expression, right? But here we want to write the regular expression that language is all valid identifiers. Right. So, we only can use the dot, the concatenation operator, the OR operator, the bar, or the star operator, or... I mean, we can use parentheses in that. Letter OR. Letter OR. So, we have that at the start, right? That means you can start with either a letter or a digit. Letter AND. You like every combination of letter or after a letter. Yeah. So, the first thing we need is a letter, right? We want to say that, okay, the first thing of an identifier is a letter. And then we want to concatenate that so the dot operator followed by letter OR number... Any combination. Any combination, which would be star. Yeah, exactly. And we can also do OR. We'll add the underscore next and you can add the underscore in there. Like that. So, it looks something like this. So, what's missing from this? The dot. The dot operator? The dot operator where? In between the letter and letter. In between this letter and the parentheses? Yeah. Yeah, there should be a dot there. Why is there not a dot there? This says it's implied, but that also... You're reading the slide. I know. So, why is there not a dot there? Because I'm lazy? Just makes sense. Because it's implied multiplication. Yeah, so that's the convention we're going to use. So, we've left out the dot here, right? And so basically the way we're going to use is this is implied that when we have two regular expressions next to each other, just like in multiplication, when you're taking a math class, do you always write x star y? Do we mean x multiplied by y? Or 5 multiplied by y? No, you just write 5x or xy, right? And that means x times y. So, how come this isn't L, E, T, T, E, R all concatenated together? Because we're ensuring that it starts with a letter, or at least it contains, you know, that something exists. Because since we have the letter, or digit, or underscore using the clean star, that means it could be empty set. Correct. The question, though, is why does this regular expression not define something that starts with an L, followed by something that starts with an E? Because the letter is a token? Yes, we've defined the letter as a token, right? So, we've defined it as a regular expression. So, when you see letter here now, we're not talking about L concatenated with E, concatenated with T, right? We're talking about the regular expression. But, you know, you can just drop that whole thing in here for every instance of letter, right? And that's, you know, it's exactly the same regular expression. It just sounds like a lot more work. It is a lot more work. Okay. So, let's test our understanding. So, is this a, does this match the token ID? Yes. It's a letter starting with any combination of numbers, letters, and underscores. So, it's a letter. So, this matches this letter here, right? And this is, what, any combination of letter, digits, or underscores? Letters, digits, underscores. What about this? No. Oh, why not? Start with a letter. Start with a letter. Yeah, exactly. So, if you think of the language that ID describes, right, that this regular expression describes, you would write out every single string in that language. Does this string here ever appear in that set? No. Exactly. Because it never, it starts with a number, and this, all of the strings that this regular expression defines will all start with a letter, right? Because of how we've written the regular expression. Questions? When, how long do you guys hear? Can you guys go through those things? It's just the buzz from the speaker. Oh, okay. Cool. That's good. Okay, so, then the next question becomes, okay, so it seems pretty easy, right? How do you write a regular expression to define the token that we want to capture in our language? So we were able to define the identifier, the ID token that says, okay, it starts with a letter and is followed by any number of letters or numbers or underscores. How do we define a literal number in our language? So how would you write a regular expression to do that? Yeah. Exactly. So one way you could say it is, well, a number is just digit star, right? So any number of digits, that's a number. So I want you to think about if that's correct. So does this match the literal one through two? Yeah, it does, right? Does it match the empty string? Yes, it does. It does. Probably not exactly what we want, right? Because the empty string is not a literal number, right? Okay, so we want to avoid that. So then what can we do? Did you concatenate it with digit star? Yeah, so digit concatenated with digit star, right? So just put that in front of it. And here we're using our convention again and not using the dot operator, but implying it here. So now does one, three, two match? Yes. Yep. What about zero? Does that match? Yep. Yes. What about this? Yep. Do you want that to match? Technically, it's zero. It depends on how much memory you have to spare. If you're doing this on an Arduino, you probably don't want something like that to be valid. Those both the same zero? Yeah. It probably could be. Could be? I might be parsing that out in some way using pieces of it. I don't know. I see it differently. I don't see it as a zero. Right, it's like a bunch of, like, it seems like a mistake. Right? Like, maybe through the programmer, and originally you had a one in front of here. I mean, like, accidentally deleted that one and left all these zeros, right? You know, you want, I don't know, one maybe property of your language you'd want is there's a canonical value for everything, right? You don't want a thousand different ways to say the literal zero, maybe only on one. Or if you think about a language like C or Java, starting with a zero means it's an octal number, which means it's a completely different decimal system. So, you know, maybe you don't want, well, that's not really the case. But, so let's try and get rid of this. Can we get rid of this using regular expressions? I think so. So how would we do this? What would our regular expression look like? Let's first, we're first going to find a little helper regular expression, right? So we're going to define something that's just one through nine. So then how are we going to use this to update our description of a number? If I concatenate the digit with digit star. Yeah, so we say it has to start with either one through nine and then it can be anything, any number of digits. So now we say, okay, it's got to start with one through nine. So it's one through two in that, match that? Yes. What about zero? No. Well, p-digit can be an empty string, right? Can it? Can it? Can it? So what's the language described by p-digit? Oh, yeah, never mind. Yeah, I don't think zero would be possible, right? Yeah, so the language described by p-digit is the second thing in the string one, the second thing in the string two, second thing in the string three, four, five, six, seven, eight, nine. And when you concatenate p-digit with digit star, there's an epsilon in here, but p-digit does not have any epsilon in it. So all the strings that this regular expression is going to create all have epsilon in them. No, none of them have epsilon in them. They're all at least length one. Wait, so now digit is what? Is digit zero included in zero? Yes. So then that would be okay. The zero there is okay. The zero here is okay, but the problem is, we're saying it has to start with one through nine. Does this start with one through nine? No, okay, so it's not okay. Yeah, no, then it doesn't match. That's quite good. You could just, that whole p-digit, digit star, and then you kind of hit four, and just one literal zero. Yeah, so I guess the other thing is, does it match this? No. A bunch of zeros. Getting some progress. Getting there. Exactly. Okay, so now I'm being a little more explicit in the calculation, right? p-digit, digit star, or zero. So we've got one, two, three. We've got zero. We've got zeros. What about this? Is this matched? Nope. No. No? So does one, two, three match? Yes. Yes. Is this a zero match? Yes. Is all the zeros matched? No. No. And is this matched? No. No. Does ABB junk is not in our regular question? Yeah. Sorry. How does the single zero work? Up here at the top, the or. Yeah. Right? So it's saying this whole thing, or zero. The entire thing. Exactly. Yeah. So basically, we kind of just put in a special case that says, okay, we'll match zero. It has to be exactly zero, or it has to be one through nine followed by zero through nine. Any number? It says it's got to start one through nine. Mm-hmm. Followed by any number zero through nine digits. So that matches the one, two, three, right? It's one, which matches p-digit, followed by two and three is digit star. So that matches all those digits. Okay. So then zero is okay here. Right. Exactly. So one, two, three is okay. Zero is okay. All zeros is not matched by this regular expression. 1901 ADB, or 8B doesn't matter. It doesn't match, right? Because there's letters in there. That's not a regular expression. All right. So now we've defined a constant integer number, right? So we can use this token. Anytime we see a number, we'll be able to say, hey, yeah, this token is a number. And we can differentiate that from identifiers, which is very good. Okay. Let's make it slightly more complicated. How do we do a decimal number? I think we had a number, a dot, and a number. Yeah. So yeah, that'd be the way to do it, right? We take the number that we already have. We know pretty much a decimal is a number, dot, and number. Right. A number can't have, a number can't have leading zeros, though. Okay. One big question. Why is there this slash here in front of that zero? Escape from that dot. Escape, escape there. Yeah, so it's an actual period, right? So the dot operator is a special symbol in our regular expression. So if we want to specify a regular expression that matches the character dot or that matches the symbol in the alphabet dot, we're going to practice it with a slash. That'd be the same thing if we want to use a bar, right? We put a slash in front of it, too, or if we want to match parentheses. Yeah. But the problem is we specifically made numbers so that it couldn't have an excessive amount of zeros in it, where that's actually very important when you're talking decimal math. Yes. Yes. Okay. So then what happens? So once you add this, this escaping business seems very easy. You just add a slash. But then how do you use a slash? You have to escape the slash, too, right? Which is very familiar if you're doing, you know, C programming and using C strings or Java strings or string literals. You can deal with this escaping. It can be quite crazy. And actually, this, actually a lot of security problems come from this escaping business, but that's another issue. Okay. So let's look at this. This is match 1.5. Yes. So we have a number. So it's one match number? Yes. Yeah. So it matches p-digit followed by zero or more. It's the zero of the digits. Followed by dot followed by a number. Which is p-digit. That works fine. 2.10? Yep. Yeah? Mm-hmm. Two's a number. 10's a number. Dot between works good. What about 1.01? Nope. Why is that one so good? Why is that? p-digit doesn't include zero. So after the dot, it can't start with zero. Exactly. So we explicitly wrote our num so that zero, one does not match. Right? So we had to say that a num has to begin with one through nine. Right? So if we just use this definition, num dot num, now we're matching other things that are not, we can't actually match the decimal that we want to match. How do we fix it? What did we use the last time? What did we say? Instead of using num, could we use digit for the second one? Yeah. So let's say, yeah, instead of num, we'll say, okay, we just want digits, right? We want zero through nine stars. We'll have any number of digits. Are you following these? Mm-hmm. Okay, that's fine. It's a good setup, I should know. I think to do that helps. Cool. Okay, so then do we match 1.5? No. Yeah. Do we match 2.10? What about 101? What about one dot? Unfortunately. Yes. So close. No. No? Well, it's not, the dot's not followed by a digit. That's followed by a digit. The digit can include zero or more. So digit star includes epsilon. Oh, so Danny, which pass? Exactly, yeah. So this one dot is in the language described by this regular expression. Or anything dot. So we can do what we did before. Yeah. Digit dot digit star. Yeah, so we can say, okay, well, let's change it. We'll add digit before there. Say, okay, a number's got to be a digit, a number followed by a dot, followed by a digit, followed by zero or more digits. So it's 1.5 match that? The number dot digit, and this matches, this goes to epsilon, so this doesn't match anything. 210? Zero dot zero one? Does that match? Yeah. Yeah. Digit includes zero, right? Digit includes zero. Yes. Digit includes zero, p-digit does not include zero. Okay, is one dot in that regular expression? No. No. Right? We've said, okay, great, we've got to rid of that. What about this? Nope. Zero dot zero zero? We'll wait. The number does include zero. Number can start with one zero. Number can start via zero, right? Because that number is p-digit digit star or zero. The dot matches here, the zero matches the digit, and digit star matches zero. Do we want this? Yes. Possibly. Yes? How is he using fix? Huh? How else do you use significant figures? Yeah, so one argument for this would be, well, you want to specify, if you're using this for scientific calculations, right, you want to specify the zeros because that will specify your significant figures. Do computers normally work like that? No. No, right? Because they're just defining a value, and it's going to translate this to whatever, or Boolean or float, sorry, not Boolean, double or float, depending on the size, right? Yeah, so we got rid of all zeros on the other one, right? So do we want this? Might as well. Yeah, might as well get rid of it, right? So, you know, you can keep adding zeros here. It's not really what we want, though. So, but we'll stop there. I don't know, maybe we do want this, because it's getting kind of ridiculous, right? But you can see how you can keep trying to refine these to restrict the languages that are, the strings that are met, so that you know exactly how the parser knows how to understand and interpret when there's a literal decimal, and the programmer knows how to write and express a literal decimal, right? These are like, these are very important things. It seems like you can never quite get it perfect. It's tricky, yes. You mean, so there's an interplay here, right? Between, so we'll get into a little bit of the expressiveness of regular expressions versus other kind of languages. That's really more for your 355 class, so we don't touch on that here. But yeah, there's an interplay between how precise you can get these versus how correct they are versus how easy they are for somebody else to understand. So that's what the job of the language designer is doing when they define the syntax of the language, is saying, okay, how do I specify it? And we didn't even talk about how you specify X values or binary values or octal values or how do you define const string literals, right? Those are also interesting. How do you do escapes inside string literals? Some languages have multiple ways to define string literals. Python has single quote, double quotes, triple double quotes, all kinds of crazy stuff. And all that has to be defined by the programmer by the language designer. Okay. So to recap, what we want to do, we're trying to take a sequence of bytes and turn it into a sequence of tokens. So we've seen how to define tokens. How do we define tokens? There's rules. Rules using what? Regular expressions. Regular expressions. Yeah, exactly. So we define some regular expressions. These define the tokens in our language. And our goal is we want to turn this series of bytes and turn it into a sequence of tokens. So what we're going to use not only in the class when we talk about, to think about how the lexer actually works, right? So now we're thinking about a piece of software that is, job of this lexer is to turn this series of bytes into a sequence of tokens. So what we're going to think of is this lexer has a method called getToken. And so getToken reads from the input stream and returns the token that matches the input string, the next token. And this is the lexer. The lexer is doing its job here of looking through the series of bytes and figuring out, based on the tokens that are defined, which token matches the first part of the input stream. And this will be, so, you know, we use this method in class. We're going to use it also on the programming assignments. So for the next program assignment, we'll give you this lexer, and you'll have to use this getToken method and understand how it works. So does it find the tokens in a string? Sure, good question. What do you mean by find? Does it look and say, okay, this is a pattern, this matches, it's a token. Yes. So yes, we're going to see exactly how it works because we have to understand how it works, right? This class is about dispelling magic, right? So it's not just something that we give you. It's a thing that you know how to do and you're going to understand exactly how it works and how it decides which token to return. Okay, so our tokens are specified using regular expressions, right? And these are what we want to get back. So our lexer is taking in bytes. So it can be from a file, it can be from standard in, it can be from the network, whatever. Taking in a sequence of bytes, turning those into tokens so that we get something like, hey, there's a number followed by an identifier, followed by a number, followed by an operator, followed by an ID, followed by a decimal, whatever. Everything that you, the programmer, need to, or that the other parts of the program interpretation process need to know to understand this language. So let's say we have these tokens, right? So we have our identifier token, right? So we have letter concatenated with letter or digit or underscore star. Okay, if anything's easier, we'll define a specific identifier called the dot, or the token called the dot. Then we'll have a number. We define our number exactly as we did before, right? P-digit concatenated with digit star or zero decimal as number dot digit digit star. So the question is what token would get token return on this specific string? Remember, the lexer is essentially consuming the input, so it wants to return the first, the token that matches this string. What would it return? Just one token. If you called it once, right? So this is the thing. So our api-get token, we call this it returns us whatever token, id.num decimal. We're assuming that the order it checks it in is the order listed? Yes, so the order it always goes left, well, top to bottom in this case. Like it checks id before dot. Maybe? I don't know. We're not getting into how it actually works yet. We're getting into how we think it works, yeah. Right, so why number decimal? Right, so where does num match? The num matches the what? The one? Well, num matches just one, right? And the decimal matches what? 1.1? Okay, so number dot digit digit star. What would it return num? Num decimal. I think it doesn't match any of the identifiers above. You have to do like a combination. It doesn't match any of the identifiers above? I don't think it does. I think you have to do a combination between them. To match the exact token that was given. Okay, yeah, so what? Because you have ABC in the middle. So you have 1.1 when you have ABC. That is given by an identifier named letter or something like that, which is an ID for example. You would combine num and decimal with ID for example. So you have two tokens, right? Yeah, so... There's really a bunch of ways to interpret that. Exactly. So every time we call getToken it's going to return us the next token. So remember, we want to turn these bytes into a sequence of tokens. So our first call to getToken and the second call to it if there's a token left should return another token. And then we call it again and it will give us another token, right? So it's doing it in series based on essentially going through this string. So yeah, eventually we will want multiple tokens from here. The question is when we call it a first time is it going to return num? Can we return decimal? Is it going to return ID? Why? Which one and why? Why decimal? Because if it's going along left to right it can do 1.1 that is a decimal until it hits that letter and then it's... It could return dot and then return num. It really goes back to what he was saying which one does it look for first? That's the longest part of the string of a token before it's not a token anymore. Yeah, so if we think about in some sense we want our lexer to be greedy, right? We want it to try to match as much of a token as possible and that's the token we want it to return. Right? So we want it to first return decimal because we want it to consume as much as possible, right? Otherwise it doesn't make sense how it would do if you're programming. If you had any token that was a subset of another token, right? You could never get that token back. So if we had num and decimal here well all decimals begin with a num so you never... always get that num before getting the whole decimal. Right? And the reason why we wrote this token is because these are important to our language, right? The decimal and the num are important when we interpret our programming language. So yeah, so we want it to return decimal and after it returns decimal what do we want it to return? ID. ID and then how much would it consume? I'd go up to dot up through the one. Right. So AVC one is an identifier, right? Exactly. And then what would it return? Dot. Dot is the token that it matches and then what would it return? Two. Exactly. Okay, cool. So you've just discovered and learned on your own the longest matching pre-tricks rule which is a very long way of describing what you just said. So what it is when we're lexing when we have these tokens our lexer always wants to match as many strings of the input string as possible. Sorry, many tokens of the input string as possible. And that's the key. This is the only thing this thing does, yeah? AVC and 1.2. Oh man. So you would have to yeah, so in this language in what we've described here at the tokens from this string it wouldn't be possible. You would need to put something in that it matches no token. Yeah, you could probably you could put a dot in between here and then that would separate it. You'd have an ID of AVC, a dot and then a decimal of 1.2. In normal programming languages how do you differentiate your string? Space. So you'd have to define space tokens. Exactly. You'd define white space as one of your tokens and you'd say, okay you'd say, okay, so you'd say, yeah, everything is like this and then you'd have another token that says white space and so that way you could separate it by white space and it would be able to understand that. That's a token too. Yeah, exactly. White space is just a token, right? Which makes sense because you can program with white space, right? As we've shown on the first day, right? So there are how many different types of white space are there? A bunch. Infinite? A bunch, yeah, definitely a bunch. There's spaces, there's tabs, there's new lines. There's a lot of extra ones that aren't used anymore. Line return, carriage return, LR, what is it, CF? Backslash R. The DOS line ending is different from the Linux line ending. There's like four just right there. There's a bunch. What about like comments? Oh yeah, slash slash. Slash slash for a start of a comment. Slash star slash slash. Yeah, exactly. Cool. So this is just describing in more details this and there's actually not any more details. I guess the only thing to think about is what if it turns out that I have two tokens and they match, so we decided we want the longest length, right? Of the input string to match that token. What if I have two tokens that are the same length? Whichever one was written first. The answer is arbitrary, right? But what we'll use in this class and it will be specified is that the higher up on the token list is the one we want only if there's a match on length, right? So we always want to get as much as possible. If we can't then we want to go by whichever one's higher up on the token list. Does that make sense? We got to break the tie somehow, right? So I didn't get it. So the question is I mean in this example right here I don't think it's possible to have both of them match? Oh. Yeah, no I don't think it's possible to have both match. We'll say examples where it's possible that it could be an ID or a number. I think we could have identifiers start with not start with these. Oh I get it now, okay. Exactly, yes. Which one do you pick then? Exactly. But it's only after the length is the same. You always want to know as far as possible getting the length. And then if there's a tie then you choose the one that's higher up on the precedent list. Question only. Yes. It has to go through and do more. It has to try every token on the string and say okay now if we match two and if we match two okay this one of those came first and then to the more tokens you have the more longer it takes. Yeah, so we'll see there's actually a really easy way to do this by hand so that's what we're going to look at. And you know, you actually can go through a little bit of a reverse of what you said. So instead of for this string going through each of these tokens to see which matches you go through the string a character at a time keeping track of which ones currently match. And then when you get to a point where only one matches you know that you're token. If you get to none of them you go back to the last one you had and you say which one was higher precedence. Cool. So the idea here is we're going to start from the next input symbol. So this just means we're consuming that string. So every time we call get token we're going to move downward we're going to move down through that string. And we're going to find the longest string that matches. Longest matching prefix longest string it's not a random name simply from that. And we'll break ties by giving token listed first. So we'll go through an example of this now I'll probably maybe do another example just to kind of show some stuff but the idea is we're going to make a table and so and I I don't know maybe it seems tedious to do it like this but this is the way you can do it it's I mean it's very easy eyeballing it will usually not work because for instance back to this example with 1.1 ABC12 when you look at this as a human you think oh it's going to be a decimal an identifier and a decimal because that's how your brain parses it because you've been used to looking at these things and separating things in different categories but really that's not how the the lecture is going to parse it and so this is a very easy way that you can do to show and demonstrate exactly what's happening in every step and why you chose why you chose a certain string over another one so this is going to be very in-depth so do I want you to do it like this on your homework and your midterms and exams yeah why? because if we show our work it's like a map to follow exactly, if you just look at it and write down some tokens and they're wrong let's say you made a mistake on the very first one I don't know that it's wrong so you get 0 but if you show this table and I can see your derivation process so I can see oh maybe you messed up getting the token for the first input but all the other tokens after that given that input were correct so then you can get partial credit so that I can understand your thinking so this is a way of you describing your thought process when going through and doing this lexic okay so we have three columns here on the left we have the string the left column we have the string the next column we have matching the next column we have the potential so these are the potential tokens that can match and on the right we keep track of the longest match okay so you can see maybe the caret is before the string so we haven't parsed anything yet we haven't called getToken so what regular expressions could possibly match at this point anything anything right that's why I have all there any token could possibly match because we haven't even started looking at anything so it's as if the stack is full with all the potential tokens yeah exactly and we're going to every character we're going to look at every token and potential to see if it could possibly match so we're going to examine the first token and we're going to say okay the first symbol could potentially match one no more decimal does decimal match the string one no potential it's a potential to match so it could potentially match it matches the first character there but we haven't reached the end of that regular expression we don't know if that string is actually in there or not does identifier match no does dot match no we check we check both of those we check decimal so we keep decimal and potential and then we check num so does num have the potential to match yeah yeah right because number it has the potential to match right because it's first one through nine which matches and then zero or more because of that zero or more right it has the potential to continual match and um and remember we're not looking at the rest of the string now right we're only looking one symbol at a time but num also matches right now right so we could stop if we stopped here we'd say that num matches this first character so that's why we put num in the second column then we go to we're going to look at the next character the dot character do we look in that test for identifier to see if identifier matches one dot why not exactly it's not one of the potential ones so we're not even looking at it we don't care about it we've already decided it doesn't match one there's no way it can match one dot right so of decimal and num which ones have the potential to match one dot decimal it's going to be currently matching so it's actually it's a good point it should not be in matching right here I'm going to update this why not because one dot is not a decimal exactly one dot is not a decimal before one matches one matches num because one by itself matches that radius if we stop right there that matches but the way we go to decimal does one dot match decimal it has to have a number of followers exactly it has to have a number so it still has potential to match there so what are we sitting at right now this is nothing exactly matching nothing at this point so in order to jump to the next token it has to match nothing twice well we have to wait until we're keeping on trying because we don't know where this is going to end up it could be that decimal never matches maybe there's an A right after the dot or something so we don't know right now it's an A and it jumps to a different token yes but we keep tracking the fourth one so up to this point number no longer matches but it did match right number of length one so we want to keep track of that because we don't know what's going to happen when we go forward we may fail we may not match any tokens and so we want to know what was the last token that we saw that was valid so we keep track here num what's the thing next to num the token itself no close that's good the number of tokens found the number of yeah the length of that token and it's length one okay okay so now we go to the next we're going to go to the next one right we're going to move one more character over and look at 1.1 so which tokens are we evaluating decimal just decimal right exactly we've already decided by looking at a character through character that one dot whatever that could only possibly match a decimal no other regular expressions that we have or tokens that we could have to match that so after we match 1.1 with decimal so does that match a decimal yes yes it matches it so decimal is going to be a matching column does it have the potential to match it doesn't have the potential to match anything else other than decimal correct but decimal has the potential to match right after we match 1.1 we can continue going if there's more decimals and then we persist the longest match right because that's still the longest match but the current longest match is 1.1 that's true that's wrong right it should say decimal and 2 actually 3 right just 3 that's the third character yeah that's the best match up to now but the numbers would need to be on the second row as well yes exactly so I think it's more of a stylistic difference so right because as long as you put it because so we still we're matching decimal we still have the potential to match decimal right so if we could put it here for 3 but we have it since we can still match decimal right there could be more after that so it could be 4 it could be 5 it could be 6 exactly so for me when it goes from matching to not matching that's when we mark it and put it in the longest match does that make sense I mean you can do it the other way and actually that actually makes more sense so I may update that so like here right one way to do this would be to put number here on the second row and say okay at this point we know we've matched number as 1 but we're still potentially matching number so exactly we have to update it for every single row that we go down right because number could keep matching and we do number 1 number 2, number 3 until it finally stops so what I've done here is keep track of essentially when it goes from matching to non-matching right when we've finally stopped matching that number then if that length is longer we write it here permanent exactly don't we didn't just say earlier that we technically don't match anything on the third line the third line right this also needs to go away so it basically goes until it has potential of being nothing else exactly and has a valid token already matched which is why it can continue so remember we're matching everything from where we started to where the current token is so that's the thing we're not matching on the third row when we look at dot we're not matching dot itself we're matching 1 dot and seeing which regular expressions match there right but it's a potential a decimal still has the potential to match because it matches one dot it matches that prefix that's where the prefix comes from yes that's wrong I'll update that before I post them and if I don't remind me when you're looking through these slides and you're like looking at your notes okay then let's go to A so what happens when I go to the next character then it's a decimal still has the potential to be a decimal when I look at A is 1 dot 1 A decimal this one's still looking at 1 dot 1 yeah to the next one then so there's no potential there's nothing that could potentially match there's nothing currently matching but if I found the longest what's my longest match up to that point decimal exactly 1, 2, 3 because it went 1 dot 1 so it was made 3 steps exactly so I've consumed by looking at all the tokens I have and looking at this input string I know okay the longest match I have is 3 characters the first 3 characters of this string matches 1.1 and that's a decimal and that's the longest match so then the next step would be we would actually this is what I kind of do here with underlining so we would return decimal so we call it getToken on the first time here we're going to return decimal and we're going to say that 1.1 of the input string is a decimal so that's why it's underlined but now I want to keep going right I want to know the sequence of tokens so how is this going to play out with the sequence of tokens right so remember getToken consumes the input that it that it uses to return a token so next time I call getToken right 1.1 is already we've already decided it's a decimal we never go back and change that decision so now we're going to start parsing we're going to start doing our lexing from here from after 1.1 before ABC yes if just for hypothetical let's say that there wasn't a 1 after that first dot so we start off say it's a number then we go to the dot and say it could be a decimal then we go to that A and say we say nothing matches decimal doesn't match but in that case it would be a number a decimal for an ID right so then we would return when we have no more matching and no more potential we would technically need to match dot it would be num first right it would match num first then it would get to that dot and then we would go back so we'd return that one token num we would consume that first one and then when we called again we'd look at the character we'd say all of them are valid the one that matches is a dot we go to the next one nothing matches so we'd return dot shouldn't that technically happen here if this was a completely different example well I mean no in this example shouldn't that happen yes yes yes we'll get to it in the second part because we're going to consume so this ID is going to consume ABC1 yes potential for A so we haven't looked at anything yet that's why the character is technically between the one and the A so we found our match we've returned decimal we consumed the first three characters it sounds like we started from a brand new string that's why on this row I got rid of that 1.1 so now it's like our input string is ABC1.2 so now we have the potential for all to match I'm just going to step through this real quick we are going to match an ID we'll go over this more on Monday too we say an ID could possibly match it has the potential to match and it's currently matching when we look at A we step through we see ID currently matches and potentially matches we step through all the way until we get to 1 which still matches when we finally get to the dot we say okay the longest match so far ID no longer matches so the longest match is an ID of like 4 so we return that as our token and then here we'd say okay the potential I'm missing a potential could be all I look at this first one the only thing that matches is the dot there's actually no potential so I'd go to the next one I'd say okay the longest match was a dot of 1 and so I return that and then I finally return look at the last character return to num and then if I called it again I'd return like end of file or something like that okay cool so thanks I don't know next one you're not taking the if there were one dot one that could be that no not that I got the last one it gets a lot better so could I have the last one these abilities okay your program will help people if you make it I'm not going to be connecting yeah so it isn't interesting I'm being on the book yeah you're starting to move yep star star the point that would make things usable yes last door the result of any indicator which side you could push on to open it is back it may look nice but it's terrible design you guys have to test to see where it needs to go yes it might be the right way it still has potential so you have some ways of indicating which side you push on a better side it's more fluid it's an interesting course he's not making it easy because if it had to I'm taking that as one of my busy full group of that's interesting next semester are you taking it to make it deeper now seven o'clock seven o'clock I do not have so you have a two 90 quits oh wow two 90 quits yep but that's fine I might have one more because the teacher is terrible it will be another 5-out again you're going to need time in this class yep you see the time the system is going to start or a theory theory books and undergrad research institute what do you do there is that a class it's basically alright the what research system what do you do then so when we start all science is it math, is it computer science so then we look at one basically you close something you what what I was doing you basically do yeah my project is a fleet of AI around the world sounds like a video game sounds like a video game or a reality video let's go go in his mouth that sounds interesting so then we see the basic games like chapter 4 I don't have enough for complicated games I saw this guy he signed a fleet of flying does it match and they they held a sheet and if you drop something it could also be like it fell in the sheets so that it wouldn't pull the joys and backtrack it would come together and then it would spread out and then they would throw it the joys would come together and then they would catch it again I was like how could it be cool yeah it's now let's look at one dot a thing that I am interested in that is a lot of lasers that hunt for the longest matching token goes back to where we have them already and that will say I'm going to find the process and get the obvios so this token can manage your lasers matched with them how is it if I still can't find anything after design of the token so that if they try to go back to the set I return num of like one and hold up a high energy beat this is basically where it returns somehow that would affect the fun but it wouldn't affect you or it would be some sort of device that would shred it off like in stuff like this I can have a shield I would like to look for something that has no potential so you can either dot or it's not that you can't look at anything else it can't be a decimal or it can't be a signature I can turn it all over it either be a physical bear or we have a set that would kill like a laser would be something that would that whole would be that would be the easiest choice so when you have your sound or a vibration we're using num digit those are the tokens if you could focus the sound yes you could make them into pin size we have why would it continue would it really continue without potential that's it right there it actually has no potential so wouldn't it just be ABC wouldn't it just look at ABC yes because we're not right exactly like well I think in your garage like it pulses across and if something obstructs it the garage stops so if a large object like a human comes through then you know to shut off the moonbug comes through it means so much sound okay yeah it's a whole world it's a good explanation oh it stops when you drop stuff yeah yeah it's all easier because it's turning and small bugs so when we get the first one it'll be today cool beans how long are we gonna have a week and a day well just because the project was due next week I will say I've seen so many people clamoring to get the first homework I was laughing when I was reading that thread when people were like how are we not having homework why are you complaining it was hilarious it seems like I've never they want the homework but you won't get it yeah exactly I would take that as a massive compliment it's not terrible I would assign you a homework on Friday eat this machine that was the con we haven't got a post and then we don't only get a day or two to work that was probably useful there's a difference that's much better that's much better sweet