 So let's start our discussion on where we left this today. And we started looking at how to design a lexical analyzer, what it is supposed to do, and what are the kind of issues we may face while designing it. And one issue we looked at was pre-cogment versus six-cogment. And we started looking at whether keywords were reserved or not. And we were looking at some examples of PL1, where we found that keywords were not reserved, and what kind of problems it can give rise to. There are some examples where you find that reading them, parsing them, understanding them will require some skill, not straightforward. Another issue in the similar language, maybe, take something like, you would like to declare. How to declare is a keyword, which says that I want to declare such variables of certain type. So I want to give a list of arguments here. And at the end, I may say colon in the year. And I may say that all these variables are on site in here. But it is also possible that actually, this is a function. Or this is a procedure name, where I'm passing all these as parameters. Now, when do I figure out that I'm dealing with a keyword, which is declare, or I'm dealing with a narrative reference, or I'm dealing with a procedure or function? Since that appears to be same. So that means I have to have some arbitrary look ahead because this list of arguments can be arbitrary long. And I have to have some arbitrary look ahead where I'll say that when I encounter a character after the right colon, then only I will know whether I'm dealing with the deterioration, I'm dealing with an array, I'm dealing with a procedure, and so on. Now, this problem becomes really difficult. Because if I am saying that, remember that when we said that I'm going to design a lexical analyzer, and my input is going to be the worker. And my input pointer is just moving that right over the buffer to make sure that I'm reading something and I'm getting something to put it back into the input screen. Now, if this list is arbitrary long, it is possible that the whole list may not fit into the buffer. That means I'm loading buffers, I'm keep moving. And then after flushing many buffers, I encounter this character. And at that point of time, I decide that I have already flushed these buffers. That means because of arbitrary look ahead, either I need very large buffers. And even those large buffers may not be sufficient, so I may have to reload the buffers. So I gave you some examples of Forten and PL-1. Now, people may argue that we lack some vintage languages. Nobody uses them. And therefore, this problem does not process today. So question is, have we resolved this issue now? You know of some examples of languages that are very prevalent today, which I use commonly, and still have these kind of issues where I cannot just find out, without the context, what kind of lexima I'm dealing with. Anyone has an example in mind? So here is an example from C++. We have these syntax for template. And then you have input and output symbols. And now that I have nested templates, what is it? Is it coming because of this nesting or because of the output? I don't know. I'm looking at it. So it's not that we have resolved all these problems. Only thing we have to remember is that all these problems I have described so far, they cannot be resolved just by looking at the lexicon analyzer. I need to know something about the tokens which occur in this context. That means I have to shift this problem, or part of this problem, to the syntax analyzer. Before I start finding out that what kind of symbols I'm dealing with. So if the symbol occurs in this context, then I know. See, when I'm just tokenizing it, I will not know. Because I have already tokenized. I have no other information, and I'm trying to tokenize this. And I have no clue of what these tokens are. So if it is in this context, then I say that it is coming because of the nesting. And actually, these are two right arrows. And if it is coming because of this syntax, then I'll say this is an output. But this context I will know only when I handle the subsequent pages. So most of these problems cannot be handled by a lexical analyzer alone. And I need to do something more. That means I need to pass on this information to syntax analyzer. And then resolve what these tokens would be. So is this issue clear to everyone? So question that comes is now. So let's now get into implementation of what we want to analyze here, or how we want to tokenize. And how do we describe tokens? So here is some set of lexics, which are basically numbers. So I have a floating point number here. I have another number and yet another number. And we want to now break this text. So this is the first problem. As I said, we are going to face. We want to break this into a sequence of tokens. So here is yet another example. So I'll use a lot of examples to describe what we are trying to develop here. And here is yet another example. Now what is the difference between the two? So if I look at this, what does it say? So this is saying if x is equal to 0, then a is assigned x. And then what happens here? Because of the priority of the operator, of the resilience of the operator, I will say that I want to first do a left shift on x. And then I assign it to x. And here we are just saying this is Boolean. But here, if you see, there's an additional if, which has come here. Now how do I break this? Is it if? And then followed by a variable, or this IFF is a variable. Any answers to this? Any feedback? Anything? As far as lexical emulation is considered, we should consider I double x as an identifier. But the code should lead to an error in subsequent cases of compilations. Compilations. And why do you say that we should consider IFF as an identifier? What is the reason for doing that? Or what is the rational for doing that? So we will consider the end of the token at the YD space. So that actually depends on the language specifically. Because I mean, if you are dealing with languages, different languages like ported, then you would have tokenized it differently. I gave you an example yesterday. Then I would not have considered this as saying that this is an identifier. But the compiler would have required that this is how I should tokenize. Language is going to play a big role. And somewhere the implementation issues will come in. So what you have to remember here are two parts. One part is the specification. And another part is, some part is given by the specifications. But specifications sometimes may not be complete. They may leave something to implementation. And at that point of time, implementation will start coming into issue. So we'll talk about both of these. Yeah, you have? Can you say both the interpretation from the, can the lexical analysis, and both the, since it has to make a decision, let it don't make. That because it don't have the whole information about it, let it don't make the decision. Send both information to subsequent phases and they will decide. So the comment which is being made is that, let lexical analyzer not decide. It can send multiple interpretation to subsequent phases. And then subsequent phases, when they have more information, they can take it away. True, I mean that is possible. That is what we were discussing in the previous points. But suppose lexical analyzer can make a decision. Then there are some issues which are going to be implementation issues. So let me give you another example of implementation issues. When I give specifications of tokens, like identifiers. I say it's a stream of characters followed by a character or a number, already given, right? Now if I say X10 is an identifier, and X100 is an identifier, okay? Now should I start saying that since I already know X10 is an identifier, okay? Maybe it's already in the symbol table. This also remember that this also matches specification X could be an identifier. So can I tokenize it here and say that this is actually an identifier followed by a number? We should not, right? But that is a matter of implementation. Because this matches specifications, but somebody can say, well, X is also an identifier and I can start just saying that X break X and then say 100 and then subsequent phases will say, that's an error. That's not possible. So that's a matter of implementation, right? So we go into that when you say that, that is we are using a principle of what is known as maximal bunch, okay? We always want to, but that will depend upon implementation. Whether you are able to really do that or not. Languages will not clearly specify it, okay? So let's keep on discussing these issues. So how do we break input into tokens efficiently? One is that the tokens may have similar prefixes, right? Here, right? Or I'm giving an example here. Where the prefix is similar, we have to be aware of this. And we actually want to look at each character only once, right? So this idea of saying that I keep on pushing my characters back into the input stream and keep on moving my pointer over the input. That is going to be an overhead. So once I have taken a character into the buffer, I want to handle it very efficiently, okay? So continuing on our description, okay? Now, tokens can be described by regular languages, okay? And you're familiar with what that regular languages are, okay? And obviously they have certain properties, they have very nice mathematical theory which has been developed around them. And we can develop also tools for really handling the regular languages. And this, since we have discussed this in the large detail in theory of computation, I don't want to get into the properties of regular expressions. I just assume that you remember it and if you do not, you will go back to your notes or the books and just write over it, okay? And the notation we are going to use is very similar. That sigma is a set of characters. And then a regular expression r denotes actually a language lr, okay? And then we say that I can now define rules using the standard operators I have in my regular definitions. So epsilon is a regular expression which we say that is the null language. And a is a symbol in a, if a is a symbol in sigma, then we say that this expression that we pay denotes a regular line, okay? And then I can use operators. So if we say that r and s are two regular expressions, then the languages they denote are lr and ls, or the specified r. And if I take concatenation operator, then I'm saying that r followed by s is actually a union of languages lr, lr and ls. And if I take, so this is r operator, this is concatenation. And then I can talk about closure operation, okay? And as I said, this is just for the sake of revision, okay? These are the operators I'm going to use for specify my regular definitions or regular languages, okay? And there are several examples, okay? So what we want to do is, we want to use something known as a regular definition. This is something we need to understand because now we are getting into an issue of implementation and we are moving away from, or we are moving ahead of specification. So what we want to say is that if I have a regular expression r, some regular expression, I want to give a distinct name, because I don't want to keep on using these regular expressions in expansion. I want to give a distinct name, and this is, I can see almost like a similarity when you write something like a function, okay? You say that look, there is some work which needs to be repeated, again and again, and therefore I don't want to do that work of repetition. I just give a name and then every time I just pass an argument. Here there's no concept of an argument. That is the closest similarity I can see here, that to every regular expression I'm going to give a name rather than using ri over and over again, okay? And then we will say that if I have a regular expression, let's say r1, I give a name b1 to this, okay? And then I have a sequence of these regular expressions and let me give these names, okay? And the only property I need to worry about here is that whatever is my first definition in this sequence, okay? That is a regular expression over sigma, okay? So this is a regular expression which is over sigma, but when I look at r2, because I already have a d1 here, so I have given a name to this regular expression. I can say that this is actually a regular expression over sigma union d1, okay? And if I look at r3, because these names are available to me, I don't want to repeat. So this is like d1 and d2 can be replaced by these regular expressions. I don't want to repeat it here. So all I'm saying here is that this is a regular expression over sigma d1 and d2 and so on, okay? So I can have this sequence of regular definitions. So what we say here is that regular definition is a sequence of this form, okay? Where the only property I follow is that if I look at any regular expression ri, this is a regular expression over sigma and all the definitions which are already available. So I don't use proper definitions, I don't use any concept of recursion here. All we are saying is that I just don't want to repeat this regular expression again and I use a very concise notation. That's the only thing I haven't used. So here are few examples of how I use regular definitions for this type. Some regular expressions. So here is my fact number, okay? Now if I want to describe this, so I gave you an example that suppose I have a lot of these numbers and I want to find out all the code information here, okay? I can describe this by saying that what is my sigma? It consists of digits and I have special characters left and right, bracket and dash, okay? And then I can say that it actually consists of a complete code, which is a string of digits. And then I have area code, which again is a string of digits, but which I put in brackets. And then I have exchange information, which is a list of digits. And then I have the phone number, which is 7586. That also is a list of digits, okay? And then I can describe, I can say that every number must have this form. That's really the valid form. In fact, I mean one rule you can clearly see here is that when you actually make a call from there, you just send a string of numbers, okay? And when they do billing, they are able to really find out to which number you made the call. So they can find out whether this is an ISD call, STD code. They can just process that, okay? You can write a program and you can process all that information, okay? But then there is an implementation issue that comes, okay? What is a valid country code, for example? I'm talking about now, all this information saying that this is a regular expression. But I may say for sake of implementation that no country code can be more than of three digits. I may want to put certain restrictions, okay? Or no area code can be more than of four digits and so on. So then the implementation issue comes in and the tools we have, they'll provide you certain more ways of specifying and say that this can be a string of digits of at most say 2, okay? Or this could be a length of digits of at most 3 and so on, okay? So I can also put these kind of additional specifications. This is not part of regular expressions, but this is part of the tool you are going to use for implementation, okay? So do you remember that in legs like tool, you could put these kind of restrictions saying that what is the maximum length of a string I can have? So remember somewhere we said that an identifier can be of maximum length 32, right? Now for implementation I need to remember that, okay? And then this can be part of your specification. It will say that if your country code is say more than 2 digits long, okay? Then that's the end. So this is where slowly we start getting into the implementation issues. Saying that we have some specifications but for sake of implementation we need to have more precise information about what can be evaluated or not, okay? So I'm not going to have an arbitrary long string of characters and say this is an identifier. I'm going to put a restriction that whether it is 8, 16, or 32 and so. This issue clear to everybody, okay? So more examples, okay? Before I get into programming languages. So here is my email id. And if I want to describe this as using regular definitions, okay? What is my character set here? It's a letter and then I have dot and symbol, okay? This becomes my character set and then I can say what are my letters? So I can, rather than using these alphabets over and over again, I can give a name to this and all this has letter. And then I say that name is nothing but a string of letters. And address is nothing but a name followed by this symbol, followed by name and so on, okay? I mean you can have different descriptions. So you may say that this is the domain information and so on, okay? And then you can write different kind of description for this. But this is one value. This is one description, one of the descriptions. It is not known with this domain, okay? So let's get into more examples and now I want to get into something which is closer to what you're going to implement as part of your politics and as part of programming languages. So I want to give specification for an identifier, okay? So identifier is something we have been discussing right from the beginning. So identifier is nothing but it consists of letter and digits. And it always starts with a letter followed by at least one letter or digit, it will consist of at least one letter and will be followed by zero or more letters and digits. So this is the kind of specification I can have for identifier. So what you see here are combination of specifications and combination of the definition names here. So I have just taken this and I have given a name to this letter. Now this is something which we have used. You could have used any variable here, okay? Which is this thing. We don't care about it. Letter is something which symbolically immediately tells me that can I start reading my specification, what is it that I'm trying to read? But there is no, nothing sacrosent about it. I could have used anything there, okay? So here is a letter specification which says letter is nothing but it consists of either A or B, so on lower case and all of our case. Let us have it captured in this, okay? And similarly all the digits from 0 to 9 and identify can be specified in this, okay? Similarly if I want to write, say unsigned numbers in Pascal. Then I'm saying that I have these, so these are floating point numbers or real numbers. I have these digits and so digit is 0 to 9 and then I define another name which I say is consists of digits, but it has at least one digit. So they can now be 0, so I use now a new operator which is plus. I say that it consists of at least one digit, cannot be empty. And then fraction part is going to consist of dot followed by at least one digit. Or fraction may be completely missing in my number. And then I may have exponent, okay? And exponent I will have this letter E followed by an optional sign. So I'm here saying that the sign of the exponent would either be plus or minus or it could be missing. And if it is missing, then it is going to be a default interpretation. We can say that it is positive. And then if I'm using this notation, then I must have at least one digit. So I cannot have numbers like this, okay? This is invalid. So if I have E, then I must have at least one digit. If you say that what is the exponent value? Or the whole exponent may be missing. So that's optional, I'm putting an epsilon there. And then if I want to say number, then number is consisting of nothing but digit followed by a fraction followed by an exponent, okay? So this way, I can keep on developing the complete set of regular definitions for whatever language I have. I need to pause and discuss this here to everyone, what we're trying to do here. When you have already done this, when you were using legs, okay? And you already know that how to use regular expressions. I'm assuming that this is not something which is very new. It's just a matter of a notation here. Can we move ahead? So now, once we have regular expressions in specifications, we also need to worry about the implementation. Because remember that I can write specifications, but somewhere they are going to be issues of implementation. So implementation wise, what is the problem? So regular expressions, they describe almost all the languages we deal with. Regular expressions are sufficient to describe the tokens. And obviously, as I said, in many cases they may not be able to find out what is the exact token and they may pass on information to subsequent phases. But remember that these are only specifications and I still need to worry about the implementation part, okay? So question now is that, what is it that I'm trying to implement? What is my input to lexical analyzer? To lexical analyzer, I'm saying that I have a string S and I have a regular expression R. And question I want to ask is whether this particular string belongs to the language specified by R? This is the question I'm trying to answer. Why? Doing lexical analysis of the programs or in the compiler, okay? But if I say just solution, I mean although this is the basis, but if I just give an answer which says yes or no, okay? That's not sufficient for our process. Because we are trying to generate information for subsequent phases. So if I just say that yes, it belongs to this language, okay? I mean big deal. So what? What do I do with it? So what I'll say is that I also want to tokenize it. I want to generate information. Tokenization is really the implementation part, okay? So goal is not just to give this answer, okay? This answer will tell me whether it is an erroneous token or not. But more important answer for me is that I want to partition this input into tokens, okay? And that is where all this information about maximal munch and ordering of tokens, etc will come into picture, okay? And the tools we use, okay? So let's look at how do we write or what is, how do I pass? I'm sorry, how do we tokenize, okay? Parts will come later, okay? So we want to write regular expressions for lexemes of each token, okay? Like we have written for numbers and identifiers, okay? And then what we do is once I have written all these definitions, I now construct R which consists of all these regular definitions. So for each, like for number, identifier, and so on, okay? I will have these different regular expressions and using this operator, I'm going to construct a large regular expression. And now the question I want to ask is, my input token is a sequence of characters, x1 to xn, okay? And now I want to say that for some value of 5, which is bounded by 1 and n, I want to check whether x1 to xi belongs to some lr, okay? And obviously it will belong to one of these lr, okay? So when I say it belongs to lr, what that means is that it will belong to one of these and I need to find out, okay? So I start because this is now saying that I'll scan my input from left to right, character by character. So this is my input and I'm looking at now a prefix of this and say that can it be a token, can it be a valid word in the language I have? That is the question I want to answer. And therefore I say that if x1 to xi belongs to lr, what that means is that actually it belongs to one of the lrj's, right? And for some value of j, okay? And we want to find out that what is that smallest j? And the smallest j immediately says that when I say smallest j, it's not in size, but is in the order, okay? So these are the ones which are coming in certain order and therefore when it comes to implementation, it becomes really important that in which order do I specify by regular expressions, okay? So for example, okay? Coming back to what you already saw in Lex, okay? Suppose I give an order where I say that keywords will come later than identified, then every keyword will also be matched by then identified. But if I say identified, keywords will come first, what will you do? We'll first say that I am going to do this match and only if it does not match, if it is a longer match, then only it will go to identify, okay? So order becomes, so different tools are going to give you a different way of ordering, and suppose you don't use a tool at all. Suppose you want to write your C program, okay? Then what is the order? Okay, you'll have to then worry about what kind of order you're going to use. And that becomes part of your implementation. So remember that this is not something you can just ignore. You have to worry about whether you're using tool or whether you're using a C program, assembly language program, whatever implementation you want to do. So I can have these specifications, but implementation is going to be as much important. And you'll have to worry about that when to stop and when to say that I have reached a verb boundary, okay? And that's something if you do not, if you're not careful, you can just, you can just generate sequence of tokens which are invalid tokens. And again, we'll take more examples of that, they'll shortly become. So then once we have identified that it belongs to one of these diagonal expressions, what do I do? I remove it from input. And then I start tokenizing the rest of the input. So basically this is saying that I start from in this string, I find a prefix which becomes a token. Then I remove that and start looking at the beginning of the next token. And then we find the next token and so on, till I have exhausted all the input. So I'll just keep going back and we'll be tokenizing my input again and again, okay? So every time I go through this iteration I'm generating one token, okay? So algorithm normally gives priority to tokens which are listed earlier, okay? So if it's a keyword and not an identifier, so if I encounter an F first, okay? Then I'm going to say that this is a keyword. But how much input do I use? And that is where we say that I want to do a longest match, okay? So normally all lexical end users will say that pick up the longest possible string in the input. So here you will say that when I have something like this, I'll not just stop here, but I'll say that go all the way till you can consume it. And only when you are not able to consume it from this rule. So suppose after this I have a less than sign. Then you know that I'm not able to consume this particular symbol in this token. Because it will not match any of the specifications I have. And then I'll say this is my word boundary and this will not be my word boundary, right? So that becomes an implementation, okay? So regular expressions they're going to provide very concise notation, okay? And good algorithm requires a single pass. I mean, I don't want to have multiple pass over my input, okay? That is an efficiency issue, okay? So how do we break up text, okay? There are two examples. So if I say else x is assigned 0, okay? That two ways I may be able to break, okay? I may be able to say either this, okay? Or this, but maximal merge principle will say that this is how I should break because this matches specification of an identity problem, okay? So normally the longest match is the one which is going to win and ties are resolved by prioritizing both, okay? So if you have common prefixes, then you need to prioritize them. And lexical definition basically now consists of regular definitions, priority rules and maximal merge principles. So here you can see that here, I have clubbed everything. I have clubbed not just specification, but I have also clubbed the implementation which says that order in which tokens will occur and the longest match, okay? So together this becomes specification for your lexical analyzer, not just the specifications of the tokens. So all three have to come into the picture. The maximal merge principle will be used for some language. For the lexical specification, then you will apply for the specific language. Yes. For C-like language, it will be not used. For C-like language, okay? So what is the longer example? Why it will not be used for C-like language? I immediately cannot think of an example if you have something you might tell me that. So in C, we separate by some lines, we have some word boundaries. So we can look into those and then tokenize it instead of longest match principle. So only thing you have to worry about in C is something like this, right? So what is the, if I don't use maximal merge, then will I break it into this or into this? This cannot. You can take any keyword, for example, if I take this. But in C, if you consider that this equal to is by boundary, so the token will be i ff. Yeah, but see, I will not reach equal to. I'll start reading from here. So I'll read i, then I'll read f and then I have to take a decision, whether I want to read one more f or I stop there. Are we able to discuss that we read up till boundary that says. So that's part of your implementation, right? Now if you say that look, keep consuming something till you hit a word boundary. That is saying that go for longest match. How do you say that? Make it going for the longest match. Word boundary is a different thing. In this case, what we are saying is that my word program is just large string. My input is a string code, okay? And now I am breaking it into tokens. See, where do you specify word boundaries? Is it specifying word boundaries part of your implementation? So when I write my regular definitions, okay? Is there some specification where I was saying that here is a word boundary? Okay, so question is this question about how do I specify my word boundaries? So let me go back. Okay, so let's look at this, okay? So there is set of specifications I have for letter, digit, identity, fire, and numbers, okay? And I can add one more specification. And let me say that which I'll say is additional add operator, which I'll specify as plus or minus, okay? This can be my complete specification for expressions which consist of identifiers and numbers and operators, right? Now, where do I specify what is my word boundary? So I can also throw in one more specification and I can say a keyword. But keyword is not really part of specification. That will be part of an identifier. And then I have a list of keywords. So where do I specify that I have certain word boundaries in this specification? We cannot specify the word boundaries. So when I say I cannot specify word boundaries. How do I know what my word boundaries are? How is it implicit? So what do you mean by implicit? We are reading from I, where the technique comes as F. Then we look one, I will find what's next. If it is like a equal to sign or something white space. Then we'll say that it's F and if I find another F. I understand that part. I'm saying where is it specified? It cannot be specified. So if it cannot be specified but I have to implement a lexical analyzer, okay? Where do I capture this? Is it part of your specification or part of your implementation? If it is part of your implementation, then you have to use such principles saying like long as math, it's not specified anyway, right? So somewhere, now if I go forward once again and take you to last point I summarized, okay? Lexical definitions consist of regular definitions and priority rules and maximal bunches, okay? A maximal bunch is a principle I'm using here. Saying that I want to go for longest math. And that is when I can say that keep on reading. Suppose I did not have maximal bunch and I replace that by first match. Then what will happen? It will be F. So that means your lexical analyzer has two consist of implementation rules as well as the definitions. Sir, for applying maximal bunch principle we need to have a- Wait, wait, wait, wait, wait, wait, we'll go. So let's not worry about how I'm going to implement maximal bunch. That's really not the point here. That point will come later. All I'm saying at this point of time is that when you talk of a lexical analyzer, it is not just sufficient to say that there is a set of regular definitions. Just go ahead and implement it. You need to specify something more, okay? And that something more is that we'll say that we'll always go for longest match or we'll say that the first specification I find in that order, if I find something which matches earlier, then I'll use that rather than trying to exhaust everything in that sequence. That is part of your implementation, okay? So you cannot ignore this part for implementation of a lexical analyzer. That's the only point at this point of time. So not everything is captured just by the definitions. This only says what are the value tokens, okay? But however tokenized, that comes from this part. That is where all these word boundaries are going to get captured. This point clear to everyone in the discussion? So in fact, this is a very important point because when you start implementing lexical analyzer, and when you start writing lex specifications. If you are not careful about the order, suddenly you'll find that the token sequence. So internally, lex assumes certain things. Lex assumes that it will tokenize, it will match specifications in the order they occur, and it will go for maximal match, okay? Now if you're not careful about that, then you can find that the sequence of tokens, if you don't write your specifications correctly in that order, then suddenly you'll find that sequence of tokens is different. So I can take two set of specifications, change the order, and my sequence of tokens outcome is going to be different. That is what we have to be careful about. So it's an implementation issue. I may have the same set of specifications. And if I use a different order of writing that, then suddenly you'll find that my tokenization happens differently. So do we agree on this? Shall we move ahead? Okay, so what we want to do now is, we want to slightly move away from regular definition, okay? And I want to introduce a new notation, or transition diagram. Because remember that one thing we were talking about was, we said not only we want to use tools for implementation, but somewhere I don't want to do a manual implementation. I want to write not specifications in regular definitions, but I don't want to directly write C code for sake of efficiency, okay? But when I'm writing C code, I don't want to be writing C code in a FSR manner. I want to have some systematic way of specifying or some systematic notation of saying that what are the kind of tokens I have, okay? So for that, I introduce transition diagrams, okay? Again, I don't worry about properties and so on. I'll go for implementation as you find that when I start talking about transition diagram, there's something you have already seen, okay? So what we want to do is, the regular expressions are some kind of declarative specifications, okay? And transition diagram is really the implementation part. So again, I'll keep talking about implementation here, because ultimately that is what you have to do. So what we do is, transition diagram really consists of a set of alphabets which belong to Sigma, and then it consists of set of states. And then I can have transition from one state to another on certain input, okay? And then we have a set of final state and a star state, okay? And what we'll do here is that, when I say there is a transition from state one to a state S1 to state S2 on A, I can say that if I'm in state S1 and input is H and go to state S2 and I have some total notation. Where is it? Pictures which are supposed to be done. Maybe that is unsubstituted for it, okay? So when I reach end of input, then we say that we are in the final state and then accept, okay? You'll find that what I'm talking about is something, finite state machine and otherwise reject, okay? So that's on this one, okay? So state I will use notation as a circle and final state is going to be two concurrent circles. Transition is going to be narrow, okay? And then I'll say that transition from state I to state J is going to be captured like this, okay? And then transition diagrams become very easy. I'll specify my tokens and then go for directly taking me to the implementation, okay? So how do we recognize tokens? What we'll try to do is we'll try to now develop transition diagrams for these kind of specifications. Whereas I have enriched my language, so I'm saying that I have these relational operations, I have identity fires, I have numbers, I have delimiters which have blank tabs or a new line. And then I say I have white spaces which are one or more occurrences of delimiters, okay? And I want to now develop transition diagrams. So what we want to do is we want to construct now a lexical analyzer, which will give me this token activity here, okay? And here I want to go for manual implementation, okay? So I don't want to use regular definitions. I want to convert these regular definitions into some kind of condition diagrams, okay? So let's start with something, okay? And start sort of developing diagrams for this, okay? And this diagram will then, this diagram will then start capturing all the implementation details, okay? So if I want to deal with identifiers, okay? I'm in some state and so if I try to develop now this transition diagram for identifiers, I'm in some state. What are the first character I can see? First character I can see is only a letter, right? So if I see in this state, I can only see a letter, anything else is invalid that will take me to the next state, okay? And then in this state, I can continue to see either a letter or a digit, okay? So I can have this label which will say I can have a letter or I can see the digit, and then I can go to a final state by seeing other. Let me use this label, okay? And what this says is that anything else which is not a letter or digit will take me to this state which will say final state. And in this final state, now I want to do something more because I'm talking of implementation now. What are the things I'm doing in this final state? I want to now return a token and attribute, okay? So what is the token here? I'm going to return, it's an identifier, right? So I'll say that my token is id and what is the attribute here? It could be either lexeme or an entity to the symbol table, okay? So if this lexeme has been identified, okay? Then I'm going to put this information in the symbol table or if it was already in the symbol table, then I just need to have a pointer but it could be pointer to symbol table corresponding to this lexeme. But then I need to do something more, and what is that something more? So I have identified, I have written this pair, but what else do I do? So we have consumed one extra character, okay? I must specify somewhere that this extra character must be returned back into the input stream, so I just put a notation here. So you can see that what I'm talking about is now implementation rotation. Therefore I did not say that this finite state machine and so on. Because it's capturing all the implementation details, okay? So here I say that this will remind you of saying that the last character which you have consumed here must be returned back to the input stream. So this becomes your specification for implementation of an identifier, okay? If I try to do, let's say for relational operation, okay? Now what will happen in relational operation? So I've been some start state, okay? So let's look at what the just first do. So if I just say my rel of is less than or less than equal to. Let me start with just this, right? So in this state, what can I see? So doesn't matter whether I'm trying to match this token or this token, I will always see less than, okay? And that will take me to a new state, okay? And in this new state, either I can see an equal or I can see something else. Now if I see equal, okay? Do I need to look at more? Now suppose I don't have anything more in my specifications, okay? Then by just longest match, I'm saying that this is what I match. And therefore I can reach a finite state. And then I can say that return now a rel of and let's say less than equal to, okay? But I don't have to return anything back to this thing. Because I'm not consuming any extra symbols. But if I reach this as a final state, then I'm saying that I have match, rel of I have less than and then I must return at least one to the input state. So this becomes my implementation. And then if I just enrich the whole thing, okay? If I say that I can see, say, let's say this also, okay? Then what will happen? So in this state, okay? I could have had other label, which would have said greater than sign here. And then as soon as I say greater than sign, definition of this other changes. So this other is all symbols which are not. So all symbols minus symbols on all other labels, okay? So this definition of other, if I did not have this edge, this was sigma-equal. But as soon as I add this edge, this says this is sigma-equal and little bit, okay? And then this will immediately say that what I have seen here. So this is not equal to. So now I'll say that I'm saying that this is rel of and not equal. And then I can keep on adding more and more edges to this. And that will give me the full specification, okay? So once we have this, okay? I can, once you have understood this part, I can take you to transition diagrams. So for rel ofs, okay? This can be one part of the transition diagram, which says greater than equal. And this will just say that my token is rel of, lexeme is greater than equal to what I can also say that this token is or lexeme could be this. And many times what you'll find is that it may have additional specification like saying this is greater than, okay? But as soon as I say this is greater than equal, okay? That is sufficient to say. And other will take me to this specification. But in this case, I'll also have to return something which is the input C, okay? So let me just finish this transition diagram and then we can break, okay? And for the second part, if I'm now trying to capture everything, if I say this is less than, then immediately I know that I have reached a final state. And because whenever I consume other, I must return it to the input stream. And when I say this is less than equal, okay? Then it is saying that I'm reaching now this final state, but I'm not returning anything. But if I say greater than, then I'm reaching another final state. And again, I'm not returning anything, okay? And if I merge, if I take all the relational operations, okay? Then in this state, in the start state, I can not only see less than, but I can also see greater than, which really is this transition diagram. So it will get captured here, okay? And in the start state, I can also see equal, which is here. Sorry, it just went too fast. Yeah, so if I see equal, then immediately I know that I'm reaching this particular state and I'm just matching equal. So you can see that this becomes now your implementation specification for relational operations. So let's break here for this, and we'll meet in the afternoon at 4.30.