 Welcome to the lecture on lexical analysis. In this lecture, we will discuss lexical analysis in detail. What is lexical analysis? Why should lexical analysis be separated from syntax analysis? What are tokens, patterns and lexemes? What are the difficulties in lexical analysis? Recognition of tokens. So, for this we require fundamentals of finite state automata and transition diagrams and then for specification of tokens, we require regular expressions and regular definitions and we will study all this and then finally, we will take a look at a very effective tool called lex for lexical analyzer generation. To do a bit of recap, this is the block diagram of a compiler. So, the lexical analyzer is the first component in a compiler. It takes a character stream as input, outputs a token stream which goes into a syntax analyzer. So, this is where the lexical analysis actually is performed. So, let us go to the details. So, what exactly is lexical analysis? The input to a lexical analyzer is a high level language program. So, it may be it is written in C, it is written in C plus plus or java or any other language, but the common feature of all these is that they are sequences of characters. There is no other distinction between these programs. They are all sequences of characters and the output obtained from a lexical analyzer is a sequence of tokens. So, the tokens go into the parser for syntax analysis. The lexical analyzer does a lot of cleaning on the input. For example, it strips off the blanks, tabs, new line characters and comments from the source program because these are not very important for parsing. Once the token stream is formed, these are not at all important. So, these are all removed by the lexical analyzer and then the tokens are formed. The lexical analyzer keeps track of numbers, the line numbers and associates error messages with the various lines of source code. The error messages might have actually, you know, arisen because of syntax analysis or semantic analysis or even lexical analysis, but these errors are all kept track of by the lexical analyzer and then, you know, associated with various numbers, source line numbers of the program. The lexical analyzer also performs preprocessor functions. For example, hash defined and hash include in C. So, hash defined defines a macro and hash include includes a file. So, hash defined whatever macro is defined, you know, the effect of that macro is nothing but replacing a particular name with a sequence of characters. So, the lexical analyzer actually performs this expansion wherever that name appears, it expands that name with the sequence of characters mentioned in the hash defined macro and then, submits that expanded sequence to the rest of the lexical analyzer. Hash includes simply says, now, you know, take this particular file, it is also a part of the program. So, perform compilation on it. So, hash includes simply means, start reading from a different file and then, do lexical analysis and parsing etcetera on the rest of the file, which is mentioned in the hash include. So, separation of lexical analysis from syntax analysis, I already mentioned this briefly in the last lecture just to do a recap. The first reason is the software engineering reason, it simplifies design. A compiler is a very, very large piece of software, millions of lines of code. So, making it modular is essential and making the lexical analyzer a separate module helps in reducing the complexity of building a compiler. As I already mentioned, the I O issues are limited to lexical analysis alone, the errors and so on, reading from different files because of hash include etcetera etcetera. And it also makes lexical analysis, you know, if it is separate, it is actually more compact and faster. You know, the parser becomes more compact, you know, fast apart from the lexical analyzer itself being very fast. Why? The reason is lexical analysis is based on finite state automata. These are much easier to implement in the form of tables rather than implement, you know, the functions of the lexical analyzer in the push down automata which uses a stack. This will become clear as we go on in the course and study parsing as well. So, the comments blanks, you know, need not be handled by the parser. So, why not remove them in the lexical analyzer itself? So, that makes the work of parser a little less. A parser is obviously more complicated and therefore, keeping track of, you know, number of lines of code, names, comments etcetera, you know, it is absolutely unnecessary for the parser. So, the lexical analyzer is a better place to take care of these and this makes both the lexical analyzer and the parser more efficient. Now, let me define tokens, patterns and lexemes and then go on to the operation of lexical analyzer itself. So, let us take a running example. It is a programming language statement similar to C. Float absolute 0 Kelvin equal to minus 273 followed by a semicolon. Here, as we know, float is a reservoir, abs 0 Kelvin is a name, it is a variable and it could also be seen as a constant. It depends on the type of usage that we want for it and minus 273 is an integer constant. Now, on this particular running example, we are going to show what are tokens, what are patterns and what are lexemes. A string of characters which logically belong together is a token. For example, the word float and the word abs 0 Kelvin are separated by a blank and they are obviously two different strings of characters. So, we can very safely say float is a token, then abs 0 Kelvin is another token, the equal to assignment is one token, the minus sign is another token, the number 273 is one more and then the semicolon is the last token in this particular sentence. So, once the tokens are identified, you know these are actually passed on to the syntax analyzer. So, the tokens are treated as what are known as terminal symbols of the grammar specifying the source language. This will become clear as we go on. So, that makes the life of parser a little easier. It need not worry about the characters making up the token float, it needs to worry only about some kind of a number called float that is it. So, internally tokens are going to be represented in the form of integers and that makes a token stream very efficient. Then, what is a pattern? The set of strings for which the same token is produced is called as a pattern. So, we are going to define what are known as regular expressions to define these patterns later, but for the present let us understand what exactly are patterns. So, in this case the pattern is set to match each string in the set of strings that it supposed to match. So, for example, in for the running example the word float is a pattern on its own, because no other string actually matches this particular pattern, but for the identifier or the name we have a general pattern which says later. So, we have a pattern here L is of a letter and then D is digit and then we have underscore. The plus actually is a form of the notation it the way to read and then star is for iteration. So, let me explain what this pattern is. It says letter followed by either letter or digit or underscore any number of times 0 inclusive. So, we can produce for example, the way abs underscore 0 underscore Kelvin is produced. We have a letter then followed by two more letters. So, in this case letter or digit or underscore any number of times is exercise to produce two more letters and then it is exercise to produce one underscore. Then four letters another underscore and then five more letters. So, this is the sequence which is actually exercise to produce that particular name. So, that is a pattern for the identifiers or names. Then equal to star these equal to and minus sorry equal to and minus match or patterns which match themselves and nothing else and finally, for the integer number constant D plus D is a digit. So, any number of digits together, but plus says at least once you know the digit is used at least once. So, the number cannot have 0 number of digits it should have at least one digit. So, using this pattern three times D D D will give us 273 and finally, of course, semicolon is a pattern which matches itself. So, these are the patterns. So, the central theme of specifying lexical analyzers would be to come up with a formalism to specify these patterns which cover the entire programming language the set of programming language constructs. So, we are going to define regular expressions which can do this job very admirably. Finally, what is lexeme? A lexeme is a sequence of characters matched by a pattern to form the corresponding token. So, in other words this is the pattern and that is actually the name identifier. So, the characters which form this particular identifier actually are the are form of this lexeme. So, the string will float the string abs underscore 0 underscore Kelvin the string equal to the string minus and the string 273 are the sequence of characters and they are called the lexeme corresponding to the tokens. So, we have talked about tokens very briefly. So, let us discuss it a little more to understand how tokens are actually used for specifying various parts of a programming language keywords operators identifiers. So, we will always say identifiers instead of names then various types of constants integer constants floating point constants then literal strings these are string constants punctuation symbols such as parentheses brackets comma semicolons colons etcetera and many more. So, these are the various types of tokens that are possible in a programming language and we need to specify patterns for each one of these tokens. A unique integer representing the token is passed by the lexical analyzer to the parser. So, as I told you each token is given a name by the designer. So, representing it is very efficient then tokens you know if the number corresponding to token does not say everything about the token you also need extra attributes extra values for these tokens. So, let us see what values are required to specify a token completely for names or identifiers the lexeme of the token or the string corresponding to the token or a pointer to into the symbol table where the lexeme is stored by the lexical analyzer that is in summary we require the string corresponding to the name. So, that is also to be accessed in you know while doing parsing or semantic analysis also code generation etcetera etcetera. So, that is one of the attributes that we want for the identifier and then for integer numbers we want the value of the number similarly for floating point numbers float num we want the value of the floating point number in the appropriate you know representation and so on. For strings we need the string itself and the exact set of attributes are dependent on the compiler designer. So, it is possible for example, here it is possible that the token contains the string corresponding to the name itself in the case of identifier some other designer may say you know let me store it in the table and provide a pointer to the symbol table. So, that is the description of what kind of tokens arise in programming languages and now let us also discuss the difficulties in lexical analysis. For example, certain languages do not have any reserved words. So, in fact while do if else etcetera they are reserved in C, but they are not reserved words in the programming language P L 1 and in FORTRAN some keywords are context dependent. So, let us take an example take this do 10 i equal to 10.86. Do 10 i is an identifier and do is not a keyword even though this is supposed to you know it looks like a do loop in FORTRAN because this says do 10 i equal to 10.86 do 10 i will now be taken as an identifier a name and do is not separated as a keyword, but if the statement were to be do 10 i equal to 10 comma 86 then definitely this is a do loop and do is a keyword. In fact the tokens for this particular statement would be do a reserved word 10 a label name then i another identifier equal to and then the integer value 10 comma and another integer value 86. So, these are going to be the various tokens for this particular statement whereas, the previous statements we have just one identifier do 10 i then equal to and then floating point constants 10.86. So, the sequence of tokens for a statement would be very different depending on the fact whether it is a you know do statement or a different type of assignment statement etcetera. Handling such features requires what is known as a look ahead. So, here until you know when we see do 10 i it is not possible to determine that it is a variable or the do statement. In fact we need to go and pass this 10 find whether it is a comma or a dot if it is a comma then it is a do statement, but if it is a dot then it is an assignment statement and only after reaching this comma or dot can we really determine in the token sequence even for the previous stream of strings you know the previous string. So, whether it would be one variable followed by equal to or it would have a reserved word followed by a label name and then variable so and then equal to. So, the token streams are going to be very different and the token stream can be determined only after looking dot or a comma which actually succeeds the string of characters 1 0 in this case. Then blanks are not significant for turn they can appear in the middle of identifiers, but in C C plus plus Pascal etcetera it is not. So, blanks actually separate various tokens lexical analyzer cannot catch any significant errors except for very simple ones like illegal symbols etcetera etcetera and the rest of the errors are caught by the parser. So, if an error occurs there is very little that the lexical analyzer can do apart from skipping characters in the input until a well formed token is found it just keeps skipping characters skip one character try to find a token skip another character try to find a token etcetera until it succeeds and really finds a meaningful token. Now, specification and recognition of tokens. So, regular definitions a mechanism based on regular expressions are very popular for specification of tokens. So, we will look at the details of regular definitions and regular expressions in the following lectures. The regular definitions have been implemented in a tool called lex. So, if we write regular expressions then you know automatically the tool lex produces a lexical analyzer. So, we will discuss regular expressions and then token specification using lex regular definitions and so on. We also use transition diagrams which are nothing but variants of finite state automata. So, they are used to implement regular definitions and to recognize tokens. Transition diagrams are usually used to model the lexical analyzer before translating them to programs by hand. By the way it is possible to write lexical analyzers by hand as well in the olden days that is exactly what was really done even today for a very small languages for efficiency sake it is compiler designers sometimes write lexical analyzers and parsers by hand. So, it is not as if it is very artificial to think of such a situation. So, when we design regular you know lexical analyzers to be implemented by hand we use transition diagrams to model them and then translate these transition diagrams by hand to programs. Whereas, lex automatically generates optimized finite state automata from regular definitions it does not require transition diagrams as a method of specification. So, we will first study finite state automata and their generation from regular expressions in order to understand transition diagrams and the working of lex itself. So, now for so far we have looked at tokens we know what tokens are and but we still do not know how to specify tokens and once we learn how to specify tokens we will see how to translate them to programs. In our study of you know token recognizers or lexical analyzers we require to define languages finite state automata and regular expressions. So, let us go through some of the definitions and understand them. A symbol and a symbol is really an abstract entity and we do not really define it. It is assumed to be known to everybody it is like a set for example, sets are not defined you know mathematically they are just abstract and everybody is supposed to understand them. So, examples of symbols are letters digits etcetera what is a string a finite sequence of juxtaposed symbols that is symbols place one after another is a sequence of symbols. So, such a sequence of symbols is called a string. So, if we consider the letters a b and c a b c b c a b a they are all strings over the symbols. Now, well it is very easy to see that you can form any number of symbols any number of strings given these three symbols a b and c infinite in fact. If you write mod w then you know mod w is the it stands for the length of the string w and is actually the number of symbols in this string epsilon is the empty string and is of length 0 these become very important for our form all definitions later on. And what is an alphabet it is a finite set of these symbols in our case most of the time you know we will be using characters as symbols and set of characters would be our alphabet as appropriate to various programming languages and what is a language. So, this is not a natural language such as English and nor is it a programming language such as C or Pascal this is a mathematical entity called a language and it is defined as a set of strings of symbols from some alphabet. So, you take symbols like a b c that is a set of three symbols a b c will become an alphabet and then you know you can form any number of strings over it and if you take some of those strings put them in a set and that set becomes a language. It is not necessary that languages be either finite or infinite all the time there are finite languages and there are infinite languages. So, the null set and the set containing the null string epsilon are both defined as languages that is by definition. The set of palindromes over 0 1 is a infinite language because there are infinite number of palindromes and you can form using two symbols 0 and 1. The set of strings 0 1 1 0 1 1 1 only three strings over the alphabet 0 comma 1 is a finite language why is it called a finite language there are only three strings which is a finite number and the set contains only these three and therefore, it is a finite language. So, if we say sigma is an alphabet then sigma star is the set of all possible strings over sigma. So, if you take a single symbol 0 then starting with epsilon 0 0 0 3 0s 4 0s 5 0s in general 0 to the power n with n greater than or equal to 0 this set is an infinite set and that would be the set of all strings over this sigma which is nothing but a single symbol alphabet containing 0. But similarly you could define sigma as 0 comma 1 then all possible strings formed using 0 and 1 obviously this is an infinite language. So, this is called as sigma star. So, having said what sigma star is sigma star is nothing but all possible strings over a particular alphabet every subset of sigma star is a language. So, that is the formal definition of a language. So, you could have finite subsets of sigma star and obviously you can have infinite subsets of the subsets which are infinite sets. So, those are also possible. So, there are infinite languages and finite languages. So, now there is a tricky you know description the set of languages over sigma star is uncountably infinite why in general if you take a set right and then you define the set of all subsets of that set that is known as a power set. So, that would have 2 to the power n elements if n is a the number of elements in the base set that is the finite set will have a power set of cardinality 2 to the power n. But once you have you know sigma star as an infinite set the number of elements in sigma star is infinite. The number of subsets of sigma star is also infinite but it becomes what is known as uncountable. So, I cannot really spend too much time talking about what is uncountability because that is a part of discrete mathematics. In general set of numbers like 1 2 3 infinite set of this kind is countably infinite whereas, the set of subsets of an infinite you know infinite sets such as sigma star is uncountably infinite. So, in some sense uncountably infinite is in quotes bigger than countably infinite in the mathematical sense. Each language must have now let us look at this carefully each language must have a finite representation otherwise we cannot talk about it. So, a finite representation obviously can be encoded by a finite string. So, any finite representation you know can be encoded in a small string how long that string is that is up to the representation to decide. Thus if you choose a particular sigma and then each string of that sigma star can be thought of as representing some language over the alphabet. So, because you can say each such of each string encodes you know a finite representation and that becomes you know some language representing over this alphabet sigma. So, it so happens sigma star itself is countably infinite the what I said here is set of languages over sigma star is uncountably infinite, but if you take sigma star it is countably infinite and the set of languages over sigma star is uncountably infinite that is a bigger thing than this sigma star hence there are more languages than language representations. The moral of the story is no matter what type of finite representation you come up with for languages. So, that it can be they can be processed by machines that is compilers, interpreters etcetera etcetera you know this particular representation will have it is limitation. In other words you cannot come up with representations for every language that is possible there are more languages than language representations. So, but that is not going to be a big disadvantage for us because we are there are more than ample languages available in the representations that we choose and they are more than sufficient for our practical purpose today. So, we are not really worried by the statement that there are more languages than language representations. So, the available representations are sufficient to take care of all the languages that we know of today. Future of course, we have no idea what would happen. Now, there are what are known as type 3 or regular languages. So, now we will go into classification of these languages. These regular languages can be represented in a finite way using what are known as regular expressions. So, the languages are infinite, but the representations are finite that is the basic idea of any representation. Then there are type 2 or context 3 languages again these are infinite, but then the grammars context 3 grammars which we are going to study later form a representation of these languages again grammars are finite objects. So, they are finite representations. Then we have context sensitive grammars which represent context sensitive languages again context sensitive grammars are finite representations of infinite languages and type 0 grammars are finite representations of type 0 languages. So, here is the hierarchy regular languages are weaker than context 3 languages, context 3 languages are weaker than context sensitive languages which are in turn weaker than type 0 languages. So, this hierarchy of languages is known as the Chomsky hierarchy based on the you know to respect the inventor who you know proposed this hierarchy known Chomsky. So, let us look at some examples of the languages. So, sigma let us say is a 3 symbol set a comma b comma c and that is our alphabet. The first language L 1 the set of all you know a's and b's a to the power m b to the power n that is a number of a's followed by a number of b's, but m and n are not related it is just that m and n are greater than or equal to 0. So, you would have epsilon then you would have a b then you would have a b square a cube b 3 etcetera. I just give examples of strings from this language, but it is an infinite set with no relationship between m and n that is very important. L 2 is a similar language, but it says a n b n n greater than or equal to 0. So, the number of a's is equal to the number of b's and the b's follow a's here the number of a's and b's are not related, but the b's follow a's. So, the first language can be represented using regular expressions. The second language cannot be represented using regular expressions, but you need context free grammars. Let us look at the third language a to the power n b to the power n c to the power n n greater than or equal to 0. So, the strings are a number of a's followed by equal number of b's followed by equal number of c's. So, the difference between these two is the c part this had only a's and b's being equal, but we here add c's also which are in equal in number two number a's and b's. So, once you do that this language fails to be regular it fails to be context free, but is what is known as a context sensitive language. So, we require context free grammars to represent L 3 context sensitive grammars to represent L 3, context free grammars to represent L 2 and regular expressions to represent L 1. Showing a language which is type 0 is outside the scope of this course, but you know very very intricate you need many more mathematical arguments before we can show that. So, I am going to omit it. So, now what exactly are automata? Automata are machines. So, what do these machines really do? These machines accept languages and those languages you know correspond to the ones that we have already defined. For example, finite state automata accept regular languages and they can be specified using regular expressions. Push down automata accept context free languages and they can be specified using context free grammars. Linear bounded automata accept context sensitive languages and they can be specified using context sensitive grammars. Turing machines accept type 0 languages and they can be specified using type 0 grammars. So, these are the four types of automata which are extremely important in the study of languages and automata theory. For our purpose we restrict ourselves to the finite or state automata and push down automata. Finite state automata are used for regular for lexical analysis and push down automata are used for parsing. There are many applications of automata. For example, finite state automata have been used extensively in switching circuit design. Of course, I already mentioned it is used in lexical analyzer. Then the Unix tools grep and arc AWK they perform string processing and are based on finite state machines. Object oriented design for example, UML it uses what are known as state charge. They are nothing but extensions of finite state automata. Modeling control applications for example, an elevator operation can be easily specified using finite state automata. Parasets of all types use push down automata and of course, compilers you know there are tree automata and so on which are extensions of the finite state automata which are used in compilers for code generation and other purposes. Let us begin our discussion of the finite state automaton. A finite state automaton is said to be an acceptor or recognizer of regular languages. I have already mentioned this. It is a machine really and it can be programmed. Let us look at the formal definition of a finite state automaton. It is defined as a five tuple quintuple. First component is Q which is a set of finite set of states. Then there is a sigma which is the input alphabet for this particular machine. Then we have delta which is the transition function. It requires a little more explanation and I am going to do that after we run through this definition. Then one of the states in Q is designated as the start state and it is represented as Q 0 and F is a subset of Q and whatever is in F is a final state. Now, let us get back to delta. As I said the finite state automaton is a machine. So, delta tells you how the machine progresses from one state to another on consuming a particular symbol from the input. So, delta is the transition function. It is a mapping between Q cross sigma and Q from Q cross sigma to Q. So, in other words we write it as delta of Q comma A equal to Q 1 something like that. So, it means when the machine is in state Q and the next input symbol is A the machine changes the state and goes to the state for which is defined as delta Q comma A. So, that is the state it enters. So, in one move from some state Q the finite state machine reads an input symbol, changes the state based on delta and gets ready to read the next input symbol. As I show you an example it will be very clear. A finite state automaton accepts its input string. If starting from the start state Q naught it consumes the entire input string and reaches a final state. So, both these conditions are very important. It is not enough if it consumes the entire input string, but is in a non final state and it is not enough if it reaches a final state and there is some more input remaining. Both must happen and it must start from the start state Q 0 it cannot start from some other state. So, in such a case the automaton is said to have accepted the input. If the last state reached is not a final state then the input string is rejected. In other words it reads the entire input, but then enters a non final state then the input is not a part of the language or rather it is not accepted by the finite state automaton. So, let us take a simple example. So, you start from the start you know here Q naught Q 1 Q 2 and Q 3 these are the set of states of the automaton. Q naught which has an incoming arrow hanging incoming arrow is the initial state and then there are two states Q naught and Q 2 which are special which have double circles. So, these are the final states. So, Q naught Q 2 is the set F then the delta function is shown by these arcs and the labels. So, when the state delta of Q naught comma a is Q 1. So, in other words from the state Q naught on input a the machine goes to the state Q 1. Similarly, from the state Q 1 on input c it goes to the state Q 2 etcetera. So, from Q naught it goes to a goes to Q 1 on a and on input b comma c it goes to the state Q 3. So, let us look at the formal description of this in the next slide. So, as I said it has four states Q naught Q 1 Q 2 Q 3 sigma is a comma b comma c Q naught is the start state and F is Q naught comma Q 2. So, you can observe that Q naught and Q 2 are the final states. The transition function is identified by the table below. So, let us look at it from Q naught we already know on a it goes to Q 1 on b it goes to Q 3 and on c it goes to Q 3 etcetera etcetera etcetera. Now, let us look at the machine again and see if some string is accepted or rejected. Let us take the string a b c. So, we start from the start state Q 0 the first symbol is a. So, we go to state Q 1 the second symbol is b. So, from Q 1 on b the machine actually stays in the state Q 1 consumes the b and on the final symbol c it goes to state Q 2. So, the input is exhausted and it has reached a final state. So, the string a b c is definitely accepted by the automaton. So, you can now see that a single a followed by any number of a's and b's followed by one c this set of strings is accepted by the machine. Whereas, any string which begins with a b or c it enters a state Q 3 from which it is not possible to get out or the state remains there. So, if you consider the string b a c. So, on b it goes to Q 3 and on a it stays in Q 3 and again on b c it stays in Q 3. So, on b a c starting from the input state you know the start state Q 0 it ends in Q 3 after exhausting the input which is not a final state. Therefore, the string b a c is rejected by the automaton. So, the accepted language for this particular set is the set of all string beginning with an a and ending with a c of course, epsilon is also accepted simply because this particular state the start state without consuming anything will also is a final state. So, epsilon the input string with of 0 length is also accepted another example. So, again we have Q 0 Q 1 Q 2 Q 3 as four states that is a set Q and the state Q 0 is our start state as usual, but it is not mandatory to make Q 0 as a start state all the time even though that is the notation which is used in a very text book. It could be Q 3 as well but the designer can use a different notation if he or she desires the set f is just Q 0. So, only Q 0 is a final state and all others are non final states. So, in other words if the input you know takes the machine from Q 0 to some other state after exhausting the input then the input is not accepted, but if it brings it back to the initial state then the input is accepted. Delta is here. So, what is the language? So, delta is always shown in the form of a table or in the form of a picture, picture is easier to understand. So, we have used pictures in our case the language accepted is the set of all strings of 0's and 1's in which the number of 0's and the number of 1's are even numbers. So, that you can check you know on a single 0 I go here on even number of 1's you know I keep circulating between these 2 states right. So, for example, 0 and a 0 right and then let us say it go through a single 0 then on a single 1 I go here but then I have to consume another 0 to get to Q 1 and finally, I must consume another 1 to get back to Q 0. So, if I consume only odd number of 0's and or odd number of 1's I will never get back to Q 0 I will remain in one of the other states Q 2, Q 1, Q 2 or Q 3 and therefore, those strings are not in the set of accepted strings. So, we saw an example of finite state automata now the language accepted by a finite state automaton is the set of all strings accepted by it that is starting from the start state the you know the string x which is the input string must actually take you to the final state it must belong to the final state. So, this notation is an extension of the notation where the second component was a single symbol but that is a very straight forward extension. So, we can say delta of Q naught comma the string must take you to a final state. So, this is the language you know all the strings of this kind which take you to the final state from the start state or in the language accepted by the finite state automaton and this is what is known as a regular language or a regular set. So, this is how we define a regular language one of the ways in which we define the regular language later we will also define regular expressions and regular grammars which are also you know specifications of regular languages but that is not the task right now. Of course, it can be shown that for every regular expression a finite state automaton can be constructed and for every finite state automaton a regular expression can also be constructed. So, we will look at this briefly a little later. Now, what we have seen so far are finite state automata but there is something very special about it they are what are known as deterministic finite state automata. So, why are they called deterministic? The deterministic automata do not permit more than one transition from any state on a particular symbol whereas, for non deterministic finite state automata they allow 0, 1 or more transitions from a state on a given input symbol. So, to show you a simple example on 0 from the state q 0 I can either remain in the state q 0 or go to the state q 1. Similarly, on a 1 I can either remain in the state q 0 or go to the state q 1. So, in you know this is an example where the non determinism shows the machine can decide to stay in q 0 on a 0 or it can decide to stay by you know jump to q 3 on a 0. So, this is exactly what is non determinism. So, it allows 0, 1 or more transitions from a state on a given input symbol. So, the finite state of NFA the other one is called as a DFA deterministic finite state automata and this is known as an NFA but the transition function delta is different here it is the same phi tuple, but the transition function is very different. So, for example, delta of q comma a it used to be a single set in the case of a DFA, but here it is the set of all states p such that there is a transition labeled a from the state q to the state p. So, it is a set here and one of the elements of the set could be chosen by the automaton at any point in time. So, for example, in this case you know so from the state q 0 on a 0 the delta function says it is a set consisting of q naught comma q 3 and for this on a 1 it is a set consisting of q naught comma q 1. So, either q naught or q 3 can be chosen by the machine. So, delta in the case of a DFA was represented as q cross sigma to q that is for a given combination of q and symbol a single state was possible, but here a set of states this is the power set notation a set of states which is a subset of the set of states is possible. A string accepted by is accepted by an NFA if there exists a sequence of transitions corresponding to the string that leads from the start state to some final state. So, it is very similar to the previous one you know in the case of DFA we said it should go from start state to final state on the input here it can go from a start state to any one of the final states it is not necessary that it goes to the same final state every time it can go to any one of the final states. So, and then still we say the string is accepted. So, every NFA can be converted to an equivalent deterministic finite state automata that accepts the same language as the NFA. So, this is a very powerful result which says the non-determinism does not really add anything to finite state automata. So, we are going to look at that result in some detail later on. So, let us look at this example here there is non-determinism and the language is the set of symbols x such that x contains two consecutive zeros or two consecutive ones. So, let me demonstrate how this works. So, it is you know from q 0 it can consume any number of zeros and ones, but finally it should go to q 3 and then to q 4 that means it would have at least two consecutive zeros here followed by any number of zeros and once again in q 4. If it has taken this path it can have any number of zeros and ones in the beginning, but then it must consume at least two ones enter q 2 and then it can have any number of zeros and ones in this state as well. So, two consecutive zeros are two consecutive ones that would be the rule to make a string acceptable to this particular automaton. So, we will stop here and continue with non-deterministic automata in the next lecture. Thank you.