 We will continue our discussion on simplification of context free grammars. Last time we saw how useless symbols can be eliminated from given CFG to produce a new CFG, which would be equivalent in the sense the new CFG also will generate the same language as the old CFG. Today, first let us discuss how to remove so called epsilon productions. First of all the definition of epsilon production, any production of this form A goes to epsilon is called an epsilon production. Such a production the left hand side as usual in case of CFG is a non-terminal. The right hand side consists only of the empty string and in general there can be many epsilon productions in a grammar and we would like to eliminate all such productions from the grammar to form a new grammar G. Now, suppose the old grammar G produced or generated the string epsilon itself. So, let us say what we want to say. So, suppose I had a grammar G V, T, P and S and suppose L G includes the string epsilon that means the string epsilon the empty string is in the language generated by the grammar G. Now, that really means that from S we will derive this string which is the empty string epsilon. Now, this derivation is not possible unless you have epsilon production clearly because somewhere down the line you must be able to remove all non-terminals that you have generated from S and substitute them by epsilon to ultimately obtain this empty string epsilon. So, it is not possible very clearly to eliminate all epsilon productions from a grammar G and get an equivalent grammar G G 1 because G 1 if it does not have any epsilon production then it would not be able to generate the string epsilon itself and in that sense the two grammars will not be equivalent. So, let me say it more clearly what I am trying to say that suppose L G includes epsilon then clearly it is not possible to obtain or a grammar G 1 without any epsilon productions such that L G 1 is same as L G because if G 1 does not have any epsilon production it can never generate the string epsilon itself however much it tries because initially any derivation in G 1 starts with a non-terminal and if you say that it finally derives epsilon somewhere all the non-terminals must be raised and there will be no terminals left because once a terminal is written it cannot be raised in the production. So, therefore, only way you could generate this string epsilon is by having epsilon productions. So, clearly we cannot remove epsilon productions from all grammars and obtain an equivalent grammar. So, our goal should be the following that given a grammar G to obtain G 1 such that L G 1 is equal to L G without string epsilon. In other words if G generated epsilon then G 1 should not generate epsilon, but G 1 should generate all other strings which this grammar does. On the other hand if G did not have epsilon would not generate epsilon that means L G did not contain epsilon then this would be equal to L G itself in such a case of course we should get a grammar that is our aim which would be equivalent to the grammar G. So, this is our goal and the way we achieve this is first identifying from the given grammar G or step 1 is from in the given grammar G identify all non-terminals all so called nullable non-terminals. What is a nullable non-terminal? We say A is nullable A can derive the string which is epsilon that is empty string. It means in other words A can finally become the string epsilon after some steps. Now, clearly if a non-terminal is there in the grammar G such that A goes to epsilon is a production in G then clearly this non-terminal A is nullable. Now, the way we define the set of or identify the set of all nullable non-terminals of the grammar G is by an inductive process and in the base case of the inductive process the base of that induction will start with identifying or finding out all non-terminals which have such a production. So, let us see the set of nullable terminals let me call it script N let it denotes the set of nullable non-terminals as we have done in some other cases previously. We will define this set script N inductively and the base is that this N consists of initially the set of all non-terminals A in V such that A goes to epsilon is a production. Remember any inductive definition of a set starts with a base definition of that set and that base definition for this set of nullable symbols nullable non-terminals is all those non-terminals of the grammar such that each of which contains a production of the form A goes to epsilon. So, clearly that such non-terminals are visibly right from the inspection right from inspection of the grammar G without doing anything I know that these non-terminals are nullable. Here it is obvious, but I should mention once more that my grammar G from which we are trying to remove epsilon productions that was of the form V P P S and this V therefore, is the set of non-terminals and that is the one that I am using here. Now, I should say what the inductive step is inductive step description is quite simple. So, let us say at some given time I have defined a set which consists of nullable symbols that I have identified so far and now at this point. So, every time I have you know at every stage of this construction we look at all productions suppose particular non-terminal is in my nullable sets that I have nullable set that I have defined already this non-terminal is there. So, I know it is nullable as well as some other ones. So, let us say A B etcetera. So, let us for the for the sake of example, so let us say C is also there in N and suppose D is a production of this kind A B C then you can see clearly that because A is in the set of nullable symbols that you have already defined I can start a derivation like this that D goes to A B I use this production to get this and now after some productions use of some productions A will go to epsilon and then B also will go to epsilon C also will go to epsilon therefore, I get that D will eventually can generate the string epsilon. So, therefore, in such a case D also I have discovered to be a nullable symbol and therefore, D is included in this set of non-term, set of nullable non-terminals if it is not already there. Now, what is the algorithm corresponding to this? The algorithm is this step is of course, here that is the base step that is easy to see first of all I define my initial version of N to be all those non-terminals which have a production of this kind and then what I do is the algorithm will do this that it will examine every production and see all productions where identify a production if it comes across a production of the kind that A goes to alpha where alpha consists of nullable symbols then we check if A is there already in the set N if it is not there then we include this new non-terminals into N and now I have changed the N from the old one. So, that any time we manage to update non-trivial in this set that is the end of one iteration. So, every iteration we start with some N and then we look through all the productions identify if I using basically this rule that if a new member can come into N any time I find so that is the end of that iteration I start the next iteration. How many times this iteration can go on? Clearly at most the number of times number of non-terminal symbols which are there in this the set of non-terminals it cannot exceed that it will be less than that because you already have some members to begin with and once the all the iteration stop then I claim that we have identified the set of all nullable symbols. So, the algorithm is very simple we start with a base case and then my iteration iterative process starts every iteration examines the set of all productions and if it finds a production such that the right hand side of that production consists only of nullable symbols that you have already identified and the left hand side non-terminal is not there in the current N then we stop that iteration because we have already discovered an update for N. We include that symbol that we have just found to be nullable into N and start the next iteration and we go on like this and it is not difficult to prove if you wish that this algorithm will correctly identify the set of all nullable non-terminal. Firstly it is clear that the way we have described that no non-terminal which is not nullable can get into N because either it came any symbol which came into the set because either because it was there in the base case because then clearly in such a case of course that symbol is nullable or it came in the set N in one of those iterations and because of application of such a case and then we found in fact that is a witness that A is nullable. So, every non-terminal which we put in the set N is clearly nullable only question that you may ask is have we identified or is there any non-terminal is it possible that which is nullable but which did not or would not get into the set N because of the algorithm that we have used. Let us provide a proof sketch that every nullable non-terminal will be in N when the algorithm terminates the algorithm which we have used to define script N. Now, by definition a nullable symbol is something which can generate the string epsilon. The way we prove this assertion that N contains all nullable non-terminals is by showing such an A will eventually get into N. A is nullable by definition therefore if there is such a derivation that will take A and lead to an empty string and our proof is on the basis of the number of steps required in this derivation. You know it is not surprising that we are using induction to prove this assertion because N was defined inductively. So, therefore, it is not unnatural that we will use induction to justify the main claim about the algorithm. So, what is it? The proof is by induction proof of what proof? So, let me write this proof that A will be in N proof that A will be in N. This proof is by induction on the length of derivation gives the empty string from A. The base case is that this length is 1. How is that possible? That is possible if A goes to epsilon is a production of the grammar. So, in that case we will have a derivation of length 1 because given such a production then we will start with A use that production to simply generate epsilon. Now, therefore, the base case is clear because you see that in the beginning the inductive definition of N, the base case of the inductive definition then would contain all such A's which had a one step derivation to generate epsilon. Now, induction hypothesis is all A's such that A derives epsilon in N or less steps are all such A's are members scripted and using this induction hypothesis I would like to show that suppose generates epsilon in N plus 1 steps then B will be in that proof is simple you see because consider such a derivation such a derivation consider the very first step of that derivation that will be that you are you will replace B by the right hand side of production whose left hand side is B and that left hand side may have a number of non terminals it is clear the right hand side of a production which contains a terminal symbol could not be used in the derivation because the terminal symbol can never be can never become part of an empty string. So, let us see and the very first step was using a production of the kind B goes to A 1, A 2, A k and then eventually each one of these clearly all these A I's must be nullable themselves. So, they are finally written off as epsilon and therefore, B derives epsilon the point is since this derivation is of length N plus 1 every derivation of the kind that A 1 goes to epsilon I will. So, they must be in that I must be using the fact somewhere that A 1 you know ultimately derives epsilon and that derivation will have number of steps which is N all less and therefore, by induction hypothesis each of these A I's will be already there in N. So, at that time you know when we are in our algorithm they will be a time when I would find all these symbols are nullable algorithm to to identify all non-terminals which are nullable and in that algorithm when I examine A 1 through A k this particular production after we have identified A 1 through A k are in the script the set script N then clearly will add B also. So, therefore, B also will get into the set of nullable non-terminals which is what we wanted to prove what we have achieved so far is the identification of all nullable symbols all nullable non-terminals of a given grammar G. So, now we should proceed to obtain what we wanted to essentially a grammar without any epsilon productions and which will generate the language which is same as the old grammar language except possibly the string epsilon. So, let us write this doubt given grammar is G E E S and let N be the set of nullable non-terminals of G and our goal is to obtain obtain G 1 without epsilon productions such that L G 1 is L G without the string epsilon. If it is as we said already that if L G had epsilon L G 1 should contain everything other than that string epsilon if L G did not have epsilon then the new grammar and old grammar they generate identical language. So, how do we do this we first of all what we do is we eliminate from P which is the set of productions of G all epsilon productions. Now clearly at this time the grammar does not have any epsilon productions, but the grammar is not the grammar that I want reason is if you remove all epsilon productions it may be that you are blocking some non epsilon strings which are in the language from being generated. So, very simple example suppose S goes to A B and A goes to epsilon and B goes to B. Supposing this is the only grammar this is the grammar that you have these are the set of productions in the grammar then what this grammar will it can generate you can see this A B A goes to epsilon and then this B goes to B. Now if I remove from this set of productions the epsilon production then this part of the derivation tree cannot be there. So, therefore, I would not be able to generate the symbol or the string B which originally I could generate because you see that S goes to A B and then there is no way of getting rid of this A. So, clearly we need to do something more and that is done by adding some new productions and that rule is fairly simple. So, this is first thing we do second thing that we do is if A goes to let us say x 1, x 2, x k is in P and let us say that of these of these x i's some of these x i's are nullable some or all are nullable that can happen. Then what we will do is the following we will add this product another production of this kind that A goes to y 1 through y k where y i is it can be x i or epsilon if y i if the symbol x i was nullable. So, what we are saying is this looks a little clumsy the way we are writing it, but the idea is very simple you look at a production and then what you are doing is this kinds of productions you will add. So, let us take a simple example that suppose I have A goes to B C D and then out of this let us say B and D are nullable what we are saying in this what we are saying is that this is a production that we will add. Now, the way we have written it you should realize it is not really one production that we are adding in general we are adding many because this choice is there if y i is nullable then sorry if x i is nullable then the production can have the right hand side of the production can have either epsilon or x i itself in its place. So, in this simple example what can happen is. So, this is my x 1 this is my x 2 this is my x 3. So, how many new things that I can get out of this I can get see I can of course keep everything. So, this is a production that will keep because that comes by never using this choice epsilon for any nullable symbol or like let us take this first one. So, I can say that A goes to C D also B a production because according to this rule that for B B could be either epsilon or itself. So, here it is itself and here we are choosing to make it to be epsilon long and short of the this thing is that all those strings which can be obtained from this by substituting one or more non-terminals nullable non-terminals by epsilon such right hand sides will also be a production of right hand side of A. So, this is a new production. So, the kinds of new things that I am getting from here you can see I can get C D I can get of course I am. So, B was replaced by epsilon. So, I can also get B C right and also I can get A goes to C because at that time I replaced both B and D together by epsilon. Now, these are the new productions that we are adding in this case. Now, it is possible that all of them are nullable right in the we did not that possibility is also there. Now, if we do this what I said if I will be allowed to replace each one of them by epsilon and then I will get an epsilon itself on the right hand side. Now, that is not allowed. So, in other words my rule is obtain all productions of the form or add let me say it this way add all productions of the form A goes to y 1 up to y k where each y i is either x i or epsilon if x i is nullable the i th place could contain epsilon which is fine except A goes to epsilon except this production and that would happen when that possibility is there when all of them are nullable. So, if I had just said this much then you could have replaced each x i with epsilon and then the right hand side would have been epsilon itself. So, that possibility we are removing. So, what is our g 1 which has the property that it generates every string of the grammar g except epsilon string right the string epsilon. So, as we said that first we identified all the nullable symbols of g and then we eliminate from p all the epsilon production. If there are nullable symbols then there will be some epsilon productions. So, this new g 1 we are creating by first removing all epsilon productions and then we are adding some new productions and these are the new productions that we add after the identification of all nullable symbols. We said that suppose this is a production in p then we add some new productions by removing one or more of these symbols which are nullable and obtain a new production A goes to something except you know we should not remove every we will not add any production of the form A goes to epsilon. So, this is how we define our new grammar g 1 very briefly again that we identify all the nullable symbols then we remove all epsilon productions from g and then we add some new productions and to the set of productions and then the final form of the grammar that we have now can be shown to generate all strings of g except the empty string. Let us try to prove this to establish the correctness of what we are doing. Let us consider the original grammar to be as we said v t p s and the new grammar that we got after removal of all epsilon productions and in the process adding a few more productions. Let that grammar let us call g 1 possibly we have we might have removed some non-terminals. Terminals will not have removed and let us call the new set of productions for the grammar that we have is p 1 and s. The way what we want to show is l g 1 is same as language generated by the original grammar minus possibly the string epsilon. So, what we need to show you can see that suppose s derives in g and s derives w which is a string of terminals and w is not epsilon. This is the case if and only if s from s you can derive the same string w in the grammar g 1. Recall we since we are talking of two grammars and this symbol that we have been using before needs to be now qualified to indicate derivations in which grammar we are talking of. So, this is easy to see that what we are trying to say that suppose in the original grammar we derived some string w and w is not epsilon such that string will generate in the new grammar also as well as if in the new grammar we generate any string w then clearly we want that string we will because there is no way we can generate the epsilon string because there are no epsilon productions in g 1. So, w is not epsilon and that w should be we should be able to show that it can be generated in g as well. To establish this we will use our standard method that of induction and induction will be on the length of derivation. So, as we see there are two things to establish first one way that is if s derives w and w is not epsilon then this and the reverse way. However, instead of trying to show only for s it will be more convenient for the proof to establish something stronger and that is for establish the same thing not just for s, but for every non-terminal. So, let us say what we want to show that a being a non-terminal and in g that derive some string w from the non-terminal a you can get this terminal string w, w is not epsilon then implies. So, let me not use this symbol because that might be confusing with our derivation symbol. So, let me write if this is the case then let me write then a would generate in g 1 w and then a would generate in g 1 w and the other way we would like to show that if a generates or derives in g 1 the string w then a derives the same string in g as well. It is not difficult to see that if we prove this for all non-terminals a then of course I mean not only it is not trivial not difficult it is obvious then we necessarily prove this because s is one of the non-terminals. So, let us try to establish these two separately. Now, we come here as we said the proof of one will carry out this proof by induction on the length of derivation. Here hypothesis is that we are deriving w in g and let us say that what we are trying to show that for every n, n equal to 1, n equal to 2 and so on that if there is a derivation of length n this statement will be true for all derivations of length n basically the induction on length n. So, what is the base? Base is some non-terminal a derives in one step the symbol the string w in g what does it mean? How can you derive in one step in g the string w that means there must have been. So, let me say if in one step we derive from a the string w then there must be a goes to w in the production set of production must be a goes to w in the production set of production p. So, there must be the case that a goes to w in is in p is not it this clear that if we derive in one step some string that means we are just we can use only one production and therefore, that is the production we must be having in the set of production p. Now, w if you see notice that w is we have assumed we are we are showing this and w is not epsilon. So, therefore, this production would not have been removed the process that we discussed of getting productions set of productions for this new grammar g 1 that removed all the epsilon productions and added some other productions. So, this particular production would survive for g 1 as well therefore, a goes to w is in p 1 as well and therefore, it means that a in one step derives in g 1 the same string w. So, this takes care of the base case for case 1 here. Now, what is the induction step induction step is we assume the induction hypothesis which is that assume one is true for all derivations n steps or less as well as for all a. So, it is a kind of simultaneous induction that we are carrying out this thing we this is generally for any a we prove this. So, in particular we prove the base case for every non-terminal and now we are carrying out the induction step. So, one is true for all derivations of n steps or less we need to prove the same for derivations. Now, and as I said that it is not only true for all derivations of single symbol a. So, we will should write for all a. So, suppose we assume the induction hypothesis then we need to prove for n plus 1 induction hypothesis holding for derivations of n or length derivations of length n or less and this step is also actually the induction is fairly simple this carrying out this step. So, think of consider. So, let us say consider a derivation in g of w of some w let me say this derivation being of length n plus 1. So, such a derivation of length n plus 1 of some w starting from a let us say write in the beginning in g the production that you use was x 1 n x 2 and then we have other steps finally leading to. So, what we are saying is that here I have n steps and the first step being a is written rewritten by x 1 through x p. Now, that must be because you can do this only because a goes to x 1 x 2 x p is a production in. So, if this is the case then we have that a goes to x 1 through x p is in p. Now, what can happen is some of these x i's degenerate null symbol it is possible. So, let us say of these x s x 1 x 2 x p. So, let us say y 1 y 2 x p. So, let us say y 1 y 2 x p y m be those non-terminals or those symbols which do not eventually in the derivation rewritten as is it clear what is happening see for example that first production that you might have used is b c d. Now, what may happen then during the rest of the thing c is a nullable non-terminal and c became epsilon. So, now other these two b and d they generated non-null strings. So, we are corresponding to b and d these are the symbols c and d that we are saying that they do not derive epsilon they do not get they do not. So, those symbols which do not eventually in the derivation get rewritten as epsilon and this is y 1 through y m are there are in the same order. So, for example in this case my y 1 would have been b and d would have been y 2. So, the because this c was getting rewritten as finally epsilon. So, in that case if that is the case then it must be the case that first of all a goes to y 1 through y n is in p 1 that is in the set of productions for the grammar g 1 why because you know we will create all kinds of productions removing nullable symbols of g to get new productions for g 1 and therefore, this will survive and here now it is very clear you see let y 1 y m they are not getting rewritten as epsilon eventually. So, each y 1 through y m they generate strings which are non-null. So, let me say this string is w 1 this string generated by y 2 is w 2 this is w m. So, then clearly w must be equal to w 1 w 2 w m and in other words the process is such that y 1 eventually is rewritten as w 1 non-null string y 2 as w 2 and so on. But the derivations for each of these to go from y i to w i they must be using steps less than n and therefore, we can use the induction hypothesis to say that we will be generating the same string w in g 1 also because the idea is that we to show that we are generating in g 1 the same string w we first use the production this production and then we use the derivation to obtain w 1 from y 1 y 1 w 2 from y 2 and so on and therefore, finally I will get w 1 through w m which is nothing but w. So, we have completed this and to show to the idea is kind of very similar. Now, again assume through induction hypothesis I am not proving the base case which is clear that we can do the base case here too very simply. So, suppose we have that this assumption this result is true for all derivations of length n or less for the grammar g 1, then I need to show that suppose I derive some w from some non-terminal a in n plus 1 steps in g 1 and I should be able to derive that same string in g also. So, consider such a derivation starting from some non-terminal a. So, a the first step that will happen is a will be rewritten by using a production of g 1 because we are talking of derivations in g 1. So, let us say a the first step happens is x 1 x 2 x p and then I have n more rewriting steps to eventually get w. So, write in the first step the first step uses step uses the production a goes to x 1 x 2 x p and let us say that eventually x 1 gets rewritten in this derivation by as w 1 x 2 as w 2 and this x p gets rewritten in this derivation. So, it is rewritten as w p remember that none of nothing can be can give you epsilon in the grammar g 1. So, all of them will generate each one of these w i's are. So, w i's are not epsilon's. So, this is a production in this derivation g 1. Now, it could be that same production is there in g. So, then we have no issue we show this in g we use that same production to come to this point and then use the induction hypothesis. But what might happen that the production that you are using came from a larger production with a production of with a on the left hand side and the right hand side there were some more symbols which were removed because there are nullable symbols. So, it could be it can be that a goes to x 1 x p came from a goes to y 1 through x 2. So, y m in g remember some of these y i's were removed to get this production. So, m is larger than b. So, now we want to show that same w can be derived in g. So, what we do as the first step of the derivation this derivation we use this production and those symbols which were removed to get this particular production from the production of g to a production in g 1 those removed symbols must be the ones which are nullable symbols. So, what I would do is those symbols here which are nullable for each y j which are nullable for each y got removed to obtain this particular production which were was removed we start with this production those y j's we rewrite as epsilon I know that I can do that because those are nullable symbols. So, eventually what I will have is after some steps that in g itself I will have x 1 through x p and then use the just follow the steps of a. Now, use the induction hypothesis for this part because each w i from x i they would be obtained by using number of productions which is less than n. Therefore, now we have completed the second step also and put together what we have shown that our process of getting a grammar from an old grammar such that the new grammar does not have any epsilon production. At the same time it generates all non epsilon strings which are derivable from the old grammar that particular process is correct and will still have one more kind of simplification do that is called removal of unit productions which we will do in the next lecture.