 a unique meaning of the program, which everyone is going to interpret. And then we looked at certain optimizations, and once we looked at optimization, we were removing the element from the program without worrying about what are underlying elements of us. Then we started looking at the translation of the abstracts with text tree, which was disambiguated already by the semantic and the text. It was something which is closer to machine, and we looked at abstractions which are available to us at the source level, versus abstractions at the target level. Then we were looking at how to convert all these abstractions into machine abstractions. And we said that the process is going to be that all the identifiers, they are going to be mapped into certain locations, and then all the variable accesses, depending upon where these identifiers work, we were going to find out exact locations from where we could pick up all this information. Basically all the addressing mode information will come here, and then we want to map all the source operators to the target operators, and if there is no equivalent target operator, then we want to write some kind of macro for that, and then we want to convert all the conditionals and iterations into a sequence of test results. And after that we started looking at the parameter passing protocols, and this is where at least some discussion is required. So what we want to do with parameter passing is that we want to find out where are we going to put all the arguments, and we want to find out where we are going to pick up the return values from. So typically what may happen is that if I say call some function p, where the list of actual arguments, what is going to happen that somewhere in my machine I will have to say what is the place where the arguments are going to be, and I also want to find out the place where I am going to put the return value. Now I cannot put all this information at some arbitrary place, I need to put it at some fixed place and therefore we must have certain protocol which will say that where the parameter passing, how I am going to do parameter passing, where I am going to put all this information, what is going to be the layout of this procedure, where I am going to put activation frames and so on. So I must lay out the complete activation frame of the function, and when we do run time system and we talk about code generation and talk about the code generation for procedures and function, that time the layout of activation frames will become clear to you. And then we also must have some kind of interface to all the libraries which we have, so we may have mathematical libraries, we may have statistical libraries and so on, we may have all run time system, operating system and so on. So we must know how to make these, so compile that we have to find out appropriate mapping. Everyone is in sync with this, so once we do this, then like we had this filers structure to begin with, we said we are going to have front end phases, after front end phases we had certain optimization, then we started talking about code generation phases. We can also do little bit of optimization after code generation, so some optimization we were trying to do was, we were trying to remove all kind of redundancies which are present in the program, but some redundancies do come because of the code generation phase. When I am trying to do code generation, certain more redundancies will repeat and I would like to remove them and what kind of redundancies these may be. So I may generate information by saying multiply something by 1 or multiply by 0. And typically this kind of information will get generated when I am trying to do code generation for either arrays or for structures and records. And that point of time, if some such information gets generated, we would like to just remove them by doing a post code generation optimization. And we also try to use commutative properties of the operators we have, so for example, if I have an expression like this versus an expression like this, where I say x is sign x plus y or x is sign y plus x. Now you will see that try to use any c compiler and generate code for this by switching off all the optimizations and you will find that the two code sequences are going to be different. And what could be reason for that? Why these code sequences may be different? So normally my compiler is going to scan input from left to right and then it will be able to figure out that the first argument is same as the left hand side. And therefore, once I evaluate the second argument or the remaining argument, I will just pick up that value and add it to location where x has already stored. But when I am scanning this, it will first encounter y, then it will encounter x. Now x may be somewhere in the middle anywhere. So I could be even of this form. And therefore it will start loading this, start loading this, it will add the two and so on. So normally what we will try to do is this is the best possible way. And in fact C already uses a very concise notation for this and they have introduced an updated line. So the whole idea of writing a sequence like this is that I can generate a better code. So now I am going to actually what I am doing here is I am using certain properties of algebra here saying that x plus y is same as y plus x. In most cases, unless I am doing high precision numerical analysis. So this is really that is why I put this question mark that sometimes when you are trying to do high precision numerical analysis, then if you try to use commutative properties, it is possible that you will end up getting different results and such optimizations should not be there. And then I will start actually using construction selection. So once I have done the intermediate code generation, then I look at the complete syntax of the machine and try to find out how do I use the syntax of the machine for finding out adjusting mode of code and the people optimization and so on. And once that is done, what will happen? So this is where we work at the end of the front end. So I have the same example running through. This is the distributed abstract syntax tree. We had the end of the front end and now what happens is when I try to do intermediate code generation, it generates a code like this which is not with respect to any machine which is not like a particular machine but is close to many machines. So I may have something like this which may say that this variable b is actually stored at an offset of a with respect to certain frame pointer. This is a constant 0 and when I add a with the value in the frame pointer, what does it give me? It gives me address of where b is stored but I need the value of b. So I do a d reference here, I do a memory d reference and that gives me the value. I add the 2 and similarly on the right hand side when I look at a and a is stored at an offset of 4 from some frame pointer, b is stored again at a frame pointer with an offset of a. I add the addresses, d reference and then add the form. This is an assignment. So this is where the move takes place. So I am just taking this value moving into this address and then I am using an F statement here which is equal to a conditional term and so on. So there is a kind of intermediate tree I may get and when I do an optimization, now optimize as I say that I am going to put all these values in a register and not into memory location. So I may have registers like saying that this will occur in so b will be in register cx and a will be in register dx and therefore some optimization happens here. I do not have to do all these memory d referencing and then finally I may end up generating code like this. It just says compare cx with 0 and move on 0 cx to dx. If it is 0 then just move it. So starting from whatever I started with from expression to expression, I may finally generate code like this after going through many transformations in the code. Now you can see that all I have done is that I have changed the representation from whatever I started with but I am actually employing the same measure and I am not changing anything. So how does my compiler look now? This is where we left our compiler with the front end and we left this part of the block empty and if I now try to fill it in we will have an optimizer which is an optional phase and then we will have two phases of code generation. One is intermediate language code generation and then we have the code generation and this also is historically or traditionally known as the back end phase which is specific to the machine. So this is where the machine specific information starts coming into the compiler and this really completes your overall structure of the compiler. Questions, comments, anything? So right now we have no clue of what happens really here or how lexical analyzer works how syntax analyzer works and so on but we have some idea of that if I take a representation here and I apply these phases then what will be the representations here and what will be my final representation. At least that part is sort of clear to everyone. So let's move on and what may happen now is that there is some information which is lagging here. So for example when I look at something like this so let's just take a piece of code so if I say a is sin e plus c this is for which I am trying to generate code when I start converting this into a lexical analyzer what will be the sequence of code constituencies generated by the lexical analyzer which will be fed to the syntax analyzer or where is it this is my input right so what is the sequence so if this is my input this is the sequence of code which I take to my compiler what will be the outcome of lexical analyzer so what is it that I will see here what will be my token. Doesn't matter what types are right do I need to know types to find out what my tokens are so I don't need to know the types we need to define that will be declared upstream somewhere up in the program right so I am only looking at this part of the program and say tokenize this what will I get as total one by one somebody can raise their hand fifty of you start speaking where you are equal where you are equal where as A operator is equal where as B operator plus where as C is that what it will be hence the white spaces hence the white spaces white spaces do not accumulate anything to the means So I can ignore white spaces, I don't have to worry about the white spaces. But really this is not what happens. What happens is, I will say that these are all identifiers, now as far as structure is concerned, as far as syntax analyzer is concerned, to check structure whether this is a valid expression or not, whether this is a valid assignment or not, it need not know anything about what the variables are, all it needs to know is whether this is an identity identifier or not. So what I would like to pass on is, I will say identifier, assigned identifier, some class of operator which is addition operator and again an identifier, now as far as my syntax analyzer is concerned, this information is sufficient to do structural analysis. When it comes to type checking, at that point of time, it will need to know what is the type which is associated with this identifier, what is the type which is associated with this, what is the type with this and what is exact operators. This is what my semantic analyzer will do. When I come to core condition and I say, now assign certain memory locations to this, that time it will say, oh this particular variable is put into certain memory location. I will say that this was in an offset of 8, this was in an offset of, so let us say this was 4, this was 8 and this was 12. So there is a lot of information, if I look at this identifier, there is a lot of information which is associated with this identifier, one information is really the string which is this string, which I need to know that the string associated with this identifier is different than this. I also need to know the file, I also need to know the offset and maybe some other information, like the type I will also get by default information saying how many bytes it is going to take and I store it in memory. Now where do I store all this information? If this is the sequence of tokens I am passing, I cannot afford to use this information, I leave it at time of code definition. Where do I store this information? I need to store it somewhere. And for that what we have is that information which is required about the program variables during compilation, so I may have various kind of information whether this is a keyword, this is an identifier, what is the class, what is the type, amount of storage it requires addressing memory and so on. This I need to store somewhere and location for this information is, either it can be right here, so I can define this as a structure and I say that all this information is right here. But then you can see I get into obvious problems and what are obvious problems that suppose I change this to A, what will happen? This information will get duplicated. If I change anything I need to go and find out all these places where this information was and change them. Or I can say that I have something called a symbol table and I can say all the information corresponding to A is here and then all these are pointing to A here. So if I now create another data structure where information about all these are going to be stored, this is where we store it and this is a data symbol table. At this point of time we will not worry about what actual data structure I employ. You can just assume this is an array of records to begin with. Where each record is going to have information about each of the symbols that I have in my program. So how in my compiler book now that along with these phases of compiler I must have a symbol table where the symbol table can interact with all the phases. So what each phase will do, lexical analyzer will not have type information for example it will just say what is the lexical which is associated with the variable. So only thing lexical analyzer will know is this identifier has a lexical associated with this which is A. When it comes to type checker type checker will figure out what is the type of it and will put this information. When it comes to code generation or memory allocation, memory allocator will say if type of this is floating point I need to allocate 10 bytes for this. And memory allocator will immediately say that since I rate bytes for this let me assign certain address to this and it will give me the base address. When it comes to code generation code generator will again figure out where I need to put this information and I may need to associate certain registers with this. So all that information will keep going in the symbol table and therefore you see a bi-directional arrows here that all these phases will be able to write into the symbol table and will be able to read from the symbol. So this is additional information I need in addition to the flow of data in my compiler. Yes? Sir can you explain why writing in the symbol table? Why? Writing in the symbol table is required. Oh, so if I start reading this from left to right I encounter A first and lexical analyzer figures out that this is an identity file. Now where do I store string A? This A has to be stored somewhere. Where do I store it? In the symbol table. So if I cannot write into symbol table how will I store it? Similarly when I come to this phase which is symmetric analyzer phase and this figures out that type of A is a digit. Where do I store this information? I need to write it in the symbol table. Similarly when it comes to memory allocator, memory allocator says that I am going to assign this variable like an offset of 16 from some train pointer. Where do I store this information 16? I need to go back to my symbol table. So I need to be able to write and I should be able to read from the symbol table. So not all the information will be written by all the phases. So this each phase is going to write some information which is relevant to that particular phase. Is the question answered? Now so here is a model of compilation here. And interestingly as I pointed out when we were talking about history of compilers that this was the model which was used exactly as in 1957 compilers which was the first 4 print one compilers. That structure has not changed last 55 years. Now what are the advantages? There must be certain advantages otherwise we would have changed all the structure by now. So what are the advantages of this structure? And what are the disadvantages of this structure? Code will be modular. Code of compiler or code? Complete compiler. Complete compiler. So code will be modular. Good. So basically one advantage we clearly see here is that each phase is doing something which is a logical activity which is complete in itself. So like lexical analyzer is just tokenizing the info. My syntax analyzer is just taking structure. My type checker is just taking the types and nothing else. So you can see that we have highly modular code where each phase is doing something which is a coherent logical activity. That's really the good part of this. What are the other advantages? Other advantages of modular. Other than that? Modular. Yeah. I mean modular we have already listed. So other advantages. We would have to write the front end part for different machines. Different machines or different languages? Different machines. Let me elaborate on that. The same front end part can be used for a liner system. So if I have one language and I have multiple machines then the same front end can be used. Very good. And you should be able to apply it to the back end that if I have many languages and one machine then I should be able to use the same back end. Signs like very nice property that we have been able to employ. So these are the advantages. Also this is known as analysis synthesis model compilation where front end phases are known really in the analysis phase the front end is doing is basically taking an input analyzing it to find out whether this is a valid input with respect to the language definition or not and what is my back end doing is synthesizing a program which is a machine program for some correcting. And each phase has a well defined work and each phase also handles a logical activity in the process of compilation. Now you can see that I can just take this phase do something with it and in the process of debugging if I find that certain information is not correct then very precisely I know where to go back and start looking for possible bugs. I don't have to really look at the whole compiler. I just need to focus on part of it. Continuing on this and this is really what some of you already pointed out that I can use part of the front end, part of the back end and the whole compiler becomes retargetable. So suppose I am trying to write a C compiler for a new machine I don't have to rewrite the front end. I will pick up the front end as it is and we change the back end. Then my compiler gets retargeted to a new machine. Similarly if I have compiler for a machine and I say a new language has been designed then I just need to write a front end for this and then I can work with the back end which is already available. So source and machine independent code optimization is also possible because if you see optimization what I had that was independent of both source specification and the target machine. That was working on certain intermediate representation and therefore another advantage is that I don't have to plug in an optimizer to begin with. I can have a full compiler, a working compiler and then at some point of time I can say hell is now an optimizer which I am going to plug in and suddenly your code quality improves. The same compiler will continue to work. That's another advantage. So highly modular structure and optimization phase can be inserted after the both front end and back end have been developed. This is actually a clean model and this is how people do it in industry that you first have a functional compiler which will assure a functional correctness and then go for an optimizer which you are going to plug in at some later point of time. Now one of the various issues which you need to be alert about when we start designing a compiler. So what are the various issues we need to face? Structure looks fairly straightforward so far but as we start going deeper into it then you will start noticing the kind of problems we face. Now when I take each face specifically we will discuss the problems there of each face. So I talk about what are the fit faults you may notice in a laxical analyzer what are the problems you may notice in a syntax analyzer and so on. But in general we need to worry about for example suppose your input is incorrect how do you handle incorrect input? Now when I say input is incorrect what does that mean? It is not that you have a buggy program the program could be buggy it could have logical errors but what I am worried about is that with respect to the language specifications whether your input is correct or not because compiler has no way of figuring out whether they are logical errors or not. So now you have to worry about incorrect programs you need to do what we know as error reporting and error recovery. Now as your languages become more and more complex and as your architectures become more and more complex your compiler it is going to have an impact on the compiler I will not say whether it becomes simpler or more complex. So for example if I want to write compiler for programming language C versus a compiler for programming language like C plus plus the two are going to be very different the reason is that if I am trying to write C plus plus front end then I have to do so many things in my syntax analyzer and type checker that it could be virtually helped and then their languages which are even worse than that so if you try to go and google for language called chil or language called eda then suddenly you will find that they are much worse much worse in the sense let me not that sounds very negative they are more complicated and obviously some of you write guys are going to handle these as you input into some of you will be not all of you will get to write a C compiler some of you who are brave hearts they will pick up languages like eda and chil write the pilers so design of programming languages and architectures are going to have an impact on complexity of the compiler and then this immediately brings us to this issue of retargeting compiler so typically what may happen is or what is desired is that whenever there is a machine and whenever there is a programming language I must have a compiler for programming language for that machine and if you say that there may be 10 to 15 popular architectures and maybe 20 popular languages then I need about 200 compilers somebody will have to sit and code them now this is what is also known as classical m cross n versus m plus m problem what we are trying to do is we are trying to find out whether there is a situation so some of you really talked about saying front end and back end now if I use that situation what is the most critical part actually I have these front ends and I have these f n front ends and I have back ends which are let's say going from v1 to vn so I have these front ends and I have back ends now to connect to them I need something and what is that something I need I just cannot arbitrarily say that pick up this front end and pick up this back end so what you are talking about is some intermediate language and this can also be called as a switch box where you say that all front ends will be able to generate the same intermediate language and all back ends will be able to work with the same intermediate language so what I need now are just n front ends and n back ends so instead of having n cross n compilers I can just write n front ends and m back ends so m plus n and then life is simple but life is simple so long as I can design a universal intermediate representation because there is always a trade off the trade off here is that I should be able to design this switch box now now is that straightforward now suppose you have imperative language you have logical languages here programming languages you have functional languages all kind of complex languages may be there and you are saying that I should be able to translate everything into a single intermediate representation so is that reasonable is that realistic no then all this structure goes for a toss right so what we are talking about I can employ a front end and a back end and so on that will not work so what we are doing so typically this is n cross n compilers look and this is how my universal intermediate representation with all these front ends and back ends will look and prerequisite for this is that I must have a universal intermediate language what property is this universal intermediate representation must have that it should work for all languages and all possible machines and there does not seem to be much commonality between this large set of languages so interestingly this project started way back because people realize this problem way back when the first compiler was written in 1950 itself people started talking about it so because at that point of time itself people were saying that look up in IBM has one machine but other machines were getting developed and at least two, three other programming languages were quickly coming up so cobalt was there people were talking about Elgol and some other languages for there so this idea a linguistics switchbox actually materialized within an year of the first compiler an idea was that if I can come up with some kind of compilers that time it was called compiler oriented language universal computer language and so on people gave various names to this acronym computer scientists are very good in inventing these kind of terms very quickly but the whole idea was that I want some kind of linguistics switchbox which will be able to then do this translation and I was supposed to have certain properties and basically idea was to reduce the total development effort of compilers for different languages to mapping on different topic and it did not succeed so the whole effort I mean first one proposal was made in 61 to reduce this development effort yet another was made I mean people were talking about another effort was made and what happened was that it is next to impossible to design a single language to be different so if we do not succeed here do I go back to this model where I say forget about this box and have m cross m compilers very good if you know then what is the solution sir following languages are different there are different types of languages but they share some similarity so there can be a group of languages which share some similarity so we can design an intermediate language for those similar languages very good so really this was the solution which was proposed saying do not look for a general solution but look for a solution that will work for a subset of languages and for a subset of machines and this was the idea which was proposed and rather than saying that there are many literature part I have given here which you can read some point of time but idea was interesting saying do not look for a universal intermediate representation but find out languages which are similar and find out machines which are similar and design now an intermediate representation which will work for this set of machines so I may not have a solution for this n plus n problem but I have this solution for a subset of machines and a subset of languages and this is what was proposed so common idea for similar languages and similar machines have been designed so if you look at DCC for example DCC will find course on to many machines it can compile many languages and it has a common intermediate representation and in industry also when you use many of the compilers they use a common intermediate representation for a set of languages and a set of machines and that really is a working solution now another important question an important issue that we have to face is how do we know that the code which has been generated is correct and we already saw that there is no way I can prove that my compiler is correct now let's understand the scenario let's visualize the scenario when you started working maybe in first year DCC 101 which was the first time when you used C and when you compiled your C program and it did not work what was your first reaction check the check the redo the program right okay anyone else tried something else recompile but recompile recompile the same program nothing gets changed right so if a program is not working and you do the same behavior nothing changes Compile Compile to the different compiler but when you are sitting on that sitting on that computer and say what different compiler you have So you started debugging program this is one of the program you wrote but you did not have confidence in this program you were saying my program is incorrect I need to did you ever blame the compiler saying my program is correct I don't know who has written this compiler and therefore my program is not working did you ever blame the compiler why that means you have this confidence in compiler that whatever compiler is doing it is correct what I am doing is perhaps buggy and therefore I need to check my program that means somebody must have done a good job of testing the compiler or good job of convincing you that don't blame the compiler look at your program so how do you generate this level of confidence in your program compiler is yet another program right so can you generate similar level of confidence in your own program and if yes how do you do it I can tell you in our instructor when you are doing this course like equivalent of ESC 101 so when you wrote the program or instructor was professor sas budhe the professor will come to the lab so are you confident that your program is correct yes okay run it it will take some input and how do you give input he will just put his hand on the keyboard and most of the time your program is going to crash because some random input will go so if you are saying read in PGR and if your input is not in PGR what happens hold on you are taking some input but you are not checking whether this input should always be in PGR what happens if I give a character input straight away go for quarter what happens if you give incorrect input to compiler have you ever seen boredom in a compiler gives you a nice error message right so really compiler goes through not of testing and one way is obviously prove that it is correct but that is something you cannot do so program proving techniques do not exist at level of where compiler can be shown to be correct okay so what we need to do is we need to do a very systematic of finding out or systematic testing which is going to increase our confidence level okay and what is normally done is we have something known as test suit for programming language this test suit is independently designed by language designers they are not part of the compiler so what happens is that I have language I have a language specification and I have a test suit which is going to test any compiler which is going to be written for this language okay and what does this test suit contain this test suit contains thousands of programs okay sometimes it can run into tens of thousands of programs and each program is going to test a specific feature of the language and we will have a documented behavior saying if this is my input this should be the output of the program and it will also have buggy programs which will say if this is my input okay this is the kind of error message compiler should give we are going to be generating code for incorrect programs it should be able to test both for correct input and incorrect input okay so we have a test suit and what happens is that you have a test suit of program where you have expected behavior of each of the programs which is documented and this test suit is given to the compiler line saying test your compiler against this suit okay and all the test programs must be compiled using the compiler to the compiler right and normally this is not done by compiler writer themselves this is done by quality assurance team okay and there is a classical sort of observation in industry that you cannot find bugs in your own program to sort of trust it all the time so it is someone else's job really QA team is always a different team which is a I am just going to dissect it and find as many bugs as possible and they use these test suits so test suits are shared by this now how do I go through this testing okay suppose I give you 50,000 programs I give you a test suit which contains 50,000 programs and you start testing it okay now what happens you start test so suddenly you find that 5,999 programs are working correctly you come to 6000 program and it crashes gives me an error that your compiler now has a bug what do you do in such a situation you have to debug your compiler right now when you debug your compiler probability that you have removed that bug which was not compiling 6000 program or program number 6000 is gone it will now compile but what guarantee you have that the other programs which you already tested they will now compile you know correct right in fact I am in the classical again situation is every time you remove a bug perhaps you introduce two more and that is what you will notice in the week of April when you start checking your project code in the week of 15 everyone will come for demo because I was trying to remove some bug because bug is fine but now it doesn't work anymore this is half the teams will give me this reason in that week and you will notice it now so what you have to do therefore is whenever you are trying to remove bug you have to go through what is known as regression testing and what is this regression testing that whenever you find that there is a bug your program does not work you fix that and start from the beginning and go all over once again till you have compiled all the programs in the pursuit in a single go with a single version of the compiler and have observed whatever is the correct behavior whatever is the document which behavior of that particular program in the pursuit so this is also known as regression testing that you keep on doing it again and again till no program has till your compiler for all the programs in the pursuit gives me the documented result and how do we design pursuit so how do we design pursuits so we want to make sure that we do not have repetitive program for example another thing you will notice that when you start doing your project you will write a program and if I ask you how many programs you will use to test your compiler there is 5 programs so show me these 5 programs every program has a formula with different values so one will be for going i going from 1 to 10 another will be for i going from 10 to 15 now these 2 programs do not give me new information they are just repetition they are more of the same so you have to make sure that the test programs they are going to exercise different parts of the compiler so you have to do some amount of coverage analysis you have to make sure that most of your features of the compiler most of the language constructs of your compiler are exercised during the testing of this testing process and therefore test program should exercise every statement of compiler at least once and usually it is an art really I mean the people who design pursuits they are experts in that and they are exhaustive pursuits which have been constructed for some languages like EDA has an exhaustive pursuit and if you notice EDA is perhaps the only language which has a registered trademark or TM on top of that which is the language of department of defense and they are the one who have designed the pursuit and you say any EDA compiler in this world if you want to be call it EDA compiler must be tested by department of defense US otherwise it is not going to be called an EDA compiler so this kind of exhaustive pursuits do exist and people have used them to really test their compilers this is how you generate a lot of confidence into the compiler so now how do we reduce all this effort all the time what we are trying to do is we are trying to do things efficiently we are saying we have to write a compiler we have to test it we have to do various things and all the time I am trying to reduce this development and testing effort because I want to cut down on time and I want to cut down on cost so how do I do it so one simple solution is let us not write compilers we do not write compilers we do not have to do projects we do not have to worry about one sort of credit life will be simple but that is really not the solution that is an extreme position now if you do not write compilers what do I do I do not have to write compilers somebody has to write that so you get a compiler right there and what we say is let us not worry about writing compilers we will somehow have a black box now imagine a situation here is a black box I tell this black box I have two ports here and one port on this side I say Pascal, Motorola and what comes out is a compiler Pascal compiler for Motorola machine that would be great right I just put in two USB sticks here one has language specifications one has machine specifications and what comes out is a compiler yes no so can we think of black box like this you have already seen these black boxes you have seen port generators somewhere have you used port generators yes no okay so let us go back CS251 what are the things you did in CS251 you have done this for CS251 CS251 CS252 CS255 you have done CS255 okay so what are the modules you did in CS255 anyone who remembers CS255 and the modules in CS255 there is the class and later and there is the class somebody who remembers about blacks anyone from this side who remembers about blacks What was flex? What is the full form of flex? Have you read the manual? You remember reading the manual of flex? What was the title of the manual? Highly volatilized flex, a lexical analyzer generator and yet was yet another compiler, compiler, right? So what we were really doing was a compiler generator should be able to generate compiler, right? And you have seen simpler versions of this? What did lex do? What was input to lex? What was input to lex? Yeah, you want to say? Don't remember? A set of regular expressions, right? Specifications. And what was the output? C code, okay? And then you just compile it using some data, right? So similarly I can think of now compiler generator which will take language specifications, which will take target machine specifications and will give me a compiler. If I can create this box, then I don't have to write compilers, okay? So I can shift whole effort from writing compilers to writing a compiler generator and then if I can somehow write these specifications, then all I need to do is whenever I get a new language, write those language specifications, whenever I get a new machine, write those machine specifications and generate a compiler. Good job, right? Very quickly I can generate compilers, okay? Again, this looks like two types of things to do. Like we said, writing a full compiler and then taking it to various phases, okay? I cannot get to do it in a single step, I want to do it in multiple steps. Similarly, I want to do this activity also into multiple steps and saying that I have language specifications. When we say that we have specifications of source and target, but if I look at source specifications, what were the specifications I started looking at? When I was compiling languages, I said I must have a set of alphabets, then I must know how to brokenize, then I must know how to check the structure, I must know how to check the meaning and so on, okay? So language specification could be given at several levels of abstraction, okay? So various abstractions work into things like lexine, structure, semantics, etc. And for each component, I can write separate specifications. So when I talk of language specifications, I can write, I can say that when I say something is an identifier, I can use specifications in English which says it is a string of characters that has at least one alphabet, starts with one alphabet and starts with an alphabet followed by alphanumerics. And their concise language is where if you recall your language specifications, I can write something like this. I can say it is a letter, followed by letter or digit, zero or more occurrences of this. I am using 3D splotion, right? And if you say you may have underscore or dollar sign and so on, you can again just make changes to this. Now when I am saying I want to write a lexical analyzer, this is the only input I should take. I should not worry about more than saying that I use certain data structures, I read my input and so on, okay? Those are mundane tasks I have to do every time, only my specifications take, okay? So therefore, I can similarly write syntax and semantic descriptions and then I can also write target machine specifications, okay? We will see how to write these specifications and then I can go back to my compiler and say if I want to write lexical analyzer, don't worry about writing lexical analyzer, but use a situation, use a tool which is lexical analyzer generator, which you have already seen is a lex and all I need to do is give specifications of the lexics, okay? And this is what my regular expressions are, right? Similarly, I can have a parser generator. So instead of writing a parser, I can have a parser generator and I can write my specifications which are parser specifications. I can write my specifications in terms of context-free grammar. And what is the outcome? This is what I gave you, right? I wrote my context-free grammar specifications and what came out was C2, which was nothing but a parser, okay? So similarly, I can think about other phases and I can have tools for each of the phases and I can have specifications for each of the phases, okay? I can have specifications for code generators so I can have a code generator generator and I can have machine specifications, right? And what is the outcome now? I am not writing these phases, but I am using these tools and I am writing specifications all the time. Now will it make life simpler? Definitely as compared to this, the overall effort here will be reduced. How much reduced? Okay? So normally, we say that if I am using certain tools for code generation, in compilers we have seen overall effort can go down by almost two thirds, okay? So it's not that I can do it in one day because writing specification and writing tool is time consuming but if I add total x, effort x in writing the compiler by using the tools and using the specification I can do it in about 0.65x, which is lot of reduction. Then when you're saying 35% effort will be gone and then not just that effort in coding is gone but your testing becomes easier because now when you say that I want to deal with my lexical analyzer I just need to deal with this. I need to say that whether my specification is correct or not. I don't have to worry about data structures I don't have to worry about looping I don't have to worry about whether I was reading my input correctly or not. That is one thing we have noticed. Generates maximum number of errors. You are reading your input you either skip a character or read extra characters and suddenly you find that something cannot be tokenized. Your data structures are not correct. All those errors will be gone. They will be part of my tool which I need to test only once again, okay? And this also has certain more advantages because suddenly you find that this is not very efficient because tool is not generating good code. All I need to do is make this efficient. So if you see flex and you remember lex what is flex? Fast flex. Suddenly you found that lex was not giving me very good performance so we wrote flex. And then you generate the same code, same specification and you generate much efficient code for the same specification and you have more efficient code here, okay? So each of these phases can be made more efficient just by improving quality of the tools. So there is a kind of effort we are going to put in into writing compiler. We will not look at the code which goes in here but our effort is going to be that what are these tools and how do I write specification? This is where the focus of this course is going to be. So how do we retarget compiler? All I need to do is change machine specification language specification and our new compiler, right? So if I need to modify a phase I just need to change specifications here and tool based development can cut down your time by almost 30 to 40 percent, okay? And this is tool testing is only one time over, okay? And performance can be improved by improving the tool itself, okay? And this is the last point as far as this is concerned. How do compilers of 21st century look? We are talking about the first compiler of 1957 versus compilers of 21st century, okay? We said overall structure is same but what about the effort, okay? Overall structure is still the same but effort is now in back end because lot of optimizations happen there are different kind of machines where you talk about all these multi-core machines you talk about GPUs and various kinds of other machines for coming which were not presented in that era, okay? So most of the effort has really shifted into the back end and your runtime system integration, board optimization has become much more complicated than what it used to be 50 years back but front end is now much simpler because they are very standard tools they are good mathematical models for handling the front ends like backward languages and context free grammars and so on, okay? So total effort has, proportion of effort has changed since early days of compilations and earlier front end phases are more complex once and expensive parts but today back end phases are the one where more cost is involved and front end phases are now much simpler So this is where I close introduction and in next class we will start on really the needs of the phases of compiling So, let's begin