 So, there are some parts we have not discussed in the previous class. So, Nikora start on that. Sir, the reference part is not clear to me. That means how do I know that people, which two groups are each other there? Talk to each other. Talk to each other, but you have small class just under the students. This criteria is exactly going to be like the first criteria. Because it will be my group of two and they will only give the priority to the other. Which is fine, but I may not agree with that priority. I still hold that freedom. For third point, I may choose not to accept your preference. That's a no. You can only give preference. First one, as I said, I am not going to change your group of two, but you may give more than one preference. I will decide how to make much. So, this is not exactly the first criteria. First, I could not change anything. So, this is going to be the main text for us. Compiler, physical, techniques, and tools. By our lab, say, people meant. The standard text book is used in compiler, also. Popularly known as the Dragon book. You can find a lot of copies of this in the library. If you want to find a personal copy, then just go to copy, center, file. I think it's available in a team in your edition. But I have also listed a lot of other books, of which at least, first thing, I also sort of use for taking some materials. So, as I said, this remains the base text. Then some material will also be taken from here, which is the book by Scott and by Uber. And the rest of the books I have listed, but these are good references for compilers, but mostly I will not use them. First one, and to some extent, number two and three. But this is good to know that there are other books. This is another good book. Then this book really talks about lot of code. So, this may not go deep into the tools, but it will talk about lot of code for compiler. This is, again, classical books. And here is a much longer list. If you just want to see these books, many of these are not in the library. You can always check them in my office. So, at some point of time, at least, I have scanned all these books to find the relevant material. And I found the first one seems to be, which can be used really for the course like CS335. Other books will have some different focus, and I may not want to use them. This is very, once again, I am not sure how many of you attended the previous class because there were 70 odd guys who were missing in the previous class. So, all of you have to register on this. Now, there is something wrong with the registration. I got a message from the other team. You are actually going and registering on the wrong course. So, please look for IITK within that select CS335 and look at Winter 2013, okay? This session. So, you have to look at these keywords and then register for this course. I have asked the administrator, some of you have actually registered in the wrong course, and I have asked the administrator to shift you from that course to the whole reason course. But please check that you have registered in the right course. Okay? And also, as I said, you can give feedback throughout the semester. You can use any forum you wish, personal discussion, anonymous email, email, Piazza, whatever you want. Okay? Do things on time, and this part we have already sort of fleshed out. Okay? We are going to make groups, and we are going to have two to three top project awards which are going to be in kind, not in cash award. Okay? Unlike all student activities, these are going to be, these are academic awards, so it's not certificate you may get a book or some academic material, okay? But these awards, I will decide, but those who are the award winners will be decided by you. You choose which are the three best projects by voting, and whosoever gets the maximum number of votes. Okay? That project will be just the best project for everyone. May not be for the great. Okay? Your phones must be switched off while you are in the class. Please make sure. No cheating, and come on time. Okay? Today also we saw influx almost up to 97. Please avoid that. And as I said, attendance is going to be compulsory in this course. I will regularly post the attendance data on public forum where all of you can see that who is attending how many classes. Okay? This data is going to be public. I'm not going to hide it. Whatever attendance sheets you circulate, by the weekends your data will appear on the website, course website. Everyone will know who is attending. Okay? Now, going back to something, we started discussing towards the end of the previous class. Okay? We started looking at pick of history. Okay? And very quickly to cover, we said, historically we do not have compiler still made 50s and people who are using machine coding directly. But the two approaches normally for interpreting high level languages, one is you want to interpret another is you want to compile and then their languages which provide you a mix of both. And so this person, this gentleman is John Lakers who actually developed the first compiler. Okay? And for that we got the Turing award, Turing award is the highest award a computer scientist can win. This is almost equivalent to what is known as the Nobel Prize in sciences. So if you say there is somebody like Turing awardee in computer science that is almost considered at that level in the world. Okay? And for this work, we got Turing award in 1976. He had a huge impact on how the languages were developed how the languages were implemented. And this project which most people believe would not be done he really made it possible 50 years back, more than 55 years. So 1957 which was 55 years back the first compiler was written. An interesting part was the impact of this compiler. Okay? So that really is the huge impact and what it led to was that all these, so interesting part was that there was no theory of compilers at that point. They just wrote a compiler and subsequent to that people started developing theory and all the theoretical work which was lexical analysis, parsing, good generation, optimization. It really happened after that after they had developed the compiler and interesting part is we still have almost all compilers today have the same structure as the original compiler which was written 55 years back or it was released 55 years back. The same structure is preserved. Okay? So you can see that kind of insight he had into compiler design to implement the languages and kind of things they were able to do without having a formal theory into bringing something into implementation and then subsequently during the interval. Okay? So this is something you have to keep in mind that kind of things which were in compiler and historically what happened in compilers. Okay? So this is where I would like to close the discussion as far as introduction to the course is concerned. Okay? And get on with the compilers. Right? Unless you have any questions or comments. So let's then move on and let's try to see what do we mean by compilers. So what is a compiler? They have been using compilers for at least two to a half years. Right? So what is a compiler? What do you understand by compilers? So this understanding is something which is going to lead the foundation for subsequent things we do in the course. Okay? Because if we don't know what a compiler is we will not know how to build a compiler. So first thing we need to understand is what a compiler is so that we can make a compiler. Okay? Anyone? What is a compiler? Anyone? Say something. So let me just write down some of these suggestions and then we can understand it. Okay? So a compiler is, it takes a program in a language and what it gives me is executable. Right? Which executes on a machine. Okay? So instead of executing executable it can also give me a low level language. You can also reject the compiler. So is that an outcome of compiler? No sir, it can't tell us that you have the wrong way. Okay? So it can also identify, is that what we are trying to say? Okay? Any other suggestions? It can translate to any target language. So it can translate the high level language code into any target language. Into any target language. So you are saying it need not be an executable or a low level language it could be any target language. Can optimize and compile it. So is there certain property which is being preserved when I am going through these steps? I am starting with the program and I am getting some outcome I am doing something. Am I preserving certain property in the process? Of the source? The essence of the program. What is the essence of the program? I want to understand. No sir, it's a serious issue. I will start with Joel for something. What you have raised is an important point but you need to articulate. So when you say essence of the program what do you mean by essence of the program? The semantics of the executable or the target language program or anything which can operate this there is the exact thing that the program is supposed to do. If it is it's right. So let me again give you certain scenarios. So I have this program P1 in some high level language and then I have the compiler here. The compiler gives me the program P2 in some language. And what you are saying is P1 and is that doable? So first, is this doable? No sir. Remember theory of computation? Can I take two programs and prove that they are equivalent? No sir. Another suggestion which came was that if P1 and P2 produce same output from the same input then they are considered equivalent. Is that what you are trying to say? They need to follow the same process also. It's not that P1 is following the output and P2 is following every statement they are given the same output but they are same output. And what is wrong with that? Because we also said that it's going to do something. Because we already said that it comes to optimize. So maybe the process which has been used at some level was inefficient and therefore the compiler now says that I can change the process itself and make it more efficient. So perhaps that seems to be acceptable. So we need to talk about process a little more. So first, is this sort of acceptable definition? That is two programs when they are executed on the same input they will produce the same output. So I say that P1 is equivalent with imports out in terms of theory of computation but P1 is compiled correctly into P2. Is that okay? Sir, it gives a statement how do you know how can you execute P1 in a higher level? What do you mean by executing P1? I don't know. I never said execute P1. I said P1 produces the same output when the input is given. So maybe there is no machine but I can do a manual execution. I can go through a flow of the programs. I can take the C program and say go through a manual execution of the program. I may not have a C machines on which the C code will directly execute because all these points you are bringing out all good points. They are relevant but we need to say what is more relevant and what is it that we pick up as far as compiler circumstances. So we will not know where the two programs are equivalent but what we can obviously test is that P1 and P2 are as far as input and output are concerned are the same. But then there are certain limitations. So for example, if I say suppose P1 is doing sorting and this does giving bubble sort and this does fit sort compilers do not have enough intelligence built into that then they can actually find out what the algorithm is and then replace that by an equivalent efficient algorithm. Compilers do not do that. They cannot do that. So they do not understand the application. And therefore when we talk of translation we need to understand that when we talk of preserving the semantics when we say it should preserve the meaning what that means is that it is really changing the representation. So for the time being let's not worry about optimization. But what is happening is that it is when really translation happens what happens is that you have certain representation at high level. So C uses a representation where you say that I am going to declare certain variables of certain type and then I will have some data structures and I will have control flow that is how my computation is represented. When it comes to machine level or when it comes to assembly language or when it comes to a different language my representation may be different but they are essentially computing the same thing as far as input and output is concerned. So compiler is only going to change the representation is not going to try to understand what the algorithm is and when it comes to optimization it will not try to then replace one algorithm with another algorithm but will try to make sure that the same computation by using the different representation of the program or by using a slightly different order of execution can be done more efficiently. So for example it may decide that part of the code never gets executed by somebody may write the code like this if an S1 S2 and compiler may figure out that this part is always true and therefore this will never get executed and therefore decide not to even generate code for this part. So efficiency will come only in terms of eliminating part of the code which may not be properly executing or maybe changing certain representations which will say that can I achieve the same computation by doing fewer computations. So it may decide that if you say multiply by 2 there is no point in multiplying by 2 because that is costly I may as well add it. So only these kind of changes these kind of optimizations it can do it cannot work at the level of algorithms that we have to understand very clearly that we are only dealing with some kind of representation and not some application and try to understand what the application is that is beyond the scope of the compiler. So that is what we have to keep in mind and therefore when we start talking about what a compiler is we all the time deal with representations and what we say is it translates one representation of the program into another representation. Beyond that we do not go into deeper into it and typically most compilers what they will try to do is they will take the high level code language and then translate that into either machine code or object code but that does not exclude the scenario where I can be translating to any higher level language. So I could be writing a compiler say for Pascal to C or C to C plus plus or C plus plus to C those will also be called compilers but traditionally if you look at compiler this term is sort of reserved for the scenario where you say that I am taking the high level language and translating that into some low level representation which can then execute on a target architect. And source code is normally when we write source code so now we are coming to this point so what happens is when I am writing source code most of the time what I have in mind is that a human should be able to read that code and understand that because we are going to deal with the source code as human beings and any application. So we are going to make it very expressive we will not worry about any kind of optimization we will not worry about even efficiencies most of us will not and we will also keep the thought of redundancies in the code just to avoid all kind of programming errors. So you want to write very expressive code but what we would like to do at the machine and is that I don't want any redundancies I want to remove redundancies so remember that when we talk of optimization I am only trying to remove redundancies from the representation and not trying to improve the algorithm per se and information about intent so whatever information I had about the intent of the application most of the time you will find that in source code I will not be able to figure out that intent so many times when you read the source code you will be able to see how this program does all it. In this code most of the time we will not be able to figure out what this program does so that intent will be lost in the process of changing representation from a level language to a so this really tells you about the functional specifications of what a compiler does these are kind of things you have to keep in mind so how does it translate now the abstractions at the source and machine level are very different abstractions typically you may have at the source level so when you deal with the programming language and you are trying to write the program in a higher level language what are the kind of abstractions you will have will normally start with variables and you will have operators using which you can than the expressions and once you have expressions then you would like to put conditioners into it and you would like to have functions and so on so depending on the richness of the language you keep on making more and more complicated but basically this is how you are going to stop what about machine if you are writing machine code directly what are the kind of abstractions which are available to you as a programmer what are the resources you have so you will have certain locations so you will have memory and you will say that I am going to put certain variables into it you may also have resources like registers you may decide to put two things on the stack and then you have outputs so outputs are going to say that I am going to take certain data from memory certain data from registers from this stack and so on these are my outputs I am going to have various addressing modes so I will say that I am trying to get certain data from memory but it is indexed with certain register and that information indexing information is in register or I may say I will load the base address and then I am trying to get something from the base address with respect to or I am trying to get something from memory where the location is given with some offset with respect to the base address right so these are kind of the connections you have and what a compiler has to do is it has to which is at the program level into something which is at this level so source code and machine code they are going to mismatch and normally some languages depending upon the language you have some languages are close to the machine abstraction some languages may be very far from the machine language so you are going to have depending upon the kind of language and machine you may have this will be narrow large the question is how do we then translate it that is what a compiler is supposed to be doing so we want to do this translation now obviously it looks like that if I just try to test it in one go it may not be possible so normally what we would like to do is we will say it is a big jump and therefore before I can make this big jump can I take a small step and then as I keep taking these smaller steps I will finally end up with a translation now each small step may be a simple step but may be a logically coherent step and normally compilers are designed so that at time you take one step but keep on moving towards this direction you have finally reached here and therefore I need to understand what these small steps are if I can understand each of these small steps then I can reach the destination beginning to make sense so what is the goal of translation this is the translation process I follow and what is the goal of this one that generated code should be good, should give a good performance obviously good compile time performance because if you say that this translation process itself is so tough that it is going to take a long time I mean imagine a situation where you are sitting on a terminal you have finished writing a program how you want to execute it and you compile and it says I am going to take 30 minutes to compile, will that situation be accepted and obviously no similarly if it compiles very fast 30 minutes to execute same scenario, this scenario will also not be accepted so we want good performance for both the generated code and for compile time now these two issues we will take up slightly later what this means is that I want to have maintainable code because I may not be generating machine code all the time suppose I am doing a high level translation and I say that I want to convert Pascal to C I do not want to generate C code which other cannot be maintained every time I do not want to go back and start maintaining my source code so I should be able to maintain both generated codes as well as I want to maintain my compiler code now as a user I may not be able to change compiler but as a compiler writer if a bug is reported then I should be able to handle that many times you will say I am using a compiler even in GCC we have found bugs or somebody has to go and patch it so obviously you report it back but now they say our code is so complicated I do not even remember what to do with it that is it again that kind of scenario is not acceptable so one should be able to maintain code and one should be able to maintain high level of abstraction as far as compiler is concerned so as I said these two points will become more clear as we go along but another very important issue is the correctness I do not want a situation where I say I compile a program my intent was something so I have written a program with certain intention in mind and when the results come they are entirely different and why they are different not because my program has a bug but because compiler has a bug compiler is giving me some arbitrary code so that also is a scenario which is not acceptable to us so correctness is very important issue and people have tried various approaches to this one approach in 70s and 80s was you want to prove compiler is correct you want to prove that the code you have written is correct now that kind of thing has not really taken off people were able to prove small programs correct but typically how long the compiler is how many lines of code a typical compiler will have so if I say C compile system in full support how many lines of code do you think it will have about lack and if I go for something more complex like C plus plus or Eda I can hit a million and if I go for something more complex like saying I want to support a full integrated program development environment where multiple languages can be supported then I may go into several million now with that kind of code size today's technology does not permit us or it does not support where I can prove that what I have written is correct and you have to go through thorough testing and this obviously when you try to do a correctness it is going to have an impact on the development possible because you have to ensure that you have a very high degree of confidence in what you have written something that happens in compiler so we will talk about how compilers are tested how do we know that the code which is generated from compiler is correct how do we ensure that C1 and C2 are roughly equivalent so this is the high level compiler this is the box we do not know what is inside this box but apparently we know what is the input to this box and what is output of this box and what we now need to understand is so here is the high level representation here is the low level representation this is the box and I need to understand what are these steps I need to take these small steps I need to take so that rather than jumping step from here to here I can say I can jump from one level to another and slowly keep moving towards right so what are those steps so let me again give you a scenario okay so let me just look at this word what is the meaning of this word back to English or compiler let me give you a simpler a small sentence we have in CSP do we understand meaning of this sentence how do we do that so what is the instruction okay so let me mess up the structure first do you still understand the meaning of this sentence roughly no but there is an error so there is a character we don't understand okay this is not in English okay therefore for any language understanding first thing as far as so I am not looking at how humans go about doing it but I am trying to say how compilers would like to do it first thing we need to understand is what is the work of it so I immediately have set of alphabets so when you say that when it comes to English normally I will say that I have these 26 alphabets both in lower case and upper case and then I can also use numbers and I have some punctuation marks I have this full stop and so on and that really gives me the character set okay and anything which is outside this character set first thing I say is not acceptable because this is not English so immediately you could flag an error saying oh I don't know what this character is this is not in the character set I know and therefore is an error and you make sure that we have all proper characters of this language so to understand a language to understand the representation first I must define my character set right what is the next thing I will do spellings so if I gave you a spelling like this immediately say oh I don't know what this word means okay now how do I do a spellings before spelling do I need to do something before checking spelling you did something which are not able to articulate so let me say no what happens now or what happens now so somewhere I was able to while reading this I was able to break this into words okay I knew what the word boundaries were okay and word boundaries here were you were saying that whenever there is a blank there is a word boundary whenever there is a punctuation there is a word boundary okay so if I immediately put a character after dot okay you did not mind it you knew that there is a word boundary but if I do something like this then you say there is a problem okay I need to figure out what the word boundaries are once I have figured out all the word boundaries then only I will be able to check whether this is a valid word or not because if I have not even identified word boundaries how can I say this is a word or not okay and how do I find out words okay our dictionary you can look at Oxford dictionary you can look at any English language dictionary proper nouns may not be there like CSB35 may be a proper noun here it will not be interdictionary but rest of the words you will find it okay now do we have so let us go back to programming languages do I have a set of characters of programming languages we have right do I have rules for finding out what are the word boundaries do I have a dictionary of C yes yes so please give me a reference where I can look at dictionary of C and I can find whether this word is in the dictionary or not every word is in dictionary of C yes sir we can consider there are different keywords there are keywords and so we do not have a dictionary what we have is a set of keywords which is dictionary but then we have rules for construction of valid words so whatever is not in dictionary I will apply certain rules to this and I will say that whether this word conforms to the rules I have so I will have rules for constructing what are valid numbers I will have rules for construction what are valid words okay so I will say count is valid this is not okay because it does not match any of the rules I have the words so first thing I need to do is the first step when I start changing representation from here to here first step I need to do is check whether all the characters are valid break those sequence of characters into a set of words and check whether those are valid words or not okay if I can do that then it does not matter whether it is a valid sentence or not but at least I have achieved something right so this is really the first step we follow but before we move into this first step okay we also need to look at that compiler really is part of why do we write programs we are trying to solve a problem and when I try to solve a problem compiler only plays part of the road there are other things which I have to be working with compiler so let's look at a big picture where compiler is part of a program development environment so after all the programming program I need to do lot more things okay so what are the other typical components you obviously have to start with an editor after you have reached whatever operating system you have okay so you need to have obviously compiler has to work under certain operating system it needs to have support for that I must have some editor then I must have assembler, loader debugger, profilers and so on okay so we look at at least introduction to all of these the compiler and all these tools I am talking about they must support each other so for example take a scenario where your editor creates a file and it's not an ASCII file it saves in a different format like take Wordstar or take Microsoft Word okay I create something if I try to read that it's not ASCII format and now I will write a program using say Microsoft Word and say compile this program compiler will not be able to handle that because it uses some kind of control sequences some kind of compressed representation okay so compiler must know this interface okay and therefore this editor must create files which my compiler can handle okay and whatever is the output of my compiler that assembler should be able to handle and so on so what is the big picture big picture is that there is an editor which is which is being used by a programmer and you have a source program which comes out of this editor but then my compiler should be able to handle this program and should be able to handle that in assembly code so you can immediately see that I am talking of an interface here okay now what do I do with assembly code I can assemble it which generates machine code and then what happens I will have to then use a linker because this may be in multiple files and I want to resolve certain symbols so I get resolve machine code and then I am going to load this into certain machine locations and execute it and what is the outcome of this execution when I execute a program what happens this is normally what happens you get an error you do not get results okay when you get an error what do you do you need to debug your program so you start using symbolic debuggers okay so therefore debugger and compiler again have to work together this representation which is coming out of compiler must be understood by the debugger okay and then debugger will the program will be executed under this control and I will get debugging results so how do I get debugging results I have certain mental model in mind about how the computation should go about I have a mental model saying at this point of time value of variable should happen this but when I debug it I found this value was different so then I said there is something wrong with the program I go and start fixing that which is a manual process and once a programmer has done all the manual corrections then we will go through this and at some point of time we will get results okay but to make sure that I get results I must go through all these cycles okay so therefore what we need to make sure is that compiler which is sitting here is able to generate information which all other tools here can use right so this is a bigger scenario and what we would like to do is stop here today and tomorrow start going into this details and then where is it alright so let's stop here today a little bit and it is all the same then