 And welcome to today's lecture on in quest of higher ILP, ILP stands for Instruction Level Parallelism. In the last couple of lectures, we have discussed how pipelining can be implemented and how Instruction Level Parallelism is exploited in implementing pipelining. So, today we would like to go beyond pipelining. We have already seen the characteristics of pipelined processors. In case of pipelined processors, we have seen the maximum performance that you can achieve that you can achieve is close to one instruction per cycle. That means, even if there is no stall, no hazard, then you cannot achieve beyond this. You can this is the upper limit of Instruction Level Parallelism that you can achieve. Second thing is one instruction is present in the pipeline at a time. That means, although we shall be executing different parts of the instruction in overlap manner, but at a particular instant of time if you look at, you will find that only one instruction is in the pipeline. And we have discussed about the baseline scalar risk. Here a term scalar is introduced. So, the scalar is introduced to specify that you have got processor that executes one instruction at a time. Later on, we shall discuss about superscalar, where you will see more than one instruction can be executed at a time. So, the main characteristics of this baseline scalar risk processors are issue parallelism is only one. That means, at a time only one instruction is issued. Operational latency is equal to one. That was assumed that the latency is one. And peak instruction per cycle, you see earlier we were using a term called CPI cycles per instruction. The number of cycles per instruction that was obviously was greater than one in case of your greater than equal to one in case of pipelining. But now, we are trying to introduce another term which is just the reverse of CPI here instruction per cycle. Why we are doing it? The instruction per cycle is that number will be more than one. So, since we want CPI less than one, we want instead of CPI, we shall be using the terminology IPC which will be more than one in the processors that we shall be discussing after this. So, up to the pipeline processor, we were discussing CPI because the number of instructions and cycles per instruction that was more than one. So, it is just a terminology CPI and IPC they represent the same thing one is the inverse of the other that is all. Now the question is can we go beyond it? So, pipeline processors we have explored in details and we have seen what are the different types of hazards, structural hazard, data hazard and control hazard. And how these hazards can be overcome by using suitable techniques? We have used loop unrolling, we have used software pipelining to overcome data hazard. We used enough additional resources to overcome structural hazard. Of course, we have not yet discussed in detail about control hazard that we shall discuss subsequently. Now how can we go beyond pipelining? And these limitations we have highlighted several times. So, in case of scalar pipelines maximum throughput is bounded by one instruction per cycle, inefficient unification of instructions into one pipeline. You try to understand this particular point. The instructions as you know can be categorized into different types. We know that the instructions are number one is data transfer, then data manipulation, there are several control instruction or these are also known as control transfer. And there are few instructions which are known as status manipulation instructions. In spite of the fact, we know that these instructions do not take same time for execution. For example, data transfer instructions in risk those are load and stored. So, these instructions will involve memory. In spite of the fact, the instructions which will involve memory will take longer time. We group them with the data manipulation instructions which are where the ALU operations are performed using registers. Obviously, to get operands from the register takes smaller time and they will take lesser time to execute. But in our pipeline whenever we considered, we considered a single pipeline having different stages instruction fetch, instruction decode, memory execution, memory and write back. So, using the same pipeline, we were trying to execute different types of instructions. Obviously, it is little not only little unrealistic in practice, it cannot be really implemented in this way. So, this inefficient unification of instructions into one pipeline ALU operation memory stages which are very diverse, then floating point operations. Particularly, we will see whenever we try to implement floating point operations in a processor, they are quite complex and they will involve several cycles. And obviously, if we try to unify them with the help of a single pipeline system, then it is not feasible. But we assumed some assumptions were made, we considered that they will take each stage will take only one cycle. So, with the best on that assumption, we have implemented pipelining. And another very important limitation of scalar pipelines were rigid nature of in order pipeline. We are one instruction is getting entered, until that instruction is complete execution of that instruction is completed, we cannot issue another instruction. So, as a consequence what is happening at a you know if an instruction is stored, then subsequent instructions cannot be issued or executed. That means, because of this in order nature of instruction issue which is done one instruction per cycle, this limitation is coming. So, if a leading instruction is stored, every subsequent instruction is stored. So, these are the limitations of scalar pipelines. And obviously, our objective of this lecture that is beyond pipelining would like to overcome these limitations. So, here the main objective is to have higher ILP processor. So, an ideal CPI 1 can be achieved by eliminating data and control hazards that we have seen. So, CPI less than 1 or IPC greater than 1 whatever you say, we would like to improve the performance further and we may try to achieve CPI less than 1 or IPC greater than 1. And in this connection two basic approaches have emerged, first one is known as very large instruction word VLIW and second approach is known as super scalar. So, first approach that VLIW we shall discuss today and super scalar processors we shall discuss later on. And just let me highlight the key differences between these two approaches before I discuss in more detail about VLIW. So, in case of VLIW processor, the compiler has complete responsibility of selecting a set of instructions to be executed concurrently. Obviously, if we want CPI less than 1 or IPC greater than 1, we have to concurrently execute more than one instruction. Question naturally arises who will identify those instructions, who have which can be executed concurrently and the primary requirement is that they should be independent. So, that can be that difference leads to two different approaches, first one that VLIW in this case the compiler is given the complete responsibility of selecting a set of instructions to be executed concurrently. And obvious consequence is that hardware will be simple, but it will require a smart compiler. That means, the complexity is passed on to the software designer who are developing the compiler instead of passing on the complexity to the hardware designer who are implementing the processor. So, in this case the compiler will be complex, hardware will be simple in case of VLIW. On the other hand in case of superscalar architecture we will find that the instructions which can be issued or which can be executed concurrently is identified with the help of a hardware not by software. And there are two basic approaches one is known as statically scheduled superscalar processor. So, in this case multiple instructions are issued and then they are executed in order. So, statically scheduled superscalar processors will perform multiple issue of instructions and that will be done by hardware and in order execution of instruction will take place. On the other hand the dynamically scheduled superscalar processors where sophisticated techniques like speculative execution, branch prediction those things will be done and in such cases you will find it will allow out of order execution. That means, instructions will not only I mean they will be issued out of order and execution will take place out of order. So, that will be done by incorporating these techniques like speculative execution, branch prediction and so on. So, in this particular case obviously in case of superscalar processor you will require more hardware functionalities and complexities and we shall see the processor will be very complex they will consume lot of power and compared to VLIW processors. So, you may be asking that why VLIW processors are important. The reason for that is hardware cost and complexity of superscalars is a major consideration in processor design because hardware is very complex they will require lot of chief area. That means, the complexity will be very high that and the power dissipation will be very high and to overcome that VLIW processors will rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. So, the compiler will do analysis on the instructions and identify which instructions are independent and then it will bundle together instructions that can be executed concurrently and these instructions are packed and dispatched together. So, what you are doing you are identifying several instructions may be say 4 to 8 instructions which are identified by the compiler then you are packing them into a single instruction. So, instruction 1, instruction 2, instruction 3 and instruction 4. So, these 4 instructions are put together into a single VLIW instruction and as a consequence the size of the instruction is long because a single instruction I mean normally in pipeline processors you will find that instruction width is only this much, but since you have packed or bundled several instructions in a single instruction in a VLIW the instruction width is longer and that is the reason why it is known as very large instruction word. So, the name has been derived from this because the length of an instruction is longer. So, for example, if it is 32 bit, so if you pack 4 instructions you will require that means it is 4 word 4 into 4 16 bytes. So, 4 byte, 4 byte, 4 byte, 4 byte, so you may require 16 bytes for a single instruction and then these instructions are stored in the memory. So, after the compiler you know that source code is applied to the compiler and compiler will produce that object code and in case of superscalar processor the object code will consist of instructions and this VLIW instruction that means each instruction is of 4 I mean we will comprise of 4 instructions, next instruction will also comprise of 4 instructions like that in this way they will be stored in the memory then they can be fetched in order and executed in order. So, after the instructions are bundled together to form very large instruction word they are stored in the memory cache memory or main memory whatever it may be then they are fetched one after the other the way simple instructions are fetched and executed in a pipeline processor. So, after the compilation is done you are your that if you look at the source code you will find that instructions are like this which are which are large instruction word. So, may be each of 16 words 16 bytes and static instruction issue capability. Static instruction issue means since it is done at compile time we are calling it static instruction issue because as I have explained at compile time the compiler is analyzing and then they are statically they are generated then they are stored in the memory. So, at run time there is no change. So, at run time they are fetched one after the other and then they are executed in the same order. So, in fact this concept has been employed in several commercial processors including Intel IA 64 processor this VLIW architecture. And as I have already mentioned VLIW processors deploy multiple independent functional units. So, what do you really mean you have got several you will require multiple functional units. Say if you are issuing four instructions you will require four functional units in a single CPU and these functional units will be fed by a single VLIW instruction as you have already mentioned it consists of four operations or four instructions. So, each field will feed to this functional unit one functional unit two functional unit three functional unit four. So, different functional units will be fed by instruction one instruction two instruction three and instruction four these functional units obviously need not be identical and in fact they are not identical. For example, some of them can be integer units some of them can be floating point units some of them can be load units that means which performs only loading which will perform I mean load and store operations. So, specialized and fourth type can be branch unit. So, that means these functional units are not identical they are capable of performing different types of operations. For example, integer units will be able to perform addition subtraction multiplication division of integers or various logical operations. Similarly, floating point units will perform various floating point operations floating point I mean addition subtraction multiplication division the load store unit will be responsible for storing storing register values into the memory or loading memory memory contents into registers. So, their job is specialized. So, this type of multiple functional units are to be present in superscalar processors and early VLIW processors operated lock step that means there was no hazard detection hardware at all. So, it was assumed since the compiler has done the job they have already identified that there will be no they are independent and that is how they have been issued. So, there was no hazard detection that was necessary that is known as lock step execution that means once a bundle of instructions are executed then next bundle of instructions are fetched and executed that is how it proceeded. So, because of some reason if a bundle gets delayed for example, may be because of say cache miss or in such a case it will be delayed. So, load store unit will get delayed instead of loading from the cache memory you have to load it from the main memory. So, in such a case it will be delayed. So, a stall in any functional unit causes the entire pipeline to stall. So, that is how it was implemented. So, let us now consider a 4 issues static superscalar processor during fetch superscalar means here we are trying to tell that you have got multiple functional units and during fetch stage 1 to 4 instructions would be fetched and the group of instructions that would be issued in a single cycle are called an issue packet or a bundle. So, if an instruction could cause a structural or data hazard it is not issued that means here the compiler does the analysis and finds out whether a particular instruction would cause structural or data hazard. Structural hazard means there is no enough resources. So, if enough resources are not available then that instruction is not issued of if it identifies that there is data hazard. So, there is data dependency among instructions they are also not issued that means the instructions have to be completely independent only then they are issued with the help of this VLIW processors. So, here for example, one single VLIW instruction these are separated separately targets different functional units. Obviously, they have to be executed concurrently and each field should target different functional units as I have already told for example. So, here some practical or commercial processors which are based on this VLIW approach are multi flow trace Texas instruments C6X, I-Tenium IA 64 by Intel then Crusoe processor by transmitter. So, these processors these are commercial processors which were based on this VLIW approach. So, a bundle in these cases in all these cases the processors the compiler will issue a bundle I mean the processor will issue a bundle means it will fetch an instruction and that instruction that very large word instruction comprises few four fields and they are fetched together and then all the four operations will be issued. So, the bundle is issued. So, you can see you have got multiple functional units. So, this is a add R 1 comma R 2 comma R 3. So, this will be issued to one functional unit. So, this is a obviously this is a integer functional unit. So, it is performing addition of integers. Similarly, this is a load store unit to which this instruction is issued load R 4 comma R 5 plus 4. Similarly, move R 6 comma R 4 this is again a data transfer instruction, but between registers. So, this will be also performed by functional units then multiplication another integer unit which is used which will perform the multiplication. It will take two operands from registers and perform multiplication and store the result in a register. So, you can see this is the schematic explanation for a VLIW instruction which is the generalized you know generalized picture, but for different processors I mean there will be some differences I shall consider in detail one particular type. So, these I have already mentioned in the shell issue hardware is simpler, because here the issue hardware is not doing any analysis on the instructions, because that analysis has been performed by the compiler. So, as a consequence the issue hardware is simpler. Compiler has a bigger context from which to select co-sedule instructions and compilers however do not have run time information such as cache misses. I have already explained that that at compile time many thing which can happen at run time is not known. That means, at run time cache miss is a phenomenon which will happen only at run time which cannot be predicted at compile time. That means, whether cache miss will take place by looking at the set of instructions one cannot say that cache miss will occur. So, cd link is therefore inherently conservative, because that means the instruction level parallelism that it can exploit is that scope is limited. For example, branch and memory prediction is more difficult. So, whether a branch will be taking place or not and memory prediction that means at that effective address value whether they are same or not that identification is also difficult that can be identified only at run time. As a consequence difficult VLIW processors are limited to 4 way or 8 way parallelism as I have already told that the number of instructions that can be showed in parallel is restricted to 4 to 8 in case of VLIW processors. So, this is in the context of that transmitter as Crusoe processors. I have already mentioned transmitter that is the name of a company they introduced a processor known as Crusoe processor that for that Crusoe processor that is a VLIW architecture that uses a VLIW architecture. And there what they said that an instruction of a VLIW processor is called molecule and each field is called atom. So, as per their terminology the transmitter is terminology. So, each molecule consists of 4 atoms. So, you have you have got different functional units floating point unit integer unit load store unit and branch unit. And in 4 fields instructions of those types are packed and then they are issued simultaneously. So, a compiler generates long instructions having multiple operations meant for different functional units. And the group of instructions that could be issued in a single cycle are called an issue packet or a bundle. So, let me illustrate this VLIW execution with the help of the same example that we had that we have discussed in the context of pipelining. So, same program high level language program where you are adding a scalar value s to different elements of an array and the result is stored in the memory. So, and this is the corresponding MIPS code for that particular program which I have explained earlier. Now, let us see how we can run this or how this particular program can be run on a VLIW processor. So, your VLIW processor is assumed to perform 5 operations, 1 integer, 2 floating point operations, 2 memory references each requiring 16 to 32 bit field. And here the same instructions are getting executed. So, here what has been done the loop has been unrolled this particular loop has been unrolled 7 times to avoid delays because you require more instruction level parallelism to fit the different functional units. And that is the reason why you will you have to unroll many times and that unrolling has been done 7 times to avoid delays. And 7 results in 9 clocks you can see or one point here what is being done each line corresponds to a single VLIW instruction. So, here you have got 1, 2, 3, 4, 5, 6, 7, 8, 9. So, that means you will require 9 clock cycle whenever you execute it with the help of a VLIW processor. So, this whenever you do that it results in 9 clocks or 1.2 clocks per iteration. So, 1.29 clocks per iteration and you are performing 23 operations in 9 clocks average of 2.5 operations per clock. So, that means in terms of your pipeline simple pipeline what will you say here IPC is 2.5 instructions per cycle is 2.5 because you are able to perform 2.5 operations per clock. But one point you should notice that the instructions are having 1, 2, 3, 4, 5 fields out of these 5 fields in many situations for if you look at different instructions you will find the different fields are empty. What does it mean? That means that the compiler has failed to identify independent instructions which can be executed concurrently. So, it does not mean that you have got 4 fields or 5 fields that does not mean that all the fields can be filled up by the compiler. compiler may not be able to fill up these fields and as a consequence some of the fields will remain empty. What do you mean by they will remain empty? You have to fill up those fields with the help of an instruction known as no operation or no off. That means nothing is being performed. So, if you look at the different instructions. So, let us assume that it has got 5 fields. These are the subsequent instructions. Now, 1 field, 2nd field, 3rd field, 4th and 5th field. So, what can happen? If you look at the different fields you may find that these 2 fields have been filled up other 3 has remain unutilized or these 3 fields have been filled up. These 2 fields have been filled up or it may be like this. This field has been filled up, this field has been filled up, this field has been filled up and this way some of the fields may remain unfilled. So, in such a case what will happen? Utilization of the functional units will not be 100 percent. That means the utilization of the functional unit will be less because some of the fields have remain unutilized. So, that is the situation here also. So, only 60 percent of the functional units are used. So, if you look at you will find that 3 fields have remain unutilized in the first instruction, 3 fields have remain unutilized in the second instruction, 1 field has remain unutilized in the third instruction, 2 fields have remain unutilized in the fourth instruction and like that. So, another point you must also notice that you will require more registers in VLIW. That is the characteristic because whenever you do loop unrolling you need more registers to avoid constants and whenever you go for VLIW you require more instruction level parallelism. So, the loop unrolling required for achieving higher ILP is more. We have seen in the context of simple pipelining the unrolling of 3 times or 4 times was enough and that was able to provide you enough instruction level parallelism and without any stall. But whenever you are going for VLIW processor you will require more unrolling to have higher ILP and as a consequence you will require more registers in the VLIW. That means the VLIW processors should be provided with more number of registers and here as you can see you have got you have utilized all the I mean 32 different registers floating point and fixed point registers and not most of the floating since the operations are floating point operations. All the floating point registers the 32 floating point registers have been used effectively and in spite of that your utilization function unit utilization is only 60 percent. So, this limitation you have to accept whenever you go for VLIW architecture. Now, I was talking about one processor that is transmitter's Crusoe processor that is transmitter's Crusoe processor that is a commercial processor. It allows two different types of instruction formats memory compute ALU and immediate means immediate data field or memory compute ALU branch. So, we find that the using these four fields you can have five different types of operation slots ALU operations typical risk like ALU operation then compute integer or floating point ALU operation where these two fields compute field then memory field a load store operation the first field and branch a branch instruction this is here is a branch and immediate a 32 bit immediate data that can be provided. So, we find 32 32 32 32 that means 32 into 4 that is the that is the size of the instruction in case of this transmitter's Crusoe processor. And I have already told that long instruction world called molecule can be of 64 of 128 bit long a molecule can contain up to four risk like instructions called atoms because here you are forming a bundle with four instructions. All atoms get executed in parallel molecules are executed in order. So, this is these are the typical characteristics of VLIW processors and the same thing is followed in transmitter's Crusoe processor and it uses a simple in order six stage pipeline for integer. So, we find that not only you have got multiple functional units the functional units itself can be pipeline. So, that is additional thing additional complexity that is being incorporated. So, the functional units will be pipeline. So, six stage pipeline for integer instruction to fetch stage decode register read register write back. So, these are the different pipeline stages that is present in for the integer instructions. Similarly, ten stage pipeline for floating point and four additional execution stages. So, you require ten plus four stages for floating point. So, we find that different instructions will take different time to execute they will not take same time. So, that is present in a transmitter that is done in transmitter's Crusoe processor and that is the implementation of transmitter's Crusoe processor. This is the VLSI chip layout it can operate in the range of 500 to 700 megahertz L 1 cache is 128 kilobyte L 2 cache is 256 kilobyte main memory is DDR SD RAM type. It can upgrade to SD RAM type north bridge is integrated package is this is the package in which is the 474 BGA and partner IBM anyway and process technology is 0.4 these are the details of implementation die size is 73 millimeter and that the production started in mid 2000. And this is the basic approach that is being followed in transmitter's Crusoe processor. They have used a novel concept known as code morphing software. What is this code morphing software? This code morphing software what it does it translates from one instruction set architecture to another instruction set architecture. So, for example, the instructions that it that will that it will fetch will correspond to x86 ISA. Then code morphing software code morphing software CMS will convert it transform it into VLIW instruction set architecture. So, this is a special software that is code morphing software which is provided by transmitter to translate from one instruction set architecture to another instruction set architecture. Why it is being is it being done? This is being done because they found that most of the applications applications software are available for you know that x86 processors that is Pentium and other processors. So, the application software will correspond to x86 or Pentium like processors. Then they are converted into instruction set architecture of VLIW and that can run in the VLIW processor that is your in the in at the center you have got this VLIW engine. That means outer layer correspond to PC and other internet applications operating system windows Linux and so on. And in between the VLIW engine and the outer layer application software system software you have got this code morphing software which is acting as a interface between the two. So, which morphs x86 to VLIW and it is a high speed low power engine VLIW plus code morphing is equivalent to x86 compatible solution. That means the program which has been developed for Pentium can be run on this particular processor. And this code morphing software is a it is a dynamically fundamentally a dynamic translation system a program that compiles instructions for instruction set architecture into instructions for instructions architecture into instructions for another instruction set architecture. So, as I mentioned here x86 code is compiled into VLIW code. So, what is being done here code morphing software insulates x86 program hardware engines native instruction set. So, what is the main benefit of this approach the main benefits of this approach as we know with the advancement of technology we keep on adding new generation of processors Pentium 1 Pentium 2 Pentium 3 Pentium 4 and so on. So, like that this VLIW processor which was developed by transmitter back in 2000 they that will also keep on upgrading and their functionality their characters can be improved. So, as it is being done what has to be changed. So, as you change the VLIW engine what modification is required in your system. So, that the user scan user is not affected user is not affected only change that is required is the code morphing software. That means, as you change the VLIW engine the code morphing software needs to be modified and that is being provided by transmitter even not affected. So, that approach is followed by transmitter. So, the native instruction set can be changed arbitrarily without affecting any x86 software at all and only code morphing software needs to be ported. So, that is the only modification that is required and this code morphing software this that learn and optimize the application. That means, this code morphing software what it does from the first experience that code morphing software can be made more and more sophisticated to facilitate to provide give you more and more optimization. So, that is that provides a platform for future extensions and here is some comparison about the that with Pentium and transmitter processors on the that second line first line gives you different types of processors that is implemented mobile Pentium 2 mobile Pentium 2 mobile Pentium 3 transmitter T M 3 1 2 0 transmitter T M 5 4 0 0. So, that is latest and the process technology that is used where 0.25 micron then subsequently 0.18 micron and for transmitter T M 3 1 2 0 0.2 2 micron and for transmitter T M 5 4 0 0 0 0.18 micron and on chip cache was smaller for Pentium processors, but for transmitter the on chip cache size they could provide was larger because they were able to provide more cache memory in the processor because the processor was simple. Since, the processor was simple they were requiring lesser chip area that means that real estate silicon real estate that were consumed by the processor part was smaller and as a consequence they were able to provide more cache memory and that is that is why for T M 3 1 2 0 96 kilobyte and for T M 5 4 0 0 120 kilo 8 kilobyte and on chip L 2 cache they were present for mobile Pentium was 256 kilobyte and for transmitter T M 3 1 2 0 there was no on chip L 2 cache but for T M 5 4 0 0 the on chip L 2 cache that was provided was 256 kilobyte, but in spite of higher cache memory you can see the die size is smaller for transmitter processors. So, you find that for mobile Pentium 2 1 30 millimeter square for mobile Pentium 2 another version 180 millimeter square or 106 millimeter square using smaller dimension 0.18 micron technology, but for transmitter it was around 77 or 73 millimeter square. So, with smaller die size this was possible. So, the particularly the code morphing software simplifies the chip hardware that is the conclusion from this. Now, what is the impact of this? You can see this is an example here you have got x 86 memory after the translation this is the VLAW code 4 fields are filled up and no translations found on new. So, there will be some I mean there will be blanks that is quite natural. So, this is the actually it is done in two steps that code morphing operation is done in two steps first it finds out the those it finds out the it divides into that risk like operations and in the second step it does the forming the that translates into VLAW codes. So, here this is the output of the this is the Intel x 86 instructions and this is the mock version of the VLAW instructions. So, this is how it is being done. So, but the main benefit that you get from this VLAW processor is demonstrated with the help of this diagram. You can see here you are running the same program one in case of Pentium 3 which is playing a DVD player and you can see the temperature profile the Intel inside Intel the temperature is reaching reaching 105 degree centigrade it is getting heated. On the other hand in case of Crusoe processor model we find that that is your TMS 5400 playing the same program your temperature is only 48.2. So, if you consider if you consider from the view point of low power nowadays you know low power is becoming increasingly important particularly in portable applications most of the devices nowadays you know laptop, PDA's, cell phones and what not in all these cases they are driven by battery. Since, they are battery driven it is very important to have lower power consumption and also you know that the temperature which is inside the core is decides the reliability of the processor. For higher reliability lower temperature is also very important it has been found that for every 10 degree rise in temperature reliability becomes half. So, in that context this transmitter's approach is quite good. So, we find that running the same program it leads to smaller power consumption. So, here some of the VLIW problems is identified large number of registers needed in order to keep functional units active large data transport capacity is needed between functional units and the register file and between the register files and memory and high bandwidth between instruction cache and fetch unit because it has to be done in higher speed. So, one instruction with seven operations each of 24 bits. So, number of that means the instruction that you have to fetch at this rate 128 bits per instruction that has to be fetched. So, these are the some of the limitations present in VLIW and which can be overcome. Another problem is large core side partially because of unused operations and wasted bits in the instruction words incompatibility of binary code and if for a new version of the processor additional functional units are introduced then the number of operations possible to execute in parallel is increased the instruction word changes old binary code cannot run on this new processor. So, that means the backward compatibility cannot be easily implemented in VLIW architecture. However, if you have got that approach that software like code morphing software then of course, there is no problem. Anyway, these problems can be overcome in superscalar architecture in spite of larger chip area and larger power dissipation and you will see that that super scalar architecture has been more accepted than VLIW because of the limitations that I have mentioned. So, with this let us stop here. Thank you.