 Welcome to today's lecture on case studies, we shall continue our discussion on evolution of Intel processors. In the last couple of lectures, we have discussed about primarily the Pentium series of processors starting with Pentium II, then Pentium III and Pentium IV. We have seen that these processors are essentially superscalar and deeply pipelined. As you know in superscalar architecture, the instruction scheduling is done that instruction issue is done with the help of hardware and because of that the hardware complexity keeps on increasing in superscalar architecture as the complexity of instruction scheduling increases. Similarly, as you use deeply pipelined processor, then the clock frequency becomes higher, so it leads to higher clock frequency. Similarly, the superscalar processor leads to higher silicon real estate. So both these things, I mean this higher silicon real estate as well as higher clock frequency leads to more power consumption. So power consumption becomes high and question naturally arises, how we can reduce the power dissipation? So to reduce power dissipation, one approach is to use VLIW. We have already seen VLIW is essentially very large instruction ward where the instruction scheduling is done with the help of compiler. Instead of doing it by hardware, it is done by software and as a consequence, the chip area reduces power dissipation of the chip reduces. That is one approach that we have already discussed and we know that there are some processors based on this VLIW approach like transmitter's Crusoe processor and another approach to reduce power dissipation is to use multi-core. So later on, we shall discuss about multi-core instead of using a single processor within a chip and continuously increasing the frequency to get higher and higher performance. You can use parallelism in terms of processors. So we have to move from instruction level parallelism to thread level parallelism to get higher performance. So we shall discuss about this in my next lecture but today we shall focus on VLIW style approach and that has been implemented in that itanium processor. So this itanium processor essentially is based on epic approach. This epic approach was known as explicit parallel instruction computing. This explicit parallel instruction computing was developed by Hewlett-Packard, HP and Intel and this name was given to distinguish it from superscalar and VLIW. That means as we shall see although it will inherit many features of VLIW but there are many novel concepts which have been incorporated in this epic architecture. So essentially to distinguish it from superscalar or VLIW a new approach was suggested and that new approach is known as epic and which is radically different from superscalar architecture. Then IS-64 essentially is the architecture developed based on this epic approach and it has many inherent features and these inherent features are listed here. Number one is it is risk style, register, register, ISA. We know that that instruction set architecture can be of two types, one is your risk, another is your SISC. In SISC architecture that ALU can perform operation with the content of register and also with the content of memory. So the instructions are available where you can perform say an ADD operation with the content of a register with that of the content of a memory location and which is not done, which is not allowed in risk, risk processors are based on that reduce instruction set computers are based on that register-register operations. That means all the ALU operations are performed between the contents of registers and result is stored is also stored in register and to load the registers explicitly you have got load and store instruction that is why it is also called load store architecture risk processors. So this is one of the features, there are many other features that in risk processors because they should have large number of registers and so on. Let us not discuss going to that details, but main feature is that epic framework that epic architecture is based on risk style with register-register instruction set architecture and with many novel features designed to support compiler-based ILP. So another important feature is compiler-based ILP instruction level parallelism. We know that both in superscalar processor and VLIW processor we exploit instruction level parallelism and in case of superscalar processor that instruction level parallelism is identified with the help of hardware. So over a set of instructions that has been that is available that has been fetched into the processor and on that within that set of prefetched instruction parallelism is detected with the help of hardware in case of superscalar. On the other hand that ILP is detected with the help of compiler or software in case of VLIW architecture. So a compiler whenever this instruction level parallelism is detected with the help of compiler it has much more scope than that it can be done by hardware as it is done in superscalar processor. And in fact in this approach that epic approach this has been extended more that means it shares many features of VLIW that means it is done with the help of compiler but extends it. How it is extends it we have already discussed about VLIW architecture we have seen that they have a rigid instruction format. In a single instruction format few instructions can be accommodated and those are fixed the number of instructions that can be accommodated and so on any parallelism are not available instruction level parallelism is not available some of the fields in the instruction format may remain on empty and but in case of your epic architecture in case of epic approach it is much more flexible and it incorporates I mean greater flexibility in indicating parallelism among instructions and in instruction format both in identifying instruction level parallelism and also in scheduling instructions I mean in formatting the instructions that more flexibility is provided and as a consequence it can reduce the increase in code size caused by the typically inflexible VLIW instruction format. That means that ultimate outcome is that whenever if it is done using simple VLIW whatever size of the code is possible using this approach epic approach you can achieve much better that means instruction code size will be smaller and epic has more extensive support of the software speculation. So the software speculation it is done in a very limited way in VLIW architecture and as we shall see little later epic uses much more extensive support of software speculation like predication and so on we shall discuss it in more detail little later. And based on this approach that IS-64 architecture and the processor has been developed that is known as itanium. So itanium is the first commercial processor that implements the IS-64 that is based on IS-64 architecture and later on we shall see that itanium architecture has been extended also leading to itanium 2 architecture essentially these are upgradation of the basic itanium architecture. So with this background now let us focus on the main ideas of epic. As I have already mentioned in case of epic the instruction scheduling is done with the help of compiler and whenever we do it using the compiler it has been found that it is easier to find instruction level parallelism using software algorithm rather than hardware. So whenever it is done with the help of hardware it is very restrictive you cannot use a sophisticated algorithm and the instruction level parallelism is identified over a very small or limited number of instructions but whenever you do it with the help of compiler you can develop sophisticated algorithm because it is done in software the overhead is not much there whenever it is done in software and then it can identify more I mean that instruction level parallelism can be more whenever it is done in with the help of the compiler. So whenever you do it with the help of compiler it does not consume cheap real estate as I have already mentioned in super scalar architecture it is done by hardware obviously it will consume silicon real estate. So instead of consuming silicon real estate for scheduling instruction it can be devoted for other purposes like you can have more number of general purpose registers more number of floating point registers and other special type of registers. So that can be done and so hard the compiler can search entire sub program to find independent instructions across the entire sub program and that is another advantage of this compiler based approach and as I have already mentioned hardware based solutions can find only a small prefetch window and other novel techniques like register windows and rotating floating point register stack have been used in this epic architecture I shall discuss about that later. And whenever you use this compiler based approach with these enhancement or innovations there is much greater chance of finding ILP at much lower cost. So the cost is cost of implementation is lower and you will get much better ILP. And it will lead to implementing that long or very long instruction words that is the basic approach of PLIW and as we shall see that epic will use branch prediction. So earlier we have discussed about the technique of branch prediction where I mean whether a branch will be taken or not taken is predicted based on past history. So here we shall see we shall be using a novel concept known as predication which is different for prediction I shall explain about it in little later. And also to reduce the latency of memory fetch some technique known as speculative loading is used I shall discuss about this speculative loading in details. So these are the main ideas that has been used in epic architecture and here is the comparison between superscalar and IAW IA64 approach. As we have already seen superscalar approach is based on risk like instructions and one per word. So one instruction per word is available in superscalar architecture. On the other hand in then you have to schedule multiple instructions to keep multiple execution units or processing units busy. So that is done by hardware you will fetch instructions one instruction at a time but issue more than one operations instructions to keep the more functional units busy. On the other hand in risk like in case of IA64 risk like instructions are bundled into groups of three we have seen in VLIW architecture in a single instruction you can bundle more than one instructions. As I have already told this is the basic approach you can this is one instruction this is another instruction this is another instruction this is another instruction though this is a bundle. So you can form a bundle of several instructions in case of IA64 three instructions are bundled together to form a single instruction. And in superscalar approach we use multiple parallel execution units that is also done in IA64 multiple parallel execution units. And reordering and instruction optimization is done at run time with the help of hardware. On the other hand in case of IA64 reordering and instruction optimization is done at compile time with the help of software which is the compiler. And in case of superscalar architecture branch prediction with speculative execution is done on one path. So we have already discussed about it in detail earlier and in case of IA64 we shall see that speculative execution is done along both paths of a branch which is known as prediction I shall discuss about it in more details. Then loads data from memory only when they are needed that means in case of superscalar architecture that instructions are loaded that data is loaded from memory only when they are noted I mean needed need based loading. On the other hand in IA64 speculative loading is done that means speculatively loads data before they are needed of course there is a possibility that the data which have been loaded may not be used that possibility is there but when they are used it is beneficial. So this is the key differences between superscalar and IA64 architecture. Now we shall focus on the details of IA64 architecture first we shall look at the register model and as I have already mentioned IA64 uses a large number of registers compared to superscalar or other VLAW architecture processor based architecture processors. So it uses 128 64 bit general purpose registers which as we shall see shortly are actually 64 bits there is one additional bit that is the additional bit is used that predicate bit we shall discuss about it later. And then 128 82 bit floating point registers which provide to extra exponent bit over the standard 80 bit IEEE format. In the standard 8 IEEE format that instruction format uses 80 bit and in case of IA64 it has been extended to 80 bit with two additional bit for in the exponent field. And there is 64 1 bit predicate register. So predicate register concept is new in IA64 I shall explain the operation of this predicate register in details and there are 8 64 bit branch registers. So you can see a variety of registers are used this branch registers are used for indirect branches and apart from these registers we have a variety of registers are also used for system control, memory mapping, performance counters and communication with the operating system and all of them may not be visible to the programmer. And it uses the concept of register windows I mentioned that a new concept known as register windows is used in IA64. What is this register window concept? It has got 128 fixed point registers out of which 32 registers are I mean registers 0 to 31 are always accessible and are addressed as 0 to 31. The remaining registers that is 32 to 128 are used as a register stack and each procedure is allocated a set of registers. That means we know that whenever we are dealing with procedures, procedure calls we use stack and in the stack we store the information whenever we do the context switching. So when the context switching is done a stack is used normally a stack pointer is available as part of the processor and then stack is available in the memory. So whenever a context switching occurs you have to store the content I mean the status of the current context and then you will reload from the stack to where the jump is taking place. So when the context switching is occur this restoring and storing of processor status is time consuming particularly if it is done through memory. So instead of that here it is done with the help of registers. So just like a stack pointer we have we know that a stack pointer is used. So instead of using a stack pointer a special register called current frame pointer Cfm points to the set of register to be used by a given procedure. So the frame consists of two parts the local area and the output area. So local area is used for local storage and while the output area is used to pass values to any call procedure that means we are using that output area for parameter passing. So you have to pass parameter that caller will pass parameter to the callee. So whenever caller passes parameter to callee that means this output area can be used to pass the parameter to the callee and vice versa. So the local area is used to for local storage and while the output area is used to pass values to any call procedure. So on a procedure called the Cfm pointer is updated so that R32 is called the procedure called procedure points to the first register of the output area of the call procedure. So that means you are using this register R32 for this purpose to pointing to the area I mean that is that register window. Then it also uses rotating floating point register stack. So the floating point registers are used for floating point data the branch registers are used to hold branch destination addresses for indirect branches and the predication register holds predicates which controls the execution of predicated instructions. And both integer and floating point registers support register rotation for the registers 32 to 128. So those registers the register renaming is done and by using the concept of register renaming that rotation of registers is done for from register 32 to 128 for the purpose of this subroutine calls and other things procedure calls. So register rotation is designed to is the task of allocating the registers in software pipeline groups. So these are the this is the novel concept that is also used. Now let us focus on the different types of instruction types and execution units that is available in the IA 64 micro architecture. So since it is a superscalar processor I mean VLIW processor there are a number of execution units or functional units and different types of execution units are available. First one is I unit I stands for integer unit which we can perform operate on A type and I type instructions A type instructions are essentially integer, ALU type instructions like integer type addition, subtraction and or compare and so on. And I type instructions are not ALU integers like integer and multimedia ships bit test moves and so on. And M unit performs that M unit stands for load and store used for the purpose of load and store and it can perform the A type instructions or M type instructions. So integer ALU add, subtract and or compare so which can be done with the help of M unit and also for memory access like load and store for integer and floating point register that is M type instructions can be performed with the help of M unit. Then you have got F unit floating point unit floating point instructions can be executed with the help of this F unit and then B unit is you exclusively exclusively used for the purpose of branches like conditional branches, calls, loop branches and so on. Then you have got L plus X type of instructions of course there is no separate execution units available for them, but I believe that I type and B type units can be used to for the purpose of that L plus X type of instructions which can be used for extended immediate 64 bit extended immediate and stops and knobs instructions. And here is the general organization of the I 64 architectures and it has got a large number of execution units. We have seen the number of execution units is 8 or more parallel units are available. So here those execute units are shown both for fixed point and floating point you have got several execution units. Then 128 general purpose registers or integer registers are shown and 128 floating point registers are available and the 64 bit predicate registers are available I shall discuss its use little later. So you have got predicate registers in addition to general purpose registers and floating point registers. So then let us look at the instruction format of the I 64 instructions. Here it exploits explicit parallelism implicit parallelism and ease of instruction decode. Actually by using the compiler to do it that implicit parallelism available in the large number of instruction is exploited and instruction decoding is ease of instruction decoding is done with the help of those template part and nearly every instruction of I 64 is called C levels. So here sometimes instead of instructions they are called C levels and can be predicated. Predicated means the lower six bits specifies the lower order six bit of every instruction specifies the predicate register that guards the instruction. So here you can see there is a predicate register six bit predicate register available in each and every instruction and so each instruction is of 41 bit. So 41 bit is divided in this way major of code is 4 bit and it has got 10 modifying bits and there are three fields which can specify the general purpose registers, general purpose register three, general purpose register two and general purpose register one. Since you have got 128 registers you require seven bit for each of these fields and then you have got predicate registers PR. One bundle you can see these instructions these three instructions together is called a bundle. So one bundle executes per cycle that means whenever the instruction execution is done one bundle will be fetched from the memory and it will be executed in a single cycle. So if it is stored then all the entire all the instructions in a single bundle will be stored so any stalls cause the entire bundle to be stored. So now let us focus on that template field I mentioned about that there is a template field present here so 41 into 3 so out of 128 41 into 3 those bits goes for specifying the three instructions and you have got 5 bit template field. This template field specifies the types of execution units each instruction in the bundle requires so we have seen each instruction comprises three instructions in a single bundle and you will require different types of execution units for executing those instructions in the bundle and this template field specifies what type of execution is needed by different instructions of that bundle. And this field specifies the presence of stops stop is another novel concept that is being used you know there is a concept called instruction group. These instruction groups should not have any dependency register type of dependency that means these instructions within a instruction group can be executed in parallel. So since they do not have that dependency because of registers one is writing into a register in subsequent instructions reading it like that that kind of dependency is not present and also it avoids the memory based dependencies then those instructions are can be form a group. Now the field specifies the presence of stops that means you can have several instructions and say suppose up to this they form a instruction group however this instruction and this instruction does not fall in this group that means they cannot be executed in parallel because of the dependencies. Then the stop bit specifies that there is a stop bit here that means these instructions and these instructions cannot be executed in parallel because of this dependency. So a stop indicates to the hardware that one or more instructions before stop may have certain kinds of resource dependencies with one or more instructions after the stop. So that is how it separates the instruction groups in a single bundle. And however whenever you forming a bundle for execution by a processor you will see that in I-64 it is possible to bundle instructions not belonging to the same instruction group that means dependent and independent instructions may be mixed in a single bundle. And compiler set template bits to inform hardware which instructions are independent and template can identify independent instruction across bundles and instruction in bundles do not have to be in the original program textual order that means the way the instructions are applicable in your original text they may not be in the same order in your instruction bundle that you prepare. And let me illustrate this with the help of an example. So here as you can see you have got a number of bundles and this template field specifies the type of execution units they require I mean required by different instructions in a bundle. As I have already mentioned a single bundle will have three instructions. So this template 0 specifies that the three instructions will require three execution units one is m type slot 0 slot 1 will require will I type and slot 2 will require I type. So since you have got multiple execution units you can execute these instructions simultaneously. Similarly you have got another instruction m i i that template 1 specifies that second bundle and here you can see there is a stop. That means they form a single bundle single instruction group here you see after the first bundle there was no stop but you can see here there is a stop that means these instructions and these instructions form a single instruction group although they are bundled in two separate I mean they form two separate bundles for execution purpose by IA 64. But here you can see the stop is present in between in a single bundle. So m i and i so here you have got a stop. So the compiler specifies must explicitly indicate the boundary between one instruction group and another stops may indicated by heavy lines and may appear within and or end of the bundle. So you can see conventionally it is present at the end of the bundle but in some cases it is present in the middle also in the middle of the bundle. So the compiler will specify that and this facilitates decoding and execution of instructions. Now coming to the branch prediction we know that the earlier superscalars processors branch prediction. So here is a sequence of instructions I1 to I10 shown here. Now here you have got a branch instruction I3 is a branch instruction. Now whenever we reach this branch instruction some prediction is required whether these instructions will be fetched I4 I5 and I6 will be fetched or I7 I8 and I9 are to be fetched that is decided based on branch prediction. That means if a branch is taken prediction is taken then you will fetch I7 I8 and I9 on the other hand if the branch is not taken then you will fetch I4 I5 and I6. So that is based on branch prediction that is done in traditional processors. So in your Pentium 3, Pentium 4 in all these processors that is how it is being done. But in case of this IA64 processor it goes beyond it introducing a concept known as branch prediction. So this problem that which one to be prefetched whether you will prefetch those instructions in the I mean when the branch is taken or branch is not taken that problem is overcome that problem of prediction is overcome with the help of this concept of prediction. So the compiler tags each side of a branch with a predicate. So instructions are tagged with the help of predicate field and bundled tagged instructions and set template bits to allow parallel execution of predicated instruction. So here because we have got large number of execution units instead of executing only one side of the branch taken side or not taken side the execution is carried out for both the sides. However, you have to keep track of which instructions belong to which side. So when the outcome of the branch is known the effect of correct side is committed while the effects of the wrong side is discarded that means the execution is carried out for both sides, but only when it is known that branch will be taken or not taken only results of one side is committed by committed we mean that writing into the registers and all these things will take place. But other side will be completely discarded I shall explain with the help of an example little later and it has the advantage that there is no need to unroll earlier we have seen if the prediction is wrong then you have to do you have to unroll it. So you have to discard some instructions then you have to fetch a new set of instructions and you have to execute that. So this kind of unrolling is completely avoided in this predication concept. So the time taken to execute the wrong side are at least partially amortized by execution of the correct side assuming sufficient functional units are available. As I mentioned time taken to execute the wrong side is also there, but since we have got sufficient number of functional units you can execute them parallel you can execute them parallel because they are independent they can be independently executed. So there is no dependency between the right side correct side or wrong side. So they can be executed in parallel I shall illustrate that with the help of an example. Here for example you have got a sequence of instructions i3 i1 i2 i3 in i3 is a branch instruction and there are two directions. So i4 i5 i6 i4 i5 i6 form one path and i7 i8 and i9 it forms another path. So these are tagged with the help of you know that there is a predicate register. So for this instruction 4 instruction 5 and instruction 6 in the predicate register write P1 that path 1 is written here P1 P1 and P1 on the other hand for instructions 7 8 8 9 in the predicate register P2 is written. So each instruction is predicated so you are predicating the instructions and as a consequence when the outcome of the branch is known whether I mean that r1 is 0 or not because that may take little time that comparison of instruction may take time when it is known you can discard either the P1 instructions I mean the instructions predicated by P1 or you can discard instructions predicated by P2. However, the beauty of this is that these instructions that 4 5 6 and 7 8 9 they can be executed in executed parallel because they are independent. So this is the basic concept of branch predication. Now coming to another important aspect of I64 that is the solution to memory latency. So we have already seen that memory you have to fetch instructions from the memory. So what I64 does it does speculative fetching of instructions actually the technique is known as wasting. So move all load instructions to the start of the program this is called wasting that means you are loading load instructions which may be present in the later part of an instruction they are moved to the beginning of the beginning part. So they are fetched on the assumption that they will be needed in future but they may not be needed but I mean hopefully data will be ready in the register when it is read that means when it will be needed and loads may belong to decision paths that are never executed as I mentioned that some of the loads I mean where you load the data from the memory to the register that path may not be executed and wasting effectively causes these loads to be executed anyway even if their contents are never required. So loads are therefore speculative and assuming that they may be needed later. So you are doing speculative loading of instructions. So a check instruction is placed before the load results are needed and the checks are exceptions in the speculative load and as well as commits effect to load of load to the load register. So whenever you are doing the committing then that it has got some effect on the loading into the target register. So this is the speculative loading that is being done and here is the example of speculative loading. Here for example you have got the instruction sequence this is the original code where you have got a load instruction here instruction 7 is the load instruction. So what is done that instruction 7 is shifted before the branch. So before the branch takes place that means I3 is your branch before the branch this instruction 7 is shifted in this modified code as you can see this is the instruction 3 is now this load instruction load R4 0 R6. So you are doing speculative loading and then you are going to the branch instruction. So and however whenever you do that here you can see you are performing load check. So after executing this you are doing speculative check. So this is how the instruction sequence are executed. So compiler scans the source code and identifies the upcoming load instructions, identifies the load instructions and at run time these instructions load the data from the memory before it is needed and if load would trigger an exception this if you postpones reporting that exception. So it may lead to some exception and that is not being informed. So it is not reported to the processor. And as I mentioned the compiler replaces the load with the help of speculative load above and here you have got speculative check. So the instruction checks the validity of the data if it is okay this if you does not report an exception otherwise an exception is reported. So this is how you do speculative loading and you can see how the code is modified. And here are some other nice features of IA64. So it transferring ILP discovery to compiler, freeze of the cheap real estate as I have already told since it is done by software, silicon real estate is available which can be used for more beneficial purposes like it can be dedicated used for implementing large and number of general purpose registers, floating point register files and we have already seen each of them is 128 bit registers files available in IA64 and also it can be used for large cache memories with more read write and also with more move read write ports and this greatly reduces memory latency. That means instead of using single ported memory you can have read write ports and this will reduce the memory latency and you can have more functional units which makes predicated execution feasible. We have seen whenever you do predicated execution of instructions you require large number of functional units because you have to execute both the paths. So this approach of identifying the instruction level parallelism by software has got many benefits because the real estate that is being saved can be used for these beneficial purposes. Then I have already mentioned about the nice features of IA64 predication instead of prediction and that definitely enhances the efficiency of program execution. Then the control speculation and that speculative loading that is being done of data and data speculation load is moved before the store that might alter the memory location. So this data speculation is done and it facilitates software pipelining. Earlier we have discussed about software pipelining and whenever these features are used we have seen in case of software pipelining we have in addition to those loop portion we have got some post and there are some codes in the beginning and at the end. So those codes can be reduced or completely overcome with the help of these features I am not going into the details of that and this leads to maximum utilization of the available functional units. According to the itenium processor the pipelining and let us look at the pipelining that is used in itenium processor in the front end it has got it does the prefetching of instructions. So the prefetching of instruction is done up to 32 bytes per clock. So you can see so up to 32 bytes per clock prefetching is done by the front end of the processor and it can handle up to 8 bundles. So each bundle as we have seen comprises 128 bits and those bundles can be that means 8 bundles means each bundle has got 3 instructions so 24 instructions. So front end will do that then the instruction delivery is performed by the in a pipeline way it distributes up to 6 instructions to 9 functional units. So it uses 9 functional units and up to 6 instructions can be distributed to the 9 functional units and it uses register renaming I have already discussed about that register renaming is used particularly for that when for procedure calls we are using with the help of registers instead of memory for stack. Then operand delivery is performed it accesses register file updates score board and the score board is used to detect when individual instruction can proceed. We have discussed in detail about the score boarding approach and the score boarding approach is used here used to detect when individual instruction can proceed and this is why this is the way to avoid lock step operations in the instructions in a bundle. So this is how the execution of an operand delivery is done to perform to get better efficiency and then execution of instruction is done. So these are the pipelining front end performs this then instruction delivery then operand delivery then execution of the instructions. And this is the example itanium is 64 implementation by Intel. So this was done back in 2001 itanium it was code named as Merced. So this was the first implementation of I 64 architecture and it used 64 bit wide bus using 800 megahertz core and 2.1 gigabits per second data transfer through the system bus and the width of the bus was 2 bundles per cycle and 4 it used 4 integer units to load and store per clock 4 integer units and to load and store per clock and 9 instructions per issue clock and 6 issue per clock and 10 stage pipeline using 800 megahertz on 0.18 micron process and subsequently it was enhanced. So these are the 9 functional units 2 integer, 2 memory, 3 branch and 2 floating point and all these functional units are pipelined using 10 stage pipeline and divided into 4 main parts which I have already mentioned. Then it uses 3 different types of cases L1 2 into 64 that is instruction and data it has got 2 separate cases and the L2 case is unified case 96 kilobyte using having the latency of 2 clocks and your L1 case has latency of 2 clocks. Then L3 case memory has 4 megabyte which is external in case of itanium and 20 clock latency and using 11.6 gigabits per second bandwidth and itanium 2 which was nicknamed as McKinlay which has the feature of 128 bit wide bus. So the width of the bus is increased core of 1.0 gigahertz and 6.4 gigabits per second transfer and this is the width 2 bundles per cycle 6 integer units 2 load and stores per cycle and 11 instructions per issue clock and so far as the case memory is concerned the L1 case is again separate for data instruction 2 into 64 kilobyte with 1 clock latency instead of 2 that was in the earlier processor and L2 case has got 256 kilobyte larger and it has got the latency of 5 clock cycles and L3 is now on die 3 megabyte and 12 clock latency and 32 gigabits per second bandwidth. So and then itanium 2 that was introduced in 2003 and these are the features little bit enhanced compared to the earlier and these are the latencies of some of the typical instructions I have already mentioned about integer load 1 floating point 5 to 9 cycles and that correctly predicted taken branch 0 to 3 and mispredicted branch 6 cycles integer ALU operation does not require any latency because it is available in the registers and floating point arithmetic require 4 cycles and these are the you know that core processor die that it looks like this is the layout view. So with this we have come to the end of a lecture on itanium and particularly on various types of processor based on instruction level parallelism. In my next lecture we shall start our discussion on another type of parallelism that is your thread level parallelism. Thank you.