 to today's lecture on case studies, we shall continue our discussion on the evolution of Intel processors. In the last lecture, we have discussed about various processors starting with that Intel 4004 that was introduced in the year 1971 and subsequently 8 bit 16 bit and then 32 bit processors were introduced and we have discussed about different types of P4 and P5 and P6 processors like Pentium, Pentium Pro, Pentium 2, Pentium 3 which are based on these P4, P5 and P6 architecture. Today, we shall focus on Pentium 4 which was introduced in the year 2000 based on Netbust architecture. I shall go into the details of this micro architecture and it used it required 42 million transistors including L2 cache based on 0.18 micron technology and clock speed was initially 1.5 gigahertz. Subsequently of course, it was increased and it is a 3 issue processor, 32 bit 3 issue processor having 8 kilobytes of cache memory and 128 k micro operations that is being stored in the trace cache as we shall discuss and it is having 2 level cache L1 and L2. In L2, it has got 256 kilobyte of cache memory. As I mentioned, it was announced in the mid 2000 and it uses native I832 instructions that means it is compatible with other processors like Pentium Pro, Pentium, Pentium 2 and Pentium 3. So, it has actually maintained that downward compatibility that means if you write a program for Pentium, it will work on Pentium 4. So, all the instructions which are used in Pentium are being used in Pentium 4 as well that is called downward compatibility and as I mentioned it is based on Netbust micro architecture. I shall go into the details of this and it has got 25 line stages, I mean integer 5 line stages and as I mentioned it has used 1.5 gigahertz clock and it uses 42 million transistors. So, this is the basic micro architecture of Pentium 4. So, you can see here it has got that bus interface unit which is written here bus unit which actually interfaces with the system bus that means memory and diode devices and it has got two level cache, first level cache which is 4-way having very low latency that means it is pretty fast and of course it has got second level cache which is also on die 8-way set associative cache memory and in the front end has got fetch decode unit and also a special type of cache known as execution trace cache and also it has got microchord ROM. I shall discuss about these things in more details and execution takes place in out of order. So, it supports out of order execution and it uses branch prediction using branch target buffer, it has got many branch target buffers and so it does branch prediction and the branch history is stored in those branch target buffers and it has got execution unit out of order and the retirement unit where the writing into the register takes place. So, this is the this is a very this gives the overview of the next bus network bus architecture and it is also I mean sometimes it was referred to as P7 because earlier P5 and P6 architecture was introduced since it was introduced after them it was also called P7, but later on this was not this terminology was not used much and also called Intel 80786 or I786. So, same processor called in a different ways, but Pentium 4 is the most commonly used name and Pentium. So, just like Pentium 3 it is also based on P6 micro architecture both P6 and net bus fetch up to 3 I32 instructions per cycle. So, it fetches multiple instructions per cycle and send an out of order execution engine. So, those instructions are sent to an out of order execution engine that can graduate up to 3 micro operations per cycles. So, as I have already mentioned each instruction is decoded into several micro operations minimum 2 maximum 4 micro operations and sometimes more a number of micro operations when the instruction is very complex and in each out of order execution 3 micro operations per cycle are executed. And with this net bus architecture Intel was expecting to touch speed of 10 gigahertz, but unfortunately it was that was not feasible primarily because of power dissipation as you know when the clock frequency is increased. The power dissipation also increases because the power dissipation is proportional to frequency actually more precisely CL VDD square F that is the power dissipation. Power is proportional to this and CL actually represents the overall capacitance VDD supply voltage and F is the clock frequency. In spite of reduced supply voltage as we have seen in case of Pentium 4 1.8 volt supply voltage is used that clock frequency cannot be I mean cannot be increased to 10 gigahertz. So, in spite of their effort Intel failed to do so and they could achieve up to 3.8 gigahertz initially, but of course as I mentioned initially 1.5 gigahertz processor were commercially available subsequently they were increased to 3.8 gigahertz or 3.5 gigahertz I believe. So, let us have a look at the instruction set architecture of Pentium 4. So, the Pentium instruction set architecture 4 architecture comprises Pentium 3 instruction set architecture plus SSC 2 SSC 2 stands for streaming SIMD extensions 2. So, this streaming SIMD extension 2 where is the additional thing that is provided in Pentium 4 and SSC 2 is an architectural enhancement to the IA32 architecture. Earlier the Pentium 3 processor was based on IA32 architecture, but this was the additional components. So, which uses I mean it extends MMX and the SSC extension with 144 new instructions, 128 bit SIMD integer arithmetic operation, 128 bit SIMD double precision floating point operations and also it allows enhanced cache and memory management operations. So, with this enhanced features obviously the performance of Pentium 4 is expected to be higher than Pentium 3. So, here is the comparison between SSC and SSC 2. So, both supports the operation of 120 bit, 128 bit XMM resistor. So, there are different types of resistors as we shall see and there are 128 bit XMM resistors available and these SSC only supports 4 packed single precision floating point values, but in case of SSC 2 it supports much more varieties as you can see it can support 2 packed double precision floating point values or 16 packed byte integers or 8 packed word integers here word means 2 byte and 4 packed double word integers or 2 packed quad word integers and double quad word. So, these are the various variations that is supported by SSC 2 architecture. For example, this is how the packing is done that 128 bit packing is done where each word is considered to be 2 bytes. So, this is a quad word. So, double quad word here 64 bit and 64 bit each is shown and in the diagram below this it has got quad double word. So, there is this one that quad double word. So, 8 packed double word 4 packed double word. So, you can see each of 32 bit and that can be packed in a 128 bit. So, that means this is this packing is done for the purpose of increasing the memory utilization. So, in each 128 bit you can just instead of storing just one word you can pack more number of words and so that that efficiency or utilization of the memory is more. And this is the hardware support for SSC 2 those various functions that I have mentioned corresponding to SSC 2 is supported by the following hardware adder and multiplier units in the SSC 2 engine are of 128 bits wide and twice the width of Pentium 3. In case of Pentium 3 those adder and multiplier were of 64 bits, but here it is 128 bits and increased bandwidth in the load store for floating point values. And because it can do load and stores are 128 bit and one load plus one store can be completed between XMM register and L1 cache in one clock cycles. So, this will help in enhancing the transfer of data between the L1 cache and the XMM register, which are used for performing the operations. So, this shows the pipelining that is being done pipeline architecture as I mentioned Pentium 3 uses a 10 stage pipeline. So, you can see that fetch, decode then decode and execute and all these operations are divided into 10 stages in case of Pentium 3. On the other hand these are divided in 20 stages in Pentium 4. So, 20 stages compared to 10 of Pentium 6 and it has got 7 integer execution units compared to 5 of Pentium 6 and branched target buffer is 8 times larger. So, it uses branched target buffer and it also uses improved prediction algorithm which I shall discuss little later. And it uses a special type of cache known as execution trace cache which I shall also discuss in detail. And this is another comparison, you can see the micro architecture for Pentium 4 is net burst compared to P6 of Pentium 3 and clock frequency is enhanced from 700 megahertz to 1.5 gigahertz. The number of transistors is more 42 million compared to 28 million of Pentium 3. Here are the different resistors general purpose 32 is same floating point unit resistors 80 same MMX 64 same, but XMM 120 is same. So, the resistor sizes are more or less same and resistor and the system bus bandwidth is increased from 1.06 gigabit per second to 32 3.2 gigabits per second. So, maximum external address space is also increased from I mean it is also same 64 gigabits for both the cases and on die cache size is increased, but with a different variety. For example, it can use execution trace cache having 128 K micro operations and it has got two levels of cache L 1 8 kilobyte and L 2 256 kilobyte. So, let me discuss in little bit detail about the execution trace cache which is a new invention that is being used in Pentium 4. So, this is the primary instruction cache used in nest burst architecture. So, this execution trace cache is essentially the instruction cache, but as we shall see it is not used in the conventional way the way the instructions are stored in a conventional cache here it is not done in the same way. So, this trace cache sits between the instruction decode and execution code incorporated in the L 2 cache and it stores already decoded micro operations. So, this is here is the difference instead of storing the instructions here it stores the micro operations that means there is no need for re decoding. So, state away the micro operations are stored and they can be fetched then they can be executed directly. So, conventionally instructions are stored in cache and then they are decoded before it is it goes to the execution units, but here it is done here it is not so. So, on a trace cache means instructions are fetched and decoded from the L 2 cache. So, whenever there is a heat then those micro operations are fetched and directly delivered to the execution units. On the other hand whenever there is a miss the instructions are fetched from the L 2 cache and then they are decoded and after decoding they are stored in the trace cache and it has been found that the miss rate is very small that means point it has been found to be about 0.015 percent. So, that means the reading from the L 2 cache is not very frequent it occurs occasionally that means 0.185 percent means it occurs very infrequently. So, instead of fetching and decoding so when execution a new instruction micro operations can be fetched directly as I have already mentioned instead of fetching and decoding the instruction again. So, this reduces the load on the decoder so and also it makes it faster and here some details about the cache memories on chip cache as I told it has got L 1 instruction cache which is the trace cache and L 1 data cache and L 2 unified cache. So, that level 2 cache is unified one where both instruction and data are stored instead of separate data and instruction cache. However, the L 1 cache is separate and first one is the L 1 that trace cache instruction trace cache and the data cache is also there is a separate L 1 data cache and here is the size of the different caches first level cache is only 8 kilobyte. So, this 8 kilobyte has been used primary to make it small and simple as we know that small and simple cache gives you provides quick hit time which has been achieved by providing this simple small L 1 cache 8 kilobyte which is 4 way set associative and line size is 64 bytes and it can access latency is 2 to 6 that this integer is 2 bytes 2 cycles for integer data and on the other hand for floating point data is 6 cycles and it uses write through. Similarly, the trace cache as I mentioned it stores the micro operations it can store 12 k micro operations it is also 8 way set associative and the second level cache is 256 kilobyte which is 8 way set associative and line size is 128 bytes. So, 2 sectors per line it uses and 64 bytes per sector. So, here the access time is 7 cycles and it uses write back policy. So, all cases are not inclusive and a pseudo error you will that means, you know that inclusion property is not strictly followed as you know the lower level cache contains I mean whatever is present in L 2 cache normally presents in is also present in L 1 cache, but this inclusion property is not always maintained in this processor and it also uses a pseudo error you replacement algorithm when replacement has to be done. So, as I mentioned the L 1 instruction cache is a execution trace cache stores the decoder instructions it removes decoder latency from the main execution loop and integrate path of program execution flow in a single line. So, this essentially enhances the execution of the instructions because the decoder latency is not present in the execution loop when the execution takes place there is no need for decoding. Decoding is done at the time of replacement. So, L 1 data cache is non-blocking, non-blocking means I have already explained the importance or what do you really mean by non-blocking. When it is blocking whenever there is a miss then you know that it is being stopped then you have to read it from the lower level and then I mean from the next higher level then you proceed, but here it is not done. So, it supports up to 4 outstanding load misses that means it will continue to continue operations execution and it will up to 4 outstanding load misses. And as I mentioned load latency is 2 clock cycles for integer and 6 clock cycles for floating point and one load and one store per clock is performed and it also does a speculation load as assumes that the access will hit the cache and replay the dependent instruction when miss happens. So, this speculation load is also a new feature, new invention that is being used in Pentium 4. Then L 2 cache is included on die as I mentioned then it has got 256 kilobytes of size and it uses a unified 8 ways iterative and load latency is I mean net load latency is 7 cycles. It is also non-blocking and bandwidth is one load and one store in one cycle, new cache operation begins in every 2 cycles 256 bit wide bus between L 1 and L 2. So, that this higher width allows you quick transfer of data between L 1 and L 2 and 48 gigabytes per second that is your external bus and one which operates at 1.5 gigahertz that is the bandwidth and bandwidth and performance increases with the processor frequency. So, as the processor frequency increases this gives you I mean higher performance as the frequency of the processor increases. Now, it also does prefetching of instructions. So, hardware prefetchers monitors the reference pattern based on the reference pattern it brings cache lines automatically. So, from the main memory automatically the cache lines are brought using the prefetcher and it attempts to stay 256 bytes ahead of current data access location. So, this prefetching also helps you to get the data quickly instead of reading it from the memory. So, prefetch for up to 8 simultaneous independent streams. So, this is another important feature. So, simultaneous independent streams can be prefetched and stored in the prefetcher memory or buffer. So, I was mentioning about the trace cache. Trace cache tries to exploit temporal sequencing instruction execution rather than spatial locality exploited in normal cache which I shall illustrate with the help of this simple example. Suppose, this is the instruction sequence I 1, I 2, I 3, I 4, I 6 and I 7. Normally in traditional instruction cache you will be storing I 1, I 2, I 3 and I 4 based on spatial locality. So, based on spatial locality this is being done in traditional instruction cache, but in case of trace cache which is based on that temporal sequencing of instructions it will store I 1, I 2, I 1, I 2 followed by I 6 and I 7 not I 3, I 4 and I 3 because that satisfies the temporal based on the temporal locality and this is how it is being done and this is one, I mean feature of fetching the cache memory from the trace cache and the Pentium 4 trace cache has its own branch prediction that directs where instruction fetching needs to go next in the trace cache. So, whenever there is a trace cache miss that branch predictor helps you to direct where the instruction fetching has to can take place. So, it removes decoding cost on frequently decoded instructions, extra latency to decoded instruction upon branch misprediction. So, and it also uses microcode ROM, microcode ROM when a complex instruction is encountered the trace cache jumps to the microcode ROM when the then which then issues the micro operations. So, this use of microcode ROM is also a new invention that is being done provided in Pentium 4 and after the microcode ROM finishes the front end of the machine resumes fetching micro operations again from the trace cache. So, this is how it combines the use of trace cache and the microcode ROM for executing the micro operations. Then the branch prediction is a it uses 4k entry in the branch target array. So, this is also 8 times larger than Pentium 2 processor and it uses the branch target buffer with two level predictor. Two level predictor means it has got both local and global branch I mean stores both local and global branch histories and based on that the prediction is done. And it uses a I mean new prediction algorithm unfortunately details of that algorithm is not available I mean is not known because it is a classified type of prediction algorithm that is being used by Intel and based on this prediction algorithm it reduces mid predictions compared to P6 by about one third. So, P6 based architecture which are used up to Pentium 3 I mean it is it gives you much better result. So, misprediction is much reduced and branch prediction whenever the branch prediction is done it predicts all near branches and it includes conditional branches, unconditional calls and returns and indirect branches. So, and it does not predict far transfers it includes far calls interrupt returns and software interrupts. So, it does not predict those far transfers only it tried to predict all the near branches that means within the page and it dynamically predict the direction and target branches based on PC using branch target buffer. So, it uses branch target buffer which gives you the branch address and if no dynamic prediction is available then it does the prediction statically that means if that branch history either local or global is not available then it tries to predict in a statically way taken for backward looping branches and not taken for forward branches. So, it combines the I mean prediction is little dynamic in nature it can be either based on those two level prediction based on local and global predictors or whenever the dynamic prediction is not available then it uses this statically predicting that branch taken or not taken and traces are built across predicted branches to avoid branch penalties. So, it uses branch target buffer uses branch history table and the branch target buffer and a branch target buffer to predict branch history table and branch target buffer. So, updating occurs when a branch is retired that means branch is found to be correct and that means retired means whenever it has been proved to be correct the prediction has turned out to be correct then only that is it that history stored that branch history it is stored in the branch history table as well as branch target buffer and it also uses return address stack RAS that is your 16 entries and it predicts return addresses for procedure calls. So, there are many procedure calls and their return addresses are stored in a return address stack and it allows branches and their targets to coexist in a single cache line and it increases parallelism since decode bandwidth is not wasted. So, all these things are used to enhance the performance of the processor and this is the case when branch it occurs Pentium 4 permit software to provide hints to the branch prediction and trace information, trace formation hardware to enhance performance and it takes the form of prefixes to conditional branch instructions. So, it is used only at trace build time and have no effect on already build traces. So, this is how the I mean the software to provide the hint is used for the purpose of branch prediction and trace formation hardware. So, this also goes in enhancing the performance then coming to the execution it uses advanced dynamic execution. So, it is very deep out of order speculative execution engine. So, we have already seen that it has got 20 stage pipeline. So, which is very deep which is considered to be very deep and it also allows you out of order execution in a speculative way. So, speculative execution engine is used and up to 126 instructions can be in flight. So, which is 3 times larger than Pentium 3 processor. So, so many instructions can be in flight and then that speculative execution engine will select dynamically schedule them for the purpose of execution and up to 48 loads and 24 stores in pipeline can be present which is also 2 times larger than Pentium 3 processor. And these are designed to optimize performance by handling the most common operations in the most common context as fast as possible. So, this is the basic objective of this speculative execution engine. Then issue instructions are fetched and decoded by translation engine and translation engine builds instructions into sequences of micro operations and it stores micro operations to trace cache as we have already mentioned and trace cache can issue 3 micro operations per cycle. So, 3 micro operations can be issued per cycle in the issue stage and execution that execution engine can dispatch up to 6 micro operations per cycle and it exceeds the trace cache and retirement micro operation bandwidth. So, it is received 3 micro operations per cycle and it can dispatch up to 6 micro operations per cycle. So, it allows for greater flexibility in issuing micro operations to different execution units. We have already seen there are 7 different execution units present in the processor and this is the execution pipeline. You can see here that system interface 3.2 gigabits per second system interface which is for external interface and here is your L2 cache from the memory it is stored in that first transfer to the L2 cache. So, L2 cache and control is there then it has got branch target buffer and L1 cache, instruction TLB and then decoder trace cache and it uses renaming instead of I mean it uses the renaming of registers and instead of using ROB and then you know that here is your micro operation queues then it is sedular which seduces those micro operations 2 different functional units. So, these are the integer functional units and these are floating points functional units. So, these are the different functional units, so this is integer register file and this is a floating point register file and these are the different functional units that is present stored RGU, load RGU, ALU 1, ALU 2, ALU 3 and ALU 4. So, and it has got floating point move, floating point arithmetic operations, multiplication add MMXSSC. So, you can see it has got variety of functional units that is being present and it directly accesses that L 1 data cache and the data TLB. So, this is the execution pipeline and these are the 20 stage pipeline that is being shown here. I am not going into the details of the different stages trace cache, instruction and so on. And these are the execution units. So, there are 4 dispatch ports, you can see 1, 2, 3, 4 4 dispatch ports and the load and store units have their own dispatch ports. The 4 0 and port 1 they are feeding to the ALU floating point move, ALU double speed. Double speed means as I shall see, we shall see that the double speed is achieved by performing the execution on both edges and integer operation and floating point execute. So, different operations that is being performed by the different functional units are listed below. So, addition subtract that ALU double speed performs addition, subtraction, logical operation, stored data branches. Then this floating point move performs floating point SSC move, floating point SSC store and floating point exchange and this ALU double perform addition and subtraction and this integer operations perform shift and rotor operations and this floating point execute perform floating point SSC add, floating point SSC multiply, floating point SSC division and MMX operations. So, this memory load performs all load and that memory store performs the store operations and as I was mentioning it uses double pumped ALUs. What is the significance of the double pumped ALUs? All ALUs executes an operation on both rising and falling edges of the clock cycles. Normally, we use a single clock cycle. So, this is considered to be a single clock cycle, this is your single clock cycle, this is the time period and normally in one clock cycle one operation is performed, but in case of Pentium 4 that the execution is done on this edge as well as on this edge. As a result it is called double pumped ALUs and ALU executes an operation on both rising and falling edges of the clock cycles. Then the retirement means at the end of execution when the micro operations completes their operation and registers are updated and so on. So, it can retire 8 micro operations per cycle, it supports precise exceptions and reorder buffer to organize completed micro operations. Since it uses out of order execution it is necessary to have reorder buffer and which is done with the help to organize the completed micro operations. So, that the way the results are stored follows the same order and same program order. So, this also keeps track of the branches and sense updated branch information to the branch target buffer. So, this is how the retirement stage works. Then the store and load is also out of order store and load operations. Stores are always in program order and 48 loads and 24 stores can be in the flight and store buffers and load buffers are allocated at the allocation stage. So, total 24 store buffer and 48 load buffers are used for the purpose of storing the operands and then loading and storing is performed not coupled with the execution of instructions, but they can go ahead independently. So, store operations are divided into two parts, store data and store address. Store data is dispatched to the first ALU and which operates twice per cycle and store address is dispatched to the store AGU per cycle. So, this is how store operation is performed for better performance. It also does store to load forwarding forward data from the pending store buffer to the dependent load. So, this forwarding helps you to the dependent load. So, forward data from the pending store, so some buffer to the dependent load. So, this is how forwarding helps you also to enhance performance. So, load stalls still happen when the bytes of the load operations are not exactly the same as the bytes in the pending store buffer. So, spite of these stalls, this forwarding can help to a great extent. Coming to the last portion that is your system bus, it delivers data with 3.2 gigabytes per second that is the system bus for external interface and it provides 64 wide bus as we have seen and 4 data phases per cycle. So, it is quad form and 100 megahertz clock system bus. So, this is the system bus. Now, coming to the important characteristics that we have discussed so far for Pentium 4. As I have already mentioned it uses front end branch target buffer. So, that front end branch target buffer has got 4k entries and that execution trace cache has got 12k micro operation I have already mentioned about it and trace cache branch target buffer has got 2k entries. So, you can see that it has got 2 different branch target buffers. One is front end branch target buffer, another is trace cache branch target buffer. So, and then there are 128 registers are available total for the purpose of renaming. So, it uses renaming and 128 registers are available for this purpose and as I mentioned it has got total of 7 functional units, 2 simple ALU, 1 complex ALU, load, store, floating point move and floating point arithmetic as we have shown in one of our earlier diagram. Then data cache is 16 kilobyte, 8-way set associative, 64 bytes blocks and it uses write through policy. Then end to cache is 2 megabyte and for 8-way set associative 128 byte block and it uses write back policy. So, here is the list of different renovations that has been used in Pentium 4. Number one is the use of execution trace cache. We have already discussed at length the execution trace cache, then the use of out of order execution. So, out of order execution also I mean as soon as the operands are available execution take place. So, it does out of order execution and which all of course, requires that retirement unit. So, that the at the end the register updating and everything takes place in order. Then microcode ROM is used here, so that the micro operations are stored in the microcode ROM and from the microcode ROM they can be directly fetched and can be sent to the execution units. Then it uses advanced register renaming and with the help of that 128 registers that I have already mentioned that registers used for renaming. So, it uses advanced register renaming and it does micro operations scheduling rather than instruction scheduling and it uses double pumped ALU which is also enhances the performance of the processor. Then clock rates is used as higher clock rates because of the 20 stage pipeline. We have seen that the pipeline is of 20 stages. So, the clock frequency can be made higher. We know that as the number of pipeline stages is increased the clock frequency can be made higher. So, it uses higher clock rate and I have already mentioned about the low latency l 1 data cache by making it simple and small and then it uses store and stored to load forwarding and it give it stored to load forwarding we have discussed in detail. Then the bandwidth it gives you higher bandwidth we have seen that l 2 to l 1 cache bandwidth is 128 bit. So, it gives you higher bandwidth and external bandwidth is 64 bit. So, with these two it gives you higher bandwidth. So, you can see these are the various renovations used in Pentium 4 and based on this net bussed architecture a number of processors are available that seller and seller and seller on D Pentium 4 and Pentium 4 extreme edition Pentium D of course, Intel has since replaced the net bussed with the Intel core micro architecture later on we shall discuss about it. And here is the detailed micro architecture of Pentium 4 you can see this is the this is that let me start here is your system interface that bus interface unit bus interface unit. And you can see the bus interface unit and here the interface is of 64 bytes and this is that l 2 cache l 2 cache 256 kilobytes of l 2 cache and the bus between l 1 cache the TLB and l 1 cache is 256 bit in this particular diagram 256 bit. And this is that instruction fetch unit this is the TLB and this is a trace cache that I mentioned 128 12 kilo 12 k micro operations are stored in this trace cache is that register allocation table and this is the micro operation queues and this is the memory and instruction that those operations are stored. And here the sedular seduces the different operations which are to be executed by those functional units. So, these are the different ALUs, AGUs that floating point unit, floating point move and floating point execution units. So, and that this is that 20 stage pipeline is shown on the by the on the left side. So, this gives you an overview of the Pentium 4 micro architecture. So, we have come to the end of today's lecture and we have discussed in detail the Pentium 4 micro architecture. In my next lecture I shall focus on a peak or IS-64 that uses the itanium processor by Intel that is a 64 bit processor and that I shall discuss in my next lecture. Thank you.