 Hello and welcome to today's lecture on simultaneous multithreading. This is also known as hardware multithreading and earlier so far we have considered multiple threads of an application running on different processors. We have in the last lecture I have discussed at length about this multithreading and particularly in the context of multiple processors or multi cores, multi core architecture and we have seen how it can be done, what are the advantages. So, various aspects we have discussed in detail in the last lecture. So, today we shall focus on something else the question is can multiple threads execute concurrently on the same processor. So, we have seen for multiple processors or multi core architectures, multithreading comes naturally and with the help of that the utilization of the processors or cores is improved. But here we are posing a different question, we have a single processor is multithreading relevant in single processor architecture or does it help in any way in improving the performance. So, all these questions will be answered in this lecture. So, to tell I mean in a sentence SMT is a technique for improving the overall efficiency of super scalar CPUs with hardware multithreading. So, this is particularly useful in super scalar architecture and we shall see how the various resources the functional units available and other resources available in super scalar architecture can be used in a efficient manner. Why is this desirable? Why this multithreading in super scalar architecture is desirable? We shall see that these simultaneous multithreading permits multiple independent threads of execution of to better utilize the resources provided by modern CPUs. So, modern CPUs are essentially super scalar having many resources and they are to be fruitful utilized and particularly the simultaneous multithreading is inexpensive because you do not require multiple processor just one processor can be used to implement this multithreading and this is also desirable from the viewpoint of communication. We have seen that whenever we use multiprocessing or multithreading there is some communication among the processes or threads that communication will be faster whenever all the threads run on a single processor. So, as a consequence the simultaneous multithreading has become very useful in the context of super scalar single processors architecture. So, here just to have a we can have a relook at the super scalar architecture and as I mentioned super scalar processors are now a common place most of the functional units cannot find enough work of an average. This is one serious limitation of super scalar processors we have seen that we have multiple functional units in a single processor. So, from the instruction window that is available the issue unit issues different operations for different functional units and in most of the cases the utilization of the functional units is very poor. For example, if we have a peak instruction per cycle is 6. So, whenever the peak instruction per cycle is 6 that means you can have 6 instructions in a single cycles because of the availability of 6 functional units, but in reality the average instruction per cycle is only 1.5. So, roughly one fourth. So, you see that just by having super scalar architecture your utilization is quite poor. For example, in this diagram we have a super scalar architecture with four slots that means four functional units four slots may be we may have more functional units, but four maximum four instructions can be issued and in this context you can see those yellow rectangles signify that functional units are used or rather the instructions are used instructions are issued for those functional units. On the other hand those other rectangles which are which are empty that means without this yellow color they are remaining unutilized. So, we find that not only in a single cycle several slots are remaining unutilized some slots are altogether remaining unutilized. So, in this particular case we have got altogether 48 different slots that means in different cycles considering four instruction can be issued 48 instructions can be issued, but unfortunately only 16 instructions are being issued. So, utilization is only 33.3 percent. So, this shows the limitations of super scalar architecture and one more point you must notice threads share resources that means having multiple threads has some advantage because threads will be sharing resources what are the different resources that will be shared we shall discuss shortly, but since they are sharing resources with small incremental increase in the hardware resources you can go for multiple threads that means the we can execute a number of threads without a corresponding linear increase in chip area that means the increase in chip area will not linearly increase as you keep on increasing the multiple threads. So, with small additional hardware you can increase in improve the I mean you can have more number of threads. So, that is the reason why multi-threading within a single processor makes sense. So, let us now have some analysis of ideal cycles in a super scalar processors. So, let us assume that typically four instructions are issued and as you know the super scalar processor can have different types of functional units like adders, multipliers, floating point units then the branch unit and so on. And many functional units are idle in many cycles that we have already seen and particularly this is true when there is a cache mesh. So, whenever there is a cache mesh then until the pipeline is refilled with new set of instructions then the functional units remains unutilized. So, in such a situation utilization is very poor whenever there are cache misses and we know that the dispatcher reads instructions decides which can run in parallel. So, we have already discussed the architecture of super scalar processor. We have seen that instructions are fetched by the fetch unit then they are decoded then they are sent to the dispatch unit and this dispatch unit identifies which instructions can be parallely scheduled or parallely executed can parallely issued to the functional units available inside the processor. And it has been found that this number of instructions is limited by the instruction dependencies and long latency of operations. So, some operations may take longer time and so on. So, because of these reasons the number of instructions that can be issued is very much limited. So, this shows a diagram where you have got four issue processor and for different cycles it is shown and you can see that X stands for full issue slot. That means those slots have been utilized on the other hand those green colored rectangles are showing empty slots. So, we find that we have got two kinds of wastes. One is vertical waste that is introduced when the processor issues no instruction in a cycle. So, in a single cycle not a single instruction issues. So, not a single functional unit is busy. So, these are essentially vertical wastes. So, this is a vertical wastes this is a vertical wastes this is another vertical wastes. So, we find three vertical wastes and leading to wastes of 12 slots. On the other hand you can have horizontal wastes is introduced when all issue slots cannot be filled by in a cycles. So, for example, in this if you look at the first instruction cycle we find that three slots are getting utilized and one slot is wasted. Similarly, in the second cycle two slots are wasted and two slots are utilized. So, in this way if we consider different types of wastes the utilization of the decreases and it has been found that 61 percent of the wasted cycles are vertical wastes on an average. So, this is because of the limitation of the first of all non-availability of instruction level parallelism and the limited size of the instruction window and the capability of the instruction issue hardware, because of all these things 61 percent of the wasted cycles are vertical wastes on an average. So, this diagram shows a pictorial explanation the difference between the multi-threading and in this particular case we have seen that the instruction issue window has been made has been increased rather enlarged and it is having a much larger depth. So, we have a single instruction window with a larger depth and by using I mean instructions are issued more speculation with lowering confidence, because you know as we have a larger window size several branches we know that the confidence that speculation with lower confidence take place as the size of the window is larger and because you have to do branch prediction and so on. On the other hand whenever you go for multi-threading so you can see here you have three different threads and from three different threads instructions are being used and we are enlarging the width here to fetch multiple instructions I mean fetch instructions from different threads. So multi-threading can be considered as a I mean memory latency hiding technique. So, we have seen whenever we are having multi-threading and whenever there is a stall then several cycles are wasted, but if we have multi-threading instead of single-threading then if one thread is blocked we can go for another thread issuing instructions for another thread and that is how you can utilize the functional units. So, hide stalls due to cache misses, hide stalls due to data dependencies and these whenever this type of I mean stalls occurs then we can take help of the multi-threading to for utilizing the functional units and it leads to increased efficiency of computation per amount of hardware used. So, you can look from another angle we have got certain amount of hardware. So, what is the utilization of the hardware or what is the computation that you are getting per hardware per amount of hardware that is available. So, from that point of view multi-threading helps you to improve the efficiency. Question naturally arises what kind of support you require for multi-threading? Multi-threading does not come free in this world nothing is free. So, it is not a free lunch you have to pay for supporting multi-threading what you have to pay what is additional requirement for multi-threading. So, fortunately it has been found that multiple states required to be maintained at the same time. So, you have to maintain multiple states processor must replicate independent state of each thread. So, if it is a single thread then you have to maintain you have to store a single thread and when switching occurs you have to store information of one thread, but whenever you are having multiple threads you have to maintain the states of multiple threads and one set for each thread requires a separate program counter, a separate copy of register file, a separate renaming table for each thread, a separate page table if running as independent programs. So, since memory is shared through virtual memory management technique which already supports multiple processes, but in such a case you require separate page table to support multiple threads and not only that your hardware should provide first thread switching. So, that means whenever you are going from one thread to another thread we know that we already know that the thread switching is much faster than process level switching. However, the hardware should be designed such that this is much this is quite fast so that we can make use of multithreading more effectively and it has been found that the thread switching you require about 0.1 to 10 clock cycles compared to 100 to 1000 cycles that is required in the context of process switching because the more information are to be stored. Moreover, since register renaming provides unique register identifiers so you are using register renaming and that helps you to have instructions from multiple threads that can be mixed in the data path. So, register renaming also helps in supporting multithreading. So, multithreading in its most basic form requires processor interleaves execution of instructions for different threads. So, in simple terms you can say that the processor is executing instruction in an interleaved manner instructions from different threads and you can have three different thread scheduling. So, you can categorize the multithreading techniques on a single processor in three different categories. Number one is coarse grained multithreading, second one is fine grained multithreading, third one is simultaneous multithreading which is of greater importance because of the higher efficiency that is provided by simultaneous multithreading. Let us have a look at the differences among these three types of multithreading. First one is coarse grained multithreading. Here as you can see the switching is taking place I mean one thread is getting executed. So, these yellow colored slots correspond to a single thread. Then here after four cycles a switching is occurring to another thread. So, for the next three cycles another thread is being executed again thread switching is occurring then yellow colored these slots correspond to another thread. So, thread switch occurs only when an active thread undergoes long stall. So, that means whenever there is a L2 cache miss. So, whenever there is a L2 cache miss obviously in such a case we have to fetch from main memory and obviously it will involve long delay. So, that memory cycle latency can be overcome. So, in such a case you can hide this form of multithreading only hides long latency events. Long latency events means you are having long stalls that means L2 cache miss or something like that. So, this is the coarse grained multithreading and it has been found that it is easy to implement. So, for example, in this here you have got five different threads and for several cycles one thread is executed then we are going for another thread then we are going for another thread and so on. However, whenever we implement it there is a requirement of pipeline flash flushing on thread switch I mean this makes it inefficient. So, whenever we are switching from one thread to another thread you have to flash the pipeline that means you have to fill the instruction pipeline with new set of instructions and leading to inefficiencies. So, this is your coarse grained multithreading. Here are the advantages of coarse grained multithreading it relieves the need to have fast thread switching. So, thread switching need not be very fast it will have some it will require some clock cycles. So, when we use coarse grained multithreading since it is occurring after several cycles it need not be that switching thread switching need not be very fast and does not slow down any thread since instructions from other threads issued only when the thread encounters a costly stall. So, in this case you know you are executing different threads may be of a single application. So, the execution of different threads are not delayed why it is not delayed because a particular thread switches I mean it is stopped or blocked only when it encounters some costly stall that means L 2 level of chasmis. Of course, there are several disadvantages number one is it is hard to overcome throughput losses from shorter stalls because of pipeline start up costs. So, we have seen that pipeline instruction pipeline has to be filled up that is there is a start up cost and whenever there is a shorter stalls. So, instead of several cycles if it happens may be after one or two cycles then there it leads to throughput losses. Since CPU normally issues instructions from just one thread when a stall occurs the pipeline must be emptied or frozen as I have already mentioned and new thread must fill the pipeline before instructions can be completed. Because of this pipeline start up the first I mean shorter stalls leads to throughput losses. So, this is the limitation of course can multi threading and what you have to see because of this start up over it course can multi threading is efficient for reducing penalty only for only of high cost stalls that means only when the stall time is very long compared to the pipeline refill time only then the squares gain multi threading makes sense. And the squares gain multi threading has been used in several processors like sun spark 2 processor it provides hardware context for 4 threads and one thread is reserved for interrupt handling and there is a it uses the concept of resistor window that provides fast switching you may recall that that means the sun spark 2 processor has got a large number of general purpose registers. So, which are group which are used to form register windows having 4 sets of 32 general purpose registers that means one thread each thread can be allocated one register window and whenever thread switching occurs the suppose you have got 4 register windows this is one register window comprising 32 registers this is another register window this is another register window I mean another register. So, these are the 4 register windows and each are having 32 registers. So, this may be given say thread 1 or thread 0 this can be thread 1 can use this this group of registers another group then thread 2 is another window and thread 3 is using another window. So, whenever a thread switching occurs only the pointer to this register window has to be has to be modified. So, there is no need to store and restore all the registers because all the registers are already present in a CPU only thing the pointer to this window has to be has to be has to be provided whenever thread switching occurs. So, this is also used in cache coherent distributed CR memory architecture. So, we have already not yet discussed the cache coherency problem in the next lecture I shall discuss about this and we shall we shall see that whenever on a cache miss to a remote memory that that is present in a distributed CR memory architecture that may take large number of cycles. So, in whenever this occurs whenever there is a this this cache this occurs we must switch to another another thread. So, similarly network messages etcetera are can be handled by interrupt handler thread. So, this is another thread that can handle the the network messages. So, we find that the the course can multithread multithread processors are available and they are used in practical applications. Now, coming to fine-grained multithreading here you have got few active threads. So, context switching among the active threads on every clock cycles. So, here context switching is occurring on among the active threads on every clock cycles and this is usually done in a round robin fashion skipping any stall thread. So, you may be having say 7 threads 7 threads may be active, but the the your number of instructions that can be issued let it be 4. So, these 7 threads can be issued in a round robin fashion depending on which are not still not stall may be say 1 2 3 then this is empty others are not are not ready or are not having cannot be issued at this moment then may be 4 5 and 6 this slot remains empty. So, for different function you can issue them to different functional units in this way different threads can be issued and in a single cycle it can be done. So, occupancy of the execution core would be much higher compared to course can multithreading here the occupancy will be much higher although we are having context switching in every cycle that is the reason why CPU must be able to switch threads in every clock cycle that in other words the that switching should be pretty fast. So, the advantage is it can hide both short and long stalls since instructions from other threads are executed when when one thread gets stall and disadvantage is that it slows down execution of individual threads since a thread ready to execute without stalls will be delayed by instructions from other other threads. So, this is one disadvantage here whenever we are having fine grade fine grade multithreading. So, so earlier we have seen when a when we are using coarse gain multithreading then a single thread does not get delayed, but in this case a single thread may get delayed because in a single slot you are you are issuing instructions from different threads. So, although that a thread is ready to execute without stall that is a thread is ready to execute without stall even then we are issuing instructions from other slots that is the reason why individual threads may get delayed, but overall throughput will improve as we shall see. So, this shows the fine grade fine grade multithreading here you can see this yellow colored slots correspond to one thread then this shaded pink slots correspond to another thread this this corresponds to another thread. So, there is a context switching in each cycle. So, thread switching is occurring in each cycle from one thread to another thread and in different cycles different threads are getting executed. So, earlier we have seen as long as a single threads do not encounter stall that is executed, but here it is not so. So, in this particular case vertical wastes are eliminated, but horizontal wastes are not we have seen there are three types of wastes earlier we have seen that. So, those vertical wastes these vertical wastes are eliminated whenever we shall be having this fine grained multithreading. So, you can see not a single cycle is empty. So, vertical wastes are eliminated and if a thread has little or no operations to execute its issue slot will be underutilized. So, however, if for a single thread if some slots I mean some instructions cannot be issued to keep some slots some execution units busy those slots will remain unutilized that means there will be a kind of I mean vertical slots I mean horizontal slots will remain unutilized whenever we use this fine grained multithreading. So, issue instructions only from a single thread in a cycle again may not find enough work every cycle, but cache misses can be tolerated. So, in this case cache misses can be tolerated and this particular approach has been used in sun snack grachip which is having eight cores in a single processor. Now, coming to the simultaneous multithreading we are familiar with two types of parallelism one is your instruction level parallelism another is your thread level parallelism. So, here we have seen we shall see how the concept of instruction level parallelism and thread level parallelism can be combined to achieve simultaneous multithreading. So, here question is could a processor be oriented towards instruction level parallelism be used for use to exploit thread level parallelism. That means when we are using thread level parallelism which is the basis for multithreading whether we can it can be oriented towards instruction level parallelism. That means functional units are often idle in data paths designed for ILP because of either stalls or dependencies in the code. So, here what we shall try to do a single I mean in a single cycles you can see we are filling up slots from different threads. So, in a earlier in a single cycles the instructions from only one thread were issued, but in this case we can see that these two first two in a first cycle first two instructions correspond to one thread and third instructions correspond to another thread. Unfortunately, one slot fourth slot is remaining unutilized because there is no other I mean instruction which is ready for issue. So, SMT converts thread level parallelism into instruction level parallelism. So, that is how the combination of both is taking place. The issues instructions from multiple threads in the same cycle has the highest probability of finding work for every issue slot. So, whenever we do this we can see the majority of the slots are getting filled up. So, very few slots are remaining unutilized whenever we go for this simultaneous multithreading. So, this technique the simultaneous multithreading is also known as hyper threading which this term this particular term has been coined by Intel hyper threading. So, hyper threading and simultaneous multithreading are same where instructions for different threads are issued in a single cycle. So, this diagram shows different types of multithreading techniques in a single diagram. So, you can compare their performances and you can see how the utilization of the processor resources taking place in this different categories of multithreading. So, first one corresponds to simple superscalar architecture where you are executing a single thread. So, normally we execute a single thread and you can see the many threads are remaining on many slots are remaining unutilized and so 16 out of 48 slots are utilized. So, 33 percent is the utilization of the processor resources. On the other hand in case of coarse grain multithreading we can see when one thread gets stalled another thread is being started in this way one thread followed by another thread and so on. So, in this case utilization of the processor increases and so efficiency is 56.3 percent. So, only 27 out of 48 slots gets utilized giving you 56.3 percent of utilization of the resources. Similarly, in fine grained multithreading you can have better utilization compared to simple superscalar architecture. However, in this case we find that number of I mean the efficiency is same as coarse grain multithreading, but in general in fine grained multithreading we get better efficiency. So, but for this particular example we find efficiency that the utilization of the processor resources is same 56.3. Then here of course it is not multithreading, but multiprocessing. So, in this particular case we have got two processors separate jobs on separate processors two different processors. So, each processor is performing I mean one thread. So, one thread is operating on one processor another thread is operating on one processor here also the utilization is good 60.4 percent, but whenever we go for simultaneous multithreading in a single processor we find that utilization is pretty high 42 out of 48 slots is getting utilized leading to 87.5 percent of efficiency of the I mean processor utilization or efficiency. So, this is the multithreading categories. Now, we have now question naturally arises what is the additional thing that is required to support simultaneous multithreading? I have already told it will not be free, but what is the additional hardware resources that you require? We have already discussed about the dynamic schedule processors and we have seen they require they have many hardware mechanisms and they support multithreading, because in a dynamic is in dynamic in dynamic residual processors we support multiprocessing then and they are many hardware resources are already available. Those hardware resources can be utilized in the context of simultaneous multithreading for example, large set of virtual registers that can be used to hold the register set of independent threads. Then register renaming which is used in dynamic residual processors can be used in simultaneous multithreading register renaming provides unique register identifiers. So, instructions from multiple threads can be mixed in the data path without confusing resources and destination across threads. So, that register renaming can be fully utilized in simultaneous multithreading. Then out of order completion allows the threads to execute out of order and it gets better utilization of the resources. We have seen out of order execution is allowed in dynamic scheduling of instructions, however you require a commit unit where you know final writing into the registers is performed with the help of commit unit. So, here in simultaneous multithreading I mean we can use that and we shall see you will require multiple commit unit for that purpose. So, just need to add a per thread renaming table. So, we shall be we will require multiple renaming tables for each of the threads. We require a separate renaming table and separate program counters that will be required because multiple threads will be executing separately. This is the additional requirement compared to dynamic residual processors. So, independent commitment can be supported by logically keeping separate reorder buffer for each thread. So, you can have separately reorder buffer for each thread and that is how you can support this simultaneous multithreading. Now we have already discussed about the simple superscalar architecture and two performance limitations are overcome. Number one is memory stalls which I have already seen. That means whenever there is a memory stall we can initiate another thread that is how the memory stall latency problem is overcome and then pipeline flashes due to incorrect speculation that also is overcome by using this simultaneous multithreading. So, instead of speculations you are using multiple threads. So, in SMTs multiple threads are simultaneously executed and one can hide both these problems that means memory stalls and pipeline flashes due to incorrect speculation these problems are overcome in simultaneous multithreading. Now, what is the anatomy of a simultaneous multithreading processors? We have already mentioned that you do not require multiple processors, but you are really having multiple logical CPUs. Physically you will be having only one CPU, but some additional resources are required that is only 5 percent extra silicon to duplicate thread state information. So, you have for individual threads the state information are to be stored for that purpose you will require additional resources that increase in silicon is only 5 percent. So, this is much better than single threading because it leads to increase thread level parallelism and improved processor utilization when one thread blocks. So, instead of single threading simultaneous multithreading is much better. However, this cannot be compared with two physical CPUs that means, so whenever we say two different threads what will be the efficiency or what is the performance that we will get compared to two different threads two different processors will definitely give you better performance that is being mentioned here CPU this is true because CPU resources are shared and not replicated as a consequence two different CPUs will definitely give you better performance. So, this gives you a kind of visual representation of simultaneous multithreading here you are having two different threads then you have got a instruction sedular that means instructions for different threads are seduled and different functional units are used. For example, this yellow colored correspond to one thread thread two and green colored or blue colored slots correspond to a functional units correspond to thread one. So, instruction sedular is seduling different threads to different functional units. Now, here are some important issues related to simultaneous multithreading to achieve multithreading you have to extend replicate and redesign some units of a superscalar processor. So, here we are highlighting what modification what change in a superscalar processor is needed to make it suitable for simultaneous multithreading. You have to replicate some resources and you have to extend some techniques and you have to redesign some units of superscalar processor. So, which I shall discuss very briefly. So, the resources which are to be replicated is mentioned here. So, that means states of hardware context you have to you have to have multiple resources multiple hardware that means that where the registers program counters can be stored for coming from different threads then path thread mechanism for pipeline flushing and subroutine returns. So, we have to think about the main path thread mechanism not for a single thread pipeline flushing and subroutine returns. In other words, you will require different areas more memory will be required where you can store the state information and path thread branch target buffer and translation look at buffer that will be required. These are to be replicated whenever we go for simultaneous multithreading. Then resources to be redesigned are instruction fetch unit. So, instruction fetch unit needs to be redesigned in the context of simultaneous multithreading then processor pipeline are to be redesigned. For example, so far as the instruction scheduling is concerned it will it does not require an additional hardware and register renaming is same as in superscalar processor. That means instruction scheduling part is not much different and here we shall compare the architecture of superscalar and then when we go for simultaneous multithreading what are the additional thing required that is mentioned. So, here you have got instruction fetch and decode unit there is a single program counter and then you have got reservation stations where there are registers where the operands are stored and which are provided to different functional units and then different functional units they will perform their execution and there is a commit unit. So, we can have out of order execution with the help of the commit unit. We wrote them in the registers in the proper order and also we can update the various registers which are present in reservation units. I have already discussed in detail the superscalar architecture. Now, let us have a look at what are the additional things that you required whenever you go for simultaneous multithreading. So, as I mentioned you will require separate program counter for each thread. So, the program counters are to be replicated which are shown here and you will require different commit units for each thread. So, separate commit unit for each thread is required which are shown here and in this register file you will require register renaming. So, which is not mentioned here but you will require register renaming. If you have a large number of registers you can assign separate registers to different threads and you can use renaming technique for that purpose. So, you can see how the architecture needs to be modified to support simultaneous multithreading. So, you can see it can be very easily upgraded from superscalar architecture. This is a superscalar architecture can be very easily upgraded to support simultaneous multithreading. So, far as the instruction fetch unit is concerned you have to fetch one instruction for two threads then decode one thread till branch or end of cache line then jump to another. So, this is required for multithreading and highest priority threads with fewest instructions in the decode. So, you have to use kind of priority that higher priority threads with higher priority will be executed first and highest priority two threads with fewest instructions in the decode renaming and queue pipeline stages. So, this is how priority of the threads can be used for the purpose of issuing instructions and small hardware addition to track queue lengths. Then so far as the register file is concerned as I mentioned each thread can have 32 registers. So, register file is 32 into number of threads plus some more registers for the purpose of renaming. So, the consequences you will require a very large register file whenever to support this simultaneous multithreading and when you have got large register file you know that the access time will be longer. If the number of registers in a register file is small you know that decode unit is simpler. So, access time will be smaller, but whenever you have got a large number of register in a register file then the decode unit is complex and the access time will be longer. So, that is one disadvantage and that disadvantage you have to accept if you are interested in simultaneous multithreading because simultaneous multithreading gives you much more benefit that you have already seen. And this is how this is that pipeline I have already mentioned that pipeline architecture needs to be changed, modified compared to superscalar. This is the typical superscalar pipeline unit where you have got fetch, decode, renaming, queue, read register and so on. And here for simultaneous multithreading to increase in clock simultaneously the simultaneous multithreading pipeline is extended to allow two cycle register reads and writes. So, you can see here you have got one additional register read and one register write. So, this is required however whenever two cycle reads writes this will increase the branch misprediction penalty. So, branch misprediction penalty is increased because of the change in the pipeline to support simultaneous multithreading. Then what to issue not exactly same as superscalar we have seen in superscalar architecture instructions are issued and oldest is the best and least speculation is used, but in the context of simultaneous multithreading it is not very clear what policy to be adopted for issue of instructions. So, branch speculation optimization may vary across threads. So, if you are having different threads the branch speculation optimism will be different and may be different for different threads. This leads to complication and based on this the selected selection strategies can be oldest fast branch speculated last and so on. So, this policy is a little different and complicated in the context of simultaneous multithreading. So, far as the compiler optimization is concerned should try to minimize cash interference and latency hiding techniques like a speculation should be enhanced because one of the primary benefit that we get from simultaneous multithreading is latency hiding memory latency hiding. So, this speculation helps you to hide this latency and that should be enhanced in the context of simultaneous multithreading and sharing optimization techniques from multiprocessor has to be changed and data sharing is now good. So, it is little different from multiprocessor sharing optimization here the optimization has to be done in a different way compared to multiprocessor architecture. Coming to caching same cache shared among all the threads we have a single cache memory and performance degradation due to cache memory occurs and as a consequence this leads to the possibility of cache threshing. Cache threshing is essentially all the time the data transfer between cache and main memory is taking place without giving any real work that kind of situation can occur in simultaneous multithreading because same cache is being shared by multiple threads. So, this is the performance implication of simultaneous multithreading single thread performance is likely to go down as I have already mentioned because of sharing caches branch predictors register it is shared and this effect can be mitigated by trying to prioritize one thread. So, I have already mentioned about prioritizing different threads and executed that way and with eight threads in a processor with many resources is possible and single simultaneous multithreading can yield throughput improvements in the range of 2 to 4 and because of the many benefits of simultaneous multithreading they have been used in many commercial processors like compact alpha 21464 and compact alpha processor which used 4 threaded simultaneous multithreading then Intel Pentium 4 2 threaded simultaneous multithreading and it has been reported by Intel that it leads to 10 to 30 percent gain in performance then sun ultra 4 which is also 2 core 2 threaded simultaneous multithreading. So, in addition to 2 core it uses 2 thread per core simultaneous multithreading then IBM power pieces also used simultaneous multithreading dual processor core 8 way super scalar simultaneous multithreading and 24 percent area growth per core for simultaneous multithreading. So, this is some commercial example and in Pentium 4 as I mentioned that simultaneous multithreading is called in the name of hyper threading in Intel jargon and they use 2 threads the operating system provides operates as if it is executing on 2 processor system it can be a single processor systems, but it can be it can use 2 threads. So, when only one is available one available thread Pentium 4 behaves like a regular single threaded super scalar processor. So, as I mentioned 30 percent performance improvement. So, with this we have come to the end of today's lecture on simultaneous multithreading. Thank you.