 we end up having some time to discuss something else. So, there are too many acronyms from the title as you can see. TLP stands for Thread Level Parallelism, something we have not talked about. We have talked about ILP and MLP that is Instruction Level Parallelism and Morial Level Parallelism. So, Thread Level Parallelism essentially talks about extracting threads from a program and running them properly. So, that is Thread Level Parallelism and threads or you can think of it as a bunch of instructions or set up sequential instructions or a portion of a sequential program and of course, a thread gets comes inside another construct when it is stacked. So, we will not talk about how to extract threads from a program or how really a processor accepts threads. So, some mechanism which will generate the threads, always talk about is how a processor can handle the threads initially. No, processor does not extract any threads. Threads come from somewhere like it could be embedded by the programmer and then generated by the compiler. It could be automatically extract by some software. So, we will not talk about any hardware that actually extracts threads. We will start with threads. We will assume that threads are already there and we will only talk about how to manage these threads optimally. HD stands for Hyper Threading. I do not know some of you might have heard about Intel HD processor. It is just a stylized main for simultaneous multithreading. So, this particular term was originally coined by certain researchers in India. Later Intel modified that and named it Hyper Threading. So, we will see the difference between what was actually proposed negatively and what was finally implemented in the processor. CFP stands for Chief Multi-Processor. This is another worry of doing multithreading on chip. So, we will look at all these architectures. So, let us start with SMT which stands for simultaneous multithreading. So, here the basic goal is to run multiple threads at the same time. So, I would like to emphasize on the term same time because whatever we have seen in the operating systems is actually a multiplex. So, you run a thread at a time and multiple threads are usually time multiplexed on the machine. Here we are talking about running multiple threads at the same time. So, it is slightly different. Which of course means that the hardware must have support to be able to do that. Otherwise it is impossible. So, why is it helpful or why should we even discuss these things? Because it helps in hiding large memory latency because even if one thread is blocked due to a cache miss, it is still possible to schedule ready instructions from other threads without taking over of context. So, one fundamental property of these type of multithreading is that there is nothing called a context. All the threads are active all the time. All you have to do is that you have to pick up instructions from ready threads and schedule them to our wake-up logic and they will execute their content. So, it improves ultimately memory level parallel because it essentially allows you to send more memory requests. So, while one thread has sent a request, it will bring in one more thread which will probably run for a while. It will send another memory request. And this is exactly what today's graphics processors actually exploit in a much larger scale. You will find that today's GPUs, when stands for graphics processing units, have if you look at the recent most recent GPUs coming out of the industry, they will have more than 1000 such threads simultaneously. About 1000. So, we will not be really talking about that scale of machine. We will be talking about 4 8 threads because the basic concept is more than the same. Of course, scaling the 1000 threads does require a certain amount of innovation because otherwise you will be having many problems like scaling resources, making the power body. So, we will not get there. We will limit ourselves to small number of threads and see what the basic concepts are. So, overall improves resource utilization enormously as compared to super scalar processor because if you think about a super scalar processor, what it is doing is that it has a single thread running at a time. And whenever the threads takes a cache miss, in the last level of the cache hierarchy, the instruction will eventually come to the top. Block everything, processor will stop for a long time. Resources will be mostly underutilized. That is what is happening today in the super scalar processors. Here, what we are really doing is that we are trying to package many such super scalar processors into a single chain in some way. So, it allows you to switch out one thread which is having a cache miss bringing another thread and let it run while the cache miss gets result for the other thread. So, latency of a particular thread may not include. In fact, it may worsen only. We will discuss that why. But the overall throughput of the system increases because what you are doing is that every unit of time, you will be completing more instructions. That is all you are doing. That is average number of written instructions per cycle increases. So, there are three design choices for a single core hardware multithreading. So, we are talking about just one processor. That is what the single core term stands for. And we want to do hardware multithreading inside. So, that means I want to support multiple threads inside. So, here are the three options. One is core screen multithreading, which means you execute one thread at a time. When the running thread is blocked on a long latency event like cache miss, swap in a new thread. This swap can take place in hardware. So, that means you need extra support and extra cycles for crushing the pipe and saving register values unless renamed registers remain pinned. So, what it means is that if you are running a thread. So, it takes a cache miss. So, currently if you look at the register maps snapshot, it has started registers currently mapped. So, the logical registers currently are having maps for critical access. So, now what do you want to do is that you want to put this thread to see for some time until the cache miss results and bring in a new thread and run it. So, what are the things you need to do? Well of course, you have to change the program counter because that is where the new thread will start running now. The program counter of that thread whichever new thread you want to bring in and you want to make sure that the registers which were used by the old thread are saved properly. So, that when that thread comes back, they can be restored. Now, this can be done in two ways. One is that you can naively just copy the registers into memory and then when you bring in the thread, it will copy the memory content back to the registers. So, that is all operating system does. The other option is that you already have certain registers renamed to the logical register of the thread or renamed to certain physical registers. You say that well these physical registers will remain allocated. They cannot be used by any other thread. So, that is actually obvious on saving mechanism. You do not have to do anything because you know that these register contents will not be touched by anybody, they will remain as they are. And that is again exactly what today's GPU is exploited. They do not save any register on a content search which is why you find that the GPUs have tens of thousands of registers inside and this is the reason. They need it because they want to support thousands of registers. We need to store the logical register. Yes, you have to save the map. Unless we will actually look at that unless we also support multiple map tables. So, we will see actually this kind of processors would actually have multiple map tables. Yeah, but anyway saving a map table is a small thing. There are any way to check pointing of map table. So, here so why is it called course grade multi-threading? Because what we are doing is that we are really not mixing thread instructions very finely. We are letting one thread run at a time. The only thing is that in the operating system we are still doing a thread switch at a much coarser grade. That is we are saying when a system call happens, you would switch a thread. Here we are actually going further and saying that well even on a cache we are still going to switch a thread. So, it is a very fine-grained thread switching, but we are still very close to what the operating system does. We are letting one thread run at a time. The next improvement is fine-grained multi-threading. That is you fetch, decode, rename, issue, execute instructions from threads in a round robin fashion. So, what it means is that this cycle I will fetch, decode, rename, issue, execute instructions from thread one. Next cycle I will fetch, decode, rename, issue, execute instructions from thread two and so on and so forth. And then I will cycle to the right thread. So, notice that here I am not introducing any new resources. I just have a single pipeline here also. I am just allowing, I am just switching the program counters to make sure what instruction gets injected in the pipeline. So, maybe thread one's instruction gets injected this cycle, the next cycle thread two's instruction gets injected. So, every cycle I will be switching the program counter. So, here what is the problem? So, it improves utilization across cycles. No doubt about this. So, compared to this, right? So, here what may happen is that because of certain resource shortage, you may not be able to exploit the entire resources given to you. Okay, here we will solve some of the problems. Because across threads, I am sorry, across cycles, the utilization of resources will improve. But the problem remains within cycle. So, within a single cycle, what may happen is that in a certain cycle, let's say that the current cycle is given to thread one. And you find that in this cycle thread one has only two instructions to issue, that can be issued, the two instructions are ready. But suppose the hardware has provided you with four issue slots. Fifty percent issue slots will actually go waste. Also, if a thread gets blocked in a long latency event, its slots will go wasted for many cycles. Because this is a static schedule you are taking. You are saying that if I have two threads, odd cycles will be given to thread one. Even cycles will be given to thread two. So, if thread one is currently in your cache list, all your odd cycles will go idle. Which is not a good solution. So, the third solution is called simultaneous multi-threading. Here what you do is you mix instructions from all threads every cycle. So, what you do is that you first fetch from thread one. You find that well I can fetch only two instructions this cycle because the third instruction of thread one is a branch instruction. So, I don't know where to go. But suppose I have actually four instructions to fetch. So, I will probably fetch one instruction from thread two is the second one of thread two is a branch. And one more from thread three fill up the fetch, you know, packet. So, every cycle in every stage I will be doing that actually. I will try to fill up my resources as much as possible by mixing instructions from all the threads. So, that gives you maximum resolution of resources. And notice that in all these cases I have no point in fetch. So, what are the problems? So, it offers a processor that can deliver reasonably good multi-threading performance with fine grained last communication through caches. Because now what is happening is that I can run a thread. Subsequently, another thread can share data with this thread through the caches. So, communication becomes extremely fast. Although it is possible to design an SMT processor with small area increase. For example, we talk about Pentium 4 HD which had only 5 percent extra area compared to Pentium 4. For good performance it becomes necessary to rethink about resource allocation policies at various stages of the pipeline. So, we talk about some of the policies. Also verifying an SMT processor is much harder than the basic underlines of what is called a design. Because essentially what we are doing is that as soon as you bring in two threads, you have to look at the prospect of the state space. So, it is an exponential blow-up in the state space of the entire design. And that blows up very fast as we introduce threads. So, you must think about various deadlock, livelock possibilities since the threads interact with each other through shared resources of our cycle basis. For example, you have to think about so these are the problems that the operating system designers actually think about. Now, the hardware designer will have to think about how to do this. Because what may happen is that a thread may hold a resource and some other thread may wait for a resource holding another resource. And then there will be a cycle of resources. You have seen all these things. So, these things now will have to resolve the hardware. So, the other option that people started thinking about in addition to this was that why not explore the transistors available today to just replicate existing superscalar cores and design a single chip multiprocessor. So, why do you want to do all these things and bring headache? You have already designed a nice processor. Just replicate some of these on a single chip and you have a multiprocessor. So, and you have so many transistors. So, that made to the concept of chip multiprocessor and that is what you find in today's all processors. So, it is essentially the name for multiprocessor. That is what they are called actually. So, in mid 90s when the first paper on this thing came out from Stanford, they named it chip multiprocessor. So, that is really the other person. So, here are some here is a list of things that happened in industry. It is no way exhaustive. It is just examples. So, Intel's dual core Pentium 4 where each core was hyper threaded actually. So, you have a multi threaded cores and it uses just the existing cores. Intel's recent Sandy Bridge and Ivy Bridge they have more cores on chip. For the server market Intel has announced dual core Italianium 2 which was code name Montecito. Again each core is two way threaded. AMD has released dual core updated in 2005. Recent AMD processors have more cores. IBM released their first dual core processors processor power 4 in 2001. We will talk about power 4 in a minute. Next generation power 5 also uses two cores but each core is two way threaded. So, we will also talk about power 5. Sun's ultra spark 4 released in 2004 is a dual core processor, integrates two ultra spark 3 cores. So, pretty much you know you look at any chip manufacturer they have gone this way. So, let's see what are the design problems here. So, first of all why do you need this? So, one thing I have already hinted at that is you have transistors. So, the question was that how do you play this? So, today microprocessor designers can afford to have a lot of transistors on the time. For example, if you look at the recent GQ's they have more than a million transistors. So, there are a lot of transistors to use. And ever shrinking future size or the transistor size is going down leads to a very good strategy. You can actually pack a lot of transistors. So, what do we do with so many transistors? So, one obvious solution that industry used for some time was that they invested a lot of transistors to caches. Which of those transistors? You can increase your cache size on chip. But beyond the point what happens is that you really don't get the return for what you invested. Natural choice was to think about a greater level of implication. Few chip designers decided to bring a memory and coherence controllers along with the router of the time. So, one example was compact for example there are alpha 2 on 3 6 foot processor had everything on a single chip memory controller and the router. So, what do we do is that you could just put the chip and the DRM cards on a motherboard and you are done. You don't need any peripherals. The next obvious choice was to replicate the entire core. It's very simple. What you've been doing is just use the existing cores and connect them through a coherent integrator. So, the cornerstone for all these things was most long. How many of you have heard of this? So, what it says is that the number of transistors on a die doubles every 80 to 24 months. So, that means an exponential growth in available transistor count. If transistor utilization is constant this would lead to exponential performance growth but life is slightly more complicated. So, some of you might have seen this as the Moore's law. Performance doubles every 80 to 84 months. That's not Moore's law. That assumes this thing actually the transistor utilization is constant which is actually not true. The problem is that wires don't scale with transistor technology. So, wire delay becomes a bottleneck. So, wire delay actually remains more or less constant while transistors get faster. So, communication becomes more costly. So, short wires are good which dictates a localized logic design. You don't want long wires with distributed logic and all these things. But, superscalar processors exercise a centralized control requiring long wires because pretty much if you look at the issue queue that controls everything. So, your issue queue is the control with the central part which decides what instruction will issue and what instruction will write to the register file how will update the ROV and everything. So, or in a pipeline you might require long wires because of bypass and all those how ever to utilize a transistor as well we need to overcome a memory wall problem. So, this problem we talked about this essentially talks about the problem that your processors most of the time start waiting for data to talk from memory wall. To hide memory latency we need to extract more independent instructions that is more instruction level parallelism. That is the only way to hide memory latency because we say that what you can do is that if you have a cache we spending for a for the instruction of ROV if you want to hide this entire latency all you have to do is that during this entire time you have to keep facing useful instructions and exactly useful instructions. We essentially boils down to extracting more independent instructions. So, but that requires costly hardware you need more in flight instructions because if you do not have more in flight instructions the probability of getting independent instructions is low. So, but for that we need a ROV which in turn requires a bigger register file also we need to have bigger issue queues to be able to find my parallelism none of these structures scale well and main problem is well so they require all long wires. So, the best solution utilize this transistors effectively with a low cost must not require long wires and must be able to leverage existing technology. So, CMP satisfies these calls exactly which uses existing processors and we best transistors to have more of this on chip instead of trying to scale the existing processor so multi core was actually a natural choice it did not happen by accident it was actually forced to move in this direction because of certain technology constants. So, here is a chart that shows slightly dated actually we are of course, 20 30 today this one only shows up I think you have to these are chart of Intel processors. So, what the point is that the y axis is in log scale x axis in linear scale x axis is linear. So, that actually substantiates most of the it is not totally linear but more or less. So, now assuming that you have done this they put multiple processors on a single chip but then did not I just make my power consumption of the n fold by putting n cores on the line as a case if you do not scale down the voltage of frequency. So, usually chip multi processors are clocked at a lower frequency you must have noticed this that as a number of cores increases normally the processor is clocked at a slower frequency the reason is actually this. Voltage scaling happens due to smaller process technology because physics tells you that if I make a transistor smaller I have to reduce the supply voltage otherwise the transistor may break down the basic physics tells us. So, you have to scale down voltage it cannot help now the physics tells us one more thing that is if you start reducing the voltage of a transistor the electrodes will move at a slower speed the electric field goes up the acceleration will be small. So, which means if you keep on clocking this transistor at a higher speed there will be a fundamental problem what is that? So, I have a pipeline so I have a pipeline these are my pipeline registers and here I have not a logic so this is the input this is the output so what I am doing is that I have scaled down the voltages of these transistors that implement the logic but I continue clocking the transistors at a higher speed which was happening previously is there a problem? So, my clock goes here this is combinational logic there is no clock there this is the clock supply voltage has gone down I have not brought down the clock frequency and I said that if you bring down supply voltage your electric field will go down your transistor will naturally operate at a slower rate so even before this computation completes I will sample this latch and I will get the wrong value from the latch so that is the problem so you have to bring down frequency for other reasons but nonetheless to keep your power consumption ok you also have to know the frequency so what happens is that you know that energy is CV squared from your basic physics and power consumption is energy power anytime so what you do is CV squared F so what you get is so voltage and frequency are more or less linearly related over a certain range of operation so what you get is roughly a cubic dependence of power on voltage or frequency that is what we will have to get so which is good because as your transistor is shrink you bring down the supply voltage automatically you get a cubic saving in power production ok so that actually helps here this means that if you have end cores you don't have to run them at a one length clock because you are gaining something here but still you will not be able to run it actually at the same clock rate that can be proved so anyway so the point here is that when you are designing these multi-code processors you cannot just look at performance you have to look at power also so what you do is that you change your metric you talk about performance per watt that's one thing that often people use which is actually same as reciprocal of energy why because your performance is one over time because we say that if the time goes down performance increases so performance is one over time so this is actually time per watt which is reciprocal of energy so this one actually tells you that if you have less energy it's good so it's somewhat somewhat bad because all you need to do is to optimize energy which probably can be done by sacrificing performance so a more general metric is performance to the power of k plus one over watt or k greater than that so instead of energy you can now talk about energy times time that's often called energy delay product for example if you set k equal to one that's what you say energy multiplied by execution time if you set k equal to two you get energy times execution time squared so that will put different facets of energy and you need smarter techniques to further improve these metrics for example today all processors will dynamically monitor your processors performance and decide whether you need to write a full voltage or I can actually scale it down and still have the same level of performance and when you scale down voltage you scale down frequency as well so we will not really discuss any of the algorithms that are used by dynamic voltage frequency scale if you are interested I can use our references but the rough idea is that you can monitor your processors performance try to establish a correlation between voltage frequency and performance larger group pollution over time and actually apply it and then continue the feedback so in all the processors we will actually discuss one implementation on this but not much of the algorithm okay so some of the basics about chip money processor the fundamental question is where to put the interconnect that connect the processors you have multiple processors some connection between the processors so you do not want to access the interconnect too frequently because these wires are slow that was the fundamental reason why you want a shift from the long wire designs okay so it probably does not make sense to have the L1 cache shared among the cores so what I am saying is that so I have these cores here let's say we have 4 cores and what is the core it contains the pipeline okay so first option talks about putting the interconnect here connect the cores and then put the cache hierarchy we have L1 there may be a 2 there may be a 3 etc etc okay right so essentially what I am saying is that I have a shared L1 because every load store instruction from every core we will have to process it and that clearly defeats this whole purpose we did not want to access the interconnect too often so it requires very high bandwidth and may necessitate de-design of the L1 cache and surrounding load store unit which we do not want to do because what you require is heavier number of cores of L1 cache so that you can grant access to all of them concurrently so what people do is that they settle for private L1 and L2 cache what means is that I need a private hierarchy here that can absorb most of the load store operations so what usually today people do is that they put L1 and L2 here so they distribute this so each core gets private L1 and L2 cache and then you share the L3 cache on the other side of the interconnect and you need a coherence protocol at L3 interface to keep the private L2 caches coherent so what may happen is that there will be a variable x which is residing in all the L2 caches here so if one of the cores right to that value of x you kill the others so that the value has changed so that is called a coherence protocol so one may use high speed custom designed smoothie bus so let me explain this what it means so imagine that you have a variable x in the program which is shared by all the processors so share the so let's suppose currently x is residing here and here and the value is 2 of the variable x so now your program says that c2 wants to write to the variable x so it will access L1 it will miss the cache it will access L2 it will miss the cache it will access L3 so let's assume that x is actually here also so x will be here if you have an inclusive cache hierarchy otherwise x may not be here so c2 gets x and it caches here and here let's assume that it is inclusive hierarchy and generates a new value for x which is 3 so now what may happen is that so this guy may want to write to x so generates a store which misses in L1 hits in L2 copies x here generates a new value which is 5 and I can keep on doing like this eventually I can show you that all processors will have the variable x in different values so now your entire system is out of bounds if you have a variable x everybody sees a new value of x and your program is going to do something very wrong so you need harder support to make sure that when this right happens so forget about this one when this right happens you have to tell these two guys that there is a new value being generated for x you should be aware of it and there are two ways of doing that one is called the Snoopy protocol so what it does is that whenever x tries to access something from the shared L3 cache it will launch the address of the variable on the bus everybody else will snoop the bus pick up the address and make a comparison of the address with all the addresses here so essentially not all the addresses will pick up the address index into the L2 cache and see if that particular cache block is present and if it is present and if the launched command was actually a right command it would invalidate the blocks from here so in this case these two guys will actually match and it invalidates from the cache so that there will be always a unique modified value of x so that guarantees this part and this is now gone but that doesn't really solve the whole problem subsequently c0 may want to access x if it goes and access x from here it is still 2 because we are assuming that it is a right back cache and this value will go to here only when c2 evicts the block from its hierarchy so you also have to make sure that whenever somebody tries to access a variable it gets directed to the current owner of the particular variable so in this case what will happen is that if you access x you will launch the address in the bus everybody snoops this guy will get a match and actually will generate the new value of the bus which will be picked up by so that's what a snooping protocol is make sure that the values are always right now here the problem is that if you have a large number of processors you will have to whenever a right is generated you have to tell everybody that somebody will have a variable so that requires a broadcast that will flood your data it will generate a lot of messages but here what we really want to do is we just want to tell these two guys because others don't care because others don't have this variable in their cache so what you could do is that you could maintain a directory somewhere in the system which this guy will go and look up and figure out that only C0 and C3 are the processors that are currently caching this data and it sends two point to point messages to these two guys saying that evaluate the cache that's called a directory protocol so I am not going to much more detail about these protocols there are many things that you can do to optimize these protocols but this is the basic thing that you require and this is called a coiled protocol and entirely this different design choice is not to share the cache hierarchy at all so what you could do is that you could pull L3 up also here so nothing stops you from designing one such architecture where each processor also has L3 and then the definite is here so there is nothing shared so early dual core AMD and Intel actually did this it reads you of the on chip coiled protocol because you really don't need anything but no gain in the communication previously what you could do is that you could share certain variables through L3 cache very fast C0 could write something and write it back to L3 C3 could consume that through L3 and now you have to go over the interconnect go all the way up to get the back so anyway today you will probably not find any such architecture that has a complete target hierarchy the last level of the cache is always shared yes it is not on chip it is not on chip exactly because normally this particular interconnect will be outside the chip and this protocol will keep the L3 coiled so if you do have a shared last level cache it has to be banned the cache is because otherwise it increases latency so then you have to talk about how many coherence engines per bank because each bank can receive a large number of coherence requests so having more engines may actually speed up your coherence processing now the question is that if you have a banned architecture let me have the L3 shared again let's suppose that this is actually a banned L3 cache so now when you put an address on the bus it may happen that nobody responds because nobody has the cache lock so then maybe one of the L3 banks actually has the cache lock it will have to respond so the particular bank that holds the cache lock is called the home bank of the cache lock and that is normally a static mapping so it's a simple hash function which takes the address and figures out the home bank writing so it's just a partitioning of your addresses across the banks and a miss in the home bank means that the block is not inside the chain you have to go to the memory to build a lock because of inclusion so this one we have talked about so we go directly the trade-offs and odds for the L3 person it's a storage to remember who has the lock now is there any incentive of changing the home bank at long time I just said that it's longer than a static map is there any reason why I might want to change the home bank of the cache lock yes so the hash blocks which are special in that bank we have a user as a processor and the latency of access in that bank for that particular processor is very nice so you can interconnect which is such that so let me show you one example let's suppose that you have a ring and so I'm just showing you know this is a logical diagram of course these drops will be distributed over the ring so on the other side also things will be there so let's assume that each ring stop has a bank of L3 so you have a 4 bank L3 and banks are distributed like this now C0 finds that it needs a particular cache block whose home is here let's suppose that C0 is unable to cache this block very efficiently it misses L1 and L2 very frequently on this particular block so then it makes sense actually to change the home bank from here to here but this can only be done dynamically after monitoring the access diagram so there are proposals which actually does this you can do dynamic migration at the grain of a cache block if you can figure out that a particular page is actually going to the same access pattern which often happens you can move the entire page you can change all the the home of all the cache blocks in that page from here to here so you are not going to be real obvious again now SMT and CMP add couple more levels in hierarchical multiprocessor design so what may happen now is that you can take this particular architecture and say that well each code now has several threads running so I have these are all SMT codes so if you just have an SMT processor that is suppose you have only this much or maybe this, this and this within 1 and 2 among the threads you can do shared memory multiprocessing with possibly the fastest communication why? because this thread writes something to L1 the other thread may consume directly from L1 so it is a very fast communication between threads you can connect the SMT processors to build an SMT over a smoothie bus so like this for example I have connected 4 SMT codes over a smoothie bus here it is a ring it could be a bus also and you can connect these SMT nodes so nothing stops me from taking this whole thing and connecting you know few of these over a day to one so I get a pretty large amount of us for example if I so one single load gives me 16 threads if I connect just 4 of these I get 60 full threads you could do the same thing with CMP only difference is that you need to design the on-chip coherence logic which is not automatically enforced as in SMT so notice that in SMT I don't need any coherence protocol because this level itself is shared the first level itself if you have a CMP with each code being an SMT like this then you really have a tall hierarchy of shared memory the communication becomes costlier as you go up the hierarchy so as you go up it becomes more and more costlier because it takes time also communication becomes very much more more so here your compiler can do a very good job in actually producing good locality so this is where your compiler can play a big role in optimizing the program to figure out in what phases, what you need and so on to actively generate prefaced instructions to premiums so most designs support SMT course on a multi-core diet like that we will talk about these 5 I have listed these 3 also we will talk about 2 SMT shared studies that is Intel 20 and 4 hyper-study and compact V8 so this one is actually 21464 we talked about 21264 this particular processor got cancelled anyway because that's exactly when compact got bought by HP and Intel and we will discuss this 4 I have IAM Power 4 also so I am going to stop here today