 talking about benchmark applications, we also talked about some of the matrix that we use to measure performance. We discussed the benchmark applications for several market segments including desktops, servers and embedded processors. So, today we will start with for example how do you compare of processors P and C with two benchmark applications P 1 and P 2. A execution in one second, P 2 in 1000 seconds, B executes P 1 in 10 seconds and P 2 in 100 seconds and C executes P 1 in 20 seconds and P 2 in 20 seconds. So, with this data which of A, B, C is the best one because you can see P 1 then A executes the P 2 whereas in both cases B is in the middle of the two benchmark processes you would be dealing with hundreds of benchmark applications. Meaningful summarization of this data and meaning by meaning it is misleading that should not lead you to which is actually not a good one. So, let us take a look at the possible ways that we have report the total time to execute P 1. So, similarly along the same line it is possible to report the arithmetic mean of the total right. So, it is the same thing in these two. So, if you at what you get is on 500 and half seconds B takes 55 seconds and C takes 20 seconds to execute P 1 and P 2. So, according arithmetic mean you would conclude that well C is the best say that well I would go and buy C right. Do you see any problem with that P 1 may be a more popular. So, 2 to alleviate this problem is dated execution mean assigns to everybody. So, as somebody has already suggested P 1 which is often very difficult to do. So, what one of although it probably does not have much meaning as such you could assign equal time weights with respect to a particular machine. So, what do I mean by that if I take machine B is that I assign weight to P 1 that they take amount of time. So, equal time weight with respect to B would be P 1 getting a weight of 0.09 and P 2 getting a weight of 0.091. 0.909 if you give weightage of 0.909 to P 1 you get 90.9 seconds of execution here. So, this will be 0.091 weightage to program P 2. So, you might wonder why should anybody do that. I mean why am I taking machine B and equalize if you want to be to respect to B. Well there may not be any reason for that actually, but anyway. So, the point here is that respect to a particular machine now you can report other things, but it is easy to far play by taking a wrong baseline machine. The other option is to report normalize performance and that is normally what industry does and what normally practitioners do. Normalize execution times to some baseline processor. So, the point is that when you are designing a new processor you already have one and you are trying to improve upon that. So, it makes sense to specify your performance improvements relative to the current processor that is often called the baseline process. So, now we are really talking about ratios. So, for example, here in this example if my baseline is a I will report the performance of B and C with respect to A. I will say that if I take P 1 then P is 10 times slower than A. Whereas, if you state as faster then A will be connected P 2. So, talking about ratios. So, how do you summarize ratios the same question again arise. And again you have options you can do arithmetic mean you can do geometric mean which one makes sense. The problem is that arithmetic mean if you take arithmetic mean of time it is still time if you take geometric mean of time it is no longer time it is something it is a number. However, geometric mean of ratio has many good properties. So, we will come back to that before that here is an example which shows how geometric mean can be used. Suppose designer of A does an optimization that brings down the time to execute P 2 to 500 seconds previously it was 1000 seconds designer of B does an optimization that brings down the time to execute P 1 to 5 seconds previously it was 10 seconds. Geometric mean will continue to show these two processors at the same level because the ratio is same actually same. What designer of B does is that halves the execution time of P 1 and designer of A does is it halves the execution of P 2. So, if I take the ratio of these. Suppose I am comparing A and B here. So, I have two ratios for P 1 the ratio is 1 over 10 for P 2 the ratio is 10 here I take the geometric mean of them I take the new optimized machine A and B I do the same thing I take the same geometric mean. So, geometric mean is oblivious to absolute savings that is the property designer of A is very smarter than never be because designer of A was able to save 500 seconds but designer of B did something to save only 5 seconds. So, if somebody provides geometric mean or harmonic mean of performance with respect to a baseline the type of mean really depends on matrix and the point is that do not cheat. So, that is the most important part of that. Now, geometric mean is often used when you mention ratios and the reason is that the geometric mean of the ratio is same as the ratio of the geometric mean of the denominator and the numerator it often becomes easier to evaluate. And people often use harmonic mean also. So, how are these ordered geometric mean geometric mean and harmonic mean? Does anybody remember? Which one is the artist? So, harmonic mean as the good property that it usually dwarfs your performance improvements because it is the lowest mean of all the numbers. So, if you really want to be pessimistic in terms of reporting your achievement your mind is harmonic. So, that you are on the safe side that you are not probably being too arrogant or being making mistakes in reporting your results. Whereas, if you really want to magnify your achievement you might want to be pessimistic. So, as we go on along the course I will also mention what mean to use for what particular type of measurement. So, there are certain things that also need to be taken care of. So, any question on this measurement? So, summarization. So, people usually use means. Harmonic mean is the most popular one. Sometimes you use geometric mean. In certain terms they actually do. And usually these are the statistics that you normally see. If you do not try to go for higher order statistics in reporting of summarizing results. So, any question? So, now we will talk about the molecular law which is actually common sense. So, Jean Amdahl was a PhD student in theoretical physics in University of Wisconsin-Madison. That was in early 50s. So, transistors were just coming up. So, Amdahl developed certain concepts or certain theories in physics and certain equations that he needed to solve which he could not solve analogy. So, numerical solutions were required. So, Amdahl built a machine, a computer which is still in Wisconsin's computer museum actually available there. And then on essentially Amdahl stopped doing physics. He offered a job in IBM and he became a computer engineer. In 1967 he came up with this law which is essentially common sense. So, which says that it makes a common case fast. Simple as that. So, it is quite intuitive. It essentially says that there is a point in which time and money to optimize part of a system that gets involved one percent of that, for example. We should really be optimizing parts of a program, for example, which requires maximum amount of time to execute. So, for example, if you have a function which gets involved 90 percent of time, you better invest your time and money to optimize that function. Forget about the remaining portion of the program. That is actually Amdahl's law. So, mathematically if you want to put it, a particular section of the program can be enhanced by some optimization in the processor. So, let us suppose x fraction of the entire execution time spent in this particular section of the program. So, you are saying that you have a program and certain part of the program can be optimized in the by software or hardware. It does not exactly matter how you optimize it. And x fraction of the entire execution time of the original execution is spent in this particular section of the program. The optimization in the processor can speed up execution of the section by y times. So, what does it mean? That means the speed up that you get is t over t. t was the previous execution time, which is t over t minus tx, because tx now gets replaced by tx over y. It is sped up this portion by y times. So, that is what you get, 1 over 1 minus x plus x over y. So, that is the speed up you get and that is the, you know, that is what you get. And if you look at this particular equation and you may ask several questions, right? That suppose I say that you can get an unbounded amount of speed up. So, y is infinite. What will happen in that case? The amount of speed up is limited by 1 over 1 minus x. There is the upper bound you can get. Whatever you do. So, that means that is a very important point. That says that the portion of the program that you cannot improve will ultimately become the border, which is actually again common sense. So, the fraction x is measured before the enhancement is applied. So, that is a very important thing. It is not measured after the enhancement, because that is going to change. Is it clear? Amdahl's law applied various scenarios. So, it just says that, you know, wherever you can. So, here of course that the assumption is that x is large. That is why you spend your time optimising because you can see that if x is actually not very large, ultimately you get limited by 1 minus x. So, you look for portions of program that takes maximum amount of time to execute. Allow the resource to design time proportionate to execute the time. As x increases, the active speed up goes up for this y. As y increases, speed up increases. That is what we mentioned in the last slide. Amdahl's law is usually used to compare design alternatives. That is, which design would bring more performance. So, let's take a very simple example. Floating point square root is critical in graphics applications. Two design choices exist. You can estimate a floating point square root unaware to improve square root execution by takeouts. That is one design choice. The second choice is improve all floating point instructions by two times. So, suppose you have these two choices. Which way would you go? So, suppose a floating point square root takes 20 percent of execution time. So, this is the portion that we can actually improve. That is what we are thinking about. While 50 percent time is spent in all floating point instructions in the current process. So, which design choice is better? So, let's work this out. So, let's suppose. So, in the choice one, what do we have? So, t1 is a time taken to execute in the first choice. So, here we are saying that we are focusing on 20 percent of all execution time. That is floating point square root. And we want to speed that up by 10 times. So, what we get is 0.8 remains unchanged. And the remaining 0.2 will get scaled down by 10. So, what does it give you? 0.8. So, I am assuming that the original you started with 1. Is it clear to everybody? Choice one. What is choice two? We are looking at 50 percent of time that I could enhance. So, remaining 50 percent would remain unchanged. So, point 7. And the remaining 50 percent will get improved by 50 points. So, second choice is clearly better. So, I should actually go ahead and try to improve all floating point instructions. So, if you look at. So, clearly of course, in this case what happened was that accidentally I will say that the time spent the 50 percent time spent on floating point operations. Since I focused on that, I got better performance. But it also depends on these two factors. One was 10 times. So, you can easily figure out what this should be. So, that these two get balanced. So, it is not that you can always pick up the major chunk of the time, improve that by some factor and you are always going to win. No. For example, if I say that this is 1.5 instead of 2, these two are going to be very close to each other. So, if I go below 1.5, I can make it 1.2. You will actually begin to see that option 1 will start. So, you can actually plug in these numbers directly to your Amdahl's law speed of formula. You can get directly the answer without going through this calculation. Clear? Questions? So, Amdahl's law can be used to derive upper bound on achievable speed of parallel computing. So, this is used very often to find out a piece of program, to find out the achievable speed of a particular program. So, let us see how you do that. Suppose a sequential program takes tiny t to run on a single processor. A profiler shows that. So, a profiler, those who do not know, a profiler is essentially a piece of software, which takes a program as a input and tells you where is time spent, how much time spent in what functions are all. He shows that a fraction s of this time is spent in executing inherently sequential portions of the program. So, that s fraction has no chance of getting parallelized. It is inherently sequential. The remaining time can be perfectly parallelized for arbitrary number of processes. Suppose you figure that. So, what does it mean? So, now I can actually deduce the nascent achievable speed up on a machine with three processors. How do I do that? So, speed up is sequential time that is t divided by the parallel time. What is my parallel time? s times t will remain sequential and 1 minus s times t, we can divide by t, because it gets perfectly parallelized. So, that gives me 1 over x plus 1 minus x over t on these processors. I could actually get this, get this power directly by plugging in my x and y values in it also. It is exactly I am also essentially saying that I could speed up this part of my execution by p times leaving things unchanged. So, what does it tell you? Something interesting tells you. So, in the limit the speed up gets capped at 1 over s. This is the most important part of this analysis essentially tells you that if you infinite number of processors you cannot get greater than this impossible. Now, if you look at some of the things if you plug in values there you get surprise. Some of this is 0.5. The program is 95 percent parallelized. The speed up cannot be more than 20. So, you might be surprised I am running this program on a 32 node machine. It is 95 percent parallelized. You will get only 20 speed you cannot go over that. Of course, it ignores communication over it and many other. So, they actually are not going to be 20. So, essentially this particular analysis tells you that you need almost embarrassing parallel algorithms with near 0 communication to fully utilize even a medium scale parallel communication. By medium scale I mean you know number of nodes below 50. If you have a 99 percent parallelized program your speed up will go to 100. So, you need almost completely parallel algorithms to exploit large scale or mid-range machines. So, that is a very very important information law. I want to mention this although we probably not use this particular thing in this course. These are normally used in courses that we do with parallel programs or parallel computer architectures. But keep this in mind. Any question? So, the next thing that is important for our progress measurement is something called the CPI equation. Again this is common sense. So, we have seen that we know how to compare two processors. We know how to decide between physical optimizations by applying the next question is how do we really measure this time? Because here we knew that some percentage of execution time cannot be enhanced, some person can be enhanced, but we need to measure that somehow. So, which are the terminated factors? Assume that we want to calculate the execution time of the program. So, execution time is clock cycles to execute multiplied by cycles. You take 100 cycles, your cycle time is 1 nanosecond, you require 100 nanoseconds to execute the program. Now, if I expand this particular execution clock cycles, this is equal to number of executed instructions multiplied by average cycles or instructions. So, execution time now becomes let us put this in. Instruction count is this one, multiplied by cycles or instructions that is CPI, multiplied by cycles. So, this is the CPI equation. Interesting part of this is that this particular first term is determined by your compiler. The processor has almost nothing to do with it. Your compiler generates your binary and that is what executes from your processor. That determines how many instruction cycles. Of course, it also depends on your input to the program because that determines which part of the program gets. Cycle time is mostly determined by your underlying semiconductor technology. However, it is often influenced by your architecture also. This one is only influenced by your architecture, cycles for instruction and that is where an architect holds power. You can do something to improve your CPI and that is what we are going to do, invest or try more. We will progressively go through phases of improvements to see how we improve CPI. We will also mention some of the things to improve cycle time. Often you will find that these two are interrelated. Often you try to improve CPI, you end up sacrifice and cycle time. So, you have to keep both of these things in mind. Cycle time is also same as the reciprocal of frequency in appropriate units. For example, if a frequency is 1 gigahertz, my cycle time is 1 nanosecond. So, execution time equally depends on the components, they equally weight as you can see. Each component can be improved to get reduced execution time. Reducing instruction count of a program normally depends on the instruction set of the processor and the smartness of the compiler. That is, separate. So, let me first try to explain the first part of it. So, it says that it depends on instruction set of the processor. So, that needs some explanation. We talk about instruction set very soon. What it means is that the processor supports certain set of instructions. They may be very complicated, they may be simple. Now, to do a very complicated operation, the processor may have one instruction. For example, you can think about an operation like copying a stream. Stream means stream of bytes, right? From one part of the memory to another part. There may be just one instruction for doing that. Even though it is doing a lot of copy operations within this particular one instruction. On the other hand, another processor may be able to expose this whole thing to the compiler. Saying that, well, I do not allow you to copy more than four bytes at a time. So, you only have four byte copy instructions. You can copy four byte of data from here to there. So, now a stream copy essentially gets transferred to a folder which will copy four bytes at a time and complete the whole operation. So, there is a trade-off here. In the first case, the CPI is probably going to be very high. Even though the instruction column is going to be low. In the second case, the instruction column is going to be large, but CPI is likely to be one. Or maybe even low. So, here is an example. Certain processors have separate instructions for doing a comparison followed by checking the comparison outcome and taking a branch. So, for example, you have suppose a piece of C code which does this. So, there are many ways of combining this. One way could be that you first do X less than equal to two. Carry out the comparison into one instruction. So, that generates some outcome. True false. They say some flag somewhere. The next instruction goes and checks that flag and decides where to jump. Whether to jump or not. If it is exactly X less than equal to two, you won't jump. Otherwise, you jump to the L-spot. If there is no L-spot, you just skip the motion and it starts. So, these are essentially two instructions followed by a branch instruction. There are other processors who would actually fuse these together. These two operations. So, you heard this operation actually and say that branch if greater than. So, that is exactly what is saying here. Separate equality check and branch instructions can be fused into one instruction such as BNE or BQ. So, these are branch not equal or branch equal to two. Depending on the nature of the comparison. Here, I am showing less than equal to. Here, we are talking about X equal to equal to two or X not equal to two. So, in these two cases, essentially what is going to be happening is that your instruction code of the program will change. And this was possible in the second case to fuse these two only because the processors actually implemented one such instruction. So, that is where this support from instruction set comes into place. If the instruction set has some instructions, compiler will be happy to generate some instructions. Otherwise, it cannot generate these instructions. It will break it down into instructions. Similarly, you can identify simple optimizations. So, this example is purely about compiler optimization. Ending with a mask and check instead of shift and end and check. So, we are talking about something like this. I want to check if let us say k bit of X is one. So, there are two ways of doing it. One is that to add X with a mask. So, what is that mask? Can somebody guess what I am doing? Sorry? 2 to the power of k. Does everybody see that? What I want in the mask actually? I want a 1 in the kth condition. Everything else should be 0. So, I am ending with this mask and now it is enough to check whether this is 0 or non-zero. Then, that would serve my purpose. Exactly, I have done this. The other way of doing it is that I will shift X by k minus 1 bits and then to an end operation. So, essentially X shifted by k minus 1 and then with 0 X 1 I will have to tell you. This is much more expensive than this. This is essentially two instructions. And followed by a comparison. This one is shift and comparison. So, it purely depends on your compiler's smartness. The compiler can figure this out or not. That, oh, I should be actually already in this piece of it. Not this. So, that is about your instruction count. So, we will talk about some of these things in the first category that is how to design instruction set without increasing your CPI too much. So, because there is a clear balance between instruction count and CPI. I can make very complicated instructions so that, you know, this entire execution of this if state will become one instruction. That is possible actually. I can do that. The CPI is going to be very high. So, there has to be a balance. The second component is your CPI and the goal is to minimize CPI. Right. Reducing CPI depends on processor architecture. Then, how much parallelism it can expect. That is exactly what is going to be the major portion of this course. Frequency of a processor depends on semiconductor technology as well as processor architecture. So, that determines your site. The third component. Architecture enhancements such as deep pipelines to improve frequency of CPI. So, this one essentially mentions how these two last two terms are actually interrelated. I can use my cycle time to talk about this more in detail by designing a very deep pipeline. So, that each pipeline is doing a very small amount of work so that I can run my clock very fast. But, this may increase CPI in many ways. We will talk about that. I cannot really mention right now how it can see what this pipeline is doing. Yes. The number of bytes you can execute like you said 4 bytes you can execute and there is a function of a processor in your heart. So, I see the instruction also depends on the processor also. Yes, so that is what I mentioned. It depends on the instructions supported by the processor. So, both. Yes. But, once your instruction set is fixed given a program how many instructions is going to execute to depend on your compiler's model. Yes. Where the compiler is able to generate the optimal number of instructions. That depends on the compiler. So, here is an example. So, let us take the same example as the last one of the GPU with some additional data. So, the current GPU does not support 40 point square root instruction and its 40 point square root is emulated in software to implement some square root algorithm. Does anybody know of any algorithm for the algorithm in square root? Newton's formula. Yes, Newton's formula. Newton and Raphson. Can you be more elaborate? Who said Newton and Raphson, you? If we basically be an approximation algorithm we start with an initial guess in further cycles. Can you just tell me how do I form the Newton Raphson? How do I set this up? Yeah. So, basically you define it as a polynomial. What is the polynomial in this case? Square root? fx minus square root of x is equal to 0. No. What is it? x square minus k. y square minus c, right? So, I am trying to evaluate say a square root of c. Yeah. So, I will basically take x square minus c equal to c. So, Newton's Raphson's method is one of the simplest way of doing this and there are other many smart algorithms for evaluating square root. So, talking about one of those implementing software. So, clearly to have a large number of instructions which means if I look at this long 20 point square root operation it will look like it is a single operation with a large number. Frequency of 40 point instructions is 25 percent. Average CPI of this instructions is 4. Average CPI of non floating point square root instructions is 1.33. And frequency of floating point square root operations is 2 percent, right? CPI of floating point square root operation is 20. So, 5 times more than your average of all floating point. So, one design alternative was to reduce CPI of floating point square root by 10 times that is by bringing it to 2.2. Right? The other alternative was to reduce CPI of all floating point operations by half. So, 4 to 2. So, which one is better? So, the question that still remains is any question is how do you really still get these three parameters, right? We require a number of instructions CPI and the cycle time to measure execution time. So, CPU designers normally use detailed simulators to get exact behavior of program execution. Simulations can be done in different levels of accuracy. So, here are few options. There is one called trace driven simulation, where you obtain a trace of executed instructions and feed the trace into a simulator which essentially simulates the processor. So, the trace goes to the simulator and what comes out is of course number of instructions which of course you can already get from the trace because you already have the trace. However, you also get the cycles for instruction and you know the cycle time of the processor. The problem with trace driven simulation is that complex interaction pipeline is not possible to model because you are in the trace by running the program on some machine. Now, your simulator the process that you are trying to simulate may be different from that machine but there is no way to model those interactions. The trace of instructions is already fixed by that execution on that machine. So, exactly what you cannot model we will come back to that. The most important thing is that you fail to model the trace driven simulation. This is ok. The best possible option is to do an execution driven simulation. What is that? It is an accurate model of the processor and memory system designed in software and programs are run on this simulator. So, essentially what we do is this simulator can actually take the binary of the program as an input and can interpret the binary meaning decode the binary and actually can execute the binary. So, it is just like a machine so this is the most accurate and also most time consuming. A user can exploit the performance counters to get a rough estimate of time spent on certain code segments and the number of instructions in those segments. Frequency is already known. So, today's machines already offer large number of performance counters to measure performance. For those you can get estimate of time spent in certain codes and other things. Static profiling of the program is one of the most important instructions. So, you can also profile your program to know what type of instruction you have in what percentage. For example, you can find out information like I have 20 percent floating point instructions, I have 60 percent lower score instructions and so on. So, a few principles that you should keep in mind one is a principle of locality we will also elaborate on that soon. I have already mentioned in the last class that programs are not random pieces of code. So, turns out that 90 percent of time spent in 10 percent of code. It is a rule of thumb although it is an average estimate of course there are exceptions. But in general that is what you will find and the simple reason is that the most interesting programs would have some kind of repetitive structures either in the power loop or in the form of recursion. If you do not have any of these it is essentially a straight line piece of program doing pretty much nothing interesting. So, any interesting piece of code would have loops or recursions and you probably spend a lot of time executing the loop of the recursive structure. And that is where this small piece of code would essentially lead to a larger amount of time spent. Same locality. So, this is essentially saying locality in terms of code. So, I spend a lot of time in one locale of a code. Same locality principle applies to data access also. So, there are two types of locality that we talked about when talking about code and data. One is called spatial locality which means that closely spaced data are accessed closely designed. Temporal locality which says that currently accessed data are likely to be accessed in near future. Caches try to exploit temporal locality because what you are touching now will be cached hoping that in future near future you will be touching that again. While prefacing exploits spatial locality because what prefacing does is that if you are touching data more x you would also preface maybe x plus 1, x plus 2, x plus 3, x plus 4. So, it is trying to exploit spatial locality saying that when you are touching x so maybe you will be touching nearby data also.