 We will talk about MIPS instructions that actually has fused compare jump instruction. So, quickly to recap what we are discussing. We talked about performance measurement, metrics, benchmark applications and the little bit on performance comparison. We talked about arithmetic mean, geometric mean and harmonic mean. We talked about Amdahl's law and CPI equation, looked at a few numerical example as well. And we finished here talking about principle of locality last time. So, this is one of the things that will visit over and over in the context of not just caches, but in many other context, the principle of locality. So, here do not think that this data always refers to your memory data, it can be any data for that matter. So, for example, if you are looking at a sequence of data values, any data values, locality principles may still apply there. So, we will look at such things also. So, the other principle that will be applied or at least will try to apply at many places is exploiting parallelism. And it is pretty much the mantra of today's computer systems and you will see parallelism at different levels in these machines. So, here are some examples, the simple ones. For example, you may want to have more disks to improve your IOT throughput, you may want to have more memory banks to support parallel data access, you may want to process multiple instructions in parallel and digital circuits are inherently parallel systems because individual bits get operated on in parallel. You may want to have more ALUs to carry out parallel additions and finally, speculation is the ultimate solution for extracting parallelism. So, here the main idea is to do multiple possible operations in parallel without actually knowing which one is the correct one whereas, the correctness detection proceeds in parallel. So, essentially what it says is that suppose you have this piece of a code, some condition check here. So, here there are three parts of the computation. One is computing this condition, this is the if block and this is the else block. So, there are three pieces of computation here. What I can do is, I can run all these three pieces in parallel, eventually this computation will finish and I will cancel one of these. So, effectively what I have done is that I have sped up the whole execution. So, that is exactly what is trying to say, I am essentially speculating about which one is going to be actually correct. So, I will run both of them and essentially cancel the one that is not. If you are smart then you will actually figure out which one is going to have higher probability of being correct. So, then I can suppress the other one right there. I will say that well I know that with very high chance this is going to be the correct path of execution. I will not even try to execute that, so we will learn about these things as well. So, keep this in mind parallelism is pretty much everywhere and we will look at even more complicated ways of exploiting parallelism inside the process. Questions? So, here are three things that you should keep in mind throughout this course. Make the common case fast, that is number one means that do not spend your time and effort on things which are rare. Smaller is faster, that is a very important lesson. If you have a small cache it is going to be very fast. As the cache size increases it will become slower and slower. Another thing applies for logic circuit. Small logic is fast, large logic is slow. 90 percent of time is spent in 10 percent of code, that is essentially your code locality. So before I move on to the next topic I just wanted to touch upon a little bit on the research side of this particular field. So, recall that we talked about this in the first lecture that computer architecture is about designing online algorithms to exploit application behavior and that behavior may be static or may be run time. So, if you know that an application is statically heavy on memory then you would probably do something in your processor that would optimize the memory behavior of that application. Or there may be something that you really do not know statically by just looking at the application you cannot say only when the application runs you discover something. And your architecture should be able to adapt when it sees that behavior and that is where the online favor of the algorithms come into picture. You have to react immediately when you see certain things happen. So, computer architecture innovations are always driven by the needs of emerging software. That is always the driving force that some new software comes up with some new problem and that is what the architects study and they propose solutions so that the software can be spread out. So, guided by thorough study of benchmark applications and sometimes the underlying algorithms. So, computer architects spend a lot of time understanding applications, their behavior, profiling applications and understanding what they do by understanding the underlying algorithms of the applications. Key questions that an architect tries to answer once application bottlenecks in terms of performance, power consumption, communication traffic or hardware complexity on current architecture understood. So, first of course what you do is that you take these applications study them on the current architecture. In terms of performance, power consumption, communication traffic, hardware complexity and so on. And then you ask the following three basic questions. With what minimal hardware changes, can I eliminate the bottlenecks? That is the basic question and this minimal part is very important. You have to be thrifty when you propose something, you cannot just propose something gigantic for solving something that will not be acceptable. And of course, what property and behavior can I exploit because we talked about this thing that your solution usually exploits certain application behavior. So, what is that property and behavior that I can exploit to solve this problem? Can the compiler help in simplifying the solution? Of course, because not everything can be solved in purely hardware. You will probably require in certain situations help from the software also. So, compiler has the advantage of seeing the whole program which the processor cannot see. Processor can only see a certain window of execution that is currently going on inside the processor. It cannot see what is coming in future, but the compiler can see the full code of course, it cannot see the whole data. But it at least knows what code is coming in future. Can I get any help from the operating system? The operating system schedules the jobs on the processors. So, can I get some hint about what is coming next or can I get some behavioral hints from the operating system about a certain process? So, usually these are the three questions that pretty much encapsulate the basic research theme in computer architecture. So, how do you carry this out? So, the research in this area is empirical in nature. The reason is that any theoretical model amenable to mathematical treatment needs to make unreasonable assumptions as the underlying interactions are fairly complicated. So, coming up with an analytical model for an architecture is extremely difficult without making certain unreasonable assumptions. Which is why what you do is experiments are done on a detailed carefully designed processor simulator, usually written in high level language. Although you can go to a little lower than high level language like Verilog that is slightly lower than your C C plus plus that is called an RTL language register transfer language. So, what you do is you the simulators try to model the processor that you are trying to design as accurately as possible. So, it is a piece of software this simulator is essentially a piece of software that models your hardware process. And your benchmark applications are run on the simulator to gather execution statistics. So, you run the benchmark application on the simulator and you pretty much can get to know whatever you want to know because it is a piece of software designed by you it is under your control you tell it to tell you something you know it will tell you that often simulating a full application takes unreasonably large amount of time. You will probably not believe, but there are you know well actually pretty much almost all benchmark applications that are important will probably take months to run on any good simulator. So, if you if you have to you know wait for so much of time then of course no innovation will get done in time. So, you need some other ways of course some scientific ways of speeding up simulation. So, what usually people do is that they find out representative regions of the applications and those are selected for simulation only. So, any suggestion how to pick representative regions of an application? So, it talks about something it says sim point you do not have to know about that I will tell you, but any suggestion? Can you think of something? I give you an application some program and I ask you tell me the representative regions of this program, how do you do that? How do you translate sorry try to look at? You mentioned that 90 percent of the time 10 percent of the code runs. So, try to look at what 10 percent of the code is running for maximum work time. I mean you run a small application for small data and you try to figure out that region of application and then that region will take months to run on realistic data 90 percent of a month is what but that would not be realistic anymore right. If you run it on a small data everything will fit in my memory. I will not be able to model page faults if you run it even on smaller even smaller data everything will fit on my cache. I will not be able to model cache misses. So, yeah you finished your simulation in two minutes, but it is useless. So, try to figure out the regions in the code which have maximum complexity. How do you define complexity? Means if there is more complexity then we can. No, how do you define complexity? Complexity means that region which takes maximum amount of time. So, that is what he was mentioning right. Yeah, but you do not have to run the code you can just Oh, you can do that how do you do that? Just by seeing just by eyeing a code you can tell me which takes longer which part takes longer. To some extent you can say. You can, how? By means there will be some loops and some calls. There may be many of those actually. How do you know which one is going to take longer? There may be 20 loops in your program. Which one is the important loop? So, we can figure out by saying that which result has maximum complexity. Now, are you saying that the loop that is larger in size will be more complex? It may be sometimes, but not always. But I need a solution which is accurate all the time right. I cannot say that hey I am publishing this paper you know. Ten percent of these papers results are correct. Others may be wrong, nobody is going to read my paper right. Tell me something that is a little more scientific. Every program is having a path. We have a graph so we can try that each path is covered. We can have such data sets that each path is covered once at least. So that we have this number of execution. Right, so usually each benchmark application comes with multiple data inputs. And that's how the data inputs are actually designed. So that it gives you more or less 100 percent coverage of the program. But so what? Each of those data inputs will take probably months to run. I don't know if you are understanding the question that I am asking. I give you a program I am asking you to give me a representative region of that execution. On a realistic data. So you can think of an execution as what? A dynamic sequence of instructions right. That's an execution. It may run for months, it may run for a minute, whatever it is. So it's a string of instructions that are given to you. I want representative portions of this. I can run you know, few such portions and I tell you that well you know. Whatever behavior you are getting from this processor by running these portions is representative of running the full name. Does this problem sound familiar to anybody? Similar regions in a sequence, no? Nobody has seen such problems? Sorry? Common subsets in a string. It's not common subsets exactly. I'm looking for similar regions. Equivalence classes? I mean the table is kind of saying we are making classes of them and then only one class is executed. But what are the equivalence classes here? How do I know? All are the examples. You want to find phases in a running? Right, exactly. I want to find phases in. So he has used a new term. I don't know if you have heard of this term. Phases are programmed regions. So that usually shows similar behavior. So one phase may come here. Then this phase may repeat somewhere there again, right? So I want to pick up all the distinct phases of an execution. Then I pretty much covered everything, right? All possible behaviors. Yes? So you search for similar strings. This phase becomes a string. Yes, it's a string, it's a substring of this whole string. So maximum repeating substrings or repeating substrings? But I want all such substrings. All substrings, all repeating substrings. All repeating substrings, OK? So yeah, so that's pretty much what the problem is. So there is one tool available which you can actually download, you can read about. You can go and search in Google. That's called SimPoint. So I'll tell you what it does. It's a very simple thing. Who does not know about basic blocks? Raise hands, OK, all right? So basic blocks are regions of code that has one entry point and exactly one execution point. So basic block is essentially a straight line piece of code. You enter here, you exit. So what SimPoint does is it takes this whole instruction sequence, dynamic instruction sequence. Let's suppose this, just for the sake of example, suppose this length is 100 billion. 100 billion instructions there. In this whole execution of the program. So what it does is that, it asks you that, tell me, what is the size of the phase that you want? Suppose you tell it that I want 1 billion phases, 1 billion length. So each phase should be of length 1 billion instructions. So what it does is that it chops it up into 100 different parts. So 100 billion will give you 101 billion substrings, OK? Now what it does is that, it looks at each of these 1 billion substrings. And encodes this one, each substring as a series of basic block vectors. So let's suppose this is my 1 billion instruction substring, all right, OK? If I take this instruction, the first one, it belongs to some basic block, right, in the program, OK? So let's suppose it belongs to basic block x, right? The next instruction probably belongs to basic block y. This one belongs to x again. This may belong to some other z and so on and so on, all right? So if you scan through this, what you'll get? For each basic block, you're going to get a count. For example, here I have shown a count of 2 for basic block x. So at the end, what you can do is, you can answer the following question that in this 1 billion instruction string, how many times did I visit basic block x? That should be a number, right, OK? So it builds a vector from this. The length of the vector is equal to the number of basic blocks that you have in your program. And the entry i tells you, number of times you visited basic block i in this instruction sequence, all right? Here, OK? So let's suppose I have any basic blocks in my program. So these are an n dimensional vector, integer vector, OK? Clear, Tilo? So I'm going to get 100 such n dimensional vectors, right? From my 100 billion instruction sequence. So now it boils down to clustering these 100 vectors in a dimensional space. And what I'll do is that once I have clustered, my clusters will have similar vectors. And I'll pick the vector closest to the center of each cluster. That will be my representative 1 billion instruction fields. Is it clear? You may not know how you do clustering, but clustering essentially helps to gather the similar vectors, OK? Is it clear? If you just want to run one representative region of 1 billion instructions, what can you do? Can anybody suggest? Just one. I want to run, because I can, suppose somebody tells you that my simulator is so slow that I can only run 1 billion instructions, not more than that. What can I do? I have to pick one of these 100, right? Which one should I pick? Most frequent, most frequent one. Most frequent one? What do you mean by most frequent? These are all distinct, right? I have 100 billion instruction sequence. I've chopped it up into 100 parts. In the center of these vectors. The one that is closest to the center of these vectors, OK? OK, so this is a, sorry? Sorry, yeah. How does this 1-liter set of 1-liter model be in memory? Ah, OK. So yes. Here the assumption is that you are what you execute. So the piece of code that you're executing should manifest your behavior. That's what the assumption is here. And it actually follows pretty accurately, in most cases. That the set of instructions that you're executing roughly tells you your transportation behavior, your memory behavior, and many other things. It's a very simple algorithm. You can, of course, improve it significantly in that. But to start with is a good one. So representative regions. So one thing that we assumed here is that parameter, right? 1 billion, which is the point we last did. That tell me how big a phase you want. So here you should be careful. These regions should be large enough to capture different phase behavior of an application. So it should not be too small, OK? And if it is too large, then, of course, you lose the whole benefit of doing this. You're actually now approaching pretty much simulating the whole application. Any question? So if you want to read up on this, you can just go and search in Google Simulator. So how does a simulator look like? So here is a possible structure of a sequential simulator. So this is the main function of the simulator. It would first initialize bunch of states in your simulator. By the way, this is oversimplified. So usually a simulator is several hundreds of thousands of lines of code. It cannot fit in a slide out of question. But this is just a skeleton. So what you do is then you run a while loop, which says while not exit, where the exit flag is usually set by the system called emulation layer when your program exits. When the program makes the exit call, this exit flag will be set at the time. Until then, you keep on running. So what do you do? For each core, so I'm assuming that you have a multiprocessor. So for each processor code, you'd go and do something in each cycle. What do you do? So you'll first, so this is the pipeline of a processor, which we haven't talked about. We'll soon get there. You'd first retire that any, so retirement means completion of an instruction. You'd complete any instruction that is spending in this cycle. Then you'd go and execute an instruction that should be executed in this cycle. You'll find out the instructions that are waiting to read registers. You'd read those registers. You'd issue the instructions that are waiting to be issued in this cycle. You'd decode and rename instructions. We'll talk about this particular phase of the pipeline soon. Decode should be easy to understand. What is rename? We'll talk about it. And you'll fetch any instruction that should be fetched in this cycle. These are distinct pipeline phases. And then you do this for each core, and then you increment the cycle count. That's one cycle of your simulator. And when the exit is set, you have essentially simulated your processor pipeline for this program. And you exit. So we'll talk about these pipeline stages very soon. Just currently, just think about these as some phases of execution. An instruction starts at fetch and leaves the pipeline here. But you go upside down, and it should be the decode issue. So can somebody, I was actually trying to avoid this question right now. But since he has asked, so what he's pointing out that an instruction really starts here, right? It is fetched, and then decoded, then issued, then it reads register operands, executes, and then completes. Why am I putting it in the opposite order? What would have happened if I put it in this order, in the other order? Sir, while retiring, you're creating space for the instruction that will get executed, and will fill up the gap. So basically, I'm putting the pipeline, and then shifting them all by one. OK. So if you do it in the other order, then if you do the fetch first, then there's already some instruction that is blocking that which is still waiting to be fetched. You can't overlap the two. There's a space. So you're talking about freeing up the space. So you're assuming that there is some space allocated for the instruction to be fetched. Maybe some resource. Some resource. So you have to first empty it, and then put it in. OK, I could oversize that, but there is a more important reason why I don't do it. If I put the fetch, the upside down, this whole thing, what will happen? I fetch an instruction, and when I invoke decode, I'll decode that instruction only in the same cycle. But that's wrong. I shouldn't be decoding that in this cycle. I shouldn't be decoding in the next cycle. By invoking the pipeline from the end, I actually avoid that problem. I complete the instructions that have to be completed in this cycle. Then I find out what is to be executed in this cycle. I execute them and so on. And at the end, I fetch. This makes sure that an instruction will not be fetched in the cycle, and will also get completed in the same cycle. That's not how the pipeline works. So we'll come back to this structure again later. And there are parallel simulators also. So here, what happens is that each pipeline stage, these ones, actually run on a different thread of execution. So that's how you get the parallelism. So for each core, each pipeline stage will be assigned to a different processor. They will simulate that. So a clock signal will synchronize the threads, et cetera, et cetera. I'm going to skip over these things because there are details which will probably not even follow right now. We'll come back to this later. So we'll start something new. Any questions before that on the material in the first lecture? So you should be reading the text, the first chapter, that has more than what we have discussed in the class. It talks about the industry, trains, and all these things which we have skipped over in the interest of time. So this one is, again, from your book, is the first appendix. And it's in the appendix because the assumption is that you know these things already. So instruction set architecture is the portion of the architecture visible to the programmer and the compiler. And it consists of the definition of instructions supported by the machine. So whenever you talk about a processor reference manual, this is what usually that manual contains, the instruction set architecture. And it necessarily includes components that determine the instruction set that is number of registers. Number of data types, types of registers, memory addressing techniques, and so on. At the time of design, different ISAs are compared based on simulation that measures different metrics. So at every stage of the design, you'll find that a simulator is used. So that's integral part of a design cycle. And almost 50% to 60% of your cycle is spent on simulating before you send a design to the factory. Because you have to make sure that your design makes sense before you send it to factory. This manufacturing is a very, very expensive matter. You cannot just send something designed to a factory for fun. You can say, oh, I made a mistake. That costs your company a billion dollars actually. So it cannot be done. You send a lot of time simulating to make sure that what you're proposing makes sense. What you're proposing is correct. What you're proposing probably brings performance. So at this stage, we're talking about instruction set architectures. And of course, we'll see that there are many options. We have already discussed a couple of options in the last lecture regarding branch instructions. So you have to figure out which way to go. And the only way to figure that out is to use simulation. You take your benchmark applications, compile them for multiple different ISAs, each of the executable on the simulator, and find out which one runs faster. It's not just running faster. It also depends on how complicated your design is. So possible metrics could be code size. That is, when you compile a program for a particular ISA, how big is my binary? That's a very important thing. Complexity of the design. That is, if I have an instruction set architecture, that is very compact, but each instruction is overly complicated. So it gives you extremely compact binary, very small. But my design is lousy, gigantic, and complicated. Not done. Backward compatibility, that's very important for business. If you are in the industry, you are designing a new instruction set architecture. It is important that users that are using your machine's previous versions should be able to continue to run those binaries on the new machine. So that's backward compatibility. Power consumption, how much electric bill my new ISA would actually make my user pay. That's also another important point. And of course, performance of the benchmarks. So there are various things that you have to balance when choosing an instruction set architecture. And this is a very important decision, because remember that this is the only thing that will ultimately get exposed to the compiler, designer, and programmer. So if you make a gross mistake here, for however smart a design you have inside, nothing matters actually. It's very important to the set architecture. Like recall that string copy instruction that we discussed last time. It would require a very complicated hardware, whereas offering just one instruction for byte copy and allowing the programmer to actually orchestrate the string copy by running a loop with byte copy instructions. That would of course, explode your code size a little bit. But your design is going to be much simpler. So just to remind you, we have to keep in mind that there are three market sectors which we have to cater to. So in the desktop segment, performance is very important. So that's all that a user actually cares about in the desktop part. Both integer and floating point performance. Power consumption is equally important. I install a desktop in my home. My electric bill cannot be doubled. Then of course, that's a no go. Code size is of little or no importance. I can buy DRAM cards, I can install them. Code size is not that of that much of important. And also the reason is that we don't really run very complicated codes on a desktop as such. That's why the code size is not that large. Backward compatibility is very important for services. Because I might have bought certain software for my previous computer. I get a new computer, those software should continue to run. I don't have to get new licenses for those. That's a very costly matter. In the server market, for databases and web services, integer performance is much more important. In fact, there will be very little floating point computation in such applications. For supercomputing, on the other hand, floating point performance is more important. In the embedded market, cost, power, code size, all of these are very important. Because here we're talking about things like handheld, your smartphones and all, where your memory on board is very small. So you cannot have an ISA that compiles into a large binary. That's not done. You need to have smart compact binaries. You need to have low power so that my battery lasts long. And of course, the design should be low cost. Floating point performance is less important depending on, of course, the application. If you're running, for example, various image processing applications on your handle, of course, floating point performance will be important. So keep this in mind when designing your application, sorry, instructional set architecture, because when you design a processor, usually what happens today is that you would be keeping your instruction set more or less unchanged, and you'll be catering to all these three markets. So you have to have a common ground for all of this. And of course, there is digital signal processors and the media applications, where real-time performance is important. Worst case performance must meet real-time deadline. For example, if you're running a codec, it should be real-time. It should not take longer. Otherwise, the user will get annoyed. Heavily optimized hardware for small number of frequently used kernels like fast Fourier transform, convolutional builders, they are occasionally found in such processors. Instructions and architectures often include special instructions to exploit this hardware. For example, you may often find an instruction for invoking the FFT unit. Just one instruction would actually do an FFT for it. Kernels are provided by manufacturer in the form of hand-coded library that makes use of these instructions. So here, there is less reliance on compilers, but it's improving. The simple reason is that given a piece of code, compiler will have very hard time figuring out that, oh, this is an FFT. That's very difficult actually. It's a unstructured piece of code that somebody has written. How does the compiler figure out that it's an FFT? So that's why the vendors already provide the libraries for you. You just have to invoke that library function, which is actually already been compiled to use this particular instruction, FFT module. So this is a big difference from your desktop server segment where we rely very much on the compiler. Actually, which seldom do hand-coded assembly program. And today, most test of processors try to support DSP and media processing instructions. In fact, this has become an integral part of your processors today. Like you must be knowing about your MMX and all these things, right? Multimedia extensions in your Intel processors. So they all have these instructions. So we'll start with a classification of the instruction set architectures. Stack architecture essentially is an architecture that has arithmetic operations and the other two operations to operate on the stack, that is push and pull. So if you look at this sequence of instructions here, push A, push B, add and pop C, what it does is that it first pushes. So essentially, I want to do an add operation. I want to add A and B, store the result in C. That gets compiled into data. So I push my first operand. I push my second operand. And then I do add. And the add instruction has two implicit arguments. What are they? The top of the top two elements of the stack, all right? And the result goes on the top of the stack. Then I pop it, goes back to C. Is that clear to everybody? The stack architecture, okay? So it will only have push pop and arithmetic operations with implicit arguments. The next one is accumulated architecture, where what you do is, you have implicit accumulator operands. So the same operation, C equal to A plus B, gets compiled into that. So you first do load A, okay? So that brings your operand A into an accumulator, all right? Then you do add B, where one operand comes from memory, where B is stored. The other operand comes from the accumulator, which is A, all right? And the result remains in the accumulator. So that when you find, say, store C, the operand is picked up from the accumulator and sent it to C. So here the operand was always the top of the stack. Here the operand is always the accumulator, okay? One of the implicit operands. The third one is register memory architecture, where your operations can, either take a register or a memory or both, or a memory operand or both, okay? So load R1, A. You load the value of variable A into register R1. So this is a memory address. This is a register, all right? Then you say add R2, R1, B. So always we write the destination first and the two sources next. So it is essentially adding R1 with the memory operand B and putting the result in R2. So this is a combined register memory operation. X1 register of argument, X1 memory argument, and puts the result in the register. And then finally I say store R2, C, which picks up the source from R2, puts the result back to a memory locations, all right? So here we have explicit memory operand in ALU instruction, okay? And finally register, register architecture, where the only difference from this with this is that no ALU operation can be done on memory in this particular architecture, register, register architecture. Otherwise it's exactly same as this. So here you also have load R1, A, then load R2, B, and then you have add R3, R1, R2. So this add operation here cannot take any memory operand. It all has to be registered. And then you say store R3. The result is always stored in register and then you have to pick it up from there with the store operation. So here you have a uniform format for all ALU operations, also known as load store register architecture. Basic distinction is type of internal storage accessible to the ALU. If you look at all these four, this is really the basic thing that differs. What does the ALU access when it does an operation? Here it accesses the stack top, the top two or the top most. Here it accesses the accumulator. Here it accesses the register and the memory. And here it accesses only registers. And that's where the name comes from. What type of storage the ALU has access to? Any question on this classification? All right. So today your x86 ISA belongs to this category, register memory. It has a lot of register memory operations. We'll look at something called a risk architecture that only supports this register register. And to contrast with that, we'll talk about SISC also, so which is your x86. So we'll come to that soon, that description. So the load store register ISA or the register register ISA has become very popular. And the simple reason is that registers are faster than memory. So you can do very fast ALU operations. And as a result, there is a push to provide more general purpose registers so that I can do more ALU operations on registers. Registers are easy to handle for the compiler compared to say stack, for example. So you should go back home and try to generate stack code for this particular calculation. Try to see what kind of code you get. Intermediate values can be held in registers for fast access. That's a big advantage, actually. Whereas in stack architecture, you have to keep on manipulating the top of the stack to make sure that your intermediate values are not overrated. Registers reduce memory traffic. Same type of ALU instructions take equal amount of time to execute. So it's easier to pipeline and optimize for the compiler. So we'll come back to that later. So that's why compiler designers just love register register architecture for this reason. So and the simple, there is a thumb rule for compiler designers is that don't give too many options to a compiler designer because the compiler will get confused about which one to pick. So here it's easy. Everything is same. All ALU instructions will take equal amount of time. So the compiler doesn't have to debate about, oh, I can do this particular operation in two points. Which one to pick? Well, you pick anyone. It's the same, right? The same time. So in this architecture, one important thing to figure out is how many registers to support in an architecture. And it depends very much on how good the compiler is. So smart compiler will probably be happy with a small number of registers because it can allocate the registers in a smart way. Whereas a lousy compiler will keep on complaining. You give it 100 registers, we'll say, oh, I need more. So it's a very tightly coupled thing where you sit down with the compiler designer team and decide how many registers are actually needed. Most compilers reserve some registers for parameter passing to functions, for procedure return address, global memory pointer, et cetera. So we'll elaborate on this when you talk about MIPS instructions. So here is some old data, very old, actually. Dates back to 2004, 2005 time frame when MD Optaren was coming out. So MD Optaren made a change to their number of registers compared to their previous Athlon processor. So they had to figure out how much to increase. So here what you can see is, so here I have several benchmark applications. These are from old spec benchmark tools. And Y axis shows functions that require n registers, so percentage of that. So for example, in GCC, if you look at this portion of this bar, it says that about 70% of functions are happy with less than eight registers. That's 40 minutes, all right? So you can see that you seldom require more than 32 registers, very few actually, all right? And 16 to 32 is also very small. So this is the kind of data that you normally require. Decide how many registers to support in a machine. So essentially what they did was that they simulated a processor with different number of registers, and they ran these applications and they generated this data. So now it becomes very clear what you should do. You can do a cost-benefit analysis now based on this what you know as after that, right? So in most cases I would be pretty much happy to have 15 registers, right? 15 registers have very good coverage. So continuing with the classification, number of operands in the ALU instructions, that's also important, it can be two or three. So some of them could be register operands for the rest are memory operands. So this is not just about your register, register architecture, register memory architecture is also included here. So if you have zero memory operand and three register operands, so in that case you would be having two sources and one destination. This is the most popular register-register ISA. So here I am again classifying your ISA based on number of ALU operands and the types of that, all right? The first classification is that you have zero memory operands, three register operands, out of which two are source, one is destination. So this is the most popular register-register ISA found in alpha, MIPS, ARM, Spark, PowerPC. So you might be surprised to see that X86 is not there in this list, right? So because it's actually not a register-register ISA. One memory, one register. Register memory found in X86, Motorola 68K, and one of these two operands is both source and destination, right? So it's a shared source destination. Two memory, zero register. These are memory operands found in Vax, old machine. Three memory, zero register. Memory, memory, found in Vax, that's again old. Advantages and disadvantages. So this has implication on your instruction size. Code density is of pipelining, execution time, number of bits for register encoding. So let's go one by one. Some sequence, I've chopped it up into 100 parts. The center of these vectors. The one that is closest to the center of these vectors. Remember that when you are designing an instruction, if you have a memory operand in the instruction, there has to be a way to encode the address of the memory operand in the instruction. That requires a lot of bits. For example, if you want to address four gigabyte of memory, you require 32 bits for that, just for that, right? So clearly your instructions that require memory operands will look large. That will require more space than your register register operands. So that's what determines your instruction size. Size of one instruction. And if you have large instructions to do the same operation, you'll have larger code size also. That's obvious. We have the same instruction count. And then if you have memory operands inside your instruction, pipelining becomes difficult. So we'll talk about this soon. So, and that will impact your execution time. Whereas if you have register, register instructions. Of course, the number of bits required to mention, specify a register in the instruction depends on how many registers you have. For example, if you have 32 registers, you will require five bits to specify a register. So that's the number of bits for register encoding. That's essentially log of number of registers. It's usually much smaller than your memory address. So we'll continue from here. So the data between the number of instructions required in the register, register, ISA, and register memory, there would be. Yes, yes, exactly. So register memory ISA will require less number of instructions because it would actually be encapsulating an ALU operation and a stored operation, or a load operation and an ALU operation. So yes, that's right. So your code density will get affected by that, definitely. You probably have smaller code, but there is a trade-off. For instruction, you'll require larger space, but you may have smaller number of instructions. It's not very clear which way you will go with it. Are you going to say ISA or ISA is the starting point? ISA is the starting point. From there, you have start designing your processor. Yes, yes. But it's usually a very tightly coupled loop. That's, you know.