 Good afternoon, everybody. So after yesterday's lecture, I got a couple of feedback that the second half was not super clear. This was potential. One potential reason was I used a lot of RDRT, and people got lost there. And so what we will do today is review the second part of yesterday, but in a more graphical way. So I will try as much as possible to limit my usage of RSRT and RD, but at the same time, hopefully, at the end of this lecture, you should be able to understand or at least tell why we have some signals, how every instruction is actually decoded and executed, and how you can very easily add new instructions to the current architecture. Towards the end of the lecture, we will also see how to evaluate the performance of any processor, especially single cycle. And with that, we will finish the single cycle architecture today. And from next week on, it's going to be the more advanced multi-cycle architecture, pipelining, memory systems, out-of-order executions, and so on, and probably, at some point, IO interfacing, and more advanced processor design. So let's get started. A quick review of what are all the main state elements that are required in MIPS. We started off with the program counter, 32-bit register, clocked, and it basically represents if it holds the address of the next instruction to be executed, you have the instruction memory where you actually load your programs as a sequence of instructions. You have the register file, which contains about 32 registers, each 32-bit wide, and this holds what we call as the architectural state. What is important here is that we have two read ports, so you are able to simultaneously read two registers out, and you will be able to have one write port. So you will be able to write into one, and this is basically implemented for R-type instructions where we have three register operands. You read simultaneously from two registers and write back into one of the registers. So you need such a configuration. And finally, you have the data memory, which is used to store data that cannot be fit into the register file. So with these state elements, we would actually start building graphically whatever we discussed yesterday and see how it actually looks in hardware. So we will slowly start with the load word instruction. So we are doing it in a somewhere. I kind of follow the book now in sequence. So we start with load word. We append whatever we need, keeping these state elements, and then we go down to store word and see what changes has to be done to the hardware, and then we slowly go all the way up till implementing a jump instruction. By that time, hopefully whatever was not clear yesterday would become much more clear here. And please stop me at any time and ask questions. This is important. So let's start off with load word, very standard syntax. So what we are going to do is you're going to read memory word one and write it into the register S3. And the typical, the field type for an I type instruction is you have six bits of opcode. You have the RS, RT, and the immediate. So what I'm going to do now is the fun part of writing these boards. And I will introduce, I will try to keep this here. So we have op, RS, RT, immediate. And what's happening here is this is the LSB. So this is the bit zero. And then this would be 15. This would be 16. Five bits. So it's going to be 20, 21, 25, 26, 31. So I just want to make sure that all of us know where the bits start, where the bits end. And this is going to be the instruction that's going to come out. So now what's going to happen is the first step for once you are going to read this load word instruction is that the PC is going to send out the address of this particular instruction. And what comes out would be the instruction itself in machine code. That is, you will get exactly this frame or this four bytes of data out as the instruction. Now, what is the next step that we have to do in load word? So we have load word here. And the first step that we have to do is actually figure out what is the effective address in the memory that we have to read from? And how do we do it? We use this, which is mapping here. And this is an immediate value. So it maps to this part of the field. And this, the destination register, goes into RT. So the destination register is here. The base address is determined by this register, which is here. And the immediate value, it's a 16-bit 2s complement immediate value, which is present in the bits 0 to 15 of whatever you are going to read out. So what's the step now to compute address? What do we have to do? So you have a base address, and you have an offset. Fantastic. So where is the base address now? So it's in fields here. So this has to go somewhere. And the offset is here. So this also goes. So we have to extract this out. So first, we will consider the base address. So what do we have to do? The base address is represented by the registered address, which is in 21 to 25, the bits 21 to 25, which means we can simply direct those five bits from the instruction directly to the register file address. Doesn't work? Oh, it's OK. That's just, yeah. So that's exactly what I do. Simple, right? Now what is going to happen is that you're going to know which register to read. And the output would be ready for you, depending on what the base address would be ready for you. Now we have to give this base address to someplace. And the immediate is also coming through. So you will have the immediate, and you will have the base address. So let me mark it again here. You have the base address, and you have an immediate. What do we have to do now? So you have the output from Rd1, and you have the immediate value directly coming in from the 16 bits. What do you have to do? Add, which means you basically give it to the ALU. Right? Can we do this? Is it correct what I'm doing? Yes? Sometimes when I ask a question, it also means it's yes. It's just not always a no. Is it correct? I have to hear an answer. Yes? Yeah? The what? Yeah, sure. You have some. That's fine. Whatever. Doesn't matter. Can you do this? The answer is no. So what you can do now? So the reason is simple. You have a 32-bit value coming out here. Right? And what is the value of the immediate here? It's 16. It's not going to work. So what you need is something called as a sign extension, which means how many of you actually did the first lecture's reading assignment? Wow. So I don't have to tell anything about sign immediate, sign extensions. You all know what sign extension is, right? You all know what sign extension is? Yes. OK, mixed answers. I'll go again. So you have. So remember, this immediate is a 2's complement. OK, it's a 2's complement, 16-bit number. Let me keep it simple for you to kind of tell why the sign extension is needed. If I have a 4-bit, say you have a 4-bit 2's complement number, let's say 1010. Let's start with a positive value. It's easier. So you have a 2's complement, a 4-bit number. And now I have to convert this into an 8-bit value. In 2's complement, whenever you look at the MSB, the most significant bit, if it's a 0, it's a positive number. If it's a 1, it's a negative number. What do you do to convert this into real values? You simply, if you want the 4-bit to be changed into an 8-bit number, what you do is you copy exactly whatever is the number, 4-bit number. And simply repeat the MSB value four times. That's it. Why? When I convert this into decimal value, this represents what? How do I convert 2's complement into decimal? Change all the 0's and 1's. Add 1. You get 0, 1, 1, 1. So it's basically 15, I guess. No, 14. And it's minus. No, it's whatever I done. Something is off here. Did I make a mistake? No, it's good. Perfect. So you have this value as what? This is 7. No, this is 14, right? 14, sorry. So it's a plus 14. So you have a plus 14 here. And so when you have a negative number, so you have a 1 here, what you actually do is basically repeat 1, 0, 1. When you want to convert the 4-bit number into 8-bit, you have 1. You write 1, 0, 1, 0. You keep the values. And then you simply change all the MSB. So you simply opened it so that you keep the sign value. And when you convert this value to decimal and you convert this value to decimal, you will have the same. You can try it out. Good, bad. So basically what we are trying to do here is simple. The sign immediate that comes out, you take the 15 to 0, the 16 bits. That will be exactly the 16 bits of the sign immediate value. So nothing will change in the LSB. But the MSB will repeat for the next 16 times for all the upper bits. That's it. And then it goes straight into the ALU. You have two sources, one source coming in from the immediate, one source coming in from the register file, add it. The ALU control for addition operation, because as we saw, ALU has a lot of functionalities, addition, subtraction, shifting, logic units. So 0, 1, 0 means add. And you get the ALU result out. And this ALU result is what? The ALU result is what? Is it the final result? It's the memory address. So this is directly fed into the address of the data memory. Simple. You have a 0 here. You know why it is there, but I will come to it when we go to branch equal. So till now, all good. What else do we have to do now? We have the memory address. And we will have the data out, most likely. You just simply read the data out and feed it back into the register file. Why register file? Because this is exactly what load word does. You read data from this particular memory address and load it into S3. And this is exactly you load the data into the register files, write data port. Where is the, how do we know which part of which register to write? What we use for that is, what do we use? RT. So RT is basically having this S3, the value, the address. So we simply use the instruction, 32-bit instruction, the bits 16 to 20 as the address for the write. Exactly what is here. And then since you're going to use, since you're going to write data into the register file, you need a control signal regwrite, which basically means you're not doing a read operation, but you're doing a write operation, which means whatever is this address and the data you write into that particular, you write the data into the particular address. That's it. Understood? Much better than yesterday? No? Yes. OK, that's the correct answer. So, but we are still not done yet. We need to increment the PC. So we simply feed the PC output to another adder. We add four, because you have to go to the next instruction to be fetched. And you feed it back as the next state. Pretty straightforward, nothing extravagant. Good? This is all it takes to execute a load word instruction in a MIPS. That's it. What do we need to do to modify the data path or this circuitry to implement stored word instruction? So pretty much everything remains the same. Just that now the value to be stored into the memory address is actually present in RT, which means you're going to read out the data from the register file and write into a memory address that you compute. So the only change here is that now RT is no longer a write back into the register file, but it's the first read data that you have to do. So you see here, it's exactly the same 16 to 20 bits that is there. And then you have the data out. Remember that your reg write is 0 for this operation, which indicates that it doesn't matter what address is here, whether there is the right data, the register file is not going to be written. It will be there. The data keeps changing in all these processes, just that we are going to control what is relevant and what is not relevant using all these control signals, right? So we get out the data from T7. It's present in RD2. And then we have to write this into the memory. Everything else remains the same. The computation of the offset address, the base address, extension, ALU, sending the ALU directly to the address of the data memory, exactly the same as the load word. The only difference, you connect RD2 to write data, which means, and you also say to the data memory, that it's the right operation. And for a load word, this is going to be a 0, which will tell the memory that it's going to read. That's it. You have implemented the store word. Clear? Excellent. Now, we are going to take one more step further. And what we are going to do is we are going to implement R-type instructions. Yesterday, we did exactly the opposite. We started with R-type, finished with branch equal and jump. Today, we start with load word, store word. Both of them are I-type. And now, what we are going to see is, how do you actually modify this whole hardware that we currently have to actually enable the system to execute R-type instructions? Don't worry about so many changes. We are going to walk through each of them. So R-type add is an example. What happens here is you have three register operands. It's exact. So an opcode is 0 for an R-type, as you know. And everything happens based on the function. So all of these R-type instruction requires two registers from the register file. So you have to read out two registers. So if you remember, you have two RS and RT. Both of them are source registers to read out. And then you write the result back into the register file. So the difference between an I-type instruction and an R-type instruction is a couple of modifications to how the ALU gets its data and how the register file gets its data. That's it. And now let us start with simply. So let's start with ALU. It's easier. So let's start with the ALU. Now in the previous, in the I-type instruction, the ALU's one of the source was always coming from the immediate. But now we have to modify such that the ALU is capable of actually receiving the source even from the second register that we read. Because in I-type, you don't have two registers to read. In R-type, we need two registers to read. So you need some kind of a multiplexer here. We need a multiplexer there with inputs coming in from the extended signed immediate function so that I-type instructions can be implemented. And the input also coming in from the register files port, which means you can implement R-type instructions. And if it is zero, so we call this control signal the ALU source, which basically, if it is zero, it means you are going to execute an R-type instruction. And appropriate values will go into the ALU. And then pretty much you will have source A, source B. You will have pretty much the ALU control depending on what your operation is. And you will get the ALU result out. However, there is no writing to memory here. The I-type, for example, store word, directly wrote into the memory. Even load word, you needed the data memory to be accessed to read the data. But now what you're going to do in an R-type instruction is there is no memory access at all. So you can simply bypass the data memory completely. However, what goes back into the register file needs to be controlled. Because in an I-type instruction, what happened? The data was always the data that came out of the memory. So the data was always what was written into the register file was always coming from the data memory. There was no other option for it because you were simply writing a, so you were doing a load word. And when you do a load word, you simply write straight away into the data memory with the reg write one. But now you need to be able to write ALU result also back into the register file, into the target register, which is the R-D, which is the destination register. So what happens? You introduce another multiplexer and control it with mem to reg, which simply means that whether am I going to transfer the data from the memory to the register or not. So now we don't have to transfer data from the memory to register. So it's a zero, which basically means the value that is coming out of ALU result is going to be bypassed, the data memory, and the data path will extend straight away and you'll feed it in into the write data port of the register file. Now how do you know which register to write to? It's again, present in R-D, right? So you simply feed this back into the register file. Again, this address can come from two sources. In the I type instruction, you were using this. You remember? For a load word, yeah, for a load word, you were using RT as the destination register. But now in an R-type, R-D is the destination address, which means you have the bits 11 to 15 of the instruction that also needs to be multiplexed, depending on what type of instruction it is. So, and this is controlled by this control signal, which is reg destination, which basically indicates if it is one, it's an R-type instruction. That's it. Forget about all the other options. Reg-dust is one if it is an R-type instruction. That's it. And hence, you will straight away get the address value into A3, the correct address value into A3. And with this, you have pretty much implemented most ALU operations, assuming that we have all the actual operations implemented in the ALU. Any questions? So we now have load word, we now have store word, and we have R-type instructions good to go with our MIPS architecture. Can I move on? Let's go to branch equal, okay? Now we go one step even further. Let's go to branch equal. So, what is branch equal? What type of instruction is a branch equal? There are only three types, R-type, I-type, and J-type. So you have to get at least 33% of the time correctly. Branch equal, what type of instruction it is? It's not J, you have 50% chance now. It's an I-type instruction. Okay? Let's bring our I-type instruction back. So branch on equal is an I-type instruction, following exactly the same thing as the load word and store word. What's going to happen in branch on equal is these two, that is RS and RT are going to be compared. If RS equals RT, then you take the branch. Okay? Where do you want to jump? That is specified in the immediate. It's exactly the same 16-bit, two's complement that we talked about. And if you recall, branch on equal, the immediate value represents not an address, but it simply represents the number of instructions the execution unit, the DDPC, has to jump. So if I have to branch to an instruction which is four instructions down, then your value is going to be the two's complement of four. Okay? This is not the final address. It's just what the compiler or what the assembler is going to generate is the number of instructions to skip. And it can be negative as well. So you can also go back. That's the whole point. Right? So how do you get the effective target address? Simple. What you have is you have the signed immediate value. So you do exactly the same thing like what we did for load word and store word. You get the, you extend it. And then what you do is you simply multiply it by four. Right? Because the immediate value is the number of instructions you have to skip. And if you want to get the address, you have to multiply it by four because of the word addressable memory. Because PC is always jumping by four. And then this address, this offset address has to be added to the program counter. So it has to be added to the program counter plus four. So you have to know that, okay, I'm not going to jump here, but I have to jump after what the current state of the program counter. So you add that instruction, add that value with the PC's value and you feed in the output back. But now we need another multiplexer. Why? Because the input to the program counter was always coming in from this adder. Because we never implemented any kind of branch till now. No jump. So the instruction that the next instruction to be executed was always jumping by four. So you always simply have to do PC plus four, PC plus four. And this is why you always had only this input. But now we need to implement the branch function which means we need to implement somehow feed this PC branch data back into the program counter which means we need another multiplexer which means we are going to control it using this PC source. But the PC source itself has to be generated. Why? Because you need, we need this PC source to be based on two conditions. One, you need the zero output to be high, right? Remember this RS equals RT. You have a subtraction operation and then you determine if RS is equal to RT, the result will be zero. You will have this zero flag rise. So you will have to have the zero flag rise. And we will introduce a control signal called branch which will be calculated based on the opcode and stuff like that. We will come to that when we look at the control unit. And then an AND of both the control signal and the zero flag will generate this PC source kind of sub control signal which will basically allow you to select between just PC plus four or the branch, the newly computed branch address. That's it. And we have branch on equal done. So now you have load word, store word, R-type instructions and branch on equal. I removed all the, so also all these control signals that you see here are all going to be generated by one single unit which we call as the control unit. And this is the entire architecture that we just designed. Just I pushed in all signal generation into one single unit. And the control unit, if you think, the only reasons or only trigger input for the control unit is based on the opcode and the function, right? If it's R-type or load word or store word, if it's a jump instruction, everything, depending on the opcode and the function, you will be able to generate all of these control signals. You know what the cool part of such an architecture is? If I want to implement a new instruction, typically all you need to do is one additional line in the control unit very long, which will allow you to generate an appropriate control signal depending on what other instruction you want to implement. We will actually see how you simply have to write one additional line in the control unit, or you have to just add one control unit signal and then you will see actually how you can implement a totally new instruction. Of course, there are some instructions where you need some changes to the data path, but now one such example would be add immediate and we would be, if we will see that after we go much more deeper into the control unit, okay? So let's see what's inside this control unit. So control unit can be divided, so most of the control information, as I said, comes from the opcode, but for what type of instruction we need to use the function, R-type. So for R-type instruction, you need to use the function as well. So what we will do now is we will simply split up the entire control unit into a main decoder which depends on the opcode and sometimes you will have to also use the ALU decoder because for R-type instructions like add, subtract, you need to appropriately send the control signals to the ALU to ask it to perform a logical operation or an arithmetic operation, right? So in some sense this is the control, the logic table that one can implement. So if you see, if it's a zero zero or zero one, it's an add or a subtract that has to go out into the ALU control, if it is one zero, the ALU decoder has to look deeper into the function field to actually identify what instruction to execute. Why are these directly add and subtract? Why don't we have to look into the function? What is the, I mean, can you kind of think about a scenario where you don't have to look at a function but you have to do some add or subtraction? Yeah? Exactly. Load word, store word, branch on equal because you have to do subtraction in that. All these things, you don't need to look at the function field, you directly need to know that whether the ALU that you're going to use is an addition or a subtraction. And then for one zero, it basically means that you're executing an R-type instruction and one one is not, I mean, we never use it so you don't really need to worry about it. Great. So pretty much, we've seen this slide before depending on what your function words are, you will be executing the appropriate controls for the instructions that we would implement like R and ans are simple instructions. We typically know the last, the MSB two bits like the three and four of function but of course it is implemented in MIPS but for now for these instructions you don't, you need only the first three bits of the F, that's it. So basically you have these two control tables. One is to generate the ALU op signal and the other one is that if ALU op is one zero then you start looking, the ALU decoder will start looking into the function fields and appropriately, you don't have to memorize these things, okay? It's just for you to know that if the main thing is that you have an ALU op which will generate zero zero one for all these load words, store word, branch, unequal you will have a one zero for an R type instruction and when it's a one zero then it simply doesn't care about the LSB, all you need to do is look into the function which will have add, subtract, shift or a logical operation and the ALU control will be, control signals will be appropriately generated, that's it, this is all you need. You don't worry about too many numbers flying around, digits flying around. It's always a matrix, don't worry. Good, and before we go for a break, you remember this control unit table from yesterday where people were like, oh my God, this is again so confusing but now it's much more, I hope it's much more clear, better. So you actually know for R type instruction why some register values are going up, what's the appropriate ALU operation and so on. So we actually saw R type, load word, store word, branch unequal and when we, maybe I can, okay I'll still continue. So we will now see, we will actually see how to add, add, immediate and jump but now let's kind of see this big picture and see how for example a logical operation like an R is the data part that's going to look like, okay. So it's a logical operation. It's an R type instruction. So the first step would be, so the data starts from here, the data part is starting from the output of the PC, so the address from where the instruction is being read. You read the instruction, it's going to be the R, R type, the 32 bit instruction is coming out, it's an R type. So you have to send in the address for both the source register operand. So you, the data part travels through both the read ports out. You have the source, so it's not a I type, so you don't have to worry about the immediate as the source. So this is going to be zero and the data is going to travel only, so the ALU is going to perform operation only on the two source register values. The ALU result comes out. Again, it's an R operation. There is no memory read, memory write. You don't have to worry about the data is not going through the memory. You go, but you bypass the data memory, which is this multiplexer, which is value is zero because there is, the mem to reg is zero and mem to reg means there is no transfer of data from memory to register. So you choose this value and then this value goes all the way back into the register files write port because in an R type, you have a destination register, you work on three registers and the address is collected from the 15, from the R D. You remember, this is the R D value of the instruction and so you choose this multiplexer and then you simply use that as the address back into the register file. That's it. This is a sample, yeah? Sure? Uh-huh. It's for illustrative purposes. Sometimes you can reuse the address. You will actually see how you can actually reuse the entire pipeline, entire thing also when you come to multi-cycle. So we'll go there, but this is more clear. Basically, unless you want to use both, you don't need two address. So it's kind of a logical way of thinking. This is the data part. And also there is another data part here for sure, which is just the PC plus four because it's not a branch on equal operation so you don't need to worry about this part. So the only data part here is this one. And pretty much that's it. So you have, you went through how an R operation also works and when we come back from the break, we would see how you can actually not modify any of this data part of the architecture. Simply write some changes to the control unit and implement and add immediate instruction. And we will also see one example. For example, the jump of instruction where you need to add some more additional hardware into the data part and then we will finish off with performance analysis for the day, okay? Let's take the break now and let's meet at 2.10, okay? Thank you. Can we start now? Good, okay. So we looked up the data part of an R logic circuit and now what we will try to see is how you can actually use exactly the same hardware and implement a new instruction and what it takes to implement a new instruction. For example, add immediate. There is absolutely no change to the data part. So you really don't have to modify or add any kind of hardware to the data part but you can actually implement add immediate by simply writing an additional very long code, very long line for example, to the control unit which actually takes in the instruction of add immediate and generates appropriate control signals. So let's walk through. Our add immediate instruction is a I type instruction. I will tell you the op code, it's 00100, doesn't matter, comes in fine. Now what is the reg write? So let's walk through each of these control signals and what you think has to be the value. What should be the reg write? So add immediate is what? Add immediate, you have a destination register followed by a source register and the immediate value. So it's an I type instruction. What you think should be reg write? Zero or one? Zero or one? One. Brilliant. What should be, it should be one because you're going to write back into the register. What should be the reg destination value? Zero or one? Because this will choose whether you have to pick the address from this part or these five bits of the instruction or these five bits of the instruction. Remember these five bits of the instruction is going to be the value for R type instructions. Add immediate is an I type instruction which means you have to choose the 16 to 20 bit of the instruction which means your reg DST should be zero. Right? And then ALU source. What do you think should be the ALU source? Should we take from the immediate or read from the register file? It should be one because it comes in from the immediate because it's an add immediate. So you have to add with an immediate value into the ALU. So this is going to be one. What should be the branch value? Zero or one? Are we going to do branching? No, so it's going to be zero. What should be the value of mem write? Mem write goes down to the data memory. Are we going to do any kind of writing into the data memory? No, zero. Mem to reg, are we going to transfer data from the data memory into the register file? No. And finally you will have ALU op which is somewhere here and that ALU op has to be zero and zero. Why is zero a zero? Why are we not looking into the function field? So this is exactly what you guys told me, okay? Potentially if somebody's opened the slide they probably read it out. So why are we not looking into one's, looking into the function field and figuring out it's an add operation? Yeah. A simpler answer? Exactly, exactly. Because it's not an R-type value. Because you look into the function field only for an R-type, yeah? So you have zero, zero because you do the addition for an add immediate, that's it. You write a control unit which figures out that with this input, you simply generate these control values and you have an add immediate implemented in your MIPS architecture. However, it's not always the case like this. For example, you want to implement a jump instruction. You have to modify the data path. Remember this slide? Super confusing slide. No? Brilliant. Okay, so for a J-type instruction, the instruction looks like this. You have a six bit op code as always and you have the 26 bit address, right? And how you calculate the target address is you have these, you have to append four zeros in the beginning and you have two zeros in the end, right? And this is pretty much what you have to do to generate the target address for the PC. And this you cannot do currently because we don't have the hardware for it. So what we have to do is include an additional hardware which exactly does that. You read the instruction out exactly like before but instead you pass the 26 bits of the address. You shift it two times which means you multiply it by four to get the real PC address. And then you simply pass it on as the PC dash because remember for a jump address there is no ALU involved. You just concatenate all the values. It's all a concatenation of the MSBs of the PC and the last two is a zero and then you simply take this value which is the 26 bit address that is present here. You concatenate all these and you get the new address. And this is pretty much why you don't see an any ALU adder or nothing and you just feed it back into the PC dash. And you also need a new control signal because you need to choose between the address that is coming in from PC branch or the address that is coming in by simply incrementing the PC for and now you have a new PC address that is coming in from a direct jump 26 bit address. And now with this we have functionality, we have implemented in MIPS functionality for jump, add immediate, load word, store word and all the R type instructions. You can do wonders with it. And pretty much what you have here for instruction J you don't care about most of the stuff. You introduce a new control signal which is going to be one. The only two things would be you have to make sure that the mem write and the reg write are zero because you're not going to read, you're not going to write anything into the register file. What is important also is that whenever these data parts that I have highlighted is an operation, for example, the control the whole data part for our operation, it's not the only part where data changes. In all the other data part that you are seeing everywhere else there is going to be changes. Just that based on the control signals we are controlling what data propagates and makes it relevant or not. There is going to be changes in the logic. If you put a probe in some random wire here you will see that in your test benches when you write in a couple of weeks you will see that all the other signals are also jumping up and down. Just that it's no longer going to be relevant because you are controlling what is relevant for you using this control signals. So don't think that all the other values are going to be static, there should not be any change. No, these will all be changing. But just that it's not going to propagate, that's it. So with that we kind of finish this MIPS single cycle architecture and what we will see in the next couple of minutes is how we will evaluate, how you typically evaluate the performance of a processor. Any questions still now with all these? So you actually have the entire architecture now ready to go for single cycle, good? So how do you define a processor's performance? You want to know how fast your program runs, right? An interesting story here is that if you bought a computer 10 years ago with either an Intel or an AMD processor, actually Intel processors were sold at, say, 2.4 gigahertz and AMD was selling their processors with a clock rate of one and a half gigahertz. So people typically thought Intel was better than AMD. But for some operations AMD was much more better than Intel, even though they are at a much lower clock cycle. So the kind of a, it's very difficult to judge the performance of a processor simply based on either the clock speed or any other value. So what you can only do is you really have to run the program on the system to actually figure out which is the better performance. Because you never know, some people use a bigger instruction so you think the clock cycle is slow, but they're going to implement in that clock cycle a whole bunch of instructions. But otherwise some people would implement architecture which has an order of clock frequency, the speed of the processor very high, but then they're going to implement only one instruction within that tiny. So you never know what's the end performance of the processor. So typically what is done or how one should do is you simply use what we call as benchmarks. So if you have, I don't know, if you're playing a lot of games, typically you'll have these gaming benchmarks that actually represent how your program, your gaming software is going to be run on your processor. And you simply run it, you actually test it whichever is fast and then you know which one is a better buy. So you really cannot quantify, quantify the performance of a processor so easily. So, but in general, what you need to understand is how fast my program is. That's it, right? So on each program is a set of instructions and every instruction, as you saw, ad immediate or load word are all implemented in hardware, which means the time it takes for these instructions to execute depends on several factors starting from how your hardware is implemented, what technology you are going to use, if it's CMOS technology or if you are, so today we are, I think in the order of, so the transistor technology has grown so much that you can actually pack billions of transistors into a very tiny area. So all these things kind of add up to improve the performance of hardware and you will see how each of them affects when you calculate the end result. So sometimes these instructions can take more than one cycle but for us, the single cycle architecture, you will always have the cycles per instruction one. When you go to multi-cycle, pipelining, you will see how these things changes but for now, this is always going to be one. And how much time is actually one clock cycle? So you say one cycle per instruction but then we don't know what is this one cycle in effective time duration. And how you determine this one clock cycle is how long the instruction takes to actually execute. And this is typically determined by the critical path. What is a critical path? A critical path is the longest path for the data to travel from one state register, which is where you have a clock to another state register where your signal has to be ready. For example, if you go back to let's say the R functionality, you have the data path, you have all these data paths where the data is going to travel. And the critical path here would be something like you start when the clock signal goes up and the delay between the clock signal and the PC coming out and then the data has to pass through all these values. Sometimes of course it passes through some other data path as well but the main data path would be this. Going through this actually through this because you have an additional multiplexer here. So you have this, this all the way here till here ready for the next clock cycle to come such that you have the data written into the register file. So if this whole path is coming after the next clock cycle, you're not going, the system is not going to work. So this is, and this delay in time units depends on a lot of features, a lot of aspects. So for example, you will, so let's, let me just go back to this slide so that, yeah. So what happens here is that you have a state register, so you have a clocked value and then this, and you can see this pattern in almost all circuits that you're going to design from now. And then you have a combinational logic. You probably, this is followed by another combinational logic. You have probably another path which has a combinational logic and then you pass this on to another combinational logic and then you finally finish it in another register. You can see this pattern almost in every design that you make. In the R pattern, in the R data path, you started with the PC and you slowly ended up in the, in the register file's write signal. This pretty much, you will see that this pattern is repeating everywhere. And the critical path for this kind of a circuit would potentially be this. So, and if you want to know what all affects the critical path of such a circuit, it depends on the wire delays. So how your wires are implemented. It's going to depend on the combinational logic gates, the technology in using which you are implementing these combinational logic gates, how fast it is, what is the number of transistors that you are using and how many, how much gates you are using. So there's a lot of things here. And also, not only that, you are also going to depend on some extra delays. For example, you need the data to be here, a couple of time units before the clock signal rises. So, if this data is, so you, so let's say this is my clock cycle, this is a rising edge and my data that is coming here, D is coming right on the same time as the clock cycle, it might not work because you need the logic circuit to stabilize, which basically means it has to come a little bit earlier than the clock cycle, the next clock cycle. And this is typically what we call as the setup time of a register. So you need, there are some requirements for how much earlier than the next clock cycle your data has to arrive. Similarly, there is also going to be a delay from the time the clock signal is kicking in to the data that comes out. So for example, if you consider this to be the PC, the program counters output, from the time you have a rising edge of the clock to the actual address being available on this data line, there's going to be a delay. All these delays will add up and make the critical path delay, okay? So, yeah. So now what happens is that, so now you know that how important is the design of the circuit itself. And based on this, you will be able to determine the processor performance. Let's just, yeah, so basically, if you have a set of instructions, how fast your instruction is executed depends on this formula. You have N, which is the number of instructions in the program. You have CPI, which is cycles per instruction, which is one for a single cycle architecture. And you have one over F, which is F is the frequency of the clock. And you can typically find it out by having this time period. So it's a simple inverse of the time gives you the frequency. And this time would depend on your critical path time delay typically, okay? So this time you can calculate based on the critical path. And this is pretty much what you would have to calculate if you want to know how fast your program is going to be executed. Now, how can I make the program run faster? Reduce the number of instructions, of course. And how do you do that? Potentially implementing a different processor where you have a much complex instruction. For example, as I said in the beginning, a couple of weeks ago, that Intel has this instruction to simply move strings. So you can actually say move string, and it will copy a whole bunch of memory locations into another memory location. If you want to implement such an instruction using this MIPS architecture, you're going to need a whole, like 100 instructions, for example. I'm just making up a number. And by the other, the Intel processor is going to do it in one instruction. So that's one way of making the program run faster. Of course, you can use better compilers to kind of optimize for the number of instructions you generate. Less cycles to perform. So you can say that, okay, I don't, for here this is not, you will not be able to appreciate this now. But when you go to multi-cycle, you'll be able to appreciate it much more when you talk about parallel ALU units and so on. And of course, the last is increase the crack frequency. And again, increasing crack frequency means you have to redesign your critical, you redesign your critical components in the critical path such that you can optimize and increase actually the frequency without affecting the race conditions. Easy? So let's look at one simple example of load word and figure out what is the critical path, right? Because load word is a classic example of potentially one of the longest paths because you're going to do both writing into memory as well as reading from the, you're going to read from, and end going to read from the data memory, okay? So we start here. The clock signal goes high. There is a small time, so there is a time delay between the clock signal going high and the PC value coming out. You pass through the instruction memory. There is some delay in reading from the instruction memory. You pass through the register file. Again, you have delay in reading the correct values from the register file. And you actually pass through the ALU. You have delay of the adder itself. And then you have to go to the corresponding memory address. So you have delay in reading from the memory. You pass through a multiplexer here, and then you go all the way back and write back into the register file. Of course there are other, so you will also find that there is a data path from here to here, but that's not critical. It's much faster than this whole path. So this is what you should say when somebody asks you what is the critical path of a load word instruction in this architecture? All the other small, small paths like from instruction to sign extension to this multiplexer to the ALU is going to be much faster than passing through the register file and so passing through the register file. So this is somehow you should, so this is something that you should know that typically the most implement, so the limiting paths of the delays are always memory, ALU, register file. These will have the maximum delays. And what now, I just mentioned is actually, so don't get scared about the formula again, I'll walk you through again. So you have critical path, what are all the time? You have this TPC, QPC, which is basically the time it takes for the PC value to be output from the clock signal. So it's this delay between the clock going high and the actual PC value coming out. So that's the, this time. And then you pass through the instruction memory. So the time taken for the instruction to come out of the memory. And then you have a max function, which is basically the time it takes to read out from the register file and the max of that plus you have the time taken for the sign extension plus the multiplexer. So depending on whether this is faster, this is slower or this is slower, you can actually use the max function, but typically this is the slower one. And then you have the ALU delay itself. Then you have another memory, which is the data memory. So you have another T mem. And then the data passes through the multiplexer. So you have the T max delay. And then you have the RF setup, which I was talking about. So you have this entire thing coming back and this data has to arrive a couple of time units, couple of nanoseconds, typically, before the clock goes up. And that's the setup time, as I was telling before, okay? And this is pretty much what is the critical part time delay of your circuitry. And of course, you simplify it with the two mems and so you get something like this. So typically you have register file read is the more it has more delay. So we don't worry about the other option. And this is pretty much what you will have to calculate. So assume that these are all the values. So you have some time value. So there might be a question like this in the exam. So you will be given some time delays. You will be given some circuitry. And you will have to figure out what's the critical part and eventually calculate the time and then answer how fast of a clock cycle you can use the circuitry on. So for example, here you have all these values in picoseconds, exactly all the other, all the parameters that you need what we saw in the previous slide. When you substitute it, you end up with this critical part of 925 picoseconds. This is a time delay. And in order to get the clock frequency, you simply invert this value. So it's one over, yeah, one over 925 picoseconds will give you an approximate frequency on which your clock cycle can be the frequency of your clock cycle. Easy? Good? So let's walk through a simple example, 100 billion instructions executing on a single processor. The execution time is the number of instructions multiplied with the CPI and the time of the critical part. And when you replace everything, 100 billion CPI is one because it's a single cycle architecture and the critical part delay is 925 picoseconds. You have the execution time as about 92.5 seconds. And pretty much you will end up with this, so you will be asked to calculate this execution time in addition to the clock frequency. It will sound a bit ridiculous now, but a lot of people have not, they're not able to calculate what is the maximum clock frequency with which the circuit has to be written. You calculate everything perfectly fine. They would have got 92.5 seconds. But somehow, for some reason, the clock frequency was never, I mean, somehow was a problem. It's very simple. You just do a one over, one over 925 picoseconds will give you something in gigahertz and that's it or probably megahertz, hundreds of megahertz and that's your clock cycle. So it's an easy two pointer, okay? So that actually brings me to the finish. So we learned how to determine the performance of the processor, CPI, clock speed based on instructions. We slowly added instructions to the single cycle architecture. We started with load word, mood to store word. Then we did some R type instructions, branch equal, jump, add immediate and then we eventually have a processor that you can use. With that, I would actually want to say that you will, from next time, it will be Professor Onur Mutlu and Sgt. Chapkun, but I have done this from us before and I'm not going to do it again. Thank you very much, see you.