 So last time we were discussing about a pipeline model where we have a bunch of ready instructions and we make a selection out of them and execute them. So this was the pipeline that we came up with. So every cycle you fetch one of the instructions, you decode them in the order that the compiler has presented to you and you allocate them in that order in an issue queue and then an instruction may sit in the issue queue waiting to be issued and when the instruction becomes ready it takes part in a selection algorithm and the selection algorithm may or may not select a ready instruction for each week and eventually the instruction will issue and read the register file or the beta from the queue from its parents, execute them, go through the memory stage if it needs and put the result back in the queue slot. And finally, when the instruction comes to the head of the queue, it will write the result back to the register file and also wake up in any patterns and we discussed why we need to wake up cycles here, why we need one wake up cycle here, of course that is just one solution for solving the races, there may be many other solutions. So before moving on, I just wanted to touch upon a few points, do you have any questions? So often you will find that these two stages are usually not mentioned in the pipeline of an instruction, the simple reason is that the number of cycles the instruction may have to wait in the issue queue is non-interested, because it depends on when its immediate predecessor actually exists until then the instruction is not ready. So it is not like there will be just two cycles of waiting here, so that is why these two cycles are often not mentioned in the pipeline. The understanding is that there will be a non-enormistic amount of delay between allocate and selection, so that is what is implicitly understood. Similarly, this cycle often is not mentioned, the reason is same that this particular wake up cycle gets essentially clumped here in this particular way, however many cycles the instruction has to wait in the issue queue. So that is one issue, the second thing is that, so it is a point here that there will be a non-interested amount of delay from the time the instruction enters in the queue to the time it gets selected, it all depends on the selection procedure and depends on when the predecessor executes. The second thing is that last time we said that when the instruction completes execution, so the way the wake up works is that you will essentially communicate your slot id to everybody in the queue and they will compare against the dependence that they depend on. The comparison passes then you know that the result is ready. So we said that that happens when the instruction completes execution, so that is not really needed. What actually happens is you can do it much earlier, so I will remove this because we really do not worry about these two cycles here. So as soon as an instruction gets selected, it issues at the time and at the same time it actually sends its slot id to everybody. So the point is that suppose there is an instruction which requires register R, some instruction and there is some instruction before it which produces register R. So writes to and this one leads R. So last time we said that this instruction will send its, so this instruction will have, so let us suppose this is the queue slot for this instruction is queue 1 which is queue 2. So queue 1 will be broadcast to everybody and queue 2 we have queue 1 marked as producing R and they will get the comparison and at the time you get to know that queue 2 is ready and queue 1 sends its slot id only after it completes executions. So that is not needed. What you actually do is as soon as instruction queue 1 issues it will actually send its slot id and in the very next cycle instruction queue 2 may issue. So essentially what will happen is that your, so if this is queue 1 then queue 2 will essentially issue here in the next cycle that is the R days it can issue. It will read the register file here will read a wrong value and will pick up the correct value from the bypass from queue 1. So last time we are not making use of the bypass problem. We are saying that you wait until the execution is done and then only you broadcast. So that the instruction will read the value from the queue slot itself. That is not really needed you do not have to wait so much. You can issue it in the next cycle and the value will be picked up from the bypass. So these are the few points that we left out last time. So from here we will move on. The purpose of that is to make sure that no instruction indefinitely waits. Because if it is written back then you need to know where to read the data problem. So that is the purpose of that. There are others that the data is now will be read from the register file itself. You can possibly avoid the cycle with some extra machine but the complexity will be picking up. So this is the summary of what we discussed earlier. Each instruction goes through 8.9 stages not necessarily in the cycle latency because there may be a bunch of cycles here which are not shown. So that is what is implicitly understood and also there will be a bunch of cycles here from the time an instruction completes execution to the time it writes back. You may have to wait in the queue for a long time until it comes to the head of the queue. So fetch decode allocate, select issue register fetch, execute where we write. And we also talked about a map till which maintains register to issue queue slot mapping because it tells you which register is currently owned by which issue queue slot. That is which slot is which instruction is producing what register. So that does an implicit renaming. It renames the same register to be current queue slots depending on the production of the register. We also talked about how to handle races between allocate and wake up, how to handle races between allocate and write back. We also talked about memory renaming through issue queue slots. So just to remind you this particular process we said that you may have, so this is my issue queue. You may have two store instructions store one and store two. So you have two store instructions and the way the stores execute we said that when the store is issued. So when you issue a store when it has the value to be stored is ready and also the register required to calculate the address is ready. It requires two occurrence. When both are ready we issue the store. What the store instruction does is that it reads the value to be stored from the register point as usual or from other queue slots or from the bypass. And it also calculates the address in the execute stage, but it does not really access the memory state of the pipeline. So it stores the address here in the queue slot, stores the value also here similarly here. So even though these two addresses are identical, these two stores can execute on the forum, these two stores can execute concurrently without any problem. They will execute both will compute the address both will put the problem here. However, the right to memory will happen in order when they go to the here one of them that will preserve customers. So this is implicitly doing a memory renaming. So essentially this instruction is renaming this particular address to this queue slot. This instruction is renaming this same address to another queue slot. So that pie is in concurrency which was not possible. Now if we have a load instruction, let's suppose somewhere here and the load instruction requires just one register according to the issue which is required to compute the address. So what may happen is that this address may turn out to be exactly same as this address. So there is a danger that this load if it is allowed to execute without any regard for this particular store, it may get a wrong value for memory because this value is not yet in memory it is still sitting here. So what we said was that the load execution is decoupled into two parts. So when it is ready in issues it is the operant from register pile or fix it up from bypass or fix it up from the queue slot. In this particular stage it computes the address and before it accesses memory it takes this address and compares the address with all previous stores. And if there is a match then something special has to be done. So in this case for example the value can be taken from directly here instead of going to memory because this is the value. So memory renaming is also done through issue queue slots. So at the end what you really want is to have a very large issue queue to expose instruction level parallelism because if you have a very large queue the probability of tweaking independent instructions actually increases. But the problem is that large searchable structures are usually in order. So that is a known fact. I cannot explain it here at this point because that requires many other prerequisites. But this is a big problem actually which is why and it is a searchable structure because you have to search this queue for many purposes. One is for figuring out dependence, one is for figuring out load store dependence and all other things. So yes. Because usually the way it is implemented is that you make a comparison. So it is actually a comparison that is what you want to do. So you put as many comparators as needed equal to the number of queue slots. So all the comparisons will have to be parallel. So time is not a concern. The problem is the energy consumption that is the biggest issue because you will be essentially accessing all queue slots and switching so many comparators. So today what people do is the solution is that you distribute the issue queue to respective functional units. So essentially what you do is you say that well I will have an issue queue attached to my load store unit only. So that is supposed to be small issue queue which will hold only load store instructions. I will have an issue queue attached to my individual ALE which will hold only individual ALE instructions. So essentially you are distributing the queue in several parts so that the complexity of the design goes down and also a load instruction need not be compared against all instructions. You can only compare against the instructions that are relevant. So essentially distribute the search over multiple smaller queues. So that is one implementation tweak that is done today in our processors as opposed to a single issue. So let's try to summarize what fills we need requiring each issue queue entry that might actually help you also remember what exactly needs to be done. So each instruction needs to carry the functional unit IED and the decoded output. So that has to go in the issue queue entry because when the instruction issues it has to figure out what to do and where to do that. Source register IEDs and immediate operands is needed for reading a register file and this is needed for computing with the immediate operands. Destination register IED required for write back. To parent queue slot IEDs that will hold which of the instruction that immediate predecessors as far as these two sources are concerned. So who are the instructions that are producing these two source registers before me. So parent queue slot ready. So essentially that tells you when these two slots are actually ready. So there are two bits. If both are ready then the instruction is ready to issue. Whether you should read from register file or parent queue slot IED this also discussed last time. So when an instruction is written back you can read the value to register file as opposed to reading form to parent IED. So one bit for each register source you need space for storing the computed value because the value goes back to register file only when you write back before that. And the same field is actually used for storing the store value this particular value because the store instructions don't compute anything so you can use the same field. Also is used for storing the predicted branch targets because you have to figure out whether the prediction was correct or not. There is a done bit which tells you the instruction is complete execution. You also have to store the store load address for this purpose. For the load issues it has to compare this address with every store before it. Also you store the computed branch address in the same field. And also you need a valid bit for this particular thing and an exception factor to figure out if the instruction is an exception. So these are the summary of one issue you mentioned. You may require many other small bits here and there depending on the implementation but these are the main things that you require. Execute one instruction, wake up its dependence and everything. Any question? For the fields in one register map table entry you need a valid bit. You need a valid value bit which we discussed last time for resolving a case and you need one qslot id which tells me which slot id currently owns this particular register. So an instruction is eligible for selection when it is ready and what does an issued instruction do? It resets the parent ready base first of all because this instruction has already been selected and is issued so you don't need these bits anymore. Wakes up the dependence according to the wake up protocol that is given to the map table with the entry matches its slot id and compares its slot id with the parent slot id of all the qmts. So this one we discussed last time how to wake up instructions. There may be additional stalls even if your parents become ready because of interlocks. You may have to implement that so that's what we talked about earlier. The load interlocks and all. So the point is that suppose you are issuing a particular load instruction in this cycle and the next instruction depends on the load. So if that instruction issues here leads the pile you cannot really get the data at this point because the data is produced over here. So there may be additional interlock cycles that selection hardware may have to implement. So that's what it is talking about here. As soon as the parents become ready it may not be enough to issue the instruction. There may be other shadowy constructs. So as the instruction issues it reads operands from its parent slots and or register file as indicated by the read from bit and proceeds to its alternate. An instruction will complete the execution since the done bit stores the completed value in its Q entry. A control transfer instructions stores the completed target because that will be needed for figuring out predictions. A control transfer instruction also invokes misprediction recovery at this point if the completed target does not match the predicted target. An issuing store instruction only leads the value to be stored and computes the address. These are stored in its Q entry and the done bit is also set so that's what we discussed here. The actual store that is access to memory happens when the instruction moves to the execute not with that. The execution of a load is more involved due to memory dependency so I just explained that here so there's a summary of that. A load selected from issue checks if all stores before it have their done bit set. If not it doesn't issue and depending on the issue protocol it may keep on trying to issue in subsequent cycles. So what it's saying is that suppose when the load becomes ready this store is not ready to execute. If the load do it has to wait. If yes that means all the stores have already executed before it, computed the address then the load issues computes its address compares the address and size with those of each store before it. If there is a full match that is starting address and size so a load essentially does what? It loads bunch of bytes so there is a size so there is a starting address and there is a size. So a full match happens when the load finds a store before it which has the exact same size and the same starting address that's a full match so this is an easy case because in this case the load picks up the value from the store. It just consumes this value. Ties are broken in favor of the youngest store before the load that makes sense right because the load must be contributing the value from the immediate predecessor. So this is called load forwarding. Is it clear? Any question? If there is a partial match so we are talking about a case where so this is my load and the store happens to be like this. This is the address and this is the size. So the load overlaps partially with the store. So this is a very problematic situation. So theoretically the load can access memory and merge the values that's possible actually. So the load could actually read out these bytes from memory and pick up these bytes from the store to prepare the final load. So this is possible however an easy solution would be to stall the load and issue it when all the stores before it has written back. It's much more performance but much simpler because this margin hardware it may sound very easy to implement it's not that easy actually because I'm just talking about one store here there will be multiple stores there will be another store like this so how many things will you merge? So this marking hardware gets more or more complicated as you think about cases. So often this is what the process is. So they will just let the load wait until all the stores complete and then read from memory. If there is no match the load proceeds to access memory. So this is the usual case. There is no store that matches so there is no intersection at all. The load can read from memory. Is it clear? And finally when instruction reaches the hero issue queue and it's done with a set it can commit. So that's the technical term used. It's called an instruction committing. That's essentially the linebacked feasible instruction. So what happens here? The first thing that you do is you check its exception vector. And if set that means the instruction must have taken the exception somewhere inside the binder. What you do is all issue queue entries are marked invalid by marking the tail of the head pointers. And the fetcher is directed to fetch a special trap instruction that will transfer control to the operating system. So essentially you are removing all instructions after responding to instruction from the processor and preparing the processor to handle that etc. The big question is how do you now fix the register map table? Because the register map table has been modified in this particular stage and it points to the status corresponds to the last instruction allocated here. How do you fix the table now? It has to go back to point to this particular instruction. Whatever the state was at that point. This is the state of one issue to memory. So this much data you have. Can I recover the map table from this? This is the problem here. What am I talking about? An instruction is taking an exception. The map table has run much farther ahead actually. Can I recover it from these states? For each instruction I know all these things. So this is my map table. Each entry has three things. So for this register register R let us say. It tells me which slot it was the one to produce this register last. The last instruction to produce this register is this particular slot. This field tells me if this instruction is already executed and the register is ready in the slot ID. The value is ready in the slot ID. And this one tells me if this particular map is at all valid or not. And to remind you this bit is turned off when the slot writes back. Provided this slot is still there in this register's map. So how do I now recover this entire map table to a state of the accepting instruction? So when accepting instruction was allocated I want to go back to that state. So can I do one thing? So this ready bit here. It is zero when the value is not ready in the slot. And otherwise this one is ready when it executes. So can I can I set all these bits to one? The only instruction that is currently in the head of the key is the accepting instruction. Everything after we will be killed. There will be fetch again and there will be something. What will be the state of the valid bit? So forget about this one. I am just asking about these two states. What should they be after we recover? Sorry both zero. Sorry what? Zero. Which one is zero? R should be zero. What about the valid bit? Zero? Why? The point is that when the accepting instruction, when the exception handler starts, all the values produced by the program to this point are all in the register point. There is nothing in the key register. So this map table essentially has no meaning. So I can mark everything to zero. All the valid bits and I do not have to worry about anything else in this table. There is no map that is valid bit. Is it clear? I can just clear this valid bit column and I am done. So when the exception handler starts, it will reestablish all the mappings. The purpose of the map table, the purpose was that when the instruction comes which needs to read from the register R, it will know which what its parent is actually. So now there is no parent actually. Everything is written back to the register point. It is like a reorder pipeline. The one that we discussed. So I can just clear this bit and we will be done. We do not have to say anything else. So this is easy fixing the register map table. So that is the exception check. Assuming that there is no exception, a store instruction is sent to memory at this time when the instruction commits. A control transfer instruction updates the relevant predictors. It is the time to update the predictors because you know the correct outcome and you know that this instruction is about to commit. Value producing instructions write the results back to the destination registers. So it resets the valid bit in the map table. If its slot ID matches the entry of the destination register in the map table, compares its slot ID with parent slot IDs of all Q entries and toggles the read from bit accordingly. So this is the wake up stage that we talked about, clubbed with the write back. So that is what happens at commit stage. So how complex is this implementation? Number of comparators depend on size of issue queue, issue width and commit width. Why is that? So issue width is the number of instructions that will issue in a cycle. And each instruction will send its Q ID to everybody for comparison. And everybody will make two comparisons because they may have two dependence. So number of comparisons in this particular wake up stage is two times issue width times the length of the queue. And when instruction commits it will have to do one more wake up. So that depends on how many instructions I can commit in a cycle. So that is like twice commit width times the number of keeps up. So two sets of comparators, one enabled during issue and one during commit cause it can't make ends. So you need a better solution that can eliminate one of these. And why we need these two? It arises due to two possible places of finding a value. Apparently a value may be in the register file or may be in the issue queue entry. Depending on the state of the instruction that produces a value. So you need to march these using some protocol and that is what today's processors do. These two register enemy implement into the interface processors. That is what we discussed very soon. How exactly you march these two things into a single structure. Then you can get rid of one set of comparators. Particularly this one you don't really need. At the time of commit there is no input. What dictates issue width? Issue width is limited by the number and mix of functional limits. Register file read ports, data memory read ports. We discussed this last time. Commit width is limited by register file write ports and data memory write ports. So one small thing that is left in this processor is how do you recover from branch mispredictions. That's one small thing that we need to discuss. So just to remind you, we said that we invoke this when a branch instruction completes execution. It invokes a misprediction recovery. At this point if the computer target does not match the predicted target. So when a branch instruction completes execution, you have hope. If they don't match, you know that something has gone wrong. So how do you recover? So that's easy to handle it until commit. You can see that why you have to recover immediately. Let's wait until the branch commits. So there it looks exactly like an exception. You can see that where the branch instruction is taking an exception. So I'll just do the same thing as I was doing for handling exceptions. It's lower performance because the commit of a branch make a delay due to other long latency algorithm instructions. Because remember that a branch can commit only which comes to the head of the team. There may be other instructions sitting here before the branch which may take a long time to complete, which are completely unrelated. And essentially if you delay so much, the processor continues to page along the wrong path with all those instructions. So this is not really done. This is not acceptable at all. The main point here is that for exception, this is acceptable because exceptions are there. But this may not be there because it's not really the property of an application. It's the property of the branch prediction, how smart it is. If I do not have a good branch prediction this is going to be very frequent. So we'd like to handle it as soon as misprediction is discovered. And the thing to observe is, so essentially we're talking about a situation where let's say we have a branch instruction here which has been issued and has executed. And you find that this branch is mispredicted. So the predicted outcome doesn't match the completed target. So the point is that first thing to observe is several instructions after it may have completed execution but still not committed. So there may be several instructions here that have already completed but have not committed of course because they cannot commit before the launch. Several instructions before it may not have been issued yet. So here there will be several instructions which may be waiting for some instruction here. They may be dependent on some instruction here. They haven't issued yet. So given this particular state of the issue queue you'd like to recover from misprediction. So the first thing to notice is that we are not worried about the instructions before the branch because they are correct. We are worried about the instructions after the branch because they are not correct actually. They are on the wrongly predicted path. So what you do, you evaluate all instructions after the branch by bringing the tail pointer of the queue forward. So you bring the tail pointer, describe me here you bring it up here so that the next instruction will be allocated here so that these will get over it. They will never commit actually. Redirect the fetcher to the correct PC and fix the register map from a checkpoint. So this is very different from the way we recover the register map in exception. So except this point is it clear the rest of the things what I do. So here what I really need to do is not really going to be very easy because now I really cannot just flash clear the valid bit and be done. I have to bring the register map up to the point what it was when the branch was allocated and there is no easy way of doing that. So one solution here is that for each branch instruction whenever you allocate a branch instruction you make a copy of the map data. So essentially that requires extra storage and that puts a limit on how many branch instruction I can keep in the queue because every branch instruction now comes from the checkpoint because there is a chance that this branch instruction may mispronounce. So often the processors talk about certain number of branches that you can maximum number of branches that can remain unresolved. For example, if you say that I can have only 20 branches outstanding in the queue that means whenever the decoder decodes the 21st branch the fetching is going to stall here. It cannot proceed any further because there is no space for storing the checkpoint for this particular branch instruction. So essentially the checkpoint contains a copy of this table that's it. And whenever a branch mispronounces I take that copy, copy this whole thing here so that I get back the map that I needed. Is there any other way of recovering which doesn't require a checkpoint? So this is what I have in these queue entries. So currently my tail pointer is here I have allocated up to this point and that means I need to, so my map table currently points to this particular state. I need to bring it back here. Can I recover it from the state of the instructions? Let's take the last instruction here. So it tells me that it writes to certain register ID that's stored in the queue entry. So that's a register 20. Which means if I go and look at the 20th row of this table I must find this particular queue slot ID there. That's what it means essentially. So by recovering what I need to make a progress I need to know which was the previous entry that wrote to register 20. So then what I can do is I can change that 20th row to this one and then incrementally move it up. I can do that. So this requires a search unfortunately. For example register 20 I have to move forward and find out which was the previous instruction that wrote to 20. Can I improve that? You can show that the previous instruction that you wrote. Right, exactly. So when this instruction was allocated what was the content of the 20th row? It was the previous slot ID, right. I just need to remember it here in this entry. So I can quickly then recover. But it still takes time because you have to sequentially walk this particular queue up to this point. You take one into your time and unmatch. It cannot be as fast as recovering from the table. But it leads you off the checkpoint completely. You just recover it from the queue states. You don't need any other extra storage. So there are two mechanisms but today most processors actually do the checkpointing because of the speed. Branch respiration is often very frequent depending on the nature of the branch. And if the branch is towards the head of the queue you may have to walk the queue for a long time actually. There will be a larger number of instructions already allocated. Requesting. That's about our single issue queue processor. It essentially does out of order execution. It gives you instructions out of the program order. It gives you concurrency in terms of allowing you to execute multiple instructions in a cycle. Very question on this. So what we'll do is now, we'll track back in history a little bit. I'll try to see what people used to do in earlier years when the sophisticated hardware is not there. So the earliest implementation of this idea was in terms of score board. First introduced in CDC 6600. So that's a machine from control data corporation. So they had a similar structure like our issue queue which they called a score board. And the name comes to the fact that essentially each instruction gets a score which signifies where the instruction is ready to go. So it handles raw hazards dynamically just the way we have discussed. Exactly same actually. It keeps track of the components. It stalls on war and war hazards. So the decoder actually can figure out if there could be a possibility of any of these happening. And if there is a chance then the decoder would actually introduce interlock stops. And it discusses how to do this. The issue was still in order. You could not actually jumble the instruction order. Here we have already discussed in our single issue queue model that you can show any ready instruction. And we enforce the order at the time of commit. But here even execution was in order. The score board determines when instruction can execute based on operative availability. War hazards stall the issue in it. War hazards are detected during write back and completed instructions of delay. So the processor previously we saw that war hazards can be written in decoder. But here the delay until write back. So the instructions are allowed to go and when they come to the write back stage they detect the war hazards. And the complete instructions are deleted. So they have a buffer there where they can put the results and until the hazard is resolved they can go. So the book talks about an example in detail. So you can look at that actually how the score board actually works. But anyway I don't want to spend much time on this because this is not really what it's done in any past. So you can read the book. The book gives an example how the score board works. And then what we have discussed in this single issue queue model is essentially the Tomasillo's algorithm that also we just mentioned last time. So Robert Tomasillo was an IBM engineer. And he came up with this algorithm when designing IBM 360 machine. So again the first incarnation of Tomasillo's algorithm was very different from what we have discussed here. So again the book goes in great detail discussing what that algorithm was along with examples. So here is a summary of that. Again I won't go into detail because this is not really what is done in any concept today. So it distributes the score board to respective functional units. These are known as reservation stations. So it already does the distribution as opposed to having a single issue. And these distributed sections are called reservation stations. So essentially an instruction can go and occupy a reservation station waiting for the panel to complete. It results in the name dependencies by using the reservation station entries just like our queue slot entries. After an instruction is registered in a reservation station all dependencies generated by this instruction are mentioned in terms of the reservation station IDs just like our queue slot IDs. Right back register file and cache must still be in order just like what we have discussed. Pending results can be held in reservation station entry or a future file. So this is just like what we need on our issue queue. Bypass network takes the form of a common bus. So this is just specifically with implementation detail of the bypass network or implement detection. All large results must compare all pending instruction sources. So that's exactly what we have discussed also. The retirement register file of P6 micro architecture is very similar. So we will discuss P6 very soon. We are almost ready to discuss that. So essentially what P6 had was they had a very similar thing like our single issue queue. But they did not issue from this queue actually. They had a separate structure for that. So they used this queue for holding the results only and they had different instructions which haven't yet written to the register file. So that was called a retirement register file. They had two register files. One was the main file and the other one was the RRF. And RRF values would be transferred to the main file when the instruction was done. So we will discuss P6 in more detail very soon. We just wanted to mention it here because it is very similar. So one more thing about P6 micro architecture. So if you look at the Tomasulo's algorithm carefully or whatever we have done what we are doing essentially is that we are looking for instructions that are independent essentially. Independent of their predecessors and they are themselves independent of each other so that they can execute concurrently. That's essentially what we are trying to do. So question is can we apply the same idea for resolving branches? So let me try to explain what I mean here. You essentially look for instructions that are controlled independent of the current branch so let's take an example. So this one will translate to a branch instructor. Execution of this and this are controlled dependent on this branch but this is not. So that's exactly what I am saying. So it is possible to skip over these instructions and start fetching from here. Later we will fill up the gap where the branch results either from here or from here. So that's essentially similar to Tomasulo's algorithm or controlled independent. So this is just for your if you want to think you can think about them. Here are some of the issues I want to talk about. First question is how do you figure out what instructions are controlled independent of a branch? So here you have a branch this is controlled independent. The question is is it a static thing or it changes over time? That is a controlled independent set of instructions for a branch will that set change over time or is it a fixed set of instructions? If it's a fixed set then that's good because if I have detected a set I am done. Whenever I hit the branch I know where to fetch from again. What do you think? Now we will come to that. Don't worry about the values yet. We just ask them the instructions. What are the instructions that are controlled independent? Is it a fixed set or is it a dynamic set? It should be static so for a particular branch the control independent set is actually a static set of instructions. Which means I can remember it once I have discovered it. Discovering it is also easy depending on how the compiler behaves. For example in MIPS what will happen is that this branch will be taken branch if the else part has to be executed and at the end of the if part there will be an unconditional jump which will take you here so you can just look for these instructions and you can easily find out what the control independent part is. That also answers the second question, how to find the deconferential point? This is usually this point and you can apply this to other branches also. The third question is do we fetch all the instructions and then look for these, that is do we fetch all these and then actually look for these instructions which are control independent we don't even fetch control dependent instructions until the branch outcome is known. What do we do actually? So there are two options right we can fetch everything sequentially but don't do anything with this start executing from here and wait until the branch outcome is known in which case I know which part to execute and I can cancel the other part. The other option is I skip over I don't fetch anything I fetch from here and later I'll fill in something either this or this when the branch outcome is known. Which one is easier to think? I should follow the concept why? I have the data differences also. Well I am still not executing any of these remember that because I don't know which one will execute actually. I am just talking about fetching part now, what should the fetcher do? First part is easier right why is the second part harder? No I can find out. I look at the branch right let's suppose it's an AFL thing I look at the branch so I know the taken target that I can compute because it's part of the instruction I move one instruction up if it's an under conditional jump the target of that jump is a reconvergence point I don't have to fetch anything actually I can just inspect a couple of instructions and I am done No I don't have to I compute the target from the instruction I go to the target I move one instruction up that's an under conditional jump if it is under conditional jump then that means it's an AFL construct and the target of the jump is essentially my reconvergence point that's a very difficult thing to do you have to create space somewhere in the middle because this order is important because a map table has to maintain this order because allocation has to happen in this order so that's very difficult it's easier to remove the instruction from the queue but filling in something in the middle is not that easy so the first one is easier but of course it wastes space in your queue you may fill up the queue with many unnecessary instructions yes I am executing everything yes I am executing all these yes that's the last one yes so these are for licensing values so values that are needed to execute the reconvergence point so what we do with this exactly that's what we are pointing out there may be a value which is produced here and also produced here which one should I take first of all I don't even execute any of these what happens to this instruction so the point is that the easy way to figure this out is that you don't execute the instructions which are data dependent on anything here so you hope to find something that is independent so essentially now we are talking about not only control independence but all the instructions here that are control independent and data independent those only the ones we can execute anyway so this one is just something that people have looked at as a research problem often works pretty well yes yes exactly so if you have the second strategy that is used if over all these things then figuring out data dependence is going to be very difficult in fact it may not be possible