 and welcome to today's lecture on distributed memory multiprocessors. You may recall that we have two basic models one is known as uniform memory access or this is also used for symmetric multiprocessor where there is a shared memory connected to the bus and each processor has equal access time. On the other hand there is another model known as NUMA model NUMA stands for non-uniform memory access and in this NUMA model you can see that you have got separate processors and the memory is not shared it is not common. So, it is being distributed so in this case memory is distributed that is the reason why it is called distributed memory multiprocessor and each processor is having a part of the main memory with it associated with it connected through a bus and then each of them is connected through the network. So, we have seen the limitations of symmetric multiprocessors in my last class we have seen there is a centralized resource in the system becomes a bottleneck. So, whenever there is a centralized resource and which is used by all the processors then it becomes a bottleneck. So, in the in case of your symmetric multiprocessor we have seen that shared bus is a bottleneck because as the speed of the processors is increased or if the number of processors is increased then the traffic on the bus increases and as a consequence it puts a limit on the number of processors that can be connected to this bus. So, and the traffic there are two types of traffic we have seen the bus must support both normal traffic that means when the when you are reading memory I mean when there is a cache miss you have to transfer data from the main memory to the cache that is your normal traffic you can say on the other hand there will be other traffic because to maintain coherence. So, those are known as coherence traffic and as the speed of processors increases the number of processors that can be supported reduces as I have already mentioned because the bus has a limited bandwidth. So, with the limited bandwidth the number of processors that can be supported is restricted. Now, question arises how the designer can increase the memory bandwidth. So, the memory bandwidth can if the memory bandwidth can be increased then the number of processors in the system can be increased how can it be done one possible approach is to use multiple buses. So, whenever we use multiple buses then obviously the traffic on the bus on each bus reduces and that may help you to increase the number of processors. Another alternative is to use multiple physical banks. So, whenever we have already discussed the use of multiple physical banks and whenever you have got multiple physical banks then obviously the reading from different banks can be done separately and that will also help you to increase the number of processors. So, these are the limitations. So, in search of better alternative we have another new technique has emerged that is known as direct base protocol where you have got distributed memory. So, in a distributed memory system we shall use a special type of protocol which is known as direct base protocol. So, we have seen in symmetric multiprocessor since it is sharing a bus all the protocols are bus based. So, by monitoring the bus the any transaction that has taken place can be identified a processor can recognize that an appropriate action is taken. On the other hand we shall see in this direct base protocol it is somewhat different and this is an alternative to the snoop based coherence protocol. So, we have to in any case we have to maintain the cache coherence and using this protocol we shall be able to do that. And basic motivation behind this direct base protocol is that it can be used to reduce the bandwidth demands in a centralized memory machine. So, a directory keeps the state of every block that may be cached. That means, we are telling about direct base. What do you really mean by directory? Directly is a kind of database which keeps the state of every block that may be cached that may be transferred from the main memory to the cache and information like which cache have copies of the block, whether a particular cache is dirty and so on or whether it is shared by more than one processor. So, all this information are to be stored in the directory for proper management of the to afford cache coherence. So, the amount of information is proportional to the product of the number of memory blocks and the number of processors. So, here we have seen that memory will have large number of blocks and for each block we have to maintain some information and similarly for each block corresponding to each processor we have to maintain some database. That is why the amount of information will be proportional to the product of the number of blocks into the number of processors in the system. And to prevent the directory to become a bottleneck, the directory is distributed along the memory. Now, as the size of the memory increases or in other words number of blocks increases and as the number of processors increases particularly in a distributed memory architecture the number of processors can be hundreds it may be one hundred, two hundred and more. So, it will be large number of processors contrary to symmetric memory multiprocessors where you can have at most few dozens of processors. Here it will be few hundreds of processors and as a consequence the size of the directory is quite big pretty big. So, if you the where do you store that directory. So, what can be done just like we have distributed the memory we can also distribute the directory and that is what is being done different directory to prevent directory to become a bottleneck the directory is distributed along with the memory. And different it has another advantage different directory accesses go to different directory that means the since it is distributed. So, reference to a particular directory will be will go to a particular place and when there is another reference it will go to another place. So, it will help you reduce the traffic on a particular bus. So, we shall this see this in detail as we discuss the protocol. So, this is the basic model you can say that NUMA computer non-uniform memory access computer and where we can use directly a solution. So, you can see each processor along with the cache memory and I O each of them is a autonomous system by autonomous I mean each of them can function independently each of them may be having its own operating system that is one advantage in this. So, and then we have a directory. So, we have a directory and then the directory is maintaining information about the different things that I have mentioned and each of them is connected through a interconnection network. So, through the interconnection network they are connected and then this memory is being shared and by that I mean there is a single logical at this space and of course there is multiple physical space. So, as we have seen here the memory there as the memory is available in provided to each of the processors. So, what can be done say for example, this is the address the higher order bits can be used to signify the memory associated with a particular processor. So, memory associated with a particular processor. So, if you have got 4 processors 2 bits will be sufficient 0 0 0 1 1 0 and 1 1 or if you have got more than in case of this NUMA model we have seen the number can be 100. So, the number of bits that will be required will be dependent on the number of processors and the remaining bits can be used to reference the memory within that particular memory block in a associated with the processor that is how the memory is distributed in this particular case. Now, in NUMA computers the messages have long latency because we have it has the information has to go through the interconnection network and broadcast is inefficient because all messages have explicit responses. So, and main memory controller has to keep track of it which processors are having cast copies of which memory locations and so on. So, and on a write only need to inform the users not everyone that means where the writing is taking place and who are the users of that particular block they are to be informed, but not everyone and similarly whenever on a dirty read whenever a read is occurring and that is already modified by processor if that information has to be forwarded to the owner that means owner means who is reading it. So, I shall go into the details of that particularly this protocol that will be used this directive is protocol is very similar to Snoopy protocol just like we have seen there are three states with the help of which you can explain the protocol the way the reading writing and other things are access is taking place in this case the three states are shared uncast and exclusive. So, whenever a particular cast block will be in the shared state if one or more processors have data and memory is up to date that means that means that a more than one processor is has have copied it and memory is also up to date that means the same copy not only is present in the memory, but also in the cache of more than one processor then we shall say it is shared and also it is updated that means all the copies are identical. Then uncast no processor has data not valid in cache that means in case of uncast the data is only present in the main memory and not no processor has cast it and as a consequence I mean whenever a processor has to read it it has to read from the main memory. Then the third state is exclusive where one processor has the data and in such a case what happens the memory from the main memory it is transferred to the cache of a particular processor and whenever it is transferred to a particular processor and whenever it modifies it that processor becomes the exclusive user of that particular block of data. So, exclusive state corresponds to one processor is the owner of the data and memory may be out of date out of date means in the main memory the modification that has taken place in the cache memory has not yet been written back. So, to keep the protocol simple we must keep the protocol simple, but it must track which processors have data when in the shared state usually implemented using bit vector that means suppose you have got 8 processors corresponding to 8 processors you can have 8 bits. So, if a particular say this corresponds to 0th processor this corresponds to 8th processor. So, whenever a particular processor copies that data that bit is modified to 1. So, in this way if there is one present in more than one bit that signifies that the same copy of data is possessed by more than one processor and as a consequence it will be in the shared state and this bit vector signifies which processor will be having the copy. So, writes to non-exclusive data that means whenever a particular processor tries to write a data which is non-exclusive it is not in the exclusive state then write means will occur and it will change the state from exclusive to I mean write means will occur then it will be become exclusive because after the modification is done then state will change to exclusive. So, from non-exclusive it will change to exclusive non-exclusive means it can be shared or in uncast state and if it tries to write will be means write means will be generated on the bus and then the state will change to exclusive and processor blocks until access completes. So, here some kind of you know it is not a shared bus. So, some kind of serialization is necessary because one operation has to be completed. So, that is why the processor blocks until access completes that means does not allow other processors to access it and assume messages received and acted upon in order sent. So, that means when a particular message is sent then that appropriate action for that message is performed only then other messages will be received and appropriate action will be taken. So, this is the kind of serialization is necessary and this is important because bus is not shared and latency is more. So, no bus and do not have do not want broadcast obviously since there is no shared bus there is no possibility of broadcast interconnect no longer single arbitration point because since it is not a bus or bus arbitration cannot be done with the help of this interconnect network it can be a point to point network or it can be interconnection network having multiple paths. So, this single arbitration point does not exist and all messages have explicit responses that means in case of bus based system. So, it is a kind of broadcast on the bus and as a consequence all the processors and will be we will listen to it, but in this particular case that is not so it has to be explicit I mean for all each and every message will have explicit responses. Now, in this directory based protocol we will find that there exists three different types of nodes one is local node where a request originates. So, local node will generate an request for read write on the other hand home node where the memory location and address resides that means home node is where the memory location and address is resides and from where data has to be read and another node is there that is remote node has a copy of the cache block where whether exclusive or shared. So, that means remote node is the place where the copy of the cache is already present and it may be either in the exclusive state or in the shared state and you can have a number of processors. So, P specifies the processor number and whenever transaction will take place the address of the memory has to be specified and that A is the address. So, whenever you will see in the messages the P and A processor number and address will be mentioned for different purposes and this is the list of possible messages. So, for example, read miss means that data is not present in the cache memory and the processor is trying to read and source can be I mean the source it is generated by the local cache and destination is home directory and message will contain the processor number and address. Then write miss again the source of the message is local cache and destination is home directory. Home directory means we have seen the diagram. So, this is the directory and is the processor. So, this is the home directory of this processor and other directories are remote directories of the processor. Then write miss the source is again local cache generated by sorry invalidate it is generated by the home directory and destination is remote cache. So, whenever data is modified and it was having a copy then it has to be invalidated and the message content is address invalidated at shared copy of data at A. So, in that address that data has to be invalidated then fetch again the source is home directory because data is not present in the memory. So, it has to be fetched and from the remote cache it has to come and the address will be generated by A data is sent to home directory data to be sent to the home directory and status to remote cache will be modified to shared and then again fetch or invalidate again the source is home directory and destination will be remote cache and message contents will be address A and data will be sent to the home directory and it will invalid action will be invalidate the block. Then data value copy data value will be copied in the local cache. So, in this case the source is the home directory and destination is local cache and data will be message will content the data which will be which will be written into the local cache. So, return data value from the home directory and data write back again the source is remote cache in this particular case and destination is the home directory and message will contain the address and data and the function will be write back a data value for address A. So, these are the possible messages that will be generated in this directory based protocol and let us see and what will be the states what are the different states that we shall mention. So, states identical to Snoopy case transactions are very similar we have already discussed about this Snoopy based protocol we shall see the states are identical and transactions are similar. However, they are little different transactions caused by read misses write misses invalidates data requests. So, transitions will occur from one state to another because of these messages and it will generate read miss and write miss message to home directory and write misses that were broadcast on the bus for snooping. So, in this case that will not happen explicit invalidate and data fetch request are to be used in case of directory based protocol. Note that on a write a cache block is bigger so need to be read the full cache block. So, that means whenever you are performing a write it will although you will be writing on a word, but a block consists of several words. So, entire block has to be transferred all the words and you have to read it then modification can be done. So, this is the state transition that we can diagram state machine for CPU request for each memory block. So, this is the state machine for CPU request for each memory block and assuming that initially invalid state is in the memory. So, let us gradually build up and see how it exactly happens. So, initially you have got it is in the invalid state. Invalid state means that data is not present in the cache and it is also not in shared state. So, CPU can generate a message that is known as CPU read. So, whenever CPU read is generated it will send read means message to the directory. That means that in response to this the reading of data will take place and obviously it will change state from invalid to shared and in this case it is read only. Now, whenever it is in the shared state let us first complete the invalid state. From the invalid state it can also perform CPU write. So, whenever it tries to CPU write then it will send write miss to home directory and the state of the cache memory will change to exclusive because after it has performed the writing. So, it will become exclusive. So, from the invalid state there are two transactions I mean two transitions one because of CPU read another because of CPU write and whenever CPU read occurs it goes to shared state whenever CPU write occurs it goes to exclusive state. Now, when it is in the shared state then if you perform continuous reading that means CPU read or write whatever it occurs if you read it. So, CPU read it means it will remain in the same state there will be no change after the data is taken to the cache memory it can continue to read without changing the status and similarly whenever it perform CPU read miss then it will send read miss and it will remain in the same state because it will read it and then it will perform it will it will it will it will continue to I mean send read miss and it will remain in this particular state. Now, from this state it will go to I think CPU read miss and whenever it performs CPU write what happens then it does CPU write. So, when CPU write occurs again it goes to exclusive state. So, it will send write miss a message to the home directory and it will change to exclusive state. So, from this shared state these are the possible transactions CPU read, CPU read hit, CPU read miss and CPU write. Now, whenever it is in the exclusive state that means the cache is present I mean only in that particular processor it and it is in the exclusive state. So, in this case it can perform CPU read, CPU read hit. So, it will remain in the same state it will continue to read or CPU write hit there will be no change of state because it is already in the exclusive state the CPU is free to read the data or write into the data when it is in the exclusive state. Now, what can happen the CPU write miss occurs and it is in the exclusive state then obviously it will send data write back write miss to home directory and it will remain in this state. So, CPU write miss means the data is not here, but it was in exclusive state. So, it will remain in that exclusive state however data write back has to occur and it will send the write miss to home directory and accordingly the data will be transferred to that cache and a modification will be done by the processor and it will remain in this state. So, it will not change now comes to two more states from exclusive to. So, from exclusive to it will go to say it performs CPU write. So, it is in case of fetch it will send data write back to the home directory and then it will go to shared state and last one is CPU read miss. So, whenever CPU read miss occurs then it will go from exclusive state to shared state and it will generate the write back message to the read miss to home directory and it will switch from exclusive to the shared state. So, these are the different transactions that will occur which is shown in this particular diagram and this is the situation where state machine from CPU request for each memory block that is occurring and whenever it is in different states and how this state transition occurs that I have discussed one after the other. Now, let us see, so I have discussed the same state and structure as the now consider the diagram of the directory. That means, directory also will maintain a finite state machine and what the states of that and the transitions for individual cache will be same in case of the directory also and memory controller will perform the following actions, update of the directory state, send messages to satisfy request, track all copies of memory block and also indicates an action that updates the sharing state, shares as well as sending a message. So, let us see the various situations. Now, in this case you can see the directory can receive three messages, what are the three messages, number one message is read miss, write miss or data write back. So, these are the three messages that will be received by the directory state machine and directory state machine again will remain in three different states. Let us see the three different states are one is uncast, data is not yet uncast or it can be in shared read only and third state is exclusive. So, these are the three states and now whenever it is in the uncast state that means a particular block has not yet been transferred to the cache memory and there is a read miss. A read miss occurs and whenever a read miss occurs it will send a message, send a data value reply. So, data is sent and it switches to shared states. So, in such a case you have to maintain a list where sharers will include this processor, the processor to which it has been sent that processor's name will be maintained in the sharers list. So, this sharers will be equal to p and it will go to the shared state and when it is in the shared state from the uncast state it can also go to the exclusive state by write miss and in this case also it will send data value reply that means the directory will send the data value reply message and sharers will be again p that processors will be entered in the sharers list. So, these are the two transitions that will occur whenever it is in the uncast state. Then let us see when it is in the shared state the sharers directory can receive a message like read miss. When a read miss occurs then it will send data value reply as reply and also the sharers list will be updated, it will plus, it will update p. So, this is what will happen in case of a read miss. Now let us consider when there is a write miss, then it will send invalidate signal, it is message to all the processors it has to go and then sharers list will be having only p because it is in the exclusive control of p that is why it will have the only value and it will send the invalidate message and it will send data value reply, reply message. Then we have got another whenever it is in the exclusive state it can go from exclusive to shared state whenever there is a read miss, whenever there is a read miss and in that case that sharer list will have will add the p which is trying to read. So, this is the list of processors and it will send fetch the data and it will the directory will send data value reply, reply message to remote cache. So, this is the case whenever you are using write back policy and the last one is in this case write miss and write miss occurs it remains in the exclusive state. So, the sharers list will be having p and it will send fetch or invalidate also send data value reply message. So, these are the various transactions and finally, there is another one when it is in the exclusive state and there is a data write back. So, in this case also the sharer list will be empty and it will perform write back block will be transferred it will be uncast. So, these are the various transitions that will occur in case of the directory state machine in response to different messages like read miss, write miss and data write back coming from the processors. Now, let us consider an example messages send to directory causes two actions update the directory. So, we have already discussed about these things when read miss occurs write miss occurs and when it is in the shared state block is in the shared state we have discussed in detail and when it is in the exclusive state when read miss occurs data write back occurs we have discussed those things and in response to write miss what will be done these are again we have discussed in detail. Now, let us consider an example where different actions are taken and in response to that what is done by different processors what is done by the different interconnect that is your interconnect bus what is done in the directory and what is done in the memory. So, different things that happens is mentioned here. So, P 1 write 10 to address A 1. So, a processor P 1 is writing some data in and the address corresponding address is A 1 and so far as the processor P 1 is concerned the state is exclusive because it has written data and in that address A 1 the value 10 will be written and whenever this is being done then the corresponding bar sections will be since it is writing write miss will be on the interconnect bus and the message that has been sent is by P 1 and at the corresponding address is A 1 and then the corresponding data read that occurs data write back occurs is corresponding to P 1 and A 1 and the directory protocol and the directory will have the message corresponding state will be address A 1, the state will be exclusive and processor corresponding is P 1. So, this is the sequence of event that will take place in response to processor P 1 writing 10 into A 1. Now, whenever the same processor reads the data of A 1 then what happens. So, in this particular case as you can see there is no change in state because it was already in the exclusive state it will simply read it as you have already seen it there is no change of state and from the processor will generate the address and it will read it from the cache memory. Then P 2 whenever another processor tries to read from the same address then what will happen let us see. So, in this case it was earlier in the exclusive state of processor P 1. Now, it will switch from exclusive to shared state because the processor is reading it and it will be also present in the cache memory of processor P 2. So, the state will change to shared correspondingly the processor P 1 also will change the state from exclusive to shared and the value that will be present here is 10 same thing for P 1 and P 2 address is same and on the interconnect the read message that is being sent is P 2 corresponding to P 2 and address is A 1 and it will be fetched from the processor P 1 and address A 1 and the value is 10 and the data reply that occurs is P 2, A 1 and 10. So, these are the interconnect or bus that will be the directory will respond in this way and sorry in the directory will be responding with A 1 and it is it will be in the shared state and P 1 and P 2 are having the copy it has to maintain the list of sharers. So, list of sharers are here P 1 and P 2 and value is 10. Now, P 2 is writing in the same address some other value. So, in this case what will happen the P 2 will write and in this case it will become the exclusive under the exclusive control of P 2 and the value is changed and there will be corresponding changes on the I mean messages on the bus interconnect and on the directory will also correspondingly will change the state A 1 earlier it was A 1 address was A 1 it was in the shared state and P 1 and P 2 were the list of sharers, but in this case state is exclusive and only it is present in P 2 and value and memory value remains the same 10, because it is not yet been written into the memory what is only present in the cache memory of P 2 and it will be invalidated in processor P 1. So, processor P 1 will have the state of invalidate and for P 2 it is exclusive and so on. Now, P 2 writes 40 to A 2. So, another writing is taking place by P 2 let us see what will happen in this case. So, it will invalidate that interconnect will generate invalidate P 1 A 1 then write message P 2 A 2 and write back P 2 A 1 20 and at this is and in this case the exclusive it will go to the exclusive state of processor P 2, because it has written it was already in the exclusive state, but the value has been modified. So, it will go to the remain in the exclusive state. However, the address and data values are changing and in the directory also the same thing is happening A 2 it will remain in the exclusive state and processor which is will be in the shareer list is P 2 and data value is 0. So, this is how it has been assumed that A 1 and A 2 map to the same cache block. If it is in the different cache block then of course, there will be a cache miss that has not happened here, because they belong to the we have assumed that 10, 20 and 40 all belong to this A 1 and A 2 they belong to the same cache block and as a consequence there was no cache miss and only the state of the process is changing from exclusive to sharing shared or and like that. Now, we shall discuss some of the implementation issues in directory based protocols. So, when the number of processors are large directory becomes a bottleneck as I have told, because the size of the directory is proportional to the number of blocks into the number of processors. So, the directories can be distributed among different memory modules. So, earlier we have seen it was in the associated with the same memory, but now what can be done it you can divide it into separate memory modules and different directory accesses go to different locations, because since they are in different memory modules then different directory accesses will go to different locations leading to lesser traffic on the bus corresponding to the memory. Here are two examples illustrating what kind of communication overhead that occurs. So, let us assume that an application is running on 32 node multiprocessor and it incurs a latency of 400 nanosecond to handle a reference read write to memory. That means, the 14 nanosecond is necessary to read from the memory and processor clock rate is 1 gigahertz and instruction per cycle is 2. That means, 2 nanosecond is needed whenever it is in the cache, but whenever you have to read it from the main memory then the time needed is 400 nanosecond. So, it is much longer. Now, how much faster will be the computation if there is no communication communication means there is no communication with the memory versus if 0.2 percent of instructions involve reference to memory locations. So, you have to compare the computation time for the two different situations and compare it. So, in this case CPI is equal to 0.2 because instructions per cycle is 2. So, CPI is 0.2 effective CPI with 0.2 percent remote references. So, whenever you are performing reading from the memory there will be remote references. So, the CPI will change to effective CPI will become base CPI plus memory request rate into memory request cost and we have already seen that effective CPI with 0.2 percent remote reference will be 0.5 this 0.5 plus 0.002 because it was 2 percent reference and 400 nanosecond is the access time of the main memory. So, this will give you 0.5 into this will be 400 into 0.002 it will become 0.8. So, 1.3. So, a program having no memory references will be 1 minus 1 by 1.3 into 100 percent that means 23 percent faster. So, whenever a 0.2 percent memory access is taking place it is becoming pretty slow as you can see it is becoming 23 percent I mean slower you can say or whenever there is no memory reference it is 23 percent faster. So, the CPI is changing from 1 to 1.3 sorry from 0.5 to 1.3. So, this is the situation. Now, consider another example again it is an application is running on a 32 node distributed shared memory. In this case it is a distributed shared memory situation it incurs a latency of 400 nanosecond to handle a reference to a remote memory and processor clock rate is 1 gigahertz IPC is 2. Now, how much faster will be the computation on a multiprocessor system compared to the distributed shared memory of 0.2 percent of the instructions involve reference to a remote memory as you local memory references only. So, in this case effective CPI with 0.2 percent remote memory remote reference is equal to basic CPI plus remote request rate into remote request cost. So, that 100 nanosecond 1000 nanosecond is not mentioned here. So, in this case effective CPI will become equal to 0.5 plus 0.002 into 400 into 400000 nanosecond 1000 nanosecond because this is the remote request cost. So, remote request cost is substantially more compared to whenever it is read from the local memory as was in the situation in the previous case. So, it will become 0.8 plus 800 is equal to 800.5 and so multiprocessor would be having 800.5 by 1.3 is equal to 658 times faster. So, in the previous case we have seen the CPI was 1.3 whenever it is read from the local memory, but in this case it is from the remote memory leading to CPI of 80 to 800.5. So, this is the previous one is much faster and performance figures of NUMA may be worse if we take data dependency and synchronization aspects into consideration. Without taking this data dependency and synchronization aspects into consideration this is the situation and if you take them into account it will be more. So, with this we have come to the end of today's lecture. So, in today's lecture we have discussed about the distributed shared memory architecture and we have discussed the directory based protocol for overcoming the cache coherence problem that arises whenever a shared memory is used in a distributed memory environment. Thank you.