 to today's lecture on symmetric multiprocessors. Actually, it should read as symmetric shared memory multiprocessor, but shared memory is not mentioned, although always we use shared memory in a symmetric multiprocessor. If you look at the different types of multiprocessor systems, they can be broadly categorized into two types. In the first case, we have a number of processors along with their trifet cache memories, and there is a shared main memory through which the processors make change information, and this is known as symmetric multiprocessor or symmetric shared memory multiprocessors. On the other hand, you have another model where you have got the processors with on-chip cache memories, then they have their own private main memory and they are connected through a network. So, this is known as distributed memory multiprocessors or there are other names as well. The symmetric multiprocessors are also called uniform memory access model because for each of these processors, the access time to the main memory is same. It does not change because they are connected through a bus. So, irrespective of which processor is accessing the main memory, the access time remains the same. On the other hand, in your distributed memory multiprocessors, which is known as non-uniform memory access model, where as you can see, since each processor has its own private main memory, whenever or in other words, the main memory is distributed among all the processors. So, if a particular processor P1 accesses its own main memory, then the access time is smaller compared to if it accesses another main memory which is attached to the processor P2. So, that is the reason why it is called non-uniform memory access model or also it is called distributed memory multiprocessors. So, we shall focus on symmetric multiprocessor based on UMA model in this lecture, various aspects of it, various challenges that is faced by this type of uniform memory access model. And these symmetric multiprocessors are very popular shared memory multiprocessor architecture, the processor share memory and also the input output devices, various input output devices that are used in the system are also shared along with the main memory. And as I have already mentioned, we have omitted the shared bus part from the symmetric processors, but usually all the access is through bus. So, it is bus based access time for all memory locations is equal that is the reason the name symmetric mp, symmetric multiprocessor. So, over the years the multiprocessors, symmetric multiprocessor systems have evolved as you can see in the in 1980s, processor and cache memories were on separate extension boards. So, there was a back plane and the different cards along with the cache memory can be connected and build a complete system. So, that was in 80s that means on a single printed circuit board, the processor and cache memories were put and they were plugged on the back plane to link them together or that connected to the common bus. On the other hand in 1990s, they were integrated on the main board that means all the processors were integrated on the main board and you can have 4 or 6 processors placed per board. That means of course, if you want to extend it beyond 4 or 6 processors, then again you have to use the concept of back plane and multiple processors can be interfaced. Then in 2000, in this century integrated on the same chip that means multi-core. Now, it is not on different boards, it is not on single board, but on a single chip. So, single die or chip you can say. So, where multiple processors are fabricated and we have got a dual core quad core and the number of cores are increasing and processors, multi-core processors have been manufactured by IBM, Intel, AMD. So, those were chip manufacturers all of them have joined the race of producing multi-core processors. Now, let us look at the advantages and disadvantages of symmetric multi-core systems. That means the main advantages is of programming especially when communication patterns are complex or very dynamically or very dynamically during execution. So, in such a situation the programming is relatively easy in symmetric multi-processors. Of course, there are many other disadvantages as you can see as the number of processors increases, the contention on the bus increases. So, since all the exchange of information takes place through a shared bus, obviously the traffic on the bus will keep on increasing as you keep on adding number of processors on the bus and as a consequence scalability of the symmetric multi-processor model is restricted. You cannot have large number of processors in a single system. Of course, you can try to overcome the problem by removing the bus with the help of switches. So, one way to maybe to use switches switch means you can have a switch say 16 port switch you can have 4 16 port switch. In this case what can be done any pair of processors can be connected in this way. So, parallel exchange of information can be done. So, this is a switch or you can have a multi-stage interconnection network that is also possible instead of bus. However, whenever you do this switch is set up parallel point to point connections as I have shown in this diagram. So, you have got point to point parallel connections and that complicates the situation. So, because if you have got shared memory and whenever you have got parallel point to point connection then this will lead to other problems. So, again switches are not without any disadvantages make implementation of cash coherence difficult. As we shall see even on a simple bus based system cash coherence problem is difficult to implement it can be implemented, but it is quite complex and whenever you go for either a switch or a multi-stage interconnection network then it adds to the complexity and obviously it is disadvantageous. So, even programs not using multi-threading that means even if you go for single threading even then experience a performance increase on symmetric multiprocessor that means even if you do not use multi-threading, but go for multiprocessor systems then there will be performance increase. The reason for that is kernel routines handling interrupts etcetera run on separate processor. So, what is happening here application program will run on one processor and kernel operating system routines will run on another system another processor. So, that way performance is improved. So, multicore processors are now a common place as I have already told Pentium 4, Extreme Edition, Geon, Athlon 64, Decalpha, Alta Spark. So, for all these processors we have got multicore versions and it has been I mean there is a forecast that the number of cores per chip is likely to double every year. So, again we are trying to I mean increase the number of processors for every year. However, programming is a big constraint so the forecast which has been told may not be successful that means the number of cores may not increase twice per year I mean not double every year, but in any case it will keep on increasing. Now, question naturally arises why we have gone for multicore instead of increasing the complexity of a single processor putting more and more transistors making it more and more powerful increasing the clock frequency and so on. So, that the performance is increased as we know performance is related to clock frequency performance is related to complex functions that you put in the chip and so on. So, why not we are going for why we are going for multicore instead of trying to improve the performance the reason for that is clock skew and temperature. So, as you increase the clock frequency there is a problem known as clock skew that means different points of a chip I mean circuit will I mean we receive the clock at different instant of time and that is known as clock skew because of the delay in different parts of the circuits clock will not reach at the same instant and that is known as clock skew problem and the temperature because power dissipation keeps on increasing as the frequency increases as you put on more and more number of transistors. As we know the power dissipation is proportional to C L V d d square and f that means that C L the capacitance of the circuit will keep on increase as you increase the number of transistors and also as you increase the frequency power dissipation will increase. So, both the factors will keep on increasing if you increase frequency and if you increase number of transistors. So, instead of that I mean whenever this happens then you can see any additional transistors have to be used in different core. So, use of more complex techniques to improve single thread performance is very limited and as a consequence as you can see as you put more and more number of transistors the number of transistors you may keep on increasing linearly, but the performance will not increase linearly and it has been found that beyond 2002 this performance gap I mean the is increasing real performance with respect to the number of transistors is keeping on increasing. And if you look at this particular table we find that instead of increasing the clock frequency of a single core if you increase the number of cores and reduce the clock frequency to maintain the same performance here we are comparing for the same performance. So, you can keep on increasing the cores assuming that the task is evenly distributed in all cores. So, parallel computation will take place and to achieve the same performance we can reduce the clock frequency correspondingly. And so, when you have got one core clock frequency is 200 megahertz and when you have got two core 100 megahertz and so on. So, in this way you can as the clock frequency is reduced it is possible to reduce the supply voltage as well because it has been found that in a VLSI circuit that frequency that delay is related to supply voltage. So, if you reduce the frequency it is possible to operate the circuit at a lower voltage. So, that has led to the concept of dynamic voltage and frequency scaling DVFS. So, you can dynamically reduce voltage and frequency I mean in some cases, but in this particular case we are reducing the frequency, we are reducing the voltage and as you can see for the same performance the power dissipation can reduce if we go from one core to eight core. So, this is an extreme case in ideal situation it will be little less, but definitely performance you will get lesser power dissipation for the same performance if you go for multi core circuit. So, this is the basis for present day multi core commercial processors introduced by Intel, AMD and so on. And another thing is thread level parallelism is exploited in multi core architecture to increase throughput of processors. We have already discussed that at length in my last lecture how you can in use different types of thread level parallelism in a multi core system. Now, you can say that whenever you are having multiple threads. So, multiple cores on the same physical packaging what you can do you can execute different threads on different processors. Now, suppose the number of threads reduces because an application will decide how many threads a particular application can have rather an application can be considered as a process. A process will have some number of threads and the number of threads cannot be increased beyond certain limit it depends on the process. So, what you can do whenever the number of thread is less you can switch off some of the cores. So, if no thread to execute that means what you can do you can do various things like depending on the number of threads you can keep number of cores active. So, number of threads will decide how many cores will remain active and not only that another alternative is you can do a kind of load balancing. So, you can do thread level thread switching from one core to another core and as a in this way you can evenly distribute the participation of different cores these are all possible and as I have mentioned the number of cores is increasing. Now, let us focus on the cache organizations for multi-cores. So, what kind of cache memory is suitable in a multi-core architecture? So, one possibility is that you can have two levels of cache memories L1 cache is always is on is available with the core. So, we can say I mean it is private to each of the processors. So, L1 cache memory and you can also have L2 cache memory and for each of them you can have a private L2 cache memory then of course, you can have a bus which is connected to the main memory this is one possibility. Another possibility is that each of the processors can have its own private L1 cache memory and L2 cache memory can be shared. So, L2 cache can be private or shared. Now, question arises in multi-core situation where you are doing symmetric multi-processing what type of configuration is most suitable? It has been found that in any multi-processor main memory access is a bottleneck. So, that is the reason why multi-level cache reduces the memory demand of a processor. So, multi-level cache memory is proposed in case of processors. However, whenever you go for multi-level cache memory, in fact makes it possible for more than one processor to meaningfully share the memory bus. So, multi-level cache can be used and not only that you can go for a shared L2 cache. So, it has been found that instead of having private L2 cache, it is profitable to have shared L2 cache in a multi-core architecture. The reason for that is efficient dynamic use of space by each core data shared by multiple cores is not replicated. That means, whenever it is in the L2 cache then it is not necessary to replicate it on a processor. So, it earlier we have seen that if a data is shared and L2 cache is private then it has to be replicated in each of the L2 cache memories, but if it is shared it is not necessary to replicate and as a consequence it will save space. And every block has a fixed home hence easy to find latest copy. So, it is very easy to find latest copy. In other words cache memory management becomes easier. So, these are the advantages of private L2 cache. Of course, there are some advantages of private L2 cache memories. Number one is quick access to private L2, private bus to private L2, less contention. So, these are there, but it has been found that shared L2 cache is more preferable compared to private L2 cache although it has got some advantages. Now, we shall come to the main problem that we encounter in shared memory multiprocessor systems. So, that is known as coherence problem. So, when shared data are cached, so a data which is being used by multiple processors if it is brought from main memory to the cache then what problem arises. So, allows migration and replication there are replicated in multiple cache memories and reduces latency to access shared data. That means whenever it is transferred to the cache memory and access obviously the latency will be small and the fast access can be done reduces bandwidth demand on the shared memory. So, it will also reduce the traffic on the main memory. So, this is because of these reasons it is necessary to transfer the shared data to the cache memory. So, data in the cache of different processors may become inconsistent. So, whenever you use write back policy we know that whenever the data is modified in the cache it is not modified in the main memory. So, when in that case what happens a particular processor if a shared data has been transferred from the main memory to the cache memory and if it modifies it then it will lead to problem because in the main memory it is not updated and of course if you use write through policy then it will be updated, but it is not done in case of write back policy. So, we have to explicitly write in the main memory whenever we use write back policy. So, how to enforce cache coherency? So, how does a processor know changes in the caches of other processors? That means what is happening I shall show you a multi core architecture where each processor has its own cache memories. Now from the main memory a data has been transferred to the cache memory and it has modified and that data and main memory data is different and that is actually known as cache coherence problem and then if the data from the main memory is read by another processor obviously it will not get the recent data. So, that means it leads to inconsistency and that is known as cache coherency problem. So, let me consider a multi core situation having four cores in a processor a multi core chip where as you can see you have got a main memory which is being shared and connected through a shared bus. So, since we have private caches question is how to keep the data consistent across cache memory. So, each core should perceive the memory as a monolithic array shared by all cores that means this main memory is being shared by all cores. So, in such a case first we shall see how cache coherence problem arising and how it can be then we shall discuss how it can be taken care of. Suppose in the main memory you have got some data x is equal to 4, 5, 1, 2, 3 a particular variable x is here then a particular processor reads it. So, core 1 this processor reads it x is equal to 4, 5, 1, 2, 3. So, up from the main memory it has been transferred there then another core reads it x is equal to 4, 5, 1, 2, 3. Now suppose a particular processor I mean this is done assuming I mean now core 1 writes x setting it 1, 2, 3, 4, 5. So, here what has happened this particular cache memory has been modified and assuming it is right through cache the main memory is also updated. So, we find that this particular content, content of the cache memory for core 1 and the content of the main memory I mean for the variable x is same because of the write through policy that is being used, but we find that in the cache memory of code 2 the value is different. So, you can see across all these cache memories we find the variable x is not having the same value. So, this leads to cache coherence problem. Now how we can overcome this there are basically we have two possible approaches one is based on hardware another is based on software. So, first we shall see how we can maintain cache coherence problem by using software. So, whenever we go for software solution obviously it does not require any additional hardware and it relies on compiler and operating system to deal with the problem. That means this cache coherence problem is dealt with with the help of operating system and compiler. So, obviously it increases compile time overhead your compiler is no longer a simple straight forward compiler it has to be a optimizing compiler it will take into consideration the cache coherence problem and it will try to devise mechanism to overcome it how question is how. So, compiler performs analysis on the code to detect which data items may become unsafe for caching. So, compiler has to perform analysis of data and should identify which data is vulnerable for cache coherence which data may lead to cache coherence problem. So, then what it will do it will prevent non-cacheable items. So, that means prevents it will make the data non-cacheable that means the data which is shared by multiple cores or multiple processors should not be taken to cache. So, it is somewhat like telling whenever you play football there is a possibility of injury. So, do not play do not play football. So, it is somewhat like that obviously this is this solves the problem, but not very attractive. So, this approach being conservative does not lead to effective use of cache memories. You see cache memory has been used to improve the performance and obviously if it is not used if shared data is not is forbidden to be taken to the cache memory it does not serve the full purpose more of a your compiler becomes more complicated. So, this is not a very good solution although for the sake of completeness and for we have discussed it. Now, we shall focus on hardware solution. So, this hardware solutions are referred to as cache coherence problems. So, this is that classical cache coherence problems and in fact the hardware solutions are referred to this referred to this problem as cache coherence problems. So, this allows dynamic recognition of potential inconsistency at run time. So, since it is done by hardware it is not done at compile time nor it is done by the operating system. So, it is this coherency is mitigated at run time when the when the programs are running and obviously it will lead to more effective use of caches and better performance than software based approaches. And also it will reduce software development burden that means the programmers who are writing compilers, who are writing application programs, who are writing operating systems their burden is reduced. So, software designers are very happy about it. Now, there are two basic approaches I mean hardware based approaches one is known as Snoopy protocols another is Directly protocols. So, today we shall focus on Snoopy protocols. Question is how do they differ? These two techniques Snoopy protocols and Directly based protocols and differ in several aspects number one is where state information of data lines are held. That means you have to store the state information of the data lines so that you can identify the cache coherence problem. So, question is where this is held where this is stored and how the information is organized where coherence is enforced at what level then what is the enforcement mechanism. So, these are the four aspects in which these two protocol differs. So, obviously the objective is same the key to maintain cache coherence track the state of sharing of every data block. So, whenever a data is shared you have to track keep track of the state of sharing of every data block based on the idea based on this idea following can be the overall solution dynamically recognize any potential inconsistency at runtime and carry out preventive action. So, this is the basic philosophy that is being used by these cache coherence protocols. So, since it is done by hardware the advantage is consistency maintenance becomes transparent to the programmers compilers as well as to the operating system and of course it will lead to increase in hardware complexity. And as I have already mentioned there are two basic approaches one is snooping protocol each cache controller snoops on the bus to find out which data is being used by whom. So, here you know just like a dog keeps on looping I mean waits at the gate and keeps on observing who is coming through the gate who is going out and it just keeps on monitoring. And whenever somebody unknown enters the house it keeps on barking. So, it is somewhat the idea is somewhat similar. So, each cache controller snoops the bus keeps on monitoring and then only when it finds out that data is being used by I mean is being shared and others then it takes appropriate action. So, on the other hand in directory based protocol it keeps track of the sharing of state of each data block using a directory. So, that is the name that is why the name directory based protocol. So, a directory is maintained. So, and directory is a centralized register for all memory blocks. So, in this case you may say this is a centralized enforcement system. On the other hand the snooping protocol is a distributed enforcement system because each cache controller takes care of it and allows coherency protocol to avoid broadcasts. So, this is the two techniques and today we shall focus on snooping protocol. And because snooping protocol reduces memory traffic it is more efficient and snooping protocol requires broadcasts because there is no problem about the broadcast because you have a shared bus and that is the reason why we call it shared memory multiprocessor based system. And since you have got a shared bus the data is broadcasted on the bus and snooping protocol requires broadcasts and can meaningfully be implemented only when there is a shared bus. So, as I mentioned it cannot function without a shared bus and even when there is a shared bus scalability is a problem as I have already mentioned. So, because as the number of processors increases on a bus traffic on the bus increases performance degrades. So, that single shared bus in a single shared bus system the number of processors that you can have is limited may be 1 dozen, 2 dozen at most few dozens. You cannot have massively parallel computers where you can have 1000, 1000, 2000 processors. So, that you cannot have in a shared memory system. So, one alternative could be to have more than one bus. So, some work has been tried sun enterprise server has up to 4 buses. So, you can increase the number of buses to reduce the traffic on each bus. However, it has been found that it is not a very good approach. So, some people have tried, but not very successful commercially. So, let us go into the details of snooping protocol as soon as a request for any data block by a processor is put on the bus. Other processors snoop to check if they have a copy and respond accordingly. That means each processor means the cache controller residing in each processor snoops to check if they have a copy and respond accordingly. So, this works well with bus interconnection all transmission on a bus are essentially broadcast snooping is therefore effortless. You do not have to do anything else and it dominates almost all small scale machines as I have already told with a limited number of processors in a system. Now, let us look at the different categories of snooping protocols. First one is write invalidate protocol, second one is write broadcast protocol. So, the protocols can be broadly divided into two types write invalidate and write broadcast and in case of write invalidate protocol when one processor writes to the cache all other processors having a copy of that data block invalidate the that block. So, it is pretty simple. So, and it can be very easily implemented and on the other hand in write broadcast when one processor writes to its cache all other processors having a copy of that data block update that block with the recent written value. So, in this is happen in case of write broadcast first we shall focus on write invalidate versus write update protocol let us. So, this is that fourth C I mentioned that cache coherence protocol this is the fourth C that three C's I mentioned earlier and this is the fourth C. So, let us focus on write invalidate protocol handling a write to a shared bus all invalidate command is sent on the bus all cache a snoop and invalidate any copies they have and whenever a read miss occurs write through memory is always up to date and write back snooping finds most recent copy. So, let us illustrate it with the help of this example here you have got a multi code chip and if a core in if a core writes to a data item all other copies of this data item in other cases are invalidated. So, let us see so this is the scenario with which we have started already three I mean the main memory we have a variable x is equal to 4 5 1 2 3 and we have copies in two cache memories corresponding to core 1 and code 2 and having the same values. So, right now the they are consistent data is consistent. Now, suppose a this core 1 writes x setting it to 1 1 2 3 5. So, the core 1 has modified it has written data into it and whenever data is written then it is also sent through this bus. So, there is a inter core bus through which that invalidation request is sent when this data is modified. So, see it is only sending the invalidation request corresponding to this variable x. So, what happens in such a case the core which is having this variable x is equal to x. So, this will simply invalidate it. So, invalidation will take place in this of course, in the main memory it will remain the same I mean because of the write through write through policy that is being used here. In the main memory it is updated corresponding to this you can see here it earlier it was 4 5 1 2 3 in both in the cache memory of core 1 as well as in the main memory. Now, in both the places it is modified because of writing in the cache memory it also writes in the main memory. And then whenever this writing takes place on the bus that invalidated signal is propagated request is sent and as a consequence this particular entry is invalidated. What does it really mean invalidated means it is simply removed from that cache memory that means as if a copy of that variable is not present in this cache. So, this is what happens after invalidation. So, after invalidation you can find that this particular cache memory is not having the variable x. So, that is removed. So, this is your invalidation technique. Now of course, what can happen code to reads x cache miss occurs and loads the new value that means if code 2 wants to read it I mean whenever it performs the read x operation. Obviously, it will encounter cache miss because it was not it is not present in this cache. So, through that cache miss it will get a new copy. So, now again it becomes consistent. So, this is how this invalidation protocol works and it is very simple to implement write to shared data, broadcast on bus, processor snoop and update any copies. So, these are the things to be performed in case of writes and in case of read miss memory is always up to date there is no problem. And whenever you do concurrent write write serialization automatically achieved since bus serializes request. So, that means you can do concurrent writing that means whenever this writing occurs then subsequently you can write it in this also that is what is being mentioned write serialization automatically achieved since bus serializes request. So, bus provides the basic arbitration support. So, these are all bus based alternative to invalidate protocol is update protocol. So, in case of update protocol what is happening it has broadcasts updated value on the bus instead of invalidate signal it will send the updated value through the bus and as a consequence wherever that variable is present that variable will be copied and I can see this is how the consistency is being maintained. So, invalidate exploits spatial locality only one bus transaction for any number of writes to the same loc and obviously this is more efficient and broadcast has lower latency for writes and reads as compared to invalidate protocol. So, write invalidate is winner because of these advantages and as a consequence this has been adopted in Pentium 4 and power PC this particular protocol has been adopted. Now, let us consider the snoopy protocol. So, invalidation protocol write back cache we have to assume that invalidation occurs and we are using write back cache memory is not write through. So, based on this assumption each block of memory is one of the following states is having one of the following states modified exclusive shared and invalid. That means the each block of memory is as you know a cache access takes place in terms of blocks. So, we are thinking in terms of blocks and each of them can have different states and in the for each of these blocks with the help of flag bits these states are maintained. So, modified means this line in the cache has been modified exclusive means cache has the only copy it is writable and dirty. That means it is in exclusive use of a particular core shared means it is clean it is available in all cases and up to date in memory block can be read without any problem shared and invalid means data present in this block is obsolete cannot be used. So, this is how you can you can perform I mean with the help of these four states you can implement this new protocol let us see and. So, cache line valid. So, these are the flag bits mentioned memory copy is out of date when it is modified bit and exclusive is valid shared is valid copies in other blocks no then the shared or may not be shared. So, these are the different cases right to this line does not go to the bus does not go to the bus in close of exclusive state shared state goes to the bus and update cache in case of shared and goes directly to the bus whenever it is invalid. So, cache controller at every processor would implement this protocol has to perform a specific action when the local processor request certain things and also certain actions are required when certain address appears on the bus. So, exact actions of the cache controller depends on the state of the cache block. So, you require two finite state machines that can show different types of actions to perform by the controller. So, let us look at the different actions that is being that is possibly these are the different types of request that can be done and these are the state of the cache controller shared or modified invalid and so on and types of cache actions that can occur normal heat normal miss and so on and these are the functions to be performed for different situations. So, whenever read it occurs reads data in the cache read miss occurs place read miss on the bus read miss occurs read place read miss on the bus and like that. So, I think it will be better if we discuss it with the help of this finite state machine. So, this is the state machine considering only the CPU requests for a cache block. So, CPU is requesting for a cache block to read. So, in such a case here three states are shown invalid shared and exclusive the other state is not required that modified is not required in this case. So, CPU read is occurring place read miss on the bus if it is invalid means the data is not present in the cache. So, it will lead to a read miss. So, it will go to the shared state from the invalid state to shared state whenever a read miss occurs then the CPU will CPU read heat occurs then it remains in the shared state keeps on reading it then a shared read miss occurs then it again it will remain in the shared state place read miss on the bus. So, cache controller will put this on the read miss on the bus then whenever CPU writes it then it will go from shared to exclusive state because now modification has been done only that particular cache controller I mean that particular core is having exclusive write I mean exclusive state because other cache memory is may be having I mean old copies. And then CPU read heat and CPU write heat occurs and it will remain in that exclusive state as long as read and write heat occurs I mean it will remain in this exclusive state that cache controller and CPU read miss occurs write back it will also write back cache memory block place write miss on the bus. So, it remains in the exclusive state for all these cases. Now, a CPU read miss occurs and write back block is done and place read miss on bus and it will again go back from exclusive to shared. And finally, whenever CPU write occurs and place write miss on the bus then it will go to from invalid to this exclusive state. So, whenever a CPU write occurs. So, CPU read CPU write. So, all possible cases have been considered considering the state machine and considering CPU request and the cache block. So, these are the various requests that can come from the bus read miss read miss shared modified depending on the cache state no action whenever read miss occurs and it is in the shared state memory to service and read miss read miss and modified state. That means, it will lead to cache coherence problem and place cache block on the bus and change to shared state. And if it is in the request is in the invalid state state of the cache is shared then again cache coherence problem invalidate the block write miss if it is in the shared state coherence problem arises again it will invalidate the cache block and write miss occurs it is in the modified state then again cache coherence problem write back cache block and make the state invalid. So, this is how this will occur and again it can be explained with the help of a state transition diagram. So, state machine considering only bus request for each cache block. So, from exclusive state it will go to shared state whenever a read miss occurs for each block or write back policies being used about memory access. So, from exclusive to shared state it will go then from the invalid state when a write miss occurs it will go from shared to invalid state and write miss occurs whenever there is a it is in the exclusive state and write miss is occurring for this block and write back block is. So, in this case about memory access again it goes to invalid state. So, this is how it will occur and combining the snoopy cache state machines I mean both the considering the state machine considering both the CPU request and the bus request for each cache block this is the situation I mean we have combined both the finite state machines earlier we have discussed the operation of two different finite state machines now we have combined both the finite state machines and to have a snoopy cache state machine again the states are same invalid shared exclusive and different I mean transitions have been combined to represent this. So, I have already discussed the two finite state machines separately. So, I shall not discuss it any longer now let me illustrate this with the help of a example. So, you have got two processors P1 and P2. So, the state of the processor address and value. So, these are for each of these processors this is written for P1 and P2 and then the bar section is written here action and similarly memory for the memory what is the value and various values are being mentioned let me illustrate with the help of this example. So, P1 write 10 to a1 so it will go to exclusive state because assuming that it was in the invalid state it will go to exclusive state and address is a1 and value is 10. Similarly, on the bus that write memory that action is taken and processor involved is P1 and address is a1. So, that will be visible on the bus and that will be visible on the bus and that will what will happen inside the processor. Then P1 reads a1 in that case since it is already in the exclusive state reading is not a problem it will simply read this data and it will not go to the bus there is no change on the bus. That means, bus will not have any action there will be no nothing will happen on the bus. Now P2 reads a P2 so another processor is now reading a so now it will go from exclusive to shared state. So, you can see the state of both P1 and P2 cache both are changing. So, it has been earlier it was in the exclusive state now it is in the shared state both for since P2 is reading here it will go to the shared state and here it is read miss on the bus and P2 is a source of that and address is a1. Then in response to that P1 processor will switch to shared state from exclusive state and that write that on the bus again that write back will take place P1 a1 is equal to 10 and for the memory a1 will remain 10. So, that does not change so this is what will happen now P2 writes 20 in a2. So, in such a case since P2 is writing on that on the memory location 20 then the P1 will invalidate that particular block so invalidation has taken place and it will go to the exclusive state of the processor P2 core P2. So, it will go to the exclusive state of P2 and on the bus it will appear write miss on the bus and so P2 it is originate from P2 and address is a1 and memory will write address is a1 and I believe it will be 20. So, some change is required now P2 writes 40 to a2. So, in this case there will be no change so far as the processor P1 is concerned it will remain invalid state and assuming that a1 and a2 they belong to the same block it will go to the exclusive state it is already in the exclusive state. So, it will remain in the exclusive state and that it is it will perform this write miss will occur then that write back policy will be used and 20 then P2 will write 10. So this is for the bus on the bus P2 will generate that and write back policies action is write back and for the address a1 it is 20 value is for address it is 20 P2 writes 42 a2 and so far as a1 is concerned it remains 20. So, we find that a2 on the memory I believe there is some mistake here all attain on to a2 a2 a2 it is not value is not so much. So, a1 is remaining 10 and a1 here it is 20 for a2 it is not given value is not given here because it is restricted to processor it does not go to the bus that is why it is not given here. So, assuming a1 a2 map to the same cache block but a1 is not a2. So, these are the commercial implementations Intel Pentium Gion P3 and P4 are cache coherent multiprocessors implement snooping protocol and larger on chip cases to reduce bus contentions and the chip set contents and external memory controller that connects the shared processor memory bus with memory chips. So, with this we have come to the end of the our discussion on shared memory multiprocessor systems or symmetric multiprocessor systems in my next lecture we shall focus on distributed memory multiprocessor system. Thank you.