 and welcome to today's lecture on multi-threading and multiprocessing. Here, we shall discuss about thread label parallelism and we shall also see how multiple processors can be used for multi-threading, multiple thread processing and multiprocessing. We have seen that initial computer performance improvements came from the use of innovative manufacturing techniques and advancement of VLSI technology. The computers, if you go to the early history of computers, you will see many innovative technology were incorporated to improve the performance. Then of course, when the VLSI technology was available, the advancement of VLSI technology has reduced the size of the devices and increased the speed of processing. So, as a consequence, computer performance improvement took place based on these two innovative manufacturing techniques and advancement of VLSI technology. Subsequently, in later years, most improvements came from exploitation of instruction label parallelism and we have discussed instruction label parallelism in details. We have seen how hardware and software can be used to achieve instruction label parallelism. We have discussed about pipelining, dynamic instruction scheduling, out of order execution. We have discussed about VLIW and vector processing, I shall discuss little bit later on. These are essentially utilizes the instruction label parallelism available and we have seen how different processors have been implemented using this instruction label parallelism particularly in the last three lectures, we have discussed Intel series of processors and how the instruction label parallelism have been incorporated starting with pipelining in different processors and leading to performance improvement. But it has now been recognized that instruction label parallelism is now fully exploited. That means, whatever parallelism is possible, whatever performance gain is possible by exploiting instruction label parallelism is already done and modern multiple processors have become incredibly complex. We have seen that those superscalar processors where the multiple issues of, I mean issue of multiple instructions are done, they require very large silicon real estate and they are very complex and processor performance improvement through increasing complexity, increasing silicon and increasing power seem to be diminishing. That means, we have reached more or less a point of diminishing return. That means, even after providing lot of hardware or using more sophisticated software, the performance gain that is achieved now utilizing or exploiting instruction label parallelism is minimal, very small. So, what to do? What is the way out? Way out is we have to look for something else. So, the way to achieve higher performance of late by exploitation of thread and process label parallelism is being focused. So, that means, exploit parallelism existing across multiple processors and threads. So, now we are doing it looking at parallelism at little higher level, not at the instruction level, but at process level and thread level and it has been found that this type of parallelism whatever is achieved by this cannot be done by instruction level parallelism. So, for example, if you consider a banking applications, nowadays centralized banking has become very popular. So, there are many users which are communicating to a bank through internet. And each user is accessing the banks and doing transactions. So, multiple transactions are taking place concurrently and it is done by different users. So, individual transactions which are taking place can be executed in parallel. So, if we consider nowadays core banking or centralized banking system where all the transactions which are done initiated by multiple users can be executed in parallel. That means, here the parallelism is more or less inherent or built in the application itself. Now, let us consider the difference between process and thread. As we know a process is a program in execution. You are running different programs for different applications like word processing and various other applications and each application whenever a program is running, we call it a process. And as we know an application normally consists of multiple processes. So, normally a single application is broken down into multiple processes and then multiple processor can run concurrently even for a single application. On the other hand, a process consists of one or more threads. So, here we can say an application program can lead to multiple processes. A single application will give you multiple processes and each process in turn can give you multiple threads. So, this is the relationship between process and thread. That means, an application can be broken down in multiple processes and a process can be broken down into multiple threads. So, now in what way do they differ, a process and a thread differs. Threads belonging to the same process share data and code space. So, let us have a look at the difference between a process and a thread. So, let us consider a single threaded process in which case you will find that it will require a particular process will require code, data and files. So, you will require code, data and files and then you will require to execute a particular program. You will require registers to store the intermediate results while processing a program. You will require stack whenever you do subroutine call, you have to store the various parameters on the stack and also when you do context switching, you will require stack. So, you will require registers and stack and then you will be running a single thread like this. So, a thread is this is a single thread for a single process running. So, this is a single threaded process. Now let us consider a multi threaded process. In a multi threaded process, the code, data and files, these will be shared by multiple threads. However, you will require suppose you are creating three threads, so you will require separate registers, separate stack, so you will require registers, stack, registers, stack and then you have one thread here, this is another thread, this is another thread. This part is shared by all the threads. So, you can see here in a multi threaded process, you are sharing the same code space, data and files. However, you will require separate set of registers, separate stack for different threads. So, these are different threads. So, this is a multi threaded process. I believe in your operating system course, we will learn about process thread in more details. So, question naturally arises, how do you create threads, how a thread can be created? Fortunately, by using different, I mean there are popular thread libraries provided like POSIXP threads, Win32 threads, Java threads, so they will facilitate you in implementing in creating threads. So, with the help of these thread libraries, you can create threads. Now, the threads can be of two types, one is your user thread, another is kernel thread. What is the difference between user thread and kernel thread? So, first let us focus on user thread. In case of user thread, the thread management is done in user space. We have already discussed about virtual memory management, there we have seen the kernel space, user space, we have mentioned about that. And here, whenever user threads are created, the thread management is done in user space and user threads are supported and managed without kernel support. And obviously, since it is done without kernel support, these user threads are invisible to the kernel and if a thread gets blocked, the entire process gets blocked. So, as a consequence, because of this limitation, whenever one thread gets blocked, entire process gets blocked, it has got limited benefits, it provides you limited benefit of threading. On the other hand, whenever you go for kernel threads, kernel threads are supported and managed directly by the operating system. So, operating system takes full control of kernel threads and kernel creates lightweight processes. So, sometimes we call threads as lightweight processes and modern operating systems support kernel threads. For example, Windows XP 2000, Solaris, Linux, Mac, operating system etc., all these modern operating systems support these kernel threads. Now, let us have a look at, so we have now understood the, we have defined what is thread, we have discussed about the relationship between process and thread. Now let us have a look at the benefits of threading. What kind of benefit do we get whenever we do the threading? First of all, responsiveness, we have seen that threads share code and data, we have seen in this particular diagram that these code data, this part is shared whenever we create threads. So, thread creation and switching therefore, is much more efficient than that of processes. So, whenever we switch from one process to another process, suppose let us consider a single processing system. So, in a single processing system also you can have multiple process or multiple threads. So, whenever we switch from one process to another process, it will involve lot of overhead. I mean you have to store and restore various information, particularly since code data and files are separate, but since you are sharing the code data and files for different threads, the thread creation and switching is much more efficient. So, as an example, if you look at the threads supported by Solaris operating system, it has been found that creating threads is 30 times less costly than processes and context switching is about 5 times faster than processes. That means, creation of thread is much more efficient, switching of thread, I mean one thread to another thread is also much more efficient. So, this tells you, this gives you the benefit of multi-threading compared to multi-processing and it gives you truly concurrent execution and of course, this concurrent execution is possible provided you have got enough support of the hardware. That means, if you want to do multiple threads to run concurrently, there should be some support from the hardware without the support from the hardware, you cannot really do that. In what kind of supports are available in present day processors? So, present day processors supports are available in the form of SMP, Symmetric Multi-Processing Systems, I shall go into the details of it later on Symmetric Multi-Processing System. You have got multiple processors, Symmetric Multi-Processor, then you can have multi-core. In case of Symmetric Multi-Processors, the processors may be on different cores or different chips, but in a multi-core, you have got multiple cores, multiple processors on a single die or you shall see that Symmetric Multi-Threading, I shall discuss in more details, which is also known as Haifa-Threading. So, with the help of this, you can have concurrent execution of multiple threads. Now, let us consider the case for processor support for thread-level parallelism. Using pure instruction-level parallelism, execution unit utilization is only about 20 to 25 percent. So, we have seen that whenever you are having multiple processing elements, as we have seen, these are present in all modern processors. We have seen the Pentium series of processors, where there are 7 to 9 different types of execution units available. So, these execution units, how much they are utilized? Utilization of these execution units are to be considered. In the hardware, you have provided multiple execution units. When they be fruitfully or meaningfully utilized or they will remain idle in most of the situations. So, it has been found that using pure instruction-level parallelism, execution unit utilization is only about 20 to 25 percent in modern processors, where you have got a large say 8, 9 execution units, processing units. So, utilization is limited by control dependency, cache misses during memory access and so on. So, we have discussed about different types of hazards. So, whenever you do pipelining, you will be encountering different types of hazards. Then you have different types of dependencies, data dependency, control dependency, structural dependency and so on. So, here by creating multiple execution units, you can overcome the structural dependency that structural hazards will not be there, but the control dependency that will occur because of whenever you are executing in a loop or decisions, then the control hazards will be generated. So, because of this, utilization of the instruction-level parallelism is very limited. So, it is rare for units to be even reasonably busy on the average. And in pure instruction-level parallelism, at any time only one thread is under execution. So, this is a limitation whenever we go for pure instruction-level parallelism, only one thread is under execution at any point of 9. So, this demonstrates or this particularly tells you that it is very much essential to go for multithreading or utilize thread-level parallelism. Now, utilization of the execution units can be improved, how? You can have several threads under execution and for example, in Pentium-3, these are called active threads. So, whenever you have multiple threads under execution, you can, I mean you may give different names, but you can concurrent execution of threads are taking place, whatever may be the names. So, these are known as active threads in Pentium-3 and it executes several threads at the same time by using as I have told SMP, SMT and multi-core processors. So, this clearly tells you the need for thread-level parallelism. Now, threads in applications where they are widely used, threads are natural to a wide-ranging set of applications whenever one more or less independent, often more or less independent, I mean in many applications will find the different applications are independent, they do not depend on one another. So, completely independent applications can run and they can be naturally multi-threaded and there may be some data sharing, though data sharing can take place among them to some extent, but that will not, I mean that can be provided without much of difficulty. So, with limited amount of data sharing, these multiple applications can run independently and also sometimes there may be need for synchronization, sometimes also synchronization among themselves will be needed and for synchronization there are techniques available and with the help of that synchronization, process synchronization and thread-level synchronization can be achieved. So, here are few thread examples given, so independent threads occur naturally in several applications as I was telling and these are some of the applications where these thread-level parallelism can be, will occur and you can utilize them. Number one is web server, so different HTTP request are threads. So, a web server is giving service to a large number of people and different HTTP request can be considered as separate threads. Similarly, a file server is giving service to a large number of users and their data is being stored, file server is being stored in the server and from the server it is accessed by multiple users locally or remotely, whatever it may be, then each of them can be considered as separate threads. Similarly, whenever you go for implementing internet service, you will require name server, so name servers also are receiving multiple requests from different sources and in such cases also each of these requests can be served by using multiple threads. So, similarly as I have already told in banking applications, it is possible to have independent transactions, these independent transactions can be independently threaded and can be taken care of by independent threads and not only in servers, in desktop applications also it is possible to have independent threads because even in desktop applications, different types of functions are being performed like file loading, display of data on the screen, computation etc. can be different threads. So, even in a simple desktop you can have multiple threads, so this gives you examples of different threads. Now, so we can say we are now convinced that threading is inherent to any server application and threads are also easily identifiable in traditional applications like banking, scientific computations etc. Now we have discussed about instruction level parallelism and we have briefly discussed about multi-threading, now can one support each other, so here we can have instruction level parallelism support to exploit threadable parallelism can be done and you can configure processors accordingly. For example, you can have four possible configurations, you can have a superscalar processor with no multi-threading support, this is one possibility, so here you will be using only instruction level parallelism, but no thread level parallelism. Second possibility is a superscalar processor with coarse-grained multi-threading, so multi-threading can be of two types, fine grained and coarse grained, we shall discuss about that. And with coarse grained multi-threading third is a superscalar with fine grained multi-threading, so that can be supported or a superscalar processor with simultaneous multi-threading. So this fine grained multi-threading, coarse grained multi-threading and simultaneous multi-threading, these techniques these we shall discuss later on and in a modern superscalar processor this can be supported very easily. Now we have so far we have discussed about the benefits of multi-threading, we have seen various advantages and different applications which inherently support multi-threading and you can utilize the execution units more efficiently using multi-threading. So the advantages we have discussed, but in this world nothing is one sided, obviously there will be some disadvantages and here some of them are highlighted. So threads have to be identified by the programmer and unfortunately no rules exist as to what can be meaningful thread. So a programmer has to use his experience, intuition to create threads, there is no generalized rule, say follow this algorithm and create threads, so that type of thing unfortunately does not exist. So no rules exist as to what can be a meaningful thread and threads cannot possibly be identified by any automatic static or dynamic analysis of code. So you cannot, by analyzing automatic static or dynamic analysis also code does not really lead to identification of threads how you can create meaningful threads. So what is happening in nut cell it is putting a burden on the programmer. So that multi-threading can be considered as some kind of burden on the programmer. It requires careful thinking and programming as I have already told experience intuition this plays important role in this type of situation. And moreover threads with severe dependencies, so an application program may have severe dependencies. So whenever there are severe dependencies that may make multi-threading and exercise in futility. In other words, so whenever you have got severe dependencies you may not really achieve I mean fruitful multiple threads, useful multiple threads. So as a consequence the thread level parallelism is not as programmers friendly as ILP. So in case of instruction level parallelism we have seen that programmer or user is not really much not much bothered. It is taken care of by hardware in most of the situations. Since it is taken care of by the programmer, by the hardware in most of the situations it is very programmer friendly, programmers are very happy with that but that is not the situation in thread level parallelism. So I have already discussed about it, threads are lightweight, fine grained, threads are shared address, space, data and files. Even when extent of data sharing and synchronization is low, exploitation of thread level parallelism is meaningful only when communication latency is low. So this is another very important parameter communication. So different processes and threads have to communicate with each other. So that communication overhead you have to consider, you have to take into consideration. So it all depends on how different processors are connected, are they sharing a common bus, are they connecting through a switch, are they connected through internet. So it all depends on how they are interconnected and so that communication costs are to be taken into consideration. Consequently shared memory architectures are popular way to exploit thread level parallelism. These shared memory architectures are known as uniform memory access. So in shared memory architectures you will see you have a common memory which is being shared by all the processors, may be through a bus, usually it is through a bus and a popular way to exploit thread level parallelism. That means this thread level parallelism can be more meaningfully utilized in a situation where communication overhead is small and that happens in a uniform memory access processor systems, in UMA systems. On the other hand, you can consider processes as Coerce-Grant. Coerce-Grant means communication to computation requirement is, I mean, whenever you can use this whenever the communication to computation requirement is lower and you can go for distributed shared memory, DSM in clusters, grids, etc. are meaningful. So that means you can use process level parallelism in distributed shared memory architecture. So on the other hand, in UMA, thread level parallelism is more meaningful. On the other hand, this Coerce-Grant parallelism, I mean, you can go for Coerce-Grant parallelism in distributed shared memory architecture. Now we shall focus on different types of parallel architectures that is available. This can be considered as taxonomy of parallel architectures. So back in 1966, fly and outline four classes of organization of high performance computer based on instruction and data streams. So in a processing, you will find that instruction stream and data streams are there and how the instruction stream and data streams are flowing based on that the classification has been done into four types. These are known as, first one is known as SISD, Single Instruction, Single Data. Second one is SIMD, Single Instruction, Multiple Data. Third one is MISD, Multiple Instruction, Single Data. So that means and fourth one is MIMD, Multiple Instruction, Multiple Data. So you have got four type four classifications of computers which was proposed by fly and back in 1966 and even today that is being used for classifying computers. So I shall briefly discuss about these four types of classifications starting with SISD, MIMD. First let us consider SISD, you will require control unit CU, you will require processing unit PU, you will require memory unit. These are the three hardware resources you require in your computation, control unit, processing unit and memory unit. Normally this control unit and processing unit, you know that together is available in the form of a CPU, central processing unit. Now in case of SISD, Single Instruction, Single Data, Single Instruction, Single Data you have a single instruction that is coming from the control unit to the processor. That means a single instruction is provided to the processing unit. So you have got a single instruction stream and then also you have got a single data stream that is between the processing unit and memory unit. So here you have got instruction stream and here you have got data stream. So this is typically SISD architecture where you have got a single stream of instructions flowing from the control unit to the processing unit and single stream of data flowing between processing unit and memory unit. Then coming to the SIMD, Single Instruction, Multiple Data. So here you have got the control unit. So control unit is providing a single instruction and you have got multiple processing unit. So you have got processing unit 1, processing unit 2, maybe processing unit N and there is a separate data stream, multiple data stream between memory unit, memory unit, memory unit 1, memory unit 2, memory unit. So this is your here what is happening, this is not a here as you can see the memory is not shared. So you can say this is distributed memory. So this type of thing processing you do in case of your array processing and vector processing. So this is the second type of classification, single instruction and multiple data. Then coming to the third type of classification that is your MIMD, I mean SIMD, SISD then you have got the MIMD. In MIMD you have got multiple control unit and multiple control unit is connected to multiple processing unit and each of them is connected to separate memory unit. So this is where you do with in this case, in this case we can say that they can be connected through an interconnection network. So through the interconnection network are connected, then you have got, then each of the processors are having their local memory, this is your local memory is connected to the processors and they are connected through the interconnection network. So this is a kind of MIMD architecture connected through interconnection network. You can have MIMD with shared memory as well. So MIMD with shared memory is also possible where you can have multiple control unit, control unit 1, control unit 2, control unit n. These are connected to multiple processing unit, processing unit 2, processing unit n and these are these are connected to a shared memory. So this is how the MIMD shared memory is occurring and so I have already discussed about these things. First one is that this is the typical uniprocessor systems SISD, then you have got SIMD classic form of array processors and vector processors, then this is your MIMD, MISD where multiple instruction and single data is being processed. So this one as you can see here, you have got different instructions coming from different processors and single data is being processed and then as it happens in a systolic array and results are generated by multiple execution units. So this is the typical MISD architecture and finally of course this one was proposed systolic architecture was proposed in 70s but there was no commercial implementation of it and finally the MIMD which is general purpose and commercially important. So nowadays these MIMD processors are becoming increasingly popular and widely used. So multiple instruction, multiple data processors are becoming more and more popular. So we shall devote some time on classifications of MIMD computers, MIMD computers can have shared memory. So the processors communicate through a shared memory and I have already discussed about this, this is your MIMD with shared memory. So through a shared memory the communication takes place and typically processors are connected to each other through a bus. So usually a bus is used to communicate. That means each of the processing elements are connected to a bus and to the bus you can say it will be like this, say processing element, processing element, all the processing elements will be connected to a bus and to that bus there will be a shared memory. So this is the typical situation of shared memory. On the other hand and distributed memory processors do not share any physical memory. So they do not share any physical memory. Only local memory is being shared as I have already shown here. This is the distributed memory here. They have some local memory but they do not share any physical memory. Processes are connected to each other through a network. So it can be, it is usually an interconnection network through which each of the processors are connected. They are having their local memory units and they do the processing using their local memory units and through the interconnection network they exchange information. They communicate with each other. So this is again some classification based on classifications of MIMD and actually there are two terms which are used in this context. One is known as loosely coupled, loose coupling, another is known as tight coupling, loose coupling and tight coupling in the context of MIMD. So physically in case of loose coupling here IO level communication is done through IO and here it is we have seen it is shared memory. So it will be memory level communication, memory access level of communication. And in loosely coupled systems there is no shared memory, no shared primary memory. So primary memory, let me remind you what do you mean by primary memory. So in a computer system you have different types of memory, primary memory, secondary memory, secondary memory is a disk storage. Primary memory is from where instructions are fetched and executed. So when the processors are executing a program, the instructions are fetched from the primary memory that is why special designation primary memory is given. And in case of tightly coupled system it has got shared primary memory, no shared access space overlapping of access space. What do you really mean by this? So no shared access space and overlapping of access space. We have seen that whenever we discussed about virtual memory, there we have seen we can have pages and each of the pages can be made, I mean there is some flag bits which can be said and which can allow sharing of a page by other users or a particular page may not be allowed by shared by other users. So whenever a page is allowed to be shared by other users that essentially leads to overlapping of access space. That means the same page can be accessed by multiple programs or multiple processors. But in this case there is no shared access space. So in that case the memory pages are only I mean it cannot be accessed by other users. So there is no shared access space here overlapping of access space. And this is from the viewpoint of physical interconnection then logically in case of loosely coupled system there is autonomy of processes. And here some kind of master share, master slave effects between processes. So this is distinction in terms of logical access and here you can say whenever it is loosely coupled you can say it is a kind of cooperative effort. And in this particular case it is possible to have word by word interaction. So the distinction between loosely coupled systems and tightly coupled systems in the context of MIMD has been discussed in detail. And very briefly I shall talk about this shared memory and distributed memory. Shared memory located at centralized location may consist of several interleaved modules. Interleaved modules then the that means the same distance from any processors and that is the reason why it is called uniform memory access model, UMA model. That means each processor will take same time to access it. So in case of UMA say you have got multiple processors. So since they are sharing through a bus kind of thing each of them will take the same amount of time and that is why the term uniform memory access model and this is very popular. On the other hand whenever memory is distributed to each processor although it improves scalability it has got lower latency to access local memory. On the other hand it has got higher latency for access of local memory. So that means whenever we discussed this local this particular situation whenever this processor is accessing this memory the latency is very small. On the other hand if this processor is allowed to access this memory obviously the access time will be longer. So you can see the access time is not uniform in such a situation whenever the connection is through a interconnection network and that has lead to the model known as non-uniform memory access. So whenever it is non-uniform memory access we use message passing architecture no processor can directly access other processors memory. So the communication is done by using message passing or distributed shared memory, memory is distributed but the address space is shared as I have already mentioned how it can be done with the help of that virtual memory concept. And this is the typical diagram of the two different models this is the uniform memory access model. So here there is a bus and this is the main memory or prime memory through which shearing is occurring although each of the processors having their own cache memory but the memory that is being shared is the main memory or this is the prime memory. So all are connected through a bus and as a consequence the access time is uniform for each of the processors. On the other hand whenever you have got this non-uniform memory access model each processor with their built-in cache is connected to a main memory and that is being connected to the network or may be interconnection network and whenever shearing is done this process will take much less time to access this memory compared to this processor trying to access through the network to this memory. And not only I mean in this case the time I mean access time will be different for different processor for accessing different memory devices. So with this we have come to the end of today's lectures we have discussed details of I mean differences between process what is process what is thread and the relationship between process and thread and also we have discussed about the different processor models like SISD, SIMD and MIMD with this we conclude today's lecture thank you.