 Dear viewers, I welcome you to the lecture series on high performance computer architecture. In 40 lectures, I shall try to provide an overview of computer architecture at different levels and various aspects of advanced computer architecture. And today we shall start with the first lecture and the title of today's lecture is Introduction and Course Outline. So, in this lecture, I shall try to give an introduction of the course and give an outline of the different topics that I shall cover in this course. And here is the outline of today's lecture. First I shall give a historical background because today we find an array of computers powerful array of computers surrounding us performing different purposes and providing the need of our daily life. And this did not happen in a single day and it has taken many years, may be 100 years to reach this stage and how gradually we have arrived at this stage, I shall give a kind of evolution that has taken place to reach this particular stage. And in this respect, I shall discuss about five generations of computers and then I shall talk about the elements of modern computers, what are the different components that builds a modern computer. Then I shall introduce to you the instruction set architecture which is essentially a programmer view of the processor and then instruction set processor which essentially represents the way the processor is realized. Then I shall talk about some related topics like Moore's law which has helped in the progress of the computers and the Moore's law is the force behind this gradual evolution of computers. And then I shall discuss about parallelism at different levels, as we shall see to reach higher performance parallelism is the key idea and how parallelism have been incorporated at different levels that I shall discuss and finally, at the end I shall give an objective of the course and an outline of the course. If you look at the history of computers, as I mentioned the history encompasses more than 100 years, so computing equipments are available may be for 100s of years and the period can be divided into two categories, first one is the mechanical era that means prior 1945 and then after 1945 we can say that electronic era. So in the mechanical era the computers were built by using mechanical components like wheels and other things for example, abacus which was developed in China that dates back to 500 BC that was used for the purpose of calculation of numbers. Then mechanical adder-subtractor machine that was built by Blaise Pascal in France in 1942, this again belongs to this mechanical era because no electronic component was used in building it. Similarly, difference engine that was built by Charles Babbage for polynomial evaluation that was developed in England in 1927, then binary mechanical computer was developed by Conard Jues in Germany in 1941, then electromechanical decimal computer was developed by Howard Iken in 1944 in Harvard and then Mark I by IBM. So this is the mechanical era coming to the electronic era, we can again divide in five generation the first generation was before the invention of transistor. Similarly before the invention of transistor the electronic building block or component that was used is vacuum tubes and that also used relay memories. So what is a vacuum tube? Vacuum tube is a kind of tube small tube with a filament which emits electron and by controlling the flow of electron, by controlling the flow of current you can actually realize the two states on and off and that is how you can realize a digital computer. So vacuum tubes were used and relay elements which are essentially switches which were used to store information for used as memory. These computers obviously when used realized these vacuum tubes they were very large in size, so they were usually these computers were actually occupying big rooms and obviously dissipating lots of power and computing power although the computing power was relatively much smaller in comparison to today's standard. So maybe 1000 times slower and lesser computing power than present a PC that we use nowadays. Then these computers were essentially single user systems, so single user systems mean that operating system that was used can provide I mean can allow only one user to use the computer and so far as the programming language is concerned it was in a very I mean primitive stage that means you can use only machine or assembly language. So these computers were single user systems and the programming language that was available was machine or assembly language. Obviously whenever you write a program in machine language or assembly language it is very painful machine or assembly language and to write a program I mean programming was a big challenge. So these computers were not very user friendly you can say and examples of this type of computers are NEAC, Princeton, IAS and IBM 701. So these computers actually the first generation of computers were primarily developed because of the requirement of department of defense of USA. So they wanted to calculate the trajectory of a cell that is launched from the worships and where it will fall like that so to do that calculation you need a computer. So these computers were primarily developed for that purpose, so in 40s. Then came the second generation computer in between 1955 and 64. So in this decade computers were realized by using transistors. So transistors were invented and which was used I mean for the realization of computers. So transistors, diodes and so far as the memory is concerned magnetic ferrite cores were used to store information. Then high level languages were used with compilers that means now you can write program in high level language and so compilers were developed. So that was a very big step so far as the users are concerned and also it allowed batch processing that means earlier the computers were developed for single user. Now you can do batch processing that means a particular computer can be used one after the other by different users. So the computers like IBM 7090, CDC 1604, Univac, LARC these are the examples of the second generation computers built using transistors and diodes. Then came the invention of integrated circuits. So by integrated circuits we mean you can put more than one electronic component like transistors diode and so on on a single silicon wafer. So depending on the number of active devices that you can put on a single silicon wafer you can categorize into different type known as small scale integration SSI small scale integration where you can put may be one to ten active devices then MSI where you can put ten to hundreds of devices like transistors and diodes then LSI large scale integration where you can put thousands of devices. And nowadays we are in the era of VLSI where you can put millions of transistors or active devices may be say million to billion of transistors you can put on a single IC. So these are the because of this evolution of integrated circuit has led to the development of powerful third generation computers using but in this third generation only SSI and MSI devices were used. And so far as the operating systems are concerned multiprogramming and time sharing OS were used and so a single computer can be used by a large number of users and all of them will be having the feeling that they are the sole users of the computer. So in this category was IBM 360 370 computers so these IBM 360 370 mainframe computers became very popular and widely used throughout the world. And in addition to that you have got CDC 6600 Texas Instruments ASC and PDP-8 these are different third generation computers which were developed by different manufacturers and widely used. Then came the fourth generation as I said with the advancement of VLSI technology this fourth generation computers use VLSI circuits and VLSI circuits and as a consequence they were very powerful and so far as the operating system is concerned multiprocessor operating system were used. That means these fourth generation computers it was possible to realize multiprocessor chips a single IC can contain multiple processors. So multiple operating systems were developed various high level languages were flourished and parallel processing was very popular using these fourth generation computers. That means it was possible to have parallel processing I shall discuss about what are the different types of parallel processing possible in the course of this lecture. And the various computers based on this fourth generation are where IBM 3090 VAX 9000 and Cray XMP and then came the fifth generation which are started may be from 1991 to present day. So these uses VLSI circuits ultra large scale integrated circuits and or very high I mean SIC. So these are the different types of very high scale integrated circuits were used in realizing fifth generation computers and to realize massively parallel processing and earlier so far as the multiprocessor parallel processing were concerned all the processors were homogeneous. That means same type of processors were used in those parallel processors, but in the fifth generation it was possible to have heterogeneous processors. That means one can perform I mean graphics processing one can perform integer processing and so on. So different types of heterogeneous processors were combined to realize a computer systems. And there are the fifth generation compute examples of this fifth generation computers are Intel Paragon Fujitsu VPP 500, Cray MPP. So these are some representative examples, but obviously this list is not complete there are many more fifth generation computers. According to the elements of modern computers you can see a modern computer is an integrated system consisting of machine hardware, system software and application program. So three very important components that makes a computer workable usable to usable or number one is hardware you require hardware, second is your system software and third is your application programs various application programs. And this can be represented with the help of this nested circles. So as you can see hardware is at the core or center of these three circles. So this hardware is being the interface between the application software and hardware is the system software. System software is essentially the operating system and different types of operating systems have evolved over the years. And these system software or operating system allows application software to run on this hardware. So and the actually to the system software or the user or programmer the functionality of the processor is characterized by instruction set. So a processor can execute a set of instructions and which can be used to write a program. So program can be considered as a sequence of instructions. So you can pick up instructions from the instruction set and realize a program. So a programmer's view of the processor is essentially the instruction set and that is why it is called instruction set architecture. So instruction set architecture ISA in short plays a very important role and it is a kind of abstraction and all programmers can look at it. So this predefined instruction set is called instruction set architecture. Now this ISA on instruction set architecture serves as an interface between the hardware and software as I have told. So here you have got the hardware and here you have got the different types of software like system software, application software and this instruction set architecture serves as an interface between the hardware and software. So in terms of processor design methodology an ISA can be considered as a specification of a design. That means so whenever you go for designing a system you have to provide specification. So this specification this ISA instruction set architecture can be considered as the specification. And these instruction set architecture is I mean to realize instruction set architecture you implement a or synthesize a hardware. So the specification is the behavioral description what does it do what the processor can do then the synthesis step attempts to find an implementation based on the specification. And then comes the processor which is an implementation of the design and how is it constructed that means instructions processor design concerns with how a processor is constructed. So it is also referred to as micro architecture. So realization of an implementation a specific physical embodiment of a design is done using VLSI technology nowadays. So in this course we shall discuss about the way this is how it is being done. But before that as we shall see later first we shall introduce the instruction set architecture. I believe before I proceed further it is essential to distinguish between these two terms which you will encounter quite often. One is architecture another is organization. So what is the difference between the two? So a computer architecture and computer organization here as you can see architecture is a short form of instruction set architecture. So it is also known as instruction set architecture as I mentioned this is a programmers view of a processor. So that means it specifies what are the instructions of the different instructions that you can perform that means data transfer, data manipulation like that. So addition, subtraction, multiplication, division and then they move instructions to transfer data between processor and memory, processor and IO loads stored like that. And then it will also provide given idea about the different types of registers which you can use for storing information temporarily while executing a program. So registers provide intermediate I mean storage and then you can have various addressing modes, addressing modes allows you to access operands in various ways from registers and memory and also from IO. So this is architecture is essentially a programmers view of the processor. On the other hand as I mentioned organization represents the high level design that means the way the processor is implemented, how many cache memories it has got, how many arithmetic and logic units it has got, what type of pipe lining control pipe lining is being used, what type of control design is being done whether it is a hardware control unit or whether it is a micro program control unit, whether the processor is single cycle, multi cycle pipeline like that. So these are all decided as part of the organizations. So this is also called micro architecture. So this is sometimes known as micro architecture. So the structure of a computer that a machine language programmer must understand that is your architecture to be able to write a correct program for that machine and a family of computers of the same architecture should be able to run the same program. So here as you can see same instruction set architecture can be realized by a series of machines or processors. They may be realized by using in different ways, one can use can be realized by using transistor circuits, another can be by integrated VLSI circuits and they can be the implementation can be done in different ways. But as long as they execute the same instruction set architecture then we can say that there is a kind of binary compatibility that means a program written for one machine can be run on another machine. So this is a very important concept and this binary compatibility plays a very important role whenever we go for designing computers and we go from one generation to next generation of processors. So you have to take into consideration the binary compatibility whenever you realize the next generation processor so that the programs developed for the earlier generation can be used cannot be I mean need not be thrown away need not be wasted and can be used. So now coming to one very important concept that is your Moore's law Gordon Moore he was one of the founders of Intel he proposed a law based on some his observation. So the computer what has been done what has been found that the computer performance has been increasing phenomenally over the last five decades and this enhancement this performance improvement is an outcome of you can say Moore's law. So this was brought about by Moore's law what is that Moore's law states that transistor per square inch roughly doubles every 18 months. So Moore's law is not exactly a law but this particular rule you can say transistors per square inch roughly doubles every 18 months that has been hold good for nearly 15 years and this is the this is Gordon Moore he is one of the co-founders of Intel and this is what he stated back in 1965 in his famous paper he wrote it for electronic magazine. So he was asked to write an article predicting the future of electronic circuits. So the title of the article was cramming Moore components on two integrated circuits and that was published in April 1995 issue of electronics magazine and in that article he predicted that transistor density of minimum cost semiconductor chips would double roughly every 18 months and obviously transistor density is correlated to processor speed as we shall see. So this shows how Moore's law has remained valid for a large number of for about 50 years you can see here only the Intel processors are being shown and on this side on the y axis you have got the number of transistors and on x axis you have got the number of the years. So as you can see as the as we go from 1970 to 2010 the number of transistors is reaching from 1000 to several millions several millions of transistors are used to realize processors. So starting from simple Intel 4004 which was obviously having few thousands of transistors to present day dual core and multi core processors requiring multi billion transistors may be tens of billions of transistors. So this shows the how Moore's law Moore's law has really influence the growth of computers and Moore's law is not about just the density of transistors on a chip that can be achieved but about the density of transistors at which cost per transistor is the lowest that means it not only says about how many transistors you can fabricate but it also says how economically you can do the fabrication that means you have to realize I mean transistors ICs with more number of transistors in a economic way cost effective way. So as more transistors are made on a chip the cost to make each transistor reduces but the chance that chip will not work due to defect arises of course this problem is there and that is you have to take care of it by reliable processing technology. And Moore observed in 1965 there is a transistor density or complexity at which a minimum cost is achieved so based on I mean a minimum where the transistor density or complexity at which a minimum cost is achieved he proposed his law which has become famous. So we can say that the initial computer performance improvements came from the use of innovative manufacturing techniques advancement of VLSI technology and which actually based on Moore's law you can say and improvements due to innovations in manufacturing technologies have slowed down since 1980s of course the rate at which it was growing has slowed down since 1980s. So smaller feature size gives rise to increase resistance these are the two reasons number one is smaller feature size gives rise to increase resistance this is one problem second is larger power dissipation. So as we shall see the Pentium processor generates about 100 watts of power dissipation. So 100 watts of power dissipation from an IC is a very large power dissipation. So this power has to be dissipated with the help of suitable packaging and cooling technique and so as you put more and more transistors the power dissipation increases the cost of packaging and cooling increases this is one parameter which is which limits the increase in number of transistors on a chip. So as I mentioned a decade ago chips were built using 500 nanometer technology so 0.5 micron technology you can say and in 1971 10 micron process was used and most processors are used are currently fabricated on 65 nanometer or smaller process. So nowadays because of the advancement of VLSI technology thanks to Moore's law which has been followed it has progressively I mean the dimension has reduced gradually from 500 nanometer to sub micron technology now as you can see nowadays we use 65 nanometer or smaller process may be 45 or 33 nanometer. So Intel in January 2007 demonstrated a working 45 nanometer chip and Intel started mass producing in late 2007 based on this 45 nanometer chip. So you can it is very easy to pronounce this 45 nanometer 33 nanometer and like that but you think about the diameter of an atom. So diameter of an atom is of the order of 0.1 nanometer so we can see that we are very close to the diameter of an atom. So the precision at which the VLSI technology VLSI fabrication is taking place nowadays is not far away from the diameter of an atom. So this particular table gives you the amazing decades of microprocessor evolution. So as you can see from 1970 to 1980 the transistor count was 2k to 1000k and clock frequency was 0.1 to 3 megahertz and number of instructions per cycle was 0.1 that means you required 10 cycles to execute an instruction and in 1980 to 1990 the number of transistors increased from 1000k to 1 million and the speed was increased from 3 to 30 megahertz and number of instructions per cycle also reduced. So you can execute more number of instructions in a 0.1 that means you can more or less execute one instruction per cycle and then in 1990 to 2000 the number of transistors increased from 1 million to 100 million and the speed increased from 30 megahertz to 1 gigahertz and number of instructions per cycle changed reached 0.1, 0.9 to 1.9. So you may be asking how is it possible to execute more than one instruction per cycle later on we shall discuss about that that can be done by using superscalar architecture and later on we shall discuss in detail about it. Then came the 2000 to 2000 the present day technology where you can have 100 million to 2 billion transistors on a single chip and speed is from 1 to 15 gigahertz. So 1 gigahertz means 10 to the power 9 hertz so you can see the speed at which it will work and then the number of instructions per cycle is again between 1.9 to 2.9 and it can be still more. So the processor performance has become twice as fast after every 2 years and memory capacity has increased twice as much after every 18 months roughly based on following Moore's law. And I should mention about another name called Mead and Conway actually they described a method of increasing hardware design designs by writing software. So whenever the number of transistors in a processor increases it is not possible to do the design manually so you will require automated technique. So Gordon Moore developed this hardware description language by which you can automate the design of a processor that means whenever you are going for designing processors with millions of transistors you have to use CAD tools computer-added designs tools and Mead and Conway was the founder of that I mean they proposed this method for creating hardware designs by writing software. By writing software you can design hardware that was an important step. Question arises how can we improve performance? So initially as we have seen the improvement occurred because of the advancement of VLSI technology but in subsequent years the performance improvement came from exploitation of some form of parallelism. So some form of parallelism were used in these processors first is instruction level parallelism what do you mean by instruction level parallelism? Instruction level parallelism means you know normally the execution of instruction take place serially that means you execute you fetch an instruction then execute so you can say that instruction say cycle can be divided broadly into two parts instruction fetch plus instruction execution. So first you perform fetch one instruction then you execute it then you fetch another instruction this is for instruction one then you fetch another instruction and then execute another instruction that second instruction. So in this way it goes on serially there is no parallelism but subsequently techniques were developed for instruction level parallelism and pipelining is the first technique which used instruction level parallelism. In pipelining you will see the instructions later on we shall discuss in detail we shall see the instructions can be executed in overlap manner that means you say that instruction fetch when you are performing instruction fetch of instruction one or instruction execution of instruction one you can perform instruction fetch of instruction two and again instruction execution of instruction execution of instruction two is going on then you can perform instruction ph of instruction 3, and in this way it goes on. So, later on I shall discuss it in more details, and also there were other techniques which were used, which is known as dynamic instruction scheduling, in which multiple instructions can be scheduled dynamically with the help of hardware, and particularly wherever you have got multiple execution units, and then out of order execution can be performed, and then superscalar processor where you have got multiple processing elements within a single processor, then you can have VLIW architecture which can use superscalar architecture. So, the compiler will pack several instruction in a single instruction, and which can be fetched and executed serially. So, these are the instruction level parallelism, I shall discuss in detail in my lecture. Then there is another level of parallelism, second level of parallelism that is your thread level parallelism. So, thread level parallelism you can say it is medium, and different threads of a processor executed in parallel on a single processor or multiple processors. As you know, you are familiar with what is process, process is nothing but a program in execution, then a single process can have multiple threads you can say, multiple threads which can be executed in parallel, if you have got multiple processing units. For example, in a superscalar architecture, you have got multiple processing elements, so different threads can be executed in different processing elements. So, for example, you are performing a loop program. In a loop program, different iterations can be considered as different threads and which can be executed on different processing elements in a single processor. So, a loop can have multiple threads. So, this thread level parallelism is another medium-grade compared to instruction level parallelism which is fine-grained. So, you can categorize in three types, first one is instruction level parallelism, instruction level parallelism or in short ILP, this is your fine-grained. The second is your thread level parallelism which is medium-grade and third type is, as we shall see later, process level parallelism and which is coarse-grained. So, these three types of parallelism can be used and this simultaneous multi-threading is a technique for improving the overall efficiency of superscalar CPUs using hardware multi-threading. That means, in a single processor where you have got multiple functional units that is present in a superscalar processor, you can have this hardware multi-threading. That means, this thread level parallelism is exploited with the help of hardware and which is known as SMT or simultaneous multi-threading. Then, of course, you can have software level multi-threading on multiple processors or coarse-grained multi-threading, software multi-threading that can be done on multiple processors or cores. So, this is your symmetric multi-processor SMPs. This is very popular shared memory multi-processor architecture. So, here as you can see, you have got multiple processors. Each processor is having a private cache, each of these processors is having a private cache and all of them are connected through a bus to main memory and I O. So, this main memory and I O are shared and sharing is done through a common bus and it is called symmetric multi-processing. The reason for that is each of these processors will take the same time to access main memory as well as I O. So, the access time for main memory and I O devices is symmetric for all the processors or uniform. So, it is also called uniform processor architecture. And process level as I mentioned, process level or coarse-grained architecture where you have got different processors that can be executed in parallel on multiple processors and you can use symmetric multi-processors as I have already shown. Also, you can have distributed memory multi-processors as you can see in this diagram. So, this particular model which I have already shown here, each processor has got a private cache and there is a shared bus through which main memory or shared memory is accessed. So, this is called uniform memory access because each of them can access it in a uniform manner and it is also called symmetric multi-processors as I have told. Then you have got non-uniform memory access. Here as you can see the memory is distributed, main memory is distributed and when this processor is accessing this memory, obviously access time will be smaller. When this processor 2 is trying to access the main memory attached to processor 1, obviously it has to do it through this network or called interconnection network. So, whenever it is accessing through this interconnection network, the access time will be longer and also this access time can be variable depending on the type of network being used and availability of the network. So, this is your distributed memory multi-processor where you have got multiple processors and the different memories attached to different processors are accessed through a interconnection network or it can be a local area network. Now, what is the objective of this course? So, I have discussed broadly given an idea about the different types of processors that you can have and also we have discussed about the different types of parallelism that is possible, instruction level parallelism, thread level parallelism, process level parallelism and different types of processors architecture. Now, what is the objective of this advanced computer architecture course? We shall see that modern processors such as Intel Pentium, AMD, Athlon, etcetera use many architectural and organizational innovations that has not been covered in the first level course. So, each and every student who are attending this course must have attended a first level course on computer architecture and organization. So, in that first level course these advanced topics were not in covered. So, which are and it is very essential for them for the computer science students to learn these details of these processors. So, particularly various innovations that have been used in implementing these processors then innovations in memory, bus and storage designs as well. So, as we shall see we shall discuss about hierarchical memory organizations the way the performance gap between memory and processor can be bridged. Then we shall also see multi processors how multi processors can be realized and clusters can be implemented. And in this way you can say the objective of this course is to study the architectural and organizational innovations used in modern computers. So, in a single sentence we can state the objective of this course study the architectural and organizational innovations used in modern computers. Now, let me give you an outline of the course that I shall discuss and the course has been divided into several modules 5 modules. So, in module 1 I shall present the review of the basic organization and architectural techniques. So, fundamentals the fundamentals of different I mean processors like risk processor, what is risk processor, what is SISC processor and what are the characteristics of risk processors and what are the differences between risk and SISC processor. And then the classification of instruction set architecture these we shall discuss in this particular module and also we shall discuss about the way the performance of processors are measured. So, it is very whenever we say high performance question naturally arises how do you really measure the performance. So, we shall discuss about the technique by which the performance can be specified and performance can be measured and we shall review these performance measurement techniques. And then I shall discuss about the basic parallel processing techniques as I have already told instruction level, thread level and process level and I shall also discuss classification of parallel architectures in this module 1 the various parallel processor architectures that is possible. Then coming to module 2 I shall focus on instruction level parallelism and as I mentioned the first approach which exploits instruction level parallelism is pipelining. So, basic concept of pipelining will be introduced and based on that basic concept we shall discuss about the arithmetic pipelines, how different arithmetic operations like floating point addition then integer multiplication can be for pipeline that I shall discuss. But most important is the instruction pipelines which are used in all modern processors. So, I shall discuss in more details about instruction pipelining and particularly whenever you go for instruction pipelining you will find that there are different types of hazards that is present in pipeline processors that mean whenever you do instruction pipelining you want to utilize each and every cycle processor cycle. But unfortunately because of various types of dependences like data dependence, control dependence and structural dependence you will find that it is not possible to avoid these hazards and I shall discuss about the three different types of hazard that is your structural hazard, data hazard and control hazard. And also we shall discuss various hazard resolution techniques that can be used. Then as I mentioned I shall discuss about dynamic instruction scheduling that can be that is important in the context of your superscalar architecture and also branch prediction technique which is related to control hazards, how we can predict branch and we can minimize the effect of control hazards. Then we shall discuss about instruction level parallelism using software approaches and as I will discuss about supercalar techniques, speculative executions and highlight how these various techniques have been implemented in modern processors. So, I shall review some of the modern processors. According to module 3 I shall discuss about memory hierarchies as I mentioned the speed of processor is increasing and later on we shall see the speed of memory is not increasing at the same rate. So, how do you bridge the gap? So, to bridge the performance gap one important approach that is being used is known as hierarchical memory organization where memory is organized in a hierarchical manner in terms of speed. So, the memory which is very close to the processor is known as cache memory then you have got main memory. Then the third type of memory is secondary memory. So, I shall discuss about these different types of memories like main memories, cache memory design and implementation, virtual memory design and implementation, then secondary memory technology and also I shall discuss about read which is used in the context of secondary memory redundant array of independent disks that I shall discuss and how it is used to improve reliability as well as performance. According to module 4 I shall discuss about thread level parallelism and in this context we shall discuss centralized versus distributed shared memory architecture, then various interconnection topologies, multiprocessor architectures and then symmetric multiprocessors and in this context there will be a problem known as cache coherence problem. Because whenever you have got say private cache and shared memory that will lead to cache coherence problem leading to some inconsistency in the information that is stored in memory and how it is overcome I shall discuss and then multi core architecture is essentially an extension of multiprocessors in which the different processors are implemented on a single chip and I shall discuss about modern multiprocessor and give a review of modern multiprocessors. Coming to the fifth module where process level parallelism will be considered I shall discuss about distributed memory computers, different types, different alternatives, possibilities and three different types of computing which are increasingly becoming popular one is known as cluster computing, grid computing and cloud computing that I shall cover in this process level parallelism. So, with this we have come to the end of this lecture and in the next lecture we shall start with instruction set architecture. Thank you.