 The hardware of a computer system can usefully be described as composed of basically three parts. First, you have the CPU, the central processing unit, second, the system memory, and third, what's called IO, short for input-output, which comprises everything else, your hard drive, your display monitor, your keyboard, your mouse, everything else other than the CPU and the memory. All of these components, of course, need to get hooked together, so we plug them into what's called a main board, or sometimes called a motherboard. The main board provides power to the components and communication pathways between them. This reductive description, CPU, memory, and IO really does apply to any computer, though we sometimes make distinctions between different classes of systems. For example, what's called a client or sometimes a workstation is a single user system, like a desktop or laptop. What's called a server often has no direct human interface devices, like a keyboard or monitor, because instead servers mainly communicate with other systems over the network. There is truthfully no hard and fast distinction between client and server systems. You can stick your laptop in the back room and use it as your server. However, clients and servers typically have different performance needs, whereas a desktop might have a fancy graphics processor to play games, a server might have a large amount of memory to better store a large database. Hand-held systems like smartphones and embedded systems like your car's computer are also comprised of the CPU, memory, and IO devices. Again, the distinction is mainly in the form factor, the size and shape of the encasing, and also in the performance needs. For example, your toaster doesn't need much processing power, and your smartphone needs power efficiency much more than your desktop PC. The term mainframe today is at best nebulous and at worst totally obsolete. The large majority of mainframes still sold today are sold by IBM, and typically these systems are about the size of a large cabinet. Unlike most servers in use today, which have a standard PC architecture, mainframes have custom architectures optimized to handle a lot of network traffic and data processing. Mainframes have always been used only by larger organizations, but even large organizations have increasingly moved away from mainframes in the last two decades because they can get the same needs met using an array of cheap commodity server PCs. The term supercomputer is also not as meaningful as it once was. The term supercomputer simply refers to a computer capable of doing a whole lot of computation, most commonly for the purpose of scientific simulations like say modeling weather patterns. A few decades ago, supercomputers were always very specialized hardware, but in the last two decades, the trend has been to build supercomputers out of many off the shelf processors thrown together into one system. Now, getting a bunch of processors to work together in one system efficiently is not a trivial matter, but it's considerably easier than it once was. In fact, you can build a passable supercomputer today by networking a bunch of commodity PCs together to form what's called a cluster. Of all the components in the system, the RAM, the system memory is the simplest. RAM is really just a big bucket for storing bits, and as far as the CPU can see, these bits are organized into bytes, each with its own address, a numeric value that uniquely identifies that byte. The first byte has the address zero, the second byte has the address one, the third has address two, and so on, all the way up to the last byte. So if the CPU wants to read or write a byte in RAM, it specifies the byte by its numeric address. A notable characteristic of RAM is that it's volatile, meaning that as soon as a RAM chip loses its steady stream of power, the content of the RAM gets scrambled. None of the bits can be relied upon anymore because, without power, they get flipped unpredictably. What this means in practice is that when you turn off your computer, the contents of RAM are lost nearly instantly. The next time the computer is turned on, the contents of memory start off in a total jumble, as a bunch of random garbage. So this explains why the operating system must be loaded from disk every time you power on the system. Despite this annoying characteristic of volatility, we use RAM chips for system memory instead of, say, a hard drive because RAM chips are much, much faster to read and write. As code runs, we want the CPU to access the instructions as fast as possible, so usually we copy all of a program's code into memory off of slower storage mediums like hard drives before running the program. As a code runs, we want the CPU to access the instructions as fast as possible, so usually we copy all of a program's code into memory off of slower storage mediums like hard drives before running the program. RAM can also be used to store any data created by a program as it runs, but of course, anything which the program wants to store permanently must be copied to a non-volatile storage device like a hard drive. If we wanted to fully understand how CPUs work, we would have to get into a lot about circuitry and electricity, as well as material science to understand how CPUs are manufactured. As programmers, though, we don't really need to know how CPUs work as long as we understand what they do. What programmers care about in regards to CPU is its so-called programming model, the abstraction presented to programmers that allied over the messy details of circuitry and voltages and so forth. So first of all, what a CPU does from the programmer's perspective is execute binary instructions, which are sequences of bits, typically around 8 bits in size on the low end and around 256 bytes in size on the high end, with most somewhere in between. The way to think of these instructions is that the CPU is hardwired to read one instruction after another and hardwired to act upon each instruction differently. For example, if the binary sequence 10110011 denotes the start of an addition instruction, the CPU is hardwired to perform an addition operation when it reads an instruction starting with that sequence. Again, how exactly this works in circuitry is not something we'll concern ourselves with here. The binary sequence which denotes any particular instruction is largely arbitrary, and so different CPUs understand different sets of instructions. For example, the binary sequence denoting an addition instruction on one CPU may not be a valid instruction on another CPU. In fact, one CPU may have instructions which another CPU does not have at all. For instance, some simpler CPUs have no instruction for performing multiplication, so code on those processors must perform multiple additions to get the same effect as a multiplication operation. Still, every CPU needs instructions for a few essential tasks. First, every CPU needs one or more instructions for copying bytes from one location to another, mainly from one part of memory to some other. Second, every CPU needs instructions for doing basic arithmetic, at the very least addition and negation. Third, every CPU needs instructions for performing logical operations, namely the operations not, and, or, and exclusive or, which we'll explain in a later unit. The gist of it is that with these instructions, we can arbitrarily manipulate the individual bits of bytes, such as setting a specific bit to zero or one. Fourth, every CPU needs instructions for performing jumps. As we described, the CPU is hardwired to execute instructions, and it does so one after another, starting from some place in memory and reading the instructions there sequentially. A jump instruction orders the CPU to execute instructions at an address specified in the instruction, to effectively jump from one placing code to another. Crucially, CPUs need jump instructions which are conditional, meaning that they perform the jump only if a certain condition is true. For example, a conditional jump instruction might perform its jump only if all the bits in a byte of memory are zero. What conditional jumps effectively enable programs to do is make decisions to branch down one path of code or another based upon the state of data. A register is a small, volatile data storage area inside the CPU. The CPU's registers can be categorized into two kinds, status registers and general purpose registers. A status register stores data that affects the operation of the CPU. For example, some CPUs operate in different modes, and so such a CPU will typically have a status register in which the bits designate the current mode. For another example, every CPU needs to keep track of the memory address for the next instruction to read, and they do so in a status register called the program counter. In fact, what a jump instruction really does is modify the program counter, thereby causing execution to jump to the new address found there. In contrast to a status register, a general purpose register is for storing any data. Again though, these registers are typically very small, around 16 to 128 bits in size, and even high-end CPUs contain only up to a few hundred general purpose registers, while low-end CPUs may contain only three or four. Today's x86 processors contain a few dozen registers, which is on the low end for high-performance processors. So if these CPU registers can't store much data, why do we have registers? Well, most CPUs can only perform operations on data in the registers, not in memory. Addition instructions, for instance, generally can only add numbers in the registers, not numbers out in the system memory. So to add two numbers in memory, we must first copy them to separate registers, and the result of the addition operation gets written to a third register, from which we may then copy the result out to memory if we wish. This then is the general pattern of code. To do work, we need to bring data from memory into the registers before performing an operation, then copy the result of the operation out to memory as needed. In fact, most CPUs don't even allow us to directly copy bytes from one part of memory to another. Instead, we must copy the bytes from memory to registers, then copy those registers to the desired destination and memory. So a CPU's programming model primarily consists of its instruction set, the precise set of instructions which the CPUs hardwired to understand, and its set of registers, their number, their sizes, and their purposes. Together, these two facets of a CPU are often called its ISA, its instruction set architecture. If you have two processors which both support the same ISA, then they both can run the same code, even if the processors were made by different manufacturers. For example, the PC platform is built around Intel's x86 instruction set architecture, and for a long time, up until about the mid 1990s, the only x86 processors you could buy were made by Intel itself, but then AMD came along and started producing x86 processors as well. So x86 is one very dominant ISA today, but there's also ARM, which is very popular today in mobile devices. Some other successful ISAs include MIPS and Motorola 68K. An instruction set architecture tends to develop and grow over time. The x86 ISA for instance started in the 1970s, but has evolved since as Intel and AMD have released new processors. Their newer processors support the instructions of their older ones, and they have all the same registers, but they also support additional instructions and include additional registers. So x86 is more accurately an umbrella term that covers a family of continuously compatible ISAs. To illustrate, when Intel released the 386 processor, they designated that upgrade of the x86 ISA as IA32 as an Intel architecture 32. And when Intel released a 64-bit processor in the mid 2000s, the new upgraded ISA became commonly known as x8664. But again, these upgrades preserve the old instructions and registers, so say if you get a new Intel 64-bit processor today, it'll still run code written for IA32. You're probably familiar with the Jonathan Swift book Gulliver's Travels, or at least you've probably seen the Disney cartoon. A part in the book not depicted in the cartoon is that in the land of Williput, the big Indians are at war with the little Indians over whether to crack eggs from the big end or from the little end, the joke being that the choice is totally arbitrary. CPU designers have a similarly arbitrary choice to make concerning how the bytes in a register get copied to and from memory. Say we have a 32-bit register with the bytes in hex 0A, 0B, 0C, 0D. As a binary number, 0A is the most significant byte, the byte representing the most significant digits. The question is, when we copy the register contents to some address N in memory, do we copy the most significant byte 0A to N, 0B to N plus 1, 0C to N plus 2, and 0D to N plus 3, or do we copy in the opposite order, copying the least significant byte 0D to N, 0C to N plus 1, 0B to N plus 2, 0A to N plus 3? A CPU that starts with the most significant byte uses the big Indian scheme, and a CPU that starts with the least significant byte uses the little Indian scheme. In both cases, the order is maintained when copying from memory to a register, so for example in the big Indian scheme, copying the 4 bytes at address N to a 32-bit register copies the byte at N to the most significant register. So if we copy a register's contents to address N, and then copy from address N back to the register, the register contents remain unchanged. Now many sources you might read insist that the choice between big Indian and little Indian is completely arbitrary, that they make equal sense. But don't listen to those sources. The arbitrariness holds if we imagine memory as a vertical array of bytes, because who's to say whether the digits of a number written vertically should go up or down, and who's to say whether the addresses should increase numerically up or down. But if we imagine memory as a horizontal array of bytes, the addresses must increase left to right unless we wish to go against all western convention. Likewise, all western convention tells us to write numbers with the most significant bytes on the left, so if we think of memory horizontally, it makes no sense whatsoever to use little Indian. Unfortunately, for historical reasons relating to performance, Intel and some other CPU makers chose to use the little Indian scheme. Those reasons no longer make sense with modern hardware, but the x86 architecture is still stuck with little Indian byte ordering. The various storage locations on a computer can be pictured as forming a hierarchy. As we go up the hierarchy, the speed goes up, but so does the cost per byte, and so we need slower storage too for larger storage. For example, if an entire program could fit in the processor registers, we wouldn't need system memory, but even the smallest programs won't fit in just the registers. And of course, both CPU registers and system memory are volatile, so we need hard drives for non-volatile storage. Even though RAM is much faster than hard drives and other non-volatile storage, RAM is still relatively slow compared to the operations of the CPU. Therefore, another level of storage is used in between the processor registers and RAM, called the CPU cache. This cache uses a form of memory called SRAM, short for static RAM, as opposed to the DRAM, Dynamic RAM, used for system memory. Unlike DRAM, SRAM doesn't require a refresh cycle to keep its content, allowing SRAM to be read and written significantly faster. Moreover, the processor cache is often placed on the CPU die itself, allowing for faster access by the CPU. On the other hand, SRAM is significantly more expensive per byte, and so the typical x86 CPUs of 2013 have around 8 to 16 megabytes of cache, much smaller than the 4 or 8 gigabytes of RAM typically found in the same systems. The unique thing about processor caches is that programmers typically have little to no direct control over them. Instead, the content of the cache is managed by the hardware transparently to the programmer. When your code reads an address of system memory, the CPU first checks if an up-to-date copy of that byte currently sits in the cache. If so, the CPU can read the byte directly from the cache without reading from slower system memory. If an up-to-date copy of the byte does not already sit in the cache, the byte is copied from memory to the cache before it is read, such that it might still be there the next time the CPU wants to read that address. Because the cache is much smaller than system memory, only copies of small parts of system memory can fit in the cache at any moment. Consequently, when data from a memory address is copied to the cache, a copy of some other memory address must get overwritten. It's up to the hardware to automatically decide what to overwrite. Now, the usual pattern with code is that when we read data at an address of memory, we're very likely to read nearby addresses as well. For example, when reading 100 bytes at address n, you start with byte n, but then read byte n plus 1, then n plus 2, then n plus 3, and so on. Because this pattern of locality is so common, most caches are designed to over-eagerly copy bytes from memory, meaning that when the CPU reads an address from memory, the cache systems will copy not just the data at that address to the cache, but also the data of the surrounding bytes, say 1000 of them, or 2000, or maybe more. This way, memory access is optimized, because it's generally faster to get a chunk of bytes from memory in one read, rather than to get each byte one at a time. Of course, if the running code doesn't need the other bytes any time soon, the extra work is a waste. But the strategy does usually pay off. On systems with this cache behavior, a very effective optimization strategy is to maximize locality, to keep the bytes of your code and data as next to each other as much as possible. If the bytes of memory needed by a processing heavy part of your code are scattered far away from each other, they're less likely to all be in the cache at once, meaning that the CPU is more likely to have to wait for reads of memory as it does the work. For two electronic devices to communicate, there must be some common storage area which they can both read and write. When the CPU and an input-output device communicate, they do so by both reading and writing registers in the device. The relationship, though, is one way. The CPU is in control reading and writing registers of the device as it pleases, but the device cannot read and write the registers of the CPU. So when a device wishes to send a message to the CPU, it writes to its own registers with the expectation that the CPU will read the data at some point. In many cases, input-output devices communicate indirectly with the CPU through a controller device. USB devices, for example, talk directly to a chip called the USB controller on your mainboard, which in turn talks directly to the CPU. So what are the CPU instructions for reading and writing device registers? Well, in CPUs which use what's called memory mapped IO, some memory addresses specify device registers instead of bytes of system memory. Here, for example, the diagram depicts a system in which the memory addresses from 0 to FFF are mapped to device registers instead of bytes of RAM, such that the first byte of RAM actually starts at address 10000. Also notice that a system very well may have more memory address space than it has bytes of RAM, and so some memory addresses may not map to anything. Anyway, with memory mapped IO, we can read and write the device registers using the very same copy instructions used to read and write bytes of system memory. In this example, any copy instruction reading or writing an address in the range 0 to FFF reads or writes some device register rather than some byte of RAM. Other systems however, use port mapped IO, in which a separate address space, so-called ports, are used for device registers. In this arrangement, reading and writing the device registers requires distinct input and output instructions. For example, an instruction for writing to a device register might look something like output register 2 to port 4498. Whether a system uses port mapped or memory mapped IO, or even a combination of the two as x86 systems do, the next question is how programmers know which addresses and ports are mapped to what. On some systems, these mappings are all hardwired and thus documented by the hardware makers, but on other systems, including PCs, many mappings are dynamically configured at system startup, such that certain devices that fix ports and addresses are used at system startup to discover the ports and addresses of the remaining devices. Either way, most programmers don't really have to worry about device ports and addresses because, as we'll discuss later, direct communication with devices is handled by the operating system. As an IO device operates, it may periodically want attention from code running on the CPU. The simplest strategy to provide this attention is called polling, in which the code on the CPU periodically checks the device registers to see if the device wants attention. The obvious problem with polling is that these periodic checks waste the CPU's time when the device needs no attention, so far better if the device could directly notify the CPU when it needs attention, which is the idea behind interrupts. An interrupt line is a circuit path running from the device to the CPU over which the device can signal to the CPU that it wants attention. When receiving this signal, the CPU temporarily sets aside what it's doing to run the interrupt handler, a piece of operating system code associated with that interrupt line. The operating system stores a list of addresses for these handlers in the interrupt table and the CPU keeps the location of this table in its status register. When an interrupt signal is received, the CPU is hardwired to copy the current program counter to memory so that it can later pick up where it left off. The CPU then looks in the interrupt table for the handler address associated with the interrupt line, e.g. line 0 corresponds to the first handler address. The CPU then jumps execution to this address and the handler does its business to service the device. When finished, the handler is supposed to restore the CPU to its state before the interrupt. For example, if a handler uses a general purpose register, it first copies the content to memory, then copies that content back when finished so that the CPU can pick up where it left off. In the course of execution, the CPU may sometimes detect an aberrant condition that needs attention from the operating system. These aberrant conditions are called hardware exceptions and they are handled much like interrupts. When the CPU detects an exception, it jumps execution to a piece of operating system code called an exception handler by finding the address for the handler in the hardware exception table. For example, division by zero is an illegal operation and so when a program performs a division instruction with zero as the denominator, the CPU may trigger a hardware exception, jumping execution to the handler for that type of exception. By this mechanism, the exception handler of the operating system can then decide what to do about the situation. When a system powers up, the first code that runs is the boot firmware. The firmware code usually resides in a small I.O. device, a memory chip which retains its content thanks to a small battery on the main board. This device resides at a fixed address or port and when the CPU powers on, it is hardware to start executing code at that address. Older PCs had a boot firmware device called the BIOS, short for Basic Input Output System, but more recent PCs have replaced the BIOS with the newer standard UEFI, the Unified Extensible Firmware Interface. In either case, the main task of the BIOS or UEFI is to jump execution to an operating system loader on one of the system drives, most commonly a hard drive. From there, the operating system code is responsible for managing the system and providing an environment in which to run other programs. We'll end our discussion of hardware by summarizing the key qualities of a CPU which concern programmers. First off, to program a CPU, the coder must know the ISA, the set of instructions and registers. The byte size in a system refers to how many bits make up each byte of addressable memory. In practice, this is not a real concern because all modern systems use 8-bit bytes, but back in the 1970s and earlier, some machines used sizes other than 8, such as 6 or 7. The industry settled upon 8 bits per byte, most likely because 8 is a convenient power of 2 and not too large. The term word is used to denote the number of bits which a CPU can handle most naturally and efficiently. The word size generally corresponds to the size of the general purpose registers, e.g. a processor with 32-bit registers generally works most efficiently when copying data in 32-bit chunks. Some processors though actually have registers of varied size and so may have no single proper word size. The address size refers to the number of bits used for each address, effectively determining the size of the memory address space. A processor that uses 32-bit addresses, for example, can address 2 to the 32nd, which is over 4 billion unique addresses. Newer x86 processors typically use 48-bit addresses, which allows for over 65,000 times as many addresses, orders of magnitude more than the actual number of bytes of memory in today's systems. In fact, we may never need the full 48-bit address space. Even if we can produce memory chips with much larger capacities, it's not clear we would ever have the need for that much memory. Also, a possible concern with the CPU are the speeds and sizes of its caches, which a programmer may wish to keep in mind when attempting to heavily optimize code. Programmers also need to be mindful of how a system treats byte order when copying data between memory and the CPU registers. Getting this wrong can produce junk data and serious bugs. When doing low-level coding that interfaces directly with hardware, such as when writing device drivers or an operating system, a programmer would need to know how the devices are mapped, whether to memory addresses or ports. For most programming, though, such as when writing applications, programmers can let the operating system and its drivers handle this concern. Lastly, a CPU core is the part of the CPU which performs the core work of executing instructions. Each core executes instructions one by one, but if our system has more than one core, it can then execute multiple instructions simultaneously, meaning that two programs can run simultaneously on two separate cores. Most PCs before 2005 had only one core, but some PCs, particularly those aimed at the server market, had two or more processors, each with its own core. In 2005, Intel introduced the first multi-core X86 processor, a processor with multiple cores contained in the same ceramic package. The advantage of stuffing multiple cores into one single package, rather than having multiple packages, is that it's relatively cheaper and better for power management and system cooling. Most new consumer PCs today can use most new consumer PCs today contain one processor with two or more cores. While some expensive systems, namely servers, still use multiple separate processors, those processors now typically have eight or more cores each. While it is possible for a single program to accelerate its performance by utilizing more than one core, doing so effectively can be very tricky, as we'll discuss later when we talk about concurrency.