 All right, everybody, welcome back. I hope you got at least a little bit of a spring break. And we will continue where we left off a little bit here. And as we go, let's start by remembering a little bit where we were. So we were talking about IO devices. And among other things, we sort of put up this mental model of how a processor might talk to a device. So for instance, there's always a memory bus that's often directly off the chip. But then there's typically a set of bus adapters and an interrupt controller. We sort of, we're talking about what a typical device controller is. The device controller is the part that interfaces with the main system and the device. And the CPU interacts with the controller to control the device as we stated. And so typically a device controller has a couple of possible interfaces, one of which is a set of registers that you can read and write that might control the device. And those registers are X86 devices at least controlled with reads and writes to special instructions using special instructions. Or we might actually have memory mapped regions where we just read and write actual addresses and the control goes directly out to the device. So this may contain memory for requests, cues or maybe bitmap image, memory, et cetera on this controller. Every device is a little different. But no matter how complicated things are, there's typically two ways of accessing things. As I mentioned here, one is with IO instructions. So IO instructions typically look like this where you might have an out instruction to OX21 with register AL. And what this says is whatever contents is in register AL actually gets sent out to port 21. And that port 21 goes over in special IO bus and that might end up reading or writing some control register. Alternatively, which is much more common, we have memory mapped IO where again just reading and writing through load and store instructions goes directly to the hardware and causes IO to happen. Excuse me. So just as giving an example of those memory mapped displays or of a memory mapped IO, we sort of said, well, here might be an example of a display with IO addresses that are physical addresses. And if I read and write in certain ranges, I might actually update what command I want to run or I might read the status. Or if I happen to write to say a set of addresses that might put bits on the screen or perhaps I might actually put commands into a graphics queue to draw triangles. And so in general, what memory mapping means is the hardware maps control registers and control and display memory and so on to actual addresses. In the old days, the addresses were actually set by hardware jumpers at boot time. Although these days you plug it in and what happens typically is these addresses are picked automatically so they don't conflict with other devices. And just writing to memory with a write instruction or a store instruction would actually cause something to happen. So here we're writing to the frame buffer. We might write graphics commands to the command queue, et cetera. We might write to the command register and the result of writing to that actually causes the device to act on what we've said, et cetera. And so this is just a simple example of a memory map device. Memory map devices are very common these days because they're very simple to interface and you don't need special processor support like you do with IO instructions. And depending on what part of the physical space these are mapped to, you can also also typically protect this with address translation in a way where you can even give full control of a device up to a user by mapping their page tables to map to a certain part of the physical space. Okay, that's kind of where we were last time. What we didn't get to too much was talking about this process of transferring data to and from the actual controller itself. And if you look, for instance, programmed IO is one option where each byte is transferred by the processor either by using in and out instructions or by loads and stores. And this is very processor heavy. So the processor is in a loop and if you're gonna transfer four kilobytes of data, it's in a loop that's pulling each byte in. Now maybe it's doing it four bytes at a time by loading a 32 bit word but it's still extremely processor intensive. The pros of this are it's very simple hardware and it's easy to program and the processor's involved. The cons are that it consumes processor cycles. Now we have a question here about while for working with memory map devices, how do you tell the processor that those regions are memory mapped and so reads maybe have side effects. Is it set at the hardware design time or is there configuration? So that's a good question. So typically there are parts of the physical address space which are outside of where the DRAM is that's known by the system itself to be IO addresses. And so when you plug in a device into like a PCI slot or whatever, there's a configuration process whereby you can say, well, certain reads and writes to physical addresses end up going to that card instead of going to DRAM. And so you could say this is set at hardware design time. It's really a combination of certain parts of the address space are reserved for IO and you plug a card in and you use that part of that IO. I don't know if that answered that question. Going back to this slide, by the way, what you see here is that certain addresses going on this processor memory bus may go to DRAM in regular memory or they may go over bus adapters and end up doing the memory mapped devices. So I don't know, I'm hoping that answered that question. So the alternative is what we call direct memory access and what direct memory access is, this is the alternative to programmed IO, is that the processor just sets up the transfer and then something else goes through a loop and transfers things. So we're gonna give a controller access to the memory bus and ask it to transfer the blocks by itself. And the good thing about this is that now the processor is not involved in transferring every byte and instead it gets a signal like an interrupt when the transfer is complete. So here's an example from one of the books that you guys have access to where kind of shows what happens if we're gonna try to use DMA to pull something off of a disk. So for instance, the first thing is the CPU is gonna go into the kernel and tell the device driver that it wants to transfer disk data from a certain part of a disk. What happens is that the driver then tells the disk controller by going over the memory bus that transfer needs to happen. The disk controller in this instance might end up reaching up to a DMA controller, which is on the bus, programming it to say, well, for every byte or every 32 bits you get transfer it to the next slot of memory. And part of that is the sort of the starting address of where it's getting transferred. And then what happens is the disk controller then starts sending data through the DMA controller, the DMA controller writes memory and then when it's done, the DMA controller interrupts. And so the key thing here is that there's this other piece of hardware involved in doing the transfers rather than having the CPU do the transfers. And there are many instances of how DMA works. Certain buses, for instance, like the USB bus, et cetera, actually allow the devices themselves to be bus masters and write directly into memory and so there might be essentially a DMA controller on every device, for instance, under some circumstances. Okay, are there any questions on that? So the question is, could normal CPU software offload accesses to DMA? So I'm not entirely sure as opposed to an external device doing so. So under, depends on the device, if I understand the question here, the CPU might program the DMA controller if you have a memory map device, the CPU might actually control program an independent DMA controller and then the DMA controller reads from one address and writes to memory or reads from memory and writes to an IO address. And so there are many variants of DMA out there. Some of them on cards, some of them on buses, et cetera. So now how does a device notify the OS that transfers done? Well, basically, or that it needs service. And so we've talked about a couple of options here. Reasons that this might need to happen is, for instance, the device has completed a DMA operation or there's an error or there's a packet coming in off the network. And of course we've talked a lot about interrupts at the early part of the class. And so this is a case where the device generates an interrupt whenever it needs service and the CPU then goes into an interrupt handler and starts doing something. So typically the bottom half of the device driver is entirely interrupt driven and it gets entered when an interrupt occurs. So the pros of this are that the CPU doesn't necessarily have to know how long it's gonna take for the transfer to finish. All it does is it just waits for the interrupt and it does something completely different. And when the interrupt comes, then the kernel handles things. An alternative though is what's called polling. And this is a case where the OS periodically actually checks a device specific register to see whether it's got a bit set saying that the transfer is done. Now, if you remember that the downside of an interrupt is that you have to save all of the state of whatever was running before and then you've got to set up the interrupt handler and get a new stack frame and so on and then run and then you gotta restore the state. So there's some non-trivial interrupt error handler cost to an interrupt. With polling, potentially you can be just checking the register every now and then just by reading a bit out of memory IO space. And as a result, there's a much lower overhead to recognizing that there's a service to happen. Now there's a question about how do you maintain coherency with DMA? That's a really great question. And back to the DMA here, which the issue here might be that if there's part of memory that is in your cache and you're overriding it, what happens? And that depends a lot on the devices. Some devices actually, some systems and that includes CPU systems automatically invalidate memory when the DMA controller writes it. Others, the CPU has to go and flush the cache before it can start a DMA operation to get coherence. So actual devices, going back to the IO interrupts and polling actual devices combine both polling and interrupts at the same time because for instance, if you've got a really high bandwidth network device like a 10 gigabit or 100 gigabit per second network device if you took an interrupt every time a packet came in you would actually spend all of your time saving and restoring registers. Fortunately, IOs tends to come in bursts and so what typically happens with a high performance driver is the interrupt takes it into the driver and then the device driver keeps emptying packets out of the network controller for instance until there aren't any left. So it's doing polling to see if there's any packets left and then eventually it re-enables interrupts and exits back to user code. And so this is a way of basically handling really high bandwidth items with a combination of interrupts and polling. Okay, so are there any questions on that? The other time when you're using polling by the way is so there's a question here about how you do IO in real-time situations. So one of the issues with real-time is typically you don't wanna interrupt your running processes because they're carefully timed, right? You've got exact deadlines and you know exactly what the time is. And so that's an ideal place where a second CPU that's not running your real-time tasks is watching for IO. And in that case oftentimes if you have a spare CPU what'll happen is you'll do a polling situation where that CPU is just in a very tight loop and it's just checking registers for waiting IO. And so the real answer to how you deal with real-time and somewhat unpredictable reality is you separate processors, one that's doing the real-time processing and the other which is checking the IO. And polling is often used in those situations if you can burn a CPU just to spend in a loop looking for IO. Now we talked about this back in earlier in the term where we talked about the fact that device drivers are the thing that allows the kernel to deal with a wide variety of different devices. And it's basically a device specific code in the kernel that interacts with the device and as a result provides a standardized interface up into the kernel. And as a result the same parts of the kernel IO subsystem can interact easily with different devices because of the standardized interface. And that standardized interface of course has what we keep in mind things like read, write, open, close, et cetera. Okay, and so close this and... So in addition to the standard IO by the way there's the special device specific configuration is this IOctl system call. So you might do an open to a device and it has read and write system calls but it also might have an IOctl call because different devices might have different options that you need to program. Okay, and so example might be a device that displays something might have different resolutions you could set or a network card might have different speeds or a serial device might have different speeds and it's possible that you would do that with an IOctl system call in terms of programming it. The device drivers have these two halves that we've been talking about. The top half is what your user code accesses when it comes in. You get a call path from system calls and that's where the standard sort of open, read, write, IOctl strategy calls are. And this is the kernel's interface to the device driver and it's also the place that will start IO and maybe put the thread or process to sleep if necessary. The bottom half is the part that runs as an interrupt routine when the data's back. Okay, and so I showed you this earlier but I wanted to talk you through it again. So this user program might actually have IO that it wants to do and so it does a read system call and that system call crosses into the kernel via the system call boundary. And at that point it might say, well, can I satisfy this read already? And a good example of that might be a file system where I try to do a read and the kernel has a cache of blocks off the disk. We'll talk a lot about that coming up very shortly in the next lecture or so. But if the answer is yes, then potentially it can transfer the data into the user's buffer and return from the system call very quickly. On the other hand, if that can't happen, then we might have to send a request down to the device driver. And this is a point where the device control comes into play. So we enter the top half of the device driver and what it would do is it might say, well, I know what block is necessary. And in those instances, if I know what block is necessary, then I issue the commands to the actual controller itself telling the disk, for instance, to scan into a certain track and then read a certain sector. And then I might actually end up having to put the process to sleep because at that point there's nothing else I can do. So I'll put the process to sleep on a sleep queue. And of course a scheduler will take over and wake somebody up. We've talked about that earlier. And then what happens? Well, we've actually sent a command down to the hardware. And so in the case of a disk, which is we're gonna talk about later in this lecture, it might actually just start doing the access. And eventually that will finish and that access will complete and generate an interrupt. And then that's when the bottom half of the device driver takes over and that bottom half will receive the interrupt, store the data in the device driver buffer if it's interrupt and then signal to unblock the device driver. And that signal at that point, after we've done the transfer, we'll figure out who asked for the IO and which process to wake up. And it'll copy the data into maybe the file system in that case and we'll wake up our process and then we'll transfer the data from the file system into the user's buffer and then we'll return from the system call possibly a lot later. So this long path which we took here could involve milliseconds or even seconds in some very slow IO before we go from the original system call to having woken things up and returned from the read call. And so this is the blocking read call path, all right? Questions. So the question here is, why does Windows seem to have much more issues with device drivers as opposed to Mac OS or Linux? I think the real answer to that would have to be the wider variety of possible devices that are out there. So both Apple and to some extent, well, so Apple basically had a restricted set of devices and so they had much higher control over their device drivers. Linux, while it does support a wide range of devices tends to have a lot of people that find bugs and so on. And so it tends to have more stable device driver code and Windows tends to have lots of devices and lots of third parties writing code which tends to lead to possibly more failure. I will point out though that when device driver bugs happen they do cause major problems with the kernel and there are device driver problems in Mac OS and Linux as well. But it's possibly true that Windows has more. But I think that's partially because there are more things available and so less control over who's writing the device drivers. Now, so let's take a brief stop here and talk a little bit about some performance concepts because we are kind of getting close to the device interface here and we might ask ourselves things like if we're trying to figure out whether a device is performing well, what might we care about? Well, one option might be response time or latency. That's the time to perform actual operations. Another might be bandwidth or throughput which is the rate of operations per unit time. So latency is the time for a single operation. Rated which operations are performed is a bandwidth or throughput question. And remember when we were doing scheduling response time and bandwidth were sort of opposite sides of the coin and optimizing for one didn't always optimize for the other. And good examples of bandwidth or throughput are things they're typically measured in things like megabytes per second or that's for files or for networks might be megabits per second or arithmetic operations might be gigaflops per second if you were talking to an NVIDIA graphics card or something like that. Another important performance question is startup or overhead which is the time to initiate an operation. Now, all three of these items are actually present in typical devices. There's a certain startup time. There's a certain throughput of bytes per second or whatever you get out. And then, excuse me, there's a certain startup time. There's a throughput and that leads to the latency for operations. And so we can actually come up with a basic model that's not too bad which says kind of the latency in a total number of bytes to transfer. So this is some size might be file size or a network block size. It's typically the overhead plus bytes over the transfer speed or capacity. And so the overhead is kind of a guaranteed not to exceed base latency. And so as you can see from this essentially linear relationship as the size of the operation we're trying to do goes towards zero, we converge to the overhead. And so a bigger size in a transfer has a tendency to swamp the overhead once you get something big enough then you can ignore the overhead. And so a good example of this might be a really fast network or a fast network which is like a gigabit per second link which by the way is 125 megabytes per second. So one gigabit per second is 125 megabytes per second. And let's say there's a startup cost of a millisecond getting into the controller to transfer something. And so we could get a graph like this where I have actually two separate units on the same graph just for convenience. So the blue is the latency of an operation and the red is the bandwidth. And if we look here for a moment this is the length of say the packet I'm sending. And what's interesting about this is as the length gets bigger there's a linear time to transfer but we always have this overhead. So the initial overhead here of a millisecond we can't get around. And so as we grow this length then as we grow the length then this linear time increases and that overhead becomes less and less important. There is a question here about why network stuff's always in bits. The answer for why networking is typically in bits is that networks are usually serial communication devices. And so that's a bit per time. And so basically bits per second basically relate to the fact that you're sending kind of bits at a time as opposed to serial. This is as opposed to sort of a parallel device that might send bytes at a time. So now the question is I'm not sure is bandwidth the same as transfer capacity? Yes, in the previous slide it is. And this question about is this because the packet size doesn't really also increase are you asking about the shape of this graph? Let me explain the graph we haven't finished this yet here. So if you notice here that for latency we sort of have our overhead S plus the number of bytes we're trying to send over the bandwidth in megabytes per second. Notice I've transferred to my units. And so hopefully this makes sense to everybody as we have a bigger and bigger packet we can transfer that at full speed once we've gotten in the overhead of getting into the controller. So that's what this overhead typically is. The bandwidth is this red curve and the way to think about bandwidth and this is an effective bandwidth is to say that it's the size of what I'm transferring divided by the amount of time it takes to transfer. And hopefully that makes sense to everybody. As I get down towards zero, I have this overhead which is basically wasted time. And so it doesn't matter how fast the network is as I get down to smaller and smaller packets I basically have a very low effective bandwidth. But as I increase the packet size the overhead means less and less. And so eventually I get closer and closer up here to the bandwidth of what I'm of my network. And notice that my network's 125 megabytes per second. And this one is not quite crossing a hundred. And as the packet size got bigger and bigger I would get closer to my 125, okay? Now, in fact, if you look here we can even continue this a little further and we can ask something about the half power bandwidth which is the point at which my effective bandwidth is about half of what at least half of what my network bandwidth is. And so if we do that computation we can see here that I have to have 125,000 bytes of data in packet size before I kind of get to half of the full bandwidth of the network. And so what's the lesson out of this? Well, the lesson out of this is you may have a really fast communication channel but because of the overheads of using it it may be that you don't get all those bytes you're spending a lot of time with packet overhead. Okay, and so the bigger the packets typically the closer you get to the full bandwidth. All right. Now what's interesting about this just to show the importance of overhead is if instead of a one millisecond overhead we go for a 10 millisecond startup which is more like disk what you see here is here's 10 milliseconds or 10,000 microseconds same gigabit speed of the transfer device but the half PowerPoint is 1.25 megabytes before we even get to using half of our bandwidth and that's because this overhead's so high. So what you can see from this is in a disk you're gonna waste most of the speed of the disk unless you can somehow get this overhead to go away and the biggest way we're gonna get the overhead to go away on a disk is with our file system we're going to try to avoid seek time which is the thing that's in milliseconds and try to mostly read things that are sequential off of disk. All right. Now, did I answer that question, Sebastian? I'll assume the answer is yes. Okay, great. So what determines this peak bandwidth? So I talked about peak bandwidth might be a gigabit per second for a link. Well, that's the speed of the bus. So you could look at a bunch of buses. So things like PCIX might have one mega 1,000 megabytes per second and that's because there's many lanes running at a reasonable speed. There might be ultra wide SCSI which is 40 megabytes per second. Things that are kind of interesting is USB 3.0 is more like five gigabits per second. Thunderbolt, which is the USB-C is 40 gigabits per second and so these have been growing quite a bit. I also put these SAS's in here. So if you buy a serial drive and plug it into a device and you get a device that's like SAS 3, you can actually get 12 gigabits per second coming off of a disk drive, which is pretty fast. So bus speeds are clearly gonna determine the peak bandwidth. And so the other things that are gonna determine how fast we can get is the device itself. So the bus might be really fast but if the device is slow, it doesn't help you much. So for instance, the rotational speed of the disk if you've got a really high speed disk because you're in the cloud, you might have 15,000 revolutions per minute. If you've got a low power device in your laptop, you might have 3,600 to try to save battery power. Or things like the read write rate of NAND flash might matter or the signaling rate of a network link. So these things can impact basically what you're gonna get and it may not just be the bus. So whatever is the bottleneck in the path is the thing that slows everything down. Okay, so let's talk a little bit about storage devices because that'll be our first kind of canonical device that we examine a little more detail. There's at least two types of devices that you use probably every day. So one is magnetic disk. The magnetic disk is very reliable storage. It very rarely becomes corrupted, very large capacity at low cost. Buying four terabyte drives these days is almost a no-brainer, 16 terabytes, whatever, not a big deal. It's block level random access except for a shingled magnetic recording which we'll talk about a little bit later. So basically you can go and pretty much get any block anywhere on the disk, it's just very slow because you've got to seek and rotate to get to it. So the performance for random access is very slow, much better performance for sequential access. So these properties are going to greatly impact the way that file systems are designed to operate on disks and we'll talk about how file systems have grown over the years to adapt to these constraints, the fact that you basically want to do sequential access pretty much all the time. Flash memory which has become much more common these days is again very reliable, it wasn't originally, it's much more reliable now. It's got a capacity that's not as cheap as magnetic disks although that keeps changing, getting better. Block level random access just like with disk, really good performance for reads, worse for writes. So writes actually take time to cause the change in magnetic levels and power, okay? You have to erase large blocks, you can only write a block once and then you have to erase it first before you can write it again. So that actually causes some issues with file systems. And then with the other thing with flash memory is it wears out. So if you write a given block on flash memory too much, it actually gets to where it doesn't store data anymore and then your device is dead. So, or at least that block is dead. So that's a downside of flash memory. The upside is much faster in general than disk, the random access is great. And overall it's a pretty low power solution. So I don't know how many of you have ever opened up a disk drive or looked inside but it's pretty fascinating technology. It's a series of platters and the data is stored in concentric tracks. And the question here that we have by the way is if a block dies as a storage device know this and avoid storing stuff with that block in the future. So the answer is that the device actually can tell that the block is starting to fail. And in fact, there's explicit something called wear leveling that tries to spread the writes all over the blocks to try to make them fail less frequently. But yes, there are actual codes on the disk to help on the blocks and the flash to notice that things are failing. So with a hard disk drive, it's spinning storage. These heads are extremely sophisticated. So the tip of the head actually requires the same kind of patterning that they do on chips themselves and is very sophisticated. For a long time, only IBM was capable of making these things. It's interesting when you start looking at the original IBM personal computer AT in 1986 that a 30 megabyte hard disk was 500 bucks, 30 megabyte. It had a 30 to 40 millisecond seek time and it could get about a megabyte per second off of the spindle. Things are a lot faster now. I'll show you some up to date timings in a second, but quite a bit faster. I also wanted to show you these cool little devices, which I actually even had some. These were, they fit into cameras that took this size of flash memory, but they were little spinning disk drives inside there. So there was a single platter. There was two heads on either side that spun on either side of the disk. And for the time, these were actually bigger than what you could get in flash. So this is an actual four gigabyte device inside what looks like a flash chip, believe it or not. Pretty cool. The other thing I did want to mention is these heads have a read-write head on both sides of the platter. Excuse me. So we read both sides of the disk and there's a head on either side of the platter for each platter. Okay, and by the way, these micro drives were made by both IBM and Hitachi at the time. They were pretty amazing, but they were only lasted for a short period of time because flash densities caught up with them very quickly and they became impractical and not cost effective anymore. So what about disks? Okay, so here's a little bit of another version of a disk here to look at. So if we have a platter, okay, that's this. There's a surface, there's two surfaces for every platter. There's a series of platters. There's two heads for each side. And then there's this arm, which as a unit basically scans in to a certain spot on the platters. And so you have a series of platters and a series of heads and they're all tied together. Okay, and so when you move from the outside into a particular track, it's only possible to move all the heads at once. Okay, now the unit of transfer here is a sector. There it is. That's the smallest unit that can come off a disk. And as I mentioned, I think in one of the Piazza Posts or whatever, the sector size is kind of irrelevant to today's operating systems because it's relatively small. It's like 512 bytes and you never want to transfer only 512 bytes. And so typically these sectors are put together into a block and that's where a 4K block comes from. Okay, if you take a ring that's a track and you look at all the tracks that are together above each other, that's called a cylinder. So it's kind of like if you took a tin can and you went straight through all of the platters, you would find all of the tracks in a cylinder. And so they had positions on a cylinder and then a particular arm is, or a particular head is activated to read or write at that point. So disk tracks are very narrow, okay? Microns wide or less. So this is, the wavelength of light in this case is 0.5 micrometers in typical human eye size. And so we're basically very small relative to the size of the heads. There's a question here. Does the arm zigzag back and forth to read a single sector? No, so what happens, that's a very good question because this is a bad figure. I'm realizing what happens is this whole system is spinning. And so all that has to happen is the arm has to go to a particular cylinder and then all of the data goes over, the arm goes over all the data as the platters are spinning. So these platters spin and as a result, the whole track traces out and the head basically gets to read or write potentially the whole track. Or if it's interested in a particular sector, what'll happen is the arm goes in and then you have to wait until the track or the sector goes under the head and then you can read and write it, okay? So there's guard regions on either side to help in the, if you have single tracks to help sort of avoid corrupted data during writes. Although what's kind of interesting is it's a little different these days, I'll talk about shingled magnetic recording in a moment. But so the track length varies across the disk. And so if you think about this a little bit, if the arm is out at the outside of the disk, there's a lot more disk surface that goes by than on the inside. That's just basically the circumference that's acting here. And so what does that mean? Well, that means essentially that the sectors are all the same density because you wanna store bits at a given density on the disk. And so that means that as the arms are on the, when the arms are on the outside, the sectors are going by a lot faster than when they're on the inside. And so the data rate actually varies on the outside versus the inside to make sure that the data is stored at a constant aerial density. And the other thing that's kind of interesting here is that these disks, regular disks are getting so big that the time to read a whole disk is becoming too long to even back up. And so oftentimes these days, companies like Google will have, part of the disk is actually used for active data and the rest is used for archival storage that's almost never touched, okay? And so they actually kind of split the disk into two pieces, those that are archival and those that are active. And that's just because the active data, if it needs to be backed up somewhere, you just, you can't read that active data off fast enough to kind of make sure it's safe, all right? Now, the other thing I wanted to mention is for really, really high density, what they do today is they do what's called shingled magnetic recording and these are disks that are entirely for sequential reads. These do really poorly at random reads and I'll show you why. If you look the track, the head that's writing actually writes a wider swath than the track itself. And so what happens here is you write one track and then the next time around to the next track you're actually overwriting the previous track somewhat. This is shingled. So this looks like shingles on a roof of a house. And so you might ask yourself, how the heck do you ever read the data afterwards? And the answer is really good digital signal processing, okay? And so the earliest first versions of, for instance, Seagate 8 terabytes and Hitachi 10 terabytes actually use shingled magnetic recording to get the high density. The density improvements have increased enough these days that you can actually get 8 terabyte drives that don't do shingling. But you can imagine that if you have shingling you actually have to be very careful how you use it. So this is more for either archival storage where you're not writing it very often or for something like your Tivo where you're writing a big device or excuse me, you're writing a large video and so you essentially just go around many times for the video and so you don't tend to write this randomly. And you can basically write over big chunks of tracks at a time for a particular video. So now we can briefly do a performance model here. So let's sketch this out a little bit. So the heads are tied together and there's a head on the top and the bottom and tracks are a ring on a particular surface. Sectors are the minimum thing that you can read and write and then this cylinder is sort of all of the tracks on both sides of the platter that are on top of each other. Okay, so basically when we wanna read data is there always one head per platter? Good question, pretty much yes. The other question which nobody has asked yet but I'll ask for you anyway is why are these heads all tied together? It seems like you'd wanna independently move them so that you could read and write off of different heads or different surfaces at the same time. Can anybody think of why that might be? So we have it seems very hard slash space consuming, yes. Physical limitations, yes. You guys are kind of going to the right answer there. More moving parts as error prone, yep. So this actually, so all of what you say here is correct but that isn't the reason they don't do this, they could. The answer is that actually these are commodity items. So disk drives are so commodity and the heads are such complicated, expensive positioning motors and so on that it would just be too expensive and it wouldn't be worthwhile and nobody would buy them. And so when would you read multiple heads at once while you've got multiple processes running and they wanna read different things you could imagine wanting to read different parts of the disk at the same time. They just don't allow that because it's just too expensive. So it's really a cost reason. And then all of the things that people said here where they're talking about physical limitations and more moving parts and so on those all relate to the bigger cost. Now our three stage process here though is seek time which is the time it takes to move the head into the right cylinder, rotational latency which is then the time it takes for the sector you want to rotate under the head and then transfer time which is the time once you've gotten the sector in the right place to read all the data. And the seek time of modern disks is like 48 milliseconds is pretty common somewhere in there that range, maybe three in some cases. The rotational time one rotation is typically eight to 16 milliseconds, all right. And that's because we've got about 3,600 to 7,200 revolutions per minute. That's pretty common type of device you might have in your laptops, all right. And once you've got sort of these pieces then you could say the following that the latency of the disk is what you see in this diagram here. So the request comes in from the operating system or excuse me from the user and it's queued. So this might be a software queue in the operating system or a queue in the device driver itself which is part of the lower part of the operating system. That queuing time is an interesting aspect we'll have to talk about in a moment. And then there's the time to get into the controller, okay. And then there's the time to the disk. So these two components, the queue and the controller are totally independent of the actual device itself. And so once you get into the device then it's seek time plus rotational time plus transfer time. And by the way, if you think of this probabilistically this rotational latency, what's the average time it takes for us to get to the sector we want once we've rotated once we've gotten to the right track, anybody guess? So the question is repeating the question this seek time, we need to know the average time to get to the cylinder, okay. You could imagine coming up with some average. Rotational latency has something to do with how long once we get to the cylinder, how long does it take to get to the sector we're interested in? And then transfer time is once we've gotten to the sector how long does the transfer take? And that transfer time has something to do with where we are on the disk because it's how fast is the media flying by the head. So that transfer time we could figure out what about this rotational latency? Can anybody guess how we might compute rotational latency? So we know the speed of rotation, yeah. Yeah, so it was very good time for this to go halfway around, that's right. So whatever the time per revolution is we would take half of that and that's what we'd plug in for rotational latency. How does the head know it's hit the right sector? Boy you guys are asking some good questions. So the answer is this sector has a header and some data and that header is a self synchronizing thing so that when the head comes by it reads the address of the sector and the header and then it knows that the data that's coming by or that it's about to write is gonna go in that sector. So that's part of the process of formatting the disk. All right, and then the other question was how does the device driver does scheduling really matter? And the answer is yes. So scheduling of what blocks we read and when really matters and that's gonna impact in a great fashion how fast the disk drive works. And so if you look at the complexity of file systems which we'll get to next time a lot of that complexity has to do with this weird device here that we have to make sure that we almost never move the head in and out because that's really expensive and we'd prefer to basically but put the head there and spin around and grab a whole file because that's the fastest thing to do, all right. Now here's some typical numbers. So 14 terabyte Seagate drive that's pretty common thing now. Easy to get if you're a cloud service provider typically eight platters and three and a half inch form factor, okay. And greater than a terabit per square inch on the platter. So that's just crazy. And they also suck out the air in there and replace it with helium to make it less resistance to spinning, okay. So they're trying to reduce some of the energy lost to the actual spinning of these devices, okay. The average seek time is in the four to six millisecond range. Depending on where you are and where you're going it could be 25 to 33% of this number. So this is where scheduling of the device driver really matters or the file system is rather than four to six milliseconds you'd like to get a much lower amount and that's basically the file system. We schedule things so we don't move very much, all right. Average rotational latency, I told you this 3,600 to 7,200 RPM gives us somewhere between 16 to eight milliseconds and however server disks get up to 15,000 RPM. So those are very fast. The average latency for these regular disks that we talked about is half of that 16 to eight milliseconds and so it's eight to four milliseconds. And then transfer time, 250 megabytes per second, common, right. Compare that in your mind to one megabyte per second in the original IBM disk quite a bit faster. Okay, now let's see. So I'm going to give us about, let's give us a brief break so people can run off for a second and I'll be back momentarily and we'll continue. All right, so we're gonna talk about SSDs, yes. In just a moment, that was a question in the break and so let's finish up with disks here. So I wanted to give you an example here. We're gonna ignore the queuing and controller time for the moment and average seek time of five milliseconds. 7,200 RPM, so how do we deal with that? So there is a question by the way on the channel saying that I'd love to hear about SSD file systems. Yes, we'll see if we can talk a little bit about that to next time. But so if we have a 7,200 RPM disk, then the time for rotation is, and this is where units matter. So hopefully you remember from your high school chemistry, the importance of units. And so for instance, that's 60,000 milliseconds per minute divided by 7,200 revolutions per minute gives me about eight milliseconds. So you could do that computation. Okay, and the 60,000 you can figure out where that came from, right. That's 60 seconds per minute, et cetera, and 1,000 milliseconds per second. The transfer rate of say 50 megabytes per second and a block size of four kilobytes. So notice that what I'm doing in my example here of, this example is I'm actually putting a bunch of sectors, which might be 512 bytes together into four kilobyte chunk, which I'm gonna assume is along the same track and together. And so we can just, basically once we've positioned ourselves and we've rotated the right spot, then we can just read, okay, all of them at full speed. And so how do we transfer four kilobytes? Well, that's 4096 bytes, okay. And remember that for data, this is really a kibby byte, right. So it's 4,096 bytes divided by 50 times 10 to the six bytes per second. So this is a bandwidth, so that's not in mibbys. And then we basically compute that out and we get about 0.082 milliseconds to get our block, okay. And the question here is, do seek time and rotation time overlap? The answer is no. Now can you figure out why we can't overlap seek and rotation given what I said earlier? It's a great question. Anybody have an answer for that one? Nope, it's not about dollars. Yeah, great. So we need to find the header. That's right, so only after you seek, then as the disk is spinning, you're sort of looking at each header on the track to decide when you're at the right spot. So you can't actually start looking for the sector until you've moved in, all right. And so now we read the block from a random place on the disk. We have the seek of five milliseconds. A rotational delay is four milliseconds. Y4, because that's half of the eight that we computed, transfer 0.082 milliseconds. And so that basically gives us 9.082 milliseconds. So you could say it's approximately nine milliseconds to fetch and push data of 4096 bytes divided by the time. And so what we're really getting there in this particular situation is 451 kilobytes per second. So notice what I've computed here. Assuming that we're randomly writing a 4K block anywhere on the disk, the best we could do is 451 kilobytes per second, even though the transfer rate off of the head is high. So look at the difference. 451 kilobytes per second versus 50 megabytes per second. So this is showing us how bad it is to keep reading a random block off the disk. So on the other hand, if we don't have to seek and we're gonna just read from a random place in the same cylinder, then the only thing we have to worry about in that case is waiting for the block to show up by rotating. And so the rotational delay is four milliseconds. Transfer time is 0.082 milliseconds, which gives us 4.082 milliseconds to do that read, which has gotten, if we keep doing that over and over again, that's 1.03 megabytes. Okay, so the answer, there was a question of why four milliseconds instead of eight. And again, this eight is the time for a complete rotation, but we're gonna assume that probabilistically on average, when we pop into a track, we only have to wait half a rotation to get what we're going on. So this is an on average, it's four milliseconds. Okay, and so if we wanna read the next block on the same track, then we don't have to do any seek or rotational transfer time, or rotational time, excuse me, it's just the transfer time, and that's our 50 megabytes per second. So you can see this progression here, 451 kilobytes, one megabyte, 50 megabytes. And so there's a significant advantage to locality. All right, so the key to using the disk effectively is to minimize seek and rotation delays. And so when we get into file systems for disks, we're gonna have to talk about that. Okay, any questions before we go on? Oh, Kiddoki, so there's a lot of intelligence in the controller. So sectors have all sorts of sophisticated error correcting codes, and so the disk basically is able to correct all sorts of errors automatically. There's a wider field than the track. So when you're writing, you're sort of messing up bits on either side. And so there's a complex DSP and error correction to fix that. Sector sparing basically automatically, the controller figures out bad sectors and will do replacements. And so when the operating systems asking for particular sectors these days, they're typically asked for in a virtual sense. And the controller might actually be replacing the sector you thought you were getting with a different one because of errors. There's also remapping a whole range of sectors to preserve sequential behavior. So these are other things that controllers might do. Track skewings, the sector numbers offset from one track to the next to help with moving and getting high speed, even when you're sort of rotating your way in and moving seeking. So there's a lot of interesting intelligence that's been built up over the years. And so it's not the case that you typically say, well, I know exactly what track I need to go to and what's sector and so on. And you're gonna optimize exactly for that because in many cases the controller has a different view of how sectors are numbered. It is interesting to see that disc prices have basically been falling kind of with a Moore's law growth rate, although they've sort of fallen off a little bit over the last couple of years, but they're still pretty dense. Part of the problem is it used to be that there was a big issue of people worrying about the bytes getting so small on the disc, the bit storage on the disc getting so small that mere heat would cause the data to go away. But there's been a variety of new ways of sort of doing vertical storage of vertical magnetic domains inside to really take care of that. And so some of what has tailed off these days has really been that industry can make huge discs that people can't necessarily even use because they're so big. So I just wanted to give you a couple of examples of things here. So the Seagate XOX14 I mentioned earlier has 14 terabyte hard disk, eight platters, 16 heads in this little tiny device. It's got helium filled to reduce friction. It's got a 4.1, 6 millisecond seek time. One of the trends that's happened over the last five years or so is that 512 byte sectors which were pretty much the norm for decades have now started to get bigger because the discs are so big. And so actually on some of these newer drives, you can't even write a 512 byte sector anymore. It's just a 4K sector. And as I mentioned earlier, nobody was using the 512 byte granularity anyway. They have typical high speed like six gigabit per second or 12 gigabit per second. That's SAS two or three interfaces and the price might be about $615 which if you look that was about $0.05 per gigabyte as opposed to the old IBM PC which was basically $17,000 per gigabyte. So you can see there's a big difference there. Obviously discs are a lot cheaper. So let's talk about solid state disks. So solid state disks are basically made out of special flash memory and then oftentimes put into a form factor that you can plug in to the same interfaces as a regular disk. And so back in 1995, they started replacing disks with these battery backed up DRAM cells but then around the late 2000s NAND flash became very dense. And so then they started using flash. And the good thing about flashes is there's no moving parts. It eliminates the seek and rotational delay, low power. Downside is limited right cycles. Now rapid advances in capacity. In fact, I have a really fun flash drive. I'll show you in a moment. But so the basic architecture is inside the flash drive is a bunch of these NAND flash devices and a memory controller that basically is busy figuring out how to do what's called wear leveling so that we don't overwrite individual cells too often because if we do, we wear them out. And so the flash controller is taking the file systems requests for data and putting a virtual layer on it and actually writing to its own notion. There's a translation table inside the SSD to decide sort of which cells get used. And that's all done in a way that's transparent pretty much to the computer you plug it into. And typically it'll read a 4K byte page in about 25 microseconds. So that's pretty fast. So there's no seek, no rotational latency. Transfer time to transfer a 4K page might be about 10 microseconds. And notice also that these 4K pages are something that you might find a little surprising. So we're still reading and writing in four kilobyte size chunks, even though you could say, well, these are just single bits at a time but that interface of a page at a time is used because the devices themselves are organized in a page at a time. Okay, the latency here for this device is gonna be queuing time plus controller time, just like with the disk plus transfer time, which is different. And random access is gonna be fine with these, right? It's not a big deal to read one part as opposed to another because there's really no seek or rotational latency. So there's no advantage to locality other than within a block. So the highest bandwidth on this device is sequential or random. Okay. Now, so the question here is if flash memory controllers also have to worry about balancing writes are they notably slower than disc controllers? No, the thing that's really limiting all of these is the speed of reading and writing the individual bits themselves. And so this controller that's doing this is not the limiting piece in most cases. Now, what it does do is occasionally it will be transferring data around to rebalance things. And those circumstances, you can actually run into a situation where you wanna read or write and you're being held up because the controller itself is reading or writing. So that is one possible cause for things to be a little slower than you might expect, but it's still a heck of a lot faster than discs. So writing data is pretty complicated, okay? Cause you can only write empty pages in a block. And so yes, here we have a 4K block. We can only write 4K at a time, but then we have to erase a whole chunk of blocks at a time. So we have to erase say 256 bytes at a time. And once we have erased a bunch of chunks that are co-located in the physical device, then we can start writing them. And so part of the process that we have to do in order to have a solid state drive is that we have some number of groups of erased pages that are ready to go. And so the controller is busy making sure that it has a free list of chunks of erased pages. And when those chunks run out, it's gotta erase some blocks to get more chunks. And so part of the interface between the file system and the SSD has to be to tell the SSD controller which blocks are no longer in use. So it can put them on the free list, gather them up and put together groups of empty 256K blocks, for instance, and then go through an erase process before you can reuse those blocks for something else. I'll say a little bit about how the SSD works in a second in just a slide. So hold that off for a sec. So erasing the whole chunk of a block here is a 1.5 milliseconds. Writing is faster. Controller has to maintain a pool of empty blocks and writes are about 10x the time for reads. Erasure is about 10x the time for writes. And so erasing things is definitely a slow process here but it's still a lot faster than DRAM. I guess I don't have a figure for what's going on here but basically the question is how the SSDs physically work and the answer is that there's a set of parallel... So they're exactly like transistors of NAND flash with the exception that there's a two basically plates that are separate from each other with an insulator between them and the write process traps electrons basically on one of them and then you can notice that they're trapped there and erasing basically raises a voltage high enough to drive those electrons off. And so really the reason the write rate is so high is you're basically shoving electrons across an insulator to get them to be trapped and thereby sort of indicate a one or a zero. I guess I don't actually have a picture of... So basically you're charging and discharging capacitors not quite because it looks like a capacitor except that you're basically shoving things across a capacitor plate. And so you're basically have to get it high enough to drive the electrons across something that's not actually a conductor. So it's a little different than a capacitor. I wanted to show you... Here's a typical SSD drive that you can go and buy from Amazon without too much trouble. It's say 15 terabytes. So it might be a $6,000 drive about $0.41 per gigabyte. So that's not too bad. This is the one I wanted to show you. This one you still can't quite buy but this is the Nimbus. It's a hundred terabytes. It's got 12 dual, 12 gigabyte interfaces. Can write very rapidly. And it says it'll give you as a guarantee unlimited rights for five years. Can anybody figure out why a company could offer unlimited rights for five years even though flash wears out as you write it? Yeah, so both there's two comments here that are essentially correct. Yes, they know they have their wear leveling working well. And if you think about it, there are so many blocks in here that they could write continuously for five years and not touch every block in here. And so the company here can say for a fact that it doesn't matter how hard you write, our wear leveling is just gonna basically keep redirecting you in a way that it won't wear out. And I wanted to point out that, by the way, I tried last year when I was teaching this class to find out what the price of this thing was gonna be and there was speculation about 50K for this guy but who knows, it's still not available. The question is, what's the difference between a Nasa SATA SSD and a PCI SSD? So the difference is the SATA SSD here looks just like a disk drive and runs the disk drive interface whereas the PCI SSD is a slightly different interface, looks more like memory. Okay, let's see, SSD prices keep dropping and so things are looking like at some point, SSDs might cross over but hard disks in terms of price per byte but hard disks still somehow keep going forward. And this 100 terabyte device is pretty cool and it's a very small form factor but it's still really, really expensive. All right, I wanted to show you an amusing calculation. So the question here is kindles which are, believe it or not, specialized reading devices. I don't know if you guys have even seen those anymore but I happen to love them because you can sit in the sun and read. You might ask the question, is an empty kindle heavier than, or is a full kindle heavier than an empty one? And so here's the experiment, you get a kindle from Amazon, no books on it and then you add all the books in to fill it up and the question is, is it heavier when you're done? And the answer is actually yes. Okay, now what was funny was somebody from the New York Times forwarded this question to me once upon a time and for one of their little science columns and the answer is yes, but it's, you gotta be very careful about what you mean when you're saying this. So flash works by trapping electrons. And so when you trap electrons, you're basically raising the energy of the transistor by trapping electrons in it because that's a higher energy state. And so the array state has lower energy than the written state. And assuming that at the time you could get your kindle with a four gigabytes of flash in it, half of all the bits in a full kindle are in the high energy state. This is just saying that when you have books, it's a random set of ones and zeros. You do a little bit of a calculation. You use E equals MC squared for the energy to mass conversion. And what you come up with is that a full kindle is about one at a gram heavier than an empty one. And now what's an at a gram? That's 10 to the minus 18th gram. The most sensitive scale you can find out there is 10 to the minus ninth gram. So you can decide whether this is heavier or not. There's also a lot of caveats to this. Like for instance, if the thing is warmer, that will add more weight than this 10 to the minus 18 grams. Or if you wear the battery down, the energy lost to the battery will make it much lighter than what you gained in weight. So the only way you can really even do this experiment is you take the kindle, you fill it with books, then you cool it back down and recharge it. So it looks exactly the same as it did when you started the process. And then you do that measurement which you can't actually do because it's too light. But anyway, so this is an amusing calculation. You can actually look this up for an October 24th, 2011. And I did confirm this with a number of my colleagues and so on. What's amusing about this situation is right after we published this amusing little calculation, suddenly everybody in the world was talking about how heavy the internet was. And so somebody came up with a calculation that the internet was the weight of a strawberry. And he had a whole video that he posted. And that was rather amusing. But you guys should check that out. Some of that may still be up. And by the way, this calculation is only doable because we're extraordinarily careful about what we say and we set up a careful experiment. None of the things that claim that the internet is the weight of a strawberry make any sense. So the summary for SSDs here, the pros versus hard drives is very low latency, high throughput, no seeker rotational delay, no moving parts, very lightweight, low power, silent, shock and sensitive. You can essentially read at memory speeds. You could imagine that the file system that you set up for an SSD might be completely different than the file system you set up for a disk drive. We'll talk a little bit about that. Some of the cons are the storage is relatively small and expensive relative to disks, but it's catching up. I don't remember the last time I bought a laptop with an actual spinning storage in it. I always buy SSDs now because they're big enough for what I need for a laptop and they're much more reliable and lower power. And so spinning storage is still used a lot in still used a lot in cloud computing areas and so on. But I think for a lot of the portable devices, definitely SSDs are it. Certainly things are no longer small, okay? So that's important. One of the cons is an asymmetric block write performance. So writes are more expensive than reads. And by the way, you have to have a bunch of spare ones that you've erased. So that's a little different. There's a limited drive lifetime because if you write too much on a given cell, you wear it out. So average failure rates about six years, life expectancy might be nine to 11 years. But all of this stuff keeps changing. Okay. I did wanna point out by the way that there's a lot of really cool alternatives. So the thing about flash is flash is basically non-volatile. So when you turn the power off, you don't lose any data, just like with a disk drive, but it's kind of more like memory, okay? However, what's very interesting is that you can do better. And there are a lot of interesting ones. One of the ones I think is pretty cool is there's a company called Nantaro and they have nanotube memory. So what you see here is a carbon nanotube looks actually like this where there's carbon atoms at all of the different spots here, dots. And they can basically set up a situation with a crosshatch three-dimensional set of nanotube cells where there's a difference between one state where electricity can go through fairly quickly in another state where it's broken up a bit. So that's a difference between a one and a zero. And you can rewrite it over and over again. There's no wear out and it's potentially as fast as DRAM. And potentially denser. So we're actually talking about maybe getting more data than DRAM could per physical chip. So this is kind of exciting and exciting enough that if this were ever to take off there's a couple of other technologies if any of them were ever to take off we might not have DRAM anymore and just have non-volatile RAM. And so pretty much none of your data would ever go away even in memory. And that's gonna change the way people build operating systems and the way that they build storage systems if all of your memory everywhere is always non-volatile. Okay. So next time, so there was a question on here. Let's see. So there's a question on how does, what are a lot of different technologies? There are other ones by the way where they actually have magnetic domains. I don't know if you remember at the beginning of the term I kind of talked about core memory which were these little lifesaver like things that would have a charge or have a magnetic field or not and that would give you a one or zero. So there's a version of that that they shrunk down to the size of chips that's being looked at. There's a nanotube memory. There's a phase change memory where there's two phases of the material in here, a crystalline one and a melted one and that's another way to get ones and zeros. So I think there's a lot of exciting things on the way. All right, so next time we're gonna talk about performance a little bit more but one of the things we're gonna have to worry about here is in this user thread versus queue versus control versus IO path, what are the most important things? And as we get through that we're gonna have to confront this curve which is the queuing curve. We're gonna figure out a little bit of where that comes from and talk about how to confront that, okay? So it's not just the device itself or the controller but also the queue itself is gonna cause an important part of our response time and we're gonna wanna make sure that we're not at the point where the curve really starts going up rapidly but rather more in this linear region down there. Okay, so in conclusion, we've been talking a lot about disks, queuing time plus controller plus seek plus rotational plus transfer time, the rotational latency being a half of a rotation, transfer time based on how fast things are rotating and what's the bit density of the disk. We talked about how devices have a very complex interaction and performance characteristics where the response time is queuing plus overhead plus transfer time, for instance, for hard disks, queuing plus controller plus seek plus rotation plus transfer. We talked about SSDs instead where you don't have the seek and rotation but you have to worry about erasure and wear out. File systems are gonna be designed to optimize this and so next time we're gonna talk more about that queuing component and then we're gonna dive into some file systems and we're gonna talk about how are file systems designed to basically deal with these response time characteristics and fundamentally what's the difference about an SSD file system will come up. So that'll be something that comes up and burst and high utilization are also gonna introduce queuing delays that we're gonna have to confront. You know, one of the things I didn't talk about was midterm two, we're gonna give you more information about that on the Piazza, so watch for that. I hope you guys have a great, have a great couple of days and we'll see you in Thursday's lecture. All right, talk to you later.