 All right, let's get started. So, you know, Tuesday's lecture was, as I had said at the time, kind of atmospheric, a little bit soft today. I want to get into some of the details of these out-of-core algorithms that I alluded to at the end of the last lecture. And the sort of two canonical techniques, and we're going to see how they're related today, are out-of-core sorting and out-of-core hashing. And I will tell you that next week you'll get a chance to implement out-of-core hashing as homework too. So, good day to pay attention if you'd like to do well on your homework. But we'll be reinforcing this. You'll get your hands on these algorithms personally. All right, so we'll start with sorting. And you know, sorting seems like such a basic thing, but what are we going to use it for? Why spend time learning about sorting? Well, the first reason to do sorting is because it brings similar things together. It allows us to do this thing I was calling rendezvous last time, getting multiple things together in the same place at the same time. And when I say at the same time, I mean in RAM at the same time, typically. So for high performance, you're also getting it in cache at the same time on the processor, right? So really, we're trying to put together getting locality, which is locality in space and time, for multiple items. And when do we want to do that? Well, for instance, if you're trying to eliminate duplicates, you'll often sort, and then you can go through the sorted list and the duplicates will be right next to each other. Okay? Or if you're summarizing groups of items, suppose I want to find the average GPA grouped by department at Berkeley. I need to get all the computer science majors together, compute the average for those and get all the cooking science students together and get the average for those and whatever is alphabetically after that. Okay. Sometimes people want to see the data in order and they ask it for it that way. So you just gotta, right? So sometimes the outputs just gotta be ordered. Typical example would be web search. Please order these by decreasing likelihood of I'm gonna click on it, right? So you compute a probability distribution of how likely you are to click on stuff and you might need to do some sorting along the way, okay? And then sorting is fundamental to some upcoming things we'll be talking about like the sort merge join algorithm which involves sorting, not surprising. And as we start building indexes, if you're gonna take a big set of data and build an index on it, you'll typically sort it first and then build a tree over that sorted set of data. So that's another reason we do sorting and so on. So the problem is simply, you know, a simple example of the problem. I give you 100 gigabytes of data and 1 gigabyte of RAM, or I give you 100 terabytes of data and 1 terabyte of RAM. How do you sort that data? Well, you might say, well, memory map that data into virtual memory and then call quicksort. What's wrong with that? Anybody? Anybody who took CS 162, what would be bad about in this scenario allocating 100 gigabyte virtual memory file? Yes? Yeah, you page fault and you're doing, when you're doing quicksort, if you think about it, right? Half the data goes here, half the data goes there, and then recurs, half the data goes, there's not a lot of locality. So you're doing a lot of random access into virtual memory, which is resulting in a lot of page faults to go get this virtual memory. Typically, you're looking at like one small little bit of a big disk page and paying all the cost to bring it off this, looking in memory, and then just looking at one memory address and then dropping it again. So you do a lot of I.O., and as anybody who's over allocated memory on your machine knows, it's incredibly slow. Things just become untenably slow because orders of magnitude cost to do those I.Os, right? So we're gonna need to be much more intelligent about how we access data on disk than randomly looking at memory cells that may have been paged off the disk, right? So the virtual memory abstraction, while it's very nice for programmers, is lousy for performance when you're doing these large data tasks. And instead, we'll be more methodical about how we design our algorithms to take into account that some of our memory is slow, that is to say disk drives or SSDs. Some of our memory is relatively fast, let's say RAM, and we wanna take advantage of that. And when you micro-tune these things, you might also worry about some of my memories even faster, like L1 cache or L2 cache and so on, okay. Before I teach you a sorting algorithm, it is a little bit useful to know a few things about disks and disk drives and where they are technologically today. So I spent a little time on the internet last night looking up prices and speeds and things for disk drives that you can buy today. So we get a snapshot in time as to where these things are going. So, a lot of databases actually still use magnetic disks. And the crazy thing about magnetic disks is that they're mechanical devices, right? They spin around physically and they have these disk arms that move back and forth. They're really like kind of 1950s technology that's just been refined and refined and refined and refined. And the technology that gets better is the magnetics and electronics of the medium of the stuff they spray on the disk, right, the stuff you're storing. But the physics of spinning the disk and moving the disk arm, that's just what it was a long time ago. So they're really in anachronism if you're using magnetic disks, you're using these physical world devices and they're very slow. So this has major implications. There's no dereferencing pointers like you do in memory. There's not even an API on a disk drive, whether it's an SSD or a magnetic disk. There's no API to say I would like to allocate an object. The API says I would like to read a block of bytes, right? And that block of bytes is typically larger than the kinds of variables you have in your program, right? So the API is read a fixed amount of data. Usually it's a transfer unit from the disk drive manufacturer. It's something like 8K or 64K of data. You can read that from disk into RAM, or you can write it from RAM back to the disk, all right? And that's your API. So we're gonna have to work with that API and our algorithms, all right? We will not get to say allocate Java objects and never think about it again. We will have to say, put Java object into page, write page to disk. Read page from disk, go get Java object back off. And actually, let's hope we're not doing it in Java, cuz it's really not a very efficient language. All right, those API calls are wildly expensive, especially with magnetic disks. And I'll show you what I mean by that quantitatively in a slide or two. So we're gonna wanna plan these IOs carefully. You don't just wanna do them all the time. And if you're gonna pay for an IO, you better get some goodness out of that, right? You wanna get the most out of every IO that you do. Now, this sounds like our hands are kind of tied, right? It sure would be nice to have an abstract data structure representation rather than think about blocks of bits. But the truth is that sometimes having an explicit API is a good thing. And there's been a persistent desire, persistent in two ways, over the years to have programming languages that store your objects at disk for you, they're called persistent programming languages. And this has been sort of a thread in language design and system design for many years. And it's never quite succeeded. Why can't we just have persistent Java objects? Or C++ objects or what not? And part of the answer is because it's hard to make that perform well. And then part of the answer is think about this. You've got your data on disk. It's gonna be there maybe for a day or a year or longer. So any bugs that are in your program are now being persisted forever. One of the nice things about software is it runs and then you turn it off and then you start it up again. And anything you screwed up before maybe nobody noticed, right? But when you put it down on disk and you try to persist it, all the kinds of programming errors that you get will become persistent errors. And so a lot of times by having this block interface, you're a little bit more careful about the stuff you put into these blocks and write to disk than you are about the rest of your program. And so there's a bit of a discipline in getting the software right, that you're gonna pay the cost of doing the IO. You tend to get things a little more right and you don't do it for data that isn't really important. So I guess what I'm trying to say is it's not all bad that we're dealing with this low level interface even from a software engineering perspective. There's some good to it. Okay, it's certainly good for performance. So I went on the new egg.com website last night just to get some prices for, what do you get for a thousand bucks at new egg? So you can get like an eighth of a terabyte of RAM for a thousand bucks, which is actually kind of crazy. You can get 2.65 terabytes of solid state disk, which is flash, okay? Or you can get 26.5 terabytes of magnetic disk. So you can see it, there's like orders of magnitude here, right? Magnetic disk is still an order of magnitude cheaper than flash. That said, a thousand bucks is very much money. You get a lot of flash for a thousand bucks. So from an absolute cost perspective, a lot of problems will fit in flash or even in RAM, and it's well worth solving many, many problems over in this domain over here on the left, right? And disk drives are really for very big problems these days or for long term storage, they're kind of the tape drives of our time to some degree. So when we talk about disk drives, magnetic disks in this class, it's mostly for these very, very large workloads that they'll matter. And a lot of things we'll be more interested in kind of performance on SSD, which is good because SSDs are a little more predictable than magnetic disk drives from a performance perspective. We'll go through that next, okay? But we will make you think about magnetic disk drives a little bit. So, you probably have seen some version of this in every class. I actually wandered by Professor Demmell's class this morning. And he was talking about memory hierarchies, I assume, in a scientific computing course this morning. And he was more refined than this actually, because they're more interested in the top of the hierarchy. But disks are slow, registers are fast, disks are big, registers are small, right? And you make trade-offs when you write programs, and when you build architectures, moving data up and down this hierarchy, okay? From our perspective in this class, we're mostly gonna focus on whether things are in RAM or not. And that's gonna be the level of the hierarchy where we're gonna be most interested in this class. However, some of the things that we talk about in this class actually can be translated up the hierarchy into finer grain implementations where you're trying to put things into, say, L2 cache from RAM and back. Okay, so some of these lessons about managing these two levels of the memory hierarchy, you can pop them up. And I think as time goes on in the next five and ten years of your career, some of these database ideas will become increasingly prevalent in memory algorithms in terms of managing cache. Okay, typically today you have a lot of RAM in a database server, because you wanna avoid going to disk whenever you can, because it's very, very expensive, as we'll see. Your disk is for the database, okay? And the backups or logs of that database if you use a logging technique. So you can think of disks as secondary storage, meaning like they're not memory, they're the database. You can actually think of them as like backup, as tertiary storage. Used to be tertiary storage was a special word for tape. But people don't use tapes at all anymore essentially, they just put things on disks. So disks are both those rules today. And Flash, it varies how it gets used by deployment. So some databases are deployed with Flash as the main database. Sometimes it's sort of deployed as a caching tier in between RAM and the database, and there's sort of the economics of exactly how to use SSD and the stack are evolving. I'm pretty certain what's gonna happen is it will replace disks over the next five years, but right now it's in kind of a funny spot because it is still an order of magnitude more expensive than disk drives, and it's not. Well, and it's faster, it's quite a bit faster. But still, it's a little bit more expensive. All right, so this is a picture that was in a paper from like the 90s by one of the founders of the database, Phil Jim Gray. And it's still a good picture even though the numbers are probably wrong. But this gives you sort of a sense of the orders of magnitude between registers, on-chip cache, on-board cache, memory disk, tape robot, right? So if you imagine that going to a register on chip is like going to your head to look something up, well then going to the on-chip cache is like it's somewhere in the room. Going to the on-board cache, another order of magnitude is like it's somewhere in the Lee Caching building, okay? Another order of magnitude to get to RAM, that's like going to Sacramento in distance. Well, that's, okay, that feels like a lot, right? You wouldn't go to Sacramento very often to go get something, right? And if you're gonna like have a little piece of information you want to remember, you're unlikely to store it in Sacramento, so they have to go back and forth every time to get it. So you should think pretty hard before you go to Sacramento. And my god, disk drives are like Pluto, okay? Because it's four more zeros. That's crazy, okay? So magnetic disks are super slow. And then things like, you know, only crazy sort of NSA-style data sets need to be on tape robots anymore. But that's, you know, that's insanity. That's like another galaxy, right? But this really gives you a little bit of a more visceral sense of what these order of magnitude performance differences mean. Here's a picture of a disk drive. We're gonna go through this quick, because it is kind of old technology. Disk drives are like LPE players. They spin for real. They spin like they're slow. They spin it like 7,200, maybe 10,000 rotations per minute. They still reported in RPMs, like a turntables 33 and a third, right, for LPs. These are 7,200, it's not that much faster. Okay, it's faster, good stuff. There's this arm assembly that kind of is usually on a pivot that moves in and out on the disk drive, like a turntable tone arm. And it has to position itself on one of these concentric tracks, all right, on the disk drive. And actually, some disk drives, it's a spiral and some disk drives, they're concentric. You might as well just think of them as being concentric, that's fine. And if there's multiple platters on the spindle, which in some disk drives, there's still two or three platters, then you can imagine that moving the disk arm gets a bunch of different disk heads on top of a bunch of different tracks on different platters. And all those tracks you get at the same time from a physical point of view, the arm is now positioned over this track, this track and this track. So you can think of that almost as like a cylinder instead of just a circle. So they call them cylinders, in fact, or cylinder groups. Okay, and those are all the tracks that are lined up that you don't need to seek between them. You don't need to move the disk arm, which is called seeking, in order to transfer from this track to this track. It's just electronic to transfer between those. So you're reading from a different read head. Okay, so each one of these arms has a head, the tracks under the heads make a cylinder. Typically only one of these heads is reading and writing at any one time. And typically they make the density of these tracks the same even though the ones on the inside are smaller than the ones on the outside. They kind of spread out the data on the outside. Not always true, but it's a nice, satisfying assumption and it is on some devices, true. And so there's the fixed number of what are called sectors, which are the little pizza slices on each of these tracks. And then a block, which is the thing we're gonna transfer to and from the disk is a fixed number of sectors. And that's actually set by the operating system when you configure the drive. So the important thing to remember is this disk arm thing, which is called seek movement. There's gonna be cost for that. There's rotation, and then there's a block transfer. And we're gonna go over that on the next slide, but it's useful to see a picture. These things used to be really big, right? And now they're like, they fit in your little, they could be really small. All right, so to access a disk page, here's what you do. And here's some times that I got off recent benchmarks on the web last night. So time to access a magnetic disk block on a typical desktop magnetic disk, so cheap disk. Not enterprise grade disk, sort of commodity disk. Seek time, so the time to move the disk arm to the right place is like two to four milliseconds on average, all right? And it depends how far you have to move it, right? So if he's moving from the very outside to the very inside, that's the most, that's like eight. If you move it from just one track, it's very little, and it scales up subtly nearly, but it's not important. It's two to four milliseconds on average to seek to another track. And then once you've seek to the other track, you have to wait for the data to spin under the head, right? And that, again, usually takes about two to four milliseconds. So the cost of seeking and then the cost of rotational delay both add up to the cost of a random IO, all right? And then there's the transfer time, which is a good order of magnitude less, which is just get the data once the physical stuff's in position, just read the data off the medium, all right? That's an order of magnitude faster. But you can see that with magnetic disks you really care about getting these IOs that are all close to each other and trying not to seek, all right, and trying not to induce rotational delay. So the key to lower IO cost is you try to come up with software solutions that will reduce your seek and rotation delays. Or you can come up with hardware solutions too, like get rid of magnetic disks and use flash, right? But that took a while, so we can now do that. But our algorithms by now already know how to avoid these seek and rotation delays and you're gonna learn algorithms, some of which do more random IOs than seek, some of which do less, okay? And we'll try to flag that as we go. So the cost, the seek costs and the rotational delay costs of random IOs will be something we may be a little sensitive to in this course. Whereas sequential IOs, because the discard doesn't have to move, are gonna go faster, all right? And to that end, we're gonna arrange pages on disk with some logical notion of things being sequential, of there being a next block after the current block. So you can think about it, the disk is spinning, the head is here, this is block number one. As it rotates, the next block that'll come under the head, think of that as block number two, okay? So those things are very close. If they're next to each other on the same track. If they're next to each other, but they're on different cylinders, that's just an electronic switch between reading from this head and reading from that head, right? So that's fast too, so that's no problem. And if they're on adjacent cylinders, you're gonna have to move the discard with a small seek, okay? And so that's closer than a random IO, but it's more expensive than the same cylinder. So there's this notion of nextness that comes from kind of this proximity metric, right? And what we'd like to do is take a big file of data and arrange it sequentially according to this notion of next, so that when you do a full scan of that data, the discard kind of goes like that, and the data is just rotating under the heads as the discard moves, right? The way you play an LP record, in fact. That's what you wanna do with your data, right? You just want it to spin and feed data out as fast as it can, right? Random IOs are gonna do this, and remember there's like four milliseconds every time you move that discard and it makes that horrible noise, all right? The other thing you can do with sequential scan is sometimes this thing is reading and the processor's too slow because it's off computing something about the data that it's getting. So you might want something else, maybe it's a controller on the disk or something, to actually be reading ahead and buffering data so that if the processor suddenly gets busy for a minute, when it wakes up again and comes back, it can grab the data that was read under the head. What you hate to have happen is, you wanted to read some data that was spinning under the head and the processor goes off to do something else and by the time it wakes up and wants the data, you missed it and you have to wait for it to come all the way back around, right? So there's usually some kind of track cache or something like that in the disk drive to hold the contents of a full spin so that the processor, if it does a thread switch or whatever when it comes back, will still get that full track. Okay, so much for magnetic disks. I'm not gonna tell you a lot more. That information's obviously in the book as well with very old numbers. So here's the numbers for SSD and we can flip back and forth SSD actually, there's various different technologies. There's NAND flash and NOR flash. It can be packaged into, you know, flash drives can be raw or they can be packaged into these SSD devices which emulate magnetic disks, right? And there's a whole bunch of trade-offs around this but what I'm telling you is what's become pretty much commodities. So this is again, statistics about the stuff you typically buy and plug into a computer these days. But what I'll warn you is that this changes like every year or two. It's a very evolving technology and what I tell you today may not be true like next year and you may need to rethink this. The general trend is they're getting way better. Which is good. All right, reads tend to be smallish. Many of these devices do 4K reads, right? Some of them will do a 64K read but not always. But you can do small reads efficiently as the point on a flash drive. So you can get down to 4K on many of these flash drives efficiently and the access time is like 0.03 milliseconds. So remind yourself what it took to do a seek. Two to four milliseconds. That's two orders of magnitude more if you do a random IO on a magnetic disk. It's a hundred times slower than doing a random IO on a flash drive. So they can be a hundred times faster for these random IO workloads, which is a huge deal. And that gives you 4K random reads bandwidth. This is again coming from this benchmark I found on the web last night. About 500 megabytes a second. If you do sequential reads, excuse me for my typo. Wow, it's really embarrassing. It's right next, C and X are right next to each other on the keyboard in my defense. And of course, because my computer's not working, I can't fix that. So we're just gonna leave it up on the screen. But we're all adults. These things happen. That's silly. Okay. That's pretty fun. I think it was auto, it was the Apple auto spell, right? S-E-C is not a word. All right, 500, but you can see sequential reads are essentially no faster than random reads or to say it differently, random reads are just as fast as sequential reads. They're just as good. So there is no difference on a SSD these days for reads between sequential and random. However, writes. So it used to be writes were slow on SSDs. It was like reads were fast and writes weren't that fast. It's like, oh man, now I gotta think about all my algorithms and worry about how writes more expensive than read. But then three years later, that's not true anymore. So now writes are about the same to do a single write. Random write bandwidth though, look at that, is still about factor of three or four slower than reads. So writes are still a little bit slower than reads. Sequential writes on the other hand are fast. And this has to do with all sorts of crazy things about actually under the covers, they're moving all your blocks around because they don't wanna overwrite the same flash to sell it too many times or it'll burn out. So there's a little microcontroller that's saying, okay, block number two isn't there anymore. It's over there. And they keep doing that under the covers for you in an ASIC. But what the net effect from our perspective is just random writes are slower than sequential writes. So there is still some worry about random versus sequential even in SSD land. Pain in the butt, right? So if you take a worst case kind of view, you say, fine, SSDs are kind of like magnetic distilled. There's still a cost for randomness and we'll try to remember to avoid being random when we can. There is this concern about write endurance. So you can't write the same flash sell more than a few thousand times, which is why those ASICs keep moving stuff around to try to avoid hotspots on the disk. And what it said, as I started reading some of the latest numbers is that it's sort of considered in many organizations reasonable to replace your SSDs every six to 12 months. Which from a cost perspective means your storage budget just went up a little bit more. So SSDs are actually not only more expensive but they're way faster than magnetic disks. So there's some expense involved. And I will point out if you're Google or a very big internet service, the expensive maintenance, the human cost of maintenance and downtime and all that, it's something you have to account for as well. You have to go in there, pull out the drive, put in a new one. The software has to auto recover the drive, et cetera. So all that's been optimized quite well in these data centers these days. It's all done, it's not like there's a guy in the back who might trip over the power plug and screw it up, but it's just an expense. All right, so some pragmatics and I've been telling you this, but here's some numbers. Many significant sources of data are really not that big. The daily weather around the globe over the course of 80 years, 20 gigabytes. The US census, 200 gigabytes. The English Wikipedia from 2009, 14 gigabytes. So a lot of problems are like, I'm gonna build a knowledge base of all of mankind's knowledge. It shall be like Wikipedia, like they fit on your phone. So a lot of data is not that big and when you have a data problem that doesn't mean you need to build a giant MapReduce cluster. And I think we should all be very pragmatic about that. I'm gonna teach you a lot of things in this class that are great for big problems and may be good or may not be good for small problems. And you should choose wisely when you're implementing things what tools to deploy. Yeah. That said, data is still growing faster than Moore's Law. So if you're sort of in the industry that's dealing with data streaming off of software logs, streaming off of devices, you may very well have problems that are sort of bigger than your compute almost can handle. So these two things are happening at the same time, so to speak. Some data that's really interesting is kind of constant size. And so it's just kind of getting swallowed up by the hardware technology. But the hardware technology is also generating data, which means that there's always gonna be these kind of big at high end problems in computing. And the funny thing when you think about it, what would you do if I gave you all the compute power in the universe? If I was God and I said you could compute as much as you want, it's free. Or if I was like Jeff Bezos. What would you do with all those computers? You'd be like, well, I gotta keep them busy. I could search for extraterrestrial life. I could compute digits of pi. Or I could get me a whole lot of data. And that's what I would use those computers for, right? So if you believe there's cloud scale computing problems, you probably believe there's cloud scale data problems. This is the only way to keep those computers busy. So if you're in the computer science business, you're kind of in the big data business. Think about that. But if you're in the science business and you have a really interesting data set, it might fit on your laptop. All right. So the bottom line for now, very large databases still deployed in a relatively traditional fashion. Disk is still the best cost deal, but SSDs improve not only performance, but the variance in performance. So they're very predictable. Their reads and the writes aren't that different in cost. Your algorithms don't have to think quite as hard about that. And also in deployments, and we've studied this actually in my research group with workloads from LinkedIn, some of LinkedIn servers were flash-based and some weren't. When they weren't flash-based, would do things like drop connections a lot because every once in a while they do a random IO or they get bad behaviors from their disk drives, their magnetic disk drives and it would screw up their performance profiles. And so they'd have to write code that would defend against failure where failure just meant we didn't respond to a message in time because the disk was seeking. Whereas on their SSDs, everything was very smooth and predictable and the software kind of usually did what you expected it to do. And so a lot of the edge cases didn't get exercised nearly as much. And it led to less crazy things like data replicas are not in sync and stuff like that. Simply because performance on all the nodes was kind of the same all the time. So low variance in performance when you get to really big systems is a nice thing. Okay, that's another good thing about SSDs to keep in mind. As we said, the smaller database story is changing quickly because flash is not that expensive, absolute numbers, many interesting databases fit in RAM. This is gonna continue to evolve much faster than it used to. Disk drives were kind of growing in a very predictable way for like 40 years. But now we're really in a time of flux. So it's interesting. Right, and we talked about this data. It seems like it's huge, which it is. There's a lot of data that's huge, but a lot of really interesting data sets are small. So we talked about that, which is like kind of thought provoking. And I guess the bottom line is some people, well, I shouldn't say many people, but many bytes will continue to be on magnetic disk for some time. So if you wanna work on a lot of the data, a lot of the data will be on magnetic disk still. It's just that that data may only be in the hands of a small number of people because who's got problems that big. So you may just decide that magnetic just don't interest you and you'll always work with SSDs and that may be in a few years is a good assumption. Okay, meanwhile, back in the land of out-of-core algorithms, with all that kind of hardware background, remember this slide from last time, you're implementing this for your homework, one. Okay, so this was the simple case of map. We wanted to apply the function F to every X on the disk. And so we were gonna read in like a page, like let's say 64K page off the disk, put it in a buffer in memory, and then take a tuple off of that, apply F to that X, put the output in this output buffer. Whenever the input buffer is empty, you read another one, another page load in. Whenever the output buffer is full, you write a page load to the output, right? That's our streaming computation model. We're gonna use this as a subroutine in everything, okay? Here's a slightly better version of that. It's called double buffering. And it's worth knowing as a performance optimization, I'm gonna mention it today, and we'll never mention it again. But if you're like a high-performance sort of person, you wanna do double and triple buffering. The basic idea is you get two threads going, because when you issue an IO, it takes a while for it to come back, even with an SSD, and your processor's just sitting there, right? So you want your processor to do something. So what you do is you fire up two or more threads to issue IOs. One of them is doing what we described before. It's filling that input buffer, or it's draining that output buffer. But one of them is reading the next input buffer in advance. Or maybe if this guy's getting behind, it's there to write an output buffer that you're not using anymore. So basically the way this works is the main thread runs f of x on one pair of buffers. And the second IO thread just fill in, let's say those are the blue ones, all right? Right now f of x is being used on the blue buffers. But there's another thread that's filling up that input buffer and maybe draining that output buffer. And then when the main thread is ready to say read another block or write a block, well, there's a block ready to go. Because the other thread either filled it or drained it. And so you just swap the blues and the browns. And you didn't have to wait for the disk drives. That thread that's doing the computation is kind of always computing. And the other thread is doing the IO in the background. And so in any of these discussions that we have for the rest of the semester where we talk about that streaming model, you should assume that it goes real fast. And you might want to use a double buffering substrate under your code with a couple of threads to make that IO pipeline not stall. Okay? I'm not gonna bring it up again. And of course it costs you a little bit of RAM. And in a lot of our analysis, they'll be like, oh, there's B minus one buffers because we used one buffer for output. And in the back of your head, well, it could be two buffers for output if we do double buffering. But I won't make you remember this from now on. It's just a useful technique to keep in the back of your head. All our analysis from now on, we will not talk about double buffering. But just so you know. All right. Last time I gave this lecture, I was like, we're gonna sort some stuff. And then I gave the lecture and then afterwards people were like, so what did you mean we're gonna sort some stuff really? So let me write down really what I mean. All right, so you're given a file F. F contains a multi-set of records R. A multi-set meaning there could be replicated records and every one of them counts, okay? So there could be replicas in this set. Let's say that that file R consumes n blocks of storage where a block is a fixed unit of storage, like 64 K bytes. It's the unit of transfer from the disk. Okay, I'm gonna give you two disks to do this with. Each of them has plenty of data, plenty of storage on it. It has n greater than n free blocks on each of these things. So we don't have to worry about whether it fits on these two disks. We got two nice empty disks to work with. And we're gonna have a fixed amount of space in RAM which is gonna be smaller than n. Let's call it B. It's gonna be measured again in blocks. So that is the unit of transfer from the disk. 64 K is again a good number to keep in the back of your head as the unit of transfer. It's typical for today's disks. So for sorting, here's our goal. We're gonna be given an input file F. We're gonna produce an output file, a new file, FS, okay? And this new file is gonna have the contents R, that multi-set of records, sorted in some order, where the order is some arbitrary sorting criterion based on fields and records or something like that, okay? So we're gonna generate a new output file. For hashing, here's what we're gonna do. We're gonna produce an output file called F sub H. But here, the contents of R will all be there, but they're gonna be arranged in such a way that if two things are the same, they're next to each other. Which is to say, if two things are the same, there's no thing that's different in between them, right? That's basically matching records are always stored consecutively in F of H, right? Doesn't matter what order the things are in. It could be all the twos and then all the 11s and then all the fours and then all the 100s, but it'll be all the fours. There won't be like an 11 with a four on either side of it, all right? Here we go, let's do sorting. We're gonna do it kind of incrementally. I'll teach you sort of a naive algorithm and then we'll generalize it until we get one we like, all right? So here's the naive out-of-course sorting algorithm. It's called two-way sorting. And it's gonna be a not divide and conquer algorithm. It's gonna be the opposite. It's gonna be a conquer and merge algorithm. And here's how it's gonna work. We're gonna read in a page into our IO buffer. I animated this, so I have to stand at the keyboard. We'll read in a page, we will sort the page, and then we will write the page. And then we'll get the next page and we'll do it again, okay? So that's pass zero. So we're gonna conquer each page. Each page will be a beautifully sorted little piece of data. So there'll be some records from the file there and they will be sorted, right? So we've conquered little pieces of the problem. Now what we need to do is merge these conquered pieces together, which is like zipping them essentially. So passes one and then there will be passes two, three, et cetera will be take two pages, each of which is sorted, stream it through memory. Right now it's only one block long, so there's no streaming involved. Just read one, you read the next, and you stream the output, which will be two blocks long to the disk. And you stream it merging the records so they stay in sorted order. So at the end of this second pass, we will have blocks that are, we will sorry have runs of data that are two blocks long sorted. So we'll take the one block long sorted runs, shade right two blocks run sorted runs. And then we will recurs back to the other disk. We will take the two blocks sorted runs, merge two of those together, streaming them through memory to get four blocks sorted runs. And then we will take the four blocks sorted runs, stream them through memory to get eight blocks sorted runs, et cetera. And you can kind of imagine a diagram that looks like a tree. You start out by sorting the individual blocks, that's called pass zero. And then the remaining passes one through N are merging pairs of runs. So here's an example of data that came in, it's two values per blocks, it's teeny. All right, and you can see we sort it in pass zero to get that second row of blocks. In pass one, we're merging blocks of size one into runs of size two, two, three, four, six, four, seven, eight, nine, et cetera. In pass two, we are merging runs of size two into sorted runs of size four, which by the way, if you don't have an even number of pages, they just carry around some empty stuff. And then in pass three, we get one big run of sizing, right? So we conquer these things and then we start merging them. And the merge is done in this recursive fashion where they get doubly long each time. All right, so let's think about the cost of this algorithm. There's N pages in the file, right? So the number of passes is what? How many passes as defined here on this slide will we take on this file? Shout it out. Log N, log base two, right? This is just a tree that we're essentially building, right? It's just that this tree of behavior is every time you go a pass, you take a full read of the whole data set and you write the whole data set, right? You appreciate that these passes are expensive. They're full passes of the data and they're two, one for read, one for write. And then you do it again, one for read, one for write, and then you do it again. We'll do that log base two N times, okay? So the total number of passes of the data is log base two of N plus one for passes zero, all right? Just to sort those initial runs, you have to read the file. So it's log base two of N plus one. And the total cost of this is, well, every pass is read N blocks and write N blocks. So it's two N IOs, if you're measuring this in IOs of blocks, times log base two of N plus one. That's the total cost measured in disc transfers, right? We're not accounting for the fact that maybe with double buffering, reads and writes are happening in parallel. We're just measuring how many disc transfers we do. That's gonna be a good course grade measurement of how expensive this thing is. It's kind of an N log N algorithm. Not surprising, it's sorting, okay? But we can actually do better here. How can we do better? Any intuition? There's some, yeah. Yeah, surely we have more than 128K of RAM to work with, right? So let's get some more buffers going in RAM. Maybe we can sort bigger things and merge more things, right? So let's maximize that idea. Let's actually, the slides do it stepwise. So let's do it, no, we'll maximize it. We'll do it right here. So let's say you have more than three buffers, which is typical, all right? Here's what we're gonna do. One buffer is for output. Oh, this is the merging face, sorry. The picture's for the merging face. But first we have to do pass zero. In pass zero, what we're gonna do is we're gonna fill up memory, which is B big. We're gonna read in B blocks of our file. We're gonna sort it in memory using quicksort or whatever you like. We're gonna write that run to disk. Then we'll take the next B blocks of the file, sort it in memory and write it to disk. It's just like pass zero, right? But we're doing it on B blocks at a time. And at the end of that, we will have N over B sorted runs on disk, each of which is B big. Yep, so that's pass zero. And then the remaining passes are merging, but now it's a streaming algorithm to merge the data because each one of these runs is B blocks big and we're only gonna have one input buffer per run. So we have N over B sorted runs. We have B minus one input buffers and one output buffer. And what we're gonna do is we're gonna take B minus one runs and merge them together to make runs that are B minus one times as long, right? So that's the general flavor. So now we have to figure out what the specifics are of how much this costs, okay? So how many passes do we take on the data in this context? We take one for pass zero. And then how many passes do we take after that? Reading and reining? Yeah? Log base B of N. Log base B of N, actually log base B minus one, okay? Because there's only B minus one of those buffers. But in big O notation, log base B would do just fine. Okay, so that's right. So log base B minus one of N over B because the runs have already been broken down in pass zero into there's only N over B runs, right? So pass zero generated N over B runs to be merged. And so we got a head start where we only have N over B runs to build that tree out of. Make sense? Okay, so that's the number of passes. The cost, as we know, is read the whole file, write the whole file per pass. So the cost is two N times that. So that stays the same. And just to give you a concrete feel of this, if you have five buffer pages and you have 108 page file, this is from the book, I don't know why he picked these numbers. But bottom line, you get this done in four passes, including pass zero, okay? And we could go through the numbers. I won't do this in class. It might be useful to do it in your head or in section just to make sure you're not missing something. Figure out, wait, we're merging at this and at that and we get a how many? It's worth working through. Okay. And you can check this and, you know, the formula matches this intuition. It indeed is four passes. The formula predicts the behavior that we actually animate, so to speak. All right, this gives you a little better sense. You know, if you have a sizable number of buffers, even for a very big file, you will have a small number of passes. And the truth is, for most workloads, you can do it in two passes. Pass zero and one merge, right? And most workloads, that's true. But we might ask, what do I mean by most workloads? Okay, so how much memory do we need to sort stuff in two passes? Well, we know that after phase zero, each sorted run, we read in B blocks, we sort it, we write it to disk. So after pass zero, each thing is B long, right? And we know in pass one we're gonna merge B minus one things together. So if we're gonna finish, then the whole file could only have been B minus one times B long to do it in two passes. Make sense? All right, well, actually that works out kinda handy because what are we saying? We're saying that if you wanna sort end pages of data, you need about square root of N memory. Square roots are good, okay? Like, you can have a lot of data, but a square root won't be that big. A ratio of a square root between your storage and your memory is pretty healthy. And this is why, very often, if you've got a reasonably well provisioned system, you can do sorting in two passes. Pass zero, pass one. Got it? All right, one more little detail, and this is always entertaining, although it's a little bit into the arcana, and you shouldn't, well, it's fun. Your in-memory sorting algorithm actually matters a lot because you are gonna sort B blocks of data at once that's probably like as much as you can fit in memory. And so actually the compute cost for this darn thing is actually pretty expensive. You wanna optimize your, if you really wanna be a sorting god, which there are people who compete for this, there's like an award every year. There's multiple awards for how fast you can sort, or how much can you sort with a small amount of energy for like, you know, mobile? You gotta tune your in-memory sorting algorithm. So here's a curious thing about in-memory sorting, which you could go find out by reading, you know, Knuth's sorting and searching book, which is like that thick. Quick sort has actually been measured to be a really fast way to sort. It's pretty good. If you wanna sort fast, quick sort's good. But heap sort, sometimes called tournament sort, has some interesting properties for our purposes that are worth looking at, okay? So how many people have done heap sort before, or know how to do heap sort? Raise them high? Do you do that in like CS61A or something? CC1B, all right, good, all right. So heap sort's kind of funny when you're doing it on a big file. Let's watch how this works. We're actually gonna have two heaps in memory. The blue one, which we'll call H1, and the brown one, which we'll call H2. And what we're gonna do is we're gonna start by reading in B minus two blocks and inserting all the records into H1. So we're gonna build a big old H1 heap in memory, okay? We're gonna have an input buffer and an output buffer, which is why it's minus two. But here's what's gonna happen. We're gonna start, once memory is full, we're gonna read in the last page. We're gonna remove the minimum value from the heap, which is the lowest value in the file so far, right? We're gonna write it to disk. And then we're gonna have room for something new in memory, which is cool. But if we get something new from the input buffer, and it's smaller than the thing we just wrote to disk, it can't go in this run on the file because it's smaller than the thing that's the smallest thing in this run. It's gotta go into the second run of the output. Does that make sense? So we just wrote this first remove min to an output buffer. It's gonna be the first thing, it's gonna be the lowest thing on that run of disk. If the next thing you read is even lower than that, it can't go into this blue heap because it's too low. So anything that's smaller than stuff we already wrote out is gonna go into the brown heap. So as the blue heap shrinks, the brown heap will start to grow, right? And this is a kind of stochastic thing. So what are the odds that the thing that reads in are smaller than the things you've written out? Well, it kinda depends on the arrival rates of the data, but if it's all random, you can do an analysis for it, which I'll give you some intuition about in a minute. But whenever something comes in that's smaller than stuff you already wrote out, it goes in the brown heap. Eventually, the blue heap gets smaller and smaller because most of the stuff you're reading is smaller than the stuff you already wrote until there's only one thing left in the blue heap, you write it to disk and it's empty. And at that point, everything's brown. And when everything's brown, just pretend it's blue and start over again, okay? So here's the interesting thing. How long are the runs on expectation? And how long are the runs in the best case and how long are the runs in the worst case? So let's talk about the best case. What's the best case for this algorithm? How long of a run do you get? Okay, if the data's already sorted, that means the next thing you read is always bigger than anything you've ever written out, which means it always goes into the blue heap, which means that the first run you write is how long? It's everything, right? You win, you win, you win, you win, you win and you're done. And it's all blue and it's all just one run. So if the data comes in already sorted, you actually only take one pass. That's cool. So that's a winner. Quick sort wouldn't work like that. We'd read in a bunch of stuff into memory. We'd quick sort it. We'd write it to a run. It would be size B, it would be size B. That's it, period. So quick sort can't win like that. Heap sort can get lucky or there's a case where it does well. What's the worst case scenario for heap sort? Louder? Yeah, if things are in reverse order, that everything you read is smaller than something you wrote out. Which means that you fill up the blue thing, that's B minus two big. That's as big as the things are gonna be on the output. So you keep getting runs of size B minus two, which is kind of the same as quick sort. So in the worst case, you get runs about the length of quick sort, but in the best case, you get runs that are as big as the data file. And on expectation, what do you get? Well, this is kind of fun. So there's this thing in the Knuth book, which I will not explain in detail, but it's quite interesting called the snow plow analogy. And the idea is it's snowing and the snow is falling down randomly on this track and there's a Zamboni that's going around the track. You could ask yourself, how much snow does the Zamboni move in one pass of the track? Well, in steady state, the amount of snow in front of the Zamboni is whatever it is. And it tapers linearly behind the Zamboni in height. Right behind the Zamboni, it's at zero height. And as you go around the track, it tapers linearly to the height it is right in front of the Zamboni. So in steady state, the Zamboni's like on a ramp. Essentially, if you unwrap the track, it's like on a ramp. And that ramp, you can call size B. And this is kind of why you get two B worth of snow moved in one pass of the track. That's kind of the analogy, all right? Because by the time the Zamboni's halfway through the track, the thing's this high, right? It's always the same height right in front of the Zamboni. So it kind of moves that whole rectangle, even though behind it there's only a triangle. That's the two B-ish reasoning. And then there's obviously a real proof, but that's the intuition that Knuth gives, which I always kind of liked. So the fact is the average length of a run, which I didn't tell you, is two B or two B minus two. If you've randomly arriving data values, the order is the permutation is random of the data, then your runs will be twice as long as they are with QuickSort, which means you'll have half as many runs, which means maybe if you're lucky, and the logs work out, you'll do one less pass with Heapsort on average than you do with QuickSort. We talked about best case and worst case. QuickSort has better cache locality, it turns out, because walking down those trees like blows away your L1 and L2 caches, those tournament trees are bad for cache locality. So QuickSort actually goes faster in memory, but longer runs could mean fewer passes. So there's cases where Heapsort will win. That's all very entertaining, but most people implement QuickSort or Radex sort or something like that. All right, but this is always fun, and I feel like it's worth throwing into lectures, so I do. So there. Okay, why don't we take a stretch? Everybody take a minute, get off your seat, get on your feet, and we'll stretch for a minute, and we'll move on to hashing. All right, don't get too excited, don't get too comfortable. Hopefully you got your pulse rate up a little bit, your adrenaline flowing a little bit, ready to concentrate. Got a chance to check your email while there wasn't an electric going on. All right, let's talk about hashing. Here's the idea, just intuitively, like if all we're trying to do is do something like removing duplicates or making sure all the CS students are together so we can compute their average. We don't actually want the file fully ordered. It seems like we're doing too much work if we're really sorting the file. All we want to do is bundle together the things that go together, which is kind of what hashing does, because all we're trying to do is rendezvous up the things that are the same, the things that match. So for straight up rendezvous, for not, you don't need ordering, we just need equality matching, which is what hashing does, and maybe it's cheaper than sorting. So we'll look at how expensive hashing is after we're done with this, all right, at least in terms of disk Ios. Hashing in memory is often cheaper than sorting, although you have to be careful, like a naive Java hash package is often pretty slow, so be careful. But how are we gonna do hashing out of core? Like, what does it mean to build an on-disk hash table? It doesn't actually sound easy at all. Well, here's the deal. We're actually not gonna build an on-disk hash table, and that's how it's gonna work, but I'll show you how you do it. To divide and conquer algorithm. We're actually not gonna build hash tables on disk, we're just gonna divide the data, and then we'll conquer it with one hash table at a time on these partitions. So what we're gonna do is a streaming algorithm to partition the data into super-duper coarse-grained partitioned hash buckets. So here's the idea. We'll use a hash function, HP. So, P for partitioning, okay? So a record's gonna come in, and let's say it's like we're looking at students, and major is the thing we wanna hash on. So we'll look at the major field in every student record, and we'll hash it with HP. And HP is gonna tell us a partition number to put it in, and there's not gonna be that many partition numbers. So it's very coarse-grained. It's gonna be modulo sum, not that big number. And all the things that HP hash modulo that number matches are gonna go into a set in a partition on-disk that's gonna be not ordered, not hashed, not anything. It's just all the things where HP equals, say, two will be stored together. It's a very coarse-grained hash function. And we're gonna use a streaming algorithm. You're gonna read in records and write them out to the appropriate hash partition as they stream in, in one pass of the file. And we'll call these spill partitions if you want. And then in the second phase, we're gonna read one of those partitions in, and hopefully it's small now and it fits in memory, and we'll build a hash table on that, in memory. And then we're good. Now we'll just walk the hash table and pick up each hash bucket, and look at all the values that are the same in that hash bucket, and off we go. Okay, makes sense? First partition, sorry, first phase, we partition into things that we know, we know the apples went this way, the oranges went that way, no problem, but there may be strawberries with the apples, that's okay, that's cool, but all the strawberries are with all the apples. Then we put the apples and strawberries into memory, we separate them in RAM. The oranges are over here, all of them. We good? All right, we'll look at a picture of it. So here's the first phase. The original relation is, the original data set is some number of blocks in some order, we don't know what. We use one input buffer to read those blocks off the disk, 64K at a time. For each tuple in that input buffer, we have a hash function, which is doing that streaming thing. And for each tuple, it decides based on h sub p, which of the b minus one output buffers, all the rest of memory, it belongs to. So we're gonna take it hash, mod the amount of memory we have for this algorithm, minus one. One for the input buffer and b minus one for the output buffers. And that's gonna generate b minus one partitions on disk. And hey, if each one of those partitions is only b big, then we're done. So in pass two, we'll read in the partitions one at a time and conquer them. So we'll read in a partition and we'll run like an in-memory hash function, a fine-grained hash function per record, call it h sub r. And we'll build an in-memory hash table, which is gonna be as big as memory we got, which is b blocks. But it's just an in-memory hash table. It's your favorite hash map data structure. And then you can walk that thing and all the apples will be in one hash bucket in memory and you can do the apple thing. And then all the strawberries will be in one bucket in memory and you can do the strawberry thing. And that's great if those hash partitions from phase one are smaller than b. So what does this cost? Well, a little schematic this time. We're gonna read the file from disk and divide it into partitions. So that's read n, write n. And then we're gonna read these things one at a time from disk into memory building hash tables, read n, write n. First phase is streaming partitioning. That's like divide. Second phase, you take a partition and you conquer it with an in-memory hash table. Good, okay. So the cost is four NIOs, two passes. Read, write, read, write. What's the memory requirement? Well, let's think about this. How big of a table can we hash in two passes? Anybody see it? Each partition is how long? In the first pass, we partition into b minus one buckets. So each partition is, there's n blocks to begin with. We split it into b minus one pieces. So it's n over b minus one long per partition, right? And then we said in the second pass, we'd like them to fit in memory so they can't be bigger than b. So we have b minus one partitions, each one of which should be no more than b big so that in the second pass it fits in memory. So we can hash a table of about b times b minus one in two passes. Partition it into things that are no bigger than b and then read in the b or smaller things one at a time. And again, that means that you can hash a table of size n in about square root of n space, which sounds awfully familiar from sorting, right? So the memory requirements of hashing in this context are pretty much the same as the memory requirements of sorting. This does assume that all the partitions are the same length, all right? So here's a bad case. Suppose I give you a file, it's hash by gender, all right? And I say you're gonna b minus one output buffers, but there's only two values. There's male and there's female. So one bucket's gonna be half the data. One buck in partitions is gonna be the other half of the data. And in fact, it'll never break down any smaller than that. And hashing is just not gonna work, all right? Or here's another one, like I've got a Twitter stream and I'm looking at mentions of people's names, like the Justin Bieber bucket's gonna be bigger than the Joe Hellerstein bucket, right? And so you may just get skew in your data and your hash function can't do anything about that because all the Justin Bieber's go the same place. And hashing won't tear them apart. Okay, so there is a big assumption being baked in here that your hash function is gonna be able to distribute your data evenly amongst your hash partitions. Now, if you have a partition that's too big, you can just recurse this algorithm, actually. So at the end of the divide phase, suppose that one of these rectangles in the middle here is bigger than b. Pretend it's the whole table and start over. And just do it again, okay? Two things you have to keep in mind. You have to use a different hash function this time. Otherwise it'll just go all back in one partition again. So the next time around you have to generate a new hash function. Oh, and by the way, if it really is skew, if that entire rectangle is bigger than memory and it's all Justin Bieber, you can hash till the cows come home and it won't tear them apart, right? But if it's too big, just because there's a whole bunch of different values and it's just too big, you just recurse on each one of those rectangles and run the algorithm over again from the beginning. Okay? So that's called recursive partitioning. You just keep partitioning till you get things size b. And then you can do the conquer part. So you just need to keep dividing recursively till you get partitions that are of size memory and then you can conquer them. So let's think about how this compares with sorting. We saw hashing, we saw sorting. This is kind of cosmic. And actually I don't think this was well understood for a while. Here's external hashing picture we just looked at. We said it was four NIOs. Two passes, right? Read, write, read, write. Well, sorting was two passes too. What's the cost of sorting look like in this picture? Sorting actually looks like that. You read and be things, you sort them, you write them. You read and be things, you sort them, you write them. And then you merge b minus one things together and write them to the output. Sorting and hashing from an IO perspective are complete duels of each other. One is divide and conquer, the other is conquer and merge. And actually if you think about the IO behavior on a magnetic disk, they're duels in terms of when you do sequential IO and when you do random IO. Think about this. In external hashing, you read in a block, you write it to a random, because it's a hash function, to a random output buffer. So which output buffer is written out first? The one at the top, the one in the middle or the one on the bottom? I don't know, it depends, right? It depends on what values come in. And in fact, the IOs to these partitions will be random because you can't control which partition's gonna fill when. So it's all random IOs writing that thing. It's all sequential IOs reading that thing because you read the entire partition into memory and write it. And of course it's sequential IOs at both of the ends. So in the middle, it's random in, sequential out for external hashing. Now let's look at sorting. And as you might expect, because it's dual. Read in B pages into memory, write out B pages to disk sorted. So that's all sequential IO. Read in B pages again, write out B pages sorted, all sequential IO. So the first phase here is doing sequential writes, but when you merge, which one of these input buffers do you drain first when you're zippering things together? I don't know, right? Could be any one of them. It could be that the lowest value is in the second thing or the lowest value is in the third thing or the lowest value is in the first thing. You don't know which one of these is gonna drain first so the IOs to read it in are random. So it's totally dual. It's just kinda cool. So from an IO perspective, they're kind of the same. They just move around the costs of when you do random IOs, when you do sequential IOs and how many of them you do. And so really the performance difference is get into two things. I think, is this on the slides? Maybe not. Hashing is sensitive to skew. So you have to worry about that sometimes. Some of your partitions might be longer than others. They might not even do it in one pass because some partitions are too long. Sorting is not sensitive to skew. Tournament sort actually has some benefits of skew if you like, if you have a skewed ordering or a bias in the ordering. Tournament sort can do even better than expected but it won't do worse. Hashing can do worse than expected. And then the other thing to keep in mind is the in-memory performance of your hash table versus your sorting. Okay? And those are things that you could tune up in your system depending. Hashing has some other attractions though which we'll see in the homework. Maybe we won't see it in the slides. I can't remember if I tip you off in the slides. We'll see. I looked at them last night but I forget. Okay, parallelizing these things is peace cake. So you wanna run this on a thousand nodes in Hadoop. It's really easy actually. Just get one more partitioning phase for hashing up front that decides which machine is responsible for which partition of the data. So we have super partitions per machine using some other hash function called hn for node, let's say. It's even coarser grained than hb which was for, or hp which was for the partitions. So every node reads its data, decides which machine to send it to based on hp and sends the tuple streaming IO fashion over the network to the right machine. So every machine is talking to every other machine. It's called shuffle sometimes in the MapReduce literature. And then once data arrives at a machine, you know that all the apples are on the same machine. All the oranges are on some other machine. All the grapes are on some machine. So we're good. And now we've got partitions of the problem we can solve on a single node and we run the algorithm you just learned. So that's super easy, okay? And in fact, if you're smart, you're pipelining your IOs here and your communication with the beginning of this phase. So as this data streams off the disk, it streams through the network. It streams into these buffers and it streams onto here. And that all happens without any staging of the data. Unfortunately, that's not what they teach you in MapReduce, which is a damn shame. MapReduce is actually not engineered for performance. It's engineered for fault tolerance, which is not important to you unless you're like really big like Google. But what MapReduce says is every time you shuffle, you have to store the output before you read it. So in MapReduce, there'd be a big disk in front of each one of these boxes, which is just pointless. So instead, you can just stream these things right through the memory of these things and the first writes that happen on the individual nodes are here, okay? And unless you're running on nodes where you expect them to die in the middle of your hashing, there's no point staging the things onto disk, all right? We'll talk about MapReduce more later in the semester, but just point out there's no IO on receipt of the data in this picture. Sorting, similarly easy, but actually not. So here's the deal. We wanna make sure that we're gonna stream data off everybody's disk and send it to the right node and that node's gonna sort a subset of the data. So we're gonna partition in the first phase of this thing. But we have to partition sortwise, which means like zero through something or negative infinity through something goes to the first node, something to something plus more goes to the second node, something plus more plus more, right, et cetera. So you have to partition the value domain of your values across nodes. It's not hashing anymore. And getting that partitioning rate might be hard because you don't actually know the distribution of the data. So if you think about it, suppose your data values. These are the different values your data can take on and there's like, I don't know, there's a lot of data at that value and then there's a lot of data at that value. And you haven't seen the data yet, but that happens to be where the records are. There's like a million records with this value and there's only 100 records with that value and there's 20 million records with that, I don't know, these numbers are wrong, 20,000 records with that value, whatever. The point is you won't know how to split up the value domain to avoid skew. Does that make sense? We need to carve this thing up so that the volume under the curve per node, the area under the curve per node is the same. And that's pretty hard to do unless you actually know this distribution, right? Because we want every node to get the same amount of records, which means it's the same area under the curve per node, but we don't know what the curve is. So usually what people do is they'll read a little bit of the data randomly, get a sample, try to draw that curve based on the sample and then carve up the curve based on the sample of the data and build their partitioning that way. Okay, so you need some sort of master node to do some sampling and make some decisions about where the bucket boundaries are before everybody gets to work. And there you get skew because the sample might get the work wrong and then some nodes will get more data than other nodes. All right, so actually parallel sorting is kind of a bummer in that respect. Okay, but the actual parallelization of it is just partitioning it by value range. Okay, so which is better? Well, they're the same. But let's keep a couple of things in mind. They're not totally the same. So pros of sorting, we saw with heap sort, if the data's already sorted, heap sorts are the one pass algorithm. Neither hashing or quicksort based merge sort are. So another reason sorting might be good is because you had to sort it as someone told you to. You wouldn't want your Google search items hashed. You want the most likely thing at the top. You don't want to have a random group of things that are equally likely at the top. So if the data has to be sorted anyway, then you'll use sort for everything. And as we said, sort at least on a single node is not sensitive to data skew or bad hash functions or anything like that. Hashing, here we go. So this is a tip off to your homework. We're gonna see in your homework, sometimes you have a big file. Maybe you got a terabyte of data, but actually all you want to do is group it and while you're grouping it, say count it. So let's suppose you have a terabyte of data, but you're just trying to count up the number of males in this data and the number of females. All you need are two counters in memory, one for males and one for females. And as the data streams by you say, is it a male, bump that counter? Is it a female, bump that counter? Which is essentially a hash function. One hash value for males, one hash value for females. You can do that with two hash cells for male and female, right? And one pass of the data. Even though the data's a terabyte big, you don't actually have to spill it at all. You just stream it in, partition it into however many different distinct values there are and compute that count, let's say, if that's what you're trying to do per partition and you're good. So for certain tasks, like duplicate elimination or counting or what other things. Caching, as we're gonna see in our homework. Hashing scales with the number of values, not the number of rows or items. Which can be a huge difference. Okay, so that can be a big win. And therefore you can simply conquer sometimes with hashing and you don't have to partition or divide at all. Okay. All right, we're gonna finish a little early today which is not a bad thing. So summary, you should understand not just the sorting and hashing algorithms but if you really understand that duality and you can explain it then you probably understand the IO patterns. Which means you really understand what's going on. You should also be able to understand and generate those cost formulas and the memory requirement formulas for two pass, three pass, et cetera. All right, point out, sorting is overkill for rendezvous but sometimes it's faster than hashing anyway. Sorting is sensitive to your internal sort algorithm. We looked at quicksort versus heapsort. In practice, quicksort is fast enough that even though you get fewer runs out of heapsort it's probably not worth it, probably. And then in general, keep in the back of your head this double buffering trick for streaming algorithms. It's a very handy thing to make your IOs go faster. All right, that is it. Let me make an administrative announcement before we go since we have a minute. And maybe I can answer some questions too. So everybody, shh, this actually is gonna matter to you logistically even if it doesn't matter to you intellectually. All right, so homework zero, Derek, where'd Derek go? There he is. You wanna make an announcement about when homework zero is now officially due? Yeah, homework zero should be due one day later than it used to be. So it used to be due today at midnight. It is now due tomorrow at midnight, so Friday midnight. So you have an extra day to do it. Yeah, and homework zero is not hard, mind you, but because there was logistics errors. However, here's the less good news. Homework one is still due on Tuesday. So that's the real work. Again, it's not all that hard. It's pretty straightforward, but you gotta kind of munch around in the data. It takes some time. Get used to the tool chain, get started on it, okay? That is due Tuesday at midnight, or 11.59. Move any other announcements? Discussion sections are meeting tomorrow or today and tomorrow. If you wanna go to discussion section, you can go. Oh, are there more? Yeah.