 Okay. Hello. And happy Thursday. It is definitely Thursday I checked. We're good. I got a lockdown. I know the day of the week. All right. Have a seat. So I want to pick up where we left off last time. I'll do a little review to get us back in the mood. We'll do a quick tour of some of the greatest hits of Tuesday. So remember we're organizing database records, which are logical sort of conceptual things. We're storing them as, you know, little atoms on magnetic disks or in flash drives, right? So how do we get from here to there? We'll take you part way down the stack. I won't take you down to the atoms, but we'll take you part way down. All right. We're going to store records in files. Files will consist of blocks, and blocks will contain records. So we'll map the blocks into and out of memory, and we will look at the records once they are in memory. So as a little review from last time, sort of one reasonable organization for files was this page directory picture. We like that one. Where we had some blocks of our disk set aside to hold entries pointing to other blocks on the disk where the data is, right? The page directory would have a first page, which was the header page, and then a linked list of directory pages coming off of it. All of the directory pages contain entries of the form disk address comma free space, okay? So that you can walk through this linked list of headers and quickly find, hopefully by only looking at a page or two, some pages with the free space you might need, say, for an insertion. And of course, if you're going to scan through the file, you can walk the header page, pick up the blocks one by one, and scan through the data, okay? So that's a desirable, reasonable, not particularly sophisticated, but certainly workable way to lay out blocks in a file on the disk. Notice, by the way, that if you make your block IDs device independent, that is, maybe they have a device ID comma block ID, then these files can span multiple drives, right? So there's nothing stopping you from having these data pages stored on multiple drives other than how you represent block IDs. Right, so that's good. So these things kind of scale up. Okay. Fair enough. The next thing we looked at, sort of reasonable representations of storing records on blocks or pages. So we started here, we then, if you recall, jumped to a description of this. And we liked this representation for fixed-length records. Sorry, for fixed-length records, we were just going to pack them in, right? So if all the fields are fixed-length, you can just pack them in and get to the different fields within the record by offset, knowing how wide each field is, right? So if you know the first field is eight bytes wide, you go eight bytes from the start of the record to get to the second field, and so on. If you have variable-length fields, the picture at the bottom of the screen is a pretty reasonable representation where you keep an array of fixed-width offsets at the front of the record. So we know each one of those pointers is, let's say, one byte, which tells you the offset in that pointer will tell you where to go to get to the start of the field. And so you can follow that logical pointer, which is really an offset from the beginning, to get to the beginning of the field. You know the length of the field from the subsequent pointer's starting point. And for the last pointer in the row, you know the end of the record from information that's going to be stored here. Okay. So that's a reasonable way to store variable-length tuples or if you're storing fixed-length tuples but you're going to admit null values, you might also choose this representation because then the null values collapse down to very small amounts of data. Okay. So that's one reasonable thing here is the field offset kind of approach. Oops. Sorry. We're here. Field offset approach for the fixed-width. And then last but not least, we've got to put these records onto pages. And we'll do that either with fixed-width records, probably with the thing on the right-hand side again, which is an unpacked fixed-width representation. And remember the way this works is you have, at the end of the page, you have a bitmap that tells you all the possible places that one of these fixed-width tuples could be. And it tells you which ones of them are free right now. And we like this better than keeping it packed because if we keep it unpacked, then the IDs of the records, which is to say what position they are on the page, can remain the same even in the face of deletion of other records. If you start compacting things after deletion, then the slot numbers change for where the tuples are stored and the IDs of tuples are record ID comma slot ID. So essentially all this is doing here, the unpacked representation, is putting in a little bit of a level of indirection where we always know that the slot is in a given place, but we have to go and make sure that the slot is live by looking at this bit if we're looking for free space. So if we go look for free space, the way we find free space is by looking at the zeros. Finally, for variable-length records, we had this picture, which is the slotted-page picture. This is the more general version. And in the slotted-page version, we can have variable-length records. We have, again, a slot directory at the end of the page that points to, via offsets again, where in the page is the beginning of each record, and then the lengths of the records are stored inside the slot directory. So if you remember, this picture is missing the lengths. And this is a picture from the book. I double-checked, so I kept it consistent with the book, but there should be lengths inside of those... Well, I guess it has lengths, right? It has a pointer and a length. Excuse me. So that's what this is doing. The pointer isn't just an arrow, it's just a picture, and the number is a length, so it's got both of those things. But obviously, in a computer, you don't have pictures of pointers. You have something that looks more like this. It's an array of bytes. The slot directory at the end tells you where does the free space start, how many slots, and that each entry in the slot directory is two numbers, a beginning and a length. So that's a review of the basic storage formats we looked at last time, hopefully all familiar. All right. With that, let's jump ahead a little bit, and we started talking about page replacement, right? Here we go. The last thing we talked about was the catalog, which is straightforward, and we'll skip. So then we started talking about buffer management and page replacement. So let's review. We have a buffer pool over the disk, right? So the databases stored on disk fixed with blocks of the disk are brought into memory and stored in what are called frames of the buffer pool. The buffer pool is nothing more than an array of RAM that we allocate when we boot the database system, right? So it's virtual memory. And we just pretend that that array of RAM is sliced into disk block-sized frames, and then we're going to copy disk blocks into those frames and replace those frames as needed, okay? So that's the idea of the buffer pool. Page requests come in from higher levels of the system. I would say I would like block 47 from this file, and that's caught by the buffer pool, which checks to see if it already has it. If it doesn't have it, it issues a request to the database, and then it's going to probably have to replace some other cached page that's in the buffer pool with the page that was requested, right? Don't forget about this notion of pinning pages. So remember, the state of the buffer pool is captured in a little, if you like, main memory table with one, two, three, four fields for each frame number. What's the page ID that's currently in that frame? Okay? What is the pin count of that page in that frame? So how many concurrent threads currently want us to keep it in memory? And is it dirty? Has anybody changed the values on that page since it was read from the disk? Dirty means that it's different than what's stored on the disk. So if the requested page is not in the pool, we choose a frame for replacement. It has to have pin count equals zero. It has to have the unpinned. If it's dirty, we write it to the disk, and then we read the requested page into that frame. And that's pretty much the job of the buffer manager is just to do that all day long. And then you pin the page. When the buffer manager gets it into memory, it pins the page automatically. That's part of the API. The caller must unpin it and soon. But the caller, when they requested a page, probably because they want to do something with it, like look at it, so they get that pin for free and then they have to unpin it or you're going to have no place to put stuff in the buffer pool. All right. The requester of the page must eventually unpin it and then set the dirty bit correctly before they unpin it. That is it. All right, which gets us to the question of what's a good buffer replacement policy, which is where we left off last time, which goodness depends on access patterns. You can always construct an access pattern that makes a buffer replacement policy look kind of bad. So let's start with the default, which is the least recently used policy. Now remember, in this context of pinned pages, our definition of used is unpinned. So when the pin count goes to zero, that's the time when use has ended and we time stamp that. So you track the time each frame was last unpinned and you replace the frame that has the earliest unpinned time. Perfectly reasonable policy. It tends to work well for kind of workloads where things are popular and if they got referenced recently, it means they'll be referenced again soon. Another way to talk about that is temporal correlation between requests, right, temporal meaning in time. So requests to an item are correlated in time. If it's hot now, it'll be hot tomorrow, right? Okay, the problem with it is sequential flooding and this is where we left off last time. So imagine you have a file and you're scanning it sequentially from start to finish and we said the file was a thousand pages long and we're just going to scan it over and over and over, which seems like a brainless thing to do. So this is our file and here's our buffer pool, which is 100 pages big, all right, so you can only hold a tenth of that file. In this access pattern, how bad is LRU? How would you describe the performance of LRU? Somebody shout out a word to describe the performance of LRU in this access pattern. Useless, I like that word. Pessimal would be another one. It's the worst possible, okay? It's as bad as you could be. It is always wrong. All right, so let's just illustrate that just so you get a gut sense of it real quick. It is like, it's funny how wrong things can be sometimes. So we're going to do a little baby example. Our file is going to have five pages, A, B, C, D and E and our buffer pool is going to have, I don't know, three frames, numbers one, two, and three. All right, would somebody like to play buffer pool manager, please? Yes, thank you. You look like a manager. You're promoted to manager. So we're going to issue access to the buffer manager and he is going to do LRU page replacement. So we're scanning this thing in order. So the first page we will access is page A. Will you please put A? Thank you. Now we would like page B. And now we would like page C. All right, the one thing we aren't doing is metrics. Whenever you build a high performance system, you have to keep metrics on its behavior so you can evaluate how well it's performing. So what should we be measuring as we're doing this? Hit rate in the buffer pool. So how are we doing on hit rate so far? Zero for what? Zero for three. So we're going to need two numbers. We're going to need hits and we're going to need attempts. All right, so we're zero out of three. All right, what page are we reading next? D, Mr. Buffer manager. What are you replacing? Yes, that is the least recently used page. We're assuming instant unpinning of these pages. And next, E, which replaces, yes, very good. And next, A. Where's A? Oh, it's gone. Too bad. Okay, so you guys get the idea. Thank you, buffer manager. Excellent. It's a slightly more heart wrenching when you make it just one too short, right? Because then you're always one off what you want to be. I should have done that. It kind of makes it more wistful. So this is clearly a bad thing. What might be better? Actually, why don't we do this? I would like another volunteer. This buffer manager is not a robot. It's a creative person who would like to be the artificial intelligent buffer manager. Actually, this is cheating. If I let you do whatever you want, you'll be really smart. Yeah, let's not even do that. Okay. I don't know how to make that fun, actually. You'll just win. So actually, maybe that's fun. All right, who wants to be a smart buffer manager? You get to do whatever you want as long as it gives the right answer. So you got to serve up our pages. All right. I'm not watching. Is he doing anything? He might be prefetching for all I know. No, he's not prefetching. He's not that smart. Okay. Go ahead. I haven't requested anything. So clearly the hit rate is zero for zero. Awesome. All right. What page shall we request first? Page A. Hey, nice work, man. All right. Page B. And just for fun. Okay. Page C. Okay. Dude, very good. All right. So we've seen the benefits of prefetching. Page D. Got him. All right. Thank you very much. Page E. We got him again. Page A. Yes. Page A is next, right? Oh, interesting. We got him again, though. We got him. Not doing very well. Page B. We got him again. All right. This is really fun. Page C. Oh my goodness. Nice work. All right. Four for eight. Okay. Let's try one more page D. Very interesting. All right. Would anybody like to guess what his protocol is? What'd you call it? I said hashed. You said hashed. Interesting. So you were associating file pages with buffer pages. Interesting theory. And what was your theory? Sorry, it was somebody else. Your theory. He's a sea lover. He's just going to hold on to sea. Interesting. So what was your protocol? Oh my gosh. It's a two-way associative cache. That sounds awesome. Give this man a microphone. Yes. Oh, you don't want to explain that? It was a really good policy. I'm going to go with sea lover, though, because it's easier for me to understand. So suppose we just kept sea there. We just like sea. What's going to be our eventual hit rate? Like, in the limit. Right? We'll hit on sea and miss on everything else. Could we do better? What do you think you can get it up to? I'd like to get it up to 2 out of 5. I could be a CD lover. That sounds weird. I was trying to come up with something funny. I don't know. I'll be an AB lover. I'm always going to keep A and B in there, and I'll just rotate everything else through here. Now what's my hit rate? 2 out of 5. That's pretty good. I actually don't think you can do better than that in this particular workload. Yes. Wouldn't that be nice? So let's keep A, B, and C. I like that idea. We've got three buffers. Why can't we get 3 out of 5? So I'm in ABC. Now where do I put D when I want to read it? Don't worry about it. Don't worry about it. Awesome. So that's an excellent example of sort of a stump the chumps kind of thing when people are presenting research ideas, particularly when they're around reliability or when people are trying to enforce a guarantee on their system. So our guarantee was going to be a really high hit rate. A lot of times the best way to achieve that guarantee is through unavailability. Your database will never break if you don't accept any requests. Your buffer manager will always have a 100% hit rate if you quit while you're ahead. This can actually sometimes, I don't know how you make this useful in this particular example, but this can sometimes be the kernel of a good idea that the way to achieve some goal is to withhold service for some period of time or some set of purposes. Another example of this is admission control to a system. One way to avoid thrashing is to simply not let too many people use the system at once. So you appear to be unavailable to some people, but the people who are in, things work. And that avoids thrashing. So there are lots of places where this idea would be a good one. I'm not so sure this is one of them, but not totally crazy. So we need one here to keep the system live to keep the system responsive. But we can get two out of five. Most recently used, just out of curiosity. See if it works. Somebody want to be Mr. or Ms. most recently used? It's a lot of fun. Yes, you're really far away, but come on down. I like it. All right, we're going to go really fast. So you're going to be a really fast processor. All right. No prefetching. Erase the hit rate. We'll start from a cold cache here. All right, A, most recently used. B, C. All right, what's our hit rate so far? 043. We're 043 so far. Yeah, yeah, yeah. You've been fetching pages. All right, D, most recently used. Nice. You had to fetch D. You're 044. You really want to count those initial reads, but I'm sorry. You don't get them. E, replace the most recently used. Most recently used. Ah. F, no, there's no F. A, again, you're about to be 1 for 6 because I said A. Yes, good. B, nice. C, yes. Erase B, very good. All right, so this is not the same as I love C. Things are going to kind of shift around a little bit because the cache hits are then going to get replaced. OK, but it's close and it will get 2 out of 5 in the limit, right? So, thank you. All right, so kind of interesting. I think the nice thing about MRU is it doesn't quite play out maybe the way you would have expected. It doesn't look like that pin to things in the buffer pool trick. It actually behaves a little differently. So there's some subtlety here, right? There's enough subtlety here that like people in the 70s and 80s wrote lots of research papers about this stuff, right? It's like a nice little computer science problem. It's kind of narrow but kind of fun. It's super important for your performance of your system, though. So it's like it's worth optimizing. Besides LRU being really bad for sequential flooding, it has another problem. Can anybody think what the problem is? Maybe you know from CS162? Yeah. It's pretty expensive. Yeah, right? So think about this. You have, I don't know, 10 to 100 gigabytes of memory in your things. That's a lot of pages. So you're going to keep the recent access history for all of those frames, right? How do you find the most, you know, the least frequently accessed thing? Well, you probably don't want to sort it every time. That's expensive. What would you do instead? It's in memory. This is an in-memory data structure, keeping the access times of these frames. You always want to find the least thing in it. So what's a data structure in memory where you can find least a heap, right? So you'll have a heap, right? Which is the thing that supports the find-in method in log time and the insert method in log time, right? And you're going to use a heap data structure, but it's still going to cost you log in the number of pages to do your lookup. It seems like a bummer. It would be much nicer if it was a constant cost. This is like at the very, very inner loop of all your work, right, in the database system. So I want a constant cost algorithm for replacing these pages. I don't want to pay log in every time, even though it's all in memory, okay? And that's what the clock algorithm does, is it's a constant time approximation-ish of LRU. Okay, so it's going to achieve roughly what LRU would achieve, but it's going to do it with a slightly different replacement pattern, and it's going to do it in constant time. So here's the idea. We're going to take the frames of the buffer pool. So A, B, C, and D. Unfortunately, on this slide, frames have letters. It's exactly the opposite of what we drew here. I apologize. So A, B, C, and D are frames. And what we're going to do is we're going to arrange them into a logical cycle as if they were the numbers on a clock, even though they're letters. And then for each frame, we will store one bit called the reference bit per frame. You can think of this as the second chance bit. It's another way to think about it, okay? And here's how this is going to work. When the pin count of a page, you've got to keep all these things straight. Pin count is about this thing cannot be replaced and is not involved in any of this, right? That's the pin count. When the pin count is greater than one, we're going to just skip it if we see it in the clock policy. It's not a candidate. But the minute the pin count goes to zero, that thing is now available for replacement, right? So when the pin count goes to zero, we turn on that second chance bit, that reference bit. Having the bid on is good for you. So the minute you're replaced, you get your bid on for free, because we don't want to replace you for a long time because you were recently used. Okay. And then when we need to do a replacement, here's the protocol. We're going to start marching the clock hand around the cycle from wherever it is now, right? So you look at wherever it is now and we're going to do this little for loop going around the cycle until we break out of the for loop. So here's what we do. Right now that clock hand's pointing to B, so that's fine. If the pin count is zero, because it has to be zero, if the pin count is one, we skip this frame entirely. It's not available for replacement. If the pin count is zero, then we can replace it. And the reference bit count is on. That means that you should get a second chance. Then we'll just turn off the reference bit count. We're not going to replace you, but we turned off your reference bit. So you're now at risk, okay? If the pin count is zero and the reference bit is already off, we already flipped you off once, then you get replaced. Okay, so if the clock hand comes around past you twice, wants to turn off your reference bit, and again, without you being referenced again, then we will choose you for replacement. Okay? And we do this until a page is chosen. So we're going to keep going around with this clock hand, checking these conditions until we find somebody with pin count zero and reference bit off. In the worst case, we'll go all the way around, right? In the worst case, with some things with pin count zero. We'll go all the way around, turning everybody's reference counts off, and then we'll grab somebody. Okay? That's it. It's pretty simple. Okay? Does anybody have a sense what the reference bit thing is about? Why do things need a second chance? Yup. It gives you, if the interrequest, exactly right, so I'm going to rephrase what you said. If the time between requests for a page is smaller than the number of pages in the buffer pool, so the clock hand won't go around before you get referenced again, notice what happens when you get referenced again, you get pinned, you get unpinned, and your bit turns back on, your reference bit turns back on, right? So if you can get referenced in the time the clock hand goes around, you won't get replaced, right? So you're capturing frequency, and it also works against the odds that you might get unlucky, and like the clock hand is pointing just before what you just unpinned, right? So what if you got unlucky? You just unpinned this. It's the most recently used thing, right? And the clock hand says, it says I'm going to replace you. That's clearly not LRU, or anything like it, right? So it's a way of kind of getting this, hasn't been referenced, compared to the other things in the entire list that hasn't been referenced recently, right? So it's going to force this guy to be sort of at the back of the line, relative to all the things it hasn't looked at yet. But it's not exactly LRU, because other things might have also been re-referenced and set the reference bit one. So it's not precisely LRU, but you get the sense that this thing has to go all the way around, look at everybody else before it replaces something that just got unpinned. It's a little fuzzy, but there's some intuition there. It smells in fact a little bit like the snowplow analogy, but I hate to try to push that because I'm sure it's not true. All right. Here's a question I think some of you asked who's seen this stuff before. Why didn't I already learn this in 162 and why are you teaching it to me again? What's different between what happens in the database and the operating system? Well, a couple of details, and it's just details at some level, but they're useful and you should certainly understand them. The first is this notion of pinning is not something the operating system usually provides as an API. We need this pinning thing to let our queries manipulate pages inside a condition variable, in essence, if you're an OS person, it's a way to think about it. It gives a critical section around the page. That's the interface we have for that. It's not a general critical section interface. It's particular to pages. We can pin these pages. The OS doesn't have that API. Another thing that we're going to do here is that buffer manager is going to talk to the disk storage manager of the database. When it says to push something to the database, the disk storage manager will respect that request and really, really put it to the database. There's going to be logic later on for durability, for making sure the database really contains what it says it contains. We're going to want to manage the whole stack, including the cache in memory, which is the buffer pool, and the explicit writes to the storage medium, which is the disk manager. That coupling here is explicit in the database software. This cache talks to the disk manager and we know what it's doing. That's both for forcing pages to disk when we want them forced. As a side effect of that, being able to order our writes, the database may care that page A is written before page B and it can enforce that. The operating system has a tendency to lie about that stuff and try to give better performance but with uncertain guarantees. We're not going to be comfortable with that. We'll learn about that when we learn about concurrency control. Other things is maybe we change the replacement policy over time. Maybe we have fancier things than LRU. Well, for sequential flooding and clock doesn't either, we might come up with something a little different. Many database systems do use a somewhat more sophisticated or just really different replacement policy. Operating systems use LRU or clock, actually. If you're curious, I won't teach it in this class, but it's easy and kind of fun. There's a policy called 2Q, which you can Google. It's on Wikipedia, I'm pretty sure. It'll come up with search hits to previous editions in this class. But the 2Q replacement policy kind of... It's got 2Qs as it happens. One of them kind of behaves like LRU and the other one works well for repeated sequential scans and it sort of bridges the difference. You can go read the 2Q replacement policy information if you're curious. All right. So we're avoiding that OS file cache and we're controlling our right timing, our prefetching, all that by hand. So the summary of last lecture, of this slide deck, we talked about disks and tapes and we talked to some degree about flash. We remember that for magnetic disks, at least, we need to minimize seek and rotation delays. We talked about why the database is not the operating system and we went through these various structures for storing things on disk. And with that, any questions before we go on to the next topic? Okay. Mental break. I heard a stupid joke. Let's see. Am I going to get this right? What did the physicists say when the atom said it was missing an electron? Are you positive? That was on my daughter's seventh grade math homework. That's where I get my jokes these days. All right. With that, let's... We did that slide deck already. Time traveling into the past. Time travel, it works. It's great. Here we go. Going on now with the present. The next thing we're going to talk about is file organizations and indexing. When the authors of our textbook were putting together the book, I was helping them and developing this class. So a bunch of the slides and a bunch of the chapters I was involved with. But the most fun we had was picking quotes for things. This quote is just so excellent. Sears Robuck was the catalogue. He was the amazon.com of its time. He sent out these big catalogs and it had these instructions. If you don't find it in the index, look very carefully through the entire catalogue. Things like this. Awesome. We don't want to do that. Sequential scan is bad. All right. How can we organize files to do better than sequential scan? We're now at this layer of our database which is going to connect up what we did before on query processing with what we've been doing on storage. It's a full sort of single user stack up to at least the iterator model of programming. By the way, systems like Spark and MapReduce provide an iterator style programming interface. Map and Reduce are two different kinds of iterators and Spark supports similar things. So at some level what we'll complete today is a full stack for a developer API to a query processing system. It's everything from the programming model of iterators down to storage and all the algorithms in between. So that's kind of nice. All right. So we're getting there. We're putting things together. So the goal for today in this deck anyway is we're going to get some kind of coarse grain sense today. Next lecture and maybe the lecture after we'll learn some particular data structures. Today we're going to get a kind of coarse grain wet finger sense of how expensive data access is in different file organizations. So we're really going to simplify things a bit. And we're doing this to kind of focus in on what matters the most on sort of move the big rocks first kind of philosophy. But along the way we will have a little discipline. So there's going to be a bunch of assumptions and I will state them very carefully and they will mostly be wrong. But they won't be that wrong. They'll be a little wrong. Not wrong enough that we won't learn from this exercise. But this is sometimes how you get your head around things. State some assumptions, kind of get your head in the game and then later on you can refine. So we're going to identify these assumptions and we're going to estimate costs of operations based on these assumptions. But the estimates we'll have will be concrete, you know, little miniature formula that we can stare at. And in fact what we're doing is we're getting a taste of what a query optimizer does and we'll see this later in the semester. A query optimizer wants to look at an SQL query that you write which doesn't talk about storage or anything. It just talks about these abstract notions of tuples and tables. And it's going to try to guess how expensive it is to go run that query using one and it's guesses at how expensive things are are going to be based on stuff like we're going to talk about today. So this is in some ways just kind of like planting the seeds of query optimization in your brain. There are lots of ways you can organize a file and this isn't, you know, ideas that go back to the 1950s probably when accessing these storage devices like drums were really expensive and these different access like everything in databases it depends. So these different organizations are going to be good at different things and we'll want to implement more than one of them in our systems. So heap files are these unordered files. We talked about this last time. That's the files we saw last time. When I say heap file in this class by the way, all I mean is unordered. Not like a heap data structure with fine min. So this is an annoyance in the literature that these are called, both called heaps. When a database file organization person when you're talking about a database file organization file, all you mean is a whole bunch of tuples. You can't necessarily do fine min from them. Just a whole bunch of tuples in a heap. Data structures people for whatever reason invented the priority heap and that's the fine min thing. That's different. Heap file is just arbitrary collection of records. A sorted file is a file that is sorted. We'll have to say how it's sorted and because we're in a relational context for the discussion of this class for the most part, the way we'll talk about how it's sorted is based on a search key that you can search on. You can think of your table having many columns which have names. Columns, I don't know, they have names. I don't know. Student ID, last name, first name, GPA, etc. The search key will be some concatenation of fields. The search key might be last name, comma, first name. Last name, comma, first name. If you had a JSON file, by the way, and it was nested, you could still define the search key. It would just be a path in the JSON file. There's nothing special about it being a flat relational thing. This is a simple way to think about it. We'll stick with relational for now. Sorted files will support look-ups based on the search key. We'll see how that works. Later on in this lecture and then more next time, we'll talk about clustered files with indexes on top of the indexes collected together into blocks but not necessarily sorted. Here's our cost model for our analysis. The assumptions have begun. These are the only things we're going to measure. B is going to be the number of data blocks in the file, the size of the file in blocks. R will be the number of records per block. Which is, I think, a different notation than they used in the book earlier. R is going to be the number of records per block for today. D is going to be the average of the disk block. We're just going to look at the average. We're not distinguishing seeks from scans. All Ios are the same. They all take an average amount of time. It's very coarse-grained measurement. We're not actually taking into account the proximity of blocks to each other. In practice, this matters a lot. But for today, we're going to ignore it. We're going to set it aside temporarily. We're going to do average case analysis today for uniform random workloads, which is totally crazy. We're going to assume that pages are going to be accessed at random given the workload. If we're asking for individual pages, they will be completely random individual pages. Every request will be independent of the previous requests. Totally random accesses to pages and what's the average performance on a totally random workload. That is not also a crazy assumption, but we're just going to make it for now. We're ignoring sequential versus random IO. We're going to ignore any optimizations we're going to ignore any in-memory costs. So CPUs are infinitely fast and free in this assessment. But this is still going to be good enough to give us a sense of things. You could definitely come back and visit all this stuff later. Some more assumptions. When we talk about our workloads, what are we going to assume? For insertions and deletions, let's focus on the case where we're inserting or deleting just a single record. Lots of cases you'll insert tons of stuff, but when we talk about equality lookups, we're going to make the assumption that there's exactly one record that we find, which isn't necessarily true. For example, if I look things up by last name, I might well find lots of people with the same last name. But we're going to assume abstractly in this case that when we do a lookup, we get exactly one thing back. Not always true, but we'll assume it for this analysis. For these heap files, we're going to assume an organization where the inserts are going to always append to the end of the file. That's it. The insert is going to always append to the end of the file. Keep that in mind. We're not going to fish around trying to find space for things. We're just going to go to the end of the file. When we have sorted files, and this is ridiculous, but we're going to do it today because it's going to keep us focused, if you delete something from a sorted file, you have to compact. No one would actually build that. It's too expensive. We'll see that when we do our analysis. We'll assume in the sorted file analysis that when we do lookups, we're going to do lookups on the key that it's sorted by. For instance, if we're doing lookups lastname, firstname, then the file we'll assume is sorted by lastname, firstname. Otherwise, it's not really sorted in any interesting way. Obviously, you'll want to question all these assumptions, both on the metrics and the workload, and rethink this. When you're studying with your friends for the midterm, this is where you can start to question the assumptions because it'll exercise your understanding of what you're doing. If you can change the assumptions and write down the formulas, then you probably know what's going on. Okay, so I wrote the cost of operations a little redefining the variables at the top so you don't have to keep them in your head. We're going to have this grid that we're going to fill in for the next little while. What are the costs of the actions on the y-axis against these different file organizations on the x-axis? And why don't we start pretty simply with the heap file. We're going to scan all the records in this heap file. The file has b pages. It's got r records per page. And it takes d time to read or write a disk page. What's the equation for how long it takes to scan all records in a heap file? Well, how many pages will we look at if we scan all of the records? We will look at b as in boy. And how much does each page access cost? d. So the total cost for a heap file of scanning all records is bd, b times d. What about for the sorted file? How long does it take to go through all the records in a sorted file? b times d. And a clustered file we haven't really figured out yet. So I haven't showed you what it looks like. So let's not worry about that yet. But those first two entries should be really easy. bd and bd. What I want to do is we're just going to do these first two columns. So let's think about a quality search. In a heap file I'm looking for I don't know, a particular id. Let's say it's sorted by id and I'm looking for id number 4752. How many records do I have to go through on average to find that particular id? What percentage of the records do I have to go through on average? 50%. On average if I'm making up random IDs in an unsorted file they'll be in a random place but on average I'll look at half of the records. So how long does it take to look at half of the records? bd over 2. Half of a full scan. What might change if I said there were duplicates? So suppose it's not a unique key but there may be more than one of that value in the file. Then how much do I have to look at? Everything. Because I don't know if I'm done. I found one person with id, whatever number I said, but maybe there are more. I keep saying maybe there are more until you're finished. So this only makes sense because we have this assumption that there's exactly one record. This, by the way, dumb as it seems turns out to be the kernel of a kind of important idea which is knowing what you know and knowing what you don't know and having to pay for not knowing things. A lot of times the reason you have an expense in an algorithm or in a protocol is because you have to pay the cost of determining that what you don't know by default but you can't be sure without asking. So a classic example from distributed systems is knowing if a machine is up or down. So there's no way to know if someone actually heard a question you asked them until they write back to you and that may not happen for a very long time. So it's actually a theorem that says you can't distinguish delay from failure. It takes some work. So in this case anyway, getting back to this point, if we didn't have that assumption there's exactly one then we have to scan everything. For the sorted file now, how long should a quality search take? What technique would we use to find it efficiently? Binary search seems like a good plan. We have a sorted list of blocks. In each block we have a sorted list of records. So we're going to do a binary search on the blocks to find the record we want. On expectation, how much time is that going to take? Log base 2 of what? Of B. That's the number of IOs we're measuring. Log base 2 of B times D, which is the cost of a block access. Very good. All right. We're in business. This is good. So we finally saw the benefit of sorting our file. There it is. It's that log factor. Very nice. Let's go down to a range search. I want to find people whose names begin with an W. I want to find people whose names begin with a W. So that's all names between, begins with a V and ends with a Z and whatever. I want to find all people whose names begin with a W. In a heap file, how many blocks do I look at? Everything. So it's BD again. In a sorted file, that's kind of annoying. But the strategy is binary search to the beginning of the range, right? And then we can scan forward till the end of the range. Agreed? So the cost of the binary search is log base 2 BD. And the cost of the scan depends on how many records match, right? So I don't know what that is. That depends. So we'll just say however many pages match, okay? We'll leave that as a variable here. Good? Still getting our benefits of sorting. So range search is a thing you can do with sorted files. That's nice. All right, insertion. What's the cost of inserting in a heap file? We're going to just stick it at the end. Suppose you have a pointer to the end. Two. You have to read it and write it. Okay? Oh, yeah. Two IOs. Eventually. Now, typically what we'll learn about in concurrency control, et cetera, is you don't write it right away, right? You leave it in the buffer pool and eventually it gets replaced by the buffer manager. But over time, the cost is two IOs. One to read it, one to write it. Okay. In a sorted file, what's the cost of... Oh, question before you go on. Why do you have to read it? Why do you have to read it? Well, because for an insert, we're going to put that tuple inside of a disk block binary image. I don't know what else is on that page, so I can't manufacture that page out of nothing. I need to read it, see the other tuples there, copy my tuple in, and then take that binary image and write it back down. It's a great question, because you might think, just like write the tuple, but you can't write things in units of tuples. You have to write things in units of pages. Thanks for asking. Great. Deletion in the sorted... Well, deletion in the heap file. We're going to locate the item, and then we're just going to erase it from memory, and then we're going to write it back. How much? Alas, I believe we have to find it first. I think it's going to be BD plus one. First, you find the thing you want to delete, over two, if it's unique. If it's not unique, it's going to be BD, but if it's over two, it's BD over two. It's the cost of an equality search, plus one. Agreed? That's deletion. What about deletion in the sorted file? Now, remember, I said we have to keep the darn thing compacted. Let's draw a picture. I like pictures. So we're going to find it by binary search. We found it. Then we're going to delete it, and then here's the page it's on. Imagine this is the file. Here's all the other pages. What do we have to do after we delete it? Oh, man, we have to move this whole thing over by one tuple. All right. How many blocks are we going to read? On average, random tuple. We'll read half, which is B over two, right? And for each one of these pages, we're going to read it in, we're going to slash everything one to the left, and we're eventually going to write it back out. Agreed? Now, the slashing might go across page boundaries. So you actually have to have two pages in memory at a time. That's fine. But we're going to read it once, and after we're reading it, we're going to write it. So it's two times B over two, which is B, right? So it's the cost of a lookup, which is log base two BD. Plus BD, right? B over two times two times D. The log plus the read-write times D. Read-write of half. Good. And then deletion. Did I skip the insert? I'm so sorry. Insert has the same problem for sorted as delete, right? In insert, we also have to move everything once to the right. So sorry about that. So deletion of a heap file, we find it, and then we write it. In a sorted file, we find it, we slosh everything to the left. That's this block. This box right here. And then insertion, which is the box one above it, which I will not be able to reach by jumping. That one. That one. We're going to slosh everything to the right, but it costs the same. Cool. These are pointers, and no fun. It's much more fun to throw things and jump around. All right. Good. So be it. So now let's talk about clustered files. You had any questions before I go on? Yes. Yes. Let's do this carefully. Good. So we're back in the sorted file, and let's say we're doing this time, well, let's do deletion again, so I'm consistent. All right. So what's going to happen? Here's page number B over two plus one or whatever. So this is page number B over two plus one. It depends if we're indexed by one or not. Let's say it's page B over two. That's fine. It's a particular page. On this page, if we zoom in, there's some number of tuples. All right. Let's assume they're fixed width, so they're just stored compactly. And we're going to delete this one. All right. What does that mean? That means that in RAM, we're going to scoot all these tuples over here by one. And we're going to have to borrow a tuple from the next page to the right to put in the last slot here. So we're going to do a single IO to get the next page. We'll put that in memory right next to it. We're going to take this thing and delete it and put it here. We're going to move all these over by one. And now we have to borrow from another page. So we're going to find ourselves scooting down one at a time, borrowing, moving things left. And oh, by the way, after we change these things, they all have to get written back to disk. So we're going to read, on average, B over two things, because on average, it's halfway through the file. And then we're going to write them back again. You're welcome. OK, cluster files. Well, let's just talk about indexes generally. So cluster files are going to be based on indexes. And this is a very high level view of indexes. Next time we'll take a somewhat more detailed view of indexes and hopefully talk about B plus trees. But today, let's just get the concept down because there's actually a bunch of design decisions that you can make about your index that are independent of what data structure you use, let's say. So that's what we're going to talk about right now. What is an index? It's a data structure, in our case, on a disk. So it's a disk-based data structure. Or a paginated data structure, if you like. It's broken into pages. That allows record retrieval by value in one or more fields. For example, you might have an index that lets you answer questions like, find all students in the CS department. That would be an index on the major column of the table. You might say, find all students with a GPA greater than 3. That would be an index on the GPA column. And note that it's a range query, not just an equality query, greater than. You might find students with first name Bob and last name Knob. That's a query on two columns. And the index would then have to be defined on both of those columns. First name and last name. So you can do that. You can have indexes that span multiple columns. We'll see more examples of this. So the index is a disk-based data structure for this fast lookup by value. We're going to use the term search key as a subset of columns in the relation that this index allows queries on. So in the third example, the search key was first name, comma, last name. In the second example, the search key was GPA. The search key need not be what's called a key of the relation. So there's a separate term for keys of a relation that we're going to spend a lot of time with later in the semester. The key of a relation is the unique ID of the relation. So some relations will have a column called ID, and we'll promise that it's unique, and that would be the key. Some relations might have a different key. For example, maybe you don't allow two people with the same name into your class, and so first name, comma, last name can be a key. And you would enforce that by the database. So the database will not allow you to insert two people with the same first name, last name, pair. So the key can be any subset of columns that enforces uniqueness. But you can search on things that aren't keys. So you can search on GPA. Even though two people have the same GPA, that's fine. So don't confuse the notion of a search key with this other notion we're going to describe of the key of a table. What is an index then? Really, it's a data structure that supports efficient lookup by search keys that contains a collection of things. And the collection of things we're going to call data entries. And the canonical representation of a data entry is a key value, k, and some set of items in that match that key value. We're going to have various versions of what this k thing looks like, and we'll have some variation on how big that set of items can be, whether it's exactly one item or it can be many items. So we'll see this in a minute. So these data entries will come in various forms. The first question you should ask when somebody tries to sell you an index, to try to describe an index data structure to you is, well, what kinds of lookups do you support? What kinds of selections do you support? What they support always is some sort of key operator constant. But what kind of operator is allowed and what types of constants and keys are allowed? What data types are allowed? So a typical index will support some kind of equality selections on some data domain, or maybe on any data domain. It's kind of akin to hashing. You can do equality lookups. Many indexes will support one-dimensional range selections. And in that case, the operator could be less than or greater than or between, which is a range. Certain regular expressions translate to ranges, like starts with a w. But there are also indexes for more exotic selections. Most typically, you'll see indexes for multi-dimensional spaces, like maps are two-dimensional indexes very often. So you can support queries like find a two-dimensional range. Anything east of Berkeley or west of Truckee and north of Fresno and south of Eureka, that defines a rectangle that's aligned with the coordinates of the map. So it's a range query in two dimensions, x, y. Another type of query you might want to support and certain indexes would support would be two dimensions, but with radius queries. Find all things within two miles of Soda Hall. That's a circle query in 2D. Some indexes support ranking efficiently. So they'll say, find the 10 restaurants closest to Berkeley, which isn't a specific circle. It determines a circle, but in dense parts of the map, the circle will be small. And if you go to, like, Boise, it might take longer, maybe 10 good restaurants. No knock on Boise. I'm sure they have wonderful restaurants. Hello, internet. So, but those are not even a fixed size space in the embedding domain in the... What am I trying to say? That's a query where the data determines how much space you cover in 2D, right? You can imagine indexes to support regular expressions. You can imagine indexes to support genome alignments. You can imagine all kinds of indexes, right? And different data structures have been invented for these different kinds of problems and give different bounds on performance. And this used to be a really popular topic to do research on. I did a bunch of research on this earlier in my career. A little less popular now. A lot of it's been kind of picked over. One common n-dimensional index that you see in databases is called an R tree. It's a sort of two-dimensional version of what we'll learn, which is a B plus tree. And there's a data structure we invented in my group called a generalized search tree, which you can read about at Wikipedia. That kind of captures all these variations in a single software framework. I don't get that if you're curious and read about gists. But I wanted you to know about this stuff because if you go work at Google Maps or something, they'll have other kinds of indexes that we'll cover in this class. But structurally, it's all kind of the same, which is why we're able to generalize it. Okay. So let's go through sort of independent of what kind of selections you support. What are some of these typical indexes? How are they structured? And here's some of the decisions you can make when you define an index. What representation of data entries, the things that this thing looks up, what kind of representation will we have? There's going to be three alternatives here we're going to go through. So what is the index actually storing? People tend to get a little confused about this, but it's pretty straightforward. Indexes can store actual records from the database, but they could also store things like pointers to records of the database, or pieces of records from the database. So we'll get some choices like that. Indexes can be what's called clustered or unclustered, which is going to affect performance substantially of sequential versus random IOs. And also in terms of how many good things you get in a single IO. We'll have obviously the single key versus composite key indexes, which we've already talked about a little bit. That's simple. And then what is the actual data structure being used? Is it a tree? Is it a hash index? And so on. In this class, I'm only going to teach you one. I'm going to teach you B plus trees, which are the workhorse. There's neat techniques for building hash indexes on disk. Something called linear hashing. It's just that it doesn't really get used that much, and time is short. But I would encourage you to read about linear hashing. If you're curious about hash indexes, they're in the book, and they're really a very nice data structure, which is a fun hash-based index. But we won't learn that in this class. And again, generalize search trees if you want to learn about more exotic stuff. All right. Here's the three alternatives for the data entries, the keys and the index. So the data entry K, I'm going to put a little star by it when we talk about this in our pictures. I'll always put a little star by it to say it could be any one of these three things. It could be the actual data record. So maybe your data from your table is really stored in an index. It's not stored anywhere else. It's not in a heap file. It's just in the index. That's where it is. It's primary copy. That's fine. We'll call that alternative one. And sadly, the book refers to these as alternatives one, two, and three. Those are their names. So for the purpose of this class, sorry about that, we will try not to make you remember that on exams, but that may be come up in conversation. Alternative two is it's going to be a key comma record ID pair. So it's a key and a pointer. The tuples actually stored somewhere else, all right, maybe in a heap file, maybe in another index. And here what we've got is a copy of the key so we can look it up and then a pointer to where the actual record is. And then the third option, which is maybe more compact than the second option if you have lots of duplicates is a key and a set or a list of records that match it. So for some reason, you wanted to build an index like on the students table on major, right? There's been lots of computer science students. We could have them be in the index just once with CS and then the full list of the students after that. And there's some sort of implementation trade-offs that maybe aren't that interesting between alternatives two and three, but you should know about them. The key difference is alternative one, alternatives two and three, there's a level of indirection with pointers to the actual data records, which are elsewhere. Now this choice of your alternative has nothing to do with the data structure you use. So all of these things will work with B trees or hashing or R trees or gist. You can choose which of these alternatives you want. The other thing to note is it's typical. It's not just possible, but typical to have more than one index on a table, right? You might have one index on... The file might be sorted by age in a sorted file. You might have a hash index on salary that's perfectly reasonable. You can have as many indexes as you like. They will cost you in terms of storage. They will cost you in terms of inserts and deletes to the database because every time you change a tuple, you'll have to change all the indexes, potentially. But they will make lookups go faster or they will make lookups on all these different keys go faster. So you can do that. So let's go through these alternatives a little bit. Alternative one, we store the actual data record. The index in that case is actually the file organization for the keys. So this is what that clustered file I was talking about was. If you have a clustered index, alternative one, it's a storage of a relation. All the tuples of the relation are stored in there. So we can compare it to heap files and sorted files. Now, obviously... Well, not obviously. We're only going to allow you to have one alternative one index per relation because we're only going to store each full tuple once in the database. That's not required, actually. You're perfectly free if you like to keep a copy of your table and store it somewhere else also. But then you just have to do replica consistency. You have to keep both copies in sync with each other on updates, and it's kind of annoying. So we'll just assume there's one true copy of each record in the database, which means there can only be one alternative one index per table. So you've got to choose the alternative one index that you like. That should be the one, probably, that gets the most access in queries, the search key that gets the most access in queries. You can do it by last name a lot. Maybe you store it in an alternative one index by last name. Okay. And the benefit of this is when you do a lookup in the index, you actually get the tuple. You don't get a pointer to the tuple that you then have to go do another IO for. So it's a little bit more efficient on the lookups into this alternative one index than it would be into an alternative two or three index. In alternative two or alternative three, we need these, actually, to be able to have more than one index on a table. If the one index is the primary ordering, then any secondary orderings are going to have to be somewhere else pointing to the primary ordering. So suppose, for instance, that we store our tuples in a heap file, then we can have alternative two and three indexes point to the tuples in that heap file. All right. Alternative three takes up a little less space than alternative two because you store each key exactly once. In alternative two, you store each key as many times as there are records with that key. Right? So if the last name woo is very popular, you'd have it copied over and over in alternative two whereas in alternative three, you'd say woo comma list of IDs associated with it. All right. On the flip side, if you have a really popular value, so let's say we built an index on keywords in web pages. Something I don't know Google might want to do. All right. In this index, you might have a really popular word like Katie. All right. And if Katie is a really popular word, then the list of things that match Katie might be really, really, really long and it might not even fit on a disk page. So you have to think about how you're going to store this big overflow list of matches. If you do alternative two, you don't have to think about that, but you'll get a really wide index. Makes you have Katie, Katie, Katie, Katie, Katie with little record IDs after. All right. So it's sort of a design decision perspective. Okay. Clustered versus unclustered index. This is pretty important. There's some pictures on the next slide to help reinforce it. But here's the idea. The idea of a clustered index is that suppose your data is stored in some sorted order, maybe in another index, maybe in a sorted file. All right. Maybe it's just sorted by arrival time. Okay. So maybe data comes in, you're always appending it to the end. Data comes in, you're always appending it to the end of a heap file. And I want to build an index on timestamp. Well, that index is going to have its entries in almost exactly the order of the data that gets inserted into that heap file. Even though the heap file doesn't promise you that it's ordered by time, it's probably ordered by time. So clustered index is an index where the search key is the same order as the data probably is. Okay. And I'll show you a picture of this in a minute. Maybe it's worth just showing the picture and then going back to the bullets on the slide. All right. Here's a picture on the left of a clustered index and a picture on the right of an unclustered index. The index itself is some kind of tree and let's not worry about how that works. But there's some kind of, you know, multi-way search thing triangle that you walk down until you get to the bottom of the tree. And in the bottom of the tree are the data entries, the things that the index holds. This is alternative two or alternative three. So there's pointers out of that to the actual data. The actual data in a clustered index is more or less in the same order as the leaves of this tree. So this might be, for example, the order that things are inserted into the file and the index entries are truly sorted by time stamp. So you can always do lookups by time stamp and probably these things are in order by time stamp too. Every once in a while, maybe something had a weird time stamp on it or had the time zone wrong or something. So it might be out of order. Or a high value in the middle of one of these blocks. But mostly this is sorted. Which means that if you sort of stare at the bottom of this tree, mostly when you look at those guys, the accesses they'll do, if you do a range search from like the beginning of time through there, you will mostly access sequential blocks in the associated file. So when you scan the bottom of the tree, you scan another data structure also mostly in order. A couple benefits of that. One is that every time you fetch one of these do I have this one more piece of picture? Yeah. Every time you fetch one of these blocks, you get a whole bunch of goodness out of where you go, right? Because your cache manager is going to say find this tuple. Then you're going to say find the second tuple and it's going to be in cache. And you're going to say find the third tuple. It's still going to be in cache. You'll get a whole page full of goodness out of this page I owe. Because the index ordering is also very likely these things are sequential on disk. So those IOs will not be random, they'll be sequential. So that's the benefit of the cluster ordering. Contrast that with the unclustered index where you have spaghetti pointers between the index which is ordered one way and the file which is ordered some completely different way that's uncorrelated. So here every IO is going to potentially go to a different page. In this tiny little picture we covered almost all the pages with just four pointers per index page. But in a really big file, every IO you do, sorry, every tuple you look up in the index, every one of these pointers points to a different data record in your heap file. And that's a misery, right? Now you're doing a full IO to get one tuple. So you'll do as many, this could be as bad as the number of tuples per page times the cost of this in terms of raw IOs. Because this is getting say, what do we call it? R? This is getting R records per IO down here. This is getting one record per IO down here. This is getting R records potentially. So unclustered indexes are really lousy in that sense. You get bad payoff on your page accesses and oh, by the way, if you follow these in order you jump around a lot. So you also do a lot of random IO. So clustered versus unclustered determines whether your index is efficient or not pretty much. But remember that this data records at the bottom, this data file, maybe it's a heap file, maybe it's another index, you can't have the file be sorted more than one way. So whatever order it's in, there's probably only one index that's clustered the same way. So some of your indexes will be unclustered and that's just too bad for you if you do range searches. Some of your indexes will be clustered and that's great. So to go back to talk this through, clearly the file can only be clustered one way. The cost of retrieving data records through the index varies greatly. The cost of retrieving data records depends on whether it's clustered or not. Obviously, if you're alternative one, that is the records are actually stored in the index, your clusters, the leafs of the index are the data, so they're sorted the same way. The other way around is not true. A clustered index does not have to be alternative one. I just showed you a picture of a clustered index on the left-hand side of this picture. It's alternative two. It's pointing two things and those things are sorted the same way. All right, don't confuse this because it's a defining version of what clustering is. If you learn to clustering algorithm like K-means, that's a multi-dimensional proximity kind of deal. It's totally different than this clustering. This is a unidimensional, mostly sorted. If I had to rename this these days, I'd call it a mostly sorted index or something, but it's called clustered. There's nothing to do with clustering algorithms. Yeah. Okay. Here's the way this usually works. Suppose you want to have a table with a clustered index. Usually the way this would get built is the following. You'd build the index and when you said, suppose I want the student table to be clustered by last name. When you allocate the heap file for the student table, you'll leave some free space on each block for future inserts. As you start inserting students into this thing, you'll leave a little space on each page to take future inserts in. Every once in a while, a page might fill and what am I trying to say? I don't know. I don't even think that free space stuff makes sense. I'm going to take that away. I don't even like that. I don't know what that's doing there. Let me do this. Oh, weird. It will not let me overwrite it. Okay. Let's just do that. All right. So we're going to sort the heap file the first time. So it's going to be nice and ordered. So let's assume that we've got a picture that looks like this. It's nice and ordered and there's an index on top of it and it's going to be nice and ordered. All right. As time goes on, we're going to insert things into this heap file and they may have to go over here, okay? Right? Or maybe they plug some holes in the previous pages where there's some free space. But whatever is going to happen, the things are going to come in, they're going to get put somewhere random because it's a heap file or they'll get put at the end. So these pointers will start to get weird over time. I suppose this is the leaf level of the index. So it's the data entries. Right? And here's your data file. And everything was all pretty when you started. And suppose this is on a numeric field. So this is like 1 through 4 and 5 through 9 and 10 through 21 and 22 through 100. And then we go along and we insert 3. Well, where's 3 going to go? It's going to go at the end of the heap file. So all of a sudden after the insert of 3, you'll be here and there'll be a pointer on this index page that crisscrosses everybody. And it'll be a random IO when you search for 1 through 4. And the more inserts you do, the more of these kind of weird things are going to start pointing to the end from all over the place. Because this is just random ordering now. So periodically we'll say, gee, my clustered file isn't working very well. Let me re-sort it. We'll go ahead and we'll take this thing, we'll re-sort it, and then it'll be clustered again. So it's kind of a very lazy, kind of very manual way to keep an association between an index and a loose ordering in a heap file. It's not supposed to be a big idea, it'll take a while to say for some reason. Alright. Clustered files are good because range searches go fast. Let's be very clear about that. The only reason you need them is because you're fetching a bunch of things that are co-located at the bottom of the tree. Therefore you want them co-located in your heap file. If you're only doing single point lookups, it doesn't matter if it's clustered or not. Alright. And there's also benefits in, there's two benefits. One is that you get a page full of good stuff and the other is locality, that these pages at the leaf level of the, the pages in the heap file, sorry, are probably close to each other on the disk. Right? So you don't do, seeks, you do scans. And that can allow you to do better disk scheduling and prefetching and all kinds of good stuff. Also, if you know that things are sorted, you can do certain kinds of compression. And we'll talk about this later on. We won't talk about this now. Cons of clustering, well obviously you have to maintain it. So periodically you have to go around and re-sort this thing if you want it to stay clustered. So you can do it on the fly or you can maybe have a background reorganization thing. Generally speaking, you leave some space in the heap file to soak up some inserts and on average with a typical workload of inserts and deletes to a file, you'll find that heap files are about two thirds full. This is something that's been studied in the literature. It's just like one of those statistical facts. I can't remember the reasoning for it. It's like a wet finger assumption that your heap file is about two thirds packed. One third free space. All right. With all that then, we can now talk about the last column of our table, the clustered file column. Okay. And I'm filling it in all at once, so we'll just have to figure it out out loud. Why is it 1.5 BD to scan all the records in this clustered file? Ignoring the index, just scanning the file. That's where 50% come from. This is one of those have you been listening carefully questions? What did I just say one slide ago about how full this thing is? It's two thirds full each page on average, right? Which means that the whole file is about 50% bigger than a packed file. That's why it takes 1.5 BD to scan it, because each page is only two thirds full. All right. And that's why the Equality Search is log-based stuff of 1.5, what's the plus one? Oh, actually, what the heck? It's log-based F. What's F? Any guesses? We didn't really see F yet. It was hidden inside of a triangle. What triangle did we have on the screen before? It was our index. Our index, if it's a tree-shaped index, is going to have some fan-out. It'll be a balanced tree with fan-out F. Walking down that tree is going to take log-based F levels. That's where the F comes from. The fan-out of the nodes in our search tree. And we'll learn about this next time in detail. So we're going to walk down that tree, and then we're going to follow the pointer, alternative 2, to the data record. And that's our log-based F plus 1. All right. Range search, similarly. We're going to walk down that tree. We're going to get to the heap file, and then we're going to scan. Okay. Insert actually is going to be cool because we're going to have free space. So on average, you find the page, you insert the thing, and you write it back. Read, write. Plus 2. Same with delete. Deletion, we didn't say anything about compaction. You don't have to compact on delete. So if we have a nice workload of inserts and deletes and things don't tend to overflow, then this will all work out just fine. All right. Okay. One last detail, I think, and then we're going to wrap up. Let me just double check that this is the end. This is the end. That's good. All right. Composite search keys. There's no confusion. Suppose you want to build an index on more than one field, like last name, first name. So you can do that. So here's an example of an index on age, salary. What can you do with an index on age, salary? Well, you can do a quality query. You can say age equals 20. You can say age equals 20, and salary equals 75. Can you ask for age equals 75? The answer is no. And this is because this thing is going to be graphically ordered in the bottom of the index. So asking for age equals... Sorry. Asking for salary equals 75 is kind of like going to the dictionary and saying, find me all words where the second letter is E. Where are all the words where the second letter is E? Well, there's some in the A's and some in the B's and some in the C's and some in the D's, right? What a bummer. Same deal here. If it's sorted by age, and then per age it's sorted by salary, all the people with salary equals 75 every single age. Make sense? So you can only do lookups that go down a prefix of the fields in the search key, right? Age and salary. If the search key was age, salary, and GPA, you could do queries on age, on age and salary, on age and salary and GPA, right? All right. Range queries... You can do a range query on the first field, age greater than 20, or you can do an equality query on the first field and a range query on the second field. I can say I want to find all things that start with a W and the second letter is bigger than C. That's just to scan through the dictionary, right? But I can't say I want all... I could, but it wouldn't help me much. Once I say I want all words that start with a letter greater than C, it doesn't really matter that I say and I want the second letter to be W. We're going to scan through the dictionary from C onwards. Right? So the prefix here you can do... Age greater than? You can't really do anything interesting with the rest of the keys. Okay, so you can do age greater than 20 or age equals 20 and salary greater than 10. So just keep this dictionary analogy in mind. It will guide you as to when an index can be used for what query. It's all about lexicographic ordering. So just to make this painfully obvious, here's name, age, and salary table sorted by name. So in the middle is a table sorted by name, Bob Cal Joe Sue. This is index age comma salary. You can see it's not clustered. The pointers cross each other because if it was clustered it would probably be sorted by name. Here's one that's salary comma age, here's one that's salary, here's one that's age. I encourage you in your spare time for your entertainment to go through the examples on the left and figure out which of these indexes are good for what. I'll give you a sense of kind of the benefits of these indexes in terms of this what index supports what query. This is it for today. So to summarize the file layer, we learned about that. It manages access to records and pages. They're these different file organizations and you want to know their costs. Be able to recreate that table we went through. For selection queries you can use sorting. You have a sorted file, but they're expensive to maintain. So typically we'll have an index like a tree-based index. An index is really two things. A tree's and a way to walk down the tree to get them and we'll see this next time. All right, good luck with your homework. I hope it's a productive weekend. I will see you on Tuesday.