 All right. Hey everybody. Happy Tuesday, Tuesday, right? It is Tuesday. Okay. These weeks go by fast. Sometimes I get confused. Happy Tuesday. It is Tuesday, right? Yeah, it's Tuesday. So as you will recall, we are in the middle of doing query processing stuff. You have a homework on query processing stuff, and we're about to wrap it up. But there's been some questions that have come up about hashing that I wanted to take a pause and clarify. And then the other thing was there was this poll on Piazza about the course speed, and a good fraction of you said things were going pretty fast. So I thought, okay, let's slow down a little bit. Let's spend a little time, particularly because you're implementing this for your homework. We'll spend a little time just clarifying hashing, and in particular recursive partitioning, which went by pretty quickly. Okay. So let's review the first pass of out-of-core hashing, external hashing. All right. Data streams in from the left. Maybe it's coming from a file scan. Maybe it's coming from a query operator below us. Okay. But data is streaming in, and we're going to fill IO buffer with those rows that come in. So if they're coming off the disk, we'll take a disk block full of tuples and put it right into memory. Okay. And if it's streaming from another operator, we can fill this thing up as we go. But we'll have one buffer in memory, which is the size of a disk block. Remember a disk block is like 64K. Okay. It's the unit of transfer from our disk. We'll have one buffer for that input flow. The rest of our memory that we've allocated to this task, so assume there's B pages of RAM that we've allocated to hash join, or to, sorry, to hashing. The B minus one remaining pages worth of memory that we have are going to be output buffers. And remember the usual streaming thing. So this thing fills up and then gets transferred via a hash function to one of these. Whenever one of these is full, we write it to the disk, right? And that makes sure that our IOs to the disk are one block at a time big, right? So these output buffers are there to allow us to write these disk partitions and append to them a full block at a time. And every time we go to that disk IO interface, we want to write a full block. And that's why we need a buffer for each one of the partitions in our hash partitioning. Okay. So we've taken our input stream, we've used the hash function h sub p, and we've partitioned it into these different partitions, right? So h sub p tells us which partition to go to. This is the bad marker. Go away. Now, when we're done streaming these things through and then writing them out sort of one block at a time to these partitions, what have we got? We've got B minus one partitions on the disk that capture all the rows that came in, right? Every row that came in is in one of these partitions. And if there are two rows that have the same key, we know they're in the same partition because they hash to the same value. So that's good. So we've divided the problem into sub problems. There's no way that two things that match could be in two different partitions, right? So each partition can now be handled independent of the others. That's why it's a divide piece. And now in pass two, we're going to conquer these independently. Good? So in pass two, here's the thing. We have B minus one of these partitions. We have B pages of RAM. We're going to deal with these one at a time. How big is each one of these partitions assuming that the input is N pages worth of data? Well, in the ideal case, each one of these is B minus one over two. Sorry. That's it. I don't even know where that came from. In the ideal case, each one of these is N divided by B minus one, right? We partitioned the N pages into B minus one pieces, okay? But there may be some skew in your hash function or in the duplicates on your hashing key, right? It could be that we're hashing on something where there's a lot of duplicates. So some of these may be bigger than others, all right? But there's multiple reasons that it could occur at the end of pass one that one of these things is bigger than B pages big, either because of skew or because maybe this is bigger than B, right? Maybe the input was just really big. So both those could happen. It could happen that just one partition is big because of skew or it could happen that all the partitions are big because you just had a lot of data, okay? But it's divide and conquer. So what we're going to do is we're going to treat each one of these independently. The ones that are smaller than B, we go to pass two. So this first one, let's assume it's less than B pages big. What do we do in pass two? We read it less than B pages big into memory and we build a big main memory hash table with a finer-grained hash function on the hash key. And that's our hash table that we needed to do the hashing, right? And then if we're doing a group by aggregation, we can aggregate in these hash buckets. If we're doing duplicate elimination, we eliminate duplicates in these hash buckets. If we're doing a join, we simply put the tuples into these hash buckets and then stream another table by later, right? So that's fine if the partition is smaller than B. What happens if the partition is bigger than B? It's not going to fit in a main memory hash table. So what do we do? We recurse on this whole procedure for this particular going to be red partition, right? So now what we're going to do is we're going to run external hashing on just that partition in red from the beginning because it's too big. So we start over. So we got this red partition. It's more than B blocks big. And we pretend recursively that we're in pass one of partitioning just on this guy, okay? So it's a one partition operation. So we're going to read in that partition. Let's call it partition P. So we're now going to read in P off of the disk. So actually, why don't we do it in this direction since it was over here. So here's partition P. It is greater than B big. We're going to read it into memory right to left this time for no particular reason. Here's our input buffer. Here's our output buffers. There's B minus one of them. There's one. And we use a different hash function. We can't use the one we used up there because it sent everything into this partition. So we use h sub P2, which is a variant hash function. It's going to split the data up differently this time. And we're going to recursively split this guy into a bunch of little guys, which are going to be smaller than this guy. And then we're going to hopefully, those will all be less than B. And we can go to pass two on those recursively. And we're done then with this partition. If they're all less than B, we do pass two on these guys. We bring these in. We recursively bring these back into a pass two hash table. This is pass one recursive. This is pass two recursive. We do our thing in the hash table, remove duplicates or compute aggregates or whatever, send it to the output, pop out of the recursion. We are done with this partition. Finished. We don't ever do pass two again on the whole partition because that doesn't mean anything. So all we did is instead of doing pass two on this partition, we recursed and did the whole algorithm on this partition. And we partition by partition as we go through these. If one of them is bigger than B, we just do that. We just recurs. Any questions? Yeah. So let me repeat the question so people can hear it. I don't know if I understood it. So this first hash function H sub p, it is modulo B minus one. So it's going to generate B minus one output partition. And it's some hash function of our choosing, but it's mod B minus one. Did that answer your question? Good question. So the question was what happens to the output inside the recursion of the second phase of p? That is to say here, this went a little fast. So let me be a little more disciplined here. So this is the recursive pass one. Let me just erase this. So this is pass one recursive. At the end of pass one, we've got these little sub partitions of partition p. And here's what we're going to do in pass two recursive. Let me redraw this. So here's our little sub partitions. Just a copy of this picture. Each one of these, hopefully, is less than B if it's not this whole argument recurses again, but let's say it is. Now this is pass two recursive. You read it into memory in a hash table. And then the question is what do you do with this? So let's say we're doing duplicate elimination. Let's make this concrete. We're doing this hashing because we want to get rid of duplicates. What we're going to do now is you read one of these guys into memory. All the duplicate values will be together in this sub partition because they hash together. You remove all the duplicates, you get one value per bucket, and you send it to the output. The same way you would have sent it to the output in the non-recursive case. Plain old pass two also would have gone to the output. So the recursive pass two doesn't pop the stack and then do pass two again. It just sends things to the output. And when you pop the stack back into not the recursive part, but here there is no pass two because this is done. Another way to say this is out, if you think about the data flow diagrams that we drew, we had scan, file scan, feeding, hash, duplicate elimination. When I say out, we're actually returning a next call from the recursion of this partitioning. We're returning tuples from this nested call to pass two. We're turning them back up to whatever this might be main. So main says get next. This thing needs to know that it's within the recursion of pass two on partition P sub one. And when it says get next, it'll actually get a tuple out of this hash table. Yes, I would think of it that way. That's right. Yeah, pass one partitions the data into these independent tasks. And then pass two conquers them by doing duplicate elimination or what have you. If there's no duplicates, then the whole algorithm was a waste of time. So if you knew there was no duplicates, you wouldn't have done this, but you don't know. So you can think of this whole algorithm as the search for duplicates and the assurance that there aren't any. Yeah, exactly right. So when we recurse what happens, this partition P, which is roughly one over B minus, sorry, n over B minus, no, one over B minus one of the original input n, right? So it's one beef roughly of n is going to get written, sorry. Pass one recursive. That thing is going to get written. Nope, this way written. I can't remember which direction I'm going. I'm going this way. It's going to get red written and red again. The read was going to happen anyway, because this was pass two. It's this right and that read that are extra. Yes, is there a pass zero? Annoyingly, sorting has a pass zero, but hashing has no pass zero. So there is no pass zero in hashing. It's the way it's been described in the book. Yes, yes. So for the case of duplicate elimination, think of it as a Python dict or Ruby map or what have you. It's a hash table, okay? And the keys are going to be, so let's give a concrete, let's make this super specific. The query we're running is select distinct last name from students. So as one example here, there may be two people with the last name woo, all right? I admitted two people to graduate school this month with the last name woo, so that's a perfectly real example. This hash table is going to have an entry in it, which says woo, and then it basically doesn't need anything for duplicate elimination, it just needs the word woo, so that's the key. So it can just say yes. I've seen one of those. And so when I read the second woo in, I don't do anything here. I just drop it on the floor. At the end, when I'm taking things out of this hash table, I'll take out the name woo exactly once. Okay? Does that make sense? Now if we were doing aggregation, if we were counting the woo's, then the first time I saw one, I'd inserted it into the hash table with one, and the next time I saw one, I would increment. So that would be for select count distinct last name from students. Yes. Select count, no. Select count, let's do that one. Select count star from, well actually let's do last name, count star from students, group by last name. For that algorithm, we would build a hash table with last name as the key and running count as the value. Okay? Yeah. Great question. So suppose that instead of grouping by last name, we were grouping by gender. This is the classic example. Okay? And you have a really big input file. Right? So you're going to get a male bucket and a female bucket, and then when you try to recursively partition the male bucket, it'll all land back. It doesn't matter what hash function you pick in one partition again. Right? And you repartition it'll still be one partition. And if you don't have any exception handling this algorithm, we'll just run forever. Recursing and recursing and recursing and getting, making no progress. So at some point, you're supposed to detect that and say, uh-oh, and switch to sorting or something. Okay? It's a very unusual case. Usually the query optimizer will know something about how many distinct values there are in a column and will choose appropriately what algorithm to use. But in the worst case, your implementation of hashing should at least detect this case. I'm unable to subpartition this anymore. And then maybe you call sorting on that partition. Okay? Yeah. Yeah. Right. What if the partition is of size B? And in pass two, actually, we can fit B here. So we're okay if it's B. If we've got all the memory for this hash table, we actually don't, we can, we can use the buffers as we're filling up the hash table to deal with the IO. And when we get to the end, we can sort of play a little trick and use that last IO buffer and make it part of the hash table. If it's B plus one though, you're out of luck. You have to recurs. Yeah. And that's what the hybrid hashing thing was about, was about trying not to have a step function there. I don't want to review hybrid hashing because I don't, it's not as important. This is quite important. We good? If you're still, you know, working through the details on this, because this is all pictorial, I'd encourage you to try to write it in pseudocode to make sure you understand it or even write it in Python or perhaps Scala. Okay, you're going to end up writing it for your homework. We're not going to make you do recursive partitioning, but you could do that as an exercise to make sure you really understand it. It can be awfully helpful sometimes to break down the pseudocode or the real code. Okay, and feel free to ask your TAs about this stuff or come to office hours. Nobody came to my office hours today. It was very little, like one person came. I saw you somewhere. Okay, anyway, I'm there for you. All right, so that's all we're going to lecture on for query processing. And what we're going to do now is we're going to dive down the stack a little bit. So remember this block diagram, sort of architecture diagram of a database system. And it's a traditional relational database system, so it has a lot of the componentry, some of which you don't need for other systems. And what's missing from this picture is distribution and parallelism, which you can do in a relational database. And we've seen it in our query processing algorithms, but it's not in this picture. So just keep that in mind. If you want to go web scale, you have to think about things like replicas and parallelism. We'll talk about that during the course of the semester also. But in this picture, that's the only thing missing. Everything else that we could think of is probably kind of in here for a traditional DBMS. Okay, and where we've been is kind of in the relational operators, where we're going now is to the bottom, to the disk subsystem and the buffers above it. And then we're going to connect them up after that with files and access methods so that we'll have done by the end of I think next week, we'll have done everything but query optimization. All right, so we've kind of done query execution and relational operators. We're now doing buffer management and disk space management, and we'll close the sandwich over the course of these two weeks. Okay, so a brief note on terminology. This is an irritation. I wanted to make sure it was written down somewhere. The words block and page are being used interchangeably in this class. Some people will tell you blocks are things on disk and pages are things the same size in memory. But the book is not disciplined about that and neither am I. And so rather than pretend that I'm always going to get that right, I'm just going to say a block is a page is a block. It's a fixed unit of memory. It is the unit of transfer in the IO interface to the disk. All right, so typically in the book, they'll say that's four kilobytes because the book is kind of old, even when the book came out, it should have been eight, probably, but so the arithmetic in the formulas on the slides and in the book will probably say assume block size equals four K out in the real world, numbers are more like 64 128 K is a good unit of transfer to set the operating system on for your disk blocks. It's configured typically in your OS, all right, or you can configure it in the database as a multiple of the OS transfer unit. Okay, but we'll assume that it's, it's an abstract number unless we have reason to do arithmetic. Okay, you can't read anything smaller than a block from the disk. All right, so you have to go do a full block IO to get data off the disk. And that's why we care about it. And then just some terminology stuff when we when I say relation and table, those are interchangeable. And that's just tradition. When I say tuple or row or record, those are interchangeable. Again, those that's tradition, because there have been many kinds of database systems and they have their different terminologies. And when people say attribute or column or field, I think for our purposes, when we're at least when we're talking about relational databases, those are also synonyms. Okay. And I'll try not to confuse you guys, these aren't hard, but it's annoying. So I'll try to be uniform. I will tend to say, I don't know what I'll tend to say, I'll try to be good. All right, disks and files. So we've talked about this a bit in a previous lecture, databases do store information on disks. A lot of them are magnetic disks. They are mechanical and acronyms. There's not a lot of devices we use in computing that are mechanical anymore. And this has implications. We have to think about read and write costs. And these are both orders of magnitude more expensive than RAM access. So we've seen this exact slide before, as well as this slide, which is a picture of a disk driver call, that these things spin at some physical speed, and there's the cost of moving the disc arms in and out, and then the cost of waiting for the platter to rotate under the disk head. There may be multiple platters stacked up, transferring from one platter to the other is quick. That's electronics, right? And the page size, this thing that the operating system is forcing us to transfer is coming initially from some unit of transfer from the physical device. So typically the physical device is something called sectors. And then the page size is a multiple of sectors. And then the operating system might give you a page size that's actually bigger than what the device could do to amortize the cost of iOS. So there's sort of layers upon layers here, but it's definitely not like memory. You can't access a byte on this thing. You have to get a chunk of data. And from our perspective in this class, just assume it's a block of data or a page of data, same word. And it's always fixed. Okay. The other thing to recall is to seek time rotational delay, right? Seek time to move the arm, rotational delay to let the data come out, transfer time for the electronics to read out the magnetic media and bring the bits up into your computer. And we saw these numbers, orders of magnitude differences between them. Okay. So we want to try to do things to reduce seek and rotation delays or simply to not go to the disk drive at all by caching things in memory or by prefetching things into memory before we need them. Okay. And then lastly for magnetic disks, sequential iOS and random iOS have very different costs because of the cost of seeking and rotational delays. And therefore we want to think about pages that are near each other on the disk versus pages that are far from each other on the disk. And we can abstract the disk as kind of like an orange that you peel, you know. There's going to be tracks and we can read them and then we can sort of concentrically go into the disk drive and think of that as one big spool of linear ordering. And if you actually read the data like that kind of concentrically or in a spiral from the outside to the inside, it's pretty much sequential IO the whole way. So you pay very small seek costs and never pay rotational delays. Okay. And in fact the disk drives with their track buffers and things will make that feel like you're never paying seeks at all. And so there is a notion of sequential scan of the disk which is way faster than random IO where the disk arm is waving back and forth. So we do have a notion of what's the next block on the disk in some sort of logical linear order. And then when you when the system knows it's going to do a sequential scan as I talked about last time it probably has a thread somewhere that's going to read ahead for you and make that data appear in memory before you even ask for it because you'll start reading in order and the system will go ah ha I see what you're doing and I'll read ahead for you. Okay so you can expect that sequential scans are much much faster than random IO's and in fact they should be as fast as your bandwidth of your machine right in a balanced architecture. Okay so with that background and review let's talk about how do we actually store data on the disk. Let's get down and dirty. Okay I'm going to spend a bit of time on this now. This layer of the system the disk space manager this is the lowest layer of the system it's it's it's it's responsible in most of these implementations for the actual physical layout of bytes on the blocks of the file that the data is stored in and I think we talked about last time that that file in the ideal case is actually the device alright so in Unix for example if you go to slash dev slash whatever those are not really files right those are they look like file IO to you but they're actually the devices and you can implement code on top of those to go access blocks on these devices by hand so for maximum control of performance you might implement this over the raw device files right and then the database system really is managing the blocks of the date disk in practice that's usually more trouble than it's worth because these device drivers change and then suddenly you're responsible for every device driver ever written in the database system and that's annoying you'd like the file system to give you the abstraction of those devices so what you do is you allocate a file on an empty drive that's about the size of the drive and most file systems will be nice and access allocate that file sequentially on the physical drive so that when you say file block four hundred file block four hundred one file block four hundred and two those are physical disk blocks that are actually next to each other sequentially alright so we won't talk much about that after this but just assume that we have management of the space on the disk and it's pretty much the abstraction of the disk as a sequential a scannable disk drive the way you'd expect alright so the higher levels of the database system are going to assume that this disk space manager has a very narrow API where it can allocate or deallocate pages from one of these files they can read and write pages from these files and that's kind of all it does it's a very low-level API and that if you request a sequence of pages from this API you'd like those that the higher levels would like the sequential access to actually have sequential performance from this lower level so the lower level is responsible for organizing the blocks of the file to achieve sequentiality when possible the higher levels don't actually get any guarantees about this though so they don't see this those physical details of which block is where are behind the API of the storage manager of the disk space manager but the higher levels may make some performance assumptions that if they ask for things in sequential order that the disk space manager will serve them quickly right so it's a sort of a performance expectation rather than a physical guarantee between the disk space manager and the higher level okay so the basic abstraction that the database storage manager gives is a file abstraction so we can ask for pages but those pages live in files and the higher levels of the system are going to want to think about records so we're going to need to map from kind of files to blocks to records in some way in the disk space manager and we'll see how this is done so what is a file for database terms a file is going to be a collection of pages or blocks each containing a collection of records all right and we're going to use the word collection fairly abstractly here so I didn't necessarily require them to be sequential how is this different already from the Unix notion of a file what what does Unix say a file abstraction is anybody know it's a stream of bits or bytes actually I mean usually don't think about bits but yeah the Unix abstraction is sequential it's a stream right of bytes so we didn't say anything about streams or orders here we just said it's a collection okay now we may expect if we say I want page 2 and then I want page 3 that 3 will come fast but we won't make any assumptions about actually where they are on the disk or what meaning that ordering has these collections are kind of more abstract than in the traditional Unix stream abstraction okay it's just a collection of pages each contains a collection of records and the API it's going to support is insert a record into this file delete a record from this file modify this particular record in the file fetch this particular record in the file using some record ID so that's part of this file interface and I'll show you how we're going to represent record IDs it's fairly straightforward and then there's usually an API to like open an iterator on all the records I want to open an iterator to scan through all the records get next get next get next alright and you might be able to pass down some simple kind of where clause conditions on those records into the storage level so that they don't percolate those records up through higher levels of the system alright and this is often implemented by the way in not just one Unix file or operating system file but maybe many of them because a database file may span lots of disks right you may have the table of say all clicks that have ever happened on your website that could span lots and lots of disks and the abstraction from the database systems perspective is that's one file quotes okay so it's typically implemented as multiple OS files or you can implement as multiple raw disk devices okay so the typical file structure is what's called an unordered or heap file it's a collection of records and the system guarantees nothing about the order of those records and as the file shrinks and grows disk disk pages are allocated or deallocated to this file and to support these record level operations fetch a record insert a record to lead a record modify a record we're going to need to keep track of what are the pages in the file where is their free space on these pages so that we can insert records where are the records on these pages okay and there's a bunch of ways you could think about structuring this data on the disk and we'll go through a few of them alright and we'll see which ones we like and there's trade-offs as always we're going to look at two file structures and then we're going to look at multiple page structures as well for them so here's a heap file right so no particular order implemented as two lists really there's going to be some header page which is where you go when you're interested in this file it's not a known location in the disk alright and then it's going to have two link lists doubly linked lists hanging off of it one of full pages alright well we won't look at those when we're considering inserting stuff and one of not full pages alright which is places where we might want to insert stuff okay and the location of the header page will be stored somewhere else like in something called the database catalog which is where we're going to keep all this stuff we'll worry about how we represent the database catalog later it's kind of a bootstrapping problem but somewhere written down on the side is the location of that header page and then the pointers let us walk these linked lists of data pages what's a pointer on the disk it's an address what kind of an address it's not a memory address is we're on a disk yeah okay good so it could be super physical if not but the one possibility could be it's like the you know sector of the track of the platter of the device but the operating system is going to give us an abstraction of just block numbers alright so it's just going to be a block number each one of these pointers is a block number in a file in a Unix file okay and that tells us how to go ask the operating system for that block okay so when I draw a pointer here it's really a disk block ID so you can imagine that this has some pros and some cons and before I give it all away let's see where we are are there any interesting I don't know if this isn't working before we do that what might be bad about this it's kind of a straw man to be honest it's not a very good organization what operation might be kind of slow here finding a record okay we haven't talked about how you might find a record certainly finding a record by value well suppose we wanted to find all the people named who you'd scan through all the data but you'd have to do that no matter how you organize it right unless you had some kind of value-based index and that we're not doing that for like two more lectures so regardless of how you organize the blocks what to scan them all to find all the woos yeah okay it would be hard to know how much free space you have I would agree with that we might know how long the lists are by maintaining a little metadata at the head to the list as we do insertions and deletions but we might not know how much free space there are in these pages and more particularly which pages have how much free space so if you come along you're like I want to insert a record and it's two-thirds of a page big you know which one of these pages has got that much space as of now the only way to do that is to walk through this link list at the bottom so you find one that's big enough or it's gotten a free space right yeah why don't we do something better yeah so that's the next slide so that that clearly we want to have some meta information about this now you can have that information in memory which might have been what the suggestion was in front you could keep a hash table in memory that keeps kind of the state of this disk and if the system crashes and it comes back up you could repopulate it kind of in the background or as you go that's not bad we can also organize the disk a little bit better all right and we can do both of these things so you can always keep keep sort of cached information about the files in memory also but we can do a better job organizing this thing just a little hierarchically all right so instead of having one header page in a link list why don't we have sort of a link list of pointers so link list of a sort of directory here all right and so this is a fairly typical thing you store up at the front of the file a bunch of header pages linked together and the header pages are full of little entries that have a pointer to a data page that is a block ID and a free space count okay and you keep them on the disk so you don't have to you know cache this in memory and recompute it if you crash you actually put it in these disk pages and you know this directory entries let you know what pages have how much data so the worst case scenario you scan the whole directory you don't scan the whole table right and you figure these directory entries are just a few bytes each so you can store lots of directory entries on a single disk page so the directory should be pretty small okay so that's one simpler scheme you could do or a simple scheme you could do that be much more efficient than what we just saw right and I think if you want to do more than this you start building hierarchies of stuff okay so you might build a directory of directories and organize the pages into collections of things that have more or less free space but it's usually too much trouble this is usually enough depth of of of hierarchy for a single disk drive okay because you can pack these directory pages pretty big and remember you're going to take probably a track full of these pages anyway when you go do an IO they're all going to be close to each other so you can get a lot of these directory entries with one sequential IO essentially the cost of one random IO all right so this is probably good enough for most most cases will be interested in right but the idea of caching some of this in memory and making it even faster is not a bad one I like that okay all right so that's just a simple page directory organization now sneak preview of what we're going to talk about either next time or the time after is there are going to be more interesting file structures for doing things more than scans particularly we're going to build indexes that are going to allow us to fetch records by value so heap file allows us to retrieve the records by specifying a record ID or by scanning all records sequentially so what we haven't done yet is talk about what's a record ID which I guess I'm going to talk about on the next slide so I'm going to hold that thought but heap files will allow us to fetch individual rows by record ID they'll also let us scan through all the rows in random order or in arbitrary order but it's nice to be able to fetch records by value find all the students named who find all the students in CS find all the students whose GPA is greater than 3.0 and have blue hair okay and indexes are going to be file organizations that will let us do these kind of value based queries and we're going to learn about B plus trees which are the standard index for disk based databases we'll learn about that in probably two lectures okay so just be aware that we're going to be able to do that these are alternatives to heap files so you don't have to store your data in a sequential way you could also store them in indexes okay now within a so this is actually kind of let me erase a little bit we're bouncing a little bit around so what's going to happen is we looked at how you store blocks in a file right think about it this way here's a we draw here's a heap file and we talked about you know directories and data pages right that's the picture on the previous slide now if we zoom in on this thing we're going to need to talk about how do you let what's what's the contents of a page look like so there's going to be a picture in a little while which is going to be what is the layout of a page but before we do that inside a page there's going to be a whole bunch of rows stored in that page which are rows of a table and before we actually look at the structure of a page let's look at how we lay out an actual row or tuple all right so we're zooming in one level deeper to how do we store a row and then we'll pop back into this layer of how do you store a page okay so here's one way to store a record a row a tuple call it what you will all right and this is nice if all of your fields of your table are the same width so fixed width fields like integers or floats then all you have to do is just sequentially as a sequence of bytes copy that tuple into into the bytes of the disk page all right so you have a base address that's the start of the tuple you know the lengths of all the fields because in your table they're all going to be the same in this example so it's always integers are always going to be let's say four bytes right so column one is type integer so it's always going to be four bytes column two is type integers to be four bytes so you can compute the location of each field within that tuple as the base address plus the length of the proceeding fields so if you want to get to field number three you start at B and you add the length of field one and the length of field two right and if in a relational database all these tuples are going to be the same format so they have the same columns with the same types right and so all the rows in this table will look like that you don't have to store the field information with the table you can store it separately in the system catalog so we know if this is let's say the statistics table that it has four columns of type integer store that on the side and all this arithmetic can be done for that whole table without carrying around that metadata in the storage right this is quite different from something like an XML format and text where for every field you put the little tag around it that says you know this is field f1 right end of field f1 field f2 is a very compact representation right so it's quite efficient and when you stream this stuff through memory you know it goes right through into the processor caches and into the registers very efficiently so there's a very nice format and if you know you're dealing with fixed-width data so you want to do this when you can it really makes things much faster okay unfortunately the world is not all fixed-length records with fixed-length fields all right but this is a good thing to be able to do when it's not fixed-length there's two sort of standard formats the first one is not very smart and the second one is better so the first one which is sort of in blue is you say well fields could be variable length so at the end of the field we'll put a delimiter this is roughly what comma separated value files do in ASCII format right they say okay commas are the delimiter oh except that if you have a field that contains the comma character we're gonna have to do some weird escaping with quotes and all kinds of stuff which makes CSVs kind of painful and actually slow to process you end up doing a lot of exceptional logic as you read them right so you don't really want to have delimiters with special symbols they they introduce branches into your code and they're just kind of gross they also make your if you if dollar sign was your delimiter and you use dollar sign in your data a lot you pay the cost of storing the escape characters for those dollar signs also so you can have a storage cost it's generally kind of a crummy idea the better idea is that at the front of every row or every every tuple you have an array fixed-width array of offsets where you go find that field in the tuple right there's basically think of them as pointers but they're just offsets from the front of the tuple so they're probably one byte worth of information to tell you how far to the right it is for you to go to find that field so that's a more a more flexible and compact representation typically yes f4 well you'd have to know before you went and got it so if you're looking at f1 you know where the beginning of f1 is and you know the beginning of f2 is you keep both of those in mind as you go get f1 for f4 there's probably a missing thing which is an end of record pointer so I think you're right there's a little bit of a bug in the slide without an end of record pointer or some higher level thing that tells you where the records are in this file you can't figure out when f4 is done in fact that end of record pointer you're going to see lives here at this level it's going to tell us where the tuples are and how long they are up in the page directory so we're going to be able to do this but somehow somewhere you need to know the length of f4 or the length of the whole tuple good question good question suppose that you know field number three is of type text and I want to change my name from Joseph Hallerstein to Joseph M. Hallerstein so f3 has got to get bigger in this tuple I'm kind of going to have to rewrite the back half of this tuple over again true all right we're not going to worry about that too much on a tuple by tuple basis because that's all going to happen in memory anyhow it's not so bad we are going to try to make that good on a page level basis as you'll see in the next slide so I think you ask that question somebody asked a question but yeah you pay a cost here of if you want to expand things you have to have to write things forward the alternative to that would be have these fields floating in in the bytes of the page somewhere right which makes page space reclamation a little annoying now you've got pointers all over the middles of your pages with different cells of different tuples so we're going to keep the tuples contiguous which will allow them to stream through memory better we'll pay for expanding tuples which doesn't happen now okay good good good good you guys are thinking about the right stuff the other thing to wrap your head around in this picture these are blocks they're on the disk we will be reading them into memory once we start looking inside a page we're in RAM now right and what is this thing in RAM it's an array of bytes that happens to be 64k big and all this stuff we're doing here is byte arithmetic in RAM right so all these pointers are actually number of byte distances and the arithmetic we'll do on those pointers is like see arithmetic on memory addresses right this thing's been read into some particular location in RAM and f2 is let's say 16 bytes to the right of the head of the tuple so we'll take the tuples address plus 16 bytes to get to f2 and that's all in RAM and if we have to change this tuple we're changing it in RAM and then we'll manipulate it in memory and at some point we'll write it back down to the disk but we'll write the whole block to the disk okay so remember when manipulating tuples on pages that's actually happening in memory the only thing that happens on disk is writing entire pages and reading entire pages okay finally let's oh one more point on the second thing not terribly important but a lot of times there are fields where you want to put in a null value which means I don't know okay it can mean doesn't make sense it can mean can't be computed it can mean a whole bunch of things but it's kind of a don't know character and in sequel and in many languages there's a every data type has a null field no value it's a special value different from all other values it's not the same as empty string for strings it's different it's null okay so nulls can be represented here just by having two consecutive pointers point to the same space here right so if the second if f2 and f3 both start in the same place that means f2 is null okay so it's a uniform sort of representation there you'd have the two delimiters for the two things which is also fine actually okay now let's talk about page formats so I'm going to not talk about row formats anymore we talk about how do you do page full of rows okay and again this is the unit of transfer so these pages will get read from the disk into memory on that page there'll be some tuples copied into that byte array of 64k right where are those tuples and how do you find them in that 64k extent of memory after you read it up from the disk right we're gonna in the book so these are the pictures from the book I'm going to jump forward to this picture this is it is an array of memory so this is 24 bytes of memory okay and in all these pictures in the previous slides which we're going to go back to in a sec remember that we're really talking about just an array of memory it's a linear array but for whatever reason he draws it as a rectangle so you have to imagine these rectangles being kind of wrapping around kind of like that so the actual bytes of memory go like left to right left to right left to right in line all right so it's sort of he's taken that long skinny thing he's wrapped it around all right so on the left here is a page format for fixed length records it's pretty straightforward you store the first record in the first bytes of the block and the next record continuously right there after the next record right there after the next record right there after and then maybe there's some free space very nice it's all at the end of the array and then at the last last piece of the array maybe it's one byte maybe two bytes is a count the count tells you how many of the how many records are on this page and knowing the number of records and then knowing from the catalog how wide your records are because it's fixed with table then you compute where free space starts and you can compute where each existing tuple is at as an offset from the top of the page right so if all your records are 12 bytes long you know that there's let's say there's n equals 3 you know in slot 1 that's at bytes offset 0 slot 2 is at byte offset 12 slot 3 is a byte offset 24 you just go get the tuples that way okay now as you delete and add tuples you have to keep this thing compacted which is sort of annoying but it's not that big a deal alright the other thing you might do instead of that is to choose not to compact free space as you delete tuples and so you might have empty slots in this page for tuples they're still fixed with there's just some slots are empty and so you'll need some kind of bitmap at the end to say which ones are empty and which ones are full you might like to know how many slots there are on the page although if it's fixed with you could probably calculate that number M but let's assume there's an M there that tells you how many slots there are and then there's a bitmap maybe from the end of the file going inwards for which slots are full okay and the free space can just pop tuples in and out as they get deleted and inserted so the only advantage really to the thing on the right for these fixed with tuples is that the tuple IDs are going to stay the same alright so this is time to talk about what's a tuple ID or a record ID in a database system typically we can refer to records the only way you can distinguish two totally identical records is where they came from on the disk physically and that's going to be a page ID and a slot number alright that's a unique physical identifier for the tuple it's where it got stored now in this left-hand representation if you keep compacting things you keep changing the slot numbers of things I delete the second thing here then the third thing here becomes the second thing here which means that anybody who remembered that record ID of the third thing and it just got changed has to be notified and update their memory and it's pain in the butt right because you may have other parts of the database like indexes pointing to these tuples so it's kind of annoying if you move them around you have to fix all the pointers the inbound pointers to these things as you move around so there's some motivation to use the thing on the right so that when you allocate a tuple on a page that is its name and it will stay its name until you do something radical like reorganize the entire file okay so it's often nice not to have a packed representation but to have an unpacked representation because then record IDs remain fixed okay now this was still assuming that we had fixed with tuples which is a strong assumption when it's true this is great very efficient when it's not true we have to do something else and you get what's often called a slotted page format alright this is the most general format for organizing a page what we're going to do is we're going to have again a pointer at the very end of the page which is going to tell us where the start of a big contiguous swath of free spaces right so if we need to insert something we might want to try to find memory for it at the point where that start a free space on the pages and then we're going to have you know one byte further in from the end of the page is going to be a number of slots currently on this page and then for each slot on the page we'll have a little directory here of two bytes each one two three to n in this case alright n entries which are going to be pointers really offsets into the page where I can go find the start of a tuple and what's not in the picture but is in the book and we'll be on the next slide we'll also have the links for each of these slot directories of those tuples which answers the question that was asked before how do we not read off the end of the tuple the links of the tuples will be stored in the slot directories so we're going to have the start and length for all the tuples on the page in the slot directory alright so the you can think about these pages getting even reorganized over time you can move the tuples on the page around and compact them together to create more free space the slots stay the same there's a level of indirection between somebody pointing from the outside say I want record ID this page second slide alright second slot is that thing that currently now is at offset 16 but if I reorganize the page and I move it to offset 8 that's fine because the guy out here still called it this page second slot right so the slot directory is the unique identity slot directory entry is really the unique identifier the location of the tuple on the page can move it's that level of indirection between the slot directory and the location of the tuple allows you to reorg these pages which is kind of nice okay and then because the page the tuples may be bigger or smaller we're going to probably want to do this kind of compaction periodically right so it's it's well suited to variable length tables where the each tuple might be a different length and when you update it it's like it might change okay so details details this is the kind of stuff though that like here at the bottom of the system you get this wrong things go slow and interfaces break and so you got to get the stuff right alright the level of indirection with the slot directory means that these are also kind of nice for fixed length records so slotted page kind of organizations get used also sometimes for fixed length records to enable the slot stay the same questions let me clarify so this is a linearization this is actually the left-to-right version of a slotted page so it's 24 bytes long I'm assuming that we can store these integers in each byte if you look at the far right it says that free space starts at byte 13 from the left there's two tuples on this the orange tuple and the blue tuple and our slot directory which goes right to left in slot one every slot is two bytes it's got a start and a length so slot one is the orange stuff it says it starts at byte zero and it's four long slot two says it starts at byte four and it's eight long okay what's the biggest tuple you can add to this page well you might think it's one two three four five there's six gray boxes that are free so you might think it's six but it's not it's only four why is it only four because we need two bytes to put in a new slot directory entry to this new thing right slot three is going to be the thing we add and it's going to take two bytes to extend the slot directory to the left and then what four bytes left to store the tuple from the left to the right yeah okay so actually might be a good thing to do in section is to just insert a tuple onto this page and see the after image of what happens when you insert that for my two by two okay so all the pictures before were kind of a little bit you know abstract in the sense that they these rectangles this is really what's going on in memory when you read this thing off the disk it's just an array okay so what we got we've got heap files we got pages we got rows or tuples on pages and then we need to be able to find our heap files we need to be able to know what the format of the pages are and the rows and that's going to be often the system catalog so somewhere abstractly outside this picture let's not say how it's implemented yet there's the system catalogs which are these oracle like I don't mean oracle the company I mean oracle like the myth there's a system catalog out here that's going to tell us how to find stuff that we need to know in particular it's going to have for each relation the name of the relation the file location meaning where's the header block for that file right where can I go find the header block for that file to start reading it so there's a physical idea of where the file is what's the structure of that file because that's going to determine what code I invoke to go read from that file is it a heap file is it a B plus tree file what is it right and then for that table we're going to know all the column names all the attribute names and their data types so we'll be able for the fixed with ones to figure out how wide they are and if they're not fixed with will know that okay and then if there are any indexes that are additional on this table maybe it's the students table but we built an index on GPA so we could do quick lookups on GPAs and we also built an index on last name so we could do quick lookups on last name there may be many of these indexes will keep track of which ones there are and they'll have names so we can do things like delete them and then we'll have maybe some integrity constraints on this table which are things like you know properties we always want to be true like student ID must be unique that column can't have any duplicates in it that would be a typical integrity constraint we'll talk more about integrity constraints over the course of the semester those will all be written down in the system catalogs for each index will know what kind of an index it is in this class will probably just study B plus trees but there are other indexes that people use in the field and it'll tell you what are the search keys of the index you know what can you look up with this index I can look up on GPA or I can look up on last name and then they're going to be these things called views a view is a query that you store and pretend it's a table all right so you can imagine a query which has like the number of students in each major compute the number of students in each major and then you might pretend that's a table and tell someone else about it you can't see the individual students at Berkeley but you can see how many students there are in each major here's a table that has that information but it's really just a view they don't know that under the covers it's a query it's going to run a query to get the answers all right so views are logical tables and what we store for those is the view name and its definition the query that goes with it and then finally there's a whole bunch of things about like statistics about these tables so how many pages they have how many distinct values are in each column what's the distribution of values in the columns authorization information what are the names of users which users are allowed to look at what configuration stuff about the database how many buffer pages do you have on your machine all that kind of stuff so all these things both physical and logical descriptions of the database are stored in these catalogs and the fun thing is that actually the catalogs are bootstrapped off the database the catalogs are just relational tables themselves and there's a little bootstrapping when you boot up the database that it can figure out how to load the catalog tables and then internally a lot of the logic to use the catalogs is the same logic that's used to query the regular tables so the metadata of the system the catalogs the data about the data is just data so you got nice uniformity and as a result you can look at these catalogs you can query them yourself so the book has a made up catalog table here called at our name at our cat so this is the catalog table of attributes what do you need to know about attributes of tables well what is the name of the attribute what table is it and what relation is it in what is the type of that attribute and what position is it is it the first the second the third of the fourth this is the table of all columns in the database according to the book all right and you'll see that it actually talks about itself so one of the relations in this thing is at our cat although it's attribute cat but that's a bug so it actually describes its own content so it says ah there is a table called at our cat that has four columns at her name real name type and position it's describing itself okay that's fine that's made up but this is what you get in postgres right there's thing called PG attribute and backslash D gives you the column names of a table in the psql interface postgres so these are the columns in a real database that tell you about attributes in the database and there is a relation ID at relid so that's the idea of the relation that this attribute blocks to there's a name right and then there's a whole bunch of other stuff that you know in the textbook they didn't think about I don't even know what a bunch of these mean but I'll give you one weird example collation so if you know anything about internationalization you'll know that people can decide what the order of things are different than what you might think it is so the order of strings is something that is been internationally standardized there's many different orderings you can choose for strings I don't think foreign languages might want to alphabetize differently but it's kind of crazy so there's a collation field here that tells you which international ordering standard is this column ordered by when we talk about less than and greater than in this column what do we mean right so there's all sorts of crazy stuff you need to store so this is the attribute table in postgres and you can poke around maybe I can poke around for just a sec in in postgres and there's a whole bunch of catalog tables you can do this in your VM so there's a table called PG tables I think it names of all the tables and inside the PG tables there's table names including PG statistic PG type PG off ID PG attribute which is the one we just saw PG product there's a whole bunch of crazy stuff this is just the catalog of the database the table names and PG tables right now are just the catalog table they haven't defined any other tables and it goes on it goes on it goes on so this is the catalog it's 58 tables in postgres and once you say select star from PG tables now you know the names of all the tables you can say select star from sequel languages or whatever okay so there's a whole bunch of stuff in the catalogs you can write queries that join the catalogs to the data if you want to I'm not sure why you'd want to do that but there's all sorts of like tricks you can play because the catalogs are just tables all right so let's take a break of mental break talk a little bit about administrative yeah take a stretch homework one should be done it's been graded and the G lookup insertion is going to happen shortly I think Michelle's going to get that done today tomorrow something like that so you should get your grades back on that soon homework one you had all the tests for so you know how you did presumably a very fussy homework right kind of gross kind of gritty all right one of the reasons I do that homework and this is the second time I've tried it's an experiment so it's be interesting to get your feedback is that it's a very typical task that you have to do in the real world like most times when you get a data set you have to do junk like this and most of the tools out there to work on stuff are about as good as the tools you used so actually a fairly typical task that I gave you not a lot of fun but you know on the other hand like sometimes getting down and dirty is the way you really like learn to live right so we'll be building all kinds of abstractions in this class right relations and tables and query languages and high-level designs for databases but like a lot of times what you get is a bunch of ASCII and you have to wrangle it so I did want to give you that feeling and you know I'm sorry if it was unpleasant I hope that you felt like it was worthwhile but it was a little one shouldn't have been too much work at homework to is going to be substantially more interesting and substantially more difficult I would think and so you should get started on it all right and I think you know we had the scala info session last night I don't know how many people were there exactly Vikram maybe a hundred ish maybe a 30 you guys which is great maybe the other two-thirds you already know scala which is even better but I bet a bunch of you don't yet so watch it on video it's going to be posted on YouTube it's been posted and it's linked from piazza awesome so if you want to watch the scala info session at 2x you can do it twice as fast as your friends if you don't have a partner for homework to you have to get in touch with the TA like Prontissimo alright you're supposed to have a partner by now we might be a little bit lenient but it's time like if you call us the day before it's due to you know what I didn't do the partner thing that's bad okay that could that could hurt you similar comment on understanding get and turning things in properly so Michelle very graciously went and fished around for those of you who didn't correctly operate the submission instructions and use get properly and graded your homework but at homework to we will not be so lenient so if you can't follow the get instructions for doing turn ins you will lose points next time okay so please pay attention and leave time for the fussy details at the end so if you misuse if you if you did get wrong this time you didn't hear about it yet I believe we're going to try to get an auto grader going for homework to so that when you do your get push it should run some tests for you and give you some feedback that hasn't been deployed yet so watch piazza for that right but you will get a note hopefully if you use get wrong at homework one and there's some feedback there which you should pay close attention to all right and then the last comment which you know I guess is worth saying every year but like if you know the answer to something on the homework you probably shouldn't post it to the piazza because that like spoils everyone else's fun so just be you know if you have a question where you feel like I don't know if I can ask this question because it reveals a bunch of the answer use your judgment you know I think most of you guys will get it right if you're uncertain you can always make the post private okay and if it's something that we feel like it shouldn't be private everybody should benefit from the answer to the question will will make sure that everybody knows so use a little bit of discretion common sense we should be fine but definitely don't sort of give the answers to things on piazza full full bore right like that spoils other people's learning opportunity okay any other questions cool all right so we finished disk face management you know enough to be dangerous we'll talk about indexes and files that are not just heat files soon but not yet first I want to talk about buffer management okay so this is the interface between the disk management layer and the query operators the query operators are going to talk about things like next which is just going to scan it's going to have a scan over a file and they're going to talk about things like fetch me this record ID you know page ID comma slot ID those are both things that in theory go directly down to the disk space management system but we're going to want to do some caching and pre-fetching into ram so that those things go faster right so between the file access methods relational operators all that good stuff we've learned about for query processing and the disk space manager we're going to insert a caching layer called the buffer management layer all right and it's going to do what you'd expect it's going to do caching but it's going to be a little sensitive and as we talked about recover old recovering database from database crashes it can be very sensitive to the needs of a database system so it's going to be a little different than the cache manager in a file system all right here's an abstract picture of a buffer manager so if you below the blue line is the disk drive okay so this is the file manager interface down here in the disk drive on that disk which we are calling DB it has pages one of those pages might be a okay which is some page offset some page ID in memory we're going to have a bunch of disk page sized chunks of memory allocated for our use as a buffer pool separate from the memory we use for our query processing operators right this stuff is allocated for a single query for a single iterator of a single query it's allocated in RAM this thing is going to be shared by all queries it's going to be allocated when you boot the system when you boot the database we're going to allocate some memory for the buffer pool independent of all those queries it's going to be shared okay and it's going to be however many pages big we configure it to be but we're going to think of it as being framed broken into frames each of which is a page big and we're going to do replacement in this thing one page at a time okay so each the buffer pools broken into these frames okay which are page sized arrays of memory and then we're going to when we read a disk page off the disk it's going to go into the buffer pool first right we're going to always have it in the buffer pool because after all we can't access any of the bytes of this page until it's in RAM you can't access the bytes of the page while they're on the page you can only access them once you've read them somewhere into RAM and that somewhere will be the buffer pool okay alright so page a might currently be cached in the buffer pool along with a page x and a page c they're not necessarily in any particular order in memory in this buffer pool so we're going to be replacing these pages and putting them into frames completely independently alright so data has to be in RAM for the dbms to operate on it the buffer manager is going to hide this fact so the upper layers are just going to assume that they can talk about disk pages but there's going to be a little lookup table on the side of this buffer pool that's going to say disk page a is currently in frame one of the buffer pool and when you access this page a it'll actually point you into memory at frame one okay so it's just a cache pretty simple there may be free frames the little lookup table would know which frames are free and which frames are containing particular disk pages oh and then the frames will sometimes have to be replaced as you would expect right if the buffer pool is full and you ask for page y page y is not currently in the buffer pool we're going to have to evict something from the buffer pool so that we can bring y into the buffer pool right and we'll talk about what you do during eviction but you can imagine that if the page has been changed you have to write it back to the database and if the page hasn't been changed you can just ignore it and throw it away essentially so but we have to choose which page is the best one to replace based on some replacement policies so we'll talk about these replacement policies as well questions at this high level how many of you took CS 162 a lot of you how much of this sounds like exactly like what you learned about caching CS 162 good so what I want you folks to do is look for the differences alright and and raise your hand and ask questions like wait did you just say that because that's not what I learned it's and for those you haven't seen this before obviously this is good for you to see but the subtle differences between probably what you learned in 162 and what you did here I will try to call them out but you should definitely raise your hand for points of clarification alright when a page is requested what happens so the higher level code which might be a be tree package or it might be a file scan iterator alright it's going to request a particular page the buffer pool is going to have this lookup table on the side that's going to associate frame numbers which are the offsets in this memory array that is the buffer pool it's going to associate those frame numbers with this page IDs right so as we saw sorry so just so you know my mom is watching these classes now that they're on video hi mom and she told me I she hopes that I'm actually washing these clothes because they're the same every lecture yes mom it's clean but it's really hard to see a black microphone on a black shirt which is what made me think of that okay my mom was a database programmer so like this is genetic if you have any cobalt questions I can't answer them but she can okay so oh yeah this is what I was trying to remember page a frame 0 right page x frame 2 yes so that's what this little buffer pool information table contains a frame number page ID association currently what page is in what frame we have a field called a pin count so it's going to be how many people or how many tasks have pinned this page in memory right and then we're gonna have a dirty bit which is gonna say has this page been changed since it was read and that's gonna be the contents of this side table of information about the buffer pool so let's see how this works if you request a page you being a higher level of the system and that page is not currently in the buffer pool we have to find a frame to put it in maybe if we're lucky some frames are empty but instead he said they won't be okay so we have to pick a frame that we want to replace all right so maybe it's frame number five and then if that frame we're gonna look it up in this little lookup table at the top if that frame is marked as dirty whatever is there right now has to be written back down to the disc it's all the changes that were made in RAM to that page are now reflected on the disk drive right right that page the disk and then read the requested page into that frame I skipped the thing in blue though so the thing in blue says you can't replace a page of this pinned you can only replace the pages that are unpinned where unpinned equals pin count equals zero there's there's nobody asking to pin this page right so the only things will ever replace are things that are unpinned there better be some things that are unpinned at any given time right if you have all of the pages in your buffer pool pinned your database system has a bug okay that that should never happen all right it's called a buffer leak right there should always be free pages the code that's in the database that's making these pin requests has to make them very short-term pin requests they're not going to be long-term pin requests they're going to be pinned and then quickly unpinned all right so but you can only replace unpinned pages now if the IO system can predict what pages you might access in the future it can prefetch them and populate the buffer pool with it so those requests for pages could come from queries but they could also come from side processes that are predicting what the queries want and are fetching things in advance of the queries okay so prefetching can happen and it happens through the same API of requesting pages it's just happening from another thread okay a little bit more the requester of a page must eventually unpin it and indicate when they unpin it whether they dirtied it so the reason you pin a page is because you're about to scribble on that memory and you don't want it to replace that page while you're changing it okay and then when you're done scribbling on it you set it to dirty and you unpin it right now many people may want to pin this many people being threads or queries okay may want to pin the same page maybe they're reading it and they're in the middle of doing some query processing operation and they can't lose it so there's a pin count so every person every person every thread that wants to work with this page will bump the pin count until it's done and then decrement the pin count when it is done so if pinning a page is pin count plus plus and a page is a candidate for replacement if and only if pin count equals zero that's what unpin means okay so you can have many pins on a page now the concurrency control and recovery units which remember we're off to the side of that block diagram of the whole database system you remember there's like the storage layer was at the bottom and so on off on the right hand side was concurrency control and recovery recovery is the thing that helps your database when it crashes concurrency control is the thing that keeps these different queries from messing with each other well these guys may do some additional stuff when you replace a page right we'll talk about that in excruciating detail later in the semester all the things that happen to a page before you write it to the database don't worry about that for now okay but there will be very typically more stuff happens before replacement than just writing that page to the disk log records are generated in particular and they have to be flushed to the disk talk about all that later but this is this little bullet at the bottom of the slide is one of the key reasons why the operating system doesn't do the right thing I need to do some special database magic before my right request is finished okay so you can't just write things when I say to write them you have to wait for me to do some special careful tidying around and then when I do say to write something well by god you better write it okay and the operating system often lies about that so that's not okay we need our rights to really be right okay buffer replacement policy all right and this will feel a lot like CS 162 so there's going to be frames we're going to need to choose how to replace them as a prediction of which frames are going to get used in the future right I don't want to replace something if it's about to get used I'd like to get the advantage of having it in cash so we have to predict the future there's like 50 years of heuristics for this and they've been all studied on a million different workloads and good old-fashioned least recently used is still pretty good all right most of the time except when it's not okay and so we'll talk about that a little bit but least recently used is the most common one you can do most recently used which is sometimes the right thing to do will also learn the clock replacement policy how many of you learn clock in 162 okay so we'll do clock again maybe this description will be slightly different but it should be the same idea this policy can have a huge impact on the performance of your system and it's at the efficacy of these different replacement policies depend entirely on the pattern of accesses to the disk because we're running these big queries I mean think about these algorithms like hashing we there's some very specific IO behavior in the scans right that we know about because we know we're going to do a hash of this entire table so we know some very sort of large-scale fact about the IOs we're going to do in the future in a database system that in an operating system you might not know right we have very stylized IO patterns particularly in these queries that access tables in the large queries that do like big table scans analytics workloads and so the access patterns in a database system can be kind of particular and we want to tune for them alright let's talk about LRU in the context of this buffer manager and then we'll call it a day so here's how it works we want to replace the least recently used frame the page that is in the frame that was least recently used is the one we evict that's the frame we get so if the frame is pinned it is in use so it's not recently used it's currently used so it's not allowed we do not replace pinned pages so ignore the pinned frames when something is unpinned that means it's done being used and that's when we time snap it so when pin count goes to zero that's when we set the use time that was the last time of use and then we just look at the frame which has the earliest unpin time that's the least recently used in our notion of use okay so pinning and use are very coupled here it's not just access it's actually unpinning to zero that counts as end of use okay it's a very common policy it's intuitive it's simple what's the intuition intuition is if you haven't asked for it in a while then you probably don't care anymore right so it's okay intuition I guess it works well for hot sort of hot cold workloads so if you have a skewed distribution of popularity the popular pages will get access pretty often so they'll stay hot so they'll stay in memory the infrequently used pages will get loaded in infrequently and then they won't get used for a while they'll get paged out and that's okay because they're infrequently used okay so it makes sense for these kind of skewed distributions of popularity right it's kind of a hot or not policy works pretty well but it doesn't work well for databases in a lot of cases okay and the usual problem is something that's called sequential flooding so imagine that you have a file and you're going to scan it over and over all right and we'll stop with this idea and then next time we'll go through a detail but here's a file on the disk it's like a thousand pages long and I'm going to scan it and I'm going to scan it and I'm going to scan it over and over and over and suppose you have a hundred so this is your file and this is your buffer manager and in the buffer manager you have a hundred frames so your homework for next time is to think about how often you get lucky with LRU in your buffer manager and you get a hit okay we'll come back to this next time get started on your homework