 Today's lecture is going to be on Buffetful, this is how we're going to manage memory inside of our database system. So before we jump into the material, the two upcoming database talks that I think are interesting that you guys want to check out, so there's a talk today in the Athorn Gates from the co-founder of Relational AI. I actually don't know what they're doing, I know who the person is, I don't know what the new company is doing, but so he's a semi-famous dude, so this should be an interesting talk. So this would be, I think databases meets machine learning and machine learning meets databases. And then Thursday next week we'll have the second session in our seminar series on hardware accelerated databases with the co-founder of MAPD, MAPD is another GPU database system. So that'll be on Thursday on the fourth floor in the CIC building, you come out of the elevators, you go right, well left depending on the elevator, but there's these glass doors, go inside that and the talk will be in there. So as we said last class, the last two lectures were all about organizing database files on disk, how we're going to lay out the files, and inside of them we have pages, and inside of the pages we lay out our data, and then we sort of punted on this idea of how we're actually going to bring this into memory and manage that, because we said we can't have our data system operate directly on files on disk, everything always has to be brought in memory before we can even read and write to it. So that's what today's lecture is about, the second part of how are we going to actually make decisions on where to bring in the pages into memory, and then how do you make decisions to evict them when we run out of space, because the size of our database is most often going to be much larger than the amount of memory that we have, so we can't keep everything in memory. So another way to think about the problem is discussing this trade-off between spatial and temporal control of data. So spatial control is what we've talked about so far, it's basically how we're going to write files to disk, or we're going to lay out our pages on disk, and the goal of this was, as we said, that since disks are faster at doing sequential access, definitely for spinning disk hard drive because it's a mechanical arm that has to jump around on the platter and read the tracks, but even in SSDs as well, if we can have the data that we know is going to be accessed together in a query physically located next to each other inside of our disks, then we'll have faster access, because now, again, thinking of the worst-case scenario of the spinning disk hard drive, the arm has to jump around all over the platter every single time I need to read a single page, that's going to be really slow, or if I can plop the arm down in one location and then read sequentially a bunch of pages, that's going to be much faster. So now what we're talking about today is the second issue of thinking about the temporality of the data we're accessing, and this means how we're actually going to control when do we read pages into memory, and when do we write them back out to disk, if necessary. So we're not discussing where we're going to write these things to disk, we'll talk a little bit about where we actually store them in memory, but overall, it's whether the pages are next to each other in memory. For our purposes here, it doesn't actually matter. In real systems, if you're in memory, it does, but for our purposes here, we ignore that. So it's really about then deciding when we actually want to move data back and forth. And the goal here, obviously, since the disk is so much slower than DRAM in memory, we want to minimize the number of stalls that will incur because we have to read data from disk. In some ways, it's unavoidable, as I said, because the database is larger than the total amount of memory that we have, but if we're smart about how we stage things and time our accesses and writebacks, we can minimize this impact. The way to think about this is if I know I'm going to read the same page over and over and over again, then maybe I don't want to write that out to disk and always keep that in memory. But if I read something once and I'll never read it again, then I want to go ahead and maybe evict it and throw it away. So these are the kind of things that we're going to discuss today. And just providing this overview diagram that I showed in the last class, we've covered the bottom part. We know how to lay out pages in the files, and now it's the second part above this, the buffer pool. The buffer pool is going to be the in-memory cache that the database system is going to maintain of the pages that it read from the files on disk. And we said before that this looks a lot like virtual memory from the operating system, but the operating system, as you'll see, it was going to go along. It's not going to do as a good job as our database system because we fully, not fully, but we have a better understanding of exactly what queries want to do, and therefore we can prepare ourselves and read and write data from disk into memory accordingly in a way that the operating system can't do. Because the operating system only sees like the reads and writes sort of at a granular level. It just sees the reads and sees the writes on a page. It doesn't understand anything about how the actual queries are accessing those pages because they always treat all pages the same. Whereas we'll see as we go along inside our database system, some pages correspond to indexes, some pages correspond to inmediate results of queries, and not all of them are the same. It should not always be treated the same. But the OS doesn't know this, so it can't do as a good a job as we can. So for today's agenda, we're going to first sort of talk about the buffer pool manager. At a high level, it's pretty easy to implement the basic concept of it, but we'll see the more complicated things you can do in the actual policies you maintain inside of this and for placement and allocation. And then we'll talk briefly about what are the other memory pools look like inside the database system. So at a high level, the buffer pool is just a large region of memory that the database system is going to manage in its own address space. And it's going to use this to again copy pages that the queries want to access from the disk and put them into the buffer pool. So you just sort of think that the buffer pool is just some large memory region, like a giant byte array. And then it's going to split up the byte array into what are called frames. And these will be the same size as your pages. I think it was the question last time whether the size of a page for the large overflow pages are they different than the rather tuple pages? Everything's always going to be the same page size because they need to fit into these frames. And that's how it's going to figure out where the stopping starting point for every page is. Because otherwise, you have to have an indirection layer to figure out the length of things at their verbal length. So we call these open spaces frames, the slot where you can actually write data in. I'll try to use frame as much as possible, but you can sort of see how the same concept at a high level just has to have different names in the database system. So we have a file, we have pages, and inside of pages we can have slots, and the slots are where we can store tuples. And then inside of our buffer pool, what we may want to call slots, we want to call frames. It's the space where we can write a page. So what happens is when the query wants to read a page, we're just going to go down on the disk and make an exact copy of this in our buffer pool. So if a query wants to read page one, there's an exact copy of that page in a frame in our buffer pool. So there's no transformation, there's no marshalling or serialization process of copying the data from the disk into the page. It's an exact byte for byte copy. If you bring in compression that complicates things slightly, but for our purposes here, how it's being stored on the disk is exactly how it's going to be stored in memory. And our database knows how to operate on that structure, right? And it knows about slotted page layouts, it knows how to operate on the tuples inside of them. No, Windows, I do not want to upgrade. Okay. So for this purpose here, again, I showed my example before, I talked about how the execution engine wants to say, I want page number two, and then there had to be something that says, well, here's the frame in my buffer pool where you can go find the copy of page two. And this is called the page table. You certainly just think of this as a hash map that maps page IDs to a frame, a where can you store it, right? And so if I want to do my lookup and say, you know, I want page two, I go to my page table, it either tells me it's there and what frame it's in in my buffer pool, or it's going to tell me it's not there, in which case I have to go to my page directory and find where that location is on disk and then decide where I can copy that frame into this. So now we want to talk a little bit about, you know, thread safety in this because again, if it's single threaded, then this hash table doesn't need to be protected, right? You just go in, just find the entry you want, and then just read it. But because we're on disk and because modern systems have a lot of CPU cores, we're going to have multiple threads, multiple queries running at the same time, all accessing this page table. So what we don't want to happen is we don't want to maybe access a page. If you go to the page table, we find out actually where it is, and then we start reading it, and then some other thread comes along says, well, I'm taking this page away and goes and replaces it. So what we're going to do is anytime we want to access a page, we're going to have to pin it in the page table, right? Think of this like a simple reference counter. This is just, you know, a counter that says the number of active queries that are accessing this page right now, okay? And we store this in the page table and not inside the page themselves because we don't actually need to make this reference counter adorable, right? So when the page gets evicted, the reference counter doesn't need to be stored as well, right? So we're going to store everything in the in-memory page table. So when a page is pinned, anybody else can still come and go find page three in this case here and still read it, it's just the buffer pool manager is not allowed to evict it because it knows somebody is actually accessing right now. So now I'll say I come along and another thread comes along and wants to read a page that's not either one and three and it's not in there. So then it's going to go ahead and actually take a lock on this location, right? So I want to read this page. I know that it's not there. I'll find a slot in my frame in my page table where I know I want to store this and so I take a latch and then I go ahead and then fetch the data that I need and update the page entry, right? So this is the way, again, I want to read a page that's not there. So then I have to find a place where I can store the reference to that page in my page table. So I take a latch on that location, then in the background go fetch it and then update the page table, right? And then at this point, I release my latch because I've done whatever I wanted to do. If I just had to read it, then I don't need to pin it. If I'm going to modify it, then I want to pin it. Okay? Actually, sorry, that's not true. Let me real clear. If I'm going to read it, I pin it. Even I'm going to write to it, I pin it, but I don't need to maintain any additional information if I'm just reading it. As soon as I'm done reading it, I can throw the pin away. If I'm writing to it, I have to maintain a dirty bit or dirty flag says, I modified this thing. And this is to prevent the database system from writing it back out the disk or just throwing it away because it knows that some other query modified the page when accessed it and therefore we have to make sure that those changes eventually make it back to disk. So that's what the dirty flag does for us. So the one thing I need to discuss is this distinction between locks and latches. If you're coming from an operating system background, you may just refer to everything as locks. In a database system, at least in the database world, locks are different than latches. So if you're from an operating system world, a lock is basically what we would call latch here. You think of like a mutex would be what we're calling a latch. So latch is going to be used to protect the internal data structures of the database system. So my example I showed before, I want to write to my hash table. I set a latch to prevent anybody else modifying that location because I want to write into it. And that latch can be implemented as a mutex. It can also be implemented as a spin lock, which is an unfortunate name because that muddies the story here. In the database where a lock we'll see later on, this is going to be a higher level construct, like a logical construct that we're going to use to protect database entities. So I can take a lock on a tuple or I can take a lock on a database or table, but I can't take a latch on this because that's the underlying data structure. So when I take a lock, I would have to take a latch to update the lock table. So this is clear. Locks are for high level constructs, and we'll see this later on when we talk about concurrency control. The latch is the work we're going to use to protect the internal data structures. So in SQL, I can take locks. I can't take latches. Latches are actually how the database system is actually implemented. Okay? And as I said, if you're coming from an OS background, you would call a latch a lock. But in databases, we use the term latch. Okay? The other thing that we want to talk about is the distinction between the page table and the page directory. So the page directory is the on-disk information about where to find pages for a given page ID. So as I said, if our database is broken across multiple files, the page directory can say, oh, page one, two, three is in this file at this offset. Page four, five, six is in another file at this offset. So any changes we make to the page directory when we bring it into memory has to then get written out the disk because when we if we restart the system and come back, we need to know where our pages are. The page table is the in-memory data structure that we're going to use to figure out inside of our buffer pool how to map the page IDs to a frame if it's been copied in. And so for this, we don't need to write this out the disk at all. Because if we crash and come back, we don't care. We don't care what the buffer pool looked like the last time we crashed because we're going to repopulate it as queries start executing and start reading data. So the page directory, again, is on-disk. It's mapping page IDs to locations on files. The page table is in memory and it's mapping page IDs to slots in the buffer pool. And we don't need to make that durable. Okay? All right. So the, as I said, any time we want to go into the page table and do a lookup, we always have to take a latch to make sure that nobody swaps out the thing we're trying to read as we're trying to read it. Right? But this becomes a contention point because now if you have a lot of threads all are trying to access the page table at the same time and then maybe taking latches you know, do something really coarse-grained like a latch for the entire page table or you can take latches for, you know, individual elements or buckets. But regardless, if you have a lot of threads and everyone's trying to do the same thing at the same time, then this is going to become a major bottleneck in our system. So one way to alleviate this problem is actually to have multiple buffer pools. So my example, I just showed I had a single page table and single buffer pool. There's no reason the data system couldn't have multiple ones. And that way, so basically what happened is your query comes along and says I want page 123. You can just hash that 123 and that will map you to a buffer pool instance. And then inside of that buffer pool instance, then it has its own page table and its own region of memory. This is sort of a way to partition the workload across multiple buffer pool instances so that everyone's not trying to acquire the same latches at the same time. So there's a bunch of different ways you can implement this. You can just sort of say the database instance itself or the database amendment instance just can have multiple buffer pools and they're allocated, you know, they're split evenly for the total amount of memory you want to allocate for your buffer pool. You can have a buffer pool per database. You can have a buffer pool per page type. So you can say here's my buffer pool for index pages. Here's my buffer pool for data pages or tuple pages. So these are a bunch of systems that support this. MySQL, DB2, Oracle, Sybase and SQL Server Informix. So with the exception of MySQL, although MySQL is really good, I would call these other ones below it enterprise databases. These are like what traditional old companies would actually use. MySQL is super common. It's used everywhere now, though. So in the case of MySQL, as far as I know, at least for version five, I haven't checked version eight, you can only declare that you have multiple buffer pool instances. In the commercial systems, they have way more knobs that you can change and tune and they support all these different variations. In the back, yes. So your question is, if you have multiple buffer pool instances, how do you map what, sorry, what was the second part? All right, this question is, if I have multiple buffer pool instances, how do I map a page ID, how do I take a page ID and know what instance I should look in? Yeah, so the easiest way to do it is just hashing, right? I have five buffer pool instances. For a given page ID, I hash it mod five and that'll tell me which instance I go to, right? And that's immutable, right? So no matter what, you know, the same page ID is always gonna map to the same buffer pool instance. You can do something real simple too, like you can do range partitioning, but I don't think anybody actually does this, right? You could say anything with page ID zero to 1,000 goes to this one, 1,001 to 2,000 goes to this one. The hashing one's more common because that handles the case where if the database is always growing, your range of page IDs increases, you don't have to reshuffle things around. Okay, and again, the reason why we wanna do this because we'll reduce contention on the latches as we go into our page table and we can actually improve our locality. So if we know that we see certain queries are doing, have certain access patterns, having a buffer pool instance for maybe indexes versus the tuples, we can sort of isolate the two buffer pool instances to ensue different replacement policies or allocation policies for each of those different access patterns. All right, we'll see examples of this in a second. So again, so the kind of things we're talking about here, again, these are the things that the operating system can't do for you because it doesn't know what you're actually trying to do, right? So again, because we know what our pages actually mean in some ways, we know whether it's index versus data, we can allocate different buffer pool instances like this. Another thing we can do that, let's write no clicker, that the OS can't always do correctly is to do prefetching. So prefetching basically says that I know I'm gonna go ahead and read subsequent pages given what my query's trying to do. So rather than waiting for me to then try to access that page in the page table, see that there's no entry, get a page fault where I have to go fetch the page that I need, I can go ahead and start prefetching pages ahead of time because I know my query's gonna need it in the future. So let's say I have a simple query that's Q1 that's doing sequential scan across the entire table, right, and my page is zero to five, and my buffer pool is empty. So in the first case, I read page zero, it's not there. So I go ahead and fetch it in, and I do this as I scan along. But if I recognize that my query is essentially gonna read every page, then maybe at some point I recognize that, oh, I know I'm gonna read page two and three. So while I'm accessing page one, or I'm retrieving page one and putting my buffer pool, let me go ahead and prefetch two and three, put that into my buffer pool, and then that way when now my query keeps on going, I don't have to stall because the next pages I need are already there, right? So the operating system can do this in some cases, right? It can recognize that when you're doing sequential scan across a large file, it'll essentially do the same thing, it'll start trying to prefetch your pages. In this example here, the operating system could probably do this or can do this because we're accessing a file sequentially. But not every access pattern of the database system is gonna be sequential like this, right? So a very common case is to do index scans or index range scans. So again, so say we have an index here and we wanna run a range scan where we're gonna traverse the tree, doesn't matter whether it's a B plus tree or whatever data structure, it's a tree, and we're gonna traverse the tree, get to some leaf page, and then scan along the leaf pages to find all the data that we need. So the very first thing we have to do when our query starts, we have to read the index page, right? Because we have to know where our entry point into the index is. So for this one, we go ahead and fetch index page zero, bring that into our buffer pool, then we jump down maybe to this side of the tree and we wanna fetch index page one, we go ahead and get that. But now we're gonna go down to the other side of the tree and start scanning across the leaf pages like that. So now we're gonna read page three and page five, but the OS may think that, oh, well, I think you're actually gonna read page two and let me go ahead and pre-fetch that for you, which is not actually what we want, right? So we want page three and five, so the database system could recognize, oh, well, I didn't go to page two, I wanna go to page three, and then the next leaf page that I'm gonna scan across is page five, so not page four, so let me go ahead and pre-fetch those two guys, all right? The OS can't do this because it doesn't know what's inside these pages, right? We know in their database system that, because we were the one that actually built this data structure, we built the tree, we know what pages are there, we know what the query's trying to do, right? Because we have the SQL, we have the query plan, so we can then infer from that what pages we think you're gonna need. Now this does mean we have to do some extra work to maintain some extra metadata, possibly in the index to know what our neighbor page is, right, so we can go ahead and pre-fetch because we have rules to recognize that I think we're gonna do a sequential scan along the leaf pages or index scan along the leaf pages and recognize that that's an opportunity for us to pre-fetch. It's more work, but the benefit can be quite substantial. And this is actually a very common access pattern, do range scans along an index like this. The next kind of optimization we can do is called scan sharing. So with scan sharing, the basic idea is that if we have queries that are running at the same time and they're accessing the same data, rather than treating every query in its own island by itself and just making decisions about what pages are fetching and now for each individual query, the database system can recognize that the queries are trying to do basically the same thing or they're trying to read the same data. It may not be the exact same query but they're gonna have to read the same data. And then we can just have the queries sort of piggyback off each other and reuse the data that we've already pulled into our buffer pool to avoid having to swap things out unnecessarily and incur more disk rights. Right, so the way to sort of again think about this is that if I'm doing sequential scans, I have my one query starts, the second query comes behind it and starts a little bit later, I maybe wanna pick up where the first guy is already running and reuse the data that it's already pulled in rather than starting everything from scratch. Right, and the way we're gonna implement these scans and we'll see this when we talk about query processing, it's essentially a cursor for each query that keeps track of what pages it's read as it goes along the table scan and it knows how far it's got to go and what pages that it's missing, right? So this is supported again as far as I know only in the major commercial systems, right? So this will be a reoccurring theme throughout the semester. We'll see that the very expensive enterprise databases do way more sophisticated things than the open source ones just because they've had a lot of money and a lot of people spending time to fix these things, you know, make these things work as well good as possible. So as far as I know for the scan sharing which we'll see in the next slide, this is fully supported in IBM DB2 and SQL Server. Oracle only supports what are called cursor sharing if you actually run the exact same query at the exact same time, right? If there's any sort of minor deviation in the where clause or even like what the select output is, it doesn't think that they can read the same data and it treats them as separate beasts. So to do scan sharing, basically for the first query starts, it has its cursor it's scanning through and then the second query's cursor will recognize I can run along with the first guy and just keeps track of what data it missed from not starting at the beginning so that it can loop back around and pick everything that it missed. So a simple example will be like this, say I have query one and it's doing sequential scan, it's gonna read every page to compute the sum on the value. So when it starts, it's at the beginning, there's nothing in our buffer pool. So we go ahead and fetch page zero and we just keep going down and to get the second page here. Now we wanna read page three and then we only have three frames in our buffer pool so we have to pick the page. So assuming we're doing a simple replacement policy, we just pick the page that was last accessed, that would be page zero. So you wanna go ahead and throw that away so that we can put in page three, the next page that we need. But now during this time, query two starts and it basically wants to execute this, it's not the exact same query, it's computing an average instead of a sum but it's still gonna read all the same pages. So if we started from the beginning, what's the problem? The first page that this query needs is page zero but that was actually the last page we just evicted. So now what would happen is, we'd have to go back in our buffer pool and figure out, all right, well, I need to make space, let me go evict a page, right? And it sucks because we just fetched that, we just had it in memory and we lost it. So now we have to pay another disc read. So instead with scan sharing, you can recognize that these two queries that need to read the same data in the same pages. So Q2 just hops along with Q1, it reads the same pages, right? So one disc read brings it to the buffer pool and it's sort of a notification system that says, all right, you two guys need these pages, it's now ready. So then Q1 finishes because it reaches all the pages and then Q2 recognizes, oh, I need to go back and get the other pages that I missed, right? So again, by being smarter about what we know queries wanna do, we can avoid having unnecessary disc reads in a way that would be difficult to do with, if you don't know anything about what the queries are actually trying to do. Another optimization is to do with what's called buffer pool bypass. So we'll see this in a second, but these special scans are problematic because they're gonna pollute our cache with data that we actually may not need again in the future. So in my last example in the last slide, I was doing special scan on the table. So as I read a page, I brought it to my buffer pool, I did whatever query processing I wanted on it and then I just went on to the next page. But now that page I just read is probably not gonna be needed in again in the future, but it's the last thing I just put in my buffer pool. So and it may be evicting data that I actually do wanna keep in there, but I had to do that because I had to make space for all these pages I'm reading as I'm doing the scan. So with buffer pool bypass, the basic idea is that the database system can say, all right, I'm gonna do a special scan and I know that this data's probably not gonna be needed by anybody else. So rather than going, paying the overhead, updating the page table and putting it in a frame in the buffer pool, I'll just keep it a local copy in my private memory for my query. And then when I'm done with it, I just throw it away. All right, so this avoids having to go to update the page table, update the buffer pool. I just update my own local copy of it, all right. So in FormX, these are called light scans. In Postgres, we'll see in a second, they basically have our own private buffer for queries as well if you're doing a sequential scan. I actually don't know whether it actually goes to the buffer pool manager. I just know that it maintains its own little private buffer for every query. So another thing we need to talk about now is the OS page cache. So the operating system's gonna maintain for the file system, its own cache of pages that it reads. All right, if you ever look on Htop, on Linux, you'll see how much memory your processes are using on the heap. But then there's also gonna be the OS page cache of copies of pages that it's read from the file system. So if you don't do anything special, when you open up a file in your program and read a bunch of data in, the OS is gonna make copies of that in its own file system of page cache. All right, this is different than virtual memory. But now the problem is, if we do this, then now every time I read a page in my buffer pool, in my database system, the OS is gonna make a copy of it, and then I'm gonna have another copy of it in my database system. So it's double the amount of memory for reading the same data. So most database systems turn off this page cache. And the way you do it is you pass the O direct flag in libc when you open up a file and say I don't want this file to be backed by the file system cache. I'm gonna manage it myself. So this avoids having, again, these redundant copies of pages. It allows you to have, again, full control over the eviction policies of how things are evicted out of memory, right? As far as I know, most major database systems, except from one, tells you to turn off the page cache in the operating system because they wanna manage memory itself. The major system that I know that does use it is actually Postgres. Postgres actually uses the OS page cache. And they argue that this is not a significant problem for them because the OS, it still has its own buffer pool, but it still relies on the OS page cache. And just by having that sort of an extra layer, it's one less thing that they have to manage and it's a minimal impact on performance. So let's see actually how this works. All right, so this is Postgres running on the machine back in my office. Let me connect to it here. So the first thing I wanna show you is what the page cache actually looks like. Wrong one. Again, if you run H-top, it's kinda hard to see there. I don't know, I'm a laser pointer. Right, so this is the amount of memory that the system is using, right? So this green bar here, that's the resident memory of the processes running on the system. Then all of this yellow stuff here is the OS file system cache, right? So this machine has 32 gigs of RAM, the resident set size of the processes running on my chute machine is only five gigs. And the rest is being used by the OS page cache, right? So we can flush this. So if you pass in this command here, sync echo three and you pass it to this special location in the proc file system, this'll flush all the page caches in for my system. So now if I go back and look at H-top again, right? Now my resident set size is still five gigs, but my total amount of memory my system is using is now exactly five gigs, because the OS file system cache is gone. So let's go into Postgres. So we're gonna use the same two tables that I used last time to show you the distance between decimals and reels, right? I think it's like 10 million entries. So what we're gonna do is we're gonna turn off parallel workers, right? Again, it's only one query running at a time. So what we're gonna do is we can execute a query that will just do that same summation that we had before. So now I'm gonna pass in, I'm gonna use explain and I'm gonna use explain analyze, which actually runs the query and shows you the query plan. But I'm gonna pass in this extra flag called buffers that's gonna show you what percentage of the data that are pages we read to execute this query will be either in the buffer pool cache or it had to read it actually on disk. Actually, so let me restart Postgres too, so there's nothing in the file system cache. Sorry, there's nothing in its buffer pool cache. So at this point, I flushed the file system buffers and I restarted Postgres, so there's nothing in memory that I can use. Everything's always gonna have to go to disk. So now, yeah, execute this query, right? Oh, it did the parallel scan, but that's okay. Actually, let me restart this. Blow that away, blow that away, come back here, turn off parallel workers, good, okay. Same query, right? So this part here, this told you that to do this sequential scan on this table that it had to read 44,000 pages, right? So this is saying that there's nothing in the cache that it actually was able to reuse. So now if I come back and execute the same query again, now it says that, can I highlight this? Yeah, here. Now it says hit 32. I'll explain what that is in a second, right? This is actually the private cache for this query. It's always gonna be 32 pages, right? But then the number of pages we read from disk went down by 32, because we had 32 in our buffer pool, right? So we can force the database system to fetch everything into its buffer pool. So it has this thing called PGWarm, right? It's a special extension for Postgres that comes with it when you install Postgres. This tells it basically pre-warm my buffer pool by reading everything, right? So we execute that and it's gonna come back and tell us that it read 44,000 pages and into our buffer pool, right? So now when I go back and run that same query, my hit ratio went up, the number of hits went up. I still had to read some data from disk, right? Let me take a guess, what's the issue here? Why wasn't everything in the buffer pool? It's not large enough, exactly, right? So you set, when you set the database system up, you say how much memory you want to use for the buffer pool, right? The database system has to be told that because otherwise it just takes everything, right? So that's why we only had, we didn't, everything wasn't always, we didn't, all of our page reads, all of our pages we want to access work in our buffer pool. So what we can do now is go back to the Postgres configuration, right? Postgres has this flag, it has something called shared buffers, right? So we want to set this to be the right size. So for this, if we go back here, we know we had to read this number of pages, right? Postgres pages are eight kilobytes. So what we're going to want to do is we're going to store, we want to have, set our buffer pool size to be eight kilobytes times 44284. So that'll be 44284 times eight, right? So about 355 megabytes. So now we go back to this guy, let's round out 356, right? So now again, we're setting the, the buffer pool size to be 356 megabytes. We got to restart Postgres to tell it, to take that in effect, right? Then we go back here, reconnect to our database. We're going to pre-warm the table. And then now when I run my query, there you go. The hit went to exactly what we wanted, sorry, I don't know where my cursor is. There it is, yeah, right there, see? So to do this, we didn't have to read anything from, from, from, from the database. Or sorry, read anything from disk, everything was in our buffer pool. All right, the reason why I like using Postgres, because it's like, it knows this is a textbook implementation of the database system, yes? Is the buffer pool in the five weeks that's allocated in that situation, is it current or is that when we just put it in? So his question is, is the, is the buffer pool allocated where, sorry, in the? So you showed the, you used HTop and you showed how the memory was partitioned for different lengths, and some file types and for applications. So which part is the, Okay, yeah, so his question is, I showed this version of HTop before, and I said the green bars from memory was the resident memory of the processes. And his question is, where's the buffer pool? Is it, is it in this resident memory, or is it in the file system cache? It is in the resident memory. It's literally calling malloc inside of the process, and that shows up in part of the green bars here. It doesn't have any control over the file system cache. That is all managed by the operating system. That's the subsequent bars. So the file system decides, the file, sorry, the operating system decides how much it wants to allocate or split up the memory for the file system cache versus the memory for processes, right? Postgres doesn't have any control over that. It always just does whatever it wants to do. So I want to show you how we know that the database system is hitting the file system cache in case of Postgres. So if I go back to, right, so this query took me, it takes about 12 milliseconds. So let's just run it, or take 700 milliseconds, right? So I run it multiple times. It's always going to take the same amount, roughly. Now it's actually getting faster. All right, there it goes, right? So it fluctuates. So in this case here now, if I restart Postgres, this will blow away the buffer pool, right? From the process. If I now go back here, turn timing back on, make sure I'm connected, turn off parallel threads. So at this point, I restarted the database. The buffer pool is gone. There's nothing in the buffer pool. I'm not going to run PG-Warm. I'm just going to run the query and see how fast it is. And if it's the same speed as this 740 milliseconds, if it's the same speed as that, then we know it's hitting the file system cache, right? It's a little bit slower, right? So if I blow away the file system cache, I actually did this in the beginning. I was roughly about 1.2 seconds. Now I'm 700 milliseconds. There was nothing in the buffer pool, but the operating system has it cached. So the database system can go ahead and read that. So again, this is specific to Postgres. Most data systems do not rely on the page cache of the operating system, because they want to manage all memory themselves, okay? So is this clear? And again, we can use explain, explain, analyze with the buffers, and it'll tell us what percentage of the queries are hitting the buffer pool. In this case here, since I didn't run PG-Warm, it still has to read a lot of things from the file system. Actually, that execution time is pretty slow. Now we'll do that later, but maybe because we're running analyze. No, we'll do that later, okay. All right. So these demos are sort of just exposed to you or show you that like the things I'm talking about here, you can actually see in some ways, without reading the Postgres code, you can see the behavior that matches up with the policies that were described. All right, so the next thing we want to talk about is how do we actually make decisions on what frames or what pages to remove from our buffer pool, right? It's super easy when everything's in, we have enough memory because we just fetch the pages in and we don't worry about making new space. But obviously again, our database may be larger than the amount of memory we have, so we have to make decisions about how to make a room for new pages we want to fetch and vick from our buffer pool to put the new one in. So there's a couple sort of trade-offs or things we're gonna care about in our replacement policy, right? One is obviously correctness and by correctness I mean that we don't want to write up, we don't want to throw away data that our transaction just updated or query just updated before actually making it out to disk, right? Because then we lose everything. We want our replacement policy to be accurate in that we don't want to do the dumbest thing and like evict the most important page because we're gonna need it right away, we wanna have a good estimation of making decisions about pages that we think we're not gonna need so we can go ahead and throw them away. We want this to be fast because as I said, when I do a look up into the page table and my page is not there and then I need to decide what page to remove, I have to figure out what page I wanna take out and put a latch there. So I'm holding a latch while I'm running this replacement policy and if my replacement policy is MP complete and takes days to run, then that's bad because I'm holding that latch during that entire time. So you want this thing to be as fast as possible and of course we want to have a low metadata overhead because we don't want to have to store a ton of extra data that may be larger than the actual page itself just to figure out what page we wanna remove. So this cash replacement policy, a buffer replacement policy, this is like one of the oldest problems in CS. The literature goes back to like the 1950s, 1960s. There are some things that we can apply from newer systems and newer would be like 1990s that we can do smarter things than sort of the general approach that everyone uses but we'll just go over high levels with these other better ways to implement this. The better way to implement these things are. So the most commonly used policy is LRU or at least recently used and all we really do here is just maintain a timestamp of when a page was accessed by a query and then when we have to select a page to evict from our buffer pool, we just choose the one with the oldest timestamp, right? Pretty easy to speed things up and maybe what we can do in our page table or our buffer pool list or frames, we can maybe sort them ahead of time by their timestamp and just always pop things out and move it to the end every time they're accessed. So that way, when it comes time to evict something, we just look at the head of that list and just those are the pages we wanna go ahead and remove. Another approach that's commonly used is called clock. You can think clock is just an approximation of LRU. I know this is used in a couple of systems. This is actually used also in Linux roughly for their page replacement policy. So the basic idea is that you're gonna maintain a simple bit for every single page that tells you whether it has been accessed and it's the last time you checked it. And then every time you do access it, you just set that bit to one. And then there's just gonna be this clock hand that's gonna go around in this circular buffer and just checks every page to see whether it was accessed the last time it was, the clock hand went around. So say we have four buffers like this or four pages like this, one, two, three, four. Every page is gonna have a reference bit and initially it's always set to zero. So when I start off, if a query accesses this page, I just go ahead and flip the bit to one. And then now there's this clock hand that gets moves around in a circular fashion. Every single time I need to go and pick the page, it just looks at the next one and decides whether the bit is zero. If yes, it can be evicted. If it's one, then it keeps it. But it resets it to zero. So at the very beginning the clock says I need to pick something. So I'm pointing to the first guy. I set it to one, so I know I don't want to throw it away but I'll go ahead and set it to zero. So then I come down to the next guy, his reference counter is zero. So I know it's not been, it has not been accessed since the last time I came around. So it's safe for me to go ahead and evict this and I can replace it with another page. Then I keep going around the clock like this. Say page three and four have been accessed. So I'm gonna come around. I keep that, keep page three, keep page four but page one has not been accessed since the last time I checked. So I'm gonna go ahead and throw that away. It's an approximation to LRU because it's essentially giving you the same thing but instead of saying what's the global time stamps of how things are accessed. It's just a approximation of whether it was accessed since the last time that I checked. Less metadata, less overhead of maintaining the link list. A lot of systems do this because it's pretty easy. The problem with LRU and clock is that they are susceptible to the issue of what's called sequential flooding. So this is the examples I was showing before where if we were doing a sequential scan over the entire table, the pages I just read in as I was scanning down is actually the most useless page that I actually need. And now if I have a bunch of other pages that may be accessed a lot by other queries, the sequential scans are gonna go through and blow all those pages away because it's gonna fetch in a bunch of pages and then not need them anymore. You're supposed to be polluting the buffer pool with pages that you need once and then you don't need it ever again. So in this case here, LRU is actually the exact opposite of what you want. The policy you really want is the most recently used at least in the case of the sequential scan. So to illustrate this, I have one query that's gonna access a single tuple. And say that's in page one. I go ahead and fetch it and bring it into my buffer pool and I'm done. Page one goes away. Now I start, sorry, query one goes away. Queries two starts executing and it does a sequential scan and again it's gonna fill out the buffer pool with all the pages that it reads. So now I get to page three. I need to decide what page I wanna evict. If I'm using LRU or clock, it'll be page zero because that's the one that was last, the least recently used. I go ahead and evict that. But now another query comes along that does the exact same thing of the first guy did and it wants page zero but that just got evicted. So it'll choose page one to choose something else. So again, I have to do an additional read for something that I just threw away. So the ways to avoid this is to sort of special case the sequential scan and recognize that it's doing the sequential scan and it's unlikely to read that data again. Now if another query comes along and does the same sequential scan, that's unavoidable, we'll have to go fetch those things in. But if I'm doing these sort of, if I have a skewed workload where I'm always trying to maybe access the first page or first small number of pages, I don't wanna evict those guys. So three different ways to handle this are through LRUK priorities and localization. So LRUK is basically like LRU but then rather than just keeping a single timestamp of when the tuple was last accessed, I'm actually gonna keep a history of all these accesses. And then from the history, I can then compute the intervals of the time from when the tuple was accessed, the first at one point and then how long it took before it was accessed again. And now the idea here is I can use this history to estimate what's the likelihood that this tuple is gonna be accessed in the near future. And if that time is greater than all the other times, then that's the one I actually wanna evict. Right, so in my example I showed in the last slide, the first tuple is what all those queries, the first and third query we're accessing, the first page. So if I recognize at a history of that, oh, I'm gonna access this page over and over again and I know how long in between I'm gonna access those things, then that first page actually would be considered more important, it should stay in memory whereas the other ones were used for the central scan and it'll be a long time before they're accessed again so I can go ahead and evict those, right? And so the K in LRUK basically says how much of a history do you wanna maintain? Right, because you could keep infinite history of this but then that's a large storage overhead. So typically K equals one is a very common approach for people that do this, right? And it works reasonably well. Another way to handle this is called localization where you, rather than having the sequential scan access the global buffer pool and pollute it with the pages that I want to read, you can actually just keep the pages that it reads, you can sort of confine the query to a subset of the buffer pool and don't let it pollute the entire thing. So my example I showed in Postgres, I showed that the number reads versus hits, the number hits was only 32. This is because Postgres when you first execute a sequential scan, it only gives you a 32 page local buffer for you to put your sequential scan pages in there. And that again, that avoids polluting the entire buffer pool space. The way, some systems in the case of Informix will have a private memory that's not in the buffer pool. I think Postgres actually backs them with pages that are in the buffer pool as well, right? One way's not better than the other, the avoiding having to go to the page table is faster because you don't have to take latches, but at a high level, the how it affects the buffer pool is roughly the same whether it's local or global. The last approach to use priority hints, and this is where you tell the buffer pool something about the context about how the pages were accessed or actually what's inside of them. So we saw a little bit already when we talked about having different buffer pool instances that were allocated for either data pages or index pages. We can go even more fine-grained than that and say, well, it may be an index page, but how's it actually gonna be accessed, again, in the future, right? And then this allows the buffer pool manager to then make decisions about whether one page should be evicted versus another. So let's say that we have an index and again, we have index pages for our nodes. If we have a query that wants to do an insert and say that we're doing insert with a primary key that's auto-incremented or it's monotonically increasing. So every single tuple has ID one, the next one has ID two, and so what we always count up, right? So when we do these inserts, what's gonna happen is we're gonna end up always inserting to this side of the tree because if it's based on ID, then all the new entries always end up on this side. So what we can do is we can maybe provide hints to the buffer pool manager and say, we know that these pages are always the ones that are gonna be accessed every time I do a new insert because they're always on this side of the tree. So they should have higher priority than maybe then some other pages in the index, right? Again, this avoids having the problem of again, since I always know I'm gonna go down to the side, I don't wanna have to do a page fetch every single time I do an insert. Another thing you can do also too is say I have queries that are just doing a random key lookup anywhere in the tree. In that case, maybe all the nodes will be accessed uniformly, but there's always been one node or a subset of nodes that I know I'm always gonna access every single time I do this lookup, right? And that's gonna be the root because I can't get in the index any other way. So we can essentially tell the buffer pool manager that this page, don't even bother recording how often it's accessed. I know that this thing should always be in memory because every time I do a lookup, I'm always gonna go to it. Then you can actually do even more complicated things. You can recognize that I have five indexes, but everyone's always going to this one. So then this one should have higher priority than all the other indexes. So this is what the commercial systems do. So the commercial systems have all sorts of, you know, extra hooks and extra information that provide their buffer managers to allow them to make better decisions than the open source guys, right? This is sort of what separates the sort of expensive enterprise systems from the open source ones. Open source ones like Postgres and MySQL are still very good, but the buffer pool management policies in the commercial ones are much more sophisticated, much more complex. The next thing I wanna talk about that we've sort of been ignoring this entire time is how do we deal with pages that have been modified, that have been dirty? So everything I've shown so far, I just assumed, yeah, I need to read a page, I need to make space, I'll fix another one. I just throw it away, right? The page in the buffer pool that I wanna replace, I don't need to worry about it, about storing anything, I just drop it and go ahead and take its frame. But in real systems, obviously, we're gonna update data. It means we're updating data in the pages and that means we have to write them back out the disk. So there's, again, this is another example of the trade-offs we have to make and a replacement policy because the easiest thing to do would be just only choose pages that are not dirty and drop them, right? But then at some point, you're gonna have a bunch of pages that are gonna be dirty and you have to make a decision. Or there may be pages like the root of the index I'm gonna read all the time. I may not actually update it a lot, but I don't wanna drop it because I'm gonna have to go through the index that way. So there's this trade-off between deciding, when I evict a page, do I do the fast thing, this drop of page that's clean or do I actually need to write back a page that's dirty? It'd be more expensive, but it may be a page that I know I'm not gonna need in the future, right? So again, if I choose a page that's dirty, I have to write it out. Now, there's a whole other issue we have to deal with about logging and making sure we write pages at the right time, but we'll cover that later in the semester. So one way to preemptively take care of this problem so that when we need to evict a page, we don't have to immediately write it back is that we can do background writing where we periodically walk through the page table, find any dirty pages, and they make a decision about whether we wanna go ahead and write them out the disk ahead of time. So now when we do that, we can decide whether we wanna evict it right then or there, or we keep it in memory and we just flip it from the dirty bit to be clean. So that when the eviction policy runs again, it now has a clean page that it knows it can throw away. All right, so Postgres does this, I don't know if MySQL does this, but a bunch of the commercial systems do sort of the same thing. The issue we're gonna talk about later when we talk about logging recovery is that we can't write out a dirty page to disk until the log that the log that corresponds to the change that was made to that page is written out the disk, right? And there's a whole process of figuring out what pages are safe to write based on what you've written to the log. And then of course there's this contention issue now where I may be trying to write data out to the log as fast as possible, and I don't want the background writer to interfere with this, but I want my background writer to go ahead and write out dirty pages because that'll free up slots or free up frames in memory, right? So how you balance all these things is really tricky and there's no magical way to do this. People sort of have these default settings and you can tune them a little bit, but it's very application specific and it's very difficult to get right. All right, to finish up, the last few things to talk about are allocation policies and the different types of memory pools you can have. So for allocation policies is basically how do we make decisions about how much memory to give to different threads or to different queries and do we make decisions about doing this for the entire system or do we have localized policies that say, for this given transaction of this seven query, it's allowed this amount of memory and we don't care about anybody else. So again, there's this again trade off of trying to make decisions that will maximize performance without having to punish any one query the most. And again, there's no magic way to do this, it's sort of these things that have been back in your mind when you make decisions about how you actually implement your Buffer Pool Manager. Now last piece to talk about is the other memory pools that are running inside the system. So everything we talked about here before today is just I need to read a two-point set of page, I go bring that into my Buffer Pool. But there's all this other stuff we have to maintain in memory that may or may not be backed by pages on disk that we now have to consider as well. So when we do query processing, we'll see this when we do sorting and joins, those things need memory obviously, and we have to decide where we're actually gonna put that memory. So some things will just be allocated on the heap, usually like in a memory pool or an arena for a given query or thread, but other things you actually want to back behind the Buffer Pool so that if you need more memory to run your query than it's actually available to you, then you can have the Buffer Pool swap those things out the disk. So a lot of times you'll see systems talk about I run a query and I run out of memory. It's usually because the query is not backed by, the intermediate results are not backed by the Buffer Pool and you can't write those things out the disk. So in Postgres some things will be backed by disk like for doing hash joins, but there's other cases where if you, we see, I don't have a query to show you right now, but we have ways of like concatenating a bunch of strings together and it'll just run out of memory because it's not backing this by the Buffer Pool, it's just stored in the heap. So the main takeaway from this is that different systems do different things. Sometimes all these different types of memory pools that you need will just be in the heap and sometimes actually backed by the Buffer Pool. Ideally, well, there's no one way that's better than another because again if it's backed by the Buffer Pool, then you have to update a page table, you have to update all that extra information. If you just malloc on the heap, that's super fast, but then you don't have full control of the amount of memory you may be actually using. Okay, so the main takeaway from this is that the database system is always going to do a much better job than the operating system for managing memory. You don't want to use Mmap, you don't want to rely on virtual memory, you want to do everything yourself because this is going to allow you to have to leverage the information that you know about the queries you're executing to make better decisions on their policies about how you move data back and forth, okay? And with the exception of Postgres, most data systems do not rely on the file system cache, they do everything themselves. Okay, so given this lecture, this is a nice segue now to what the first project is. So the first project is you guys got to build a Buffer Pool, all right? So the project is going to be going out today, I'll post it online immediately after class. You're going to have three parts, you have to build a hash table, replace some policy and the actual manager themselves. So this year, all of the projects were based on SQLite. We looked at trying to use Postgres, it was just more complicated, we didn't have enough time to do this. But basically SQLite is going to be the front end to the database engine you're building as what's called a virtual table. So what happened is you'll start up SQLite and you say create me a table using my virtual table extension, and then any reason rights will go to the engine that you're building. So for the first project, you don't actually need to use any SQLite, you don't need to actually look at the internals on SQLite throughout the entire semester. We're hiding all that from you. You work on your own C++ code base. So the project will be due in two weeks on September 26th. You'll submit this on Gradescope. Well, again, we'll turn all of this on later today, to at least get started. So the first task is that you have to build an extendable hash table. So I'll explain what extendable hash table is next class, but you're going to end up using this as your page table. So it's a hash table that can grow. So that way as you insert new data and you make new pages, your hash table can grow with it to keep track of these things. So for this, you don't need to support shrinking because that's actually harder to do, but you need to support growing and the basic extendable hash table algorithm will do that for you. You're allowed to use some of the things that are in the STL. So this will be all, or C++ 11. So you can use SCD hash as your hashing function. You can use SCD mutex to protect this. This needs to be thread safe. You can't just use SCD map. You have to build your own hash map, okay? One kid tried to do that last year, don't do that. The next thing you got to build is the replacement policy. So you don't have to build LRU and you don't have to worry about the way we extracted out the APIs. You don't actually have to worry about whether it's pinned or not. You just know that these are the pages that are free and I have to make a decision about which ones I actually want to evict. Okay? This part should be pretty straightforward. Then you put the LRU manager policy that you implemented plus your extendable hash table and actually build now your own bufferful manager. So we will take care of reading and writing data from disk. We'll provide you an API to say, give me this page from disk and it knows where to go get it. But you have to then make a copy of the page into your hash map or your internal data structure and then map any page IDs to it. So the tricky thing about this that sort of fouls up students is that you want to make sure you get the operations correct when you do pinning. If you don't want to do a lookup and it's not there, then try to pin something to take a slot, take a frame. You have to always set the pin first because another third could come in and take your location. Yeah, this part is the more complicated part of all this, it's sort of putting everything together. But again, we're only giving you two weeks because it is actually not that bad, in my opinion. So the way to get started is that when we post it online today, download the source code, we'll give you a tar ball that will have a bunch of stub files that you can run or you can implement your actual code. So the code when you download it should be able to compile it, right? So test that right away, we have APIs, we have functions you actually have to implement with documentation inside them. The tests will fail because obviously you haven't written any of your code, but you should be able to compile it. So you should do that right away and check your development environment. So we've tested all the Android machines that are in gates. Everything should compile on that. There's no sort of crazy dependencies to make this all work. We know it works on OSX, we know it works on Linux. Although I have a Windows laptop, I don't use it for anything other than PowerPoint. I don't know whether it actually compiles on Windows 10 with the Ubuntu package. I think a student last year got the work, but I don't know whether it worked out of the box. If you don't have access to a Linux machine or a machine you can actually use the compile on this, email me. We can give you a virtual machine image that if you want to run it that way. But everyone should have access to the Android machine if you're a student here. And it should work there. So the instructions will lay out exactly what files you need to modify. You should not modify any other files in the tar ball we give you. Because when you run your tests, when you upload the grayscope, we're going to pick out those six files that we asked you to modify. And we're going to overlay them into our sort of clean code base. So if there's something you have to modify to make your thing to work that aren't in those six files, it will fail when we actually run it because we're not going to copy those things in, right? The other thing to say too is that the four projects in the class are all cumulative, so you need to get this thing working before you can do the subsequent things, right? You can't build an index unless you have a buffer pool. You can't do a current control unless you have an index. So we would not be providing solutions because of this. In cases last year where some kids had problems due to medical reasons, we can give you an obfuscated binary that will implement some of the files that you may be missing. And so that way you can keep up, right? But for the most part, we set these deadlines such that you sort of keep implementing things and you'll have enough time to do them in order to get started on the next project, okay? So if you have questions about basic things or the high level ideas or where to find certain things in the code, post them on Piazza. Do not post on Piazza like, I get a segfault, I don't understand why, right? We're not going to help you debug. We have tutorials online that we will send you guys how to use GDB, right? How to do basic things with C++11, but we're not going to come sit and look at your code with you, okay? The last thing I'll say is too is that, please do not plagiarize. We will run the most detector. Please also do not post your thing publicly on GitHub. Cuz we don't want other people just taking your code, right? You can set a private repo if you want. I realize a lot of you want to be able to show your employers, here's the projects I've done. That actually doesn't matter, right? They don't care. Cuz you're all implementing the same thing. It's not like you're gonna have this magical breakthrough, right? In the little code you're writing for our projects, okay? So don't post the things with githubs, don't cheat, don't copy from each other. If you have questions, email me. And then next class, hash tables, okay? Have a good weekend. That's my favorite all-proud dog. It's me, rollin' with fifth one, South Park and South Central G. And St. I's when I party, by the 12-pack case of a thought. Six-pack, 48, gets the real balance. I drink proof with Joe, I drink it by the 12 ounce. They say bill makes you fat. But St. I's is straight, so it really don't matter.