 DJ Mushu, thank you as always. I was at Greasy Jane's last night. Greasy Jane? Yeah, I banged my knee. It's alright. Okay, next time you see her, tell that fool she owes me money. Who doesn't owe you money? That's your problem. Okay, look, I thought Greasy was literal. You know? Yeah? Not figurative, but I guess it was both. So that's on me. That's on me. Stop giving people money. It's weird, okay? I mean, I didn't know the IRS didn't need gift cards. I mean, maybe they did. You're too trusting. Okay, alright. That's on you. Alright guys, let's get started. So quick administrative things. Homework two and project one went out early this week. Homework two will be due on Sunday, September 25th. Project one will be due on October 2nd. Again, as all projects in this semester, I please encourage you to get started as sooner rather than later. We have a QA session, it should be Thursday, not Tuesday. Not this Thursday, but next Thursday on the 22nd at 8 p.m. Again, that'll be a resume and that'll be like a quick introduction about what the project's about, what you're required to do, and then you can ask questions for this. The other thing we do also in this class is that all the projects are due on Sundays and we specifically don't have office hours on Sundays because we don't want you to be used as a crutch and like, oh, I'll just come to office hours the day it's due and try to figure things out. So we have special office hours on the Sunday, or sorry, the Saturday before the four programming assignments are due and this will be multiple TAs so that way the queue is not super long. So the special office hours on Saturday will be held on October 1st at 3 p.m. and we'll post on Piazza where this will be. This will be on campus. Okay? Any questions? Yes. We know that everyone's going to try to wait until the very end. We're telling you not to. All right? And so, correct, yes. And then we have multiple TAs because we know everyone's going to show up on that one. And that's why there's... Well, all right, look, what else are you going to do, right? This is actually a holdover from the pandemic. We did this during the pandemic and it seemed to help. Okay, but we're not going to have office hours on Sundays when things are due, okay? That at least forces you to figure something out before the days do. All right, so last class, the last couple of classes, we spent a lot of time talking about what does the database look like on disk? And so now we're going to talk about what happens when you bring these pages that are in the database back into memory and how we're going to facilitate that process. And remember that the database system... Most database systems can't operate directly on data when they're on disk. There is some hardware that can do that, but as far as I know, no system actually can do that right now. Basically, Samsung will sell you an SSD that has ARM cores on it and you can run queries down on the ARM cores, right, directly on disk. Ignore all of that. This is sort of the classic canonical way you would implement this. Again, because we're a von Nomen architecture, since we can't do anything when data's on disk, we can't do anything in memory. So we're going to talk about how to facilitate that movement back and forth. And this is the ultimate goal that we're trying to achieve here is to build a software system or database system that can manage a database that exceeds the amount of memory that's available to it. So the two sort of things that we need to consider and how we decide how we're going to do this movement back and forth is the spatial control and the temporal control. The temporal control has to do with the actual physical layout of the data on the storage device. And this goes back to the idea we talked about before. We want to maximize the amount of sequential access. So we want to sort of organize things as they get written out so that the reside on disk is close together as possible so that if I need to go fetch a range of data, that data is going to be contiguous blocks on the storage device rather than a bunch of different random locations. And likewise, in going in the other direction, if I need to write it out, sequential writes are always going to be faster than random writes. The other thing we have to consider is actually the temporal control. So this is the, at what time during, you know, where we're trying to do something in the system, when should the database read data from disk and when should it write data out the disk, right? And the tricky thing is going to be, and I said this before about why we don't want the OS to do this for us, is that we need, the database needs to be aware of who made a change to a page. Like what query made a page dirty and modified something and at what point is it safe for us to go write that data out the disk. This will come up later when we talk about transactions, but it's just reiterating the idea that we don't want the OS to do this, we know what's going on, we have to take full control and we being the database system. Because we'll get paid lots of money to build this. All right, so this is the overall architecture that we talked about. And again, so the last three classes were down here, the page is on disk, and now we're talking about up in here, inside our database system, inside the process, we have something we're calling the buffer pool, some blocks of memory that we can use, and then other parts of the system that we haven't talked about yet, like the extension, they're going to run queries that need access data, somehow they're going to say, hey, I know I need page two, it goes to the buffer pool, asks for page two, the buffer pool has to say, okay, well, I need the page directory because that'll tell me where the page two is located on disk, and once I have that in memory, then I can find, I can jump to whatever the location it is, make the right sys call to the operating system to get that page, and we copy it into memory in our buffer pool. So then what we hand back to the execution engine, again thinking of the layers that we talked about before, is a pointer to page two. Right, and how exactly we got it into memory, the execution engine doesn't need to know, doesn't need to care. Not entirely true, but for the most part, right? So the key thing you need to understand about what we're going to do here is going to be different than maybe when you think about the memory map files from the OS, is that if we go evict page two, because we don't need it right now, we discard it, but the execution engine comes back and says, hey, I need page two again, the buffer pool is allowed to put it into a different location in memory, right? So again, this is different than the MAP stuff, where you'll get the same virtual address for some page in a file over and over again. In our world, the virtual address can be different, because the database says we can decide, okay, well, this is what I have in memory right now, this is the page that I could get free if I evict it, because it's not high priority, so therefore let me go put page two there. It doesn't have to require you to put it back in the same location over and over again. So today's class, we're going to talk about at high level what a buffer pool manager is, and I'll also say too that I'm going to call the buffer pool manager. I think the textbook calls it buffer manager. Sometimes it's called buffer cache or page cache. These are all essentially the same thing. It just means that it's the memory, it's the memory sort of manager of the database system that's specifically for moving pages in and out from disk. Then we'll talk about replacing policies and then we'll talk about other types of memory pools you can maintain in your database system. And we'll finish up today's lecture also with a quick intro to what project one will be about. Okay, so we've kind of covered this at a high level already, so let's go down into more detail. So the buffer pool is going to be some memory region in our database system that's going to be just an array of fixed size pages. Maybe talk about the harbor page size is four kilobytes, the database system can have four, eight, 16 kilobyte pages, can have any size pages that it particularly wants. And then in this array, we're going to have the different offsets because they're fixed length, that'll be a location where we can store a page that we can bring in from disk. And so when it's in memory in our buffer pool, a free slot or free location, I use the word slot, I don't want to say that, a free entry in our array, we're going to call this a frame. And it's called a frame because we can't use slot because if that's in slotted pages, you can't use block, you can't use page. We're going to call it a frame. What? Yes? The question is, if it's fixed size pages, how do we account for variable length size pages? You'd have to have a separate buffer pool for each of those different sizes. We'll see examples of this. In my SQL example, yeah, so that my SQL example, it would be, it had to have, no, here's where I put two kilobyte pages, four kilobyte pages and so forth. You allocate space for that. All right, so when a database system, when the database doesn't request a page, we're going to go out to disk, get it, just using a file system operation, whatever you want to use, and then we're going to copy it into our buffer pool in a free frame. And then if we need another page, we do the same thing, we just copy it in. So now though, we need to be able to do, expose the ability or have the capability to let another part of the system ask for a page by its identifier and then go give them back that pointer. So the indirection layer we're going to use for that is called the page table. Actually, prove it from the head. So, we'll get to this later. If something modifies a page that's in the buffer pool, we're not going to immediately write it back, as I said, because we have to wait to make sure that this is safe. So in like OS parlance or caching parlance, this would be called a write back cache. So I'll write to a page, dirty it, but then it's not immediately written back to the backing store. I'll do that at some later point. So in my example four, I showed where the execution engine says, hey, give me page two, right? I didn't say how exactly you figured out that frame had page two. And this is going to be handled by what is called the page table. So this is going to keep track of all the pages that are currently in memory and it can route a page ID request to a particular frame. And again, there's no guarantee that the ordering of the pages in our buffer pool in our frames is going to match how it exists on disk and file. So there needs to be some additional metadata we have to maintain in our page table to keep track of what's going on with our pages. So the first thing we can have is a dirty flag that basically says this page has been modified by another query and therefore don't write about the disk just yet unless we know it's safe. And again, you can store this either in the page itself or you can store this in the page table. It doesn't matter. Different systems do different things. And then the next thing you're going to have is what is called a pin or a reference counter. And the idea here is that if my query needs to do something on this page, I can set the pin flag to it and that prevents the buffer pool manager from evicting it. So say like I'm running a query and I need to read this page, I go bring to my buffer pool, I can then pit it, do whatever read they need to do. Then I'm done with it and then I unpin it. And this prevents the buffer pool manager from swapping it out with another page until I'm finished. It's a simple course grade mutex. You can have a pin counter that says how many queries have this thing pin and increment and decrement it as needed. But a simple Boolean flag is enough. So then if I come along here, say now I have a query that wants to read, you know, get something out of this page, I want to acquire a latch on the page table and this is to prevent another thread or another process or another worker in my system from stealing this frame or this location in the page table and putting something else in it before I can finish it. So we have, assuming that we're running in a multi-threaded system, we have to make sure that we don't have threads overwrite each other getting caused problems with race conditions. So we use sort of these low-level synchronization methods like pinning and these latches to prevent other threads from doing things and clobbering us. Yes. So this is a latch, not a lock. I'll explain what that is in a second. But it's thinking of it like a lock. So it's a mutex here that's going to put other threads from modifying the entry in this page table. So then we go copy what we need, update the page table and not point to it and then we release the latch. The pin prevents you from swapping out the page and evicting it to sort of solve different problems. Yes. So this question is if the buffer pool is right back cache, it's in memory so it's volatile. So upon crash, do we lose the contents of what's in memory? Yes. And there's correctness to reasons why you want to do that. Yes. So this question is why is page two put here instead of right in between page one and page three? Oh, page three is clear. For say it's a hash table, then your location is random. Yeah. Because again, I can have a billion pages. I don't need to go in order. Because they're coming in at different times as well. Like this thing could be page one, page three, and then page a billion. And then this is the only free frame. So I take that one. Yes. His statement is that the page table essentially is a hash map or hash table that stores the metadata about what's in memory. More or less, yes, at high level. But some of the metadata you could store in the page itself or you could store it in the hash table. So I want to bring up the thing that he asked when I sort of corrected him. I'm wearing a mask so you can't see me smile because this happens every year. The difference between a lock and a latch. So if you're coming from an OS background, what I'm going to refer to as a latch in the OS world is referred to as a lock. And in the database world, a lock is a higher level construct or protection mechanism to synchronize threads or queries running at the same time from on the higher level concepts or objects in the database. So I can take a lock on a tuple. I can take a lock on a table or lock on a database. You can take lock on pages and other physical constructs. But the idea here is that I would take a lock during a transaction on say a tuple and I would hold that lock throughout the entire transaction. A latch in my world is what you can consider a mutex if there's a low level synchronization primitive that protects some critical section of the data structure. And it's something you take, the thread could take it for, you know, acquire it, do some small update or some change and then immediately release it. This will make more sense when we talk about transactions after the midterm. But just understand that there's low level, you know, you had to do this for PZER anyway. There's a low level, you know, synchronization method like a mutex that's going to refer to as a latch. And there won't be any deadlock protection for us when we use latches as up for us as programmers to make sure that we don't have, we don't write crappy code. Whereas if you take locks on queries or take lock during queries and transactions, that's dealing with application code which is usually going to be stupid and therefore we have to have the protection mechanisms built in the system to prevent deadlocks. So again, all you really need to know at this point is a latch, it's a mutex. This is also further confusing because we're writing our projects in C++. But you look at the C++ manual, they have a notion of a latch as well which is essentially a countdown barrier, right? So it's not, when I say latch, it's not the C++ one, it's the database one here, okay? All right, the other thing I want to bring up is also the distinction between the page table and the page directory. So again, the page directory is going to be mapping from a page IT to some location in the database files that reside on disk, right? And this has to be persisted because any time we crash and come back, we need to know where pages are in our files and so we have to be very careful making sure this thing is always written out safely. The page table is an informable data structure that we maintain in memory just to map the page IDs to frames in our buffer pool. So in this world, we don't care if we crash and lose the contents of the page table we'll repopulate that when we come back. Some systems like to be clever and they recognize that instead of having a cold start on your page table cache, you can actually periodically write out the contents of the page table of the disk so that upon crash you come back, you load it in that page table and then you prefetch a bunch of the pages you expect to read when you start running queries again. I know MySQL can do this. Postgres has other things we can talk about too, right? But for our purposes for this class here, it's all in memory. If we crash, we lose it, but that's okay because the data that we care about is actually in the disk pages. Yes. This question is, when a transaction ends or commits or finishes, will we write out the dirty pages? Let's pause that conversation. The answer is no. But we'll get like after the semester, or sorry, after the midterm. Sorry. You flush the log pages before you flush the dirty pages. But to this point, that's the example I was saying where if I have a query that updates a page, a table page, I don't want to write out that dirty page to disk until I have a log record that says, hey, I made this change. Then at that point, it's safe for us to write out the table page. When you actually write it out, it can depend. We'll get this in a second, but you could write it out. If you know the log record's been saved, you could either have the eviction policy write out the dirty page, or you could have a background writer write out the dirty page. There's a whole bunch of different mechanisms to when you actually want to do it as well. But you always have to do it after the log page has been flushed to disk. But that's like after the midterm. All right. So this is what I've described so far. It's the first 10 minutes of class or so. That's the basic bufferful manager. Pretty straightforward. Match page IDs to frames, and you get back pointers and do whatever you want with it. Of course, the challenge is how to make this actually high performance. How do we make this actually work really well and do things or have optimizations that the operating system's not going to do or can't do? Because we're inside the database system. We know all the queries that are running. They're declared of. We know what they actually want to do. So we can start making decisions now about how we evict things when we prefetch them, when we write things out that will achieve us better hit ratios for our cache. So the first two sort of type of allocation policies you could have. You could have sort of a global policy where you just hear some, here's the procedures I'm going to use for all possible pages that are in my buffer pool. I don't care whether they're from the index or a table or what table they're from. You just sort of have this set for all things. But then we can actually have more fine-grained or local policies where we can determine what the query is trying to do, what table is trying to touch, what the access pattern actually is. And then we can decide how to evict things or move things in and out based on what that query is actually trying to do. And then the most typical way to do this in a production system is you want a sort of combination of both of these. You want to set the sort of policy at the global level that you think is for overall the right thing for your application, but then potentially for individual database objects like individual indexes, tables, or individual queries, you could have things that are more localized. So you can basically allocate some memory just for a particular query and do certain things for that memory you've allocated. And then maybe not worry about the other queries running at the same time. And then we'll see this as we got this lecture as well. This is a good example where the differentiating between the enterprise very expensive database systems and the very good but not as robust systems or complex systems like PostgreSQL and MySQL. We'll see DB2 in a second, but DB2 you can allocate buffer pools on tables, on indexes, like you can vary fine grain and have different policies for each of these buffer pools. Whereas in MySQL you get one buffer pool and it's good enough. It's all you get for all the queries and all the tables. So the optimization we're going to talk about is having multiple buffer pools, prefetching, scan sharing, and then the last one will be buffer pool bypass. And then we'll talk about, again, the eviction policy is how to decide when to write things out. So the first most obvious optimization you could have, as I already said, is that you could have, instead of having one buffer pool with one page table and one set of frames for the entire system, we could have multiple buffer pool instances. And then you can assign these buffer pool instances per database, per table, per index, so forth. It's essentially partitioning the memory that you're going to allocate to the database system two different logical components, the logical entities in your database to improve locality and to improve cache efficiency. The other side effect you get from this as well is that you're also going to improve latch contention because that page table, if everyone's going to the same page table and I got to take latches to go do lookups and read and modify it, if all your threads are doing that or all your workers are doing that, because for all your tables it's one giant page table, then that'll become a bottleneck. So again, these are all commercial systems that support this. DB2 and Oracle are still... They all make a lot of money, but DB2 and Oracle are still widely used. Sybase and Informix, they were hot in the 80s, not so much anymore. There's a lot of enterprises around them. SQL Server is actually super up-to-date, super modern, that's really good. MySQL one is the most obvious one to do or the most easiest one to do. We'll see it in two more slides, but basically you tell how many buffer pool instances they want and they just hash on the page and they just jump to a different one. Yes? All right, so this question is, what do I think is the best database system in the world right now? That is not something we should talk about in the middle of the semester. Let's talk about this at the end. I've already sort of said what I think I would use, what I do use for stuff. Yeah, Postgres. Why? We'll get to this later. I think for this class actually, and we're going to use it in a demo, a great textbook definition of a database system. Does it have some things that I think could be improved? Yes. Is the query optimizer the best? No. SQL Server has the best query optimizer. It's phenomenal. But for 99% of people, it's probably good enough. For some things. All right, so I just want to show you some quick syntax. Quick syntax, how to do this in DB2. Not that everyone here are going to run it, but let's just show you. They have SQL commands where I can create a buffer pool, I can set the size of the buffer pool, how many pages, I can set the size of the pages, and then now I can create a table space. Think of this as like the, I think in Postgres they call this, the catalog calls this schema. It's thinking of like a namespace. So I can assign this table space to be backed by, or managed by this buffer pool, and now I create tables in that table space. So now I would have a custom buffer pool just for this one table here, right? So the way you actually handle this in the system itself, you have two approaches. The first is you have this notion of an object ID, and that can map the object to, use that as a mapping to get to what buffer pool has the data that you want, right? So this is not SQL, but to say it's like get record one, two, three, and if I can convert that record into some kind of identifier with the object ID, page ID, and slot number, and SQL server, I think third class, we showed how if you convert the record ID into a row ID into its subunits, you would see this information. And then we just have some kind of mapping that says, okay, for object ID X, you would use this buffer pool versus another one, right? So this is happening before you hit up the page table. You don't have to take a latch to figure this out. The other approach is the MySQL one, as I said where you just take the record ID you want, hash it, mod by the number of buffer pools you've allocated, and then that just routes you to the location that has what you want, right? So basically, we need to guarantee though that the one page, physical page only exists in one buffer pool, right? Because it wouldn't be good if like one thread does this mapping, lands this buffer pool for page one, two, three, and then another thread gets a different mapping to a different buffer pool, and I have two copies in memory. That's bad. We don't want to do that. Yes. So this question is, what's the advantage of having these multiple buffer pools? The advantage is that you can have different policies on how you decide to evict things, right? I can have different page sizes. Now, you know, I can have a larger page or this table because I'm reading big things in. And then I can reduce the amount of concurrency or contention I have on my page table because I have to take latches when I go inside that. So, again, DB2 is great. Well, DB2's ability to manage buffer pools is fantastic because I can create these and drop these without having to restart the system. In MySQL, I don't know about Oracle, but it doesn't have multiple buffer pools. MySQL does. In MySQL, you can't create multiple buffer pool instances without restarting the whole system. And that's going to be bad because then you've got to throw away memory and you've got to reel all back in from the disk. Yes. So his statement is, buffer pools, couldn't that increase the risk of having memory fragmentation where you have unused space in memory? I mean, yeah, like if you do something stupid, yes. Like if you have a page that has zero tables in it, sorry, if you have a table that has zero data in it, zero records, and I say, okay, let me assign you a buffer pool that has two gigs of memory, but I never use it. Yeah, like the data system will do what you ask for, but is it the right thing to do? No. Right? Yes. So his question is, why is having multiple buffer pools reduce latch contagion? So it's here in the page table, right? Like depending, you do your latching protection in your page table, which will cover Tuesday next week, it's an entire data structure, which is stupid because then I have to protect the whole thing, right? Or I have to take it on if it'll produce. And again, that could prevent, whether it's a rewrite of the latch or not, everyone's coming into sort of one data structure and trying to figure things out, and you have to have the latches to protect things. So that's definitely, it's always going to make attention. But now it's like divide and conquer. Now if I have multiple page tables, you can go here, I can go here, we don't care about each other, we can query the latches very quickly and do what we need to do. Right, but if you need page one, two, three, and I need page four, five, six, four, five, six is in this buffer pool, I can go there and do whatever I need, you have to go over there. His question is, the buffer pool has latches to themselves that are specific to the buffer pool. Yeah, they each have their own page table. Yes. All right, cool. The page table, do you latch on the whole page table? It's the same as, if you latch on the page table, do you latch on the whole page table? It depends on the implementation. The easiest one to be latched to the whole thing, you can have more fine-grained latches. I think we'll cover this, I think we might cover this, I know we talked about index and cursor control, I think we talked about latching in hash tables for that course or that lecture. I'll have to see whether we cover it next Tuesday. Again, the easiest thing to take a latch on the whole thing and that's going to be a bottom up, but nobody should do that. All right, so the other thing we can do now is pre-patch, based on what the query wants to do. Remember, the heap files are unordered, so there's no guarantee that the order in which you're going to read pages matches what's going to be in order on disk, let's assume for simplicity for this first version it is. If I have a query that comes along, Q1, say this is just doing a complete switch on the table, on these disk pages here, well, say there's nothing in our memory, probably when we start, so when I go look at page zero and I ask the page table, do you have this? It's going to say no, so now I've got to go fetch it from disk. Then same thing, I go get page one, it's not memory, I've got to go fetch it in. But now the data center could try to be clever and say, okay, well, I know what you need to read, you're going to read this entire block of data or this entire segment of pages. So when I go read page one, maybe I'll also go read ahead and bring in page two and three. And then now it knows that, well, I've already read page zero, I'm not going to go back and read that again, so I reuse this space to bring in, pre-patch the page, so that when the query starts running again, there's not a disk stall or not a stall waiting for the pages that I needed because they're already in memory. I can do this for all the other ones. So you can do this in the OS, you can pass in flags to do pre-patching, but the challenge is going to be when you want to start pre-patching things that aren't always going to be sequential scans. So let's say we have a range query like this, select start from A, we're valued between 150 and say for this one we're going to do an index scan. Now I haven't explained what a B plus tree is yet, we'll get to this next week as well, but just think it's an order preserving or ordered tree data structure and all the values are going to be on the leaf nodes, right, like this. So now as I start executing this query, it has to look at the index page zero because that's the root, I go get that, bring that into memory, then I jump down and go read index page one, that's not in memory, I go fetch that, but now the system can recognize that, okay, I'm at the leaf nodes and I'm going to read this page and this page, but those pages aren't continuous on disk, right, there are different locations. So the OS is not going to be able to handle this because it doesn't know what's in these pages, it doesn't know what a B plus tree is, it just sees a bunch of reads and writes on single pages, but because we're the ones managing what are in these pages, we know it's a tree, we know what the leaf nodes, we know what's going to be ahead of us, so we go ahead and pre-fetch these guys as needed, right. So this is a really simple example, you can imagine more complex things, you can do this as well. And again, the commercial systems will be much better at this than the open source ones. The next trick we can do is called scan sharing and the idea here is that we can reuse any data that we've retrieved for one query, we can reuse it for other queries that are maybe trying to read the same thing at the same time, right. These are sometimes called synchronized scans if you look up and I think Postgres calls them that. So this is different than result caching, like where I take the output of a query and I can reuse it for other queries that have maybe looking for the same thing. This is really at the low level at the storage manner portion of the system and the buff pool where we're reusing pages that one query has fetched and reads them for another query, right. The way this is basically going to work is obviously without the scan sharing technique if my query shows up and I read the same pages that you just read, then I get that for free because they'll be in the buffer pool together. But the idea here is that if I'm scanning a lot of data I want to piggyback off of your cursor so that I read pages in the same order that you do rather than maybe me starting from the beginning and having to go fetch things all back over again, right. So I'll show you an example of what I mean by this. There's other applications you can do where if you and I are doing some kind of computation that are exactly the same on the data we can reuse those. But that's at the upper level of the execution engine and we can ignore that for now. So, sorry, like this, right. So here's the most simple approach is, as I said where if query shows up and another query is already running and the system can recognize that that other query is reading the same things I'm going to need to read I can jump on their cursor and read things along with them. So this is fully supported in IBM DB2 SQL Server and Postgres. Oracle, as far as I checked I checked last year, I haven't checked this year they only support the sharing technique if you have the exact same query like literally the same query with the same SQL syntax you can piggyback off of it. Whereas Postgres and SQL Server and DB2 they can be a bit clever to say you're reading the same tables I am let me jump along with you. So in previous years I forgot to mention I said Postgres didn't support this and then on Stack Overflow someone watched the video and complained and they also said I had bad hygiene too but that's old news. But there is a flag called synchronized scans and Postgres they've added this. This is hard to demo because it's like a race condition it's really hard to get it to work exactly the same time and it's hard to determine that they actually use it without sitting for GDP in the system. Yes? No, no, no, no, no. I made a mistake in previous lectures where I said Postgres didn't support this and then this person called me out and said I was wrong. Okay, so this question after piggybacking your cursor sharing is done you've got to go back and maybe read the data you missed. We'll go through this example. So we have query one here we'll do a complete sequential scan on table A and compute the sum so it starts in the beginning nothing's on our buffer pool so it's just going to do what it normally would scan down fetch pages and bring them to the buffer pool as needed. And then now at this point we need page three we don't have any space in our buffer pool because we only have three frames so it's going to go ahead and pick page zero. Right, throw that out and then bring in page three and do what it wants to do. But now I'll say at this point another query shows up and wants to do the exact same thing right, compute an aggregation on table A. We can ignore result caching that comes later. Just thinking low level were excess. So if we had this query do exactly what the first guy did we're starting the very beginning and then scan down well I just evicted page zero and it needs page zero, that would be bad. So now basically I just threw it out and now I've got to fetch it right back in. So I'm not going to be thrashing the contents of my cache. So instead the data system is going to say oh okay you're just doing a scan on A or a scan on this table. Q1 is already doing that. Let me hop you on this one too. Go along for the ride, read the same thing that Q1 is reading. Now at this point Q1 goes away because the client or to the application but as he pointed out the data system would recognize oh well I skipped a bunch of pages at the beginning of the table that you didn't read because I did the scan sharing thing. Let me go back and put you back on the top and scan down the parts that you're missing. Right? So another interesting thing about this as well, save the query was slightly different like with a limit 100 clause then in this case here Q2 could actually just stop at the end of Q1. Right? And without having to read everything but now the result actually might be different if I run this again and start at the top versus the bottom. But on the relational model that's okay. Right? Because tables are unordered. So it may be the case with the scan sharing technique you may actually get different results if you have something like a limit clause without a where clause or an order by. But from the relational model perspective that's okay. It makes our lives a lot easier. Okay. So the next technique or optimization we can do is called buffer pool bypass. And the idea here is that if we if our query is running along and it's accessing data it may be the case that the data we're bringing in we don't want to put into the buffer pool because we have to take latches on it we got to get a frame, copy it in there and so forth. Right? And so instead the data system can decide to sort of maintain a little private buffer pool for for a query that it copies data into it and then immediately evict it when it needs more memory. So the downside of this is that these private buffer pools may not be shared across different queries unless you do more bookkeeping. So maybe the case that my query reads a page and you read the page and long as it's read only that's okay. But maybe it's two fetches to disk instead of just one. Right? So all of the again high end systems including Postgres actually support this in formers they call these light scans. It's basically like you can use this for like data that you know you're not going to need for long periods of time or maybe some you computed and then you immediately just throw it away. Okay, so the last thing I want to talk about is this is sort of specific to Postgres. And this is in addition to the memory map files and virtual memory stuff we talked about before the operating system is also going to maintain what is called a page cache below the file system that's going to allow to buffer and store pages that it reads from disk. Right? And you get this by default in a modern operating system. So the way it sort of works is that you have your database systems in user space and then you have down below you have the operating system. So when I do a read now I do a read on the operating system sorry on the file system that's a sys call and then now the file system say oh the OS is maintaining a page cache let me see whether the page you're looking for is in my page cache in the kernel if no then I'll go down to disk and get it. So the operating system by default is going to try to use all the free memory that it possibly has as a zone page cache because it doesn't know what's up above it doesn't know that there's a database system maintaining its own memory it just tries to make things run fast so if it has free memory it's going to use it for a page cache. So most database systems do not want to use the OS page cache and turn it off by default in most situations. The only system that I know that always uses the OS page cache is Postgres. So if you don't want to use the OS page cache you use what is called O-Direct or Direct.io where when now you do your reads on the file system you say bypass the page cache don't maintain a copy in there for me just go give me the data directly. This is going to also matter for writes as well because if you want to do a write in our world make sure we don't lose any data we want to make sure our data lands at disk immediately when we say it is so that one if you do Direct.io you still have to do an F-sync to make sure that the hardware flushes things out. We'll cover that later but the main takeaway here is that the we almost except for Postgres we always want to use Direct.io because then we're giving up control of what's actually in memory now so like in most database systems they'll tell you when you read the manual assuming you're running on a machine without any other programs running no web server, no application server no bitcoin miner the data system has dedicated hardware it'll tell you allocate about 80% of the memory that's available on the machine to the database system except for Postgres Postgres tells you allocate I think the rule of thumb is like 40% because you want to maintain some memory left over for the the OS page cache so Postgres does this for historical reasons I don't think there's been any attempt to get off of it I will say though when Amazon forked Postgres and made Aurora which is their sort of flavor of specialized version of Postgres they got rid of the page cache reliance Aurora manages all the memory itself so we can actually test and see how you know see that Postgres is actually using the OS page cache quite easily again what I like using is because it's like a textbook definition other than the OS page cache thing but it's really easy to play around with and see what's actually going on let me move this sorry let me log in so we're going to use the same okay we're going to use that same table that we have I showed it before with the decimals right so it'll be a table with ten million entries or real numbers right like that like that okay so what I'm going to do is I'm going to restart Postgres and that'll force it to blow away the page cache blow away it's buffer pool but I'm also going to run this command here to tell the operating system to drop its own OS page cache so then we'll also restart sorry so again we blow away the OS page cache we blow away the the restart of the Davy server so now nothing is in memory so now I've got to reconnect so turn on timing and then I'll turn off some extra stuff we don't need I apologize for that sound what I'm going to do here I'm going to run run this query that we did before where it's just taking column A and add column B computing with sum but I'm going to put an explain in front of it because I want to see what the query plan is but then inside these parentheses I'm going to pass in flags to Postgres and say analyze the query plan meaning show me the query plan but also run the query and tell me how long it'll take the result there's buffers flag which I think I was trying to show before and I couldn't remember the syntax this is going to tell us what percentage of the data were read from disk or were available in our buffer pool right so now here you see here there's the sequential scan and that's just reading the pages one by one you see here that the from the first setup F4248 pages from disk and Postgres are 8 killed by pages so if I run the same query again now you see that it got hit a 32 so 32 pages were in the buffer pool so Postgres is trying to prevent this one query from using all of the pages right away so it said ok well I'll keep 32 of the pages that you ran last time I'll keep them in memory and sort of keep incrementing this or doubling the size and so forth so now it's going to have 64 so it's slowly keeping pages in memory and evicting it right away so in in Postgres there's something called pre-warm basically I can tell the database to some go read this entire table and put it in memory for me so I run that and it tells me that it read 44248 pages because before when I showed when I had a cold start it had to read 44248 so that's the number of pages that are in this table so now when I run this again now you see that the it got up 16092 but my read from disk is still kind of high 28156 so even though I told Postgres pre-warm everything put everything in this table into the buffer pool it's still not all in there right let me take a guess what's going on here the buffer pool is not big enough the buffer pool is not big enough exactly right so I can figure this out in Postgres because I can call there's a parameter called shared buffers and that tells the database how much memory it wants to allocate because we're not the operating system we just can't take it but you as the user has to tell the database how much memory is it allowed to use so it's set to 128 so now when you go look at this number up here like the hit ratio like the 16089 so this number right here it had 16,000 that I could reuse when I ran this query if we go take that number so it was 1609 1689 right so I take the third pages that were in memory and then multiply it by 8 so there's 8 kilobytes per page and multiply by 1024 I get 125 megabytes so that roughly matches up with what the buffer pool size was right so now what we can do in Postgres is sorry we can go and modify our Postgres config file there's the 128 so what do we need to set this to so we said that when we did PG pre-warm it read what? it read 44248 pages in so all we need to do now is figure out what that is in megabytes so you can write queries without from calls and it's like a calculator so we need what 44248 times 8 kilobytes no it is 8 kilobytes yeah 8 kilobytes that gives us the number of pages 8 kilobytes and divide that by 1024 right 345 so we need 345 megabytes to set to put this entire table in memory let's round that up 375 right so now we got to restart Postgres so guys what is the eviction policy for Postgres? it's LRU we'll get to that in a second yeah alright so now we restart Postgres we'll get disconnected here we'll turn off this thing again now call PG pre-warm pre-fetch brought in 44248 and now when I run my query again now you see the hit number right there so I told Postgres I want entire query bring it on and now when I run the query everything is in memory things are great so the that Postgres is also relying on is the OS page cache it's not going to be hard to see but let's see how it looks so this is not a really good this is not scientific but there's a little blue bar right there that's the page cache from the operating system maintaining so if I if I run this query again even after restarting it's going to be reasonably fast because the OS is going to be still maintaining the cache for me but if I blow away the cache then I'll force it to read everything from disk we don't have time but we could measure that and see that I actually did read from disk versus like using the OS page cache like I said Postgres is the only one that does that the reason why I don't think it's a good idea because you just lose it makes things some things easier like if I restart and come back I don't have to potentially fetch everything back in from disk on to memory because the OS page cache could be there if my system bounces and restarts anyway that doesn't help me there but it does make restarting without restarting the data system without restarting the server faster but I said like you if you want to say how much memory that the data system is actually using you'd have to go look in the page cache in the OS to figure out what pages correspond to Postgres to make that actual determination yes let's see yes but not much this is running on a it's not a powerful machine but it has a fast drive, an M2 drive so reading from disk for 300 megabytes is not that slow a larger table would have this problem so again Postgres is the only one that does this it is what it is alright so now I can take a his question how we actually decide what frames to clear out and remove pages to bring new ones in so this is going to be called a cache replacement policy and our bubble manager if basically what happens is when a query runs it says I need a page I need a frame to bring something in because the thing I need isn't in memory if all the frames are full then the data system has to decide which one should I evict and there's a bunch of different criteria that we can consider for how we want to evict things we have a page that's dirty and needs to be written to disk before it can be evicted and another page that is not dirty that can just be dropped depending on the access patterns and depending on what the query that wants to touch it the next page wants to do you may be better to actually write the dirty one out first and give that one up because it's better off than dropping the one you're going to need immediately afterwards that's undirty so this is again where the another good example where the enterprise system is going to be way more sophisticated than the open source ones because they're going to maintain all such statistics about how pages are being used and in what context extrapolate from your query what you actually want to do and what you expect to do next they make the best decision to get the best cash efficiency and the open source systems can't compete in terms of amount of money and time they've had engineers to work on these things the commercial ones are very sophisticated yes your statement is if I have a query that yes yes the same as even if there's statistics I can maintain about the access patterns of pages or queries on pages and even after I restart the system and blow away my page shape when I come back are those statistics maintained I think in the enterprise ones yes so your statement is if I have a new query and I run it once but then I run it multiple times it will get faster over time why is that bad oh oh oh ok so he brings he's sort of getting to a point I'll rephrase for you in the database world often times maximum performance is not the most important things stability is actually more important so I don't care if you have a new system where it's going to make 90% of my queries run faster if those 10% of my other queries run slower or like sometimes they're faster times they're slow because that makes the system very nondeterministic makes it very difficult to understand what's actually going on so to your point like if I start doing AB testing is this query faster than this other query like if the caching policy is trying to be clever about certain things then isn't that going to potentially give me false results to make the wrong decision potentially yes but I don't at sort of the low level at the buffer pool manager I don't think it's a huge problem right I'll think about this I'll take this offline ok so the design decisions you can make when you decide how to build an eviction policy like you care about correctness obviously you want to make sure that you evict things that should be evicted or can be evicted likewise with accuracy that you're doing this reliably speed is super important as well deciding what to evict if it's an NP complete algorithm it takes longer to compute what page to evict versus reading from disk then that's a waste of time and of course the storage overhead of all the extra metadata we have to maintain about the different access policies there's an old problem in computer science everyone has a paper on this kind of stuff everyone is trying to make something better no one has sort of I think it's unsolvable but the enterprise systems do a pretty good job alright so the second easiest thing to do is called least recently used and all you have to do is just maintain a time stamp of when each page was less accessed and then when you need to go and evict a page you go see one, find the one that has the oldest time stamp that isn't dirty or isn't pinned and you go ahead and swipe that out flush that out and the way to make this run faster is that you can maintain the pages in sort order so that when you want to go find the next page of evict just pop off with the front of the queue alright so this is like the most simplest thing you can do another approach that's common is called clock and this is an approximation of LRU so instead of having to maintain that time stamp and that sort order for every single page instead all you just have is a single bit per page that you set the one when it's accessed so the way we work like this say you have organized your pages in a circular buffer everyone has this reference bit set the one or sorry set the zero in the beginning and then when a query accesses the page we just flip it to one and then we have this clock hand that's just going to sweep around continuously and anytime it looks at a page if that bit is set to one it sets it to zero if it's set to zero then it knows it hasn't been accessed since the last time it's swept around so therefore it can be evicted right so I pointed this one first set to one I set to zero I come here it already is set to zero so I know it's okay for me to evict this page because I haven't accessed it since the last time I came around so I can fill a new guy in like that this take these elements to set to one I sweep around set to zero sweep around set to zero sweep around set to zero and then I land on this one again and then it's safe for me to go ahead and evict it right so instead of having to maintain the exact order in that the pages were being accessed as I would in LRU I just say okay since some amount of time since I looked at you before have you been accessed if no then it's safe for me to evict so clock is used in a bunch of different things outside of database systems I think Linux uses a multi-hand version of clock as well for their page eviction algorithms right this technique is this one is very common is it expensive to store the time expensive to compute or expensive to like to do what sorry so his statement is like why do this approximation when setting the time is so much easier but what is that time is it like system clock time or is it like a counter time like what kind of what would be the time all right so they got to go down you got to go down the harbor and get that right then I got to handle like you know unless it's UTC you got to handle like leap years and all that time zone changes and things like that usually it often is like a logical counter you just increment by one but now it also becomes a bottom as well and then you also this also you maintain a bit you have a bitmap that you can just flip bits on this and do that very efficiently versus like now you got to start like a 32-bit time stamp for every you know every single page right but regardless of whether they actually do an LRU clock the problem with both of these approaches is that they are susceptible to what is called sequential flooding right and that means that when I have a query that does a sequential scan on a table in a bunch of examples that I showed before we're essentially polluting the buffer pool with pages that may actually not be immediately after that query access is there right I said that you could have a little private buffer space like in some systems to avoid this polluting but like say you don't have that and you're trying to use this LRU stuff you could still hit this problem here right because the challenge is that if you're just doing a bunch of sequential scans LRU is exactly the exact opposite of what you want to do right you actually want to use most recently used because I just read this page if I come back and scan that entire table again the page I just read is not the one you know not the most immediate one that I need to read again if we ignore the scan sharing the technique that I talked about right so there are replace some policies that can specifically design to handle this use case so just to visualize what I'm talking about so say we have our query like this select star from A where id equals 1 and say that again we have an index we can go get that single record that we want or single page that we need we go fetch that in the memory but now our sequential scan query shows up and he's getting along and he's just flooding our buffer pool with a bunch of pages but now at this point here I got to page 0 because I need free space and if I'm doing least recently used page 0 was the last one that was used so I go and pick that but now the same query shows up again that needs that page 0 and that's the thing I just evicted where in this case here if I was using like most recently used I would probably want to pick page 3 or page 2 and 1 and keep page 0 in there right so one solution to do this is LRUK which is what you'll be implementing in project 1 the basic idea here is that instead of having a single time stamp whether it's a logical or a system clock time it's some kind of time stamp and I keep track of the last K times this page was accessed and then now when I would go decide whether to evict a page instead of just looking at the time stamp that has the smallest one or the oldest time stamp I want to go look at the the difference between the last time it was accessed and the most recent time that was accessed and I can use that to figure out which one was brought into memory and then has not been used for a long time it was not used it was not used immediately again right so in that case here if you just base it on that difference in my sequential scan example from the last slide page 0 would have not been evicted if I kept track of like I was being used over and over again by queries that are just accessing that one page yes yes right so the question is if you have more than K equals 2 it's the last time and the most recent time if you have would you have more than K yes because you can then use the intervals to figure out like a pattern yes so he says why stop at K why not maintain a running average for how things are accessed again that's what the enterprise guys do they maintain all sorts of statistics of like what kind of queries are accessed what kind of queries are accessing it how often you need to like what users because some users might have higher priorities than other users you can do some wild things with those systems so the question is after keeping track of these K things these K references how do you decide which page to evict again if you think LRUK it's the difference between the last time stamp and the first time stamp which whatever one has the biggest distance it's been the longest gap since I accessed them so therefore I'm less like they need this page versus other ones yes again so say you maintain K equals 2 so I access the page like a stack or whatever or it's a queue I access it, I put a time stamp in I access it again, I put another time stamp in I access it again, I push out the most recent one yes so his question is if K is greater than 2 what actions do you take if you start prefetching I don't know whether they maintain I don't know whether they use this for prefetching but assume it's already in memory you're using the K the time stamps to decide whether to evict something but to your point if you kept the history of all the different pages you'd have to you need more context you don't want to say this random page I think it's going to be accessed soon but there's no query that actually needs it you don't want to fetch it if the query is doing this is accessing the table this way then maybe you could use that to decide okay I know these pages are going to look like this I don't know if they actually do that though yes the question is if you what happens if K equals something and the page does not have K entries what should be the value for those pages you said it's affinity it's not perfect yes but it's better than again there isn't a magic oracle that's going to be to say here's the best caching policy for every possible thing right it doesn't do okay most of the time and in practice LRUK is does better than LRU there's another one for IBM called ARC I don't teach that because it's patented but it has similar protection mechanisms yes this question is do any systems explicitly use most recently use the commercial guys actually might use this for again for for an LLAPS system too as well again it's more complicated because how often are tables being accessed I don't want to say oh it's definitely it's venetal MRU without any additional accoutrements it's never that case but there is policies that do something like that yes this question is do you store one actual copy of the page and then a queue of K timesets for that page I mean you could store it in the page itself you could store it in a page table that's that metadata I said in the beginning that reference count, the pin stuff you can store it in the page itself you could store it in the page table if you store it in the page itself then you get durability for free because it'll get written back out but there's some things that you don't want to actually write out potentially I don't need to have the pin flag written out the disk because who cares yes so his question is why is a buffer pool even needed for an OLAP system when you're just doing scratch-cold hands anyway well we didn't talk about table designs but like one way to design your database is what's called a snowflake schema which is why Snowflake got its name where you have this single giant fact table where like think of like Walmart every single item I've ever even bought you just have this giant table and billions and billions of purchases like the name of the item they bought or the size or the price or so forth you store these in these sort of side dimension tables so to your point if you're always just scanning this giant like multi-billion row fact table you probably don't want to buffering always helps right reading from distance is expensive so you don't buffer that but maybe you have different policies to pin those dimension tables on the side but if you run out of memory maybe you want to evict them or not or evict one because this one is used more than this other one you definitely want the data system to maintain cash for this and everyone does this right so at the stake of time I'm going to skip this we've already sort of covered this priority hand is sort of obvious as well where like the data system can flag some pages and more important than others so a really simple example would be say I have a bunch of insert queries in this table where I'm just inserting some primary key value that's incrementing by one each time so I'm always inserting new data on this side of the tree so therefore maybe I want to have the database system to keep these pages in memory rather than these other ones because for the next insert it's always going to go down that side and likewise if I have another query here that says it wants to look up on this index I actually don't care what the value is the first thing it's always going to do is access the root page of the B plus tree or the tree so therefore we want to tell the database system don't evict that guy because everyone's going to always need it it's not like a pin, the pin prevents the data from evicting it no matter what this is just saying like hey be nice to not evict this but if I need space and the page isn't dirty it's free to do that I've talked about this a little bit already about these dirty pages if we're running out of space in our buffer pool we need to take over a frame if it's not pinned and it's not dirty we're free to drop it and that's super fast to do there's nothing to do, just overwrite whatever the contents were if a page is dirty then we have to then flush it out the disk to make sure that it's durable after the system restarts if I go need a page and there isn't a free page and they're all dirty, I have to wait until that dirty page is flush before I get my pointer back to the location that I can write into or redate it from so again beyond this LRU case there's a the system is trying to evaluate, determine is it better to evict a dirty page and flush it out or should I just drop a free page or an undirty page even though I might need that in a very near future so the algorithms could get quite complicated to prevent the slow path case where I'm blocked until I write a dirty page out I can do what's called background writing and most data systems have this there's some thread that wakes up every so often walks through your buffer pool just finds dirty pages and then flushes them out the disk and then sets the dirty bit or onsets the dirty bit, the dirty flag so again we've already set this before we're going to make sure that we don't write dirty pages before the log records are written I'm rushing this quickly because I want to talk about project one the memory pools off we can skip this but other memory pools we can maintain others for queries some can be backed by disk some don't have to be different systems do different things alright so just to summarize the data system is always going to do a better job of managing memory than the average system and I realize that I think I prefaced this class saying caching is one of the hardest and oldest problems in computer science and I'm trying to cram it down into a single lecture but I want to get a sense of the flavor of this system when we build these buffer pools and then the data system is always going to be in a better position to make better decisions than the average system because it knows everything that's going on inside of it it knows how data is moving back and forth and what queries are doing and there's a bunch of optimizations we can take advantage of alright so next class we'll kick off on hash tables but I want to finish up with project one alright so any questions about buffer pools before we keep going alright so project one we're back on our beloved bus tub system working on the system itself not the try thing on the side so project one will have three parts the high level goal is you're trying to build your own buffer pool manager and so the first thing you have to build is a extendable hash table a hash table that can grow in size as your system needs then you have to implant an LRUK replacement policy and then you actually build the buffer pool manage itself like use the build the page table run the replacement policy algorithm move pages in and out the reading writing from disk is taken care of for you so we will cover what extendable hash table is on Tuesday next week but you basically have to build a thread safe hash table that's mapping from the page IDs to the actual frames themselves you're going to need to be able to support growing the page table the growing your hash table you don't have to support shrinking because that's a bit tricky and so you can use the the C++ built in hash function implementation don't need to roll your own that's going to be good enough and then I think this might be out of date use the rewrite latch that we guys provide you to do protections so the extendable hash table does not need to be backed by disk right because if the system crashes you just build it again then for the LRUK policy there's a separate class implant this and then you basically have to run your you have to maintain again these time stamps and when things are accessed and then you have to do a sweep and go clean things up you know to evict things and if no page has been touched you just evict the lowest one the lowest page I do this should be tested but now you're going to use your LRUK Placer to actually implement the buffer pool again all the disk managers they'll be written for you but you have to maintain the actual page table and all the metadata about when pages are pinned, when they're dirty and so forth yourself so I don't know about actually number six I'll double check that but you don't want to change any files outside the ones we tell you to modify because those will get overwritten when you load it up into Gradescope so if you do some random custom thing that isn't in the files you'll have to touch you're not going to have that when you run a Gradescope it'll crash all the projects are cumulative you're going to have problems the rest of the semester because now you're going to go implement the B plus tree and get pages into memory you can't do anything right we won't be providing solutions we will have the ability to you can keep testing your buffer pool after the project is due so if later in the semester you think there's a bug or something's coming up you can always keep trying our tests to make sure your thing is working out and as always we're not going to teach you C++ at this point everything did project zero you should be able to do it but post on Piazza a couple of office hours if you need sort of high level questions so yes performance? one sec yes so we don't want to write crappy code so we're going to force you to follow the Google coding style and then we have the checks in place to make sure that you don't have messy code so his question is about performance so if you have extra credit if you rank high on the leaderboard so the top 20 implementations in the class will get bonus points so if you have the fastest buffer pool you basically get 50% credit on this extra credit on this and then we'll have different levels going down okay and then whoever has the most bonus points at the end of the semester will get a bus tub t-shirt if you want to buy a bus tub t-shirt we sell them for $89 if you get the high score you'll get one okay don't plagiarize or I'll f**k you up okay see ya hit it