 Okay, let's get started. So today's lecture is about the long-awaited buffer pool topic. I feel like we've been talking about it for weeks by now, but it's finally here. Before we get going, just a few administrative things to discuss. So Project number one, which was released on Monday will be due on Sunday, September 26th at 11.59 p.m. There's going to be a Q&A session to cover any more detailed questions about the project on Thursday, September 16th, which is tomorrow at 4 p.m. And there is a Zoom link posted on Piazza where you can go and join. And I would encourage everyone to, if you haven't yet gotten started on the project, at least take a look through all the information on the assignment page and in the codes that way you can ask more detailed questions or if you do attend the Q&A session, you'll be better able to absorb the details that are being discussed. There was also a question last class about hardware support for fixed-point arithmetic, which was one of the topics we covered last time. I looked around a little bit. I couldn't really find any specific information about modern hardware using specialized components for fixed-point arithmetic. I think there were some early attempts at this, like the 1960s from IBM had some specialized hardware for it, but nothing recent or modern. I have seen some things where FPGAs have been programmed to handle it specifically, but other than that I haven't seen any work on it. So kind of the two problems that we've been talking about so far, problem number one, which was the topic of some of the last lectures, was how does the DBMS represent database files on disk. We talked a lot about how database files are stored as pages, how pages are stored internally, and how tuples are stored within those pages. So that's kind of been the focus of the previous couple of lectures, and now we're going to be transitioning to talking about the second problem, which is how does the DBMS manage its memory and move data back and forth from disk. Remember, this is one of the important features we talked about that we want the DBMS itself to manage rather than leaving it up to the operating system. So it's going to be an important topic in this lecture and you'll see going forward. So the way that you can think about kind of this database storage and memory management problem is in two kind of dimensions. The first is in terms of spatial control, which is where we're thinking about where physically to write pages on disk. So where do we want to store the pages on disk for maximum benefit. And the goal is to be able to keep pages together, so co-locate them or have them laid out contiguously on disk if they're frequently accessed together by our applications. So if you have a query that needs to touch multiple disk pages, ideally what you'd want to be able to do is lay those disk pages out as close as possible together. And again, the reason remember you want to do this is because sequential access on disk is a lot less expensive than random access. So if you can seek to a particular spot where you know a bunch of the pages you'll need are for your query, then you can get all of those in one shot rather than having to seek around to a bunch of different locations on disk. So that's kind of the spatial control aspect. The second aspect that we need to consider is temporal control. So what this means is that when we're going to be fetching pages from disk into memory, we want the DBMS to be able to do it in a way that minimizes disk stalls. So what does that mean? It means that if there's a page that you need and it currently doesn't reside in memory, your query or your application, your program is going to block while the DBMS goes and reads in the page that you need from disk. So there's going to be this IO stall waiting for a page to come in and we want to try to avoid those as much as possible. So that means we would like to have happen is the DBMS to figure out an efficient way to keep pages that are going to be accessed at the same time in memory together at the same time. So these are kind of the two different dimensions we need to consider when building our buffer pool. And we'll kind of talk about all the different ways that the system, the DBMS is going to handle this. So just I showed this slide in a previous lecture and just to kind of hammer home this point. If you think about kind of the different time scales that are in play here, we have, you know, data we can access in memory in the normalized scale is on the order of 100 seconds. Going to disk, traditional, you know, disk is on the order of 16.5 weeks. So it's really important for us as DBMS engineers to figure out an efficient way to maintain this buffer pool to keep things in memory as much as possible. So again, so far we've kind of been talking at this disk layer at the database file, you know, you have your database file stored on disk, it's split up into a bunch of pages. You have this directory page that stores the mappings from page IDs to physical locations or offsets in the file. And this is kind of what's happening on disk, what we've talked about so far. So the next layer in the stack is this buffer pool, which we've referred to before. And the buffer pool is going to exist in memory. So just again, thinking at a high level, there's this execution engine piece, which we'll cover in later lectures, but the way that the buffer pool is going to work is it's going to serve requests to the execution engine. So we have the execution engine, for example, issues a request for page number two. There's no page number two currently in our buffer pool. So what the buffer pool is going to do is load the file directory first into memory, figure out where physically page number two resides, and then go fetch it so that we can return a pointer back to our execution engine. So this is kind of at a high level how the whole buffer pool process is going to work. So specifically, the topics that we're going to cover in today's lecture are again this high level idea of the buffer pool and specifically the buffer pool manager. So the piece of the software or the spot in the software stack that's responsible for managing the buffer pool. Sometimes it's referred to as a buffer cache, you may see that. But we're going to take a look at kind of the different algorithms that the buffer pool manager employs, including replacement policies for how to decide which pages to read in, which pages to evict. And then finally, we're going to look at some other types of memory pools that can exist in the DBMS. So this is the agenda for the lecture. And then I allocated some time also at the end of class where we can just have a high level introduction and discussion of project number one. So the buffer pool is organized basically as a memory region that's just an array of these fixed sized pages. So each array entry in the buffer pool is called a frame, and it's going to be the size of the disk page so that we can map disk pages into the slots in the buffer pool. So when the DBMS requests a page, what we're going to do is essentially load or copy an exact replica of the page into one of these frames in our buffer pool. So for example, let's say we need page one, the buffer pool manager is going to load that page and stick it in one of the frames in our buffer pool. Similarly, if we need page three, that can go in the buffer pool, and we need actually one more layer of indirection now to be able to access these pages. So that's what's going to be called the page table. So the page table essentially keeps track of the mappings of the pages that are stored in memory. So similar to how kind of the header file, sorry, the header for the file page directory keeps track of where the pages reside in on disk, the page table is going to keep track of how the pages are laid out, where they exist in the buffer pool in memory. So here we have the mappings from pages one and three to their frames in the buffer pool. The page table is also going to be responsible for maintaining some additional metadata about each of the pages, and we'll talk later in the lecture about why this important. But for now, just keep in mind that we're going to keep track of things, for example, like a dirty flag, which is going to tell us, you know, it's just a Boolean value that tells us whether or not the page has been modified since it was loaded into memory. So we can't make updates to things directly on disk, we have to load them into memory. If there's a write that happens, we want to notate that by setting this dirty flag. We're also going to keep track of things like a pin or a reference counter. So what that means is if we want a particular page to remain in memory in the buffer pool, we don't want it to be evicted. We're going to keep track of some kind of counter that lets us know that certain queries or applications are using the page so the buffer pool manager doesn't evict it. So you can keep all kinds of other metadata about pages, but just these are probably the most important ones that we'll cover. So again, you know, if we want to make sure that page three stays around, we can just set this pin on it, basically just a counter that says some query somewhere is using it. You know, now if we have a bunch of these concurrently running queries and one comes along and wants to fetch another page from disk into memory, then we can run into these kind of concurrency issues where we have multiple threads or current queries all accessing trying to modify this page table. So basically what we do is a thread will set a latch on a position in the page table to prevent concurrent modifications to it and then ask for the page to be loaded in. The page two gets loaded in and then we add this mapping from the page table to the buffer pool so that we know where it is and then the thread can release its latch. So that's kind of how everything works at a high level and throughout the lecture we'll go through the different algorithms that we use to do these evictions and loads from disk. So an important distinction I said latch on the previous slide so threads are going to put latches on the page table. There's sort of this long standing feud between people who are in databases or database research and people who are in the broader systems community or if you're coming from an OS course or something what database people refer to as latches versus locks is different than what systems people refer to, locks versus latches. So I'm going to talk about how database people talk about it because we're in a database class so typically when database people are talking about locks they're referring to kind of these high level abstractions that protect like the logical contents of the database. So that could be things like whole tables that could be like an index that could be tuples just things that are logical abstractions that are related to pieces of the DBMS. And locks are typically held for the duration of a transaction so we haven't talked exactly about what a transaction is but you can think about like for the duration of a query. And it's important that if a query is acquiring locks then it needs to be able to roll back its changes. So kind of locks are a higher level concept. When you talk about latches in a database management system we're typically referring to short lived latching that protects kind of these critical sections of like an internal data structure or a modification that's going on in the DBMS and we don't need to be able to roll back the changes. So this is what it maps to like a mutex. So again I think I mentioned this distinction so the page table resides in memory in the buffer pool and the page directory resides on disk for the file. So the page directory is this mapping from page IDs to the physical locations of the pages. It could be an offset in a file. It could be a directory somewhere but the point is this needs to be persisted so that way we can load in the future in order to track down individual pages we might need. The page table is just going to be in memory. It's ephemeral. We don't need to persist it so the page table can get built up over time as we're executing queries and then if the system shuts down and we lose power or something it's okay if it gets wiped out. We don't need to keep track of it for the future because when we start up again and we start issuing new queries it's going to look completely different. So it's kind of just this in-memory data structure that allows us to track page IDs where they're located in the frames of the buffer pool. There's a question yes. So the question is will we lose updates to pages with the dirty bit set? So pages that have been written or modified somehow by a query. Is it? Yeah. So yes. So if there are pages stored in the buffer pool with the dirty bit set it means that they were modified by some query and they have not yet been persisted to disk. So we haven't really talked about transactions or transactional guarantees yet but typically the way that's going to work is if you have a query that's running in a transactional manner then you don't want to say that your query is complete or it's called committed. You don't want to say that your query is finished executing until all of the changes have been persisted to disk in some way. So that could either be through a database log, a write-ahead log which we also have not covered yet but either through a write-ahead log or by ensuring that the dirty pages in the buffer pool have been flushed to disk. So it's possible if the system crashes or you lose power or something before that happens then you'll lose what happened, you'll lose any modifications in those pages. Does that make sense? Any other questions? Great. Okay. So kind of just a recap that's the difference between page table, page directory. Page table is in memory of the buffer pool page directory is persisted on disk. So when we're trying to figure out kind of how we're going to lay out or decide which pages are going to exist in our buffer pool we have two ways of doing it and there's trade-offs between the two we'll kind of discuss and see cases where one might be preferable over the other but these are kind of at a high level the two competing approaches. On the one side you have some kind of global policy for your buffer pool which is going to make decisions for all active transactions or queries. So it kind of sees all of the queries that are running concurrently in the system at one time and it tries to best figure out how to manage the buffer pool globally. On the other side of the spectrum is this idea of local policies where individual transactions or queries are going to allocate, deallocate, load and evict pages on a per query basis so they're only going to consider the execution model they have they're not going to think about any of the other concurrently running queries in the system at the same time. Now this doesn't mean that we don't share pages so the pages are still going to be shared but when we're making decisions about which pages to either load or evict from the buffer pool we're just going to consider the needs of individual queries we're not going to try and globally optimize across queries. So kind of their trade-offs to using each we'll go into the details about some of those but this at a high level are kind of the two options we're trying to balance. So some of the buffer pool optimizations we're going to talk about will start off kind of with the simpler ideas and move to more complicated multiple using multiple concurrent buffer pools at the same time rather than just one using different prefetching techniques, scan sharing across multiple queries and using some kind of buffer pool bypass mechanism for individual queries. So we'll start with the multiple buffer pools idea. Logically you know that the DBMS has one kind of buffer pool where you load pages from disk into memory but physically that could be implemented as multiple individual buffer pools with different strategies. So for example you could have multiple buffer pools running just on machine you could have a per database buffer pool so if you have multiple concurrent databases that are managed by your system each one could have its own buffer pool. You could even have a different buffer pool at the level of individual pages. So if you have different types of pages for example maybe pages for tables versus pages for indexes those could be handled by completely separate buffer pools. And there are a lot of advantages. One is that it helps to reduce the latch contention so there's less fighting over managing a single page table and it also allows you to specialize buffer pools for each buffer pool instance for individual the needs of individual either queries or databases or page types. So kind of these are these are some examples of systems that allow you to have multiple concurrent buffer pools running with different strategies inside each. So can anyone think of you know a problem that might come up if we do decide rather than just having one you know big buffer pool that all of all of my pages going to go into if we decide split it up into a bunch of smaller buffer pools. What might a problem be yes OK so the comment is that how do you decide how much space to allocate each buffer pool that's a great question I think in most of these cases that each system has some kind of knob that the database administrator needs to tune to figure out you know they in theory know what's going on at the application level so they decide how to how to split it up. So yes that is a problem can anyone else think of a different problem. So the comment is could you have multiple of the same pages in different buffer pools. Yes that could be a problem. Typically the way these things are implemented is to make sure that pages are mapped to a single buffer pool. So kind of if I rephrase the comment. This is actually going to bring to the next slide. It's essentially how do you figure out if you have a page and you wanted to go to one and only one buffer pool how do you figure out which buffer pool the page is going to go to. So here we have two buffer pools and kind of there are two common approaches so maybe others but these are kind of the most widely used. The first approach is that when you're storing pages you store some kind of object ID associated with it. So by an object ID I mean kind of the type of page that we're dealing with. So for example it could be a page that stores tuples in a table. It could be a page that stores part of an index data structure. Could be a page that stores log records. So just kind of an object ID that's going to identify for the DBMS what type of page it's dealing with. So just as an example let's say Q1 wants to get this record 123. So somewhere in the 123 we have embedded this object ID. Let's just say the first digit is the object ID the second is the page ID and the third is the slot number for that record. So kind of if we look at our object ID here the system might know okay objects with object ID number one are going to map to buffer pool one. So kind of at a high level this allows the system to split up different types of pages into different buffer pools. The second approach which is also pretty straight forward is we're just going to take some hash of the record ID so we have record ID 123 we're going to take a hash and then modulo times the number of independent buffer pools that we have and again here you know maybe it just tells us to go to buffer pool one whatever the result of that hash function module is. So basically what this is allowing us to do is you know have multiple different buffer pools where we map different we can figure out the page mapping to an individual buffer pool so the system knows where to go look for that page. So the next important optimization we're going to talk about is prefetching. So kind of the idea is that the DBMS can go and prefetch pages based on a query plan before they're actually needed so kind of what we've looked at so far is we have a query executing and we fetch the pages as you know they're needed for processing but there's nothing stopping the DBMS from you know in advance before the pages are actually needed going and fetching them from disk and loading them into the buffer pool for us. So we'll look specifically at these two types of plans there are many others but just these two query operators here kind of illustrate the point of having some intelligent prefetching mechanism. So the disk pages are again laid out in order remember you know it's a heap so they're unordered so they might not be in this exact order but just for illustration purposes they are. So imagine we have a query that comes in and you know wants to start accessing pages so Q1 starts executing first it accesses page 0 we load page 0 from disk into the buffer pool then it moves on and accesses page 1 loads page 1 into the buffer pool. Now if this is a sequential scan Q1 is going to need to access page 2 and page 3 next so we've already done 0 now 1 it's going to need 2 and 3 next so why not allow the DBMS to map these pages you know ahead of time for us into the buffer pool so we don't have to wait for them to be mapped so the DBMS can perform some kind of prefetching figures out okay let's put page 2 in this empty slot down the bottom page 3 well we already used page 0 we don't need to look at that again so let's evict that and load in page 3 into that slot there. So now when query 1 moves on to page 2 we find out it's already in the buffer pool and if kind of we keep this pattern up then Q1 is never going to be blocked waiting for some kind of disk or file IO stall so kind of another situation might run into that was a sequential scan in this query here where we're basically asking for values between 100 and 250 so we want to go get all of the tuples from the pages where the values fall in that range it's obviously not going to be all of the pages so we can use a data structure called an index and again we haven't really talked about indexes this typically they're index data structures that are used most widely are B trees or B plus trees we'll cover them in a later lecture but you can think of this just for now is basically like a binary tree so just a standard tree data structure so at the root node we're going to have page 0 and kind of the pages descend from there so if this tree is organized based on kind of the value that we're searching on then all of the values are sorted in this order here and we can rather than having to scan all of the pages in the file we can just traverse the tree just going to tell us which pages we need to go and look at so we're going to start kind of at this at this root page we need to load that from disk into memory so we know kind of where we are at the root and then kind of descend the tree to find just the range that we're interested in for the query so kind of we start at the the root get that page in get the next page in and so forth until now we get down to the leaf nodes and all of the leaf nodes have pointers connecting them so we can just sort of scan right across in in consecutive pages for the range that we're interested in now you'll notice that here the the pages we have consecutive in the leaf layer page 3 and page 5 so we're going to have to do is some kind of you know more intelligent prefetching the first two you know we're we're sequential but now we have to jump to these later pages so kind of sorry so the difference here from the sequential scan is we have to know specifically at the level of the buffer pool manager what the access patterns are for the query so kind of if we treated this the same as we did the sequential scan we'd end up reading in pages that we don't really need like index page 2 whereas if we know specifically that this is an index scan we can prefetch in exactly the pages that we need yes so the question is when we have just index page 1 right here how do we know that we'll need uh index page 3 and then index page 5 set the okay so um the the answer is that unlike the sequential scan where we know all of the pages are contiguous we know something about the high level layout of the data structure so we we kind of know at a high level how the pages in the index are laid out and based on you know whether we're traversing left or right in the tree we know which page we need to to fetch next when you get down to the leaf layer there are pointers across all of them that allow you to scan directly across you don't have to traverse down the the other side of the index yes correct yes yes yes so the question is how are the system resources that are used for prefetching accounted for uh so you mean like how how does the system um know how to allocate resources to okay so so the question is like how do you know how much uh how how many resources you should allocate to doing this prefetching versus what you know normal work that's going on in the system that's a good question um there uh has been a lot of work about prefetching in the academic community and typically um in commercial systems uh is a big selling point that there's a lot of work put into kind of uh better buffer pool managers broadly but um in particular prefetching better prefetching and figuring out you know exactly how much effort to devote to um prefetching in this way so you know if you if you spend too much of your resources kind of doing prefetching then you kind of uh hinder the real work that's going on in the system versus if you don't do any then you can end up with these uh ios also it's a delicate balance kind of you have to strike between the two um typically and we'll see this in in some of the other um strategies there are there are ways to mitigate uh allocating too much effort to uh prefetch so usually you can think about it like just in a simple way if if there's idle time or idle resources uh in the system you can use them to perform these optimizations sort of in the background but certainly it's it's difficult because you don't want to you know like I said hinder the work being done by actual uh executing queries to make sense right yes uh so yeah sorry use you're asking the uh database that was interesting um so that your your your statement is that the the database files as well as the uh index pages are both stored in the buffer pool yes they are they are both stored in the buffer pool um they may uh like I said in some cases if you're splitting the buffer pool into to smaller pieces um they may be split up so like in the object id case you could have um uh one buffer pool just for the database files separate buffer pool for the indexes but uh at a high level yes they share the same buffer pool uh so that the statement is is there uh or there should be prefetching in both dimensions both for the the um database pages file pages as well as the index pages yes so uh you you could have I guess it depends you know sort of if you split it up into multiple buffer pools um you could have one prefetcher per buffer pool um if if you have uh uh kind of a shared buffer pool then you know the one single buffer pool manager needs to figure out how to you know balance the the uh resources between database files and index files we'll talk more about index files later um in in uh the semester but kind of in some cases you can answer queries entirely just by looking at the index you don't even need to go to the database file so um there's kind of a balance between uh keeping both of those in in the same buffer pool yes so so the the question is what is the number along the bottom of the tree uh so in this this query here uh there's just some some field in the table some attribute called value yet uh so you're asking does does the index just store the value uh oh I understand so so your the the question is is the index uh built exclusively for this query or it doesn't depend on this query uh the answer is no so indexes are data structures that are built independent of queries they're just maintained by the system but they can be used to answer many queries so just think about this for now just like a binary tree that we're going to store that sorts uh these these values between you know zero and three ninety nine or whatever it is those are stored in the database file in pages so they're part of tuples that that live in pages in a database file we're just building this auxiliary data structure that uh sorts those values for us and that that index can get built by the system and then reused to answer as many queries as you have that touch that value so that makes sense okay so if there are no more questions about prefetching uh then we can move on to the next optimization uh it is called scan sharing uh basically the idea is that queries can reuse the data that is retrieved from storage uh or potentially operator computations by other queries so this is also called synchronized scans um it's different from result caching so I'm not talking about like you know you issue a query and it computes you know how many students are in this this uh course or something and then you have another query that asks a similar question how many students are in the course and you just reuse the result um I'm I'm more talking about the individual data pages so if a query is reading pages from disk and putting them in memory um you can have a another query that needs to access the same pages reuse them so it allows basically multiple queries to attach to a single cursor that's scanning a table so the queries don't have to be the same uh that's not a requirement but they they do have to be accessing the same day where they they need to be looking at the same pages in order to make sense otherwise this optimization wouldn't be possible um so basically you can think about it like if there's a query that wants to scan a table and you have another query already scanning the same table the dbms can attach the the new query the second query is uh a cursor to the existing cursor so cursor is just an iterator basically over the pages in a table um there are lots of examples of uh uh systems that do this some at different levels for example oracle uh the the cursors need to be or sorry the the uh queries need to be identical to do cursor sharing um they're uh in previous versions of this lecture um Andy uh misspoke uh there are some people are very upset online about this uh fortunately I am much more careful than Andy so I don't make these sorts of mistakes but um on stack overflow uh someone um posted this after watching one of the previous year's lectures about um uh Andy said that it it doesn't exist in uh Postgres um ignore the comment about uh his hygiene I don't know that's just uh I mean like technical technical disagreements are fine but I think that's a step too far personally uh so I if you go and look at the the postgres documentation in newer versions they they do support this called synchronized sequential scans um and you can either enable or disable this uh in the system um bypassing a configuration uh parameter too so how does scan sharing work uh well imagine we have this one query here really simple basically you just want to select uh the sum of all of the values from some table a so again we're going to go um get the uh uh disk pages that we need so first you know query one is going to fetch page zero page one page two so when we move on to page three we need to figure out a let's evict a page we'll talk more about how to choose pages for eviction in a minute but basically you know we need to take page zero uh throw it out and replace it with page three so now imagine at this point in the query executions we're already down to page three here another query comes along we'll call it query two um and it's trying to do basically the same thing except instead of computing a sum it's computing an average so kind of the same scan over over the table uh just a different aggregate so this it's going to have to look at all the the same pages so you know we could start query two up here and start scanning from the beginning but kind of you know we we've already looked at those pages um for query one so an alternative that we can do in this scan sharing approach is we're actually going to attach query two's cursor to uh query query one's cursor and start from the same position uh in the scan so we're going to start at page three rather than at the beginning so now they're going to go together through the remainder of the pages four and five and now query one is finished it it looked at all the pages it needed to query two needs to restart from the beginning and kind of scan through only the pages that it hasn't seen before so it knows you know it started at page three it has to go back and look at the um pages that query one already looked at uh before um it it started so this is a good example where um we talked a lot about how the uh relational model and SQL doesn't assume any kind of ordering it's just a set or multi-set um algebra so there's there's no relationship uh with how or in what order um tuples are stored and this is an example of how you know you can still get the same sum you can still get the same average depending on where you start uh in the scan looking at different pages so in in some cases uh in particular if you have some kind of limit clause on your query this can cause a problem because now uh if I issue this query multiple times I could potentially get different answers so there's no determinism um that forces the dbms to give me the same 100 uh um tuples I could issue this query two times in a row and get you know two uh different results assuming there are more than 100 tuples but um kind of this is something to be careful about if you're if you're trying to impose uh um or assume ordering logic for a limit if there's no ordering clause then a limit can give you you know just 100 tuples from any of these pages and it's it's equally correct okay so there are any questions about scan sharing yes so there's there's two parts to the question the first part is imagine you have multiple queries running so more than just two you have to figure out which query you want to attach to right because you know you could have multiple running um yes that's a good point there are probably different trade-offs um depending on so these two queries are scanning the exact same range I imagine you would want to try to maximize you'd want to try to attach to the query that maximizes the scanned range because otherwise like you you want to try and get the the the most queries attached to the longest running query if that makes sense so there are different trade-offs absolutely uh we're not going to go into any details about kind of more intelligent algorithms for deciding when to attach um there's a research project recently kind of they uh uh would would start scans only from the beginning you get like I don't know thousands or something of these concurrent queries they all attach to one scan start scan all the way through to the end and any new queries that came in since then just block waiting for the loop to get back around to the beginning so um there is a fair amount of of logic or difficulty that goes into um uh achieving that can you remind me what the second part of the question was so uh the the question is do you have to stall the query while you're attaching to it um the answer is yes uh so there is some overhead associated with doing this you have to you know block one while you're waiting otherwise you you know there has to be some synchronization mechanism um and I think the the uh question you have to answer before you decide to do it is is it going to be worth the trouble of doing it I mean if if you're waiting on a disk stall then you know any synchronization you do um in memory in the CPU is it's going to be uh irrelevant it's going to be completely masked by the disk IO so um maybe you know if the query is at the very end of the very last page that needs to look at it doesn't make sense to attach to it so again yeah there there are tradeoffs you need to consider when uh figuring out when to to to apply this optimization any other questions yes uh so the question is can you attach all queries to the same cursor the same scan together uh and if uh I didn't catch the part about the joints if there I okay so if sure you could attach as many in theory as many queries as you want to the same scan um if there are joins or other sorts of things uh that access pages from other tables then uh yeah I mean I I guess you you could attach queries to individual scans in a query plan so imagine you know they all happen to scan this table a and then query one scanned table b next and query two scanned table c they could attach for the same you know portion of of table a scan that they're doing and then diverge the other two um sure you could have you know arbitrarily complex schemes like that but again you kind of have to weigh the overhead of kind of figuring out the the best um uh scan sharing plan versus uh actually you know doing it so the the advantages it would give you is that answer the question okay so uh the the uh next I think this is the last one um optimization that I want to talk about is a buffer pool bypass so uh basically what the the sequential scan operator is going to do is it's not going to store um the the fetched pages it's it's not going to persist the fetched pages in the buffer pool in order to avoid overhead so basically what this is doing is you know it it fetches it into the buffer pool and then it says okay I'm I'm never going to need this again so you can override it or evict it right away because you're doing this sequential scan you've already looked at pages you don't need to look at them again so um this in theory is is pretty straightforward to do uh there are some tricky details around it um they're called different I think they're called light light scans and informics um and different uh uh systems uh supported in in varying um degrees so okay so those are those are all of the um kind of uh optimizations we're going to talk about at the level of the buffer pool the next piece we need to get to is um the different strategies for uh replacing pages in the buffer pool if the buffer pool fills up so most disk operations that we're going to perform go go through the the uh apis provided by the os remember we talked about mmap or alternatives like read and write um and unless you tell it not to the operating system is going to maintain its own file system cache um it's usually called the page cache basically what this is is uh the operating system when you're when you're requesting to read a page from disk what's going to happen is that request is going to go to the operating system the operating system is going to go if it's not already loaded going to go fetch it from disk load it into the page cache and then return a pointer uh to the page to you that you then have to copy from uh the os page cache into user space so you're ending up with a redundant copy of the page here one stored in the os page cache one stored in your buffer pool that you've implemented so what most dbms's do is use this uh o direct flag to bypass the os's page cache so basically you're telling the os uh just you know read this page in for me i'm going to take care of it you can don't cache it don't keep it around um i'll take care of it from here so if you if you rely on this kind of caching mechanism from the os you have all these different bad properties that we talked about in in the previous lecture so kind of what most systems do again is is use this o direct flag to um prevent the os from keeping a separate redundant cache um uh in the page cache so there there are different strategies that we can employ to figure out um which pages to keep around in memory when uh the the buffer pool fills up so we need you know we we're out of space we need to free up a frame in order to get a new page in how are we going to decide which pages to get rid of or evict from uh our our buffer pool there are different goals kind of that we need to balance as we're deciding um on an algorithm to use and they include you know correctness we don't want to have any um issues where our data gets corrupted for example by uh throwing away or or not properly writing out um dirty pages um we want our answers to be accurate it's a similar concern speed we want we want to do the replacement quickly if you spend all your time thinking about you know the optimal page to evict then you might spend more time doing that than actually the benefit you would get from having an intelligent algorithm and finally we need to be worried about kind of about how much metadata we're storing in order to do the eviction so we kind of need to in designing an algorithm choosing an algorithm we need to balance all of these concerns so the the most straightforward one and this is used in a lot of places um in in different systems areas uh it's called the least recently used or lru policy basically what we're going to do is for each page we're going to maintain a single time stamp of when it was last accessed by a query so just you know one field that says the time stamp of when this page was last accessed and then when the dbms needs to evict a page um it's pretty straightforward we just go and find the page with the oldest time stamp which means that it was least recently accessed and kind of the intuition here is that you know more recently used so uh pages that we have accessed more recently are going to be more useful so this could happen you know if there's skew in your workload and there are a couple pages that you need to access frequently um it could come up with that or different other access patterns so there's some kind of um temporal aspect of the access that you know the intuition is we want to get rid of if we haven't looked at the page in a while we can just get rid of it um and of course you know you can keep the pages in some kind of sorted order using a some kind of data structure uh like a heap data structure something to um figure out you know which one is is uh the the least recently used when you want to evict so uh kind of a little bit uh of a variation on lru as this clock strategy and this also gets used a lot um it's it's basically like an approximation of lru so rather than storing a complete time stamp per page we're just going to allocate essentially one reference bit per page so when a page is accessed we want to set the reference bit to one and we're going to put the pages basically in this circular buffer with you can think of it like a clock hand that's visiting each of the bits so the clock hand is going to sweep past it if the if a page's bit is set to one uh then we're going to set it to zero uh which means it it hasn't been read since the last time we looked at it otherwise we want to evict it because it hasn't we haven't looked at it um since our last pass so I can show kind of a visual example here this might make sense so we have all these pages in our um our buffer pool and let's set all of the the um reference bits to zero which means we haven't looked at them we're going to access this page one here which means we're going to update the bit to one we've looked at it and we're going to have this clock hand that's going to sweep around uh the pages so again the clock hand starts at at uh page one it sees that the reference bit is set to one so we're going to set it to zero now we're going to move on to the next page in this case it's page two the reference bit is set to zero which means we haven't looked at it since the last time the clock hand passed so we're going to evict it we just throw it out we don't need it anymore the assumption is that it um was not used recently so it's not important for us to keep around so then we replace you know that page with page five and let's say that in the meantime uh page three and page four both got access we update their reference bits and kind of we can keep moving the clock hand around to reset the bits until we get back to the page one where we started and if we notice here uh the bit wasn't set to to one in the interim so that means it's zero it hasn't been looked at recently we can evict it so we can throw that out and replace it with a new page so does this make sense yes question so the question is um as the the number of pages or the size of the buffer pool increases then uh this this scan or algorithm uh has to increase linearly with it right so yeah so uh basically what what this this algorithm buys you is rather than having to maintain an entire timestamp whatever that is you know four bytes eight bytes per page you can just allocate essentially one bit per page so it grows at a much slower rate so when you're doing the scan I mean it's it's the same as as maintaining the the previous lru strategy you just you know scan through and afford pass looking for pages to evict until you find one then you can throw them out does it make sense yes the question is is is this replacement strategy distinct from whether or not the page is dirty so whether it's pinned okay so it is is this um uh separate from whether or not it is uh pinned uh yes so this is uh just so a pinned page is something that we want to make sure it doesn't get evicted uh this strategy is just to decide whether or not we think a page will be used in the near future so we should keep it around and not evict it so yes they are they are distinct yes uh so when it's added so the question is when you add a page to the buffer should it be set to zero or one by default so I maybe this is just the the animation here but when a page is added what's going to happen is the the clock hand should already be at that slot so we don't need to um because the clock hand is already at that slot we don't need to set the bit to one because uh what we want to know is if we've loaded it then we know it's used we want to know if it was used again in the time it takes the clock hands to work its way all the way back around so that's why it's set to zero uh when when the page is first loaded yes so the question is does the clock hand always start at the same like position uh so um yeah I mean just think about this as like an array of of bits and you start at the zeroth position uh and you work your way until you hit the end of the array and then once you get there you just start again at the zeroth position so when you need to find a page to evict you stop there do the eviction whatever and now when you need a new page you just scan forward until you hit the end and then you loop backgrounds it's like a circular buffer of of um these bits yeah so the statement is that you start with the the last position where you did an eviction you continue forward from there yes that's correct yes yes so so the question is in lru it's kind of on-demand eviction where uh you only evict pages when you when you need a new one uh is this the the clock algorithm the same thing uh or is this some kind of like background process um it could be either um there's no I mean I think you could you could preemptively evict pages in lru too you could preemptively evict the least recently used ones to make space um for simplicity kind of we're just talking about it here as you only do it on demand um but if you think about like how prefetching works then you know if you want to go ahead and and prefetch some pages in and you need space then you could do this apply this preemptively so I either it's the the algorithm doesn't depend on it happening either on demand or as some kind of like background process yes uh so I think your question is uh do you append newly added pages to the end of the array do you mean for the like the clock bits here okay uh yes so if I guess if you're making you're increasing the size of the buffer pool then yes usually what we what we do is we have a fixed size buffer pool so we know how much space we want to allocate so you can preallocate um a a bit array that's as large as the the frames the slots in your buffer pool if you want to increase the size of the buffer pool then you have to you know increase the the bits in the array does it make sense so the the the bits um the clock bits are are just for uh the the frame slots in the the buffer pool so you know that's a fixed size that you know when you start so it's not it's not per page it's just a fixed array of bits one one for each frame in the buffer pool does it make sense any other questions okay so um there are some issues with with lru and clock replacement policies um they're susceptible to this thing called sequential flooding so like I mentioned the main assumption that you're making is that things that are recently used pages that you've recently accessed are going to be used uh again soon so this is good for skewed access patterns but it's bad for sequential scans so if a query performs a sequential scan that needs to read every page you're going to end up uh polluting the buffer pool with pages that are read once and then you never look at them again so in some workloads kind of what ends up happening is that the most recently used page is actually the page it's unneeded you you're not going to look at it again so I'll just show a quick example of this how that might look so this query again query one uh needs to scan through all of the pages to find where id equals one so first we're going to get page zero we're going to load that in we have uh query two come in uh at the beginning here it's going to do a bunch of scans and now we get down here um to to page three query one sleeping uh what's going to happen is now query two needs a new page the least recently used page was page zero so we're going to evict that page three gets swapped in now we have query three come uh come online and it sorry that's mislabeled there query three comes online it it needs to scan through and we've just thrown out page zero which query three now wants to access because you know query two evicted that that was the least recently used um query three now has to to have this kind of thrashing that happens where uh pages that were used uh uh least frequently aren't going to be needed again so i i are there any questions about that i just want to try and hurry up here uh there's a few more things we have to cover and i i also want to allocate some time to the uh the project if there are any questions about that great so um some some better policies you could have uh or for example lru k we're basically rather than tracking whether or not something just the the most recent timestamp that something is accessed um you're going to look at the the history of the last k references so i just assumed k is two for now uh you you keep track of the two most recent timestamps from something um was accessed and then you can kind of compute the interval between subsequent accesses and you can look okay how how how what was the time interval was it a long time was it a short time um if it's a long time then you know it doesn't get accessed that frequently we can throw it out if it's a much shorter time uh we might want to keep it around you can also do more advanced things if you know think about into the future like okay uh the the average access interval between these two pages is i don't know every 10 minutes or something you may be able to project okay 10 minutes from now i'm going to need this so you can kind of do some more intelligent prefetching there but basically the idea is rather than keeping around only the most recent uh access timestamp which in the the sequential scan case could be misleading um we want to keep around multiple timestamps uh to keep track of intervals and again there's this trade-off between you know how much metadata what's the overhead of the metadata we're storing what's the overhead of trying to figure out these intervals and um trying to figure out you know the benefits that you get from from this uh more intelligent strategy so another alternative is to use some kind of uh notion of uh localization where again on a on a per query or transaction basis we're um making eviction decisions and this minimizes the pollution of the buffer pool um but kind of you know you you run into this struggle between the the global optimization problem for multiple queries versus the uh making decisions just based on uh a single query um another another better policy might be uh to be to provide priority hints based on uh different access patterns so if you think back to the index example the the dbms knows kind of the access patterns for um accessing an index versus sequential scan so uh you can provide hints to the buffer pool about which pages are important and which ones we might not care about so for example uh imagine we're just doing some you know inserting some contiguous IDs since the the values are monotonically increasing they're always going to get inserted kind of on this right side of our tree so our access path here is always going to be down this side when we have to do new insertions conversely if you have um kind of a scan that you want to perform your access path might be different but one thing that they have in common is that you're always starting at this at this root page so uh you know you always have to start at the root of the tree in order to reverse down it so uh one hint you may be able to provide to the buffer pool manager is that you know you should keep this this root page around both of these types of queries are going to need it so it's an important thing you might not want to evict um so the question about dirty pages from way back in the beginning of the class um um there are uh trade-offs to consider here um so it's fast if you have a page in your buffer pool that's not dirty you can just drop it you can override it we don't need to keep it around because there have been no changes and it's backed up on disk we can always get back to it if we need so we just override it uh it's slower if you have to evict dirty pages so the dbms has to go and get the page write it back to disk to ensure that the changes are persisted so there's kind of this trade-off um between fast evictions versus the the writing dirty pages um if based on when things will be will be uh read read or accessed again in the future so typically what you'd prefer to do is to evict um non-dirty pages because you don't have to incur any writing just overwrite them directly uh background writing process basically uh the dbms can periodically walk through the page table and write dirty pages to disk this is similar to kind of the question about on-demand versus uh kind of some background process that's doing the evictions uh think about it just like an optimization the dms and the background could be doing um when it's you know safely written to disk then you can you can unset the dirty flag you know it's uh persisted but one important thing if you're doing this is you need to be careful that you don't write any dirty pages um before the log records for the transaction which we'll talk about logs in later lectures um before the the log written log records are persisted as well as if uh the page is pinned we don't want to you know if it's currently being worked on the transaction it's not committed we don't want to um evict those pages uh the final thing other memory pools so um you know the the dbms needs um memory for a lot of things other than just two pulls and indexes um they might or might not always be backed by disk uh so sometimes things are shared by um the the uh buffer pool which again is responsible for kind of buffering things between disk and memory for other situations where you don't need persistent values for example if you just have some intermediate query results or your caching results or you just have you know some kind of other uh buffering that's not you know critical to have persisted then um in these cases these can use memory pools outside of uh the buffer pool so you don't have to worry about kind of polluting the buffer pool with just ephemeral in memory things that you need for query processing so uh just to wrap up kind of I want to hammer home the point that the dbms can almost always manage memory better than the os we've seen a lot of examples why we prefer uh the os not get involved with file i o and managing memory um and there are lots of opportunities to leverage semantics about the query plan in order to make better uh decisions so that includes things like evictions allocations and prefetching next class can be about hash tables it's exciting well i don't have to labor that and now i just want to quickly go through project number one um i mentioned it was released on monday it's going to be due sunday september 26th at 11 59 p.m kind of the you're going to be building the first component of your storage manager um the buffer pool buffer pool manager and it has these three parts um you're going to need to implement an lru replacement policy uh a buffer pool manager instance and then a parallel buffer pool manager so you don't have to worry about like the the disk stuff we're going to provide you with all the code the base code to do uh disk writes disk reads file i o stuff and the page layouts so you're just focusing on the buffer pool so i'll just kind of quickly go through these three pieces and again there's the q and a session where you can get more detailed um answers if you have questions about these things in particular so again the lru replacement policy um you need to to build a data structure basically that tracks the usage of pages using the lru policy that we discussed um it's important to remember to check the the pin status of page you don't only evict pinned pages um and again if there are no pages that have been touched on this last week then you'll just return the lowest page id so uh the next piece is once you have the lru strategy set uh the buffer pool manager um layer is going to uh use the lru replacer that you built to manage um the allocation of pages so you need to maintain internal data structures is going to to track the pages uh in your page table um again we're we're providing the the file i o components uh and you can use whatever data structures you want there no you mean you use anything from the the stl uh implemented however however you see fit um one important thing to remember here is that you want to get the order of operations correct when you're doing pinning uh for the pages uh and the final piece is uh we talked about the multiple buffer pools optimization so you can have you know separate buffer pools for um different pages i mentioned two ways to do it i do the object id for different page types uh since right now we only have one type of object in the database system which is a table uh we're not going to use that one we're going to do the second approach um which is hashing where we're going to uh it just really simple you don't have to apply a hash function just take the page id the page number so the identity function if if you want to get technical but take the page id uh and then mod that by the number of buffer pool instances that you have to figure out the mapping um from page id to buffer pool so just a few things to note um please do not change any files other than i think it's six uh you can double check and correct me if i'm wrong but i think it's six files that you have to hand in if you change anything else in the system um it may be great but it's not going to be graded basically we're just going if you submit uh any other files we're just going to wipe them out and consider only the six that um need to be changed uh another important point is that the projects are going to be cumulative so we um are not going to be providing solutions to earlier product projects so if you have you know if you really mess up this buffer pool one you have later pieces later projects that build upon it uh you're going to be in trouble so please try to stay on this and and keep up to date with it and if you have any questions postman piazza or come to office hours but we are not going to be doing debugging for code and one last thing extra credit uh sorry two last things but extra credit what we're going to do is uh basically you're going to submit the code um since the large class we're going to do is we're going to take the top 20 fastest implementations so if you really optimize it um we're going to take the fastest implementation it's going to get 50 bonus points on the project uh the implementations that are in second through 10th place we're going to get 25 percent and the the 11th through 20th place are going to get 10 bonus and this is very important this student with the most bonus points at the end of the semester will receive a bus tub t-shirt uh you can ask people who are in previous sections uh previous years of the course this really improved uh people's uh dating life so this is a this is a highly sought after prize that you really really want to make sure to get so thank you and i will see you next week for the new yorker snappy snacks and i take a sip and wipe your lips to my 40s getting more i'm out he got sick drink a drink a drink and then i burp after i slurped ice cream i put in much work with the bmd and the e-drug get us a sanai fruit on the job