 2PL. How was the concert on Saturday? It went well. It went well with Red Line, a bunch of speakers. What does it mean to Red Line speakers? I mean, so if the speaker can work at like a hundred percent, we were working them at 200 percent. Did you break them? Next time, maybe. Okay, nice. All right. How many people showed up? 200. 200, that's impressive. Okay. And your next show is when? October, like after the fall break. That's off campus though. Okay, we'll post that on Piazza. All right guys, a lot to cover. Some quick administrative things to go over quickly. So, the Homework 2 has been out. We bumped the due date to be October 4th because we didn't want to have it lined up on the same date that Project 1 is due. So, the material, probably the problem with those two is the schedule with Labor Day or anything. So, the material of that, Homework discusses things we'll discuss next week. So, we postpone that a day. And then Project 1 is still on track. We do October 2nd. And again, that's on a Sunday. We're having the Q&A session this, today, Monday, at 6.30 p.m. And that'll be on Zoom. And there's a post on that in Piazza. And then, for all the projects again, there won't be any office hours on Sundays when the project is due. But there'll be a special office hour session on the Saturday before it's due on campus with like 4 TAs between 3 and 5 p.m. This is meant to be like a forcing function for you guys actually like start working on the project instead of showing up the dates due to office hours saying, you know, it doesn't compile, right? Like, we want you to start sooner rather than later, okay? Somebody's already gotten, I think there's a couple of people already gotten 100% on Project 1 even though today's lecture discusses Project 1, so impressive. Yes? How long do we expect it to be entirely? Sorry, how much do you expect it to be? What? How much does it complete the entire project? This question is, how much time should we expect to complete the entire project? I mean, it depends on your background. Like, if some people are ripped through C++, no problems. Other people will struggle, right? Project 1 is, I mean, it's more work than Project 0. It's less work than Project 1. It's reasonable. It's not bad. Any other questions? All right, so a bunch of other things that are extracurricular things that you can do in the class. Today, we're having another seminar talk from CME alum, Dana Van Aken. Just talk about Artun, the startup that I'm involved in. The next week we'll have the one of the co-founders of ProcessML. This is a hosted version of PostgreSQL where they put UDFs, or you can make PyTorch calls directly in SQL. They also have been working on a PostgreSQL proxy called PGcat that's written in Rust and gets in popularity. And then the week after that, actually not the 18th, it should be October 2nd, we'll have the CTO and co-founder of Weaviate, one of these vector databases, come give a talk with us on Zoom. Okay, yes? Are there any internship opportunities for any of these companies? This question is, are there any internship opportunities for any of these companies? Absolutely, yes. They figure out the best way to contact them, but yes. All of the new companies are hiring. So there was two big database news in the last week since we last have class. They don't know what they were. PostgreSQL released version 16. It's not a big game changer in terms of they don't have amazing new features. That's a lot of refinement and improvements. They're nice to have. It's, again, a game changer improvement would be like, in my opinion, is when they added just-in-time compilation for ware clauses, which we'll cover in a few weeks. De-duplication for B-trees is kind of nice. There's a bunch of I-O stuff that's still not ready for production, which we'll talk about a little bit today, but it's a nice to have. PostgreSQL has been putting out releases once a year and it's been kind of nice. And then Databricks announced they raised a series I for $500 million with a $43 billion valuation. That's a lot of money. Good as that was saying. Yeah, Snowflake didn't raise this amount before they won IPO, but anyway, they're hiring. All right. So last class, we finished up the discussion on the storage aspect of database systems and was really focusing on how the database system is going to represent data in disk. We talked about the two-pointer storage, log structure storage, the index-oriented storage. Now we know what these files look like on disk, how they're broken up into pages. And so today's class and going forward is really about how do we get those pages from disk, bring those into memory, and then do something with them. And of course, the whole goal of what we're trying to build in this conceptual system we're talking about is having a database system that gives the illusion that we have more memory than we actually have. The database is larger than what fits in memory. We want to make it look like we could fit everything in memory. So today is really about how do we go get the things we need from disk, those pages, bring them into memory, and then make decisions on how to remove them in order to save space when we want to bring new things in. All right? So there's two key aspects we're going to have to consider in any decisions we're making on how we move data back and forth in disk. The first is where do we want to write our pages on disk and how can we sort of lay them out in such a way that we can maximize the amount of sequential IO we're doing instead of doing random IO. And the idea here is that we want to keep pages that are going to be used together possibly close to each other physically on disk so that when something, when a query is running or some task is running inside our database system and we have to go fetch a bunch of pages, we want those pages to be sequential when we bring those into memory. Then the next thing that we have to consider is when we have to go read things in memory and this is typically done on demand, meaning we'll see prefetching in a second, but the system doesn't just go randomly reading things because it wants to, all right? If you insert a bunch of data and then you never go read it, it's not going to go read it back into memory just for the hell of it, right? So it's going to be on demand. We have to go get things from disk and bring them into memory. And then the question is going to be when do we want to evict that from memory, right? What was the last time it was accessed? How was it accessed? Was it updated since we brought it into memory? And therefore we need to write it out. We're not going to talk about how to make sure that we can save our changes in case of a crash or a failure. We'll talk a little about that at the end, but that will be a major focus of a lecture after the midterm. This is really about, okay, I got to decide what data to evict. How do I make that decision? And so this is that same diagram I showed before where we have the database file on disk and it's broken up into a bunch of pages. And then now what we're talking about today is this piece here called the buffer pool. I'll call it the buffer pool manager. Some systems call it the buffer pool cache, cache manager. They all basically mean the same thing, but it's the memory that the database system is going to allocate from the operating system and control on its own. So when you have your execution engine, it starts doing something like say it's running a query. It doesn't have to, but it could. And at some point it says I need to get page number two. So again, we go get the page directory from disk, bring that into memory if it's not already there. Look at our page directory and that's going to tell us our page number two. Here's the file and the offset where to find that particular page. We go fetch that in memory, put it into one of our free space in our buffer pool, and then give back the execution engine a pointer to that page sitting in the buffer pool. So now let's say again we run some kind of eviction policy or replacement policy and we decide to remove page two for whatever reason because we need more space, we run a bunch of other data, it doesn't matter. But now when the execution engine comes back and says I give me page two again, it's not memory, you got to go to disk and get it, but this time it actually might land in a different location in our buffer pool. And again our system, the execution engine, all the other parts of the system above the buffer pool, they should obviously not be, not care that it's now in a different location. This is different than ManMap, right? If you ManMap a file, the memory map file, when you bring that in your address space, any time you jump to that sort of map file address space, you're always going to get the same page, the operating system is going to guarantee that for you. It may not be in memory when you access it and you get installed while it fetches in, but it's always going to be in the same address space. In our system we're not going to do that, the same page can be in different locations every time it's brought in and out of memory. Because we need that freedom, because who knows what's going to be in memory the next time we go fetch a page. So again, this is much different than calling malloc. When you call malloc, the OS is taking care of us all for you, the database system is managing all this memory, because as we see as we go along, it's always going to be in a better position to make the best decision on how to optimize this. So for today's agenda, we're going to talk a little bit about what a buffer pool manager is, how are some optimizations we can add to it, the orders actually switch. Then we'll talk about buffer pool replacement policies, then disk IO scheduling, and then we'll briefly mention that there's other memory pools in our database system that may not be always backed by a buffer pool manager. It might be an ephemeral cache for certain things. So again, I'm going to call it a buffer pool manager. The textbook calls a buffer manager. I think Oracle might call it the buffer cache. So at a high level, it looks like this. Again, it's just a region of memory that we've allocated from the OS, and we're going to logically chunk it up into fixed size pages, again, based on the page size of the database system. We said 8 kilobytes, my SQL is 16 kilobytes, it's going to have compression size. It doesn't matter, we're breaking it up based on those page sizes. And then an entry or a location in our buffer pool memory that we could use to install a page, we're going to call that a frame. So the system boots up, it calls malloc, gets a bunch of memory, and then say break it up and divide it up into frames. And then as the database system or other parts of the execution engine or whatever start requesting pages, we're going to make an exact copy of the pages from disk into memory and put it into one of these frames. The reason why we have to call it frames is because we're running out of terms. We can't call it a page, we can't call it a block because we already used that. We can't call it a slot because we have a slot array, so for whatever reason we're going to call it a frame. So again, somebody needs page one, then we go find a free frame in our buffer pool and we just copy that page on disk into memory. Same thing here, I need page three, find a free slot, and I copy it into memory here. But again, we can see that page one and page three are not contiguous on disk because there's page two in between them, but when we put it in our buffer pool and bring it into memory, again, we're free to put it in any location that we want. It doesn't matter. So now, if we modify one of these pages, we're not required to flush the data back to disk right away. And again, we will cover durability and recovery after the midterm, but this is the key difference then between a write-through cache or a write-back cache. The write-through cache in the OS, when you write something to the cache, then gets immediately written out to disk as well. And a write-back cache, we'll write it in memory, but we're not required to write it back right away. We'll do it at some later point, whether it be a background thread or an eviction policy that'll do this. And so we won't talk about this today, but there will be a sort of separate log file, right ahead log, that'll keep track of what changes we made, and we'll make sure that thing gets flushed to disk before our dirty pages do. We don't have to know that for now. Just be mindful that even though we may update pages, we're not required to write them back right away in memory. I said that the internal data structure we're going to use to keep track of what is actually in our frames is going to be called the page table. Again, the OS has its own page table. This is the database system's page table. It's better. And so it's typically going to be a fixed-size hash table that is just keeping track of here's all my frames, identified by some frame ID, and then here's the page information that's currently residing in that page. It could just be a pointer to where that page is actually located. And we're going to have to protect this page table with a latch that I'll describe in a second, like a mutex that allows us to have multiple threads or multiple workers accessing the page table at the same time. I don't want to use term threads. It's better to use terms workers because they can Postgres and older systems, they're not multi-threaded, they're multi-processed, but the idea is still the same. We want to make sure that if there's multiple workers touching things and updating things, they don't have, when we're in the critical sections, we don't break things. So in addition to keeping track of like here's the pointer to the page in our buffer pool and in the frame, we're also going to have additional metadata about how the pages are being used throughout the system. So the first thing we would have obviously is a dirty flag that tells us whether a query has updated a page since we last brought it into memory. We'll also have a pin or reference counter that keeps track of the number of workers that require this page to remain in memory and therefore it can't be evicted when we run our eviction policy. So for each page, say page three here, say there's some query that is accessing it at this given time, so in our page table we have a little counter that says there's at least one worker that's accessing it. And then now say if another query comes along and is looking for another page that's not in our page table, we'll put a latch on it, protect it, go fetch the data we need, say in this case we need page two, we update a free frame in our buffer pool, update the page table to now point to this buffer pool, update any metadata we need to know about it, like who accessed it, when they last accessed it, and so forth. And then once this query is done doing whatever its update needs to the page table, we can release the latch, return that worker back to whatever it's doing, and then now any other worker that comes along looking for page two will find it in this page table. Pretty simple, right? So I used this term latch and purposely did not say lock. Everything I guess why? Yes. So he says the standard locks that we have, locks and latches we have are different, different from who or what. So he says that latches have some database magic that's better for our purposes versus locks. Not quite. Yes. Yes, so he said locks are for user space. I mean the database system is running in user space, but it's for, say, logical things in the database. Correct, yes. So this trips people up when they come from like a more OS background. So in the database world we have this distribution between locks and latches. So a lock is to protect these higher level concepts or objects in our database. A tuple, a table, a database, right? They take locks on these things. And what will happen is we haven't discussed what transactions are just yet, but think of like I want to do multiple updates. It's like multiple round trips of SQL queries. And so if I take a lock on something, I want to hold it for the length of that transaction. And because he's saying user space, but it's like the application is the one that's creating these locks or the data system is creating these locks on behalf of the application, we assume that they're stupid and therefore we need to make sure that they don't have dead locks or other problems. So we have to have these additional protection mechanisms to make sure that the JavaScript programmer doesn't do something they shouldn't be doing. Latches are the low level internal primitives we're going to use to protect the critical sections of our database system. And these are what the database system developers are using. You have to use in project zero. You have to take a mutex. So a latch is basically like a low level mutex. And because latches are being used by the database system developers, meaning us, it's not going to have the deadlock detection and other protection mechanisms we need because if we're the ones building a database system, we need to be smart enough to make sure we don't have deadlocks. And so latches are really meant to be like quick in and out critical section, do something and release it. And we need to, through programmer discipline, we're the ones that have to make sure that we don't have deadlocks. Now this is confusing because this is also in C++, the standard library, there's also, they use the term latches. But that's just a countdown barrier. We don't want that. We don't need that. We're going to roll around latches. We'll see this more next week when we talk about index and currency control and for B plus trees. But for now, just assume you can treat it as the mutex. We don't want to use the OS mutex as that has other problems. We'll cover that later. All right, again, another important distinction I've already said this, between the page directory and the page table, the page directory is just a disk resident mapping between page IDs and their locations on the physical disk, on the actual files themselves. But the page table is going to be this ephemeral in-memory mapping that we use to identify, for a given page ID, here's the frame of where it's actually located. And if it's not in our page table, we know we have to look in the page directory to go find where it is on disk and go bring it in. So most of the time, the query is going to be hitting up the page table, but it's only when the page table says something's not there that there needs to be some mechanism through the disk schedule or something that says, all right, let me look at the page directory. Let me go fetch the page from disk and then put it into the page table. So what I've shown you so far is a basic page table. It's a hash table, some extra metadata. If the page is there, if you're given page ID, you get back a pointer to it. If it's not there, there's some mechanism to go get it from disk, then install it into a free frame, then the page table has the entry, right? But this is going to be a big bottleneck unless we're clever and smart about exploiting the information we know about what is going on inside of our database system to have us make decisions on how we want to allocate things and decide who gets what page, at what time, at what location and how we decide to evict things. We don't want to exploit the information we know about what our queries want to do, what our data looks like, what our access patterns are. Then we're no better than the OS. The OS doesn't see anything going on inside our database system. So if we just blindly take requests and go take them out, whatever, we're no better than the OS, right? So some of these techniques we'll talk about, these optimizations, again, it's going to motivate why we need to write our own buffer pool manager, why we don't want to use the OS. And it'll be a combination of policies that will affect all queries running at the same time, or it could be things that are going to just help a single query by itself, maybe not necessarily worrying about other queries that are running at the same time, but we can isolate the decisions we make for that query so they don't try not to affect others, right? And I'm not going to say one of these approaches is going to be better than another, but you will see that as we go along, all of the major data systems are going to use some combination of all of them, or some of them are most of them, right? So I can't say, you know, which one is the most important one to implement first, but we'll see as we go along. All right, so the things I talk about is using multiple buffer pools. Actually, that's probably the first one to implement. I take back what I said. Multiple buffer pools is obviously going to do first. We'll see what that looks like. Pre-fetching is about complicated scan sharing and then buffer pool bypass. All right, so my toy example I showed at the beginning, we said there was one page table, one set of frames, and that was it for the entire system, right? But then again, because there's multiple workers running at the same time, we have to use these latches to protect the data structure, and for a large number of CPU cores, a large number of workers running at the same time, those latches are going to become a bottleneck, right? We can, because it's fixed, assuming the page table is fixed size, we don't have to have a latch for the entire page table. We're going to have latch for individual pages or locations in the hash table. But even then, if everybody's trying to go get the same small number of pages, then those latches are going to be a bottleneck, right? So an easy way to alleviate this contention point is just to have multiple buffer pools. So I still allocate the same amount of memory, so I have to tell the databases when I boot up, I want 10 gigs of memory for my buffer pool, but I'm going to take those 10 gigs and I can divide it into equal size chunks, and then now have a separate page table for each of them. It also ensures that for certain access patterns on certain objects in the database, I can have different policies that can affect one buffer pool versus another based on how I know that object is going to be used. So for example, in DB2, DB2 probably has the most sophisticated buffer pool management configurations where you can actually define a table space, something like a namespace, and that is backed by a given buffer pool. You can set what the page size should be for that buffer pool, and you can tell which tables will be managed or indexes will be managed by that buffer pool. So let's say you have like one table that is primarily used for random access. You can have some policies, eviction policies, based on, you know, that's optimal for random access. Then you have another buffer pool that's for these other tables, you're doing sequential scans, and you have different buffer pool policies for that sequential scan. Maybe you use larger page sizes, right? As far as I know, DB2 is the only one that lets you do this. I haven't seen, it's post-custom MySQL stuff that can't do this. And I don't know about the other enterprise ones, but like the DB2 one is very sophisticated. And again, this allows you to customize the buffer pool management for exactly, again, how that object is going to be used. So now the question is, how do you find at runtime what buffer pool manager should you use? So let's say a really simple example here. I have two buffer pools. And so the first thing I could just do is, as I said, in the DB2 case, I can assign a buffer pool to back a given object based on some kind of identifier. Like table 1, 2, 3, table whatever, that's buffer pool 1, and all other tables are buffer pool 2. So at runtime, when I have a request, this is obviously not SQL, but somehow I got a, through an index lookup, I figured out that I want to look at record 1, 2, 3. And we saw before how we can break the record ID into its individual components, usually like a page ID or slot number. But in case of SQL server, it also had a file number or object ID. So if we can use this from the record ID, then do a lookup and say, okay, object 4, 5, 6, that's managed by buffer pool 1, and then send the request to that buffer pool. And all the requests for other objects may end to another buffer pool. And I've isolated them so that there's less last contention between the two of them. The simplest approach then is to do what my SQL does, is you just take the record ID, hash it, modify the number of buffer pool managers you have, and that tells you which one you go to. All right? Pretty simple. Yes? Do you have to statically allocate how much memory each buffer pool has? Yes. So the question is, do you have to statically allocate how much memory each buffer pool has? Yes. And so in PostgreSQL and my SQL, in most systems, you cannot change the size of the buffer pool without actually having to restart the entire system. Actually, let me take a look. All of the open source systems have that limitation. The commercial systems could be kind of clever. I think in Oracle, you can increase the buffer pool size and it'll allocate the memory and then slowly increment and migrate pages over from the page to the next. That's a tricky thing when most systems are statically allocated. So wouldn't that potentially waste a lot of memory? Because I imagine different. Yes. So this question is, could this mean that you're potentially wasting memory if you partition, say, by table? So for example, if I make a new buffer pool, I say it has 10 gigs and I say, you're going to manage this table, but I don't put any data in that table, is that wasting space? Yes. But the data system did exactly what you wanted to do. The human was stupid. Right? I mean, there's not... Yeah, because the problem is, it doesn't know, you think, okay, well, only allocate it on demand as needed, but the amount of engineering effort to accommodate stupid people is probably not worth it in that case. If you're using DB2 in theory, you should know... If you're calling createBufferPool to command to do it, you kind of should be knowing what you're doing. Right? Right. So again, in my opinion, I take back what I said, this is actually the first optimization you should do to scale out your buffer pool manager because it's not that much work. The hashing one's probably the easiest one to do as well. Because there's no central data structure to say, okay, for this object, you go to this one, you just hash it and you're done. All right. The next optimization we can do is prefetching. Again, they always do prefetching. We'll see... For like simple cases, like when you're doing sequential scans, we'll see one case where it can't do it. So the basic idea is that if you run a query and it has to start accessing data in your table, it's going to open up a cursor that starts scanning through the pages one by one. Right? And so assuming this case... In our example here, our buffer pool starts off as empty. So at the very first page that it sees, page zero, it's not in the buffer pool. So again, we just go copy and put that in. Then we scan along. We need page one. Page one's not there. So it goes ahead and copy that. But now the data system can be smart and say, okay, well, you've read page zero, you've read page one. It's very likely you're going to read page two, three and so forth. So let me go ahead and pre-fetch those guys while the data system is processing page one. Go get page two, page three, put it in. So by the time you're finished processing page one and you come to page two, lo and behold, the page you're looking... the next page you need is already there. And just do this all the way down the line. We haven't talked about how we execute queries just yet, but the typically way it works is that you request a page. It's going to have a bunch of tuples in it. You do some kind of computation inside the data that's in those tuples. And then when you're done, go get the next page. Right? As you're doing your scan along the leaf nose in the query plan tree. And so it's not like we're just blindly ripping through the pages and say, you know, get, get, get. It's get some think time, do some compute. Then go get the next page. And so that think time is where the data system says, I have some time to go ahead and pre-fetch the things that you're needing. Right? And I don't have the diagram here, but this is why we have that pinning mechanism before. I obviously don't want to fetch in page one. Then the data system says in a background thread, okay, let me go pre-fetch page three and two. And it goes and swaps out page one while you're still accessing it. The pinning mechanism will prevent that. Yes? Is it possible that different buffer pools hold the same page? This question, ah, this is a good point. Thank you. This question is, is it possible for different buffer pools to hold the same page? No. It's always going to one more correspondence. Every page can only exist in one buffer pool. Because think about what would happen if you allowed that. Right? Well, you could have the page table point to the same, point to the same page, different page table point to the same page, but then depending where the metadata is, like the reference counter, the pin marker, and things like that, like if that's separate into separate pages, then one buffer pool may say, okay, no one's accessing this thing, let me go swap it out. The other page table has it pinned, and you would miss that. So, yeah, for that reason, it's a one more correspondence. That's a good point. Thank you. Yes? What's the relationship between the pin counter and the latch? This question is, what's the relationship between a pin counter and a latch? A pin counter just says that there's some some So, the pin counter says there's some worker that's accessing this page, but I'm not in the page table right now when I'm doing it. Right? I get the pointer to the page. Go back to my original diagram. Right? Going back here. So, I don't have the page table, but like this guy says, give me page two. I take a latch in the page table to go get the pointer to that page. And then, before I get the pointer back to my extrusion engine, I pin it because now I'm outside the page table because the latch is protecting the data structure, but I'm out of the data structure, but I have a reference to the page. So, the pin is just to say someone is actually reading this page. And then when I'm done with it, I can then decrement the reference counter, which potentially unlocked or released the pin, and then now the buffer manager can decide, okay, I know for this page two, nobody has it pinned, there's nobody referencing it, so I'm free to evict it. So, again, the latch protects the data structure, the pin protects the page. Yes? And just to be clear, the page table is not, like, inside the buffer pool like sort of separately? My question is the page table is like sort of separate from the buffer pool? It is the buffer pool. I guess, like, it's not in this. Yeah, this is like a whole other diagram. It's in this box. How about that? Right? Like, these are the frames. There's some other page table data structure here. Yes? Yes? Regarding pre-processing, the buffer pool manager has, would it be able to know, like, which pages are going to be active? So, his question is, does the buffer pool manager have access to the query plan? No. Right? Because we have these layers, but you can send it hints. Bust hub doesn't support those hints, but you can send it hints. Like, I'm accessing this page, and here's likely the next pages I'm going to access as well. Yes? Then you said something about static? Yeah, I know something about static. What do you might statically know? Like, if I send a... Yes? Okay, so his question is, if I have a select query, who is figuring out what the prefetch? Maybe that's what you're really asking. Right? This has to come from the execution engine. Right? The buffer pool manager doesn't know about... It's sort of division responsibilities. Like, the buffer pool manager shouldn't have to infer, like, for this query plan on this table, what page I'm going to be. That all comes down below, because you think about it, too. That's where... The execution has to ask, I know what page it needs to read anyway, because it has to read them. So, all that logic is in that part of the system, which will cover in two weeks. Yeah, so the execution engine should say, I'm fetching page one now, but by the way, I'm also going to fetch page two and three. Think about it, like, you could have a... You could have, like, in your select query here, you could have a limit 10, right? With no where clause. So, in the first page, I got five tuples in the second page, or the page there I got five tuples. In page one, I got two tuples. Therefore, I know I'm going to have to read more pages, so you can send that ahead of time, or something like that. Right? And that's the beauty of having a declarative language like SQL, where you know what you're going to do ahead of time. Right? At least at a high level, with enough... You have enough information where you could make these kind of decisions. Yes? So, the question is, do you never prefetch pages unless you're absolutely certain you're going to need them? Maybe that's what you're asking? Actually, I don't... I don't know how aggressive they are. Yeah. The commercial systems do this much better, and they're obviously closed-source. So, potential scans are pretty easy. For index scans, next slide, this one can be kind of tricky as well, because you may not be able to prefetch without following along the pages as much. You may not be able to jump ahead. So, I think in index scans, there are prefetchings a bit more conservative. In sequential scan, you could just... You can jump a lot farther. And we haven't been talking to... There's all a bunch of other factors, too. As is always the case in databases, there's multi-versionings of who knows whether... Yes, the next page I'm going to read has 10 tuples, but only three might be visible. It's complicated. All right. So, going back here to sequential scan, the OS can kind of do this. Assuming your pages are contiguous, the OS read head can kind of figure this out. What it can't do is infer the logical data structure that the pages physically represent and prefetch according to that. Because, again, it doesn't know what a B plus three is. It doesn't know what a hash table is, at the page level. But we do, because we're the ones actually running it. Let's say I have a query like this. Select star from A, where there's a value between 100 and 250. And I can do this by reading... By doing an index scan. We haven't talked about B plus three yet, but, again, it's a tree data structure that shouldn't be farmed to anyone in this class. And then, assume that along the leaf nodes, the values are sorted based on the key. Right? So, to run this particular query, I have to start at the root node. That's page zero. Put that in my data... Put that in my buffer pool. Then I traverse down to this side of the tree. I get page one. Put that in my buffer pool. But then now I'm going to jump down to this leaf node here. And because I have this where clause that is going to be reading so many records, I know I need to at least go to page five. Right? So, I'll go get page... Oops, sorry. I'll go get page three, so it can potentially prefetch page five. But, again, the operating system can't do this, can't know this because page three and page five are not contiguous. The dating system knows because it knows that the... If I at least get the index page three, I have a sibling pointer, a page ID, so I know what the next page I'm going to read over along the leaf nodes as I scan across. So, therefore, I can go ahead and prefetch that. Even though it's not in sequential order of the other pages. Again, there's just another example why we want to do this in the side of the data system and not the OS. All right. The next optimization we can do is called scan sharing. Sometimes it's called synchronized scans in some of the older systems. The basic idea here is a bunch of queries show up. They want access to the same table. One of them gets started. It starts scanning through the pages. And we can recognize that they're... They need the same data, so we just piggyback off of them. The cursor attaches to their cursor, and we read the same pages at the same time. So we remove the redundant IO. So this is different than result caching, which we won't only talk about this semester, but result caching, basically, query shows up. I compute some answer. I save that result in a cache. Same query shows up again, potentially slightly different, which is hard to do, but say the same query shows up again. I don't have to rerun the query. I just send the result back. The scan sharing is really at the low physical levels of the access method, how we're actually scanning the pages. We can recognize that the two queries need to read the same thing, and therefore we can reuse any pages as we go along and fetch them. So we don't have conflicting cursors trying to read the same pages at the same time. So this is repeating what I just said. For the DB2 SQL Server, Teradata, and Postgres, they actually support the full scan sharing for queries that aren't exactly the same, but at least they're reading the same tables. In Oracle, they only support cursors sharing for queries that show up that look exactly the same, literally exactly the same, because they're basically hashing the string seeing whether there's a match. If you go look at the documentation, if you have queries like select star from employees, select star from employees of the capital E, or an extra space before the from clause, these won't match because when you hash the strings, they're not the same. So it has to literally be the exact same query running at the exact same time, and then they can share it. So conceptually it looks like this. So say I have query one, let's just do a summation on the value column from table A. So it attaches a cursor to the pages, starts reading them, fetching them into the buffer pool. First thing, it gets page zero, that's not there. So it puts that in memory, goes down to page two, and so forth. So now we get to page three, and we haven't talked about the eviction policy, but page zero is the last one that was used. So we go ahead and evict page zero and put in page three. But now Q2 shows up, wants to compute an average instead of a summation, on the same table though. So the naive thing to do is to have it start at the beginning, just like the first cursor, and just scan down and read the pages at the same time. But obviously this is stupid because in this scenario here, Q2 needs to read page zero, but Q1 just got that evicted from the buffer pool. So the very first thing we would do here is, if you want to Q2, we had to evict page two to put page zero in. But again, we just got rid of page zero. So the better thing to do is you attach Q2 to Q1, at the lowest level as you scan the table, and let Q2 ride along with Q1, see all the pages that it sees and processes them accordingly, and then Q1 goes away, but then Q2 recognizes, oh, there's a bunch of pages at the top of the table that I missed. Let me go back and get all those, and then compute my query, right? What's the potential problem with this? Yes. Is it hard to implement? Not that hard. So if the aggregate function depends on the ordering of the data, which again, relational model it doesn't, with no functions, yes. You're close to it, but basically, say you had a limit clause, right? I want to get the first 100 tuples, and compute the average from that. Again, relational model is unordered, so technically it's correct if Q2 starts at page zero versus page three, and again, it gets 100 tuples and it computes the average. Both answers are technically correct, but from the application perspective, this looks fucked up because now you've got queries the same query at two different times show up with different results. So I was being glib and I said, oh yeah, it's not that hard to implement if you attach the cursor. Again, if there's no ordering constraints, it's easy, but if you need to make sure that your queries produce the same results over and over again, then this can be a bit tricky. There's a part of the reason, again, trying to bash Oracle, like the easiest thing to do is if there's that same query, I'll do cursor sharing. The tricky thing is to figure out, again, to understand the semantics of what the query is actually wants to do, then identify when is it safe to attach one cursor to another, and then how to maybe go back and get more results as needed. So this is a sort of extreme scan sharing, is this idea called continuous scan sharing. I'll say up front too that no real system does this, but I just like it because it's a different way to think about how to build a system. So again, going back to this point, he said it'd be hard to implement the scan sharing, protecting potentially yes, right? But what if you just did the dumbest thing and just had everything to do scan sharing because the cursor is just running all the time. So it literally reads one page after another, brings that into your buffer pool, and when you're done, it just loops back around and does it all over again. So now, like when a query shows up, you just kind of pop along whenever it's going and then get what you need, and then you go away and you're done. Good idea or bad idea? What's that? I heard bad. Why? I think it depends on if... So assuming it's OLAP, assuming every query is doing maybe a full table scan or almost a full table scan. Table is sitting around and we're just scanning through a place number axis. So what if you have a table... So you can maybe say the cursor doesn't get fired until the query shows up that touches the table in the back. Yeah, but... Well, say you have to do that anyway for the full table scan because the table doesn't fit in memory anyway. So, yes. So it's hard to skip around. Well, in some ways, also, too, it makes your runtime kind of deterministic, right? Because you know it's going to be at least n where n is the number of pages, right? That's just for scans and it joins our whole another beast we haven't talked about yet. So if it's on-prem, you've already paid for the hardware, ignoring energy costs, then, you know, this is kind of okay. But if it's running in the cloud, when you're actually paying for IOP, then this is actually terrible, right? Because you end up reading more data than you potentially actually need. So I was saying there's only one prototype I know that did this out of ETH Zurich called Corscan, though. And they built it specifically for a sort of telecom business where they needed deterministic run times of queries. But this was a few years ago and I've only seen this, there was another prototype that had the same kind of thing. So it's an interesting idea. It's a different way to think of a system that's sort of, that's unorthodox. I just like to present it as a different way to think about things. That's why I love databases. There's so many different ways to solve the same problem over and over again. And we can debate what's good and what's bad. All right, so last optimization to talk about is called buffer pull bypass. And the idea here is that if we have a query that's running a sequential scan we have to bring things off the disk in the memory but maybe we don't want to put it into our buffer pull because one, we have to pay for the maintenance cost of the buffer pull. And the patient will take a latch, update things, evict stuff and so forth. And then also too, if we're doing a sequential scan the data we just read may not actually be useful. Well, it's not going to be useful for our query because we've got a sequential scan. We're only going to read the table once, usually. And so rather than having all these different workers running at the same time doing sequential scans and polluting the page table what if we just give every worker its own little piece of memory, like a working memory. And then any page we read we put into that worker's memory. Yes, you could have duplicates. It only works if it's read-only. You can't do writes. And then that way it's just like a circular buffer where you just keep filling up and wrap around. So a bunch of systems support this. Oracle, SQL Server, Postgres and Informix. I think this originated Informix and they call it light scans. Light meaning because again you don't touch the sort of heavy weight page table in the buffer pull. And the idea here is that I can potentially not pollute my page table because the data I need, but the data I need is sort of local to me. Of course the downside of this is that you lose the sharing capability if two workers need the same pages at the same time or one soon after another then you lose that re-use possibility. But again it's another optimization that we can do because we control exactly what the queries are actually, what they're actually executing and touching. All right. So we sort of dance around this idea of evicting data, or evicting pages from our buffer pull. So now we got to talk about how we're actually going to do it. You need to know this because it's project one. So when the executioner says, okay I'm bringing a page into memory I'm going to put into a frame. If there's no free frames it has to decide what to evict. It's a cache. This should not be groundbreaking. So there's a bunch of different sort of metrics or objectives we have to consider in our eviction policy that's going to depend on various factors of our implementation of our database system. Obviously we want our eviction policy to be correct. We don't want to evict a page then immediately that page is the most used thing and therefore we keep reading and writing it from disk over and over again. Could that be bad? We want our eviction policy to be fast. We're using an NP-complete algorithm or exponential algorithm. We don't want to take three seconds to decide what page to evict because you might as well go read it from disk because that would have been a lot faster in the first place. And in related to this we also don't want to have a big cost of maintaining the metadata we need to keep track of how pages are being accessed so that we can make a decision what to evict. So again this is the oldest problem in computer science. The other oldest problem in computer science is we don't have anything or naming a system. We can talk about how bus type got named but everyone has a caching paper. I think I have two. So the most obviously easy thing to do is do LRU, least recently use. The basic idea here is just maintain the time stamp or keep track of a link list of when pages were last touched. And then when it comes time to evict the page we just go to the tail end of the link list and pop whatever is there and that one hasn't been accessed in a while. We go ahead and evict it. So say it's an error here, Q1 wants to touch page one. Page one is already in our link list so we just move it to the front, to the head. And then now say another query wants to come and touch access page five but page five is not in memory so we go ahead and evict page two because it's at the end. Right? This should not be news to anyone. So another way to achieve the same thing without tracking the actual time stamps in this link list is to use an approximation algorithm called clock. Who here has heard of clock before? Less than five, okay. And so clock is used in a bunch of other systems as well. I think Linux uses this for its page cache or page table. They use a multi-hand clock which we don't need to know about that. But the basic idea here is that instead of keeping track of the exact ordering of pages in LRU instead we just give a use a simple reference bit for every page that we use to keep track of whenever we set to one, whenever it's accessed whenever it's written to or read. And the idea here is that we'll have this clock and sweep around and look at all our pages and if the bit set to one set it to zero, if it is set to zero then we go ahead and evict it. So say we have four pages we just give them a reference bit we set to zero. And then say page one gets accessed by a query we go ahead and set the reference bit to one. And then now let's say another query wants a page that's not in a buffer pool so we've got to decide which of these four we want to evict. So we go ahead and the clock starts at some starting point it looks at the reference bit if it's set to one we set it to zero and then move on to the next one. And here at page two the reference bit is set to zero so therefore we know the last time the clock swept it around and looked for pages to evict it wasn't touched. So therefore this is safe to evict. We go ahead and replace it with another one. And then say the other page three and four they're a bit of access to clock soaps around, sets their bit to zero and then so forth come to here and then now we're back to page one it wasn't accessed since the last time we saw it so we go ahead and evict this one. Is this a good idea or a bad idea? Yes? It seems good in that it allows us to have a lot less storage because we only need one bit per. The downside seems like it's going to evict even if we don't use the room and we might have fewer hits. So you got the first part right. The first part is said this is nice because the metadata overhead is low because it's just a bit per page but until you have this bit vector that's easy to do but then the second when you said you might evict things that you shouldn't actually need to evict the clock only runs. You only start the sweep whenever you need to evict. So it's not just running the background all the time. We don't want to do that. Yes. So he said and he's correct that in the case of clock and actually LRU they both have this problem we're not keeping track of the frequency in which pages are accessed. And so this makes them susceptible to two problems. So the first one is what he said down here like in both clock and LRU we're only tracking when they're accessed not how often they were accessed. And in both cases also too they're susceptible to a problem called sequential flooding. And this is the problem where if we're running sequential scans we got to go fetch pages from disk put into our buffer pool but if we're tracking the last time they were used the last page I just got from my sequential scan is the most one that's recently used but for that sequential scan it's actually the most useful one at least useful one because it's just the page that is read I'm not going to go back and read it again. And in some cases for at least all of our workloads it's kind of like you want the most recently used one. It's a gross approximation but it's another way to think about it. Again, so like this I have a query once you select star from the table but we're only going to get one ID or one record and say that's in page 0 so we go put page 0 in our buffer pool then we have our OLAP query that's going to scan the entire table and go fetch all the pages that are in the entire table but when it gets to page 3 we don't have any space so for these pages here the least recently used page is page 0 so I'm going to go ahead and evict that and put in page 3 but if another query comes along it does the same thing that the first guy did and goes get to page 1 or once record 1 in page 1 that's actually the page I wanted I need but I just evicted it this is the worst thing you could do right? So sequential flooding is a problem because again if we do most of these point queries and then all of a sudden a sequential query shows up it's going to blow away any useful information that's collected in our LU or clock metadata so the solution to this is called LUK the idea here is you just keep track of the last K times a page was accessed and then when it comes time to decide what to evict you compute the interval between the last time or say the K times that was accessed and whatever one has the largest interval meaning the time from it was sort of accessed K minus 1 or K minus 2 times of 4 if that interval is the largest then you know that it's likely to not be used in the future and therefore you can go ahead and remove it you think regular LU is basically LUK where K equals 1 right? and with 2 or 3 or whatever most systems use 2 if you can use this they just keep track of like the last 2 times and I'd say what's the time between the 2 of those and I take the one that is the largest and so of course this is susceptible to another problem where I fetch a page in and I haven't accessed it twice yet so the interval is essentially infinity and then it goes ahead and gets immediately evicted but say that it actually is the hot page and I want to keep that in memory but because I keep evicting it I lose that I don't have any history of it so the way to solve this is that you maintain an in-mary hash table that keeps track of here's the last couple pages that I've evicted on disk and here's when they were accessed the timestamp so that when I fetch a page back in after it was just removed I at least now have a history for it and not assume that it's infinity and that means that over time you'll be able to get information you need to compute the correct interval for pages when the first time they're brought into memory and again it's self-correcting because again if I bring something into memory but it never go fetch it again it'll get removed from my ephemeral cache and whenever I need it again I won't have the history and I don't know I should evict it it's a simple solution to a simple problem surprisingly this was not in the 90s and as far as I can tell only Postgres and SQL Server actually do this and this is why I like open source things because there's actually the mailing list for the Postgres people 2002 was saying hey this LRUK seems like a good idea we should do it and they implement it so MySQL doesn't do exactly LRUK as I defined but they use a sort of approximate one and the way they do this is that they sort of logically divide up the link list for the LRU page list and they have two different sections or regions they have like the the young region and the old region and for each of these two different regions you'll have a different head pointer where you would insert new entries so let's say that I have a query I want to touch page one it's not in memory so I have to put it into my buffer pool and when I want to add it to my link list here because page one is not already in the link list I'm going to add it to the old region and I'll insert it where the head pointer is so it will evict page eight and put page one there and if page one is never accessed again it will slowly make its way to the end of the link list and then get evicted but if say Q2 comes along and touches page one again we would identify that it already exists in my link list and it's in the old region so therefore I'll put it at the head of the young list and then slide everyone out so again it's approximate LRUK I'm not really keeping track of the intervals between when it was accessed before but just knowing that it's within this boundary of the young versus old then it was most likely accessed more recently whereas over here you haven't seen it before you haven't seen it before it was added to the list so I would say also too going back to the LRUK there's a bunch of other optimizations you can do that SQL server does but I don't think Postgres does where you can keep track of when of how of who is accessing or referencing a page and then that can determine whether you would say an access counts for a distinct reference and therefore should update the interval so an example would be if I have within the same transaction two separate queries access the same page well that's in the same transaction so therefore should they be considered distinct or not if there are two separate transactions then it's very likely this page is hot because a bunch of transactions are accessing it I think of like if if you log into Amazon and you go update your account information and say for whatever reason that transaction updates your record twice well that's done in the same transaction is that considered two accesses or one and then SQL server they'll consider that one and then in Postgres they'll consider that two right again, you can do a bunch of fancy things because you know how the how the data system is accessing pages alright there's a bunch of other policies you can do for deciding you know sort of relate to this like for a given query what page should I evict like if this is sort of related to the sort of the private cache like in some systems you can say here's a some subset of the pages that I'm accessing they're being backed by the buffer pool right but I'm keeping track of which ones I'm accessing so then I can give a hint to the buffer manager to say if you don't have any more space here's the pages where I know I'm accessing therefore and I don't need them again therefore you can go ahead and evict them now whether or not the data system decides which issue to evict them or not depends on the implementation you can also maintain priority hints about what the type of page or what the object of that page represents and then this is provided to the data system to say you know the buffer pool manager whether something should be evicted or not so the classic example would be if I have an index of a bunch of pages and if I have any queries that are always inserting new records that are just increasing the size of the datum or the value that the index is based on then I know I'm going to be always hitting the right side of the tree and therefore maybe I want to keep those pages in memory and I don't care if page two over here gets evicted because I'm mostly going to be updating things on page six right or likewise if I do a bunch of select queries that I have to use the index well I know the very first thing they're always going to access is the root page and index so therefore that should be given high priority than other ones now these seem kind of like band-aids over like LRUK or other mechanisms and they kind of are right and so it's like think of it as like a light pin to tell this hey you really probably shouldn't evict this but if you have to yes you can but please you know please don't but again this is just additional things beyond the LRU tracking that the system is already doing the big challenge though when it comes to evicting pages is whether they're dirty or not so the easiest thing to do is if all my pages are clean to evict them from the buffer pool is to do nothing you just drop the page reference in the page table you need to overwrite whatever is there before right because you don't need to flush it back to disk if though if all the pages are dirty or the one you want to evict is dirty then you got to write that to disk make sure it's durable and safe which will cover later in the semester before you can go ahead and say okay this frame is now free you can go ahead and reuse it it's actually more complicated than that because you actually have to write make sure the log record is flushed to disk first before you can flush the dirty page that the log reference talks about okay we will cover that later so the reason why this is tricky to do is because it may be the case given all the things we just talked about these priority hints, the different policies it may be the case that the page you want to evict is dirty so that is going to require a disk flush but maybe the second page you could evict is clean so should you violate the LRUK in that case because you're just trying to get things out as fast as possible right different systems do different things and this is why the enterprise systems are much better than the open source systems because they have all this metadata they have all these sophisticated algorithms to make these decisions because it actually depends on the speed of the hardware right if your disk is super fast then yeah I'll write out the disk I'll write out a dirty page out right away because that's going to be a fairly inexpensive operation but if my disk is really slow or I got to write over the network to some slow device then maybe I want to minimize the amount of disk rights I have to do when I have to evict something the exact moment I need a I need a space so there's no easy answer to this you know if all your pages are dirty and you need to evict one you have to write it but what's one way to avoid this problem to not have the right to be on the critical path when you want to access a page back on writing right right there's you could have a separate thread in the background just walk through your page table figure out what's dirty make sure the log is on disk again we'll cover that later but assume there's you have to record a log first find pages that are dirty and go ahead and write them out then you just flip the bit to say this page is now clean so that when the eviction algorithm runs and it says okay I have to evict something now has a bunch of options of pages that are clean that could write out just drop you don't have to write anything but now there's this trade off between like should I have my system be aggressively writing out dirty pages maybe interfering with queries and transactions that are trying to run things on behalf of the application or should I you know should I delay that but then now the problem is at some point I need to get free space and all my pages are dirty it's a super hard problem and there's no easy answer but all the systems are going to have some mechanism to do this kind of background writing then we haven't checkpoints there are a whole other piece checkpoints you flush everything out but that happens every in minutes not every second alright so now we start talking a little bit about when we do these disk writes and disk reads we got to talk about okay how are we actually going to do that and for this one when we make read and write calls to the file system assuming we're running on the file system not raw partitions there's a bunch of layers below us in our database system like the OS and the file system and the hardware that's going to be clever and try to obviously maximize the amount of bandwidth by reordering and batching our IO requests part of the reason why these the modern disk drives or SSDs or MDME drives are so fast because they have these long queues they can do parallel requests so if you sort of do one read at a time that's going to be super slow but you can batch things up and make sure that you're reading a bunch of contiguous data things will be really fast but the challenge is though these different layers of the system below the database system don't know what the requests actually correspond to they just see reads and writes in pages at some location or some address they don't know oh this is from the background writer or this is for an index or this is for the query that I need to run right now so you can play some games with in Linux setting the IO priority but that's basically a sledgehammer the only thing you can really do is change the IO priority on a per per process level you can't do it on a per single request which is what we really want if you read the documentation on a bunch of different database systems they tell you to all get off the default Linux scheduler which is the fair scheduler and either use deadline or the simple FIFO queue and no op because they don't want the they don't want the operating system to do a bunch of stuff that the the data system can't control so this is why in most data systems they're going to have their own little shim layer right above the the OS that's going to be responsible for keeping track of what requests are outstanding for reads and writes from the buffer pool manager and decide how to put things together to optimize performance and basically you think of like determining the priorities for the different IO requests based on a multitude of different factors because again we know what the queries are trying to do we know what pages are on a buffer pool we know what's dirty and not dirty we know what order the outstanding requests are and so to try to keep track of things like what's sequential IO versus random IO is the request based on the critical path or like a query needs this right now or is this like a background job the background writer and therefore you know could have a lower priority as the data we're accessing for a table an index log records again the logs we want to flush as fast as possible is it for ephemeral data like if it's a sequential scan for a table that's going to have a lower priority than maybe random access for index because as you traverse your B plus tree you're holding latches as you go down and therefore that's going to prevent other threads from running at the same time but if your query your scan query is a little bit slower you're technically potentially not interfering with other queries running at the same time although that's not true because you can take locks during when you run queries and transactions so there's no easy answer for each one to be faster than the other you can't just because it's an index because it's a table again the OS doesn't know that it's an index pay versus a table page there's also sometimes in some systems you can have SLA service level agreements or service level objectives like my queries have to run within a certain deadline at a certain latency and the typically way you do this is you have different user accounts and give one user account higher priority than another user account like the front end application has a higher priority than like you know nightly reporting jobs and you do that based on user roles so the way we can get better performance and try to avoid some of this interference from the OS is to use what's called direct IO so all your disk operations for the most part are going to have to go through the OS API unless you're doing like you know direct kernel bypass to the hardware device but most systems don't do that and the idea here is that we don't want the OS to maintain its own cache called the page cache because we don't want the OS to buffer our reads and writes we want to do that all entirely ourselves because again we want to have full control of the hardware so the idea is like this if I do a read against the file system well the OS is going to say let me go maintain my own little buffer pool in the OS across all the all the processes running it's a global for the OS and then I will store the data I need you're asking for in my page cache and the next time I do a read I'll get it from my page cache so instead we want to bypass this and go around it and go directly to the hardware and not have the OS buffer anything so most systems use direct IO most systems will use direct IO by default there's only one system that does not let me take a guess what it is single store no my SQL no it's postgres postgres because it's a relic from the 80s they rely heavily on the OS page cache and so when you allocate a buffer pool in postgres you set it to like 30 40% of the amount of memory that's on the box my SQL in every other database system tells you it's 80% of the memory that's available on the box some of the memory to be for the page cache and the OS, some of the memory for the database system so what's the problem with this well now I've got redundant copies of my pages the OS is going to have a copy of my page and my data system is going to have a copy of my page same things for recent writes and then the the data system is going to have its own policy of how it decides what pages to evict but then just because I evict it from my buffer pool in my database system the OS can decide to evict it any way it wants to it doesn't know what the pages actually represent and then you also lose control when things actually get flushed out the disk unless you're careful so let's see how far you guys can get with these answers here even if you haven't taken the OS class so if you call F write what happens I open a file I open my database file I have my buffer pool manager bring a page in another query updates it the page is dirty I call F write to write that page out the disk because we were not using direct IO the page lands in the OS page cache because the operating system is trying to be clever and trying to make things fast for you so is it on disk yet when F write returns is my data safe no because it's in the OS page cache when is it going to be flushed out the disk when the OS decides to do it but f*** that we want to make sure our thing is on disk so what do we call F sync what do we do flush what dirty pages out the disk you can kind of do ranges doesn't always work and then the call to F sync will block until the the hardware comes back because your data is persistent now the hardware can play games too because sometimes the hardware has a little battery down there so you'll get your writes that's still sitting in volatile memory but if there's a power and you can send a response there's a power loss, there's just enough battery juice to then make sure it gets written to disk so it's not always going to be on the name and flash at that point but typically that's good enough but what happens that F sync calls a failure what F sync says I can't do it for you what does that mean say the OS doesn't crash F sync just returns an error in Linux it's going to mark the dirty pages of clean these pages are clean even though F sync failed and then you call F sync again on those pages that you want to write the disk for your F write it's going to come back and say yep I got them they're clean, it's on disk but it's aligned to you right why is it aligned to you because the kernel developers are worried about someone pulling out a USB stick right and then the F sync failing and having a bunch of C in its page table that are never going to get come back again because you're never going to put the USB stick back in again right is that the right thing for databases no right we're not running the USB stick so again we need full control of everything to make sure we're doing things right well it turns out that people didn't know that F sync was broken in this way for 20 years so in 2018 a scandal called F sync gate where someone on the postgres mailing list reported hey postgres lost some of my data but I never got a kernel panic I never had a failure and it turns out because F sync on all these database systems they would call F sync, F sync returned to error and they just put it in a while loop and called it again and then F sync came back and said yep I got it because it marked out your pages that were dirty to clean even though they were never written my SQL had this problem MongoDB had this problem and wiretiger a bunch of other systems so now what do they do well F sync fails then the system crashes and then you gotta go figure out what's going on but for 20 years people didn't know this was an issue right and so this is not an issue of the page cache but this is an example where we need to make sure we have full control of what is getting read and written from disk into memory we want to make sure that it's actually safe and correct and that the OS can lie to us because the OS doesn't care about databases because the OS is worried about somebody with a USB stick or whatever because it's trying to be a general purpose system and so we need to make sure that we as a data system developers put the mechanisms in place to make sure that we don't get screwed okay alright so there's multiple pulls other than what we've talked about I think ephemeral caches for joins and things like that we'll see that later so we'll always be better than the operating system despite what the Linux people say the OS people say and because we know what the query plans are going to do we know how queries run access data and we can always do better things so hash tables next class let me talk about project one very very quickly so you'll be implementing this in bus tab obviously and so you have three parts in your LRUK replacement policy a disk azure, a very primitive one and then the actual bubble manager instance itself so for the first one there'll be a separate class that you need to implement that basically keeps track of all the pages and then there'll be an API to implement that says I need give me a page to evict so you sort of implement that first and then there's there's tests to make sure this is actually working correctly if none of the pages have been touched that's how you check then always to turn the lowest page ID because again you've got to evict something and so you decide what to evict based on that the next thing you implement is a disk azure basically you can take a bunch of different requests or different threads running at the same time and then have a single queue decide in which order you should apply those the reads and writes and the way you implement this the API will be a callback mechanism through the C++ promise API promise structs and this is a function you invoke once the data that the request is waiting for is available and you call that back so it's not going to be true asynchronous IO but it's because basically the thread that makes the request will block on this callback but this would be the building block to do more sophisticated things so make sure that what you build here is thread safe the last thing is the bubble manager itself and this will be built on top of the LRUK implementation and your disk azure and you maintain internal data structure to read and write data using disk azure and keep track of what pages are free and when they're not accessed and the thing that always trips up people every year is make sure you get the ordering correct when you're pinning and unpinning so 6 might not be the right number but don't change any other file because everything will get overwritten when you load it up into Gradescope the projects are cumulative solutions and then post everything on Piazza as you go along like in project zero make sure you have good code quality so make sure you run make format and then check clang tidy because if you don't do this and you upload the Gradescope you'll get a zero we are having for this project and all the other projects there'll be extra credit there's a leaderboard with additional tasks that go beyond the core requirements and then you'll get ranked when you submit your thing on Gradescope and then the top 20 students will get bonus points for this project and this will be available for all four projects and then the student at the end of the semester has the highest score of bonus points out of all the students will get a limited edition bus tub hoodie you have to fill out a tax form because they're limited edition they cost $5,000 and again CMU will handle that the paperwork for that but it's highly desirable don't look around, don't play your eyes because again the Gradescope has the automatic plagiarism checker and we go find the randos off github we put them as fake students and if you copy them you'll get flagged ok? hit it thank you for watching see you next time