 Today's lecture is on databases that are larger than the amount of memory that's available to the database system. So again, still on lockdown here at my house. The carrier is over there, it's sort of asking questions as we go along, but it hasn't been very good before. So people are complaining that like the tier doesn't ask the questions that they want to ask, but it is one of those. So before we get into today's lecture material, I first want to do a quick overview on what's coming up for you, for those of you that are enrolled in the course this semester, like the upcoming deadlines and dates. So on Wednesday this week, I will be releasing the final exam. It was originally due a week from Wednesday, so the next week. I've now extended that to have the final exam being due on the 13th. So everyone's going to have three weeks now to actually complete it. What's that? Okay. The carrier's question is, can the class have a practice exam before I release this? No, because you'll see that it's sort of a long-form question based on the material that we discussed this semester, and now that I'm giving you three weeks to actually do it, there's no point in releasing a practice exam first. The next thing is that on Wednesday, next week, we will have hopefully the guest speaker from Amazon come and give a talk. For this one, because of Amazon restrictions, we're going to have to make this a live lecture that will only be accessible to CMU students. And so you'll need to come at the time that we normally would have scheduled, class scheduled. It'll be 12 p.m. Eastern Standard Time, because this will not be recorded. So now, also, in terms of Project 3, the second round of Chrome reviews will be due on May 4th. And then we'll have now our final presentations at our originally scheduled final exam time on May 5th at 5.30 p.m. And so, again, we'll just go on the Zoom channel and everyone will present as we've done with the status updates from before. Again, and then the final exam will be due on the 13th. So the main thing I'll say about this, so for the code reviews, Matt and I will go through this week and give feedback on the first round of code reviews and that'll prepare you for the second round. And the idea is that you want to take all the suggestions that the other team made about your project from the first round and actually apply them to the second round. I don't want to sort of make the same mistakes all over again and have them just sort of repeating themselves. Okay? All right, so as I said in the very beginning of the semester, which seems like a long time from now, back in January was that this class was focused on in-memory databases. And that all of the algorithms and methods and architectural decisions that we've talked about this entire semester have been based on the assumption that the database resides entirely in main memory, meaning we're not writing algorithms or not writing joint algorithms that can maximize or minimize the amount of disk IO that they incur words assuming that any time we follow a pointer to a tuple or some buffer region that that's going to always be in memory. Now the downside of this is, the upside is that as we've seen throughout the entire semester, this allows you to implement things way more efficiently because you don't have to have all these checks to account for the fact that in a disk-oriented system any time you go touch a piece of memory that it might actually be in memory. It's we on disk and you have to go through the bubble manager to go and get it. So if we, we can write our system without the assumption, it's going to go really fast. The downside is however, and what the in-memory marketplace has sort of shown, in-memory database marketplace has sort of shown the last decade is that SSDs and spinning disk hard drives still provide or still in terms of price versus performance are still a, you know, they still have provide certain properties that would be desirable in, in even for modern applications. And this is because DRAM at its core is expensive, right? And it's expensive both to, to buy relative to DRAM and or sorry, to SSDs and, and spinning disk hard drives. So roughly in 2020, the price of per gigabyte for a spinning disk hard drive is around two to three cents for NAND flash. It's, it's less than a dollar in a few years ago. I thought being, you know, 30 cents per gigabyte. Whereas in DRAM, it's gone down in recent years because of some lawsuits, which we can take that offline. But the it's, it's roughly maybe like around five to six dollars per gigabyte. And this is assuming that you're buying in bulk, like as a manufacturer or not, like you're going to Amazon and buying, you know, dames for your machines, like your major manufacturer that can buy this in bulk. So it's expensive to buy. The other thing we haven't really talked about too is that it's expensive to maintain, meaning when I put it in my computer or my server and I'm, and I'm, you know, plug the server in, I'm actually running it. The percentage of the electricity that's being used to power that machine, a sizable portion of it is going to DRAM. Now, this is ignoring like doing like Bitcoin mining or, or, you know, neural net training on GPUs. Those things are definitely bigger power hogs than DRAM. And so that those are draw, draw most of the power. In sort of a database server that's not doing stuff on the GPUs, it's going to be a roughly about 40%. This was a survey that was done a few years ago where they actually went and measured how much memory was being drawn by the dim slots. And they showed like on average about, it was about 40%. So that means like all the power that you're paying for to run your machine, 40% of that is going to, to, to DRAM. So now that means if I'm running a memory database where the database has to fit entirely in DRAM, the larger my database, the more DRAM I have to buy and then more power I'm going to use to, to, to maintain it. Essentially what the motherboard is doing is like every, so often, right? I think it's in every couple of seconds, it's sending a charge to the dims so that they can refresh the, the, where they're storing. Um, and of course, this is why you, if you pull the, pull the power on them, you can't do that refresh and then you lose your data. So given that we spent the entire semester talking about how to make a really fast memory database system, it'd be nice if we could bring back one of these slower, uh, involved non-volatile storage devices, NAND flash or spending this hard drive and get the benefit of being able to write out data to, to those disks without having to bring in all the architectural components that we were avoiding and algorithms that we were avoiding by going, you know, from a disk or any architecture and pulling that back into an in-memory system so that we just end up with that slower disk or any architecture that we were trying to avoid in the first place. So that's what the, the focus on today is. So we'll first talk in the background about the, the different choices or we could have, um, why we want to do this to, to support larger than memory databases in an in-memory database system. Then we'll talk about how you actually want to go and implement this and then we'll finish up talking about, um, some real implementations of the various techniques that what we will talk about today. So at a high level, as again, I've already said this, but just to repeat it. Um, the goal of what we're discussing today is enabling an in-memory database management system to be able to store and access data that's been written out the disk, but without having to bring back all of the slow parts, in particular the buffer pool manager that we got rid of when we went to an in-memory architecture, right? And so another sort of engineering, uh, change we want to try to achieve is that as we bring back the disk, we don't want to have to go and touch all our, uh, all our components in our system to now account for the fact that data that it could be accessing is, uh, is not in memory and it's on disk. Like if we bring back a traditional buffer pool manager, this is somewhat the case because anytime we go access a, uh, you know, a location, we need to know, we would have to know what page or block it, it, it's in, it resides in and go through a buffer pool manager to do this for us. So we'd have to go modify the entire system to now, instead of just go reading a piece of memory as we did before, to now go through the buffer pool manager and get a page or get a block of memory, uh, that we can then access and know the offset within that. Like that's going to be slow. The algorithms that are designed to minimize disk IO, there, there's really much slower than the random access ones that we've been talking about today. So the, the way we're going to have to do this or the way, you know, the, is that we need to be aware of the, the sort of fundamental differences between non volatile storage, like a spinning this hard drive in the end flash and in memory, uh, volatile storage and the core difference is that for in memory storage is going to be tuple oriented, essentially bite addressable that I can go access, uh, to jump to some memory location. And that will, that's where my tuple will reside in a disk oriented architecture and a disk storage model. It's going to be block oriented or page oriented, meaning I can't jump to just a single byte and memory or in a page and just get that data and bring it in into memory. I got to go fetch the entire four kilobyte or a kilobyte page that my, the data that I want resides in. Even if I don't want the other parts, I only maybe want one kilobyte of that four kilobyte page. I'm bringing in the entire four kilobyte pages. Right. So this, this is what we have to deal with when we design a, a, a capability that allows us to move data in and out of disk in our memory architecture. So the other important thing to discuss is that the, for this entire lecture, we're going to focus on all OLTP workloads or OLTP systems. And then we can do this in the context of an HTAP system where we're running the OLAP system, OLAP workload and the OLTP workload at the same time. But for the OLAP, OLAP queries, there really isn't going to be anything special or magical. We can do an MRE database system to make the disk accesses that they're going to incur go faster. Because then in the day, OLAP queries are going to do, for the most part, large sequential scans over, or sequential scans over large segments of the table or the, or maybe even the entire table. Now maybe accessing just a subset of the columns and a column store could, could alleviate the, the, the issue of bringing in data that you don't actually need. And we can do that, but that's not anything special because we're in memory because a disk oriented database system can, can still be a column store and still get that same benefit. So again, the main thing I'm trying to point out here is that there's nothing really we're going to do because we're an MRE system to make disk IO in, for OLAP queries go better. The only thing that we can really do is like say we have a column that we know in our table that we want to scan, well, we can compute a zone map for like a precomputed aggregation, aggregates for the column, like the main, the max, all the things you would want to run aggregation stuff on and keep that in memory at all times. And the, the, the rest of the column, the actual data itself, we shove out the disk. And so we can use this zone map to try to figure out what, whether or not we actually need to access that column on disk or not, depending on what query we're trying to do. And that's just the data skipping technique we saw before when we talked about doing compression in a data system. But again, a, a, a disk-oriented system can still use zone maps and there's nothing special because we're in memory here. So again, for this reason, there's not much we can, we can do for, make OLAP stuff go away. All the same optimizations that we would do in a disk-oriented system in our buffer pool, like scan, sharing or buffer pool bypass, we can still apply them here. So the reason why we're going to focus on OLTP is because they are going to have this, this, this very common pattern where there's going to be this notion of hot data and cold data in the database. And the idea is that we want to keep the hot data and memory because that's the data we're going to be updating or accessing over and over again more, more often. And in the cold data, we then shove out the disk. And the idea is that we still have a, we still was keeping track of that cold data in memory so that we know it exists and we don't have any false negatives, you know, we do a look up for a tuple in its own disk and say, oh, we don't know anything about it. Like, we're going to avoid that, but we're just not going to pay the, the storage penalty in memory to have all that cold data around in memory, even though most of the time we're not going to need it. So I need the, the example I always like to use of understanding hot versus cold data would be something like Reddit or Hacker News. Right. Most people are going to post comments on the latest articles that have been posted on Reddit within the last, you know, 24 hours or so many days. A few people are going back and posting comments on articles that were, that were uploaded or posted six months ago. Right. And I actually, I don't think even Reddit even lets you do comments on old articles. Um, so again, the idea is that the recent posts we want to keep in memory because those are the ones that everyone's reading and making updates to the older stuff we shove out the disk. If anybody comes in and goes looking for it, we'll go fetch it from disk and bring the memory and serve it to them. But most of the time we don't need that. And then the likelihood that someone's going to read that same article, uh, in a cold article immediately after it was brought back into memory from disk is actually very low. So we can go ahead and shove it back out the disk later on. All right. And save, save space. So again, some applications like I said in Reddit where they were actually, uh, prevent you from posting on articles that are, that are become a certain age. That mechanism is actually something that the application does. It's not something that the database management system enforces. As far as they know, nobody, no database system actually says, all right, this stuff is old. Uh, we won't let you modify it because it doesn't know that, you know, you're not, you're inserting into a table, like the comments are getting certain to a separate table from when the article is actually posted. So it doesn't know you can't do that, um, at least as far as I know, no system actually does this. So again, the purpose of what we're trying to do today is, is this piece here, what we need the mechanism to allow us to identify that we have this cold data, what it is, where it is, shove it out the disk, and then if it's ever needed again, break it back in. Another way to conceptually think about this too is that in what we're trying to do is push cold data out of memory onto disk. Contrast this with a disk oriented system where you pull hot data from disk and bring it into memory. So this seems like a sort of a semantic difference. And hopefully as we go along, this make more sense, but like the, the system is being architected such that cold data is moved out the disk, whereas in a disk oriented system, hot data is pulled in. All right, so let's look at a high level example here. So we have, uh, now we have a database now that can support writing out data to cold data storage. This is some, some, some spinning this hard drive or NAND flash or EBS, whatever you want, doesn't matter for our purposes right now. And then in memory, we still have our in memory index, uh, and then we have our in memory table heap. And for now, assume everything's fixed, fixed size and it all exists in, in this one space here. So the, say now we have a mechanism to look at our tuples and identify which ones are cold. We have a way to identify that these three tuples here are, haven't been accessed in, in, in, in recent, recent time, unlikely to be accessed in the future. Again, how exactly we do that we'll cover in a second. But for whatever reason, we think these three are, are candidates to be evicted. So what we're going to go ahead and do is we're going to pull them out of our table heap, uh, combine them into a page or a block that we then write out to disk. So this would now be our evicted tuple block. There's some header that's saying, here's what, you know, here's what's in here. And then we have our, our tuple data. And for this, assume that we're organized as a row store, a column store could essentially work the same way. Because if you do a packs organized, uh, block, right, this can be laid out in a column store within the page. But all the values or all the attributes for a given tuple can be found in this, in this block. So now the first question we have to deal with is, what do we do with the holes that we just made in our table heap? Right, we have now empty space that we can reuse. What are we actually going to, you know, when do we actually want to put stuff in there? Because the issue is that the index now is because everything is now in, in our database, you know, you're having 64 bit pointers to, to other locations of memory to identify tuples. It's not a, you know, page ID in an offset as you would in a slot number, as you would in a disk running system. These pointers are now pointing, are still in the index are still pointing to these empty slots. So now if I could do a look up and try to find this tuple, you know, that, that used to be here, I'm going to land in, you know, look, you know, some empty space now, because somebody else could, could now be to, I could have put another tuple in this space. But now let's say that this query comes along and this query now wants a reference, one of the tuples that we just evicted, right? So this is accessing tuple one, but now tuple one is somewhere out on disk. So the question is, we have to deal with this, well, how are we actually going to find it? How are we actually going to be able to identify that Oh, tuple zero one, it's not in memory anymore, right? It's, it's out on disk. Then say we, if we can actually identify that it is on disk, we need to bring it back into memory. And now the question is, what do we actually do? Because as I said, the, the, the, the cold data storage on our disk is, is block oriented. So I can't just go grab tuple zero one and just copy that the bytes exactly for that tuple and bringing into my table heap. I got to bring this whole block in. And so now I'm bringing in tuple one, also three and four, but I didn't ask for three and four. I only wanted one. So what do I actually do? Do I merge all three tuples back in? Do I just merge one tuple in and then leave a hole out on disk? So these are the questions that, that, that we're trying to answer today. How we actually want to orchestrate all of us. So the, the issues we're going to have to deal with are the, are the following. And again, we're focusing on OTP systems where we assume that actually turns out, you know, transactions or queries good or could try to actually not just access, but also update or modify tuples that are data that's been shoved out the disk. So the first thing we had to deal with is for the runtime operations is what are we actually going to do while the transaction or the data system runs queries and transactions to keep track of whether data is, is hot or cold. Like how are we going to keep track of it and identify, oh, when it's time to evict or when you want to evict some data, which tuples or which blocks haven't been accessed in a while and therefore we want to go ahead and evict them. The next is the eviction policy. This is when the, the database recognizes that it's running out of space or running out of memory. So it says, let's go ahead and evict some data. So the first question is when should it actually fire off the eviction? The next is like what metadata we're going to keep track of to, to record that there used to be data here in memory, but it's now out on disk and here's how to go find where it's located on disk. And we need to do this because we need avoid false negatives. We don't want to write, we don't want to be back here and write out tuple one to disk. Then this query comes along and says, oh, I want tuple one, but we only consult this portion here of the table heap, which doesn't have tuple one anymore. So it would come back and say, oh, I don't know anything about tuple zero one, which we're asking for doesn't exist. And therefore this thing would return incorrect answer. So we need a way to identify that the in memory that there used to be a tuple zero one here. And here's, here's, here's the information on where to go to find it on disk. So that's the metadata we're kind of keep track of here. Then now when we go and recognize that we actually do need data on disk, the question is how much data should we actually bring in? What do we actually do with the query or transaction that requested that data? And where do we actually put the data that they've requested? Like where in memory, the table heap or maybe a private buffer, right? We'll cover all these different design decisions. And these all come from a paper that we wrote in 2016 with one of my PhD students where we were looking at this idea of again, how do you actually support large memory databases in memory database system? So what I'll say though, this was based on a system that I helped build in grad school called h-store, which was then commercialized in 2008 2009 as volt db. Most of the mechanisms we're talking about here are going to be sort of tuple-oriented, like fine grain identification of individual tuples. But the paper I had you guys read, Lean Store, which came much after this particular paper here, was doing this at the page level. So some of the things we'll talk about here don't make sense if you're trying to do it at a page level because they have a sort of more idea of when things get affected. But it's important to understand the fine-grained approach and then you'll hopefully then appreciate better why the Lean Store idea actually I think is really good. So again, the first thing we have to do is how to identify what data is called in our database. So the first choice is to do what I'll call online identification, meaning as the database is in the student queries we are keeping track of what tuples they're accessing, what data they're accessing, and then we have to then maintain some metadata directly in the tuples or pages themselves that we then update as our queries are accessing them. And the reason why you want to embed it in the tuple itself is because you don't want to have to maybe go then consult an auxiliary data to update this tuple. Find me the tuple entry in my tracking data and then update it because then you're paying the storage penalty of having that extra data structure and then you're also paying the computational overhead of having to update it every single time you access the tuple. So for when we did this in each store the we would store actually a 64-bit the next most recently accessed tuple. We were essentially maintaining an LRU chain of all individual tuples. So 64 bits per tuple is not ideal and so the tuple is very large if it has a lot of attributes then the overhead is kind of small but if it only has like two attributes, like two 32-bit pointers, or sorry 32-bit integers as its attributes then an additional 64-bit pointer to keep track of the metadata is essentially doubling the size of the tuple just to keep track of whether it's hot or not. The second approach is to go offline and what that means is that the database system will record like an in-memory log of all the individual accesses that the queries and transactions make while they're running but this is just getting updated in a private buffer to the thread doing the access so there's no contention on the global data structure. We're not storing every single tuple, this metadata and paying the penalty as we saw up here so this is just some additional tracking we're doing that has low overhead at runtime for queries as the access data. But then now, periodically there'll be a background thread that picks up and looks at all this log information from the different threads and then computes some histograms to figure out the access frequencies of individual tuples. So, this one is cooperative because all the threads are helping along maintaining the metadata this one is a separate background thread to combine this information together. Alright, so the next thing to do is basically how to recognize when the database system is running out of memory. So, a simple approach is just to have an administrator define threshold that says when my database system's total memory usage gets to be about 80% or 90% of the amount of memory that's allocated to my database system then I'll fire off my eviction policy and go ahead and start shoving things out the disk. Again, in this case the data system is responsible for removing this data because there's nothing else that can control this it's us managing our memory and therefore we have to do this we can't rely on the operating system typically in database for every database system whether it's in memory or not the database system has to be told by a human how much memory they're allowed to use. Again, whether it's the buffer pool size for a disk warning system or just in memory size of the heaps for a in-memory database system in the end of the day someone has to tell us how much memory we're allowed to use. So, it is then set a threshold and say to 85% of that threshold that total upper bound then I start evicting. The other approach is to do this on demand and this is where if the database system recognizes that it doesn't have any more memory to bring in a new piece of data from disk that it needs then it runs its replacement policy which we'll talk about in a second but thinking it's LRU to identify which pages are no longer needed or unlikely to be used in the future and we can go ahead and shove them out the disk then reclaim that space for the new piece of data that we want to bring in. So again, you could use LRU or clock to approximate LRU. LeanStore uses an approximation of clock which they call second chance and we'll see what that looks like in a second. Alright, so now assuming that we have a way to fire off the eviction to move data out to disk we have a way to identify what our cold data is and we need to move out the disk now the question we have to deal with is when it's been written to disk what do we still maintain in memory to keep track of this data actually there used to be tuples here there used to be data here in memory but it no longer is and here's where to go find it. So the first approach is to use tombstones and this is where we will have a marker or a special kind of tuple that we have in our table that says this tuple you're looking for at this address does not exist but here's where to go find it. Here's the block ID and offset in that block to go get the data that you're looking for and any time that you see if you follow a pointer doing scan and come across this tombstone pointer or tombstone tuple you could then say alright well let me go talk to the cold data storage layer and say go get this data that I'm looking for. So if you evict a tuple that has shut out the disk and you replace it with a tombstone then you have to go update the index now for any or any index for that table that's pointing to the old evicted tuple to now point to the tombstone tuple and again that way there's no false negatives your data is always there your data is always identifiable. The next approach is to use bloom filters and this is where you evict all the data out the disk evict all the tuples out the disk but instead of maintaining a tombstone per tuple that's evicted you just have this bloom filter which is an approximate data structure that does set membership that says whether or not there could exist the tuple for a given key out in the cold data storage and so what happens is if I'm looking for a particular key on a table I check the real in memory index if it says it's there that's going to have a point to the real tuple so I'm done in memory if it says it's not there then I go check this bloom filter and if it says it's not there then I know it doesn't exist at all because a bloom filter would never give you false false negatives but it could give you false positives if it does say the key exists and it's out in cold data storage then I go fetch this on disk index that I bring into memory and then I find the real location on disk for the tuple that I'm looking for again I'll explain these two in the next slide the other two approaches are to do either a database system managed memory or OS managed memory or virtual memory and for this one the database system is going to keep track of on a page level whether the page actually exists in memory or not or the block level same kind of thing so I can't tell you whether an individual tuple exists but if I know that it would exist in this block and that block has been written to disk then I know the tuple that I want is on disk so with OS virtual memory it's going to work in the same way it's just in this case the OS is going to track what's in memory versus what's on disk and this is the data system is going to track it so we're going to focus on these two because these are going to be these are using the fine-grained tuple identification approach that we've talked about we will discuss how to do this one with LeanStore and Umbra and then we will see a technique from EPFL of doing this OS managed memory in using VoltDB in a few more slides as well so we'll cover all of these but I'm going to first discuss and show an explicit example of these two here so again this is the same setup that we had before in memory index, in memory table heap in my cold data storage so again, assuming we have some way to look at how our tuples are being accessed and we can compute a histogram that says here's the access frequency of every single individual tuple we can use this information to identify that these three tuples here have been the least likely to be accessed again because they've been accessed the least often in my last my last time I did this check so I'm going to go ahead and write those out to my cold data storage so now the indexes are still pointing to these old locations here so now what will happen is if I go and use tombstones then I will have a marker for every single tuple that I removed and have a corresponding tombstone tuple that's going to have the block ID and the offset of where they exist on disk so now I update my memory index to now instead of pointing to the old locations in the table heap they now point to the the tombstone tuples so if a query comes along and looks up key xyz they follow the index and get one of these pointers I check a flag that says you're looking at a tombstone tuple pointer or sorry you're looking at a tombstone tuple not a data tuple and therefore I know that my block ID and offset will tell me where to go fetch the data that I'm looking for on disk in this example here I'm showing the tombstone tuple is being stored in the same data table as the regular in memory tuples in actuality though you want to store these separate data tables you have one per regular table because the schema is going to be different in the regular table the tuples can have any arbitrary number of columns as defined by the schema when you created the table the tombstone tuples these are only stored to 32 bit values the block ID and the offset number so I wouldn't actually want to store these in the same table here because if I allocate the total amount of space I would normally have for a regular tuple for a tombstone tuple then I get no benefit from this and I'm just wasting space so for that reason they're stored separately and the reason why you want to have a separate tombstone table per data table instead of one giant tombstone table for all tables in your database is because when I do a sequential scan meaning I'm not going through an index and I'm just scanning through one by one through continuous reaches of memory for my table after I'm done scanning the regular table heap in memory tuples then I need to start doing scans on the tombstone tuples now I can't do any predicate evaluations on the tombstones because there's no data in there all the data is out here on disk but there are some queries I could potentially do like count queries where I could I could just count the number of tuples that exist in my entire table I could apply them I could run that kind of query on this and still have the correct correct amount without having to go there or now if I'm also maintaining the in memory zone map for what the values actually exist in here and recording that for a block of tombstone tuples I could then still do a scan on there and be able to compute some answers for some queries but anytime I need to actually know what the exact value for one individual tuple is for one given attribute within a tuple then I got to go out the disk and get it another optimization you can do with this but we actually never implemented this in h-store is that even though this is out on disk you can still maybe use the indexes as a covering index like if I know all the attributes that I need to answer my query can be found in this index I don't even have to go follow the pointers to get the data on disk the index has everything I need another thing we thought about what we never actually did would be say I have indexes that for a table that have all the attributes for my table like I have three attributes and there's an index there's a separate index on each of those three attributes so now my query comes along and does a look up on one of those attributes I would follow along and I'm with this tombstone tuple here but then rather than going out the disk and getting it I can maybe do a reverse search in my index and go find the corresponding key that matches to my value basically just scanning along the leaf nodes like this and that actually might end up being cheaper than having to go fetch some block from disk and bring it into memory there's more than just the disk I owe, I gotta go bring it into memory and potentially update indexes as well so one idea would be if I have enough, if I don't have a single index that could cover my query but the combination of them, the intersection of them would, it might be cheaper to go back and do leaf node scans or depending on what data search I'm using to find the find all the values find the matches for my given tombstone but we never actually implemented that alright so the next one is the the bloom filter approach and so again the idea here is that we have a bloom filter for every single index we have in memory and then add on disk for the every single for every single block or set of blocks we'll have an on disk index just another B plus tree that we can use to identify for the given key that we're looking for where do we go find it is that we want to reduce the size of this index because in the tombstone case I still have all the keys for tuples that have been evicted but with the bloom filter any key that's been evicted, any tuple that's been evicted we've removed those keys from the index so now the index gets reduced in size so if a query comes along that says does key x exist I always check the in memory index first if it says it's there then I have a pointer to the tuple and I'm done if it doesn't exist then I consult its bloom filter and then again the same thing if the bloom filter says it doesn't exist I'm done if it says it does exist then I go do a lookup on the on disk index and that'll tell me now where the location is, the block ID and the offset of where the tuple is that I'm looking for and then I sort of have to copy back in so now depending on whether we're using the tombstones or the bloom filter approach or OS managed memory or database system managed memory we now got to bring bring our data back in so then we can then have our query run on it so the question we have to deal with is what do we do with the tuples that we brought in because again in an OOTP environment it's very likely that the for a given block of data we're only going to need a subset of the tuples for our query that are in that block so what do we do with all the other ones so the first choice is that we're just going to merge all the tuples we find in the block that we fetch from the disk back into our table heap regardless of whether they're needed or not and so what does that mean in the case of the bloom filter arrangement or the tombstone tuples now we got to go update the indexes to now point now to the tuples we just merged in so if I needed one tuple in my block and my block has a million tuples now I have to do a million updates to my index to now to point to the one million tuples that I just brought in and the downside of this is that it's very likely the tuples we just brought in are going to be evicted again so we sort of have this ping pong or thrashing effect where we're merging data back and from our index on our indexes and back into our table heap from disk into memory and then we run our eviction algorithm and then they get shoved out the disk right away because they're cold all over again the other approach is to only merge back the tuples that we need within the block and so in this case here we identify just what's the minimum amount we need we bring that into our table heap we update those indexes but now we have this problem of the the disk page that they call data if we don't record that we have evicted that data sorry that we move that data back into memory and therefore the tuple shouldn't be existing on that block anymore we can write out the new block without that tuple in it but that would be expensive because now for every single read just to go access a tuple I got to do one disk I owe to read it in I pull it out merge it back into the table heap and then another disk I owe to write it out and say the tuple doesn't exist anymore so you have to maintain some additional bookkeeping to keep track of these holes and you probably want to record them in a log rather than in the the evicted block itself and then there's some background process you could run to do compaction or coalescing to combine a bunch of blocks that have a bunch of holes together into a single block the reason why you can keep track of these holes is because what would happen is if I fetch a block get a tuple back merge a tuple back in and then now that first block shouldn't have that tuple but it's still physically there but I haven't recorded that it shouldn't be there if that tuple gets evicted again and now gets written to another block I could have the same tuple duplicated multiple times in these separate cold data blocks and if I crash and come back now I don't know which one is actually the right one I should be merging in I have to do a bunch of work on the recovery side to deal with that the the second last issue is what is the threshold we're going to use to determine whether to merge something in or merge a tuple back in so again with lean store we don't really have this issue because everything is controlled by pages that we can swap in and out of the desk this is just needed when you're doing the fine-grained tuple-based identification and eviction so the easiest approach is what I said before is just always merge it whatever tuples I decide I want to merge I'm just going to always put them back in the table heap and update the indexes the next approach is to only merge them when there's an update this means is that if my query just wants to read a tuple I will go fetch the block that it wants but put it in a temporary buffer allow the query to read it and then immediately discard the buffer this way I don't have to update any indexes if the query is trying to do an update then I'll go ahead and merge it first then allow them to do the update and then the last one is that we can be a bit clever and actually maintain the access frequency of of evicted blocks essentially how often are they being retrieved and if my access frequency goes above some threshold then I've decided that within my current time window this block is being accessed all the time so it's probably a good idea for me to keep this around so it's sort of like in the same way we keep track of how tuples are being accessed our pages are being accessed while it's in memory we also can maintain information that says how often a page is being accessed from disk so the last one we want to talk about is what do we do with the query or transaction that accessed cold data how should we respond to it so the easiest thing to do is just to abort the transaction query and restart it and the idea here is that we go to access a tuple that's not in memory we record what tuple they wanted or what data they actually wanted and then we abort it a separate background thread then kicks off and fetches the data that they need that they wanted and then merges it based on the policy that we just talked about and when the data is actually available we have a way of either restarting the transaction like if it's running as a stored procedure on the server side or we could potentially notify the client and say hey the data you want is actually now available that last one nobody actually supports this so it's basically you abort the transaction and can send back an exception over to ODBC or JDBC and give an error code to say the data you want is not in memory please retry there are similar ones for deadlocks and things like that it's not unfeasible that you couldn't do this but no system actually does this explicitly for evicted data and memory database system so the tricky thing with this is going to be is that if I have a transaction or a query that wants to run with a strong isolation level like SNAPS isolation or serializable isolation I have to have multi-version I have to have multi-versioning in order to allow a transaction to get restarted multiple times or sort of paused and restarted and at least have a consistent snapshot of the database so with this one it's not really you're aborting, yes you're aborting and restarting it but when you come back the second time your transaction ID will still be the same the idea here is that this allows you again you have a consistent snapshot of the table or whatever you're trying to access even though you may be running it at different invocations of the query so the way to think about this is say I have a query that wants to scan the entire table but only half of the table can exist in memory at a given time what would happen is with this approach is the query would start it would begin scanning and get through the first half of the table and then it would try to access the next half and then it would hit up either a tombstone tuple or somehow identify that the tuple that it wants is not there then it gets aborted the background thread fetches the remaining data that it actually needs by evicting the first half of the table and then my transactional query can get restarted but if I'm restarting from the beginning I'm going to hit the same issue where the first thing I need to go fetch is not in memory and not to get aborted and now the first half gets paged in the second half gets written out and I never can actually complete or at the run with a consistent snapshot where transactions could be updating the second half while waiting for my transaction to get restarted or if I have a consistent snapshot then I just sort of essentially pause garbage collection while I go flush out the first half bring in the second half and I come back and I still have a consistent view or consistent snapshot of the table as I said nobody actually does this one that I am aware of for this reason because it makes it tricky to guarantee consistency the more common approach is to do synchronous retrieval and the idea here is that it's essentially what a Discord native database does now when I try to access something that's not in memory my transaction gets paused and the files are stalled while another thread or the disk controller or why the disk manager fetches the data that I need and brings it into memory and once it's in memory then I'm allowed to proceed and start running so there's some games you can play with some of these like I could try to let a query keep running for as long as it can accessing and I just sort of have to do a Jedi mind trick where it says oh I want to read this tuple and you pretend that it could actually read it some queries can do this some other queries can't and it's only when they actually try to return results to the application or do something with the individual attributes on the evicted data then you go ahead and pause it or abort it the idea here is that you can kind of let the query run as long as possible identify all the tuples or all the pages it's going to access and then at some point you say alright well I've seen enough let me go fetch the big batch of things that you need and so that way you're not like this run stall run stall over and over again you can maybe do a bunch of scrunchyro fetch it all in and then let it run more quickly alright so now I want to go through a bunch of different implementations of systems that support larger than memory databases and the way this is going to be organized is that before we're here we're all going to be using the sort of fine grained tuple based approach that we talked about before or talked about so far and then these three mining systems are actually going to be page based and at the high level they're all going to achieve the same goal that we can have a memory database system that can maintain and store databases that are larger than the amount of memory that's available to the system without having to do any rewriting of the application to some extent and the spoiler of what I'll say is that the page based approaches in particular from Lean Store and Umbra I think are the right way to go and these tuple based ones are a little bit too fine grained and the storage and computational overhead you pay or the penalty you pay from being able to support this I think is is not worth it so again we'll go with the issues one by one but I think the Lean Store Umbra one is probably the right way to go so as I said before the paper of all the different implementation issues or design decisions was based on a system that I helped build when I was in grad school called H-Store that was then commercialized as Voltdp and an early prototype of H-Store that supported writing data out the disk was this component we call it anti-caching and again the idea is like it's a reverse of a cache instead of fetching hot data from disk and pulling into memory we push cold data out from memory onto disk so given all the different design decisions we talked about H-Store here is going to be doing online identification it's going to maintain an LRU chain in the header of every single tuple we do sampling to make sure that we're not updating it every single time our transaction runs to avoid too much overhead but it's still 64 bits for every single tuple it's going to have administrator defined threshold to identify that the data that you need when you're running out of space and go ahead and kick off the eviction policy we're going to use tombstone tuples we're going to do the abort and restart approach which we can do because all transactions in H-Store are going to have to run as store procedures it's going to block the total granularity of merging in data and in the original implementation it would always merge them the original implementation couldn't decide oh this is the data this data is being updated so let me go ahead and merge that just that piece or have the side buffer it is always grab whatever the block was updated all the indexes that merged it in so needless to say that this had quite a bit of overhead to support this the other sort of major implementation around this time that supported large memory databases came out of was part of the Hecaton project again Hecaton was the in-memory storage or extrusion engine that Microsoft built for SQL server I think now they're just called in-memory tables but the if you search Hecaton then it shows up so this was a side project or a research project called project cyberia that was again looking at how to extend Hecaton to support databases that exceed the amount of memory that's available so the main thing about this one was this was where the Bloom filter stuff that we talked about came from and the idea is that you do the offline identification the same threshold as H-Door with the as defined by the administrator will remove values or keys from evicted tuples from the in-memory indexes but then update or maintain a Bloom filter to keep track of whether the key actually could exist out on disk and then when transactions try to access that cold data we would actually pause them or stall them and then go fetch them and bring it in and they were always doing this merging at a tuple level granularity and again like in H-Door they would always merge everything so to best my knowledge all the cyberia stuff from Microsoft never actually made it to production never actually made it into the real Hecaton system and I think a part of this was just sort of the engineering complexity of sort of maintaining the cold data plus the hot data and the sort of inconsistent performance you could get from you know sometimes your tuples are in memory sometimes they're not and you don't really know that until you actually run them so for these reasons Hecaton never actually got this functionality and likewise I don't think both TV ever got the anti-caching stuff that we were doing at H-Door either another interesting system is came from EPFL and this was developed by Natasa Alamaki in her group in Switzerland Natasa used to be the database professor at CMU before I showed up and she actually taught the earlier versions of 7.21 13 years ago so what are they going to do so they're going to do offline identification the same with that that cyberia did but now they're going to have the OS manage and make decisions on how to evict things out the disk but they're going to be clever about this they're only going to let the OS evict certain portions of the table heap and they'll have a hot portion that they maintain but that never gets written out the disk and they're going to move data to the cold cold portion of the heap and let the OS page things out as needed so because it's doing using M-Map in virtual memory the OS is swapping things out it's going to do synch in its retrieval because the data doesn't know actually what's in memory versus not in memory it's going to do page level granularity because that's how M-Map works and they're always going to merge things because again the rest of the system doesn't know that a tuple has been evicted it still has a pointer to the cold portion of the heap and then when it goes to access to it then it gets page fault and it swapped in right so this is the key part here that's different than what we've seen before so let's see how this works so here's our M-Map table heap, here's our cold storage again this is just swap space as being managed by the operating system so what they're going to do is they're going to divide the memory heap into hot tuples and cold tuples and for the hot portion of the heap they're going to use M-Lock to pin these pages in memory and M-Lock prevents the OS from writing this data out of the disk but the cold tuple portion that's not locked or not pinned so now that the operating system can at any time decide to evict this data out so let's say now we have this oil flight identification that the data system is going to have to run identifies that this data this tuple here is cold so what we're going to do is copy it now into the cold tuple it's essentially a delete file by an insert like an internal transaction does a delete file by an insert and this will automatically update the indexes now to point to the new location for the physical location of this tuple and then we can put whatever new data we want here at some later point if the OS decides that it's running out of memory it can decide to go ahead and swap any of the pages that are in the cold tuple region but we have no way to determining is this data in memory or not other than asking the operating system which would be a sys call usually a bad idea so the tricky thing we have to make sure of course is we want to make sure that all our tuples are page aligned because we don't want to have the issue where the tuple may be split across multiple pages and now things are going to be tricky to make sure that we write out these pages in the correct order to make sure that we don't see one without the other we want to sort of do this atomic which is not something the operating system can provide for us because MAP cannot be atomic across multiple pages which now means that we have to maintain a log somewhere to keep track of what's happening over here so I think this is an interesting idea because it's using MAP in a clever way for data that is read mostly but to handle the case for this or when there's updates you need a log to keep track of what's happening here the last tuple based approach I want to talk about comes from Apache Geode Apache Geode was originally this data system called Gemfire long story short Gemfire was developed by this other company called Gemstone and then Pivotal bought them and then VMware bought them but then VMware decided they didn't want to own database companies so then they divested it and combined it with EMC's green plum and that became Pivotal and then I guess Pivotal decided they didn't want to actively maintain Gemfire or whatever so then they dumped it off to the Apache Foundation so it's Apache Geode so what are they doing it's online identification like an HDOR administrator defined threshold just like before as far as I know they're using tombstones to identify that tuples have been removed from memory and now they actually reside on HDFS this is all assuming that you run on HDFS they'll do synchronous retrieval cause they'll block queries and bring things in they'll do it at a tuple based granularity but the interesting thing is that they're only going to merge tuples and bring them back into memory and discard the disk based blocks whenever you do updates and the reason why they do this is because HDFS is a pen only it would be kind of a pain for them to do in place updates to a block that already exists you'd have to either write a log message and say this old block has been evicted sorry it has been pulled back into memory and is no longer considered sort of valid or up to date so in their world to avoid that big penalty and then also running things up to disk later on they just update a log message to say yes this thing has been removed but they only want to do it on an update so as far as I know I mean I'm sure people are using this but I have not come across anybody in the wild that has been using these overflow tables in Apache Geo okay so now as I sort of alluded to so far everything we've talked about at least in the HDFS, Hecaton VoltiB and in the Apache Geo examples these are all doing evictions based on a per-tuple basis right that means we have to keep track of individual tuples how they're being read, how they're being updated and then we make decisions about how to how to evict them you know we're making these decisions on a per-tuple basis to combine them into a page or block and write them out to disk but with the exception of of Project Siberia all of these wouldn't actually reduce the size of indexes and furthermore none of them can actually spill the index or write portions of the index out at the disk so only Project Siberia would pull the keys out of the index or reduce the size of the index and then populate with the bloom filter which would be much smaller than the the keys would have been if they existed in the index but other than that they can't identify oh well this portion of the index or my indexes they're not being updated or accessed very often so let me go ahead and shove those things out the disk and as we saw when we talked about compression and I showed at the end how to do index compression for some old to be databases the size of the indexes can be quite significant relative to the overall size of the database in some examples we saw the indexes were up to 60% of the total memory size of a database so all of the things we're doing here are not actually targeting for those applications or for those databases what the bulk of the memory is actually being used for so what we really want is a unified approach a unified model for evicting cold data out the disk from either table or from either indexes right we shouldn't have to have separate policies separate mechanisms or both of these it should just be a single approach used across both types of data so this is how we ended up with the the paper I had you guys read on LeanStore and so LeanStore was a prototype in memory storage manager that was developed by the Germans at Munich that worked on the hyper system but to the best of my knowledge this wasn't actually part of the hyper project this was a standalone separate thing and so what's really cool about what LeanStore can do is again it's designed from the ground up to evict pages and it doesn't know if they don't care whether the pages belong to two data tables or they belong to indexes so the way to think about what they're doing is that LeanStore provides a decentralized buffer pool manager that is based on a page hierarchy meaning instead of having instead of having just sort of like a regular page table just an unordered list of pages it's going to be organized as a tree structure and the idea is going to be that you can't evict a the main idea is that you can't evict a child page so you can't evict a parent page unless it's all its child pages that have been evicted to disk either and then also now because they are of page based the overhead of tracking how pages are accessed is going to be much less than having to do it on a basis in the other systems we talked about here so but now the way they're actually going to do tracking is interesting because they're actually not going to track at all most of the time so the way they're going to decide what to evict is that they're just going to randomly select some pages and then if a page is selected then they start tracking whether how it's being accessed or whether it's being accessed and then at some point when we need to free up space we go look at that tracking information from the last round of when we select the pages and we find the ones that haven't been touched or haven't been accessed and we go ahead and evict those so you only turn on the tracking identification the pages identification when you randomly select a page to be evicted so we'll go through all these policies in more detail but the key thing they're going to do instead of using a tombstone instead of using a bloomfooter the key thing they're going to do to figure out whether a block or page exists in memory is through a technique called pointer swizzling so I don't think we've covered pointer swizzling that much in the intro class so I'm going to cover that now in the advanced class the value of pointer swizzling is that we are going to switch the or sort of flip the contents of a pointer that one object has to another object based on whether we know whether that object is being pointed to is in memory or not so again the idea is that and so to figure this out we only need to use one bit in the pointer so we'll use the first bit to say one that it's on disk, zero if it's in memory and then the rest of the 63 bits and our 64 bit pointers will actually be used for the address that's actually not entirely true either because on x86 the current architecture only uses 48 bits of a 64 bit address to find the memory location of something in memory so they can only store up to 35 terabytes of memory Intel claims they're going to use the whole 64 bits I was at a talk that Intel gave three or four years ago where they said hey don't store anything extra in the remaining 16 bits of your pointers because eventually Intel's going to use it but like I said that hasn't happened yet and as far as I know it's still just 48 bits so you can use the upper 16 bits to do whatever you want so let's say we have here we have two blocks and block b1 has a pointer to block b2 so when the pointer is unswizzled meaning it's pointing to something on disk then the bit the first bit will be set to 1 saying that it's unswizzled and then we'll just have our page ID and offset like 32 bit page ID 32 bit offset but now if we go and this would be 64 bits in total and now if we go actually uh swizzle that address because we brought in the block into memory we'll flip that first bit to 0 saying that what you're looking at is a real memory address not a page ID and offset and then the remaining bits would be used to actually point to the memory location so what happened is when you pass a memory address in your program to say go access the thing for me to do a load this thing's not used because it's not part of the 48 bits that the x86 cares about so it's just ignored so we don't have to do anything special when we do a memory address lookup using these swizzle addresses the hardware just takes care of it for us right so the reason why you want to do this is because this is going to allow us to have a decentralized way to track whether a page is in memory or not right this is what a page table would do in a disk based system right you'd be this this giant map that says page ID 123 is is in disk or on disk or in memory and if it's in memory here's the address to it the or the mapping table in the B plus the BW tree was essentially the same thing so rather than having a centralized data structure if we ensure that in our database system that only one block will have a pointer to another block then we know that there's only one location where this is pointer to this other block so we just can do a compare and swap on this to be able to flip it to be swizzled or unswizzled because we don't have to worry about trying to do atomic updates across multiple pointers and so if we think now in like a B plus tree if you ignore sibling pointers you're essentially going to get this for free because that's you know it's a it's an acyclic graph I said now that you think about Linkster was again the eviction policy wasn't maintaining LRU, wasn't doing a clock it was just randomly picking some blocks for eviction and then figuring out whether they are going to actually be accessed so what's great about this is that you don't have to maintain any metadata as we did in the other approaches for hot data because if it's hot then you don't track any information but if it's potentially cold then you go ahead and track this information so what will happen is that I select some blocks to be evicted and then I'm going to maintain this global hash table that's going to keep track of of the tracking info of how this block is being accessed and then if I if I determine that when it's time to go free some memory I go check that hash table and I see here's some blocks that haven't been accessed then I know it's safe for me to go ahead and evict them so again what will happen is I randomly pick some blocks I'm going to unswizzle their pointers meaning it'll revert back to the page ID and offset but the I'm not actually going to evict the the block it's still going to sit in memory and then what will happen is if now there's a long query run and it comes across an unswizzled pointer with the page ID and offset I go check that global hash table for the data in my cooling stage and if it's in there then I know it's actually still in memory and I can go find out where it is if it's not in there then I know it actually is not in memory at all and I go out the disk and get it again we'll show an example of what this looks like in a second so the what's important about lean stores is that again the block hierarchy there's no centralized buffer pool table so we need a way to ensure that there isn't more than one pointer to a block in the database so that I can just unswizzle one location I know that's been covered everywhere so they're going to organize this as a tree hierarchy so every parent can have a pointer to a child and that parent is the only one with that pointer to that given block now so in indexes they're already sort of managed like this if you ignore sibling pointers again for table heaps you have to organize that as a hierarchy and then do breadth first search do scans and think in the lean store case there was always index organized tables in the case of umbrella they're going to do the same thing so why do you want to do this well again I know that if I want to evict a parent I can evict it to disk unless its children have been evicted and this prevents me from having a situation where I evict a parent but its children aren't still in memory the parent gets written out the disk and then now when it gets fetched back in it's going to have when it was written out it would have in memory pointers to the children where they existed before in memory but now when I come back those memory addresses might be pointing to frames in my buffer pool that now have different pages that were in there before and so now I'm pointing to garbage I'm not pointing to things I should be looking at so by ensuring that the children evicted out first children get evicted I update the parent now to unswizzle pointers to his children so that when it gets written out to disk and I bring it back in I can then use those unswizzle pointers to find the pages that have the tools I'm looking for so just in the same way I can't fetch I can't evict a parent before its children have been evicted I can't fetch children until its parent has been fetched which you guarantee because everything is in this tree hierarchy alright so let's look at an example so the way we're going to organize this again say we have a really simple database that only has four pages and so we're going to break the hierarchy up into three stages so we have the hot stage where everything is in memory and you have a swizzle pointers then you have the cooling stage where it still exists in memory but you're going to have unswizzle pointers and then you have the cold stage where it's out on disk with unswizzle pointers so let's say that I run my eviction algorithm that randomly picks some blocks to be written out to disk and let's say I pick B1 so I'm going to move it down now into the cooling stage and then I'm going to add it into my hash table over here that's tracking again the access patterns how often this block is being accessed so this hash table is basically saying for B1 here's where to go find it in this eviction queue and then the eviction queue would have the actual memory pointer to this block so now what would happen is that if then I have to make sure I update the unswizzle pointer from my parent block again now if anybody comes along and tries to access B1 they would have to go through B0 that's in memory then they would get this unswizzle pointer but then I would always do a lookup in my hash table and I would find an entry for B1 find the location in the eviction queue and that would give me the real memory address so now you're paying a bit of penalty now just as you would having an indirection layer and a page table and a disk or in a database system but it's actually not that bad because again if this was hot data then the first time this would happen I would then bump it back up to the hot stage remove it from my eviction queue and then then swizzle the pointer and so only one query would ever pay this penalty but if it actually really was cold then nobody would actually come across this I have to go through this hash table and eventually get it right on the disk and no harm, no foul so this is actually really interesting that this eviction queue is keeping track of this essentially is naturally ordering the what were the latest the oldest or newest blocks added to the queue so now if I need to go fetch something that's in the cold stage on disk and bring it into memory I just go evict whatever is in the front of the queue here shove it out the disk and then use that space for the block that I need so there's only one latch you need in this architecture and that's when you actually write things from the cold stage to the cold stage because they just need to protect and make sure that if you try to evict a page or evict a block at the same time you're trying to fetch that block back in you don't want to have duplicates and get fouled up so there's only one latch you have to use to protect for that concurrent operation but that's pretty rare so like I said this was a prototype system that was developed by the Germans I think they're continuing to work on it it's separate from Hyper and it's separate from the AMRA project which we'll talk about next right so AMRA is a a very new system that the Germans have been building think of this as like Hyper 2.0 and they are actually supporting the large in the memory database that's the technique that LeanStore proposed of having this hierarchy with the randomized the randomized eviction algorithm but one of the interesting things that they're knew differently is that they're going to support very well sized pages so in the case of LeanStore and actually of all the architectures we've shown before the pages that were being organized in memory and the pages that were getting written on the disk were always the same size in AMRA what they're going to do is they're going to allow you to allocate memory in the same way you would in a slab allocator like in J. E. Malik where you can allocate blocks that are to be of different sizes they're always exponential from the previous size and then now you can then address the entire block of a given size rather than individual tuples or individual offsets within a block and so the advantage you're going to get for this is that for databases that have large text fields or strings or barchars or other internal data structures like compression dictionaries where you often can't store them in a single page you can now avoid the overhead of having to copy those pages in memory and reassemble them whenever you go fetch things from disk so the number of people the Germans make should be a really interesting design argument for this system and they say that it's better when you want to support large memory databases it's better just to have a complex buffer pool manager like in the case of supporting a variable length one it's better to make that be efficient have that be sort of a more complex or difficult to engineer component of the system because you only have to implement that part once and now the rest of the system is going to end up being easier to develop because now you have this flexibility and this capability of the variable size buffer pool manager and it's sort of what we said in the beginning about how if you you don't want to bring back the buffer pool manager and then have to re-architect the entire system to account for the fact that you could be accessing data that's not on disk in the same way here you can not have to worry about packing data in always exactly into a page that a fixed page size you can allocate the right amount of memory that's needed for whatever it is that you're doing and the algorithms of the data structures that you're building on top of that the buffer pool manager don't have to worry about the complexity of how that's all managed so I think that's actually really interesting and to best my knowledge this is the only data system that's actually doing this which is pretty exciting so just like in LeanStore they're going to store things into this hierarchy but all the tables all the relations are stored as index organized tables so basically all the it's sort of like in MySQL is entity B where all the tuples are stored in the leaf pages of B plus tree so you get that natural hierarchy that we needed in LeanStore where every child there's only one parent that has a pointer to any one child so this is just a high level overview of what's going on in the variable size buffer pool and again the way it's going to work is that it's like a slab allocator where they'll have a bunch of frames in the buffer pool for different size classes and you'll have more frames for the smaller size classes and fewer frames for the larger ones because most of the chunks of memory you're going to allocate are going to be quite small so the large size the small size would be 64 kilobytes and that'll go up to 512 kilobytes so in the buffer frames you would have sort of inactive and active pools everything's going to actually be managed using Mmap using anonymized Mmap and that gives you virtual memory and so that means that I can allocate reserve all the space here but it's actually not backed by physical memory until I go ahead and access it so I can allocate all the space I would need in my buffer pool frame for the total size of the database but I only the OS is actually only going to back it when it's actually needed so they're also going to do pointer swizzling the same way you would in Lean Store or in other systems but what's interesting about it is in the Lean Store case it was just if it's swizzled, it's a memory address if it's unswizzled, here's the page ID in the offset in this case here you don't have offsets anymore in pages because you're having a fine grained pointer to the page that has exactly the memory that you need because it's the right size so if I only need 64 kilobytes then I have a pointer to those 64 kilobytes and it's left to whoever's getting that 64 kilobytes to be able to interpret its contents to get the data that it's looking for but I don't need to record that offset in the unswizzle pointer but I do need to record what size class it's in so that when I go fetch that block I know which category of the data I need to go get it from how big it actually is going to be stored out on disk because again, the OS with virtual memory, it doesn't know that it's not going to allow you to do with these memory allocations or page allocations from different sizes it's always going to be whatever four kilobytes as the default or if you're using huge pages, whatever the two gigabyte ones, it just knows that you have some chunk of memory so it's up for the database system to be able to interpret how much data do I actually need to read for a given page size and that's based on a size class down here I think this is actually really interesting, this paper came out only a few months ago no other system as far as I know actually implements this it remains to be seen whether the compared to something like Lean Store that this is the right approach to go but I definitely think it's very promising and I think it's better than the stuff that we were doing years ago with each store or with the project cyber stuff so actually the last thing to point out too is again these are still 64 bit pointers the block ID is now 57 bits and then this size class is just 6 bits alright so the last approach I want to talk about is MemSQL as I'm wearing the old school MemSQL shirt so traditionally what MemSQL would do is when they brought on the column store even now today you can declare that you want a table to restore as a row store and that resides in memory or you can have a table exist or you declare a table as a column store and that can actually be backed by disk but there was no way to sort of declare a single logical table that could have both types and when they first came out with the column store approach up to 2017 they were using M-Map to manage the pages out to disk but they were sort of blindly just using M-Map and you know not the anonymized memory virtual memory allocation that Umbra was doing so they quickly found out this was actually a bad idea and there's a blog article that shows that they were sort of abusing M-Map and getting bad performance so then they built their own buffer pool manager where they would take the columns and split them up into one segment that had 1 million tuples and knew how to fetch in those segments as needed but they still had a separate sort of separate row store for transactions or OTP workloads separate column store for analytics so what they then now announced in 2019 is this what they call a new single store architecture where you it's like the hyper stuff or the stuff we're doing in our own database system where you can do transactions on top of the on top of a column store there's a blog article makes a bunch of claims about you know make it sort of seem like that they're the first to do this which is not really true and then they have some optimizations to deal with the large overhead of storing nulls that we saw before in the row stores when we talked about data types but they're sort of getting away from having separate different in the row store and sort of separate column store and now having a single single column store approach that can have pages written out to disk it's unclear how they're deciding what pages get written out to disk I suspect it's a basic LRU or clock approach so now going back to our same taxonomy to talk about four they don't need any eviction metadata for this because just keep track of what pages are in and not memory and not memory they're doing the single store tree rule and then they always just merge things okay alright so that was a lot the as I said the what we talked about today was just dealing with bringing back in the disk that's going to be block oriented it's going to be slow and trying to be clever about it to avoid slowing down memory architecture and as I said multiple times I think the block base or page based approach used in Lean Store or Umbra is the right way to go and the fine grained tuple stuff that we did in other systems is is not the right approach now the other interesting thing about this is that everything I talked about here today could eventually just all be made null and void and obsolete when we actually finally achieve cheap and fast byte address while non-volta memory so that means that memory that goes in the dim slot and then you can read and write to it as if it's DRAM but when you pull the power it persists things like like an SSD so that hardware actually exists we'll cover that next week with like talking about databases running new hardware for now SSDs are all we have for non-volta storage but in the future real persistent memory or real non-volta memory will make everything we talked about today I think unnecessary alright so next class we're going to talk about sort of this is an additional topic that doesn't fit exactly in with what we talked about but it's another way to get big performance improvements in a database system if you know what you're doing so we'll talk about a way to improve performance of UDFs and we'll see two approaches from another set of Germans and from Microsoft okay alright guys wash your hands and take care of yourselves see ya Thanks for watching