 Today's lecture is going to be on handling databases that exceed the size of memory, which is sort of going against everything we've talked about this semester. But we'll see how to techniques to bring back the disk without hopefully slowing us down entirely and sort of regressing to a disk-oriented database system. So real quick, I want to go over the schedule. What's coming up for you guys? The rest of us, the next steps in the semester. Can you just make a noise? I'm just going to show up on the audio too, all right. So the first project is due next week on Wednesday. The same day in class, I will announce what the second project is. The second project is a three-person project where you pursue some aspect of the system for the now and to the rest of the semester. So I'll describe what's entailed for the second project, what you have to do at each step. But what I'll also do now, I'll do it today or tomorrow, I'll send an email on Piazza to post a link to a Google spreadsheet because some of you have already sort of figured out what groups you're in. So you just list the three members of your group, right? You don't have to pick a project at this point, but you at least want to tell me what group you're going to be in and that way, if there's any free agents, we can make sure you get put into a group. We also announced the extra credit next week as well. If you took the intro class, it's the same thing. Pick your favorite obscure database system and write a Wiki-Peter-style article for it. And I'll give the list of what systems are still available. I think we're up to 581 now. I found another one last week, or Uber announced they have a graph, a GPU database called AgentsDB or something like that. All right, there's a bunch of different systems you can choose from. March 6th will be the midterm exam, and that'll be in class. And this will be like sort of a combination of long formal question and multiple choice. And then after spring break, we'll do the project proposals in class, and that's basically you'll get up here with your group, spend five minutes and say, I'm going to do X. You guys are going to build X. But it's a little bit more deeper than, like, I we hope to do this. You actually want to spend time looking at the code and trying to figure out how it is that you're actually going to do it. And you can meet with me if you want to, because I'll be around during spring break. All right, so any questions about any of these things? OK. And then just a reminder again on tomorrow, the Spice Machine guy is coming to give a talk, and that'll be 12 o'clock over in the CIC with lunch serve. There'll be pizza. OK? All right. So before we jump into what today's lecture is about, I want to spend time to teach what Bloom Filters are, because I think some of you didn't know when it was last class, and we need it for today. We're actually going to need it for a lot of other parts of the system going forward. So Bloom Filters are a probabilistic data structure that are designed to answer set membership queries. So set membership queries would be, is this key in my set? That's pretty much all you can do with it. So it's a probabilistic data structure, because it is not going to be tournistic, or it's not going to be exact. So we'll never have false negatives. If we ask it, does something exist in our set? It'll never tell us no when it actually really does. But we may get false positives. We may ask it, does my key exist in my set? And it tells me yes, tells me true, but it actually doesn't. And Bloom Filters are old. They're like from the 1970s. The guy that invented it is actually named Bloom, B-L-O-O-M. So that's why they called that. So the Bloom Filters essentially is just a bit map. That's all it is. But instead of using it to set single bits in the same way we saw with maybe some of the tri data structures, anytime you want to look up or manipulate it, we want to use a series of hash functions to figure out what bits we should be looking at. So if I want to do an insert, I want to start this key k, then I'll have k hash functions that I'm going to hash x for each of them multiple times. And I just mod the hash value I get for each function, buy the number of bits that I have, and that tells me what bits I need to set in my bitmap. So now if I need to do a lookup, I do the same thing. I just take the key, hash it the same way I did as an insert, and just check to see whether every bit for all my hash functions is set to true. And if yes, then I know it's in there. If at least one bit is set to 0, then I know it's not in my set. So let's look at a really simple example. So you have an 8-bit Bloom filter. And so initially, all the bits are set to 0. So let's say the first thing we want to do is insert the key RZA. So we'll have two hash functions, k equals 2. So for the first hash function, this doesn't need to be like murmur hash, doesn't need to be like city hash. It can be like a simple bit-shifting arithmetic to compute the hash really fast. It doesn't need to be super cryptographically safe. So let's say our hash value is 222. We mod by then our bits, which is 8. And then we get 6. So we go update the 6th bit here. Same for this guy, second hash function. We after modding 8, we get 4. We set that bit. And now our key is in our Bloom filter. It's in our set. Let's insert jizza. Same thing. First hash function gives us 3. Second hash function gives us 1. So now let's do a lookup. So we lookup rayquan. And so for this, the first hash function gives us 5, which is not set. Second hash function gives us 3, which is set. Because at least one bit is set to 0, then we know rayquan is not in our set. So again, a Bloom filter will never give you false positives. It will never tell you something doesn't exist when it really does exist. But all right, so this is a returns false. So let's try to look up on ODB. So in this one here, the first bit we get to 3, that's set. The second bit gets set to 6, that is set. So this would come back as true. But we know this actually is not in our set, because we didn't insert it. So this is an example of a false positive. It could tell us something exists when it doesn't exist. So there's a really simple idea, but it's a really powerful primitive that we can use in a bunch of different places in our database system. So we'll see this today in the case of Project Siberia for Hecaton. They're going to use a Bloom filter to figure out whether a key actually exists on disk or not. And yet you're not a false negative, so it will never tell you something doesn't exist when it actually doesn't exist on disk. When it actually does exist, but you may get false positives, the Bloom filter may tell you this key exists, you go out the disk, and then you find out it's actually not there. So this is a great trade-off of in exchange for having these false positive rates, you can significantly reduce the amount of IO or amount of operations you need to do. So we'll see Bloom filters in a bunch of other areas in the system, like we'll use them for joins. We'll do a Bloom filter before we do a probe in the hash table. And because, again, this hash function doesn't need to be super sophisticated, this could be a few more instructions to go check a Bloom filter to see whether something exists in our hash table, and that avoids the expensive computation of looking up in the hash table. So let's clear what a Bloom filter is. All right, so the only thing you can do is insert and look up. You can't delete because how are you going to know which bit was this bit set by just you or a bunch of other people, right? You also can't do range queries on this, because again, you don't have any of the keys, any information in your bitmap, you just know a bunch of bits are set. You don't know what the original keys were. But just that simple primitive of does this thing exist yes or no can allow us to do more sophisticated things. Okay, so with that said, let's talk about larger than memory databases. So it should be sort of obvious about why we want to do this, right? DRAM is expensive, it's really expensive, compared to what your alternatives are, like a spinning disk hard drive or even an NAND flash SSD. So for DRAM, roughly the price in 2017, 2018, it's roughly about $15 per gigabyte. NAND flash would be roughly 30 cents to a dollar per gigabyte, like the enterprise ones are like the dollar range, the cheap consumer ones are less. But like a spinning disk hard drive, the old school rotating disk drives, those things are like three cents per gigabyte, right? So it'd be nice if we can put some of our database out on the disk because storing everything in memory is being super expensive, because buying DRAM is expensive. But also, not only is buying DRAM expensive, it's actually maintaining, and our system is also expensive, right? Because the DRAM can't persist storage if you lose power. So the motherboard basically sends a little charge to every DRAM slot or DRAM dim every so many seconds to refresh the charge that it retains its values. So there was a study done, I mean, about 10, 15 years ago, but they calculated that about 40% of the energy cost of a single box, it's just spent on refreshing DRAM, right? So again, this is gonna be really expensive. You may say, all right, well, can't I go distributed? Wouldn't that solve my problem? Like if I hitting the limit of how much memory I can have in a single box, being distributed causes a bunch of other problems, right? Because you need to be consistent across those different machines. But you're still storing everything in DRAM, so these issues still hold. So that's our goal for today. The goal today is to figure out how can we incorporate cheaper storage in our database while still having all the benefits of being an in-memory database system and the performance benefits that you get over a disk-oriented system. So today's agenda, we'll start off talking about the background of what we're gonna focus on or what kind of workloads we wanna target for this technique. Then we'll talk about the implementation issues you'd be mindful of in order to bring back disk in memory database. And then we'll go through a bunch of examples of how systems actually implement this, including the lean store paper that you guys read. And actually, this shouldn't be here. There's no evaluation, I pulled it out because we're gonna run out of time. I'm gonna try to pack less slides into my lecture so we're not like rushing through it, okay? All right, so again, the basic goal we're trying to achieve today is that we want to allow our in-memory database to be able to store and access data that's not in memory but that's actually on some auxiliary storage whether it's NAND flash or an SSD, it doesn't matter. And we wanna be able to do this without having to bring back all the stuff we say we wanna get rid of at the beginning of the semester from a disk-oriented database system. So we don't wanna bring back the page table. We don't wanna bring back a lock table. All that stuff is, we showed was we super slow for an in-memory database so we don't wanna bring all that stuff back in, right? So that's the goal we're trying to do. Now, to only do this, we need to now be aware of what are the access method differences between the DRAM and disk. Again, we already covered this at the beginning of the semester, but we should start thinking about this now in the context of trying to solve this problem. So the main thing is that in the in-memory storage, what we've been focusing on so far is that everything's been tuple-oriented. All the different components we've talked about building whether it's the indexes, the convolutional scheme, the storage manager, all that has been focused on just having byte-addressable pointers to tuples, right? My index points to a tuple, my storage manager points to tuples, right? We're not gonna be able to do that with disk storage because these things are block-oriented storage. And if I want a single tuple, I can't just go get that one tuple by itself. I gotta go fetch the page, which is four kilobytes, of that that the tuple resides in, then bring that in memory and do something with it to try to pick out the thing that I want, right? So we're not gonna worry about how do we incorporate disks in our join algorithms and other aspects of the query execution. It's really focusing on at the sort of storage layer, how do we bring this in without, and without having to change all the rest of the system to be mindful of, hey, we have these pages and they might be dirty and we have to keep CLRs and our log records, all that we wanna avoid. The other thing I'll say too, we're gonna focus on today of just only OLTP queries, or OLTP workloads, right? The reason because with, to the best of my knowledge, in OLAP workloads, there's not that much you can do to make the system run faster when you have to read and write data from disk, right? So if I need to read this entire column, really the only sort of optimization I can do is the zone map stuff we talked about last time, right? This allows me to do data skipping to avoid having to read a bunch of other stuff, but at the end of the day, these OLAP queries are gonna be doing long sequential scans on tables and there's no magic we can build in our system to make that go fast. We're always gonna be IO bound, right? Trying to fetch the piece of data that we need and bring it into memory. So all the sort of standard tricks you would do in a regular, disk-oriented database that are designed for OLAP workloads, those still apply here, right? But there's nothing magical we're doing because we're in memory. So instead, we're gonna focus on OLTP. And we already talked about what we can exploit about OLTP workloads a little bit when we talked about storage models. Remember I said in these HTAP systems, there's this notion of hot data and cold data, right? Your hot data is also called your working set. It's the data in your database, the subset of the entire data that most of your transactions are reading or writing from. I just think of anything of your favorite website, like Reddit or whatever. Like no one's going back and commenting on Reddit posts from six months ago. And you're posting comments on what's hot today, right? That's the hot data we want to keep that in memory and maybe stuff from six months ago we can shove out the disk, all right? So what we're gonna need to be able to do is we need a way to move the cold data once we identify it and we'll talk about how to do that in a second. We'll move the cold data out the disk and then have it appear to the rest of the system that the data's actually in memory, right? Your join algorithms, all the query execution stuff above us in the storage layer, they all think this is in memory. They can still access it and they still know about it. But at some point when they try to go touch the data that's out on disk, we have a way to go and pull that data from the disk and bring it back into memory. So another way to think about this is that we wanna be able to push cold data out of memory into disk rather than the disk or any database you pull hot data from disk into memory. Sort of a semantic difference but we'll see how the architecture can change. So let's look at a really simple example here. So let's say that we have our database, we have this, we have a single table and these are the tuples that are in memory and then out on disk we have our cold data storage. This is some block-oriented device. Let's go through this example here and talk about what are the bunch of questions we have to answer in order to actually make this work in our system. So let's say we have some way to identify that these three tuples are cold. I didn't say how we're gonna do it or what about that later but we said these three guys are cold. So what we're gonna do is we're gonna combine them together into a single page and write that out to our block-based cold data storage. So this would be about a four kilobyte page or some multiple four kilobytes. So now the next question is, we have to deal with, is actually that's called a victim block. So the next question is, what do we do with these old slots in our in-memory table heap where these guys used to be occupied? Because the issue is the index, the index is still pointing to these guys, right? It's just a block ID and offset for some memory location here. This is still gonna point to this. So we have to do something about that. Then the next thing is that, say a query comes along and tries to access this thing, right? So the next question is, what do we do with this query? Assuming somehow it can figure out that it wants this tuple, whether it's an index or a sequential scan, it doesn't matter. But we can't actually run this query because the tuple it needs is out on disk. So what do we do with this? Do we abort it and then fetch in the background what it needs and try to restart it? Do we stall it? Like as you would in a disk-oriented database and sort of keep that idle while we try to do other stuff and then go fetch the data that it needs, right? So that's unclear what we should do. Then once we go fetch this data in, right? Again, the cold disk storage is a block-oriented device. So I can't just go get tuple one. I gotta go get the entire page or entire block that tuple one is located in. But now what do I do with that block? Do I just merge the one tuple back into memory and leave that there, or do I have to merge everything else? And do I actually wanna put it back into the full table heap and then update all my indexes or should I just put it into a temporary buffer, run that query, then throw it away? Right, so these are the questions we're gonna answer today about how we actually wanna architect all this entire system. So, not to go into another diatribe again, but I said in the beginning you never use M-Map and I, it's almost like a fanatical belief of mine, although we actually are starting this semester to the progress to actually prove scientifically that this is, that my, not lifelong ambition, but my, what's the word I'm looking for? Irrational obsession with M-Map is a bad idea. We're actually trying to prove that this semester. So it's a work in progress. But, and just to understand why M-Map is still a bad idea for this, the main issue's gonna be is that since we wanna do transactions and transactions could start updating data, we don't want to use M-Map because we're not gonna have control over when pages get written to disk and when the log records that correspond to the changes that modify that page get written out to disk. So because of this reason, again we covered this last semester and we'll cover this again on next week when we talk about logging, we don't want the OS to do any of this. Now there'll be one example I'll show from researchers at EPFL where they actually do use M-Map but they use it in a clever way where they sort of quarantine the data that the M-Map's allowed to flush out and prevent the OS from modifying this other region of data. So then you don't have this problem. There's other issues that there's no easy mechanism for asynchronous read ahead or writing out multiple pages concurrently. Some, there are flags with M-Advice you can try to use or implement prefetching separately in other threads. I don't want to get any details why this doesn't always work but it's suffice to say it doesn't always do what you think it's gonna do and it's very hard to enforce these things. M-Advice doesn't always do what you wanted to do. So again, never use M-Map for your database although we'll look at some examples where it might be okay. All right, so what are the design issues we want to talk about? So there's basically three categories of things we need to implement or think about when we want to add disk back to our system. So the first are runtime operations. What should the database days be doing as it's executing transactions and queries to keep track of what data is hot versus cold? Then when it comes time to say we're running out of space let's evict some data and put it out the disk. When should we actually do this and then what metadata should we keep around in memory to record that we've put some data out in disk so we don't lose anything and we don't have any false negatives? Then when a query tries to touch data we know it's been evicted, we gotta bring it back in. Now the question is how much data should we bring in? What should we do with the query that requested that cold data and where should we put it? Should we merge into the full table heap or should we put it into a temporary buffer? So this sort of outline here comes from a paper that we wrote in 2016 in the context of H-Store which is a memory database that I helped build academic predecessor of BoltDB. So in this paper here we sort of lay out a bunch of these different policies but we're also looking at sort of more modern storage devices or different classes of storage devices to see how these things change in depending on how fast your disk is. And I don't remember there wasn't really any major finding or difference between like SSDs versus spinning disk car drives. Like whatever to not be this policy is the right way to go when I forget exactly what it was. For NVM, for non-volta memory it's slightly different. You want to treat that as just DRAM and work as fast as possible. All right, let's go through each of these one by one. So again, the first thing that they do is how are we gonna identify what data is cold? So one thing I'll say also too for this, the paper you guys read on LeanStore was doing page identification. I'm gonna show you a bunch of techniques how to do tuple identification. I consider the LeanStore paper which came out last year at the State of the Yard whereas the paper I shared you that we wrote was from 2016. And LeanStore is the only one that I'm aware of that does it on a page by page basis. Everybody else is done on a tuple by tuple basis. So to understand why LeanStore I think is a good idea let's see how we do it with tuples first, see what's bad about it and then we'll see how to do it better with LeanStore. All right, so we needed to identify what tuples are cold. So the two approaches that do this are online or offline. So online would be the system as it runs queries and runs transactions, it tracks every single tuple you access or even samples what tuples you access. And then records and maintains some metadata in the tuple header itself to say when the last time this thing was accessed. So one approach could be sort of approximating LRU. The LRU chain is keeping pointers of the link list. You could also do the clock approach or I think it called the, in the LeanStore paper code they call it second chance. Same thing, you sort of set a bit and then when you touch it and then you wipe it out when you ever do a pass and if you wipe it out again and that bit set to zero then you know when to affect it. I think these are sort of standard bufferable policies but the same idea is can be implemented here but we're gonna maintain the metadata about how tuples are accessed inside the tuples themselves. The other approach is do this offline and this is where the system as it runs queries rather than updating the metadata about the access patterns of transactions as you're running them, you just write out a log record that says this transaction touched this tuple or accessed this tuple. And then you have a separate background thread that comes along, consumes the log and then computes some kind of access frequencies to figure out which one's the coldest which one's the hottest. The idea here is you don't put the, you don't want the overhead of maintaining the metadata about the access patterns in the critical path of execution of queries and transactions. All right, so now we wanna evict. So now we recognize that we're running out of space or actually the question is how are you gonna recognize that we're running out of space? And the two basic approaches to do this are just have a simple threshold that's set by the DBA or the administrator that just tracks how much memory you're using in your tables and in your system and then when you go above some threshold like I'm getting near 80% capacity of what the max amount of memory I have that I have then that triggers the process of start evicting data. Right, so this is something that the data system implements itself, the system implements itself. The other approach if we're using M-Map and again we'll show an example of this is that we just punt this decision to the OS and let it, because it's already gonna track how pages are accessed in virtual memory. So we'll just let it figure it out when it comes time to evict things because it knows what's the memory pressure of the overall system, right? So again, this will be done in the background but there's other issues because of page faults and you block your threads, we'll get to that later. All right, so now the question is if after we decide what tuples we wanna evict and then we evicted them we would put them out to out the disk, the next question is what metadata do we need to keep around in memory to record that a tuple used to exist but now here's where to actually, if you need it, here's where to actually go find it on disk. So the first approach is use tombstone tuples and this is similar to when we were doing deletes in MVCC at the Virgin Chain we could have a special tombstone tuple to say here's the end of the Virgin Chain or this tuple doesn't exist anymore. So what happens is that every time you evict a tuple you then insert a new tombstone tuple that's gonna record the block ID in offset of where the tuple now exists out on disk and then you go ahead and update all your indexes to now point to this tombstone tuple and tombstone tuple obviously is gonna or should be smaller than the original size of the table. Ideally it would be awful if like your table has one column that's like a 16 bit integer and then you evict it and your tombstone tuple is now two 32 bit integers and you get no benefit there but ignore that case but all of our indexes now point to this smaller tombstone tuple that we stored separately from the main table heap and then that way if I'm doing a lookup I follow them on the index and I land in one of those tombstone tuples I recognize oh this is not real tuple I need to go out to disk and get the data that I'm missing. The next approach is to maintain the bloom filter in memory for every single index that's on your table and just record that for a given key there was a tuple that exists out on disk. The bloom filter's not gonna tell you where it is on disk there's an on disk index you have to fetch in to figure out where that is but at least the bloom filter's gonna help you avoid having to do a lookup on that on disk index unnecessarily because if it doesn't exist in the bloom filter it won't exist on disk. The last approach again if you're using M-Map these are reside on virtual memory again the OS already records or tracks how what pages have been swapped out the disk and then it knows that any time you do a lookup on those pages that's been swapped out you get a page fault and it goes out in the disk and fetches it for you. So from our point of view from the database system we don't have to do anything extra there's no extra metadata we need to maintain when something gets evicted because the OS does that for us. And now again this is another good example why M-Map and the OS is a sort of a seductress for people that want to implement databases because it's like oh it's already doing this they can do it for me I don't have to implement it but it's gonna bite you in the ass later on because it doesn't work correctly. It doesn't always work the way we want in databases. All right so let's look at these two examples the Bloom filters and the Tombstone tuples. So say again whatever method we're using to figure out this is the access pattern of access frequency of all our individual tuples and then we can identify that these three tuples here are the coldest ones so we'll write those out the disk, all right. So now our indexes still point to these empty tuple slots, right. And this would be bad if we do a lookup and say all right well I land here and it's garbage or some other tuple because I reclaimed the space and used it for somebody else. So instead what we'll do is we'll make these Tombstone tuples here and now the indexes point to these guys and this block ID and offset just tells you where on disk it actually is, all right. So I'm sharing this sort of and continuously in memory with the regular tuples but in actuality this would be stored as a sort of separate table heap. So that means that if I'm doing an index scan that's fine because I'll just follow my pointers and I land there, there's no big deal. If I'm doing a sequential scan I need to be careful to make sure that I do my sequential scan on this part here on the main table heap but then I also need to do a sequential scan if necessary on the Tombstone tuples. And of course there's no values about what's in these tuples in these Tombstones, everything's out on disk. So if you're a sequential scan here you're basically reading everything because you don't know what exists out there. So now one thing you could also do, we never actually implemented this in our system but instead of having every time you follow the index I always have to go get the data you need. If you do a covering index query meaning all the data I need to compute my query can be found in my indexes then it doesn't matter that it's been a victor or you have these Tombstones, you can answer your query result from this. We never actually implemented that. Another question would be if I have say three columns and I have an index on each column and my query does a look up on one of those columns can I then figure out that this is the tuple that I'm pointing to and then do a reverse look up on the other two indexes to recreate the rest of the tuple? We never actually implemented that either but that would have been interesting. The other approach is again with the Bloom filter and again for every single index out that we have on our table we're gonna have a separate Bloom filter for it. And then the bits in the Bloom filter are set for each key that we've evicted out. And then out on the cold data storage as well we have to have another index to now tell you what the block ID and offset is for the tuple that you're looking for. Because this tells you whether it exists or not you gotta do a look up on this to figure out what page has the thing I'm looking for. Because otherwise you'd do a sequential scan on every single page and that would be terrible. So if I wanna do a look up now, does my key X exist I always go to the in memory index first if I find it then that's pointing to something that's in memory and I'm done. If it doesn't have it then I always gotta go check the Bloom filter and the Bloom filter would tell me whether I have it or not. If I don't then I'm done then I know the key doesn't exist. If the Bloom filter says that I do have it then I go to my index, fetch that into memory and then figure out where the tuple actually need is located. All right, just clear. Yes. Just a confirmation that you cannot delete is from the Bloom filter line. It's a question if you can not delete keys from the Bloom filter, correct, yes. The Bloom filter I showed you, you cannot. There are like, you can extend Bloom filters not really counting Bloom filter. There's ways to keep track of, instead of every bit to be one or zero just keep track of the number of keys that have set it. And that way when I do a delete but even then you're not gonna get, you still could have false positives. Most people don't do that thing. They just rebuild it. If you wanna delete something you just rebuild this thing. Okay, so you cannot use the Bloom filter as an indicator that whether a key is present in memory because sometimes we want the big key to do because it is correct. So here's Stan, his question is, that means I can't use a Bloom filter as a, in front of this thing to keep track, just to tell me whether it's something exists or not before I actually go through the index. You could do that. You'd have to, you'd have to throw it away. No, no, so this thing doesn't need to be super accurate. Right, let's think about this. So you could delete a key. Yeah, as long as you don't have false negatives, that's fine. If you have false positives, meaning I delete a key from my index but I haven't deleted from my Bloom filter. So I do the look up of the Bloom filter first. It tells me it's there. Then I do my look up here and it's not there. Right, ignoring the storage thing, say this is all in memory, then that would work. But for this thing here, we need to make sure that it doesn't exist in here and doesn't exist in here. We know it doesn't exist over there, so we don't want to go to disk. I'm sort of complicating the example, but, right? No false negatives, but sometimes false positives. And I forget what the mathematical guarantees you need for how big of a bitmap and how many hash functions, but the target rate for the false positive rate is like 1%. So it's not like every other one is giving back bogus data. 1% is not bad. That's a fair trade off. Okay. Okay. Okay. So again, now we know what metadata we need to record in memory. So let's figure out now what happens when we go and need to fetch the data and bring it back into memory. What do we do? So the, since we're storing everything as four kilobyte pages on disk, it's very likely that the size of one tuple is less than a page, and therefore within one page there'll be multiple tuples, right? So that means for a single disk IO to go fetch the page, we're gonna get back a bunch of extra stuff, which we actually don't need. So the question is, what do we do with the other ones? So we could take all the tuples that we have on the block and just bring them back into memory and merge them back into the table, even if the other ones aren't needed. But of course that means now we have to update our indexes for every single tuple, right? Cause now it exists in memory, it's no longer on disk. If we have a lot of tuples on our block and we have a lot of indexes, then that could be expensive when we really only needed to do it for one tuple. The other crappy thing about this is that the one tuple I needed, that now becomes hot. But all of the other tuples on my page, they're still cold, and therefore in the next round I do an eviction, they're probably gonna get written out back again. So it's sort of like this ping ponging back and forth of reading it in, writing it out, reading it in, writing it out. And that's gonna have right amplification. That's gonna increase the number of IOs I'm doing down on my devices. And if it's an SSD, I'll wear it down more quickly. The other approach is to only merge the tuple that we needed. This is these computationing faster because I don't have to update all my indexes. The downside is gonna be is that now I have a page out on disk that's gonna have the copy of the tuple I just merged back in. But it's not considered the correct copy. Because I could affect the tuple I needed back in memory and then like an hour later it becomes cold and then I evict it back out. So now on disk I'm gonna have two pages, two evicted blocks that's gonna have my tuple. And I have to do extra bookkeeping to recognize that it's the last one that I, the most recent one that I evicted is the correct one. So if I ever have to reconstruct the database or maybe reading those old blocks, my evicted blocks, I need to make sure I skip the old one and only take the newer one. Again, and this also, because I have essentially holes in my pages. So, okay, whether we wanna merge, all the tuples are one tuple. Now the question is what kind of merge do we wanna do? So the easiest approach is to take all the tuples we're bringing back in, all the tuples we wanna bring back in memory and we just merge them into the full table heap. Again, that means updating all the indexes and now that we point to the correct ones. An alternative is that we only end up merging the tuples that are part of an update query. Like if I select it, I'll just read it and then throw it away right away. Same thing, well, insert doesn't make sense, but delete, same thing, who cares? So for the select queries, instead of merging them to the full table, updating indexes, I'll put into a temporary buffer in memory for my one query. So my query that needed this tuple was the only one that can see this buffer and then the only one can read from it. So they do whatever read they need to do and then that we throw away the buffer. So as if that we didn't fetch in the tuple at all. The third alternative is that we can do a combination of these two, deciding whether we wanna do always merge or just merge on update, but we can keep track of how often this block is being fetched. And then when we reach some threshold, we can say, all right, well, if we think this block is hot, we may or may not know what individual tuple in the block is hot, but we keep fetching it. So rather than putting it into a temporary buffer and throwing it away, let's just merge it in right now. So this is like, I don't do it the first time I see it, I'll do it at some later point. The last design decision is that what do we do with the query that access the tuple that was not in memory? So the first approach is that we recognize the thing we need is not in memory. We immediately abort our query or abort the transaction, put it to, if it's a stored procedure, we can queue it up on the side and wait for our data to finish. Otherwise we send back an exception to the client over JDBC or ODBC and we tell it that you have to restart your transaction, we can't run it. Then in the background, in a separate thread, we go get the data that they wanted, fetch it from disk, merge it in based on our merging policy, and then we know it's okay to then restart that query. Whether it's a stored procedure or somehow we could notify the client but nobody actually does that. So the idea for this one is rather than installing the thread, when it tries to access data that's not on disk, which is what a traditional disk-oriented data would do, we just punch, we say we can't do this, go away. Come back later, we'll have what you want. So the problem is, this doesn't work if you have queries that wanna access an entire table that doesn't fit into memory. So let's say we have one table, we have a single select query that wants to scan the entire table, but only half of it can fit in memory. So let's say the first half is in memory, the second half is on disk. I start scanning, I can get to the first half and then I reach the first tuple that's not in memory, I get aborted, I get put aside, then the system goes and evicts the first half of the table and brings in the second half. I restart my query and it starts off in the end memory of the second half and keeps scanning and then tries to get to the first half and then we're back where we started before where we get aborted. So this approach won't work. The way to get around this is that if you relax consistency, you just let the table just read whatever, whatever version that existed. Actually, if I'm multi-verging, this wouldn't be an issue. As long as you record that this guy is waiting, you know what his timestamp was and then when you restart it, you let it have the old timestamp so they can always read the old version. But if you're doing update in place, then this won't work. All right, and the second approach is that you just query comes along, tries to test data that doesn't exist, whether you recognize that through the Bloom poster or the Tombstone tuple, you just stall it. Go fetch the data that it needs and then when it's ready, you go ahead and resume it. So this one can be tricky if I need a bunch of cold tuples in my query, I don't want to stop maybe immediately as soon as I hit the first cold tuple. Maybe I want to keep going as far as I can, collect all the cold data that I need and then when it actually tries to do something with that data, like send it to the client or update it, then I go stall it fetch the data that it needs and brings it all in memory. Because otherwise you'll scan, stall, scan, stall and it'll be f***ing horrible performance. All right, so let's talk about some implementations now. So we're gonna talk about six different things. So I'll say the first four here, Azure, Hecaton, EPFL's version of Vol2B and Apache Geode, these are all tuple-based. So these are all can use, the taxonomy we just talked about. Now that we have our common language to describe how these systems are actually implemented, we can talk about how these guys do it and again they're gonna be focused on tuples. And then the other two systems, LeanStore and MemSQL, these are not gonna be tuple-based, in the case of LeanStore, it'll be block-based or page-based and then for MemSQL it's table-based. Right, MemSQL is not really, they're just punting it to end map for what they're doing, we'll talk about that in a second. Whereas in the LeanStore it's all implemented in software inside the database system itself, okay? Let's go through each of these. So again, HStore was a system I helped work on when I was in graduate school and then it got commercialized as Vol2B a few years ago or 2008 now. So we wrote a paper called anti-caching, right? The idea again was this was the reverse of a regular disk cache for discordant databases, right? The all data starts in memory and then we find the cold ones and show it out the disk, right? So that's what's called the anti-cache. So sort of clever name that Stonebreaker came up with. I always like to say when we wrote the paper the first time we had anti-caching in the title, the reviewers hated that title. So we took it out, then the paper got accepted and then Stonebreaker was like, yeah, put that back in. So we did that. I think it's a good name. All right, so HStore is going to do online identification. We would actually track or we would sample and track how transactions are accessing every single tuple. We had the administrator to find a threshold to say if you get 80% to your max capacity of memory then that kicks in the eviction process and you keep evicting until you hit some low watermark. So I would say if I'm at 80% capacity I start evicting until I'm down to 60% capacity. We're going to use Tombstone tuples. Again, that was stored in a separate table heap. We would do abort and restart retrieval and this is because we were doing store procedures so we would stall or abort the store procedure, roll back all its changes, put into a side queue, fashion it that it needs, and then have a call back from the mechanism and say, the data I want is now there, let me go restart this transaction. And then we were doing block level, granularity of the data we were storing and then in the original implementation we would always merge all the data or all the tuples into a single block. In the subsequent paper that I shared before in 2016 that's the one where we explore different options to do more selective merging. The other most famous system that was doing larger than memory databases was this thing called Project Siberia at a Bricorsoff for the Hecaton project. To the best of my knowledge, and I think the lean store paper talks about this, is this never actually made in production, this never actually made it into the real version, the commercial version of Hecaton. It was probably due to the complexity of it. So they were doing offline identification. So again, they would maintain a log that recorded how transactions were accessing tuples and then they had a background that had come and compute histograms based on the log. Same thing, they had administrative defined threshold. They were using bloom filters to figure out to approximate whether the tuple would exist out on disk. They were doing synchronous retrieval. They would block threads or block the query, go fetch the data they needed and bring it back in. And they were doing this because they weren't, they weren't a store procedure based system the way that HStore and BoltDB are. They were doing tuple random granularity of the data they needed, and then because of that, they would always merge because I would always grab a single tuple and merge that in. And like I said, there's a bunch of papers from these guys that sort of discuss a bunch of these issues, but it never made it into production. All right, so let's talk about a system that actually uses M-Map in an interesting way. So this is a research extension to BoltDB by Natasa Alamaki's group at EPFL in Switzerland. Natasa Alamaki was the database professor here at CMU before she left for Switzerland, like before I showed up, she was basically me here. Or I'm her now, right? So again, so this is not, this never actually was put into production for BoltDB. They just took BoltDB because it's open source. They modified it to do this technique. So they were gonna do offline identification, and have a log figure out what's called, what's hot, but they would actually use the OS virtual memory in a very clever way. I'll show in the next slide. And because it was OS virtual memory, that means that these things are always fixed. They're always kind of synchronous retrieval because when you try to go fetch something that's not in memory, the OS will hit a page fault, stall your thread and go fetch it while you wait, and then the virtual memory is always based on pages and they're always merging everything because the OS doesn't know about what's actually in your pages. It doesn't know that there's tuples. It just knows that it's a page that it needs and brings it in. So let's look to see how this worked. So what they would do is they would have the single in-memory table heap backed by Mmap, but they would designate different regions of memory to be either hot or cold. And then for the hot tuples, they would use M and Y to tell the OS, pin this region here. Never evict these pages. They always have to stay in memory. Whereas the cold tuple region, the OS was allowed to swap this out as needed. So they would use an offline identification, figure out this tuple is cold from the hot region, but then rather than them putting it into a page running out the disk, they would just move it down here into the cold region. And you can think of this as a delete followed by an insert. So then I can now reuse that space for whatever other new tuple I wanna put in there. So now down here, again, this region here is not pinned. We didn't tell the OS to keep it in memory. So at some later point, the OS may decide, oh, I have memory pressure. Let me go fetch some, take some page here and write it out the disk, right? So one thing you need to be very careful about here is that in the same way we have to worry about in data alignment and word boundaries when we talk about how to lay out things in memory, we need to be very careful to make sure that we lay out our tuples according to the page boundaries of the OS's virtual memory as well, right? Yes? What's the amount of hot tuples fixed or did it go? So this question is, is this region fixed? Yes, this region is fixed, yes. Typically for in-memory databases, the way it works is when you turn them on, you tell the system like, you have 10 gigs, right? And that's fixed. So how it divides it up, so I don't know whether they're doing something sophisticated to figure out, oh, our working set size is really this big. It was probably just fixed. All right, so the thing we gotta be careful about is that just like we gotta worry about word alignment for data and memory, we gotta worry about page alignment for the data down here. Because what we don't want to happen is we don't wanna have our tuple now span two pages and then this thing gets evicted and written out, right? Because the hardware's not gonna be able to guarantee that this is actually atomic. You can only write one page at a time, right? And we also don't want the case of like, when we do a fetch, we don't want to do two IOs to get this one tuple, right? We wanna do one IO. So, again, I think this is, I don't know what happened if you actually updated the tuple. Do they actually put it back in the hot space? I think they would have to because again, if you don't wanna modify anything down here because the OS is free to swap that out any time and there's a bunch of extra stuff you have to do be able to super careful with fences and whatnot to make sure that it doesn't write out data to disk before you write out the log record that corresponds to the change. So that essentially makes this region read-only. So what I will say is that for all of the workloads we're doing updates, I think MAP is a bad idea. Again, we wanna prove that this semester. For read-only data, I think it might be okay. I think it's okay. There's a bunch of extra stuff you may wanna do, like prefetching and, I don't wanna say too much. There's a bunch of things I think you have to do to make MAP work, but you have to do that anyway if you're managing memory yourself. But if it's read-only, I think it's okay. Okay, the last tuple-based system we wanna talk about is called Apache Geode. Apache Geode was originally called, I'm gonna get this wrong, I think it was either GemFire or SequelFire, I forget. But basically there was some startup that built an AMRE database, then they got bought by VMware, and then VMware says, then VMware merged their database division with EMC's database division, because they had Green Plum, and that became Pivotal. And then Pivotal basically said, we don't wanna be in the database business anymore, and then they sort of offloaded some of their systems, right? They still do Green Plum, but I think Geode was like GemFire, and they sent it to Apache and renamed it. I haven't actually never met anybody that actually uses this, is this true? Yes, correct. I'm sure people do, I haven't come across them. Oh, that's not true, somebody in Russia, sorry. Okay, so what they're gonna do is they're gonna do online identification, mainstreamed and fine threshold. I think they use Tombstones, it's not clear in the documentation, but basically what happened is that they would write things out to HDFS. So Geode is a distributed memory database, and you have this sort of shared disk layer with HDFS, and that's what they would use to write data out. And then they would bring single tuples back in, I don't get how they would actually do this in HDFS, I don't know, but then they would only merge the tuple on update. And the reason is because since HDFS is a pen only, there's actual work you'd have to do if you wanna invalidate some page, and say this is not the primary location of this tuple anymore, it's now in memory. So if you only do an update, then that reduces the amount of work you have to do for this. Okay, so as I said, these four approaches, HDOR, VOLT-B from EPFL, Project Siberia from Hackathon and Apache Geode, they're all tuple based. So the two issues, the key issues or the limitations about these approaches is that we have to track the metadata about how tuple transactions or queries are accessing tuples on a per tuple basis. And this actually can be quite expensive. So in the case of HDOR and anti-caching, we would have to maintain a pointer inside the header to keep track of where the tuple was in the LRU chain to figure out what was the coldest, what was the hottest. So that's adding now an extra 64 bits for every single tuple. Now, it wasn't actually that bad in both V and HDOR because it's not multi-version, so it's not like I have all these additional timestamps. So the 64 bit pointer wasn't a big overhead, but still it adds up. The other thing to point out to is that, there's nothing I said this entire lecture that talked about indexes. It was all about tuples. But last class, when we talked about compression at the very end, I showed you that table that said for some workloads, the indexes comprised about up to almost 60% of the total amount of space or memory being used by a database, an in-memory database. So that means that we're targeting these tuples to write them out the disk, but that's not where we should be looking. It's actually trying to figure out how can we evict the indexes. So hackathon sort of could do a little bit of this because they would remove it from the index and just put it in the bloom filter. And the bloom filter is much smaller than the index. But none of the other guys were handling this. So what we want now is we want a unified approach to managing memory that we wanna show about the disk that can handle the two main contributing factors to the size of our database and memory, both tables and indexes. Meaning we don't wanna have separate policies for indexes, separate policies for tables. We want everything to be all the same. So this is the problem that Lean Store is trying to solve. So Lean Store is a early prototype of a storage manager out of Munich from the same guys that wrote Hyper. They wrote Lean Store, but to that's my knowledge, this Lean Store is not part of the Hyper project or the new son of Hyper system that they're building now called Umbra. So it's a separate thing. Again, the key thing about what Lean Store can do is that it can evict cold data that are both in tuples and in indexes. So that's much different than everything else. So the core idea of what they're gonna do is that they're essentially gonna build a buffer pool manager in the same way we have a disk-oriented system. But it's gonna be entirely decentralized and it's gonna have a hierarchical model of keeping track of what data's in memory and what can be written to disk, what can stay in memory. And then rather than tracking every individual tuple or every individual page and say here's when it's being accessed, they're actually gonna just do random selection of pages to evict them. Then track, if you've selected for eviction, then track it to see whether it gets accessed again before you evict it. If yes, then you just put it back into the hot space. If no, then you evict it because no one touched it. So this is much different than everything else we've talked about so far. In the case of the, even on the online offline approaches, you were still tracking every single tuple that transaction for accessing and using that to figure out what was hot and cold. But if something's hot, you just assume that it's hot and don't track anything, right? Because that's the data that everyone's gonna touch and you don't wanna be on the critical path of query execution to slow down the system while you go update metadata. It's only when you go say I think I wanna evict this, then you turn on the track and you figure out whether that actually gets touched. So the core idea that they're gonna use to make this all work is a technique called pointer swizzling, which I don't think we discussed in the introduction class. So I'll describe what that is. So they're gonna use pointers to figure out that you have to store to keep track of this thing points to this page. They're gonna use that to figure out whether something's in memory or not. Rather than having this separate page directory or page table, you have in a disk oriented system. All right, so the idea of pointer swizzling is basically to say that you have a, an object has a pointer to some other object in our database. Doesn't necessarily have to be a database object. It could be like an internal data structure. It doesn't matter. And you would have one type of pointer for when something is on disk and another type of pointer when something's in memory. And so you just have a little identifier. In this case here, they're using a single bit of the entire 64 bit pointer to indicate that this is an in memory location or this is a block ID offset because it's on disk. So actually just a general question. I have 64 pointers on x86 Xeons. How many bits are we actually using in that 64 bits for addresses? Raise your hand if you think all 64 bits. Well, I kind of spoiled it, right? It's not, right? Yes. 48 bits. 48 bits, yes. So the current Xeon hardware is everything is addressed to 64 bit pointers but the hardware itself only uses 48 bits of those. So you can only have, I think, what is that, 35 terabytes of memory, right? So people do this trick all the time where they say, hey, I got 64 bits but I don't need to use 16 of it so I'm gonna go put a bunch of crap in the part that the hardware is never actually gonna read. And I can use that in my system to figure out what it is I'm actually looking at. Now when we talked about Judy arrays, they had those Judy pointers, they were just taking two 64 bit words and they had the full pointer on one side and then extra metadata on the other side. In that case, I don't think they had enough space to store all the extra stuff they wanted to store in the remaining bits after the 48 bits. In this case here, we only need one bit, so that's fine. Now, I will say is that I was at a sort of workshop for with SAP HANA guys a few years ago. They had a speaker come from Intel and they were talking about, they had this HANA installation that it was a beast of a machine. They maxed out the all 48 bits of memory and on a single database instance, like it was some $2 million machine. And the Intel guy basically said it's coming down in the future where the pointers will be true 64 bit pointers. So don't store extra crap in here, right? So we avoided doing this in our own system. Now, he told us that this was coming two years ago and it still hasn't happened yet. So I don't know when it's happening, but so that's what they're doing. So that's why we can use this extra bit and not worry about us pointing to garbage because no one's actually gonna use this. All right, so we're gonna use this first bit to tell us whether it's on disk or in memory. So if the bit is zero, then we know that what we should be, what we're looking at is a page ID and offset. But let's say we bring this data in and now our pointer points this, that bit set to once, we know we should interpret that as a true memory address, right? So again, the idea here is that this thing has one pointer, but depending on where we move this block that it's pointing to, we update that pointer with the correct information. And this is only gonna work for us if there is one and only one pointer to any object at a time. And that's gonna allow us to do our updates latch free because if some other block is pointing to this thing too, then I need to set latches somewhere to say, all right, now the pointer is this because otherwise you have race conditions. And that's essentially what the page table is doing in a disk-oriented system, right? It's an indirection layer, or think of the mapping table in the BW tree. It's an indirection layer where we have a single location where we can say if the data you, here's to find out what, here's to find the address for the data you need. And it's one compare and swap to go update that. So the way they're gonna organize memory through this hierarchy where you only have one pointer and that allows you to do this without having a mapping table. So the next interesting thing about Lean Store, which I really liked is that rather than trying to be clever of deciding what's least recently used or whatever, they just gonna roll a dice, randomly pick some pages and say, you're gonna be evicted. And then they put it into this sort of cooling stage where they say, we think we're gonna evict this thing, see whether it actually gets accessed before I evicted. If yes, I'll make it hot again and put it back, keep it in memory. If not, that I know nobody needed it, it's safe for me to go write it out. So again, the advantage of this over all the approaches we talked about before is that we don't need to record any metadata about how tuples are being accessed, or in this case, pages are being accessed, if they're hot, which is the most common case, the thing we wanna optimize for, right? The whole reason the data is considered cold is that nobody's accessing it. So it's okay if we have to do actual work to keep track that we accessed it when it's cold because we're not gonna do it that often. So what'll happen is when we identify something for eviction, then we're gonna unswizzle their pointer, but still leave it in memory. But then we're gonna put the information about that page in a separate hash table that we can then go consult to see whether something is really on disk or whether it actually is in memory in this cooling stage. And then we access it again, we remove it from this queue, and then just swizzle the pointer again, right? Otherwise, if our page doesn't get accessed and we reach the front of the queue, if someone needs to reclaim space, we get evicted. All right, so let's talk about how we're gonna organize pages and I'll show a demo or show the illustration of what this looks like in the next slide. So unlike in a disk based system, where the buffer pool, the page, the heap is just this list of pages, instead what we're gonna do is we're actually organize our blocks into a hierarchy. So you're gonna have parent and parent can have children. All right, one parent can have multiple children, but every child can only have one parent. And so because of this, we have this hierarchy, this will guarantee that every single block, every single page, only has one pointer to it. It's parent. So this avoids that problem of two guys pointing to the same thing. Now for indexes, this is totally easy to understand. Like think of a B plus tree. It's already tree structure, there's already pointer to one guy. How you handle siblings is another story, but in that case, the memory can already be laid out in the hierarchy. So that works nicely with this. For the regular buffer pool, you basically have upper pages that, you basically have to treat it as like viewing, breath first search if you wanna scan along something. You just say, this page starts, and then here's some more pages that come after it. It's an unordered heap, so it doesn't matter where they exist in the hierarchy, but we just know that we point to a children. So now what'll happen is, because we have this tree structure for our buffer pool, we flip a coin, or roll the dice, we pick a random page for eviction. If we end up picking a page that has children that have not been evicted, then we cannot evict that parent page. Because what we don't wanna happen is we can have our parent page get written out the disk with its swizzled pointer to its children, because when it gets put back into memory, that pointer is gonna be invalid, because it's gonna point to some memory location that may not even exist anymore, or may not point to a new page. So to avoid that problem, if I try to pick a parent node or parent child to get evicted, and it has swizzled children, then I'm gonna pick one of its children to get evicted. And that's still considered random and often, that's fine. So let's look at an example here. So the hierarchy of storage now is gonna be in three stages. So they'll have a cold, a cooling, and a hot stage. And so for this, we're gonna distinguish between pointers between swizzled and unswizzled. So again, a unswizzled pointer has the bit flags at the zero, and it's a page ID offset. So this is out on disk, and it tells me where to go find this block on disk. If it's swizzled, then it's in memory address. So let's say that we decide, we roll the dice, we decide we wanna evict B1 here. So in all the other examples I showed before, when we say we wanna evict this, we would immediately write it out as a disk. But we're not gonna do that here. Instead, we're gonna migrate it down into the cooling stage. We're gonna add it to our hash table for the cooling stage, where we're gonna have a map to the, now, where it should be on disk, right? The page ID and offset, that'll then get mapped to its location into an eviction queue. And then the eviction queue has the physical memory address of where this page is located. So now, for my parent page here, it's gonna have an unswizzled pointer, which is gonna be the page ID and offset. So if I'm now doing a scan, I'm doing traversal into this hierarchy, and I wanna do a lookup on B1. All I have is the unswizzled address. So I do a lookup first into this thing. If it's not there, then I know it's on disk and I go to disk and get it. If it is there, then I would follow this hash table, find it into the eviction queue, pull it out of the eviction queue, and now move it back into the hot stage and update pointers. So again, I really like this idea because it's completely decentralized. There's no single page table to keep track of what's in memory versus not in memory. The information about it is essentially retained inside the pointers themselves, but that's always in a single location. And so we can do simple compare and swaps to flip that around. And then here, if we ever need to get a free memory, we need to get a free block or free page to store something new. We just go grab whatever's in the front of the queue, write it out the disk and now reclaim that space and then move it from the hash table. There's one latch I think you have to handle for making sure there's not a race condition of when you try to, if you're trying to fetch something out, fetch something in and write something out, but that's not in the common case, right? The most of the common case you're gonna access in hot data. So again, I really like this paper, I really like this idea, and when Victor sort of explained to us it sort of like, it clicked in my head and I'm like, yeah, it d***s it, that's a good idea. So any questions about this? No one's as enthusiastic about this paper as I am. Okay, yeah, all right. All right, the last technique, the last one we'll talk about is Memsegal. So Memsegal is not really doing the sort of automatic eviction stuff that we talked about here. They have the ability to have tables that are backed by Mmap and they just let the OS swap things out. So Memsegal is a HTAP system, so they have a row store and a column store. The row store always exists in memory and it's managed, they malloc their own heaps and manage that, that never gets evicted, but then you can declare a table as a column store table and then that gets backed by Mmap and now the OS is allowed to swap anything out that it needs. Of course now you still wanna be able to update the column store, so they have that delta store that we talked about when we were talking about the split execution engines, they have a little delta store that they can use to absorb updates and then background processes go ahead and merge those in carefully and try to avoid issues with the OS. So in this case here, if you have data in the row store and you wanna evict some of it, the only thing you can do is dump the table out and load it back in as a column store table. But from the application standpoint, it's gonna appear as two separate tables because it's not gonna know the column store table is linked to the row store. At least the last time I read their docs, that was the case, all right? So there's no eviction metadata because the OS handles this, they're doing synchronous retrieval because the OS will block your thread when you try to access data that's not memory, when it go fetches it and then the OS doesn't know what's in a page so they always merge everything, okay? All right, so to finish up, the, again, today was really all about how can we identify that we have cold data and maybe reduce the amount of memory pressure in our system by writing it out to some auxiliary storage. But then we had to do a bunch of extra stuff to make sure that if some query comes along and wants the data that's out on disk, we have a mechanism to find it, retrieve it and bring it back in. So again, from the application standpoint, it doesn't know that anything's in memory or on disk and that's the beauty of the relational model because we hide all that from you at the logical layer but we need to make this all work very efficiently. And none of the techniques except for the lean store would actually handle indexes which was a major contribution to that overhead. Now, we're not gonna talk about non-volta memory this semester. So what I will say is non-volta memory is actually a new storage device that Intel's putting out this year. Now, every time I give this lecture, I always say it's always this year. I just do a montage because it's like been three years in a row and you're like, yeah, it's coming out this year, it's coming out this year, right? It is actually coming out this year. We actually have access to it in, we don't physical access to it. There's machine at Intel's labs in Oregon we can log into and use it. My first PhD student, Joy, who was the initial creator with me on Peloton, who's now Georgia Tech, his whole dissertation was on non-volta memory databases. So the way they want non-volta memory database or non-volta memory is it looks like DRAM, it's byte addressable, it's fast, but it's persistent like an SSD. So you pull the plug and it retains all the data. So that starts to get kind of really interesting because in my opinion, all the crap we talked about today goes away because who cares? Because it doesn't look like a single address space. Someone's gonna be in DRAM, someone's gonna be in NVM. There's actually, you can run the NVM in different modes where like it's whether it's transparent to you or not. But all of the extra crap we have to do, we just let the hardware handle that and that's amazing. So I think what we're talking about today will go away when NVM becomes more prevalent. But we're not there yet. All right, any questions about the larger than memory databases? So normally I teach this one at the end of the semester. Forget my reasoning why I wanted to bring this up closer. I think this is something interesting and it's, I want you guys to focus on it now because by the end of the semester, everyone gets burned out and no one reads the papers more carefully. So that was sort of my, I think that was the one reason why I moved it forward. So next class, next Monday, will be about logging and checkpoints. So for this one, we're gonna bring, we need the disk again, because we have to make sure that all our changes are durable. If you lose power, DRAM gets wiped. So we want to talk about how can we efficiently write things in and out to disk and then recover from that if we have a crash. And then if you have a little time, maybe I'll also talk about doing a log replication over the network to another machine, okay? All right guys, enjoy your weekend. See ya. Take care. You know I'm great handy. Got a bounce to get the 40 ounce bottle. Get a grip, take a sip and you'll be picking up models. Ain't it no puzzle I guzzled cause I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stacked some sick packs on the table. Ain't I'm able to see saying I was on the label. No shorts with the clothes, you know I got them. I take off the cap and first I'll tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby. Take back the pack of blood.