 So we have two more lectures left. So for this week, we'll be focusing now on storage. And what we're going to now look at is how to handle databases that exceed the amount of DRAM that we're going to have available to our system. So for today's class, we'll focus on how can we do this in the context with hardware that we have today, like spending discard drives and SSDs. And then on Thursday's lecture, we'll talk about the new non-volatile memory devices that are pretty close to actually coming out. And I'll talk more about what they all are and what companies are going to be the first one to put out something on the market. So real quick, before we get started on the lecture material, just to go through the upcoming schedule for the rest of the semester, next Thursday in this location in class, we'll have the final exam. So it will be at the regular class time. And so this will be an hour long. It will be a combination of multiple choice and short answer questions that covers everything we've talked about in the lectures in the class. And then Thursday's class, I'll provide a sample final exam to give you an idea of what these questions sort of look like. The way to sort of think about what this exam will be about, it's not going to be questions that ask you about very specific things on the different topics or papers that we discussed. I'm not going to ask you in multiverse curriculum if a transaction does this, what happens. It's more about trying to synthesize the high level ideas of the different concepts and techniques and methods that we talked about the entire semester, and to be able to extrapolate them further and apply them to different scenarios. So if you understand at a high level what's actually going on, rather than what specific algorithms are doing, that should be enough, because that's sort of the important thing here. Because you can always go back, look at the papers or look at the textbook to see what actually the algorithm does. Yes, question? Made her study the Wednesday. All right, then she made the fourth then. Thank you, yes. That's an easy fix. Yes, if you made the fourth. So, I'll do it now, because I won't forget. The code review, I think I'll check the website. It might be also made third, made fourth. We'll double check this. So any questions about the exam? I'll be next class in Thursday, the fourth, not the third. For the code review for the second round, it's either made third or made fourth. It doesn't necessarily matter. But again, the idea here is that you're gonna make your code for your project available to the other group, even though again, you haven't finished. It's just sort of as a process of getting feedback and as you work towards the final project. And then for the, what we have scheduled for our final exam according to the university is set for May 9th. I don't know what day of the week that is. I guess if it's the third, the fourth, then I think it's a Tuesday then. This will be at 5.30 p.m. in Ween Hall in some other room I've never been before. And this is where you do the final presentations for your project. So every group will have 12 minutes. I'll order pizza, because it'll be at dinner time. And the idea here is sort of a lighthearted thing where everyone sort of goes through and talks about what they actually accomplished. And then the code drop when we actually try to merge all your stuff in will occur after this point here. All right, so any questions about this? Okay, the other news I wanna bring up is actually one correction for something I said in last class. I said that the bit-weaving stuff that was constantly developed made it inside the Apache QuickStep system. That is actually not true. They took all that out when they made QuickStep open source in an Apache project. So if you get QuickStep now, if you download it from the website, it doesn't have any of the bit-weaving stuff that's still only in the Wisconsin-only system. And the other cool announcement I have is that they announced the 2017 SIGMOD Awards. So SIGMOD is the big conference in databases, the big academic conference. It's gonna be next month in Chicago. And they announced that Gertz-Graphy won the Cod and Innovations Award. So this is sort of an award they give out to honor someone who's made contributions in a sort of major field of databases over a long period of time that had been impact. And so we've read a lot of Gertz's stuff in this class. We talked about Cascades. We talked about the Volcano model. We talked about index locking and latching. So he's done a lot of stuff in databases that has a lot of big impact. So this is a good thing that he finally won this award. So that's just an aside that happened in the last week. All right, so for today's agenda, we're gonna talk again about how a in-memory database management system can now support databases that exceed memory. And the basic idea of a way to think about this is that we wanna be able to now, DRAM can only get so big and we won't have databases that are bigger than DRAM, but we wanna be able to use all the stuff we talked about the entire semester of how to have this in-memory-oriented architecture for our database system and still now be able to access things on disk without bringing all the slow stuff we got rid of at the beginning of the semester. I remember I showed that pie chart at the beginning that said, in a transaction system for OAT workloads, 20, 30% of the time is spent in the buffer pool, 30% of the time was in recovery and the lock manager and only 12% of the time that the system was actually spent doing useful work. Now we wanna bring back disk and we don't wanna go back to that original pie chart. So that's sort of what we're trying to accomplish here. So I'll start off with talking about some back of material about why this is sort of a tricky problem and then I'll talk about the things you have to be mindful of when you actually implement support for on-disk data in memory database system and then I'll walk through some specific examples of how people have implemented this in a real database system architecture and the paper you guys read was an anti-caching work that we did and that's sort of one of these examples that we'll go through and then we'll finish off with an evaluation that we did last year about looking at all these different implementation issues and running them against different types of storage devices you can have and seeing which ones are better than others. All right, so it's sort of getting obvious why we wanna do this. DRAM is expensive compared to like an SSD or a spinning discharge drive. Again, it's not only is it more expensive to actually buy, it's actually also more expensive to run. So what I mean by that is they've done studies where they've measured what the power consumption is on a rack server to measure where the power actually goes and they find about 40% of the energy goes toward refreshing DRAM because the way DRAM works is that since it's volatile every so often they have to sort of send a charge into the cells to have it maintain whatever the value that it's being stored. So 40% of the time in the CPU or it's not in the CPU, 40% of the energy is being spent on just refreshing this. So it'd be really nice if we can get again, all the benefits of having a fast optimized in-mary database system without having to pay more for this high energy cost. And so ideally we wanna be able to shove things out to disk and only bring that back into DRAM when we actually need them. So that's the big idea that we're trying to do here. And the major thing that we have to be mindful of when we start adding back disk is that the way we're gonna access storage on a disk based system is different than the in-mary system. Remember in the beginning of the semester I talked about how these sort of classic disk oriented database architectures are inherently block oriented or page oriented and that's because that's the interface that the hardware provides for you. You can write four kilobyte pages, you can't write to individual tuples when you write out to an SSD or a disk. So now we need to be able to sort of seamlessly deal with block oriented storage even though the architecture that we've developed for our indexes, for our execution engine, for our compression schemes and everything are inherently tuple oriented or dealing things at the byte level. So that's sort of the challenge that we have to deal with in this world. So what I'll first say though for this discussion, we're really gonna only focus on OATP. And this is because in the case of OLAP there's not really that much you can do that's special because in OLAP you're essentially just gonna be access to the entire table or doing long sequential scans of large segments of the table. So there's not really any magic we can build in our database system to make this go faster. Yes, we can do pre-fetching, we can do some pre-computer aggregations and things like that. And this is what sort of standard database systems do now. But there's not anything because we're in very architecture that we can make the OLAP queries run faster when we have to read things from disks. The disk is always gonna be the bottleneck. And so let's say we have to do like a sequential scan on this column, right? Typically what you'll see something is like to use zone maps where you have, you maintain in memory, sort of pre-computer aggregations and other information about say the attribute that you're trying to scan on for an internal column. And you keep this in memory so that when your query comes along it says, all right, do I need to access anything in this maybe this block of data that's on disk? I can look at my zone map first and do some early filtering and make some decisions about whether there's anything even on disk I wouldn't even look at. So this is probably the only sort of optimization you can do and this is used in a lot of different systems. So this is what MemSQL does but this is also what Vertica does even though Vertica is a disk oriented system. So there's nothing specific about this that makes it, you know, that you can only do this in memory system. This is a standard optimization technique that you can do. But again, if we need something out here we have to go fetch the whole thing in and just scan it, right? So this sort of clear, we're only gonna focus on OTP because in that case we will be clever about what data gets moved out and then when we bring data back in what do we actually keep around in memory? And how do we actually treat the transactions? Or what should happen to the transaction that tries to access data out on disk? So in the case of OTP we have this distinction of our database that we've talked about before and a lot of the different techniques we've talked about are based on this is that there's always gonna be this notion of hot data versus like the cold data. The hot data is the data that's being, recently accessed or recently modified by transactions and therefore it's being more likely to be accessed in the future. And then over time as it ages or for whatever high level semantics of your application changes some data becomes cold and therefore it's not gonna be accessed in transactions anymore. And so the idea here what we wanna be able to do is that we wanna recognize that we're gonna have hot data when we wanna keep that in memory and we wanna have cold data and move that out the disk. And so if we ever need it we can still go fetch it and we'll pay a performance penalty to go do that but for the common case where most of our transactions are accessing hot memory or hot data then we're gonna just work like before or run really fast. And this is sort of what the anti-caching term comes from in the paper you guys read because what I'm describing here is essentially what a disk-oriented system does. Now, if you have a buffer pool they put the hot pages in memory and leave the cold data out on disk and they call that caching. In anti-caching it's been flipped around where everything starts out in memory as hot and we push the cold stuff out the disk over time. So that's sort of where the anti-caching term comes from it's a reversal of what traditional caching does. So what we're gonna need in our database system we're gonna need some kind of mechanism or component that will be able to take our stuff out of memory and put it out in disk maintain some information about what's actually being stored there and then if any transaction or query tries to access that we know how to go fetch it and bring it back into memory. So the sort of high-level it looks sort of like this. So for this also I'll say for this lecture as well we're gonna ignore the problem of moving memory from the index out to disk. Remember I talked about we talked about compression I said for some databases the indexes can be 50% of the actual total size of the database in memory but for everything we'll talk about here and I'll say why this is a problem later on we're gonna leave the index alone we're gonna assume the index is always gonna be in memory and we're only gonna focus on moving tuples out to disk. So let's say that somehow we identify that these three tuples here are what's cold. And I'll say too as I go along through this example I'm gonna put little question marks and say what are the problems we have to solve or what are the design choices we have to make in our database system at these different steps to be able to do this moving cold data out. So the first issue is that we decide that these three tuples are cold so we're gonna move them out in the memory and put them out on disk until a block. And again this is an SSD or a spinning disk hard drive so it's inherently gonna be a block-oriented device so we're gonna pack a bunch of these guys together there'll be a little header at the top that says what's actually being stored here but then they'll be written out to disk. So this will be like an evicted tuple block because we're evicting the data out of memory and putting out on disk. So now the question we've got to deal with is what are we actually gonna put here? We don't wanna lose, if we just shove this data out the disk then nothing else in the system will know about it and we won't be able to find it. So we may have to maintain some extra metadata and memory to say there used to be a tuple here but it's not and if you actually need it here's where to go find it. All right and then now in our index what should the index actually point to? Should it point to the original location or where these tuples used to exist or would we have some auxiliary data structure that says here's where to go find the tuples that you're looking for? So now I say you have a query comes along and say this is some query that wants to do a look on this table but it wants to go retrieve tuple zero one. And the tuple zero one is out on disk. So now assume we have some kind of mechanism that knows how to find zero one from just the in-memory metadata and it knows it's out here in this page at this offset now we gotta make a decision of how we're gonna move it back in. Should we just take the entire page and move all the tuples back into memory or should we only pull out the one tuple that we need and put that in memory leave this on disk and maintain a little marker that says there's a hole here because the tuple that used to be here is now actually in memory. And then the other question we have to deal with is what should happen to this query? Assume this is running in a transaction it does this query on tuple zero one we recognize that it's not in memory and we have to go fetch it should we abort it, put it in a queue and then a background thread go fetches the data that we need or should we stall the thread right here and then synchronously go fetch the data that it needs. In a disk ordering system they will stall the thread because they'll make the request to the buffer pool manager the buffer pool manager will recognize that things are not there and then this thing gets stalled what it goes fetches the thing that it needs to bring to memory. But in some systems like an h-door or a volt DB because they have a single threaded execution engines that may actually be a bad idea to stall the thread because now you're stalling all these other transactions. So this is also entirely related to what concurrential model or execution scheme you're using in your database system. So for this class what I'm gonna do is I'm gonna go through all these different these different question marks and say what are the different design choices that you can have and I'm not gonna claim one's better than another but we'll see at the end when we do our evaluation we'll see how these things to perform on different types of storage devices. All right so now you may be thinking to go back again to the second lecture when we talked about in memory database architectures couldn't we just do this with M-Map? Because M-Map kind of solves all these problems that I'm showing here, right? And I've said early on that you should never use M-Map for your database system. I still stand by that. If I die, put that on my tombstone, right? That's something I believe in very strongly. And so the reason why you don't wanna use M-Map for this although we'll see one example where they do use M-Map but in a clever way and not sort of the way you would think normally use M-Map what I'm talking about here. The reason is because we let the OS start managing stuff for us we're gonna lose the fine-grained control, right? And we have to deal with this problem here that anytime that we have a page that gets modified by a transaction the Steel No Force Policy says that we have to make sure that we write out the log record that corresponds to the update to that page to disk first before we're allowed to write that page out. But if we let the OS make decisions about what could flush out the disk then it may flush out the dirty page before it flushes out the log record. Now you can use M-Advise and M-Lock to make sure that doesn't happen but then you have other problems like you can't have a separate thread age synchronously read ahead in your pages on disk and start prefetching things. And you also cannot do multiple flushes to pages atomically using M-Map, right? So you can do M-Sync to say flush this individual page make sure that goes out the disk but you can't have it run flush multiple pages concurrently at the same time. So there have been some papers that sort of extend M-Map to sort of solve this problem but as I understand now in the positive implementation of M-Map you can't do this. So again we'll see one example where they don't have to worry about these two problems and they can use M-Map but in general we don't want to use M-Map to solve this problem entirely. All right, so for OLTP this is sort of the hierarchy of the different issues we have to deal with in our large memory support in our database system that I sort of showed in the diagram going before. So I'll go through each of these three categories by way of one, one by one. So the first will be runtime operations like what should happen inside the database system to identify which tuples are cold and then once you identify those tuples then you have policies that specify how should you evict the data, right? When should you do this? What sort of metadata you need to maintain in memory to keep track of what you've shoved out and then when you have a transaction try to access something that's been evicted we have a bunch of questions we have to deal with like how much data should we bring back in whether it was just all a thread or do it in the background. So again we'll go through each of these. All right, so the first problem that they deal with is the cold tuple identification. And again the high low idea here is that we need a mechanism inside our database system that can say these individual tuples are not likely to be needed by transactions or queries in the future and if I'm running out of space it's okay for me to shove them out. So you can do this on an online or offline matter. So in an online matter you basically have the data system itself is somehow keeping track of the access patterns of queries on tuples and then when it comes time to decide what that you need to move things out you just consult this metadata to find the things that you wanna move out. So in the case of the anti-caching paper they maintain this sort of an LRU chain that's directly embedded inside the tuple header so you don't have a cache miss or cache line miss. When you go access the tuple I need to update this pointer. The idea here is every single time that a tuple is accessed then you move it to the front of the chain. So when you went aside I need to move data out you just look at the back of the chain and you know those are the coldest tuples because they have not been accessed recently so you can shove them out. An offline manner is used in Hecaton and in Apache Geo which we'll see in a second. And this is basically where you have a separate thread monitors the, an access log that's generated by the engine that says here's all the individual tuples that are being accessed. And then when it comes time to say all right what I need to move data out, what tuples are cold you consult this log and you figure out what things have not been touched and that's how you identify things. I actually, I think the first approach is better because it's guaranteed to be sort of in sync with the actual application where the application is actually trying to do. But you could have the case where the system is overloaded and this offline thing just sort of can't keep up with the log messages. You'll pay a little penalty to update the metadata as you execute transactions but I think it's one less moving part you need to have or one less extra sort of background thread you need to maintain things. I actually think also now I will admit that the way we do this in the H-store intercaching approach is not actually the right way to do this. I'll talk a little bit about the end what I think is a better way. The problem with the metadata, the linked list approach in H-store is that it becomes expensive to maintain and we end up having to do sampling so that we're not updating this chain for every single transaction. So it's not 100% accurate anyway and you're paying this sort of extra, I think it's eight bytes in the header per tuple to just maintain these pointers between them and so if you don't have a really big tuple like it's not really wide with a lot of attributes you're paying a big penalty to maintain this metadata but I still think it's better than the offline approach. All right, so now say there's some sort of now we need to weigh in our database system to identify all right, we're running out of space now's the right time to start evicting data out putting out on disk and there's two ways to do this. One you can just have sort of a administrator to find threshold where you say my memory, the size of my database can't go larger than 80% of the memory that my database system is allocated and then when the data system reaches above that threshold then that triggers the process that it started grabbing tuples and moving them out to disk. So in this case here with this approach it's up to the database system to decide how to move the data out, right? Meaning it has to spawn a thread that goes down to the database system or the database in memory and starts extracting out tuples then writing out the blocks on disk. The alternative is to do virtual memory essentially M-map where you don't have to set a threshold because the operating system knows how much memory it physically has and the operating system can decide what pages or blocks in memory that you've allocated will get written out the disk, right? So there's nothing actually the database system actually needs to do which is nice to use M-map but as I said you have these other problems where you can't guarantee always the correctness of your database when you have transactions updating things. But it's kind of nice because you're essentially in this case here it is better in terms of I think it's going to be better performance and have better control of everything but this one you get sort of essentially for free if you just use M-map. And we'll see one example in VoltDB in a research system where they actually they do do this approach. All right, so now we again to say we and the first step was deciding how do we maintain the how to keep track of what tuples are cold and then we've decided how are we going to actually move data out? Now the question is what do we need to maintain in memory to keep track of the data that we've shoved out? And the reason why you have to do this is you don't want to have any false negatives. You don't want to do an index lookup and have it not see the key that you know should be there because you shoved it out the disk and it's not in memory or you don't find it. So there's essentially three ways to do this. I'll sort of skip ahead and say that the last one again is using M-map or OS is virtual memory and for this the OS is essentially keeping inside of the kernel and keeping track of what pages that's been written out the disk. So that it knows that when you go try to read an address in your memory map file, it knows that it's in page one, two, three and one, two, three is not in memory. So you get a page fault and then it goes fetches the thing that you're looking for. If you want to maintain this metadata yourself inside the database system, there's essentially two ways or a high level there's two ways. The first is to use what we use in the anti-caching approach which I'll call tombstone markers or tombstone tuples. And this is basically a marker or a special tuple in the table space for the table that you've evicted data from that has a little flag in the header that says, this is not a real tuple, there's no real data. It's actually a pointer to where the real tuple exists out on your on disk at what block ID and what offset. So you could follow the index, you would land on this tombstone, you have to check the header to see whether it's a tombstone or not, if it is, then you know that the data that's actually stored in the tuple is where to go find that thing. I'll show what that looks like in a second. And so what's really nice about this approach is that you still maintain the in-memory index and have it just have 64-bit pointers to tuples, some of the index entries will point to real data, some of them will point to tombstones. And so when you land on a tuple, you just check see what kind of data you're looking at. Another approach is to use an approximate data structure like a bloom filter that sits above the table and when you do a lookup on an index, you have to look at both the real index and the bloom filter to see whether the tuple you're looking for is there or not. So there's no longer any tombstone, there's no extra metadata you keep action the table heap, you just have to have two separate index data structures. And they use a bloom filter because it's an approximate one, so it'll be less size than a regular B plus tree. You may get false positives, meaning it may tell you a tuple actually exists on disk when it doesn't act, when it doesn't actually at all, but you'll never get false negatives, meaning it won't tell you that a tuple doesn't exist when it actually does. And so in that case, the bloom filter gives you a false positive response, you're wasting IO to go lookup data that's not actually there, but that's actually a pretty good trade-off in terms of reducing the amount of IO you have to do to go fetch other things. So again, these two are built, you have to implement this into your database system, this is something where the OS would give you this for free. So let's look at the two examples that were inside the database system and see how those work. So again, our index is always in memory and then we have an in-memory table heap and then we're gonna shove some data out the disk. And so the first thing we need to have is some way to keep track of the what tuples are accessed the most and therefore we can identify the cold ones and shove them off. In this case here, we can have a histogram, we'll see later on, you could also use like account min sketch or other approximate data structures to essentially do the same thing. All right, so say now we identify the tuple one, three and four are the coldest ones, so we're gonna write them out the disk. And so if we're using, the index would normally point to where the tuples exist, but now we wanna have them point to something else because we don't want them to point at this and then have this be a tuple that it would not expect in the index because we wanna end up reusing these slots for other tuples later on. So if you're using the tombstone approach, like in anti-caching, then you have sort of a separate space in memory for these tombstone pointers or tombstone tuples because these will be, this will be like a 32 bit block ID and I think a 16 bit offset. So these in theory should be much smaller than your regular table tuples. So you wanna have a sort of separate space in memory for these and that way you can reuse them as the slots as you move things in and out. And so again, these will tell you when you land on a tuple, it'll tell you what block to go to go find the tuple to look for and what offset to jump into. And again, this is sort of what I was saying before is like this only works if the tombstones are smaller than your regular tuples. If your regular tuple only has a one, one byte field and you shove it up the disk and your tombstone is gonna be 16 bytes, then you're just wasting memory. So this only works if these guys are wider than this. The alternative approach again is use the bloom filter. And again, you would do a look up, say I wanna find tuple one for my table. I would do my look up on my index and if it's there, if it has something there, then I can go fetch the thing I want right away. If it's not there, then I go consult this thing. And then if this thing tells me I think there's something there, I can have a sort of undisk index that tells me where to actually go find the thing I'm looking for. Because again, the bloom filter will only tell you whether something exists or not. It doesn't tell you where to go find it. So you have to maintain a separate index to say where to go find the data that you're looking for. And again, if you get a false positive, then you do a look up in the index and it would find nothing and you just stop there. So depending on what kind of index you're using, it will tell you whether you always need to consult this thing or not. So for example, if it's a unique key or primary key index, then if I do my look up in the regular index and I find the thing I'm looking for, I know I never need to go look here because it's here and it's only one entry. If I do a look up on the primary key and it's not here, I always have to check this. If it's a secondary key or non-unique, then I always have to check both because I may get one, for a particular key, I may get entry that's in memory, I may also get entry that's on disk. So for primary key, it's either one or the other. Or sometimes both. If there's no match in the first one, for secondary index, it's always both of them. So what's another problem with this architecture here? If you use the bloom filter, yes? You have a disk, so when you're doing a look up on the index, you might have to scan two multiple times on the access to disk. So I think she said because the index is on disk, if you do a look up multiple times, you're always sort of hitting this thing on disk. This thing probably will be, at least a lot of it will be page in memory. So you could cache it, it won't be that bad. But looking for something more obvious. So what access method did I show in all these examples? How did I get to the tuples? I went to the index, right? What did I do with sequential scan, just right on the table here? Would this approach work? No, because there's nothing in here that says here's the tuples that I'm looking for, right? So this wouldn't work for OLAP queries. And this is why we're focusing on OLAPTP. In the case of anti-caching, the way we did this is that you actually essentially had to basically do two sequential scans. You do a sequential scan on the first regular table heap, then you do a sequential scan on the tombstone part of it. Now, because all of the indexes are gonna be in memory, it may be the case that, and for a lot of OLAPTP applications, they make a lot of indexes. It may be the case that we can actually generate the entire tuple again just based on the index keys, right? Or if you have a query that can handle a covering index, even though the tuple you may need is actually out on disk, you may get enough information from the actual index itself, right? So for sequential scans, this is what I was saying for OLAPTP, this won't work, because sequential scans will have to scan everything, right? Because if you, and if you sequential scan on this thing, you have, there's no metadata that says, there's nothing data that says what's actually there. So you'd always have to go fetch it. In the case of the Hecaton approach to the Bloom filter, you always have to scan the entire blocks anyway. I think, so I think at Hecaton, they won't even let you do a sequential scan if you push things out the disk. In our case, we'd support it, but we'd have to go fetch things. Okay. All right, so now let's say that, regardless of what mechanism we use, we, in terms of identifying what data's cold and what metadata we're gonna maintain in memory to say what's actually on disk, now when we have a transaction that comes along and touches data, we know that something's on disk that we need, we go get the evicted tuple page we want, we're gonna fetch that in memory. Now the question is, what are we actually gonna merge back in memory? Right, at the very least, in the case of the first choice, you know you need the tuple that you actually were gonna go fetch. So you have to merge that in, no matter what, right? But then let's say that, what should you do with the other tuples that weren't required for your query, but they got pulled back into memory because we can only fetch things in four kilolite blocks. What should we do with them? So the first approach is essentially saying, only merge back the tuples that I need for that one query or the set of queries that requested that block and then we'll throw away all of the other tuples. And the idea here is that we now need to maintain some extra bookkeeping metadata to keep track of that the tuple that we excised out of the block shouldn't, if you need to go find it, you shouldn't go find it in that evicted tuple block, you should find it somewhere else. Because think about after a crash, we load in the checkpoint back into memory. We don't have to load in the evicted tuples because they were already out safe on disk anyway. And now we need to be able to know that, all right, if I need tuple zero one and before I pulled it out of my evicted tuple block and I marked a hole in that block, so that if you need tuple one, it's not on disk in that evicted page anymore. I'll make sure I find the one, the correct version I'm looking for. So now we sort of break this notion we had before, before we said that in an in-memory database, there was only sort of one copy of the tuple at a time and now we're sort of having multiple copies. We need to make sure we know which one's the correct one. So to do, if we only wanna grab the tuple that we need, we have to do some extra stuff. The easier approach though, is just to take all the tuples, merge them all back in memory. And the idea here is that because we're maintaining some metadata about over time, the access patterns of queries on tuples, these tuples that got sort of incidentally merged back in when we fetched the block, they won't be accessed again in the future, so they'll be marked cold again the next time we have to do an eviction. So then they'll get shoved out the disk all over again. The problem with this is that we're essentially now doing, we're thrashing because we're fetching in four kilobytes block, just go grab one tuple, then we're gonna write out those tuples back again to make more space. So we're just bringing them in back and forth over and over again. And merging the data, all the tuples in a block back into memory is actually not for free because now we have to go update all our indexes to now point to those tuples. For guys whether you're using the bloom filter or using the tombstones, we always have to update the memory indexes to now point to the new tuples we've added back into our in-memory database heap, in-memory table heap, right? So if our tuples are really small and we can fit maybe like a hundred of them in a block, we have to update the index a hundred times to go merge all these guys back in. And again, as I said, there's gonna be a lot of indexes on our tables in a sort of modern OTP application. So multiply a hundred times the number of indexes we have on that table, we have to update all of them. So that's not cheap. And you need to do this within the context of the concurrency control scheme of the system. So that means that, because you don't wanna have phantoms and other weird problems. In the case of H-store, it's the single-threaded execution engine. So we had to sort of block all our transactions to merge all these tuples back into the table heap. In other systems, if you have multi-threaded concurrency control, then you make sure that adding these tuples in doesn't fail up other transactions that are running at the same time. So this is easier to implement in terms of what metadata you need to maintain, but you're gonna have more CPU overhead and then the tuples aren't basically gonna get written out over and over again. For this one, it's faster, but you have to do this extra stuff to make sure that you don't get incorrect versions of a tuple if you have to restart. All right, so now we gotta deal with when a transaction access one of these tuples out on disk, what should happen to it? As I said, the two approaches we can have or we can just abort the transaction, roll back all its changes, put it into a side queue and have a separate thread to go and fetch the data that it needs. And once we know that block is now in memory, then we can restart this transaction and it can run just as it was before, right? And now we won't have these, we won't have to get aborted because all the data that it'll need is in memory. So the problem with this is that there may be the case that the transaction actually has to touch these data that's on multiple evicted blocks on disk. So you get to the first one, you find that the data you don't need is there, so you have to abort, restart and then you go fetch the first block, then you restart it and then now it touches the next thing and that's not on disk, you have to restart it again. So you could have multiple restarts for your transaction when it's sort of blocking every single time it tries to touch data that's not on disk. Now the way we got around this in h-store with the anti-caching is that when you try to touch data that was not in memory, we just pretended that it was actually there and we let you keep on running, we let the transaction keep on running and only when you actually try to maybe modify the tuple or use the tuple as input for another query or return the result back to the application, then we would abort you and restart you. And the idea here is that we're gonna let you run as far as you can go with the data that we actually have in memory and try to figure out everything you're gonna need on disk ahead of time, then stop you, restart you, go fetch the data you need all at once and then bring you back in. And this doesn't always work because if you have say the output of one query is then used as like an if branch for the next query, if this, then do that, otherwise do this other query, this, we had to block you right then and there. So you can't always do this. The other issue is that we can't guarantee that if the total working set size of the query exceeds the amount of DRAM that you have, we can't guarantee that you have a consistent view of the database. So what I mean by that, say you have only a giga, giga data, a giga space in DRAM but you need to access two gigabytes of data from your table. So in the first time you ran the query, you would get to the first one gigabyte of data and then you would say, I don't have any more space so then I'll abort and restart. Now you go fetching the other gig but that swaps out the first gig. So when you restart, you'll try to access the second one but then you try to access the first one and you get a board restarted. You keep restarting over and over again. You can't guarantee that everything's always gonna fit. So this is again another example why we can't say this, why this technique wouldn't work for OLAP queries because in those cases the database or the data you need for your query is certainly gonna be much bigger than DRAM. And so under this model, this wouldn't work. The way to get around this would be, say I need to compute like an account on a table, I would run it the first time, it would count all the tools that I have in memory and then I abort the transaction but know where I left off in my account, fetch all the data, the rest of the data that I need to come back in. Now during that time, other transactions could be updating things because I'm being put on the side and then when my query or transaction restarts, it goes count, it keeps the account for the rest of the data that just got fetched in. Again, that might have been modified by other transactions but for OLAP queries that might be good enough, that might be okay. So that's the big issue for this. Again, another reason why we can't always do OLAP with all this. The other approach is to do what essentially what a disk-based system will do is that as soon as you try to access something that's not in memory, it's been evicted, you just pause that transaction right there and then go fetch the data that you need, bring it back in memory, merge it into our indexes and then restart the query right where it left off. Again, depending on what concurrency code scheme you use, this may or may not be a good idea and actually it also depends on what storage device you're using because if you have a fast storage device, this may not be a big deal but if you have to go fetch something from Amazon S3, then that's gonna be tens of milliseconds and that could be a bad idea. Yes? Can you go back to the previous slide? Yes. So for the first approach, you just grab a block in memory and then just merge the tuples you want and then just discard the rest of it. Just discard that block right there. Yes. In that case, you have some holes in the block because you have to do some compaction to save space for the data in the disk on the disk. Why don't you just grab some code as data to fill the, code as tuples to fill in the hole and then just write a new block back to disk. So in that way, you don't have to do compaction later. So his statement is, yeah, so I'll say what we're doing here and I'll say what he proposes. And your thing actually might make sense. So what we do here is that we fetch the block in memory, just grab the one tuple we need and then just discard the rest. We don't write anything back. What he's proposing to do is I fetch the block in, I pull out this one tuple that I need but then in its place, I'll insert another tuple that I was gonna evict at some later point to fill in that hole and then write that out. Right, that would work. In this case here, if you're doing this as you block the transaction, now you're paying for the read and you're also paying for the write. Now you could put it on a side buffer and then flush it out asynchronously later. But yeah, we didn't consider that. That actually might work too. Another thing actually that I didn't really talk about here but the thing you could do too also is you could try to pack in your block all the tuples that are related to each other based on the foreign keys. So for example, say I have the customer table, I have the orders and the order items, I'll make sure that when I move out Andy's customer account, I also move out all his orders and order items. So when you go fetch a block, you're not fetching random tuples from the same table, you're fetching all the data that's related together and more likely to be used together. So in that case, you can use this but you're not merging in stuff you're not actually gonna need. That'd be another way to solve this problem as well. But yeah, the filling in that hole would be a good one too. Okay, so then the last thing we have to deal with is what tuples we wanna merge back into memory and when I say merge, I mean like should I merge it back into the entire database and have it be visible by any transaction or should I only put it on a sort of a side buffer for my one query transaction that needed that data and then when it's done with it, I can just throw it away. And then the idea here is that we wanna avoid having to go update all our indexes every single time we merge a tuple back in. So this is sort of again orthogonal to this point here, what data we wanna bring in back into memory. This is now after we decide what tuples we wanna bring back in memory, which ones we actually wanna merge into the table and update our indexes. So again, the easiest thing to do is just always merge everything. An alternative would be to only merge the tuples that actually be modified by a query and go update those indexes. Otherwise, just if the tuples is just being read, then we just have it, the transaction read the data that it needs and then the buffer just gets thrown away when the transaction commits. Now of course now the next transaction comes along and decides I wanna merge, I'm sorry, I wanna read that tuple again. It's not gonna be able to see this buffer, we have to go back and fetch the data again. And the last approach is that we can be a little bit smarter about this and we can maintain some extra metadata again about how often each block is being retrieved and when we find ourselves going fetching the same block over and over again, we'll go ahead and just merge everything in, right? To avoid this sort of back and forth of going at the disk, get the same thing. All right, so now I'm gonna walk through a bunch of different implementations and I'll describe using the taxonomy that we've talked about here, I'll say what these different systems actually do. So we'll start off with the A-short, the paper that you guys read, let's talk about Hecaton and then a variant of OTPE developed by EPFL and then these are actually two systems that are actually commercial or people that are using in the real world, Apache Geo'd and MMCql. So in anti-caching paper you guys read, so in this version of the system that we were doing online identification, so again we maintain a linked list or LRU chain about how tuppers were being accessed. We used administrator defined threshold to say when you got some capacity of your database relative to the amount of DRAM that you have or your residents that size the process got above some threshold, then you decided what data you wanted to evict. We maintained tombstone tuples when we moved data out. When a transaction access data that was on disk, we abort it and restart it, go fetch the data it needs in a separate background thread and then restart it at a later point and then we were doing block level granularity of the data we were bringing in and then at this point in this version here we would always merge everything. We'll see in later on we had a later version of the anti-caching stuff developed in A-short where we could tune all these different things and see which one actually worked. But in the version you guys read in 2013 we were sort of naively merging everything. So for the Microsoft of Hecaton they had this sort of side project called Project Siberia. As far as I know this was never actually implemented or it was never actually released in like the commercial version of Hecaton. This is sort of like a research project they developed but they never actually put it in because they had sort of the problems that I talked about before where you can't do sequential scans. So they were doing offline identification so they would have a separate thread read the log and identify what tuples are called based on that. Just like in A-short they had a separate administrative and fine threshold to decide when it was time to start evicting things. They didn't use tombstones, they instead used the bloom filter to keep track of what data was on a disk. They would block threads in a synchronous manner when they go to retrieve data and they were moving things in and out with tuple level granularity and they would just always merge everything. Like I said they have a couple of papers on this project but it never actually made it into the real system. In 2013 the researchers at EPFL took VoltDB which is the commercial variant of our A-short system and they modified it to actually do something kind of cool. So they were doing offline identification just like in Hecaton but they were actually now gonna use M-Map. And so I'll show in the next slide how this basically worked but the high level idea was they would use malloc for hot data and M-Map for cold data. And so what would happen is when you had this sort of background thread identified tuples were cold you would move them out of the malloc allocated portion of the heap into this M-Map side. And then eventually the OS would say I need to make space, my residence size, my process is too big and the OS would be free to move all those things out the disk on the cold side. Actually I take that back. The whole thing was M-Map. They didn't use malloc for the hot stuff. They would use M-Advise to pin those pages in memory to prevent the OS from swapping them out. But the higher level portions the higher addresses in the memory space would not be pinned and the OS was free to move them in and out as much as it wanted. And because of course you're using M-Map you have to use synchronous retrieval because if you try to access this memory region that was swapped out to the disk you would get a page fault, the kernel would suspend you and then it would go fetch the thing you were looking for and then when you actually got the data when the data was pulled back to memory then your thread was allowed to stop running again. So this would be the same thing as synchronous retrieval and they were doing this at a page level granularity because they were, you know that's what the OS wants to swap out on M-Map. And of course they would always have to merge everything because they had no control over this because the OS was doing this. So how this would work is you don't need tombstones you still have your index point to just the memory region but that memory region might be out on disk. So there's nothing special, you know always merge was bad in the A-short case and in the Hecaton case because you had to update out your indexes in their case they don't have to update indexes at all. So again it would look like this. So say you have say the top portion of your memory heap would be for the hot tuples and these will all be pinned in memory so the OS can't swap them and then down here would be the space where you say here's where I put my cold tuples. And then you would identify which tuples were cold copy them down into here and actually in which case you would have to update the index now point to this region. But then when the OS decided that it wanted to go ahead and copy it out it just did that on its own. Does this in the background it doesn't suspend your threads to do it. So then now when you wanna go ahead and access it you get a page fault and then it goes it fetches it and brings it back in. So in addition to the synchronization problem that we talked about before what's another problem that you could have here if you let the OS decide what data to move out. It has to do with sort of alignment stuff that we talked about before. So one big problem would be you don't have complete control of where these page boundaries are and you need to be careful about that you align things correctly because what could happen is the OS could say well my page boundaries are here in memory and so this one tuple may end up getting split into two pages. So those two pages would get swapped out you would follow your index and it would point to this position here but when you actually start trying to read the other data down here you got to go fetch this other page. So before when the data system had complete control of how it moved data out onto disk we could be careful to make sure that for a single tuple it would fit an entire page but in this case here if we let the OS do everything then it may split it up like this. So now what should have been one page or one IO to go fetch the tuple is now two to stitch it back together. All right so this is something we have to be careful about. All right so then the next system is Apache Geode. I'm putting question marks here because I'm not a hundred percent sure this is actually correct and it's sort of confusing because Apache Geode came out of Pivotal and it used to be called Gemfire but there's also Gemfire XD which is different and Gemfire XD uses Apache Geode but it's not the same thing as Gemfire and before that before Pivotal came out of VMware and VMware had a system called Sequelfire which is related to Gemfire but not Gemfire XD so it's all sort of kind of confusing and when you read the documentation it's not a hundred percent they don't come out and say exactly what they're doing but this is best by knowledge this is what they're doing. So they're doing the same online identification where you're maintaining some metadata about the access patterns to tuples you again have an administrator to find threshold to say when you reach some memory capacity you start writing things out I think they maintain tombstones to keep track of what's been merged out and when you try to go access things they block your thread and go fetch everything back in but they're doing it at a tuple level and what is interesting is that they only merge things back into the regular table space that when they're being updated and so one thing I think that Geode can do is that what they're targeting for their cold data storage is actually HDFS so you can have your database system write data out to blocks in HDFS and then you can go fetch them back in when you actually need them, right? So again this is the best by knowledge this is what they're doing and then the last one we'll talk about is MemSQL so MemSQL does something different than what everyone else does here so in MemSQL it's a hybrid system so you have a row store and a column store and so when you have a row store that's always gonna be malloc and so it never can be written out of the disk for the column store what happens is they essentially just use MemMap as the buffer manager but they sort of have this staging area where they put all the deltas or the changes to the column store data in a sort of a transient row store I think of this sort of like a log structure merge tree you have the MemTable portion of it and then when they start merging things from the deltas store into the columns they just use MemMap and they let the OS decide when to write those things out so because it's MemMap you don't need any extra metadata to keep track of what's been flushed out you always have to buy a product of having pagefalls and blocking threads you're doing synchronous retrieval and they always just merge everything back in the thing I'll say when I say manual identification what I mean by that is you as the administrator have to say that I want this table to be the column store and therefore it uses it uses MemMap so they talk about how you could have a row store and then you could copy things out and put it as a column store over time if you want to get this functionality I think this is okay even though they're using MemMap and again you get this eviction stuff for free without having to do it yourself I think this is a good first implementation I think that if you have more fine grid control and let the data system do everything I think you can get better performance but for them it's good enough all right so in the last few minutes I want to go through the evaluation that discusses all these different design decisions and running them on different storage devices so this is a paper we published last year in Daemon where we took the anti-caching stuff that we built in a store and we extended it to support all the different techniques that we talk about here and then what was really cool about this is we ran this on a machine at Intel that had five different storage devices so you had the classic spinning disk hard drive and the NAND flash SSDs we also had a newer drive or a newer type of drive from Seagate called Shingle Magnetic Discs or Shingle Magnetic Recording Drives if you don't know what a Shingle disk is the basic idea is that it's like a spinning disk hard drive but they pack in the tracks closer together and what happens is when you write to a single position on a track you end up overwriting the tracks that are adjacent to it so they had sort of this error correction mechanism that allows you to fix those overwrites and again, essentially you can pack more data into the same storage medium that you normally get in a spinning disk hard drive but what's really cool and it will segue us into the lecture on Thursday is that we had a Harbor emulator that could emulate a 3D crosspoint which is the new non-volta memory storage device that Intel just put out a few weeks ago so this was a year ago it was an emulator, it wasn't actually a real thing I think they sell it now under the name Optane so if you've seen those articles, this is what it is and then we had NVRAM which is sort of think of this as and we'll talk about this more this on Thursday think of this as like DRAM in a dim slot form factor but that's actually non-volatile so in the case of the 3D crosspoint you're still doing the block writes that you would do in a regular SSD but when it's an MVDim or MVRAM you can do byte addressable loads and stores as if it was DRAM but when you can do flushes to make sure it's actually being persistent again, I'll explain more about these different technologies next class but this is sort of a, you know this is a good evaluation to see what these all the different approaches we talked about today how these match up with these new storage devices so for the first sort of basic micro benchmark we want to do is we just want to see how fast these different storage devices can do lights with different block sizes and again think of this as being the eviction and retrieval speeds of what the systems can do and so in this case here obviously lower is better because we're measuring latency and so the first thing you see is that for the spinning disk hard drive and the single disk drives the smaller block sizes one kebab blocks get you better performance because when you go to the 64 kebab reads then you have to read multiple blocks and fetch them in in the case of the MVRAM and DRAM you see that as you increase the block size you get worse performance and this is sort of expected a much worse performance this is expected because these again these are doing sort of single single cache line reads and that's the best case for these guys so going beyond that you don't get much better performance another thing to point out too is that in these expected MVRAM devices the difference between the reads and writes are going to be asymmetric so reads are going to be much faster than writes when DRAM it's a little bit slower but not as slow so you see a much greater difference in a read versus a write on MVRAM versus DRAM so now we want to see what should be the merging threshold for our system and for this along the X axis I'm showing the different storage devices but in the case of the spinning disk hard drive and shingle magnetic disk I'm going to compare also against the asynchronous retrieval where again you bought the transaction and restart it and a set of background thread fetches the data and then the synchronous retrieval where you actually stop it and go fetch the data that you need now in this case here this is YCSB so we're only going to be reading one kilobyte two bowls one one kilobyte two bowl at a time per transaction so what you see here is that the in the case of the spinning disk hard drive and the shingle magnetic disk the abort and restart actually ends up being the slower approach because we're measuring throughput here so higher is better and this is because the sort of the software overhead of aborting the transaction rolling back to any changes that it made putting in another queue and then restarting the next transaction that actually turns out to be the major bottleneck versus just blocking the data and getting the thing that you need in this case here the reason why this is higher when you merge everything is because it's the merge everything that you read in the block and then it's less likely that future transactions will actually need that data sorry it's more likely that future transactions will need that data and they won't have to block and restart because it would already be in merge in the case of the sort of solid state short devices you see they were always running with synchronous retrieval and what you see here is that the gap between the merge all versus all these other ones is because the overhead of updating the indexes if you have a lot of tuples and when you fetch the page you have to update them to change their pointers so that's why this one is actually slower and again you can compare this against DRAM this is what you would get when everything fits in memory and as you can see the NVRAM is not far off and I'm not sure why this bar is actually slightly higher than that that was a weird bug with the emulator so now we want to do, we want to compare sort of a generic configuration sort of the same standard configuration that we had in the original anti-caching paper that you guys read versus a sort of optimized configuration for every single storage device where we think we can get the best performance and so what we'll do here is then we found that the synchronous retrieval and this merge threshold for everything is actually better and then we'll vary the block sizes for the spinning disk drives versus the solid state drives so with this we're gonna use actually two different benchmarks the first one is this thing called TETP sometimes it shows up in the literature as TM1 so this is a benchmark that I think Ericsson developed that sort of mimics cell phone tower applications so if you have a car driving down the highway it has to update the tower to say here's where to go find me if someone calls my number so that's what this workload sort of represents so for this one what you see is that the generic configuration is always slower than the optimized one so this is showing an example of that if you tailor how we're gonna do eviction and retrieval of data out on disk in our database system if we tailor it for the actual storage device that we have we're gonna get much better performance and in case of the NV RAM we're almost getting what DRAM can do so that's pretty impressive so for this application it's actually a lot of updates and there's a lot of reads you're updating the cell phone tower you're updating the caller list and you're going fetching it for another workload that is actually insert heavy you see that the optimized one actually doesn't perform as well so this is the voter benchmark so this is a benchmark that the Volta B guys developed that models the American Idol TV show like when people call on the phone and they vote for a contestant they actually originally developed this for the Japanese version of American Idol which then was later picked up by the Canadian version I think it's just called Canadian Idol so for this it's an insert heavy because you have everyone calling all of a sudden a burst and for this it turns out actually the generic configuration for these other devices and especially the SMR actually performs better because you're not paying the penalty of all this extra metadata to keep track of what data is cold you just write everything out as fast as possible and that turns out to be the best in case the NV RAM is so fast that you don't see a real difference all right so what am I parting thoughts about this so today was really about how to add additional components inside of our database system for a memory database system to be able to use block-oriented storage to extend sort of the amount of space we can maintain for our database but without getting all the slowdown that we had from a slower disk so in this case this graph here it's about less than 10% slower than DRAM, in this case here it's a lot more so there's sort of not really easy way to get around this unless we have the NV RAM stuff that we'll have, we'll talk about next class so the other big thing about all of this is that I didn't show any technique to actually write the indexes out the disk we assume everything, all those are gonna fit in memory now you maybe say this is okay because if we had the right indexes we can still do covering index queries and therefore we never even have to look at the stuff that's out on disk but I would say in practice just as there's been cold data on disk that you don't really need in transactions or cold data in your table there's gonna be cold data in your index so if say you have a auto increment key and you're always just inserting to the right side of the index the left side is not gonna be accessed at all so maybe you wanna write those pages out and none of these techniques that I showed here actually can solve that and to the best of my knowledge the only people that I think can actually handle this in an intelligent way is the hyper guys from Germany but unfortunately they don't have a paper yet on what they actually do I know what they do and I think it's a good idea but they haven't published it yet so I can't talk about it there's another paper that's on the optional reading list from HP Labs also from Grits Graphi where they sort of do something similar but they're sort of starting with a disk based system and trying to make it in memory oriented whereas the hyper guys start with an in memory system and they can add the disk to it in another direction and they can handle indexes both end tuples and so with all of this I would argue everything we talk about this class as well as solving this problem here will go away and it's completely unnecessary if we have a byte address on non-volta memory and how far away we are from that depends on who you talk to but on the next class we're gonna spend all our time seeing how you actually build a data system to use this because then you don't have to worry about disk at all necessarily because what will happen is you just allocate a big amount of address space some of it will be in NVM some of it will be in DRAM and you still use all the pointers you had before and just whenever you go access stuff as in NVM it's just a little bit slower maybe like 2x to 4x slower but in the grand scheme of things that's not that much compared to going out to disk and fetching other things so I would argue for I think a lot of cases maybe 10 years from now everything we talked about here today may not be necessary at all with this so we'll see that on Thursday alright so again Thursday will be at NVM as I said and then I'll also provide the sample final exam and I'll be able to answer any questions we'll go over it in the beginning to give you a high level idea of what the final exam will be about I'll also say too on Tuesday next week we'll have Marcel Connacker from Cloud Air come and talk about the Impala system so I'll have a guest lecture and that's when I'll do a more comprehensive final review okay any questions? alright I'll see you guys on Thursday