 Okay, let's get started low turnout today because everyone's hung over from the weekend, but we'll we'll go into it So the first thing we'll discuss is sort of some administrative things as we finish up the semester The first is the reminder that we're having the final exam in class Next week on Wednesday And the basic layout of the exam will be sort of three short answer question short essay questions So it's not going to be like this multiple choice thing We're like, you know, does NVCC do this? Does OCC do that? Right? I'm looking for you got to have you guys sort of synthesize a bunch of the different lectures that we You know topics and things we talked about in the class and sort of ask more of higher-minded questions Like how do you combine these things and how do the different pieces of the things that we talked about in the class Interoperate and so I realized that this may be kind of different than what you're used to in most classes And so what I'll do is provide you later this week sort of two sample questions of what I'm sort of going for in the exam And that'll give you an idea of like how to repair and how to study So the basic idea is like if you follow along the lectures if you're mostly familiar with the papers that we talked about We'll sign the readings then you should be able to answer these questions, right? So again, there's not going to be like if you miss this one lecture You're gonna you can't answer the question It's gonna be you know the question can be designed to encompass topics across a lot of different a lot of different not a Lot of everything's we talked about so again, I'll provide the sample later in the week then the second thing is that the final exam during the final exam period that we have during the exam period We will do the final presentations of your projects and I'm putting final in quotes because the final exam was on May 6 But your actual code drop for the projects are not due until May 11th So it doesn't necessarily be have to be you have to have everything done on this on this date And so it'll be 10 minutes per group sort of the same thing we did last class And by the way, I felt like the the the presentations you guys did last class were excellent the demos are really awesome I'm really excited to see how everyone's progressing So what this time today I'm five minutes. We'll have 10 minutes per group We'll help we'll let Prashant goes first because he always gets cut off at the end And you will get some pizza for everyone So there'll be food and then I'm getting donations from the different database companies that there'll be prizes for everyone So there'll be a grand prize For whoever we vote to be the best project of the class But then everyone else gets a consolation prize like t-shirts and shit like that So we have items donated from snowflake DB volt DB new DB MongoDB Splice machine and I think one more other and the names excuse me So there'll be stuff for everyone and then the other thing I'll point out is that I've bumped up the due date for the code of use to Be May 8th after the final presentations prior to this they were due on May 1st But I realized that everyone's gonna sort of be rushing at the end to write a bunch of code and prefer this So in between the final presentation and then when you actually have to provide me the source code You guys are required to do this in this code. You can do it beforehand I just necessarily has to be done this date But this is when everything has to be done and I'll talk about the what's expected and what the guidelines are for doing the code review a Next class and then I'll announce the pairings for the group So it'd be a sign each group be signed to another group And it's your job to look at their code and provide suggestions And this is the same kind of thing you would do when you go out in the real world and work on a large software company All right, so any questions about all these things? Okay, and there's also the extra credit stuff as well The website is almost done And so hopefully we'll have that ready for you guys this week as well I think it's not meant to be, you know, very taxing and stressful. It's just to be You know, it's a really enhancing your own education about what's out there in the real world Okay the other thing I want to point out is last class I made a comment that Pivotal bought the Wisconsin quick step system and I said well, I don't really know what they're doing with it Hasn't been any announcement or I didn't know what they were their plans are for it But it actually turned out just only a few weeks ago Pivotal announced that they're making quick step open source and they've submitted it as an Apache incubation project So the idea is here that now they're actually going to try to build something that's you know People can use as part of the Apache ecosystem So this is the link to the proposal and they're also in these comments too They also talk about how they're building a distributed version of the quick step system as well Which I thought was kind of interesting. I actually don't know what they intend to do with this because Pivotal all obviously also owns Green Plum, right? And that's their sort of massively parallel data warehouse Quick step is meant to be used for a data warehouse So I don't really see how the two are going to be used together or whether conflicting or what the deal is So that's just something to note Actually somebody on Twitter pointed that out and so I wanted to correct it. All right. So for today's class what we're going to talk about is Dealing with databases that exceed the amount of memory that's available to the system So pretty much for the entire semester We've been just assuming that oh our database is going to fit exactly into DRAM And yes, we have to be careful about how we pull things into our CPU caches and where we place the data on our different Dims, but now we're actually going to deal with databases that exceed the amount of DRAM that's available And we'll talk about how to how can we actually bring back? spinning discard drives SSDs and other things without slowing down the all the good stuff that we've achieved with area and memory optimized execution engines and systems So it's about the background of why this is hard and what the challenges are going to face And then I'm going to spend time talking about the different things You have to consider things you have to solve when you actually implement this in an in-memory database system And then I'll talk about some real-world examples that actually do some of these things and we'll finish off with an evaluation of Taking these implementation issues and seeing you seeing what the trade-offs are them for different workloads All right, so the the motivation of why we want to do this is quite kind of obvious, right? DRAM is very fast compared to like a spinning just hard drive and SSD But it's also very expensive and its capacity is quite limited So I think the best you can do now is you can buy a you know 32 gigabyte dim So you have one stick of RAM had 30 up to 32 gigabytes of DDR4 All right, so that means that in order to have a really large database You would have to have a motherboard that has a lot of dim slots in order to scale up for that size And but obviously DRAM is also more expensive than a spinning just hard drive and SSD I mean not just in terms of actually buying the hardware itself But then when you actually run it in your system is actually very expensive I said I there's been reports that says the For the total amount of energy that that should be used to power a server 40% of the energy is being used to refresh DRAM Right because every every so often they have to cycle through an app power in order to maintain the charge and all the DRAM So 40% of the power is just being spent on Updating DRAM over and over again So it'd be really nice if our database system could could use the cheaper storage devices that are out there But without having to slow down all the in memory stuff that that we've that we can do So this is sort of what our today's lecture is about And so the the basic idea of the high-level goal again is we want to be able to store and access data on disk But without having to bring back all of the disk oriented stuff that I disparage and say was it was slow in Beginning of the semester, and then I showed that pie chart that said like 30% of the time I spent in the Buffalo manager 30% of the time was in the log manager 30% time was in the currency control and they only left around 12% of the time 12% of CPU cycles available to the database system to actually, you know execute queries So we want to bring back disk storage without, you know You know killing ourselves and slowing ourselves down the way that the discordian guys are so to do this we need to be very careful about how we're going to access the This this secondary storage And we need to be mindful what the access method or the API we're going to use right So when we had an in-memory database system when all the tuples were in memory We would have this tuple oriented storage architecture, right? We had our indexes point to exact addresses in memory for the tuples out where the tuples are located Our current goal scheme is based on this as well We could have pointers from one tuple to the next when you do change from multi-virage Conversion control so when everything's in memory we can be completely tuple oriented But when we bring back these slower disk devices They don't allow that sort of same fine-grained access Instead they're gonna have these block-based APIs where you can read and write four kilobytes at a time You can still read a single tuple, but you're not gonna go just grab that one tuple you want You have to go get the four kilobyte block that the tuples located in so in our architecture We need to be mindful of this and make sure that We make decisions as we bring back disk to account for this because otherwise we can end up a lot of problems So For this class, we're mostly going to talk about a lot of TP, but I'm gonna talk a little bit about How to bring back disk for an OLAP system? So there's there's not much we can actually do or exploit in our workload or application To get around the slowness of disk And this is because when you think of these OLAP queries again They're doing full table scans or maybe a subset of the attributes that you have But you're basically just scanning from beginning to end. There's not really a lot of optimizations You can do to get around that you're just going you're sort of going at the speed of Reading things from disk you can still do all the vectorization the query compilation stuff that we talked about before And there's other optimizations compression and things like that, but you're still going to have to go fetch those blocks and bring it in All right, so there's not much we can do to get around that the only technique that I know of that people use is to maintain a Some metadata in memory about what is actually on disk and the idea is that hopefully you can have queries that can just be processed and Using the metadata that's in memory without having to go get the actual items on disk So what I mean by that let's say you have a table that has a column a And this is sort of just the logical scheme of it so the way we would lay that out physically is that in memory we would have these These pages that would correspond to pre-computer aggregations on this column So the min the max the average count standard deviation all those other things, you know things You would normally execute in a query for all the data within that you know within a single block and then add on disk We maintain the block itself. So the idea is that if we have a select star query For this column I said select count star query then rather than going getting the the actual page and reading it and Doing the count we would just do a look up here in memory and look at the pre-computer aggregation So we avoid having to go read them things from the desk So this is what mem sql does but I would say that this is not actually unique to in-memory storage, right? Vertica does this as well. They just have the pre-computer stuff does be in the header of the disk page So you just read the header and maybe you don't have to read the rest of the data as there's nothing precluding a disk basis for making the same optimization, but In something we can apply in an in-memory system for LTP won't yeah, sorry Has this been considered already for example, if this is cheap You might not have two copies of the data one where you store it in its actual place And the other copy where the entire table is a log and so if you want to go and all that way You just place through the log in a sequential manner So his state his quest statement is um What if you had the that's a table on desk stored as a log Proper place according to the data Yes In the Yes So so his statement is you have two copies of the database on disk and one copy is a log the other copies is a sort of on sort of heat So a lot of times in the overlap systems that discord in the ones the actual table itself is is clustered So it's already so stored in sorted order, right? So mem sql does this vertical does this with the projections The user cluster index to get the same thing. So there's not really any benefit to having two two things, right? You still have to do a sequence of stand on disk If it's OTP, yes, you want to avoid that and so in case of mem sql, which we'll talk about later on, right? They have the column store out on disk and they have this metadata for For pre-computed in memory, but then they have the row store or the hot stuff that you're accessing your transactions That's always in memory. So you never like the disk the bottom line is you if you even they have a discord in the system If you for when you're doing transaction if you ever have to go to disk You're screwed right it's always gonna be slow So usually what happens is you can use a system like my sequel You just have to give it enough ram such that the working set size fits in DRAM Right But again, you still pay the penalty for having the buffer management all that sort of discord in architecture What I'm sort of talking about here is we can go get stuff on disk as needed But we're gonna try to avoid the buffer pool and all that stuff We're gonna assume that most transactions will access Everything in DRAM and if we ever to go anything to the disk we can go get it We'll talk about smart ways of doing that later on But for overlap, it's just always just doing a central scan anyway Okay, so so for OATP. It's different and sort of exactly what I just sort of said to him now is We can exploit the fact that most OATP workload to bifurcate it in between hot data and cold data Right most of the transactions are gonna be doing the reason rights and modifications on the hot data And we want to keep that in memory But then as data ages or data becomes colder We're gonna evict it and push it out to our secondary storage and we're still gonna maintain information in memory About what that data it is and where it's located so that if any transaction comes along and tries to access to that data They don't get a false negative where we you know, we say we don't have it when we actually do So the example I've used before and that I always really like is is eBay So you think at eBay when auction starts say last for seven days during that period You get a lot of transactions that are updating adding bids the seller might be modifying the you know the auction itself There's a lot of traffic and the data is hot, right? So you want to keep that in memory But then after that seven days the auction ends and the only people that are really coming back on looking at it for the Most part are the seller and the bidder and then after 90 days eBay purges in puts in the backend data warehouse Right, so if we know before maybe before that 90 day 90 day timeout We could move the data out to a secondary storage and not pay this huge penalty of having to keep everything in D ramp So that's sort of the basic idea what we're trying to do So to give so I want to go through now an example at a high level of What we're trying to do and then what I'll show what I'll show as we go along all the different questions Are things we have to figure out in our implementation in order to make this work And then we'll go after this and say what are the different ways we can do this for each of the different questions So let's say that we have some mechanism in our database system to identify What data is cold what data? We don't think is going to be needed in the future but in transactions And therefore we want to move it out into our through our cold data storage for this We can just assume we have some sort of disc resident hash table. It doesn't matter where that's necessary or spinning disk drive We'll talk about those issues later, but there's some way some place we can store blocks of data all right, so Say what we want to do is then take these cold tuples and we'll move it out into a block and then the Victor tuple block essentially looks like the slotted pages that we talked about before where you have a header that says this is where this This is the table where this data came from This is the check sum for it, and then I'm going to lay out the tuples sort in sequential order If you're doing if you have your tuples have verbi-length feels like a varchar blob Right obviously we need to be able to take it from the the verbi-length pool and then inject it into our header as well So we'll take all the data we have not just whatever is in line in the in the heap here So now the now the question we have the deal with is what are actually we're going to put in here, right? Do we need to make and maintain some metadata that says we used to have tuple zero one three and four They used to be in memory, but now it's out on the disk and here's where here's where it's located If you ever need access it again, right and then now the question is well What is our index actually going to point to is right before just pointed to the tuples when they were in memory, but now they've been moved out What should they point to should they point to the the tuple themselves should they point to an auxiliary data structure that says Here's where they're located Should we just completely remove them maybe have another data structure that says this is you know This is our Victor tuples And then say now we have a query that comes along right He's doing a lookup on id and he wants to find tuples zero one, but since we know that's out on disk Using some mechanism where we're keeping track of these things. We now want to copy it back into memory Because we can't have the transaction operate anything. That's not not in memory So now the question is how much data are we going to copy? We're going to copy all the tuples in this block Even though we only needed one of them and then when we copy it back in do we always merge it back into the table heap or do we put Into a temporary buffer and just allow this transaction to execute it and then or operate on it and then throw it away when it's done Right, so it's not just you know It's not just simple as like oh I'm going to copy things in and and just put them back in There's all these other things you have to deal with in order to make this work efficiently Right because we're kind of trying to avoid having to have a buffer pool All right, and then of course quen the question is well what happened with this this query When access the tuple should we just pause it install it while we go fetch the thing that we need or should we abort it? Have a separate background thread go fetch the data that it needs and then req and re-execute it when that data is ready Right in a disk ordinance system that would stall it But in a sense sort of some of the main memory or main memory systems like h-door You may actually want to to abort it because it's using a core screen concurrency code scheme What and that'll cause all their transactions to wait while you're fetching the data? All right, so these are all the things that we're gonna have to deal with So I'm breaking up these issues into the three categories. So the first I call runtime operations So this is the mechanism that we're going to use as Transactions access data and queries access data to keep track of what what data that they're using and that's gonna allow Identified later on when we run out of memory what tuples are cold and we can move them out the disc And then when we go ahead and do this We'll talk about when we actually move the data out and then what meta data We're gonna keep in memory to keep track of what tuples been evicted And then now in the trans my transaction tries to touch evicted data We can talk about the policies. We're gonna use to bring it back in so we talk about How much data from the block we're gonna bring in? What mechanism we're gonna use for how we you know stopping the transaction or waiting stalling it? To get the data that it needs and then we'll talk about what can we be more? Can we be smarter about what we just how we decide to bring memory back in right? Maybe we don't bring bring the entire block all the time Maybe we don't want to bring it back in when we know the tuples actually hot, right? So we're gonna go through each of these one by one and then we'll get you said what we'll finish off talking about How real-world systems actually implement all these different issues or different policies? All right, so the first one is doing cold tuple identification so the first approach is to do what I'll call online identification and this is where the database system is going to monitor transactions access patterns as they run and Maintain some kind of data structure that it can allow to identify here's the hot tuples and here's the cold tuples The reason why we have to do this is again We don't have a buffer pool that would be tracking this kind of information for us right in a disk oriented system The buffer pool maintains some kind of LRU or ARC or other other kind of Tracting policies in order to decide that this block is now cold and therefore when I need to free up a frame I can grab that block and evicted and throw it out in order to make space in our system When everything was in main memory, we didn't have to worry about this because we didn't have to worry about running out of memory But and now since we wanted to do this we have to sort of add the same kind of thing back in But we don't have a buffer pool anymore. So we knew certainly maintain this stuff ourselves So the way you can do this is you can maintain Your own LRU chain between tuples by embedding pointers in the actual tuple headers themselves So every single time a transaction accesses a tuple whether it's a whether it's a read or write You would extract the tuple from the LRU chain and put it in the front and that way when it comes time to evict data You just go grab the back the back tuples and you know, you know, those guys have not been accessed recently And the reason why we would embed this metadata directly in the tuples rather than having a separate data structure Is that we can avoid cache misses? Of having to go update the LRU chain because we've already had a we have to go touch the tuple anyway And now if since the LRU the link list is embedded in the tuple itself We don't have a cache miss, you know going accessing a separate thing Another approach is to do offline identification and this is where you sort of maintain a in-memory log of how transactions access tuples and then you have a separate background thread that comes along and basically computes a histogram or the access frequencies of the different tuples in the same way that we're doing here So the difference here here we have a background thread maintained stuff and this this way we're doing it online So now once we had to kind of identify what tuples are called Now the next question is when should we trigger a an eviction and the two approaches to do this are to first have Yes, sorry Lying on the usage patterns, but they're not relying on other semantics like you said eBay has this constraint that after 90 days They're going to be gone. So Is there a way like sort of how people encode those things or I mean you can even think of machine learning approach is like Yeah, so his question is um This is sort of like a hard and fast rule in terms of like exactly how the transaction You're using it, but there may be higher level logic in the application itself. That says for example, I know it for 90 days I want to purge everything out So you can think of this is like this is a mechanism that you use like the I see my database is system is my database is growing in size I Need to move memory. I need to move stuff out of the disk now because otherwise I'll start thrashing or the out of memory thing and Linux will kill me So this is sort of like a failsaid mechanism before you can get to the to the to the the higher level logic stuff I'm not really talking about it here, but there's also sort of other things we tried where you can maybe be smart about how you Package up tuples you write out in the blocks if you can use some kind of machine learning thing to Identify that these things are related to each other or even just foreign keys would tell you this stuff Like for Annie's account. Here's Andy's orders. Andy hasn't logged in in 90 days. Let me just not only move his Customer information out but also move his orders out, right? That's sort of the same kind of thing that might help but that's That you can then you can bed directly in the system the higher level stuff like the ebay example needs to be you have to Have a dba to you know invoke these things manually All right, so the first suit so the next issue is when do we actually evict stuff and as I sort of said to him How this is mostly going to be used is basically there's a threshold that demonstrated to find says when my residence set size of my Database and memory gets you know near 80% or whatever My total capacity of the event that I have in my system. I'll go ahead initiate the the eviction Protocol start moving stated stuff out in this case here because the database system is managing memory and managing things on Disc it has to have a separate background thread to go ahead and move data out And then update whatever metadata it wants in memory to keep track of what was evicted the other approach is to use the dreaded and map and we're going to use OS the operating systems virtual memory mechanism to Allow, you know have the operating system to decide on its own how to move data out So this is done in the background the data the system doesn't doesn't have to maintain any extra information And it's really up to the operating system to decide when to do this And obviously it can know that are the residence size is getting too big. Let me start moving stuff out So there'll be one example of how we do this in mems equal and in a variant of Volty B that was developed by researchers in Switzerland And I always derived using m-map, but I think the way the EPFL people do it is actually pretty smart And we'll see that later on to avoid sort of having any transaction get a page fault when I try to access evicted data Try to minimize the impact of that All right, so now we get the next big question of what we're actually going to keep in memory to keep track of What data has been evicted? So the first choice is what I'll call tombstones and this is basically a special tuple or marker We're going to keep in memory that says there used to be a tuple located here And it's no longer in memory But here's the block ID and offset in that block on disk or where you can go where you can go to get it So what will happen is when you evict the tuple you create this new tombstone for it Keeping track of the block ID in the offset and then you update all the indexes now to point to that tombstone So then if a transaction comes along and say it does a look up in the index It would follow the pointer and realize that it has a tombstone and therefore the data that it really wants is not there I'll show you an example that in a second the other approach is to maintain some kind of approximate data structure like a bloom filter to Keep track of what the tuples are where you know keep track of the tuples used to exist But now are no longer in memory and then if you need to go Get the actual tuple there would be a disk resident index that you have to do a look up on and therefore that you have to kind of use the So a buffer pool like manager to bring things you know bring the blocks of the disk You need to look up in the index find what you want and then get the actual data that you want And so the issue with this is that say you have a query that wants to do a look up You know look up on an index you'd have to first check the index to see whether exists and then Then you may also have it who checked the bloom footer as well And the bloom footer will give you can give you false positives, but it's guaranteed never to give you false negatives So the bloom footer might say yes the in the tuple you're looking for is not in memory But I know it's out on disk But then when you go look it up on disk in this disk resident index you'll see that it's actually not there So you may not doing more work than you need to when you use bloom footer, but in exchange you don't Have less memory than you would of having to maintain the entire index so in this first approach here All the keys for your for your for your Victor tuple will have to be in all the indexes for that table in this case Here we just have that we have one bloom filter per index So this ends up using less memory, but you may again have that false positives, but you won't have false negatives And then the last approach is use virtual memory again the m-map stuff that in the last slide And this is kind of nice because you don't have to maintain anything information about what tuples are missing Or what tuples are out on disk because the OS tracks all that stuff for you internally And we'll see again. We'll see how volt TV or the epifl version of volt TV does it in a second All right, let's go through the first two examples the bloom filter one and the tombstone stuff We'll start with the tombstone one right so here we can we have an in-memory table heap and then we have our block base cold storage And so first thing we do is we'll use either the offline the online Identification to add to compute the access frequencies of the various tuples And then we'll identify that these three guys here are the ones that are have not been accessed very in the in the in recent times So therefore these are eligible to be evicted. So we'll go ahead and write them out to a block on disk So now if we're using the tombstone approach We have to deal with the issue that our index used to point to these tuple locations Right, but now there's nothing there right now these slots end up going back into the free pool and the Database system can reuse them for other queries or other transactions for new tuples So if we have to update our indexes because now another transaction might write tuple tuple 10 in here and Our index is still pointing to it and that would be incorrect So what we're going to do is we're going to create these tombstone tuples It's on a separate memory region for the table and then update all our indexes to now point to them And the again these tombstone tuples have a block ID and an offset that point to the location out on disk in the cold data storage Where that where you can go find them? And so now what happens is our transaction comes along does an index look up follows this pointer It checks them a little flag in the header that you know whether it's looking at a tombstone tuple or a real tuple If it's a real tuple it processes it just like before if it's a tombstone tuple that it knows and needs to go fetch the data From disk and again, there's different policies that decide whether you just you stall and wait where you go fetch it Or you can abort and restart it and fetch it in the background so obviously the the Thing we have to be worried about is if our say this this block ID is a 32-bit integer and the offsets are 32-bit or a 16-bit integer If our tuple is a real tuple smaller than the tombstone then we're not really gaining anything here right furthermore if there's a ton of indexes I Remember I said before in some cases for OAT workloads can be up to 50% or more than 50% of the actual memory usage of a table Or database can be all in indexes. They're not really the gaining that much here Right if our indexes are very narrow and we have a lot of tuples then the different the size difference of this versus this is not Gonna be that great right So and then the reason why also to we have to store this in a separate area is that These things going to be our different sides of this and remember said this is like a fixed length pool So we can't have you can't embed the tombstones in the actual table itself because everything will get misaligned But the nice thing about this approach is that when we do a sequential scan without the indexes Well, we'll hit all our tuples, right? So that our implementation of our iterator would have to know that I've reached the end of the regular table heap Let me start scanning in the in the tombstone heap as well All right, so this is the tombstone approach In the in the bloom filter approach, we don't have any tombstones here We can reuse these slots for new tuples as needed And then for every single index we're gonna have a bloom filter that allows us do a check on this first If it's not there then we also do a check on this and then if we identify that it's in in We think it's in there, then there'll be a disk resonant data structure that we can go fetch to find the actual Block location of the tuple that we were looking for so you can kind of think of this is like disk resonant version of this To track just the data that's that's on on disk And so the depending what kind of index you're doing and what kind of query you're doing you May not actually have to check the bloom filter, right? Let's say I do a Primary key look up or a unique key look up. I get a match in my regular index I know I don't care about anything that's in the bloom filter because it has to be unique And it's already in there. So it's definitely gonna be in memory They are this is from the heck of some guys and they also have a way they have a variance of a bloom filter I think it's called arch or arc that allows you to do sequential scans in an approximate data structure The same way you can do a bloom filter because a bloom filter can only do for key look ups If you need to do a sequential scan then this won't work. So they have an additional data structure to deal with that problem okay, so Now that we've evicted data out on disk We've identified that what's cold right and we then we've made a decision about how we're gonna move it out the desk And what metadata we're gonna maintain in memory to keep track of everything And now we have a transaction or query that comes along and wants to access one of these evicted tuples So what how are we gonna deal with this? So the first issue we have to deal with is When we go fetch the block that that has the tuple that we want memory said because because in memory stuff Is all tuple oriented so we can go do be more fine-grained with our access methods But when we go out to cold data storage, it's a block-based device We're always we're always gonna get a block of data so that means for when we need one tuple We're gonna go fetch a block that may have multiple tuples in it So the question we have to deal with is what do we do with those other tuples that we didn't really need? So the first approach is just only merge back into memory into the regular table heap the tuples are actually never requested Right so the idea is that we only need one tuple in this block We'll go go grab the block that has all the data for all from multiple tuples But we'll go pick out the one tuple that we want and we'll merge that back into Into memory and that also means going back and updating all the index pointers for that particular tuple To change them from pointing to the tombstone to now point to the to the regular tuple that's back in memory The other approach is all right So to do this though you have to keep you have to maintain some extra metadata now in memory about When you excise tuples out of blocks because now you're gonna have these holes because what you don't want is if I Pull a block back in and I just grab the one tuple that I need But then later on it becomes cold again, and now I've write it out to another block I need to make sure that if I crash and try to recover from my coldest storage I recognize that the first one is no longer valid It's it's actually the second one that I that I really want because I have actually two copies of the data All right, so there's an additional bookkeeping You have to maintain your log when you do checkpoints and things like that for this kind of environment if you want to do this first approach the second approach is just Pretty simple right you just go grab the block with a bunch of tuples and you merge them all back into main memory And you don't worry about holes, and you don't worry about additional bookkeeping Now the downside of this is obviously you're merging back in tuples that weren't really requested and are probably still cold But you and you pay a CPU overhead to go update now all the indexes to now point to Then back in the table heat rather than then then the tombstones, and then what'll happen is Soon soon after they merge back in you're probably gonna run out of more memory And you're gonna have to go pick the exact same tuple you just merged back in so you sort of have this ping-bong In fact where you're merging tuples back in then it's cold and then you victim out again, right? They're sort of getting copy back and forth over an hour again Right, so these are the two issues that we have to deal with When we pull block back in Now the question is what do we do with the query that tried to access or transaction that access the evicted data What do we do while we fetch we fetch the data from disk that it wants? So the first approach is to abort the transaction Roll back any changes that it made put it into a separate queue on the side Then have a back on thread go fetch the block that you need and then you need to merge the single tuple or the all the tuples It doesn't matter And then once the data is ready, and it's merged back into the table heap you restart the transaction and let it let it You know let it execute again, and now what acts when access is that tuple that I wanted before it'll be in memory It doesn't have to restart So one of the problems with this is that well first you assume that you're running everything to store procedure Right, you're assuming that you're running like a volt dv style system where you have all the program logic and queries that the transactions can execute Directly in a store procedure so therefore it's really easy for you to restart But if you're running in a procedural style system where all the program logic is on the application side in like php code or something else You can't quite do this if there's a lot of intermediate steps between the different queries another problem with this is that You can't guarantee consistency For queries that touch a lot of data because let's say that I have a I have a database or a table That's one gigabyte in size, but I don't have 512 megabytes of RAM and Half the tables me and DRAM half the tables can be on disk When my query starts I do the I scan the first half and that's everything all that's in memory So there's no problems, but then now I go to the second half and that's on add on disk So my transaction will get a boarded then The first half will be swapped out and the second half will be brought back in now when I restart my query Because I have to maintain consistency and therefore I have to start at the very beginning again I'm gonna hit the first two boy and that's gonna be evicted and I have to do this all over again So I'll never be able to finish this query if I do this approach if I'm trying to guarantee strong consistency or acid guarantees of the transaction So that's that's the problem this approach The other approach is to do synchronous retrieval and this is essentially what you would do. Yes, sorry So this statement is if you have a So you have two two transactions or two two queries in one transaction So it's a bit is yeah, I do to choose actions each have one query the first transaction hits something on disk And therefore has to be a boy the second transaction has everything in memory So they're fucking finished as long as the second transaction does not see anything that was modified by the first transaction It's allowed to finish without any problems, and you still get you still guarantee serializability So it's a statement if the first one modifies something and the second one sees it Yeah No Right, so like serializability just means that the queries if the transactions have to execute in some serial order But it doesn't have to be the serial order that you submitted them as you're referring to as External consistency. Yeah, right and that's no database system guarantee that guarantees that other than the spanner as far as I know Right, so if I accept this is everyone's in the issue, right? This is serializability if I execute two transactions, right and the application submits transaction number one followed by transaction Number two the database system does not have the guarantee that one will finish before two It can actually execute them anywhere at once as long as they're they execute in a Schedule that produces a database that is equivalent to one that was generated by transaction that ran in a serial schedule So there's there's no problem with this. Yeah All right, so the second approach is dealing with It's basically what a disk-based system would do is like soon as you try to touch something That's that's out on disk. I stall The the the IO thread fetches the thing I need and merges it back into memory And then once that's ready then my transaction can pick up and keep on going of course now if I'm holding locks during this time Then that just makes everybody go slower because I can't you know I have to wait until my data comes back into memory before I can release those locks Right, so this is yes. Sorry You're going to do this long running queries and there will be short queries that swap blocks out The abort and restart doesn't seem like a good choice there, right? Absolutely. Yes. So the statement is If you have if you're doing transactions, I have to touch a lot of data and therefore you're doing this important restart It's gonna be a bad idea. Absolutely. Yes. And so again, this is sort of why we're focusing this on OTP because For the most part it's gonna go grab a small number of things, right? This obviously would not work between a lot because you had to scan the entire table, right? but and then but there may be a case where Then you have a fat finger transaction where you're doing you're updating everybody's salary because you're all by one or something In that case you couldn't do this because you have this that's ration effect that I talked about before There's also very tightly related to the the conventional scheme is being used This approach might be better. Oh, we'll see the second. It's not yeah our original hypothesis was this this approach is better in a whole TV a store style scheme because Since your execution end is they're single-threaded if you have to stall them and go fetch the data that you need for a particular transaction Other transactions are waiting to be quite a lot, right? And therefore it's better to board it Put it beside you and then other transactions keep on going again and we guarantee things are still done serializing orders, so we're okay Yeah, for all that this is problematic, but we'll see and later in class. We should experiment For God's Yes, that's the Yes, that's the spoiler use this way All right, I know they should be thinking we give us to is also When when we go grab a block with data Before we said we can decide to merge just the single tuple or the entire block But maybe the case we don't actually emerge any of it And what I mean by that is we have the ability to say well This data has been cold or this tuple is cold and we know this because we've evicted it that before and chances are it's still probably cold And therefore, maybe we don't even bother merging it back in memory at all we'll do we'll put it into a temporary buffer and Then let the transaction that needed it do whatever did you know operation it wants on it And then when it finishes will just throw it away and free the memory and that way if any of the transaction comes along later On then we can you know we'll do the same thing But now we don't worry about using up memory for a bunch of cold data that we're gonna have to go there later on So the first policy is to always merge merge the data in right and that's what I talked about before The second policy is only merge in tuples into the heap If they're being modified in an update query if you're doing delete it doesn't matter Or you just have to maintain some metadata to say that this tuple has been deleted if you're you can't insert a cold tuple Right that doesn't make any sense So it's really for the updates that will was merged back in right for selects We'll bring it back in let the query operate on it and then just throw it away as soon as it's done And then a selective merge is for girls the weather is being it's using a select or update operation We'll make a decision about whether we should merge it back into main memory based on how that that data has been used in the past All right, so basically the same kind of thing where we're tracking the usage patterns of individual tuples We can track the usage patterns of blocks as well in memory So the idea is that we'll maintain a this histogram or access frequency of the different blocks Over a window over a fixed period of time That's always moving forward and then when we go fetch a block that's needed by a transaction We'll look in this this histogram make a decision that How often it's been accessed before and if it's in some kind of top threshold Then we'll merge the tuple back in because we know it's being accessed over and over again And therefore it's probably become hot again, but it's not above the threshold Then we just throw it away and save our memory space And so obviously this is you know You don't want to have to maintain a lot of extra metadata to be able to do this And so the way we use it the way we do this is using another approximate data structure called account min sketch You can sort of think of it like as a as a counting bloom filter, right? It sort of it maintains an approximation of how many times we've seen the block in the past All right So those again are those are the main ideas the main issues you have to deal with more We want to have large in the memory databases So now I'm going to go through five different examples and show exactly how they line up in this taxonomy so the first one is Something that I worked on called anti-caching with an h-store and the basic ideas of this is that you have a used Berkeley DB as a On-disk hash table we turn off all the locks and all the other you know transactional stuff that Berkeley DB provides and just use it use it as a disk resident storage and For this we use online identification where we maintain an LRU inside the tuples themselves Administrator defines a threshold for when you know when you want to side to our bicking stuff We're going to use tombstones the initial implementation used the abort and restart retrieval policy And then we have block-level granularity where we would merge the entire block in And we'd always merge it in we didn't do this selective merging that we have before So this was the assigned reading for you guys from today's class But I mean the the basic ideas is applicable for all these other stuff as well in Hecaton they had this thing called project Siberia, which I don't know whether actually made it into the commercial version This is sort of something that came in MSR And they have a bunch of papers about it And so this one's going to use offline identification where you have the log and then a background thread figures out the access frequencies of tuples And then it's also going to use administrator defined threshold to move data out And they're going to use that bloom filter and on just stores that we talked about before And as far as I know the day synchronous retrieval tuple-level granularity and they always merge data back in Rather than trying to be selective For EPFL they have the researchers there made a took a variant took volt DB and modified the extrusion engine To now actually support m-map storage Instead of using the malloc storage So they're doing the same thing that hackathon hackathon does with the offline identification But they're going to use virtual memory To manage moving data out and so that means that because it's virtual memory It's going to be synchronous because you're transactional try to access something hit a page fault and the OS will stall the thread while it goes That's just the data you need they can't do They're doing page-level granularity And the reason why that's different than block granularities because they don't really control the boundaries of where tuples lie in the pages The OS just sort of does all that for them And then they use the always merge policy because again the OS doesn't lie to control this so again, I've derided M-map multiple times in this class I said to put, you know, don't use M-map in your database system on my tombstone But what actually think they do is exactly kind of clever So what they're going to do is they have the in memory table heat, right? And they're going to split it up between hot store hot storage and cold storage. And so the way it's going to work is you use mAdvise to tell the OS that this this region of memory for my process is pinned and the operating system is not allowed to Evict anything that's in this part here And so they'll maintain that the log about how transactions access tuples And then the background thread will decide that this tuple is cold and move it down to this lower region That's unpinned where the operating system is allowed to move this stuff out So now any anything that that gets added to the database always ends up in this hot region And then when it goes above some threshold of the operating system is maintaining It'll go ahead and move it out Right, and this is all transparent to the applicant to the the data system itself The OS is doing all this for you, but it knows that this region here is it's allowed to move stuff out So what's one possible problem with this for anybody that's used M-map? so the problem is If we don't know where the boundaries of pages are for in memory here Sorry, it may be the case that our tuple lands on At a memory location that spans multiple pages Right, so what happens is it's gonna get split up and then when these things get evicted Half the people goes out in one page and other half goes in the other page So now when I do a you know a read on this tuple if I have to read the entire thing then it's to two separate Pay you know page falls to separate reads to go get this data and reconstruct the entire tuple Whereas before in our database system when we were man in anti-caching in hecaton So they're managing the blocks of cold new table cold tuples ourselves We can be make sure that our thing aligns exactly within a single block and we don't have this problem now You know this is only for tuples that fall in the boundaries of their of our pages here And if we know our page size and we try to be careful about this So it's not for every single time you access the tuple you have to do to their page reads It's the only ones that may cross this boundary here So I don't think this is a huge penalty, but this is sort of one of the things you lose When you let the operating system manage your stuff for you Okay The next is what we'll talk about is Apache geode and I guess full disclosure I don't know exactly whether I'm telling you what I don't know whether exactly what I'm telling you is 100% correct because it's very confusing because Because a patchy geode as far as I know used to be called VMware a sickle fire And then when they merged and created with emcees and became pivotal they renamed it to gem fire and Then they now open sourced it to be a part of patchy now as Apache geode and then there's this other thing that they sell call a gem fire XD which uses a patchy geode And so I'm not exactly sure everything. I'm saying is 100% correct But this is far when I used to when I used to read the documentation for gem fire XD before they became geode This is what I understood them to do But they're doing online identification with administrative threshold. They're using tombstones with synchronous retrieval And the way they're gonna do this is that basically you say alright when my table gets larger than than the size I wanted to be they start shoving out tuples out into a log base structured on HDFS and then anytime that a Transaction or our query tries to access the log structure data They will retrieve it from disk and just read them there unless it's being updated Which then they append it back to the log structure. That's in memory All right, so I have these question marks here because I've read the documentation a couple times for both the old version The new version and I think this is what they're talking about But I probably should should should have contacted them So in the last one is mem sequel and this one's actually very different from everything We talked about today because they're bringing in cold data storage or disk storage for As a column store that's separate from the actual hot data. That's in a row store system. So what I mean by that is All the mechanisms we talked about for you have a single logical table That's exposed to the application and then the database system automatically decides how to move cold data out to Disk, but then it still appears as a single logical table In a mem sequel which what happens is the administrator can when it creates a table can specify that it should be an on-disk column store and It'll appear to the applications as a separate logical table So using our ebay example You would have a in-memory row store table for the hot auctions and then after there's 90 days You would move them into the disk resident table But if you ever ever wanted to do a query that span both tables you'd have to write a join To combine them because it's gonna it appears as two separate logical tables And so what the way they implement this is using m-map as as the buffer pool So they don't have their own LRU structures or anything like that They just let the operating system do everything and I think this is okay as far as I know Having think thought through it a little bit I think this might be okay for what they're doing because if you ever have to update the on-disk table they have an in-memory Delta store the same way we talked about before with Hanna and Hyper and And then they have a background thread that goes along and then can can merge those changes into to the column store stuff And they also have the pre-computed aggregates that are always in memory that we talked about so you don't have to go to disk Go fetch things and maybe maybe not need So for this case, they're not doing anything manual or not doing anything automatic either offline or online to identify What what tuples are cold? It's up to the administrator the human to specify this They don't have any victim metadata because they're letting the operating system to do this It's synchronous retrieval because as soon as you try to touch the page the operating system page falls and brings it back in and then the in case they always merge because you're doing a You know a no lab query on this data So therefore you need you want to put everything back in main memory in order to operate on it And then the OS can can flush it out later on So I think this is this is actually kind of interesting I'm Again, I'm debating mentally. I'm still debating with myself whether and that was a good idea or not But they see for them they say they say it seems to work All right, so now what I'm going to do is I'm going to talk about a actually a work in progress for us as the paper we submitted a few weeks ago on Evaluating these different design decisions we have in our database system architecture in the context of a short anti-caching And I'm sort of go to a high level some of the big takeaways we have from this and for this what I think is really cool about it is this work is Rather than just you know picking one hardware device and trying out all the different policies on it We actually do these evaluation on five different devices that all have different properties and different characteristics So spinning this car drive is the standard hard drive that people think of you know from the architecture 1970s The shingle magnetic recording drive is a newer technology from Seagate Where it's basically a spinning this hard drive, but they can write They can write to the tracks more closely together On the platter and then what happens is when you do a right you end up overriding the Tracks that are next to you so you have to do some extra work to repair them But allows you to pack more data into to us a single drive than you would in the traditional case Then we have a nan flash solid state drive like a you know generic SSD And these last two here are expand experimental hardware from Intel That are supposed to be coming out in the near future So 3d cross-point is their non-volatile memory device that's going to be a PC express like think of it like a better SSD So it's still block addressable and then non-volatile memory or non-volatile RAM is going to be in a couple years from now where it's going to be basically a DRAM DRAM dim that will have the same sort of Latencies as is DRAM just a little bit slower But anytime you pull the power all the data you wrote to it persists and so to do this correctly There's some extra CPU instructions that Intel's adding that That allows you to flush cache lines directly out to the MVM so that if you know That way if you crash, you know, you don't have your stuff sitting in your caches. Everything's been persisted So I'm not gonna say much about this today. We'll talk more about this on On Wednesday's class But this is sort of a nice segue for us because you'll see how all this older technology is In a block-based system is actually going to be inferior to this if you have byte addressable tubal Tubal-oriented access methods that can use MVM. Yes This question is do we do we get actual hardware for explain answers? No, so this is I guess I can say this it's a Both of these are Harbor emulated So it's basically for this one. They made a special PCI express card that has DRAM on it You pull the plug everything goes away. So it's not it's not actually durable, but it's They've tweaked the Good the hardware system the latency is mimic what the real thing can actually do and in this case here What happens is they actually put hooks in the the motherboard for the for the Debug methods for the memory controller so that when you read and write to DRAM in the special mode There's a little busy loop that spins it for a little bit to slow it down. Yeah, just as if it was actually really MVM So Intel says that this is coming later this year This one they always say it's two years, but I mean people been saying that since 2008 or something before that It's always two years. Let's talk about more about the disappointing Timelines for these guys next class But this is stuff. We're actively working on All right, so the first thing we're going to do is sort of a microbenchmark that Evaluates these different devices for a workload that is representative of a database system reading writing evicted tuples back and forth from DRAM to disk All right, so we're going to break it up between doing one kilobyte reads Sort of accessing a single tuple or a 64 kilobyte read and write that's accessing a block of tuples so as obviously as expected as you go across from the You know older drive devices to the newer stuff the performance gets gets gets much better but the two kind of things I want to point out here is The spinning this hard drive and the shingle magnet magnet of this drive their read speeds are basically the same the same And this is expected because they're basically the same technology But in the shingle magnet and that you disk drive the writes are much slower because you it's more expensive Do right to a track because you have to repair the other tracks that get overwritten that are near it and then over here we see that The nv-RAM stuff and the DRAM is actually very very close to each other And this is expected because the the nv-RAM is expected to be about 4x slower latencies than Then DRAM so we're not so this sort of lines up exactly what we're seeing here. So you see that the Accessing a single kilobyte to think of that as a single tuple and cold data storage On nv-RAM is roughly the same as DRAM was all these other guys. It's much much slower. So what I'm getting at is And next class we'll see that you can design a database system to Treat nv-RAM just as if it was an extended version of DRAM And you avoid all the slowness of using a block oriented API or block oriented architecture to do cold disk storage or cold data storage If you have that that kind of storage device again, we'll talk about more of that on Wednesday All right, so for this This graph I want to show you we're evaluating the the merging threshold policies remember This was I keep track of how often a block has been used and I make a decision whether to merge everything back into memory or to maintain that or put the data into a temporary buffer and blow it away When when the transaction the accesses it finishes it So we have doing the the merge only when a tuple is updated and then at top 5% top 20% Is when the tuples actually is in the top 5% or 20% of all tuples accessed in some some time window And the merge all approach is basically you always merge everything back in and I think for this we're doing We're doing tuple level granularity instead of block granularity So the first thing to point out is for the spinning the disk hard drive and the shingle magnetic disk drive Using the abort, sorry the AR stands for abort abort and restart and SR is the synchronous retrieval for these other guys over here The sickness retrieval is always faster But we want to evaluate to see whether the abort and restart works better on a system like a store and when we see is actually It's not the case at all. So if you do the synchronous retrieval even though the disk drives are actually kind of slow Stalling transaction and waiting is actually the better better way to go I Because if you're doing select emerging then you're not always merging the tuple back in and therefore you're not paying this big panel To update the indexes all the time This is basically saying that the CPU costs of merging cold data back in outweighs that the the cost of actually going and reading it But we can compare all these different storage devices Versus DRAM which is this line here So this is what this is what the speed of the data system would be if everything fits in DRAM And as you can see for these guys over here For these newer non-volta memory technologies were pretty close there So the next thing you want to compare is sort of say a generic configuration that sort of is a catchall for all possible storage devices versus a Configuration that's optimized for each in each individual storage device So this sort of goes back to what the original anti-caching paper that you guys read was about right had this Configuration here, but now we're going to have one that's actually designed for the actually the hardware that we're using And so for this one look at two workloads the first one is going to be the TATB benchmark I think the paper might call it TN1 benchmark And this is basically a work lift from Ericsson that mimics a people driving down the highway with their cell phone And you have to update the cell phone tower to say you know what tower is the closest to you So when a call comes in you know how to route it and so for this workload here we see a quite a significant difference between the Storage devices using the generic generic version versus the optimized version and what we see is that in the case of the MVM case We're almost as good as being exactly, you know entirely in DRAM Right so for this particular workload here, and it's and its access patterns We see having a speed of the device makes a big difference in the performance of the system you can get right This is the slowest, and then this is the fastest right, but it's not always this case Right if you have a particular workload that doesn't have the properties that the TATB benchmark has Then you don't see a huge difference, and that's what the second workload is here So this is called the voter benchmark. I think we talked about this before this was designed by VoltDB to mimic the the access patterns of an application of a American Idol kind of application So you have people calling the phone or going online voting for contestants they like on the TV show So for this workload, it's very insert heavy And so that means basically what's happening is we're just inserting to our database over and over again And then we're just evicting all that data out in the background, and we never have to go out and go back and read it again So in that case here, you don't actually see any major difference in performance of these different devices Right because you're almost never having to go back and fetch cold data Right so the main takeaway the get from all this is that if you have a workload that does have to access cold data a lot and the access pattern is very skewed then Having a better storage device does make a big difference and treating and using it correctly makes a big difference But if you're never actually reading the cold data and just always evicting making stuff as it comes in then there's no difference at all So any questions about that about this? Yes Yeah, I see that yeah, so this question is why is this higher than this We need we need we need to protect us Yeah, I guess it's under submission We need to go look at that. Yes Okay so Again the the reason why I want to show these results is that it's gonna set us up for For Wednesday's class So all of this stuff we talked about today is about how to have the database system work around the slowness and Complications of dealing with a block-based device Storage device when we're doing when we're a tuple-oriented or tuple you know a byte address will database system architecture Right. We want to bring bring blocks back in without having all the buffer pool crap we had before But if you have non-voltaum DRAM non-voltaum memory sometimes storage-class memory persistent memory and It's cheap and fast enough and close in the speed of DRAM Then all the crap that we talked about today is completely goes out the door And you don't need to do any of it because I will see on on Wednesday You can design a system to us to use MVR MV RAM as an extension to the database systems address space And that means you can have you know half your database and DRAM half your database in MVM and all the execution engine stuff that we talked about before can just point to data in MVM and Doesn't need to worry about fetching blocks in or managing all that other crap, right? It just extends the address space of memory and everything runs really fast And we'll see why a in-memory architecture that we've been talking about in this class for the last semester Is in a better position to use this new MVM stuff when it finally comes out Then the mysql's and oracles of the world because they're all treat MV RAM as just sort of a faster SSD But if you treat it as truly what it is byte addressable memory that's persistent then you can get much better performance So again, I don't think that's gonna happen until you know for five years or so But we can talk about you know, we can talk about how we're we can design systems now to take advantage of it when the filing comes along Okay, so any questions about this stuff? Yes. Yes Correct So his statement is unless your MVM MV RAM stuff is as big as the shortest capacity you can get now in today's disc Then you're sure about these problems. Yes And so the depending whether you believe the Manufacturers are not right when HP announced members. They were talking about One petabyte of storage per per square centimeter right Yeah, that was that was 2008 it hasn't hasn't you know that hasn't arrived yet We'll talk about it how you know again, I'm not a chip manufacturer I'm not I don't do you know computer engineering stuff But we can talk about how 3d, you know stacking things like that can maybe allow them achieve that this is essentially where they're going So yes, it is something that Yes, if MVM is not very big in the very beginning then you still have to do all the stuff. So I'm saying Unnecessary in 10 years. How about that? You know questions All right, so again next class. We'll talk about The new non-volta memory stuff and this is sort of like there was two lectures. I wanted to do But it's unfortunately had to cut some of the classes There's two lectures. I wanted to do on sort of experimental memory hardware There's the non-volta memory stuff and the Harvard transactional memories work I would say that this is going to have a bigger impact on how we design database systems than the Harvard transactional memory It's been out for a while the Harvard transactional memory has been out for a while And I think Intel has some bugs in it and things like that and as far as I know nobody actually uses it But all the major data vendors are chomping at the bed to get this because I think this is the this is gonna be a bigger game-changer And then I'll also talk about at the end about What's expected in the guidelines for doing the code reviews for project number three? Okay? Awesome guys. See you next year Wednesday