 The where we're going that the next two lectures is sort of finishing up with Some ancillary things we can do in memory database system So today's class is looking at how to maybe bring back the disk in a smart way and then on Monday's lecture We'll talk about using New Harvard accelerators or new new types of devices to Speed up our data system as well. So I'd say that these things are sort of like the extra stuff That you can add on once you already have an established in-memory database system So some administrative things to deal with Your schedule in this class looks like the following coming up. So We can we have one more class lecture on on on Monday? I will send out the reading review emails to everyone to make sure you guys know where you are on track with that and then we have a Friends of snowflake are coming on Wednesday to Give a guest lecture here, and they're actually gonna be giving another talk a tech talk on Thursday, so I'll send a reminder about that the final exam this semester is actually a take-home exam So I will hand that out to you guys on on Wednesday You can more or less think it's gonna look very similar to the To the midterm, but you can go read and use notes and watch the videos and to answer that And I think what we'll do is what you'll have to hand this in when you do the one of the final presentation The second code review will also be due on next Wednesday as well So I'm gonna go through the PRs tomorrow. Give everyone feedback and then just gonna close everything And then you can submit a new PR To your team and then that has all your new code and then the final presentation for this class For your project will be when we have our schedule final exam, which is May 14th. It's a Monday at 8 30 a.m I have no I don't know what room it's in It sucks. It is what it is. We'll get bagels, okay? Again, this is just like a friendly thing of like a here's here's everything we've done and then you know The the final PRs and the testing and everything we'll have to be do is around this time as well And then we'll vote to see among on the class who do you think has the best project and and they win a prize, okay? So I also want to talk about real quickly. Uh last week There was two, uh Major news items that came out in the field of databases That I didn't have the time to talk about on Monday So I'm gonna take a quick guess if anybody knows what they were there was two major announcements last week What's that? No Excellent. Yes. So foundation db went open source. So foundation db was a startup out of virginia They were a transactional key value store that apple bought in Because five or six years ago actually before I came to CMU and it was It was sort of touted at the time for being like a no sequel system that actually carried out transactions And could do transactions well Um, so apple bought them I knew some of the guys that were at akaban, which is another startup that foundation db bought They're at these x israeli military guys at a boston They moved down virginia to work on foundation db and foundation db got bought by apple Then they moved to california and they were always very hush hush without what they were working on But apple actually opened the source this, uh, which is pretty amazing. So internally they've been still working on it um We can talk about performance later. Um, but one interesting thing about this is that actually snowflake actually uses foundation db and has been using foundation b for years as the Uh, the metadata store the catalog for the internal system So they it's an olap system But you need to actually have a catalog to say you have to know where all the data is and they were using foundation db All right, there was one other announcement last week as well What I guess one major new release of of a famous database system Not postgres postgres postgres 10 was looked last year My sequel excellent. Yeah, so my sequel eight came out. Um, so the the prior to this the Well, it was it's been out for a while, but now it's the general availability meaning like it's actually be stable to use in production So the prior stable version was mycicle 5.7. That's been that way for the last two five last two years I forget why they skipped six and seven And they went right to eight But the the add a lot of performance enhancements But the one thing I think that was as really cool about the latest version of my sequel is that um, they expanded the What kind of sequel features they support? So for a long time postgres had was for the for the open source systems It had the most advanced like sequel functionality like ctes and window functions regular expressions So now my sequel has all these things as well And they also had support for for gis or like so, uh geospatial indexes Um, so I actually was at uh in in uh, San Jose yesterday and I met the mycicle team So there's a picture of me with them. So these are all the guys that actually work on the on my sequel. Um, they told me they liked the class videos They said I talked about larry ellison too much which um It's true. Yeah, so I have a problem. Okay All right, so today's class. Um, so today what we're going to talk about is How can we bring back the disk in our in-memory system in an intelligent way to allow us to support databases that exceed exceed the amount of memory that's available to our system Right, so you want to sort of do this without just bringing back the buffer pool manager, which we said before was Was a major bottleneck in in these in-memory environments in the early in the semester So we're going to see how we can do this in a smart way So we'll do some background about the background of what uh, how this is all going to work And then we'll go and talk about all the different design decisions You have to make if you want to introduce disk back into your in-memory system And then we'll talk about finish up. We're talking about how For real-world examples look like um, and uh an evaluation that we've done So the the motivation for this should be pretty obvious for everyone right d ram is awesome. It's fast Uh, it's much better than disk, but of course, it's it's not durable, but we We already can handle that correctly or safely using using logging I'm actually see how to use persistent memory next class But it's expensive Right, it's it's expensive those to buy and it's also expensive to maintain Right and get on a per gigabyte basis. It's much more expensive than ssds and spending those hard drives And then when you actually deploy in a server Uh, the energy you need to pay the energy you're paying for to refresh The the dims to maintain the charge in order to maintain the values that they're storing That actually can add up a lot. So there was a study done It was about a over a decade ago But they basically showed that 40 of the total energy usage of a you know server Is spent on just refreshing d ram So it's nice because we can have really fast systems But it may be the case that we want to maybe offload some some some data That is maybe not always, you know, it doesn't need to be super fast and urgent to us to have an in d ram We can offload that to disk Then we can get uh, you know, but energy savings and uh, and hardware savings in terms of cost So the way we're going to do this is now introduce Mechanisms in the database system to allow us to store and access data that's that's on disk But without having to bring back all the slow parts we we we shed at the very beginning of the semester Right, so we don't want to bring out the bubble manager We don't want to bring back all the crap that we that we that we said was a big bottleneck In order just to go access to sometimes something on on disk so the Key thing about this as we go along is that we got to be aware of how the harbor is going to expose Uh as access methods to us in our system in order to retrieve data so In the entire semester, we've been dealing with in memory storage. So that's inherently, uh, it can support, you know, byte addressable Um access methods So we can be too oriented and we can operate on single pointers In our indexes in our in our in our query execution But now when we want to bring back disk disk can't uh the The granularity that may expose to you in order to retrieve data is always at a block or a page level 4k So we need to be careful about when we start building algorithms in our system these components that can access data on disk The regular execution engine will access things normally as uh tuples But we need to handle the case that we're dealing with with a batch of tuples in a page The one thing i'll say about all this too as well is that this class is going to be focused or this lecture is going to be focused on all the T workloads So this is all about going to get you know A small number of tuples from disk and then fetching them into memory in order to crunch on them in our query or transaction The reason why we're not going to really discuss all that query is because There really isn't anything we can do in our database system That's magic that we'll be able to Pull data from from disk much more efficiently than what a regular bubble manager could do right and the reason because is uh these queries are going to be doing full table scans So there's not much, uh There's not much you can do to to to to mask that Right, you just go You know stream all the data in from the disk that you need and then process it So the only really uh Optimization we can do in this environment is possibly use the zone map stuff that we talked about before So say you have a single column like this We can build a zone map that pre-computs a bunch of aggregates aggregate information about the data that's in this column And then keep that in memory But then just flush out the the data at the disk So if our query comes along and we know that we can satisfy it with the zone map Um and and and preclude any data that's actually on the block on disk Then we just read this in memory, which is really fast and just ignore everything there But that's pretty much the only thing you can really do a way to avoid reading things with disk So this is what mem sql does, but this is not specific to in memory databases. This is also what a bunch of the other um The other systems like like vertical like a disk based system and green plumb can do the stuff stuff like this So again, so this class is really being focused on on all of tp Now the key aspect about all its people workloads that we want to exploit is this notion of hot versus cold data Right, so the hot data is the data you just inserted or you just update it You're reading The cold data are things that will be less likely to be accessed in our transaction and again workloads modern applications have this This this this access patterns like this because you know you go on reddit and you post comments on the the The articles have been posted that day or maybe the previous day you're not going back 200 days And writing comments, right? So the the the stuff that's hot today Is always going to be used more is more likely to be used the stuff that is is older is colder And it's less likely to be accessed So what we're going to need though to now have to make this work Is that we want we need to we need a mechanism in our system to Not only identify what's actually cold data automatically But then actually then push it out to to disk And then at some point if some transaction comes along and says I need this data We need a mechanism to again pull that back in efficiently So at a high level what we're doing is sort of like this So our our index is always going to be in memory And then we have our in-memory table heap of just our tuples And this is just the regular architecture that we've we've talked about so far in class Right. This is not this is not we're not changing anything here But then we have out on disk we have what are called our cold data storage Right. This is where we're going to offload data that we don't think we're going to need So we can we can save space and memory by moving it out there Right. So the first thing we got to do is figure out what we actually want want to write out So say is using some mechanism we figure out that these three tuples here are the Are the cold ones? So we're going to go ahead and write them out into a block And there'll be some header that says, uh, this is an addictive tuple block Here's here's the data that I have in it and this is going to look a lot like the the The disk pages we saw in a in a disk based system It's only 4k except that we don't really need to use the slotted page mechanism Right. We can just have these things be stored contiguously or we don't we don't um We don't need to manipulate it differently because when we suck it back into memory We just want to take the data immediately out and put it back back into our table here All right So the first question we got to figure out is how can we identify that these tuples are are the cold ones? We want to move out And then the next question we got to figure out is well our index used to point to these tuples when they were in memory But now if we've evicted them, what should they point to now? Right And then if you have a query comes along the query says I want to access tuple, you know zero one What should happen with this right because now that tuple is not there How should we go fetch it and bring it back in right because the thing it needs is is out on disk And so the mechanism we're going to use to bring it back in is just something we have to figure out as well Right because we could just bring in the entire block or we just try to maybe pull in a single tuple Right But you you have to pull in the entire block because it's a it's a block based api But when we what we actually put back in memory, do we just put the one tuple we need or do we put the whole block in? So these are the questions that we're going to answer Answer answer today So now as I often say, uh, you know, you never want to use nmap for your database Uh The question is all right. This sounds like what we're trying to achieve here is exactly what the os provides with virtual memory Right the os can maintain the the the the access patterns and the Figure out automatically for you what's hot versus cold in your pages in memory Why don't we want to go ahead and just let that the let let the os manage these things for us So the the answer is that the the os is not going to provide any mechanism to allow us to do a second just reads ahead read ahead so we can't have a We have to do a much extra work to have Dispatch to another thread. Hey, I'm going to need this data. Go ahead and fetch it for me now While I continue processing my transaction If you don't do that then immediately when your transaction tries to access data That's not in memory You get a page fault and the the the process preempts you and you're blocking till that page is brought back in So your thread can't do anything We're also not going to be able to write multiple pages concurrently We can write that single page at a time, but if we want to do this more efficiently, we want to be able to do this as well So again, I would say in general mmap is not we're going to what we're going to use uh in a um It's my opinion. The maps not the right thing here We'll see two examples from mem sql and a research system from epfl where they actually use mmap and The epfl guys do it sort of in a clever way that Avoids these problems. Um, actually they can't avoid the first one, but they sort of handle the the second one cleanly All right, so here's the design stages we have to deal with now If you want to be able to have larger larger than memory databases So the first is we got to figure out what what's going to happen at runtime like when we Are you know running transactions or running queries that are updating the database? How do we identify which is going to be the cold tubules which is the hot tubules? And then when we identify that there we have cold data, uh, what should be the addiction policy? How should we decide? You know, how do we actually move the data out? Do we do this at fixed intervals in time or do we track some internal thing to say this is the right time to do it? And then when a transaction tries to come and access data Uh, that we know is not in memory. How do we go fetch it and bring it back in? Right how much data the granularity tells how much data we're going to bring in How we what thread is actually going to do this and then how are we going to merge the tubules that we need back and back into memory? So we're going to go through each of these uh one by one So right so the first thing is again a cold tubal identification So for this there's essentially two two ways to do this We can do it on an online fashion or an offline fashion So an online fashion basically means that as transactions access The database like as they touch tubules to either update them or read them Then the database system is going to maintain some metadata that says Uh, you know this tubal has been accessed so many times in the past Right And what we can do is in the I think the paper you guys read and the anti-caching stuff We can actually embed this metadata in the header Of each tubal rather than having an auxiliary data structure Right, we sort of pat out a little extra information to say, you know, this thing's been accessed so many times Or here's you know maintain a lru chain The alternative approach is to figure this out on an offline manner And this is basically where you have your your transaction your your system's already writing out a right head log anyway Right with all the tubules that are getting modified And then what you do is in a background process you just go read that log back in And then you basically compute the same metadata you're tracking up here Um in an offline manner and say well, you know, these are the tubules that have been updated Now the problem with this one is um It's the right head log it's not going to contain any any any reads, right? It's only the updates insert updates and deletes So if you want to include selects in the log you you essentially have to might write more information out of that log um as my opinion, I think that the The first approach is better because yes, you'll pay a slight performance penalty to do this but The you'll know that the the metadata you need to figure out what's actually hot versus not hot Is always going to be in sync with the actual application for this one Uh, it could be the case that the background thread is falling behind Now how much of that actually matters? I don't know. I in my opinion I like the idea of having this cooperative method that does this Uh as you go along without relying on a separate thing in the background to get kicked off every so often All right, so the next question is uh, how do we decide? Of how how to actually evict A when to evict So the first approach is you basically have an administrator defined threshold that says When my database system is using uh, is using this amount of percentage memories Compared to the total memory amount of memory i'm allowed to use Like I could say if I i'm only allowed to use one gig of ram And if when I hit my 80 capacity, that's when I trigger an eviction to say when when how when do I start writing data out Right, and in this case what'll happen is Because we're maintaining this ourselves in our database system It's up for our responsibility And as you know the database system developer to write the mechanisms to actually evict this data and move it out the disk The alternative is just again Essentially use m-map let the os manage all this this for you And then it figures out only using its own algorithms When's the right time to flush data out the disk right and this this can be done in in the background So again, I feel like that Because my disdain from m-map which is Somebody say it is irrational I think that the the threshold is the better way to go for this right because we should we should be managing memory and disk ourselves entirely And the next thing we've got to figure out is what actually we're going to store For the tuples when they get removed What actually would what actually want to maintain in in memory To keep track of these tuples to say that we've been evicted them Right, and this is necessary because we don't want to have any false negatives Right, we don't want to be the case where we write something out the disk And then there's no, uh, you know, there's there's no remnant of it in memory So when we do a look up on a query to say find that tuple we don't find anything in memory So we assume it's not there when it actually is because it's out on disk Right, we want this this moving data out the disk. We completely transparent to to the application So again, there's three approaches and the bottom one is basically more m-map right the m-map the the operating system is going to figure out uh Keep track of what what data is on disk in in your address base and Inside the data system. We don't have to do anything special. Yes All right, so it's quit. Yeah, well so his question is Do the tombstones have to be stored in another place in memory and should they be smaller than the original tuple? You know, let me go through that the answer is yes to both of those Okay, and now I'll have a diagram in the next slide Right, so with tombstones the basic idea is that when we write out the tuple out the disk We're going to make a new tuple that has a special flag that says here's where they used to be a tuple And then you you maintain the block id in the offset to say where on disk is it actually being stored Right, so his point was that Does it does the tombstone has to be smaller than the original data item? Yes so if you have a table that has a single column of an 8-bit integer And you evict that to memory and in its place you store a tombstone That has a 32 bit block on a 32 bit offset Then you're you know, you're not saving it and you actually make things worse Right because in this case here, you still have to maintain all your indexes to point to the tombstone now The alternative approach is to the tombstones is used in approximate data structure Like a bloom filter That basically says For a particular key. Yes, there used to be a tuple with this key Uh, it doesn't tell you where it actually is you have to maintain a separate index On disk to say where to go find it Um, but the idea is that when you want to do a look up again, we're doing otp queries So we're assuming we're doing point queries or or going getting a small a small subset of items We check the check our index that we maintain regularly if it's not there Then we check the bloom filter the bloom hunter says it's there Then we know we go out the disk and get it because bloom filters never have false negatives If it says there's not there that we know it can it can never not be anything But bloom filters could have false positives So we the bloom filter could say there's a tuple with that key We got the disk and then we actually find it's not there. All right, so we end up wasting disk Ios for that But the difference is that uh, the bloom filter could could be much smaller than the Than the storage for your tombstones All right, look at an example for both of these So again, so this is that same setup we had before so the Assuming that we're maintaining against some internal metadata about how these different tuples are accessed In this example, I'm showing a histogram. It could be the LRU chain. It could be the clock algorithm could be anything So it identifies that these tuples are the coldest therefore it wants to write them out the disk So now we need to update our index because right because our index is now pointing to The position where these tuples used to be but and we want to be able to reuse this memory Uh, because we want to put new data in this and we now we have to update index because we don't want this to point to uh, using our old keys point to the The old the old tuple So the uh The way with we do this with tombstones is that we have a separate storage space In uh the in memory for our table that have these social tombstone tuples that just maintain again the The pair of the block id and the offset and that's enough information you need to go find the original data that you needed Out on the disk Right, and so the way we actually implemented this is a store. We just had a you made like an internal table That was linked to the main table that had these special special columns The alternative from the from the hackathon guys is to use a bloom filter and again, basically you the You would remove the keys for the addictive tuples from your in memory index But then you you you add apply them or add them to your bloom filter So when you do a query and say do I have you know Whatever the key to tuple tuple zero one you check the index. It's not there You check the bloom filter it is there and therefore you got to go out in the disk now Find this index that'll tell you where to go get the data that you actually need So one limitation with um Uh, the bloom filter approach is that you cannot do sequential scans anymore on your table because there's nothing in here here at all Right, whereas with these things With the tombstones you could do a sequential scan you just hit this to one recognize. Oh, this is something I need I I can't get the data from this. I got to go out in the disk and get it Yes So so your question is if I use the bloom filter do I lose the data that I've stored in my index Because if you use the other way like what we do is basically folder the offset and we can't retrieve the data But that's what we use bloom filters. We have to insert the whole disk No, no, so you use the bloom filter So say I want to say do I have key do I have the key for tuple zero one in next text? No bloom filter says yes, so then I have this other on disk index That I have to use it tells me where where to actually go find it right And yes, you could cache, you know this thing but But for this one you want to like I guess they didn't talk about logging for this for this one This definitely has to be durable because otherwise, you don't you you don't know what you're going to find when you come back All right. Okay. All right All right, so now that we know how to write the data out The question is how do how are we going to bring it back in? So let's say we own in our scenario. We recognize we need that tuple zero one It's in a block with with two other tuples We go out the disk. We get that block. We press it into memory. What do we actually do with it now? So in the first approach we could just go grab just the tuples that we needed right And then merge those back into the regular table heap update or indexes to now point to the the new memory The the reconstituted tuple that we brought back in um And then just discard all the the other the other tuples Right, and so you could maintain some some extra metadata that says for my block Here's the holes that are inside them, right? So I pull this tuple out Uh, if I go back here if I if I take zero one I put it back now in here If physically it could still be on this block I would remove the tombstone that that corresponds to it. Um But it gets sort of tricky when I want to recover because I may I need to know that oh the the last place this tuple should it physically existed was in memory here Not on this this on the disk block. So if I update it I need to know that this is the one that I want to use when I when I Repopulate the database not what's out on disk here So you have to maintain some metadata in order to track these things the alternative is to Just take all the tuples you find in the block merge them all in And then just discard the block entirely Right, so the advantage of this is that you don't have to track the holes in in your blocks Uh, which makes you know from from amplification standpoint. That's much easier The downside though is that uh, you have to go update every single For every single tuple that you bring back in you have to update all the in-memory indexes to now no longer point to the tombstone to now point to the new tuple And so if you have you know a hundred indexes on your table and you're bringing in a block that has you know A thousand tuples or so you have to update all those indexes for every single tuple The other downside of this approach is too is that I say I only needed one tuple in my block And that tuple then technically should become hot right because it's the Because I just updated or just read it right But all the other tuples are just hanging you know hanging around and got brought in as part of the the fetch They're not actually hot right. They're still cold And so when I do another pass and say what data do I want to evict Those tuples are likely to get written back out again because I haven't you know If I didn't read them since I brought them back in they're gonna get written right back out So I'm sort of ping-ponging or thrashing just you know reading data in that I don't need and then Writing it back out immediately So again these have different trade-offs Again as all things in databases it often depends on what the application wants right and maybe the case that you can You can maybe clever about packing in Uh When you write out a block instead of having just tuples in the same Table all sit in the same block you could follow foreign keys and say all right Well, I need to evict this tuple from the parent table Let me also evict all its child tuples as well And then that way when you go when you go fetch a block in you're you're fetching in Uh data that's actually related together at a logical sense in the schema rather than just data that was just Had to be cold at the same time All right, the next thing we got to deal with is what happens to our transactions When they uh When they actually access this data So the the anti-catching paper that you guys read was in the context of the h-store system And h-store again is the academic predecessor of uh a volt db So h-store has this this execution model. We have these single threaded engines And so there's only one transaction running at a given time at an engine So the question is what do you actually do to that transaction when it accesses cold data So in the first approach you use is you immediately abort that transaction when it tries to access an evicted tuple You actually can let it go for a little bit longer, right? You just you you sort of do a Jedi mind trick when it when it tries to access data. That's not in memory You just say yeah, you got to keep on going and as long as it doesn't try to read or update it You let it keep executing and as soon as it tries to read or update something Then you have to abort and restart it And the idea is that you may be able to catch that the transaction has to read Or try to access a bunch a bunch of evicted tuples at once And that way you can go fetch them in a batch rather than just aborting restarting over and over again So we we we catch that we try to access evicted tuple Then we abort our transaction Put it into a a background queue While another thread goes and fetches the data that they need Then we then we we merge that data back in And then when our transaction Then we can reschedule our transaction and now we know that the data we needed from the last time you ran is now brought back into memory and the advantage of this is that It's boarding transactions in an old to be context can be pretty cheap The only downside though is if you have queries you have to access a lot of data They you know, they may need to read the first half of the table and that's in memory and the second half is evicted They come through the first time They get to the second half and they get evicted Then the top half gets swapped out the The bottom half gets swapped in and you start back over And you and you hit, you know, the top table now becomes evicted, right? So if you A way to get around this is you just let the transaction restart where where left off when it got aboard the first time But then that no longer guarantees that it's serializable. Yes Is this a store procedure only? This is the context of the store procedure. Yes. All right Um It actually doesn't have to be it could you could do this without it Yeah, I mean you could roll back a query and then then run it again in the constant transaction Um, but that would not guarantee consistency potentially The alternative is to do Essentially what what a disk based system would do is you install the transaction When it has access as a victor tuple and then go immediately fetch the thing that it needs and then once that That's now in memory. Then you let it keep on keep on running So it's and then the the last thing we got to deal with is What should actually happen with the tuple that you uh, That you that you wanted to go fetch from disk And so we're calling this an emerging threshold that says The two alternatives are uh, do I immediately merge it back into the regular table heap update all my indexes? Or do I Keep it in sort of a private buffer that's just for that transaction that needed it And then when that transaction commits I can actually then throw it away And and then not worry about you know, the penalty of merging it back in entirely So the always merges as you always merge it back into the regular table heap So that means that when the when the transaction that accesses the victor tuple when it commits and it's done The next transaction starts running it'll be able to see all the same tuple that the other guy brought back in The alternative is to uh only merge any any tuple that didn't did an update So if you just read something After the transaction commits you throw it away if you update it then you actually install it into the regular table heap Um, and then the select emerge is that you just maintain Information that says how often each block is actually updated and then you make a decision about Whether you're above some threshold To to to bring it into the full full memory or just discard it right so again the idea here is that The disk part is expensive, but also updating all the indexes is expensive, too So me maybe just want to just bring it in memory not update the indexes Let the transaction that needed it be able to access it and then just throw it away when we're done Or we could merge it into the to the full table heap All right, so any questions about like how we actually implement these things? Yes This question is what are we doing with barlin attributes? So they'll get serialized out with the regular tuple as well, right? You don't split them up All right, so now I want to talk about uh some some real world implementations of this um, and so the What I'll say is the the the first three actually Are you know h2o was a was an academic prototype? This is an academic variation of ldb and then products that beer never actually got merged into hecaton But the last two are actually are real and available today and I haven't looked Uh, we wrote this paper a year or so ago I haven't looked recently to anybody else is doing this in the context in memory databases as far as I know nobody else is sort of Has taken some of the approaches that we're talking about here in the context of in memory systems So the paper you guys read was the anti-caching paper we wrote Uh, when I was still in grad school um And the way to think of the name is like a regular disk based system is doing caching right the colt the All the data starts out on disk and then you go When it comes hot then you go move it out of disk and bring it to memory and cache it Anti caching is the reverse All the data starts off in memory and then when it gets cold then you write the cold data out out the disk Right, that's why it's the anti of anti caching It's a classic storm breaker term and the best part was we wrote the paper and they didn't like the term anti caching Uh, so when we submit the revision we took the name anti caching out of the title But then after got accepted storm breakers like yeah put it back in Um So uh, so so the using the taxonomy we just define The anti caching approach is going to use online identification. So it's going to maintain this lru chain In the tuple headers that's going to keep track of how the how one was the last time tuples were accessed We'll have an administrator to find threshold that says at some point when you when you reach the Uh, certain capacity of your DRM compared to the the max you're allowed to have it'll start evicting things out It'll use the tombstone tuples to keep track of evicted data items It does a abort and restart at least in the original version. That's what it did And we're doing block block level granularity of of merging things back in and we would always merge everything We didn't do anything selective merge or we didn't do any Uh, no matter whether you're doing selective update. We always merge it in The project site beria again, this was a uh interns project at microsoft research Done in the context of hecaton as far as you know, this was actually never never Ported to the to the full system that's used in sql server Um, so at this point, I don't think sql server, uh, the hecaton engine actually supports larger than memory, uh databases Um, so they're going to do the all fun editification, right? They're going to look at the log and try to figure out the access patterns of tuples Uh, our transactions on tuples and use that to identify what's cold and what's not And then again, they'll also have an administrator to find threshold They'll be using the bloom filter approach instead of the tombstones They do synchronous retrieval meaning they block transactions when they when they fetch it at a tuple level granularity and they always merge it back into memory right The uh researchers at e pfl. So this is natasa alamaki. She's the former database professor here at cmu So they did a version of volti b Where they actually modified the storage engine the storage manager actually to use mmap Uh, entirely for storing these tuples and so they're going to do offline identification Uh, using os virtual memory within synchronous retrieval page level granularity and they always merge back things back in So this is the interesting part that I want to cover. Um, because I think this is actually a a clever use of of mmap So what they're going to do is that they're going to have their in-memory table heap Just like just like we know they would have And then they're going to break up the address space or they break up the memory for for this the table heap into hot and cold sections So they'll they'll they'll mmap Malak a huge table heap for store these tuples But then they'll use m advise and m lock to pin a certain region in memory To tell the operative system it's never allowed to evict it. So it's still even though it's mmap You if you use m advise and lock you're telling the operative system no matter what you cannot write these pages out And then for the cold tuple section this is unpinned So the operative system is free to write it out as as whenever it needs to So then they use this background identification or the offline identification to run a background thread Look at the access patterns and they identify what tuples are cold So in this case here, we we've identified this guy's cold And then what they do is instead of moving it out the disk immediately They just move it down to the cold tuple section right and then because this is uh This is this is unpinned memory the operative system is free to evict this out whenever it actually needs And then they can reuse the space and start storing storing more tuples And at some point when you have memory pressure the os decides all right Let me let me go ahead and write this out And then anytime now you try to access data in this you hit a page fault and then it's merged back down right So now the tricky thing about this is that uh You got to be careful how you lay things out in memory because the Before since we had complete control of everything we were reading and writing If our tuples fit into a single block We know that when to go retrieve a single tuple from disk Is always going to be one disk IO right to go find the single thing that we need With mmap because uh, if you're not careful What could happen is the os could break up your tuple across multiple pages again It can't happen managing our memory ourselves. So like We just need more care about this and then what would happen is that What would what it should have been one disk IO could end up being two disk Ios here So this this is the only downside I can think of that can occur with this approach here other than that I think it's actually pretty clever Yes So it's question so question. How do I manage an index? So At this point here, it's like so the index has the index has a memory pointer to this All I need to do In the context of a transaction when I move it down here, I just update the index is not a point to here So statement is with mmap you don't you don't know when it's moved memory to disk correct So it doesn't matter because when when this is all still your address space This page where that tuple was in now gets evicted. It's still in your address space So it just happens that when you try to read that memory address The os you had a page fault interrupt the os blocks you and it knows Oh, well the the page the the memory address you're trying to access The virtual memory address you're trying to access is mapped to this physical page That physical page has been written out the disk. I block you until I go fetch it back in The index Yes, so the statement is the index still points to this correct. Yes, there's no need for Timstone at all Yes So your statement is this doesn't avoid the problem that what was the original problem sorry So The problem you don't want to use like mmap is we don't want like helps sec for and go back to Now the sec fault page fault sec fault is way different Yes Yeah, so If I if I if I was like this and neither one was pinned Then what would happen is I just don't I don't know what anything that got flushed out right For this one, they're actually they're being you know Like how do I say this? If you don't pin anything the operating system is going to decide when things get written out By pinning it. We're essentially forcing it forcing the operating systems hand to say all right You can't flush anything out. So we're kind of like half managing our memory for a right Yeah It's yeah, it's a hybrid. Yes. It's a good way to think about it. Yes Yes Yeah, so his question yeah, so his statement is if I access this page that were that two and two was that need The prop the thread has to stall. Absolutely. Yes Thread yeah, all right, so the um The last one I'll talk about is a patchy geode now this one I fully made this is a bit I should go back and read what they do again. It's confusing because Uh a patchy geode used to be called gemfire Uh, but they still sell gemfire as a commercial variant that's based on geode But geode is different than that and then there's also there's gemfire xd, which is not the same thing As far as I can tell this is what they're doing. So they're they're due to all my identification Um with the minchat and find threshold they're using a tombstone And then they use signals retrieval to a level granularly and merchantly on update The the thing I think it's interesting about geode if I understand what they're doing correctly is that Uh, they have mechanisms to read and write data out the hdfs Right, it's it's it's done in a way that's that's can be efficient or writing out to the hudu file system Whereas all the other approaches are assuming you're writing out to an ssd And so I think that they have a they have their own sort of storage manager For the on-disk data that allows them to be more efficient for reading things in And the last one is uh mem sql Uh, so mem sql is actually using mmap for their column store. So the With mm sql you declare either tables of row store. It's a column store If it's a row store, you get the regular in-memory storage engine right where they manage memory themselves If it's a column store, then they actually rely on mmap And what they do is they Have the column portion of the table is always stored in the map and may have this sort of delta buffer that sits in front of it So sort of like an lsm Uh storage manager and that's where all your writes and updates go into And so that part is never actually used based on mmap. It's only when you go fetch the data from your column store as you're doing a scan Uh, then you just let the os bring things in for you And they're using the zone map approach that I talked about before to figure out You know what segments of the column or blocks of the column they can avoid Having to read entirely so, uh For this for man identification, um This just meant that before the you could you you would declare your table as a row store or as a column store And then if you get a column store, you just let the os do everything right and you have segments of tree will end always merging as far as you know, this is they're still doing I actually I think this is okay because this is olap because the you don't up ever update the columns in a transaction You only update that delta buffer that that sits in front of it. That's always going to be in memory I think this is okay. You don't have the problem of you know Trying to do an update on the column data and then you hit a page ball in the middle of a transaction Uh, and I think influx db does something very similar like this as well All right, so to finish up we do evaluation. Um, so this is a paper that uh, my pc student lin wrote Uh with with us a few years ago And so for this when we took the h-door system, I mean we tried all the different variations of the Of how to write data out the disk that we just talked about And then we tried them on a bunch of different storage devices So we have the sort of standard, uh You know spinning this hard drive and the n flash ssd But then we also had someone from newer hardware that we wanted to try out in this environment as well So shingle magnetic disks are have been around for a while They're basically like a spinning disk hard drive like a regular one except that they write to the, uh They write the tracks really really close together So when actually when you write to a you know a single bit On a track you end up overwriting the data that's in in the tracks next to it So they have an error correction method to handle that and this allows them to get really good, uh compaction and get better, uh A higher density on the data you write out Uh, and then 3d crosspoint is the is the new Non-multi memory device that intel sells we actually have one here at cmu that sits on the pc express bus and then we had a We had an experimental nan, uh mvm device Uh, basically non-bottle memory that looks like d ram, but everything is actually persistent So this will cover more in next class Um for this one, this wasn't actually real Uh, I think this one this was this was actually simulated as well But these other two ones here are emulated these other ones here are actually are real drives So the the first, uh, micrometric we're going to do is just we want to see, uh, what happens when we, uh When we when we want to read data back in and out, um From disc from these different devices At different capacities, so this is sort of like just let me say How fast can these can these can these drives read and write data? And as you would expect the the you know, the spinning is hard drives the slowest D ram is going to be the fastest, uh, and then as you increase the capacity The increase the size of the data that you're actually reading and writing, you know performance gets worse as expected um, but what's sort of interesting here is that the the The read speeds for the, um For the spinning is hard drive and the the magnetic shingle magnetic magnetic disk drive are basically the same Even though the capacities are much different And then for mvm and d ram you see that mvm is almost as fast as d ram Which is again the the major selling point for this new hardware when it actually comes out again We'll cover this more uh next class So the first thing we want to evaluate is the merging threshold. So this is how When we go at when we go attach evicted data, what's the likelihood that we're actually going to fetch it back in? um and for this we're showing the the The spinning just hard drives and the single magnetic drives doing either a uh the asynchronous or the the synchronous retrieval I mean whether we block transaction or not where but whereas in the other ones It was so fast. I think it was always done. Uh, and it's a seamless manner. So as you would see is that as you, um If you merge everything The performance is is always going to be the worst because you're just doing you're you're merging more data back into your database And then you have to evicted back out right right again So you're doing a lot of reads you're doing a lot of rights additional rights Then you would not normally have to do if you only merge a small percentage of this And you sort of see from this so that the the sweet spot for the really fast drives are just Either update only actually nvm is update only for the 3d crosspoint. It is uh It is only merging the top 20% Okay, so the main takeaway here is that if you have really really fast drives You actually want to not merge everything but for the um For the slower drives in case the spin gets hard drive you actually do want to do that Yes What why is this so slow? Um Merge update merge. I gotta look at this again. Sorry For which for this for this one here. Why is that so slow? Yeah for asynchronous retrieval merge only You're gonna be on four hours asleep. Uh It's gonna why hold up So asynchronous retrieval means that uh, you you you abort the transaction restart it and fetch it back in So if you merge everything Then the likelihood that you're gonna hit the same tuple again, um And not have to do that is is is cheaper and that's why you get better performance for Synchronous retrieval It turns out that yeah, so it turns out the case for signature retrieval You're paying the penalty to to merge everything anyway All right, you pay the pun to merge it in any way or to go retrieve it from disk Uh And if you're more selective with what you merge back in you end up doing less less writes back out Actually, I don't think we have the graph here. We should show like we need to show how much data you're actually writing out This is just the overall throughput Yes, so yeah, yes with when you're dealing with disk low performance usually entails more idle Like Not cashmases but the data you need not in memory Okay so the These are sort of micro benchmarks in the previous slides, but now we just want to see for Uh, just a straight up comparison performance of the different devices um We're going to do uh, look at two benchmarks And we're going to look at the generic configuration of anti caching Which is from the original paper that you guys read where you do a abort and restart merge everything and then you have 1024 kilolipte block sizes and then for the What we'll call the optimized configuration is actually went we ran a sort of parameter sweep of all possible combinations of the settings for these different devices and for each one we set them As needed so they're always going to be synchronous retrieval and then you do just only merge the top five percent, but then the block size will be different for the spinning is hard drive versus the the nand flash drives So in the spinning just drives larger block sizes like 1024 is a one megabyte block size That makes sense because they're they're doing better at sequential access Whereas in in these other sort of solid state devices Smaller they can access data at a smaller granularity So 16 kilobyte page sizes actually works out to be the best So the red bar at the top says here's the what you can do If you just run everything with everything with DRAM, so this is the TATV benchmark, which is a It's a benchmark from Ericsson from from a few years ago that basically models Cell phone network or cars are driving down the highway. They're updating the cell phone tower to say, you know Here's what I find me if you want to call me So it's a mix of of reason reason rights And so again the what we see is that the optimized version of the mv When we use mvm with anti-caching we get almost the The same performance that you get when everything's in DRAM But the generic configuration actually turns out to be the worst for it Can you sort of think of this it's not using the hardware that it's meant to be used In in the generic configuration because we originally designed it to be on a spinning this hard drive So this is not really that exciting just again to shows that the As you get better devices in the optimized case you get better remember performance I actually don't know remember why This thing was was better than this because doing rights are more expensive in shingle magnetic disks And I don't remember why this this was slightly better than and the rather spinning this hard drive the next benchmark is the voter benchmark so this one is modeled after the The the american idol when people call on the phone and they want to vote for a singer on the tv show That you have this blast of inserts all all of a sudden and this actually came from both the b where they were It was both even used for like the japanese version american idol Canadian version american idol, which I think is just called canadian idol So this is actually modeled after their environment So this is this is super insert heavy because this is just people calling so here now you actually don't see The the optimized actually configuration doesn't worse For the shingle magnetic disk but for the mvm one it's it's roughly the same and this is because the optimized configuration is actually better for Reading data back in rather than writing data out All right, because this thing just doing inserts you never go back and read the the old votes Like you maintain an internal counter table So you never have to go do that. I never go read the old data So the main takeaway I want to get you from show from this is that in the case of t ATP This is where we it's a combination of reads and writes. We're going back and reading old data Um, and we get better performance by having what we call the optimized configuration for this one was just straight inserts Uh, the configuration that was best for reading writes is actually not the best for uh, just straight writes right, so when I say reason right to mean uh Reading data from from the disk and writing data out where this one is only writing data out because the workload is just insert heavy Yes Iran Yes, so so like this question is um, why isn't the mb ram almost opposite to where d ram is And if they're quite similar like The correct yeah, so All right, the his question is why is why is why is everyone pretty much almost exactly the same even though the device? Yeah, so this is an artifact. This is all software overhead, right? So writing out, you know combining much much of data together into a large block. I'm writing that out sequentially um There's more overhead of like maintaining the the metadata Finding all the tuples you need to go write out creating the tombstones Updating the indexes all there's all this software overhead That is the main bottleneck of the high pole in the tent. Whereas like then you just write out these pages Yeah, you will see this in a class you just run pure like like simple micro mesh marks. I mean I sort of showed this thing Right. Um Yeah, this is latency, but so lower is better, right like clearly it can write out Um, you know, you can write out pages Yeah, much faster than than than than the spinning is hard drive But here all all this is sort of goes away because All those extra mechanisms we have to figure out what data we need to write collect it update metadata update indexes So which means the real workload of writing data is pretty much more Correct for you see the statement is it means that in for this particular experiment The overhead of writing data out is is the main bottleneck not so much the disk. Yes 2016 yeah So it means we published in 2016 so we ran the experience. Yeah, probably 2016 2015. Okay All right, so today's class was was all about dealing with You know databases we want to throw out the disk Uh, and how do we you know work around this this this issue of having block-oriented access, right? with in case mv mv ram I think that one of the things we never did what we could have done is since mv ram is actually, uh It's it's quite addressable We could have just written the data out the same way that the the volt db guys or epf epf l did in volt db It just moved those tuples out to the mvm device Update the index as a point to them and then let this let that be a slower slower disk storage. We didn't do that here because We wanted we wanted to sort of deal within the confines of a block-oriented Uh storage device world because that's what that's what the hardware that was available available to us Uh two or three years ago um And all these techniques here didn't actually do anything to offload the the indexes off off the disk as well So we skipped this part when we talked about compression early in the semester because we ran out of time But we've done experiments to show that the the size of the indexes account for like 40 to 50 40 to 60 percent Of the total amount of memory used in uh in memory database um So all these techniques here we're all about writing out the the tuples themselves But we're still maintaining a bunch of memory for for the indexes so This only sort of solves a half the problem And so where this will take us into next class is This was using nvm as they said as a block-oriented device the the next class we're actually see well if you use nvm as a slightly slower, but larger capacity replacement for DRAM and is also actually persistent So therefore you may not need actually a full log in the same way that we needed for DRAM um It may be that like all of this stuff we talk about here today And actually most of what we would talk about an introduction class for you know disk based database systems All that can get thrown out the window Uh if we have nvm And I can't say anything but it is coming very close As soon as the water is clear, okay Any questions? Yes Uh you're going way back, okay, um that was The slide So the question is why does a port restart not guarantee consistency for large queries? It does if So I have a large query I do a full table scan half the table is in memory half the table is on disk My transaction starts I start scanning the top table I get halfway and then all of a sudden I recognize Oh, I I need a tuple that's not here I have to abort Go go in the background now Fetch the table the table the part of the table that I don't have in memory Bring that back into memory, but let's say that I only have one table In my database, so now I got to evict the top half to make space for the bottom half Then I come back and if I start back off where I left off before Then I'm going to immediately hit evict a tuple and have to restart again And now I'm just ping ponging back and forth Or if I come back and start off where I left before at the second half In between that time another transaction because it's because I got aborted another transaction could have gotten scheduled Could have started updating The second half of the table and now when I scan through I should not have been able to see those their changes But because because I transaction should have started from before but I'm picking up where I left off before So don't think of this as a context of NVC NVCC system think of it as a single version system NVCC would Yes in the context of the h-store you couldn't h-store is a single version system Would have this problem and NVCC will solve that problem There's some things you could do which we we ever actually explored like if you just need to do a count Uh, you don't actually care what the actual values are unless you know, if you're not doing, you know, you know wear clothes that's super complex um You could just count tombstones Right, or you could do a you know a a covering index query where you just count You know you use the indexes which are always in memory to satisfy Satify some product and count the number of keys that actually match And don't actually look at the full tombstones So like the data involved is it cannot be loaded in the memory so you can't handle it anyway Uh, what do you mean? Yeah, the statement is in my example here that all the the working set size of the transaction cannot fit entirely in memory So I I either have to pick up where I left off Or I can't execute the transaction ever Right and if I and if I pick up where I left off then I may have I may be seeing dirty data if it's a single version system We also explored possibility of maybe doing partial evictions So almost like a column sort of thing where like I know that I have you know, I have 10 10 attributes of my tuples I only really use three of them in queries evict the other seven We never we never went that far Okay, any other questions All right, so Next class as I sort of been talking about is Again, we're going to look at new hardware you can use for databases. So I'll spend the paper I think I signed you guys to read is about non-volatile memory Um, but we'll also talk a little bit about a gpus using for databases and uh using hardware transactional memory, which uh Is limited in how we can use it but we can use use it for very specialized things in a database system and actually do something cool. Okay All right guys, uh, enjoy your weekend and I'll see you on monday I'm in flight