 Okay we can get started. So today is the sort of second half lecture, second lecture on doing recovery. So last class we talked about doing logging, again this is where you're writing all the changes that transactions make into the database as it goes along. And now today we're going to talk about doing checkpoints. So how do you take a snapshot of the database and recover from that and then replay the log. So for today I'm going to talk about some quick course announcements related to project two real quickly in the beginning. And then we're going to focus most of our time on doing in-memory checkpoints. And then we'll finish up talking about how to do this little trick from the Facebook guys and how to do fast restarts of the database process using shared memory. So for project number two the auto lab should be now online. The thing we were battling with was trying to get more cores to make the multi-threading stuff be more interesting and more thorough in the testing. So you should be able to submit and it'll be additional tests again that we're not providing you in the repository that will run to make sure that your thing is correct. So because we were late on the auto lab I wanted to have it done this weekend but we had these problems. So I'm now going to give everyone a week extension so the project will be due on March 9th. So it will be not this Thursday but next week's Thursday. Okay. Then but everything else is still going to still be the same. So for your project three proposal that will still be on the 21st which is the Tuesday from when you guys come back after spring break. On next end of next week I will spend time at the end of one lecture to discuss here's some project three topics that you could work on and then you guys need to get together over spring break and figure out what you actually want to do and come give a you know a meaningful proposal on the 21st. Am I meaningful I mean not just like hey we think we're going to do this wouldn't be great but actually has spending time to look at the code and understand what it's going to take to do the thing that you that you're proposing to do. Okay. So again project two we do next week and I'll update the the course website with the with the proper deadline. Any questions about this or project two. There's some questions on on Piatta we'll take care of this week and anything else in the back. The question is can we talk about use of malloc. Oh yeah so yeah thank you I should have brought this up. We prohibit you from writing malloc anywhere in the code and then there is the there's a source code validator that makes sure that any invocation of malloc is only allowed in sort of whitelisted files. So what we need to do is provide you with a better memory pool that doesn't have the spin lock in it so if you just make up you take the ephemeral memory pool get rid of the the spin latch in at the spin lock then you can use that and instead so we need to go well you know you can either make one and update the source code validator to whitelist your new file or what I'll be able to make one today and send it out. So we only let you call malloc in certain places so we'll provide you a file that you can then use and whitelist it on the source code validator so you can call malloc there. Right the idea is again we just don't we don't want anybody calling malloc anywhere in memory database because that'd be bad so we only allow certain files to do it. Okay and there's anything else you maybe need like I don't think you guys are gonna need memset but if there's anything else that's whitelisted talk to me we talk about how to add your file into the whitelist. Just decide whether you actually do need to do the instruction you're trying to do. Okay all right cool. All right so as I said so the lecture we have last week about the logging protocols these were this sort of the standard mechanism we're gonna have inside our database system to make sure that we can recover the database after a crash or restart. Because when in every database again if you kill the process everything goes away so we're gonna use the log to make sure that all our changes that transactions make that we then notify the outside world that a transaction has committed are durable after after restart. But the problem with the logging protocol is that it grows indefinitely it grows into infinity so that means that when we restart the system we're gonna have to replay the log from the very beginning to put us back to the correct database state where we were right before the system stopped. And depending on whether you're using a logical scheme or a physical scheme this could take a long time. So in the case of the logical scheme remember I said that we were logging the actual SQL queries that we had to execute. So if you have like a hundred-day log file it's gonna take you maybe a hundred days to restore the database after a crash because you have to replay all those SQL statements. I mean obviously you can go faster than you did when they were originally submitted but if you're running at max capacity your max throughput then you know if it takes you a hundred days to run in the first time it takes a hundred days to run in the second time. There's nothing about recovery that makes us go faster. So to avoid this problem of having to replay the entire log database is some mistake checkpoints. The idea is that we're gonna take a we want to checkpoint our checkpoint to be a a snapshot of the database at some point in time where we know if we load the database upon restart at that checkpoint then we can ignore anything that comes before it in the log. So essentially when you take that take a checkpoint depending what kind of scheme you use there'll be a little entry in the log file that says I took a checkpoint at this time and you know usually says here you know here's the file location or where it is. So again you can based on this depending what kind of checkpoint you can take depending what kind of checkpoint you take you can ignore any transaction that was executed prior to that because you because you know their changes for the most part will be included in that checkpoint data. So this allows to significantly reduce the cover time of the database because we don't have to replay the entire log. We suck in the last checkpoint and then replay the log for everything that came after the checkpoint. So that's the main problem we're trying to solve here. So now with in memory checkpoints there will be slightly different than how we would do a disk based checkpoint. May at a high level semantically it is essentially the same thing. If you think about disk based database system the checkpoint is basically taking all the dirty pages you have in your buffer pool and then writing out the disk. In a in memory database system you're basically taking all the pages of the database that are in memory and writing out the disk regardless of whether they're dirty or not. You can try to do delta checkpoints where you only write out the dirty changes. Typically people don't do that because then you have to go back through all the different checkpoints to find all the pages you're missing from the last one in order to restore the database. So typically when people do these checkpoints they just take the whole thing and write it out. So what we'll see as we go along. I'll talk a little bit as we talk about different approaches is that the kind of the way you're going to do your checkpoints is going to be tightly coupled with the underlying concurrential mechanism of the database system. I think for the papers that you guys, the paper you guys read for today, it was talking about doing checkpoints in the context of a sort of the H store or a volt DB style system. But if you're using MVCC for example then you may want to do a different type of checkpoints as well as we'll see in a second. So essentially what's going to happen is when we take a checkpoint there's going to be some some thread background thread and it can be one or one or many it doesn't matter depending how fast you want your checkpoint to go. And it's essentially going to scan the entire contents of the database at each table going block by block and writing them out to disk. And depending on how your data is organized in memory you may transform the data from an in-memory representation to a disk-based representation or you can literally just take the bytes and write them out to the file in the original form. There's different trade-offs to doing each of these. Sometimes it's obviously faster to do the write-out if you just leave it in this original form but how you lay things out in memory may not be the most optimal way to lay things out on disk. So different database systems do different things. Okay so the the paper you guys read talks a little bit about what are some of the ideal properties you want to have in a checkpoint and this actually also comes to you from a paper that came out last year from the guys at Yale on doing sort of different types of checkpoints but the high-level ideas are the same. So we can talk a little about what do we want what are the ideal characteristics of properties we would want in our in-memory checkpoint method or mechanism. And so there's sort of obvious in some ways but we can talk about as we go along the ramifications of the different approaches in the context of these desirable properties. So the first thing is obviously we don't want the system to slow down the regular transaction processing workload right in our old to be database while we're taking a checkpoint. That would be bad if we say hey invoke a checkpoint then all of a sudden the throughput of the system drops by 50 percent. That would discourage people from taking checkpoints and then that would make recovery take longer and just sort of mess for everyone. The other things that are related to this because latency is totally coupled with throughput is that we don't want to introduce any large spikes in the latency all of a sudden when either a checkpoint starts or it finishes. And this is going to occur if you have to block all your transactional threads or your transaction working threads from running anything while we do the different parts of our checking protocol. And the last one which is not essentially covered by the protocols we'll talk about in the paper you guys read because they actually have a large memory overhead but ideally you don't want to have the database system have to use a ton of more memory just to take a checkpoint. Okay so these are the things that we ideally want for these first two we'll get them in the protocols we talked about. This last one depends on you know which approach you're using and also it actually depends on what the current draw scheme you're using as well. So for all the schemes you guys read in the paper and all the schemes we'll talk about today these are all taking what are called consistent checkpoints. So a consistent checkpoint if you sort of think of it in the context of snapshot isolation it's then when you write out the contents of the database to disk in your checkpoint you know that it's a consistent snapshot of the database at a single point in time meaning in your checkpoint you will not see any changes or modified tuples that were modified by transactions that hadn't finished at when the checkpoint started. Again the thing about snapshot isolation was when I started running my transaction I would see a consistent view of the database where there would not be any changes from any uncommitted transactions that did not finish before my snapshot or my timestamp started. So the same thing applies here. So the advantage of this is that when the data system comes back and we want to reload our checkpoint to load the database back into memory we don't have to do any additional processing because we know that there could not be anything that we should not see in our checkpoint. There won't be any transactions that modify some tuples they're in our checkpoint we load that back in and then we have to figure out whether they should actually finish or not. Because remember I said in memory database we're not actually recording any undo information. So we're going to have no way unless we record the undo information we can't have any uncommitted transaction changes in our checkpoints. This is why consistent checkpoints will be useful for in memory database. We don't have to look anywhere else. We come back and everything is okay. We still have to replay the log because we have to make sure we get all the changes that were made by transactions that came after our checkpoint but they won't be in our snapshot. Now contrast this with fuzzy checkpoints which is the more common approach used in a disk-based database system. In a fuzzy checkpoint you don't prevent transactions from modifying the database while you're taking the checkpoint and therefore when you recover you have to go back and figure out what things you need to reverse. Remember I talked about areas we do the sort of analysis phase and figure out what was going on in the database system when we started right before the crash then we do the redo to make sure all our changes are applied and then we have to go back and do undo diverse things that shouldn't be there and that's sort of what you have to do when you have fuzzy checkpoints. So in a consistent checkpoint the database system can say I'm going to start taking a snapshot and write a single log entry that says here's the point in time where my checkpoint started and you actually don't need to record anything after that. You may want to record that the checkpoint succeeded but there's no additional metadata to say like from when I started from when I finished here's the transactions that ran but in a fuzzy checkpoint you do have to do this. You have to know like what are the transactions that were there when I started what were the transactions that were there when I finished and what were the pages that were modified in between them. Because then we're going to use this when we come back online to figure out what should actually you know what should be the correct state of our checkpoint. So again for this class for in-marry databases we're going to focus on consistent checkpoints because this has the property that we want that we can come back online and immediately load in the checkpoint without worrying about correcting it. Another big question that also comes up is how often should you take checkpoints. So the papers don't really discuss this but this is actually a big deal in practice in real systems. Because if you take a checkpoint all the time just like non-stop it could slow you down. In the silo R case for the paper you read last class they were just continuously taking checkpoints and then when a checkpoint finished they waited 10 seconds to start it all over again. And they think they show that they had minimum slow down in the system versus when you didn't have any checkpoints or logging at all. But I think if I remember correctly from the experiments they were still not allowing the threads they were using for checkpoints to be used for processing transactions. So in the case of the silo R system although they had low overhead in taking checkpoints in a real system you can't use those threads to process transactions because they're processing checkpoints. So that slows you down in that way. There's other issues too now like you're spending all your time flushing buffers to make sure you get all your your your checkpoint changes out to the file. There's a lot of a lot of stuff going on that will slow down the regular transaction processing workload. If you take checkpoints all the time. But if you take checkpoints too infrequently then depending on logging scheme you use it may make recovery take a long time. If you wait every two days to take a checkpoint then you're going to have this huge log that you then need to replay whether it's logical logging or physical logging then it may take longer for you to do recovery. Again remember I said last class one way to get around this is to avoid ever you know you still want to take checkpoints but avoid ever have to replay the log for a long time would be just to have these replicas so that if the master fails you can you can have the promote the secondary the replica become the new master. And that helps of one no goes down but that doesn't help if your whole data center goes down. So we're still going to want to do a checkpoints you still want to do our logings. So how often different database systems do checkpoints varies wildly. So like in both DB it's typically done on a sort of time basis. So you can say I want to take a checkpoint every five minutes every ten minutes every every one minute. You really really cared about having high availability. In other systems that use physical logging like men's equal for example and my sequel. The way they determine when to take a checkpoint you set a threshold in your configuration file that says when my log has written this much data go ahead and take a checkpoint. So like a men's equal for example when the log the last log the amount of data written into your log since the last checkpoint goes to be like a quarter of a gig then they flush out and do it to a checkpoint. And again it depending on how often you do this if you let the log get too big then it makes your recovery time take longer. It just makes everything harder. So how to tune this exactly is sort of a black art and usually the defaults in data systems aren't tuned to your application and you get better performance if you know the right configuration. But that's a whole another you know hollow ball of wax which you pay DBAs to figure out for you. All right so for this for everything we talk about here we're going to ignore how often we take checkpoints but just know that this is something you can vary depending on how what your availability of requirement needs to be. All right so now the paper you guys read was about doing every checkpoints and they talked about four different approaches to naive snapshots copy and update snapshots then wait for zigzag and wait for ping pong. So it's sort of as a spoiler what I'll say is that I don't think anybody actually implements these two guys. The last two but I think they're interesting to talk about because they show the trade-offs you can make for the different primitives you use to build up a checkpointing protocol. The second one is probably the most common one and we'll see ways to do this. So another thing with the paper they sort of use terminology that doesn't really fit in with all the other papers we talked about so far in the course as well as the things we'll talk about in the future. Right they talk about these sort of these you know these applications where you they had this application state in memory and you need to checkpoint that all the time. You can sort of essentially think of that as like the working state of an application. So the idea is that instead of having you know if your entire database is 100 gigs but you only have a you know one gigabyte of the database is only being updated all the time then they're trying to say that you can just do checkpoints and those one gigs. And this is part of the reason why nobody actually does it this way because it's very hard to be able to differentiate you know what part of the system should be or what part of the database should be checkpointed all the time versus what part doesn't need to be. Typically what everyone does is you take a checkpoint the entire thing even if it hasn't updated since the last time because again this is not something that you know humans can easily identify to say this is what it should be or not be. It's easy to say a single table could be read only and therefore you don't need to checkpoint it but to say a segment of the database a segment of a single table like half it needs to be checkpointed the other half doesn't doesn't need to be checkpointed that's a bit more difficult. So again nobody nobody actually does this but we'll still talk about it because I think it's interesting. Alright so naive snapshots is like the easiest way to do a checkpoint and basically what you're going to do is you're going to quiesce all the worker threads in the system to prevent them from executing any new transactions you basically say hey stop doing any work finish the transactions you have going on but don't take any new ones yet and then once when you know all your threads are blocked then you have the data system take a checkpoint or make a copy of the entire database into some new location and memory then you can have another thread write out the contents of that new location memory out the disk and then you can allow the other threads to process transactions while this you're writing this thing out. So again like I said is the most simplest thing you can do but there's two ways to actually do this the first is this sort of you roll your own and do everything yourself meaning the database system does everything itself where you're going to be responsible in the system for copying blocks of data to a new location and then writing them out. The advantage of this is that you can be sure that the system is only going to copy just tuples or data the things that you want to put out in your checkpoint remember I said that in-memory databases don't actually record any information about indexes for recovery because you're going to rebuild the entire indexes when you load the checkpoint back in. So and if you do it yourself then you only copy just the tuples. You may think also too well aren't going to need the twice amount of memory for my database in order to do this approach. In actuality you don't because you can rely on the operating system doing copy on write so that when you actually copy memory it doesn't actually make a new physical copy it just has both the VM pages point to the same location and only when you when you modify the original data then it just does the copy. So you could copy your entire database in the naive snapshot and not actually double the size of physical memory. Alright so the first way you can roll your own and again the advantage of this is that you only copy just the tuple data. The other approach is that you can let the operating system do this for you by forking your process. Right and the downside of this is that the operating system doesn't know that this region of memory corresponds to tuples and this region corresponds to indexes so only copy the tuples. It copies the entire dress space of the database process. So this is actually what Hyper does and I remember reading or at least in the original version of Hyper I remember when I read the paper about this I was like oh that's actually a really clever idea it's like so simple you rely on the operating system to do this for you and it allows you to do some interesting things. So what they're going to do is when they want to take a snapshot they're going to fork the database process and now in your child process you're going to have you know complete copy of the exact same address space of the parent process and again remember I said the operating system is going to do copy on write so it's not actually doing a real copying from physical page to physical page it's just making a virtual memory copy and then only when the when either the child or the parent updates the the memory location do you actually copy things around. So what will happen is in the ways Hyper would do this is that they would not actually wait to pause transactions when they did the fork they would just do the fork and then on the parent process the the transactions keep on processing just you know just as if nothing happened but in the child process you would use the in-memory undo log that you're maintaining for your or your transactions to then abort those transactions and then roll them back and then when you do that now you have in your child process a consistent snapshot of the database and the child process can then write that out to to disk for the checkpoint. So what is interesting about this is like it's not only does the hyper guys use this for checkpoints they also actually do this to do analytical queries. So what happened is you have an analytical query come in rather than running it on the parent process and slowing down your transactions you can reroute it to the child process and they can do the analytics directly on the consistent snapshot in the child process as it's writing out the disk right. So that's actually kind of clever it's actually kind of cool. So two things I'll say is that in the newer versions of Hyper they abandoned this idea and they switched to using multi-virgin concurrency control right. So it's you know using one of the schemes that we talked about a few weeks ago. And I think the reason is because you know this is another example where doing the forking is nice and easy but you you give up control to the operating system whereas if you do multi-versioning yourself you can you know you have more fine control on exactly how being memories being copied copying and you can be more efficient. So the other thing I'll say too is we actually also tried this technique in h-store a few years ago and turn out the not to work at all. It actually turned out to be terrible. Not because the technique didn't work per se it's because h-store was actually using Java as like the front end layer for transaction processing and networking and catalogs and things like that. So what happened is if you read the documentation about Java or the JVM it says do not fork your process right. Do never fork the JVM. We ignored that we did it anyway and it turns out to be a bad idea because what happens is in the child process it doesn't restart all the same threads you had in the parent process. So in like a managed memory environment like the JVM there's a garbage collector thread there's other system threads running in the background and those don't get respawned in the child process. And then what happens in the parent process remember I said again it's doing copying all right in the operating system. So when the JVM's garbage collector starts going through and cleaning up the heap that starts reorganizing and compacting memory which then causes an excessive amount of copying to be done and for that for the child process so things get really slow. So it worked but it was really flaky and cumbersome and brittle and you didn't get the same kind of speed up that the hyper guys did. So the bottom line is don't fork your JVM right again the documentation says not to do this we ignored it. Okay so now sort of related to what I you know I talked about MVCC I talked about making these these these four snapshots this is very similar to the second approach to take checkpoints where you do basically copy and update snapshots and again you can think of this as like taking this like as a multi-version concurrency control it's basically the same thing that instead of having transactions overwrite tuples as they modify them when we know we're in the middle of a checkpoint we require all transactions to make new copies of data and we have to update our indexes to know about them. So what happens is as the checkpoints read reads or scans the table heap if it come across a tuple or version that was created at the checkpoint started that it knows it can ignore it right and then once the checkpoint finishes you go back and and prune things. So what I'm describing here is basically just MVCC but you're now doing this in the context of for a checkpoint so you can use this protocol for a system that doesn't use MVCC to get the same kind of kind of semantics. And so this is essentially what VoltDB does. So again VoltDB has that single threaded execution engines that are doing in place updates and they can do this without any locks and latches because they know no other transaction could be touching data at the same time as the one transaction is running right because again it's single threaded. But when you take a checkpoint you don't want the checkpoint thread to be running on the same thread you use to process transactions so they have a separate background thread come along and can scan the table heats. But now because we don't have any locks and latches to protect the checkpoint thread from the transaction thread they switch into this special copy on write mode where the system essentially becomes you know it's not multi-version as you can have multiple versions it's really a two-version system. So you have the version of the tuple that existed before the checkpoint and the version of the tuple that existed after the checkpoint. And if any thread needs to update the same tuple multiple times during a checkpoint you don't keep creating new versions you just keep overriding the second one. And then when the checkpoint thread finishes you have a complete snapshot on disk and then you go back and clean up all the old versions basically doing garbage collection. So again I hope you kind of see that the same ideas for MVCC and the same ideas for all the stuff for commercial things that we talked about before we can reapply them in the context of doing checkpoints. Alright so for the two observations we can make about what we talked about so far is that in the case of for naive snapshots the database system has to pause all the transaction threads in order when we take the checkpoint because we don't want to see any dirty data because we want consistent snapshots. So that's bad. And then in the case of doing a copy and update we have to do additional memory copies each time and that may require us to have to require latches depending on what kind of protocol we're using in the checkpoint thread and that may block other other transaction threads. So based on these two points this is what the the sort of the wait-free protocols that you guys read about are trying to try to overcome. So the key thing to point out though is that I'm saying that they're wait-free not lock-free or latch-free. So wait-free would mean that the worker threads don't have to wait for the checkpoint thread to finish or start or even finish whenever they want to update the database. They just keep they can always proceed without any problems. And the way they're going to do this is by maintaining multiple copies of the database, essentially the entire database. And again they talk about this in the paper as being the application state. I can't see how this could be anything other than the the entire database. Maybe you can use a hybrid storage model and say the row store is the thing you checkpoint and the column store you leave alone. But they don't really talk about that. But some of the ideas we talked about before with hybrid schemes might apply here. All right so wait-free zigzag what we're going to do is we're going to always have two copies of the database and then our transaction can only write to one of those copies at a time during a checkpoint. And then when the checkpoint finishes and we start the next one then all the rights will go to the other copy. And so to do this is to figure out which ones where we need to go. We're going to maintain two bit maps that tell us where the transactions to either do a read or write from on a per tuple basis. And again this is going to avoid having to do multiple copies of the tuples, maintain multiple copies of the tuple as we saw on the copy and update approach. So these guys are arguing that the additional computational overhead of having to check these bit maps and figure out where you should be reading and writing is less than the overhead of having to do copies of individual tuples anytime you update them. So let's walk through an example of the wait-free zigzag. So again we have two copies of the database and we're going to have two bit maps. In this case here these bit maps are going to be correspond to every location in the bit map correspond to a location in the actual table here. So in this case here if we want the second tuple at this position the read bit map says it's set to zero so therefore I want to read it from copy one. And I know that for the second tuple I can jump here and get it. We're not talking about row stores or column stores. For now just assume that this is some tuple. I'm showing a single value but it could be multiple values multiple attributes. And so with the same thing of the write bit map at this position if it's set to one that I want to do my right to the second copy if it's set to zero I do my right at the first copy. So let's say now that we have we have our system running this is the initial state when we first loaded everything in. So right now copy one and copy two are exact copies of each other. The read bit maps are all set to zero and the write bit maps are all set to one. So we start doing a checkpoint and the checkpoint is thread is going to look at the write bit map and take the inverse of whatever the values are here and that's going to tell it where it should go do a read to find a consistent snapshot of the database and that it can write out the disk. So in this case here when we start everything set to one so we take the inverse of that and have all zeros and that's tell us that the thing we want to read our consistent snapshot is over here right and so now we have a thread go ahead and write all the contents of this thing in out the disk. So now I'll say as we're taking the checkpoint a transaction comes along once update update the database and say our transaction wants to update these three three two plus here these positions. So in this case here it would look in the right bit map and it would say where it should do its right. So now that they're set to one telling us we should do our updates copy and copy two which is what we want because we don't want to change copy one because that the checkpoint thread is still trying to write that out. So what happened is we go ahead and apply our updates to copy two. Then we update the read bit map to flip their bits to say if you know if you're now a transaction that needs to read the latest version for the people at this offset go to get it from copy two. I'm ignoring like what you know all the high little coordination going on in the conventional scheme right all that still applies here. So whether it transaction is even allowed to read this depends on the concurrency protocol which is independent of everything we're talking about here. So if you have two phase locking you still have to go through the you acquire the read locks in order to actually do this read. This is again if you're if after you acquire those locks then you go look up and figure out where it is the thing where is the thing that it is that you need to read from. So now I'll say where checkpoint finishes right and let's say immediately after this we have another checkpoint thread. So again the same thing what we need to do is we need to figure out where we need to do a read from. But before the checkpoint starts we need to then map over these values here into our right bit map to say we know that we modified the data we modified the data in copy two. So therefore flip that over so if anybody wants to come out right it wants to write into copy one. So then the same thing before for our checkpoint thread we'll take the inverse of all of these and that will tell us where our consistent snapshot is. So in this case here we would see that we want to read for the first tuple in copy two, the second tuple in copy one and back and forth. Now you see why it's called zigzag because you're sort of zigzagging back and forth between different copies to find the consistent snapshot that you want to you want to use for your checkpoint. And then the same thing while the checkpoint thread is running if anybody comes along and updates tuples right they would say for this guy here if he wants to update the first tuple it would say go to your right in copy one and this one says go to your right in copy two. And again our checkpoint thread is not looking it's not trying to write out these two these two versions it's writing out these other versions here. So the checkpoint thread will not see any changes made by transactions after it started. So is this clear? All right so some of the deficiencies in this approach are you have to do this you know this propagation of changes in the read bit map into the write bit map you want to do that atomically right and so this is and then you have to you know reset everything when every single time you restart the new checkpoint. So these are some of the deficiencies you have in the in the wait-free zigzag that they're going to try to overcome in the wait-free ping pong approach. So in the ping pong approach what they're going to do is they're going to trade additional memory in CPU overhead for when transactions read and write data and we when we take the checkpoints in exchange for having to do a longer pause at the end of a checkpoint to reset all those bitmaps. Because again we want to do this atomically to make sure that nobody sees you know a partial a partial snapshot. So what they're going to do is they're going to maintain two copies of the database that will call sort of like the the the master copy and the shadow copy and then we'll have another copy database we'll call the base copy where there where it's always being modified for every single checkpoint interval. So what happens is we'll elect some copy to be the master and that's when all all the updates will go into and then there'll be the base copy that always has those changes and then when the checkpoint finishes we'll we'll do a pointer swap to elect the shadow to become the new master and then write out all the changes to there while we take the checkpoint on the old shadow. So I'll walk through an example here. Again so we have a base copy here right and it has the entire contents of the database and then we have our two additional copies and for each of these we have not only the contents of the database but also a bitmap that says whether this thing was modified in the last checkpoint and then we have down below we have our master pointer that's going to tell us which of these copies is the master version. So in that current in our when we first start up the the copy one is considered the master and you'll notice here in we have again we have all the data we want for in the base copy the master copy only has these placeholders and say here's where new values can go in and then copy two has the complete snapshot or complete copy of the of the base data because this is where we're going to write our checkpoint from. So again we have complete copies that copies in the base copy and copy two. So now when our checkpoint thread starts it looks at the shadow copy which is the copy number two and it's going to do the same scan through and write that out. And then as transactions come along and do writes it's going to always apply the rights to the base copy as well as to the master copy. So we apply our changes here in both locations and then we flip the bit to say that this thing is now dirty since the last checkpoint started. We flip it to one. So now again when anybody wants to do a read you can always read the base copy because this will always be consistent and again the concurrent dual scheme above all this is making sure that whatever isolation level we're running at transactions are seeing the data they should be allowed to see. Alright so now let's say the checkpoint finishes and we flush that out. So now we want to start a new checkpoint so we're adding a next phase the next checkpoint interval. So what's going to happen here is that we're going to reset the bitmap and the old shadow copy to be all all zeros because we want them to be zero because that means that the data is clean. And then in my example here I'm showing that you're also resetting all the old values. You don't technically have to do this because you're just going to overwrite them with the new values as transactions update them or simplicity I'm showing that they get zeroed out. So then now we then switch the master pointer now to be to be copy number two and then copy number one becomes the master or sort of the shadow. So our checkpoint can come along now and write out the contents of this. Well what's one problem here? Right so she said the changes that that were the we only have the changes from since the last checkpoint we don't have the original values because remember they were zeroed out when we started. So what we then need to be able to do is we need to figure out how we actually fill in these contents. Remember I said that we don't want to we could just take a delta checkpoint meaning we could just write the things that actually change since the last checkpoint but then that would make it difficult to do recovery because then you got to reply go back and look examine all the other checkpoints to fill in the missing tuples. So there's two ways you could do this. One way is you could just copy from the base the base copy the missing value you have into copy number one. Why would this be a bad idea? Some of these tuples might be updated. The way you would do this you would go check the bitmap here and say has this thing been modified yes or no like if it's zero meaning no if not then you know you can just go you know it's safe for you to go ahead and propagate this value in but now you have a race condition because you could check this bit you would see that it's zero then by the time you go to do the read someone has swooped in and modified it so now you're copying dirty data and you wouldn't know this right because the only way to prevent that you would have to take a latch on this guy and then go ahead and do that which is not you know which is not what we want to do. So the alternative they propose is what you're actually going to do is the checkpoint thread is going to recognize that it's missing this data and it knows it has to be in the last checkpoint that it took so it's going to go back on disk read the last checkpoint and then fill in its missing gaps and then write them back out. So this is totally different than everything we talked about before for doing checkpoints all right all the other checkpoints were like these threads write out the data you know take whatever's in memory and just write it out the disk and never go back to read it unless you actually have to recover from it but now what I'm saying is for in order for this approach to work you actually have to go back and read the checkpoint fill in your gaps and then write them back out. Yes? I'm confused about like the copy 2 is still there right? I mean when you flip the bits, the bitmaps, you don't have to wipe out the data, you can just take the data from copy 2. So his statement is when I was back here you have to flip all the bitmaps so his statement is could not just copy from copy 2 into copy 1. The answer is no because you're gonna have the same problem as the base copy. So I flip all my bits to 0 and yes I'm showing that it gets zeroed out but I'm saying you don't actually have to do that right for for illustration purposes I'm showing that. But then this we have the same problem here is that someone may come and modify this record here this this tuple here in between the time we read that it's 0 and then read the value. I'm showing a single value here but it could be multiple values could be the entire tuple so you can't guarantee that's going to be done atomically. The only way to prevent that is to take a latch but now you're blocking you're blocking your writing threads. Let's get a question. Anybody else? Okay so I don't want to talk about the performance applications of these things but I want to talk about the primitives they're using or they talk about how to build up a checkpoint protocol because I think again it's illustrative to understand what these different approaches do at the lowest level. So what they basically say is that there's four key constructs you can apply to build a checkpoint protocol. The first is that you can do the bulk state copying where you basically take again the naive snapshot approach we just copy the chyrotons of memory put it to a location and then write that out. But in order to do this you have to pause transactions. You can use locks and latches to isolate the checkpoint thread from other transactions to prevent them from modifying readings of memory that you're as you're actually trying to write it out. And then the last two are that you can use this bulk bitmap reset protocol to basically allow you to track the dirty regions and then somehow like do a complete modification or flip the bits and now say that they're cleaning and or you've successfully written everything out. To know where you know what you need to read or not read when you take a checkpoint. And the last one would be just you just use more memory to use additional copies for the data that you know will be consistent that you can then safely write out. And you can do this without having to block other threads. So what I really like is this table they show here in the paper where again they have they break down again the four constructs you have for your checkpoint protocols and then the four implementations of checkpoints and they talk about how they do do all these things. And again what you see is in the case of the wait-free ping pong you pay a large memory overhead to avoid having to do locks and latches. You have to have three copies of the database at any time. As I said as far as I know no system actually does these two I would say that this is the most common one for in-memory databases. This is essentially what BoltDB does, MemSQL does this, TimeSend does this. This is what pretty much everyone does. And you can do this regardless of whether you're using logical logging or physical logging. This implementation just works. So for this class and last class we've been mostly describing how to do logging and recovery and checkpoints when there's a database crash. Someone trips over the power cord, the data center catches on fire or whatever and your machine goes down in an unexpected way or unexpected interruption and then you need to load this turn the system back on and recover the database state. Not all database system restarts though are going to be from these sort of cataclysmic accidents or crashes. There's all the reasons that require us to restart our system. So one could be because we have to update our operating system or our kernel to make sure we get the latest security patches and therefore we have to restart the box. It could be because we want to upgrade the hardware, maybe change instance type on EC2, put in more RAM, change in SSD or it could be because we actually just want to update our database system software itself. We have a new version of the system, it has better features, it's faster, so we want to stop our database system, load in the new version and turn it back on. So what I'll say is that for all these, except for this last one here, all these require you to restart the system. We start the operating system, turn it off, turn it back on. But updating the database system, just the software itself, doesn't require you to restart the operating system. So, but the problem is because we're assuming we have an in-memory database where the primary service location of the database is in memory. We have to shut the process down, that kills our address space, we lose everything in the memory and then we turn the system back on and it goes through the same protocol to recover the database from a checkpoint and the log just as it would if it was a hard crash or a restart. And that could be bad, that could be really slow if your database is really big. Even if you're just really fast, it's going to take a while to suck everything into DRAM. So what we want is we want to wait to quickly restart our database system without having to reread the entire database from disk and put it back into memory. Again, we're assuming that we're doing like an upgrade or something. So this is what the Facebook guys were facing in their system called scuba, that they wanted to overcome and they came up with a way to do fast restart using shared memory. So what this essentially is going to allow us to do is we're going to decouple the in-memory contents of our database in the private address space for our process from the actual process lifetime. So even though the data system process may finish and we come back with a new one, with a new PID, we can then reuse the contents of the old process to avoid having to reload everything back in memory. And I'll show why they want to do this in scuba in a second. So again, because we're going to use shared memory to do this, we can restart the process multiple times and the memory contents will always survive. So scuba is an interesting system at Facebook. It's a distributed system, so we're not going to talk about how they actually implement the distributed orchestration, the coordination of query execution. But it's an interesting system because they built it to do time series analysis for all their log events. So anytime you click anything on Facebook, anytime on the website or on the mobile application, they do sampling for those events. Like if you load the page, occasionally they will trace all the steps that your request goes through in their architecture stack and write out a log event to say how much time I spent in the database, how much time did I spend in the web server, how much time did I spend in the caching lab. And they want to throw all this data into a time series database that they used scuba for and then do event analysis anomaly detection to find out whether there are any unexpected slowdowns in their application stack. And they do this because Facebook has this sort of engineering philosophy that they like to push out updates very often. So I think every two or three weeks they're always pushing out a new version of a product, a new version of a system, a new version of a service. So that means that things are going to break. You can test it all you want, but it's not really due to production. Maybe you'll see that the problems occur. So what they want to use scuba for is they want to log all these different events that occur and then if they notice that all sudden with the new version of an application, the latency of doing requests increases by 3x, then go back and figure out where in the stack you're having a problem. So I'll talk a little bit real quickly about how the system is actually architected. It's not entirely relevant to how they do restart, but I just want to show you what it sort of looks like. So they had this sort of tree hierarchy model where you have a bunch of different nodes and at the upper layers of the tree you have what are called aggregator nodes. So these are like stateless machines where they don't actually have copies of the database. They just know how to take information sent from the leaf nodes where the data is actually stored and combine them together. You can sort of think of this as a map reduce model or doing group by an aggregation, but you take the leaf nodes where the data is actually stored, you do all your scans based on the query, you shove them up to the aggregator and they combine them from the different nodes and you can have multiple levels of this. This is really nice because it's really easy for them to add up new aggregators because they don't actually store any part of the database. You just spin a new one up and it sort of fits in with everyone else. Then the bottom is the leaf nodes. What you have is the database actually stored in memory, but then there'll also be a log written out the disk for all the new events that occur. So this is the problem they're trying to solve. They want to push out new versions of the leaf node software without having to do a complete restart. So the paper talks about how even with a fast disk, if you have 120 gigabyte database that you need to load in after the leaf node process restarts, that can take two or three hours. If your database is even bigger, it can take even longer. Now if you have hundreds of machines that you all need to refresh, a large portion of them could be down every single time you push out a new version. What they're going to do in their fast restarts is that they're going to write the contents of the database in shared memory and then mark some location on where that shared memory is located out on disk. We start the process, come back up and look at that log entry, figure out where it is in shared memory and then suck that into the new process. So there's two ways to do this. The first is that you can always use shared memory for all your allocations for the process, meaning you don't, you know, when you call malloc you get shared memory rather than private memory or heat memory. And what's really interesting is that Facebook actually spent a lot of time thinking about actually doing this approach. Essentially what you do is you have to rewrite a new version of malloc so that when you call malloc it goes to shared memory rather than the heap. And they actually have a guy that invented J.E. malloc on their payroll at Facebook and they spent a lot of time talking with him to decide whether this is the right idea. And they determined that doing this was bad because not only you have to write a custom allocator, but there's a bunch of extra stuff you have to do when you subdivide shared memory to make sure that the different threads don't trip up on each other, right? The other big problem too is that in shared memory you can't have lazy allocation of the actual physical pages. So when you say, when you allocate shared memory, the operating system has to guarantee that there is actually physical memory backing it. And you can't do that, you don't want to do that for, you know, for doing copies and things like that, because you want to do this in a lazy way. That's the way to get better performance. So they determined that you didn't want to do this. And what they decided to do instead was that, excuse me, when the database system is going to go down, and again, they're doing this in, it's an upgrade. So it's not like the power got pulled, they know they're going to shut down the system so they can send a shutdown command to the database system and say, hey, get ready to restart. And when that occurs, then they start writing out the contents of local memory into shared memory and then restart the process to come back and suck it all back in. Right? So again, when the database system administrator says, I want to restart this node, they send the shutdown command, then the data system is going to block all new updates for new events in the system. You can still do reads on it, but you're not going to do any writes. Then what happens is the data system will start copying blocks of memory out to shared memory, and then they can delete the blocks in local memory once they know they're in shared memory. Then you update whatever pointers you have and still, you know, you can still read it because you know where it is in shared memory. Then when the database system snapshot finishes, the data system restarts, comes back with the new code, the new version, figures out whether there is a valid version of shared memory, of the database in shared memory that can then copy it to a heap. If not, then it just does the normal recovery process by reading the checkpoint and the log. So there's a bunch of different safety checks they have in this approach of when they go to check to see whether they can use what's in shared memory on a restart, because what you don't want to happen is if the new version of the database system modifies the layout of data in memory, you don't want to suck that in, you know, you don't want to have the old version of the heap try to be used in the new version of the system, because then, you know, all the pointers and all the alignments will be messed up. So they have someone to check to say, you know, they know that, you know, when I took the checkpoint and shared memory, it was at this version of the system, and if it's valid, then you can reuse it. So that said, this is actually, you know, sort of like the hyper guys, they were forking the process to do checkpoints, then the naive snapshots, this is sort of building off a sort of simple OS primitive to do something interesting. You wouldn't normally do this as I showed, as I said, you wouldn't want to use shared memory for your regular, you know, your regular heaps when you're actually running the real system, but for the special case, when you actually want to do a shutdown and a restart, the operating system actually provides you something that is very helpful here, which I think is really simple and really clever. All right, so what are my party thoughts? I would say I think the copy and update approach for checkpoints is the most common way to go, and it's especially easy if you're using multi-version concurrence control. Because you get snapshot isolation for free, you're making new copies anyway because you have to make new versions, so all you need to do is just, you know, find the consistent snapshot and write that out the disk. And as I said, this is what pretty much everyone does today. And as I'll show that in the case of the Facebook restart approach, shared memory actually has some usefulness at all, after all. We, when we started building Peloton, we were still based on Postgres, and Postgres is a multi-process shared memory architecture, and we found that using shared memory for in-memory database was just too slow, and if you want to add multiple threads, so we end up getting getting rid of that entirely. So the older systems are still used to shared memory for the primary storage, but newer in-memory systems don't use that at all. They use private heaps. Okay? All right, so next Thursday we'll begin the two-part lectures on optimizers, and this is sort of something actually, something we're actually working on now inside of Peloton, so we'll talk about the sort of early optimizers on Thursday, and then Tuesday next week we'll talk about sort of the modern variants of them. By modern I mean like the 1990s, but still there that's still considered the state of the art right now. And then again reminder, project number two we do Thursday next week, still at midnight, and then the project free proposals will still be due on the 21st. Any questions?