 All right, so today we're talking about recovery protocols, and so there'll be some, you know, in the interruption class we talk about areas, we talk about how to, we want to restore the database after a crash, and in the MRE database we're going to do the exact same thing, right? Obviously that if the database shuts down and crashes, we want to be able to restart the system and put the database back in the correct state, such that we guarantee the asset admissivity and durability guarantees they would have in the asset system, right? So, atomacity means that we can't have any partial transactions, consistency means that we don't put the database in like a funky state where like data could be actually incorrect, and then durability just means that, you know, if we write changes to the database, we tell it to commit a transaction, the outside world hears that that transaction is committed. When we come back, then all our changes are still there. So every recovery algorithm is going to have two parts. There's the part we're going to do at runtime where as we process transactions and they update the database, we're going to take some extra steps to record some extra information so that if we crash and come back, we can look at extra information and figure out what was going on in the system to put us back in the correct state. And then there's the recovery protocol you actually do after the crash or after a failure. All the information you've collected that you've generated in the normal processing and then you figure out, all right, what do I need to do to go back? So we'll talk a little bit about, when we spent time talking about in-memory database or logging protocol, the recovery protocols, the paper I had you guys read was not in a memory database. And I'll explain why I signed it. There's actually not a lot, not many papers about in-memory database logging recovery. When you look back at the early ones from the 1980s, they all make this big assumption that the log or the database itself will be backed by what is called non-volatile memory or persistent memory. Back in the 1980s, this just meant battery backed up DRAM. So you had a little battery on the motherboard that could be triggered any time you recognized that the power was going to go down and they used the battery to actually gracely shut down the memory, write it out the disk. And so when you came back, all your content or DRAM were still there. So battery backed up DRAM has been around for a long time. It's not widely used because it takes up space on the motherboard. They're great until you try to use them and they don't work because you have to make sure the battery is always functioning correctly. So you don't see that too often in the wild. You certainly can't go to Amazon and get an AWS instance that has this. So the other kind of persistent memory though, instead of banging battery backed up DRAM, is to use what is called true non-volatile memory or true persistent memory. So for years and years and years, every time I taught this class, I would say, oh, real non-volatile memory is coming out one year from now, one year from now. And finally, last year, Intel stepped up and released true non-volatile memory called the Optane. And again, the idea is that it goes into a dim slot like you normally would with DRAM, but this has magic inside of it that says if I lose power, I still retain everything. It's slightly slower than DRAM, but it's going to be way faster than SSDs. So we're not going to talk about NVM stuff or persistent memory right now. We'll cover this at the end of the semester. Yes? The question is, is it really expensive? What's that? What'd you say? Oh. Yes and no. It's not like outlandish. It's like maybe a couple thousand for like 128 gigs or something like that. I don't know the exact prices. I don't think Intel publishes them. It can maybe only get it from OEM. But yes, it's more expensive than DRAM, more expensive than SSDs, but like it has certain properties that you can't get out of an SSD. You can also get it in a form factor that goes down on the PCI Express. So that looks like a flash drive, even though it's the special, basically it's not NAM flash. It's not DRAM. It's a special storage medium called phase change memory. You can get it in other form factors and that one, those prices are close to what SSDs are. But for the DIMM stuff, it's more pricey. Yeah, we'll cover this later in the semester. The point I'm just going to make is like, if you go read the early papers on logging protocols for MME databases, they all assume this, but back then they didn't really have it. So we still have to design our protocols relying on SSDs or spinning just hard drives. So the good thing though for us in an MME database is that our recovery protocol is going to be slightly easier than what we had to do in a disk-coordinated system. And this is just due to the fact that since the primary storage location of the database is in memory, when we crash or there's a failure, that memory is wiped. So when we come back, all we need to do is just load in the last checkpoint and replay the log from disk and we don't have to worry about any dirty pages from transactions that may be linking around on disk that we have to reverse as we would in a disk-coordinated system. So we don't have to track dirty pages as we run, like if a transaction makes some changes and it doesn't commit and we crash, all those memory pages get wiped. So that's fine. So this means now we don't have to store any on-do records to reverse these things because there's nothing to reverse. The only thing we need to write out the disk is just redo information. The other big difference is going to be that we're not going to log any changes that we make to indexes. So in a disk-based system, as I update my table and that causes updates to the indexes, I'm also going to create log records that record the changes to the index. But in memory database, I don't do any of that. And so when I crash, I have to load the database back in from the checkpoint file anyway. So the disk is way more expensive than doing the computation. So I'm just going to rebuild the indexes on the fly as I load the checkpoint in. And in a disk-based system, you don't do that. So we still have to deal with the fact that disk is going to be slow. And so we still want to take advantage of all the optimizations we would do in a disk-oriented system, like group commit, batching things, doing speculative lock releases, things like that. All those optimizations are still going to apply for us here, even though we're in memory. So today I'm going to talk about, first, about different types of logging schemes. And again, the paper you had you guys read was for a disk-oriented system. But it's relying on multiverging. So I think that's really interesting. And we'll see how that works. And then we'll talk about checkpoint protocols. And these two here are what you need to guarantee your database is durable. This is a way to speed things up after a crash. And this is a way to speed things up if you're going to do a restart and you know that it's going to be scheduled at a certain time. So this one you absolutely have to have. This one is an optimization to make this go faster. And this is an optimization to avoid having to do all of this. So at a high level, there's two types of logging schemes. There's physical logging and logical logging. Physical logging is like taking a diff as you wouldn't get. The idea is that we're going to record the changes that we're making to a record at the byte level. If you think of also an MVCC, if you're using a Delta Store, I'm just recording, here's the column that I modified. And here's the new value. So then after a crash, we're essentially going to replay these log records and reapply the changes to the columns. The other approach is do logical logging. And this is where instead of storing again the low level bytes we're modifying within a tuple, we're just going to store the high level operation that the application invoked or requested on the database to make that change. So it would be like recording the actual SQL statements that they would send us. The insert update delete queries. So the idea here is if I have a database at the table, it has 1,000 tuples, make it big, maybe a billion tuples. If I'm doing physical logging with one query that needs to update all 1 billion tuples, then I need 1 billion log records that correspond to every single tuple that I've modified. But if I'm doing logical logging, then I only need to record that single update statement. And that's enough for my system after restart to replay that update and put me back in the correct state. So logical logging is going to be much smaller log files and potentially much faster to commit at runtime. But the recovery process could potentially be more expensive because there's no magic I can do to make this update go faster during recovery. So if I have to update a billion tuples, and it takes an hour the first time I run it, after I crash and restart and replay this update query, it might take an hour again. So it might delay the amount of time it takes for my database to come back online. So for this reason, most systems are going to choose physical logging. And this is only used in certain specialized systems. Now the question is, when do we actually flush the log records that a transaction generates? So the first approach is to do all at once, meaning as a transaction runs and they're making changes, they're executing queries to update the database. Doesn't matter whether it's physical logging or logical logging, I'm just going to record in memory in this log buffer for my transaction, here's all the log records. And only when the transaction says go ahead and commit, we pass our validation, then we hand off those log records to some our logger thread that then flushes it out to disk. Of course, we have to wait until all those log records are written to disk before we can tell the outside world that our transaction committed. So it could be, if I update a billion tubeless, I have a billion log records, I have to wait for those 1 billion log records to get written to disk. The other approach to do incremental flushing, and this is as transactions run and they're accumulating these log records, when their local log buffer gets full, they can hand that off to some writer thread who can then start writing it out to disk and then they get a new log buffer to start filling out with new log records. So that means that if there's a crash now in the log, there may be log records from transactions that did not commit yet, because we're allowing things to get written out before everything is finished. So in this approach here for the all at once, this makes recovery easier because I know if I crash and come back, I'm almost never going to see log records from transactions that did not commit. I may see a transaction did commit and they start writing out all their log records, but we crash before we write the rest of it and the commit message. In that case, I certainly have to undo it, but most of the times, I'll have all the log records. In the case of incremental flushing, I have to do some extra work to go figure out, here's a bunch of transactions that haven't finished. Let me go ahead and make sure I don't reapply their changes. So this is not that big of a deal because, again, we don't write out any dirty pages. We're reloading from the checkpoint anyway. It's just we have to do some extra processing in the log to make sure we skip things that shouldn't have committed. Yes? So the first thing, if a transaction is like, say, inserting 10 billion people, and it changed them and then it did something, and those five billion things got stored, and the last five billion things did not get stored, and then it crashed also. So now, how do you undo these last five billion things? Again, there's nothing to undo. Nobody updated things. But updated what? Updated in memory pages. We crash, that memory is gone. So when we come back online, we're loading the checkpoint, there's no dirty pages, right? It's just in this case here, we'll talk about replication in the next class, but it's sort of like, when do I send the outside world to a replica? Hey, here's all my updates for my transaction. In this case also, too, if I update 10 billion tuples, I have to have a log buffer that can handle 10 billion tuples. And I may run out of memory. So for this reason, most systems are going to do this approach. But there are some advantages to doing this. All right, these are the standard techniques we talked about before in the introduction class that would still apply for us in an in-marry system. So we'll do group commit. It just means that we have this log buffer that we can fill in with log records from different transactions, and that we can then flush them out whenever that log buffer is full, and then have another log buffer start getting filled up by other transactions. And the idea here is we want to amortize the F-sync cost across multiple transactions. So if you're the first guy getting out of this queue, then yes, you wait the longest because you have to wait for the log buffer to get full. But if you're the last guy, then you're basically running with a dedicated F-sync call, right? So this is a really old technique. As I said, it's used in pretty much every system today. It was originally developed for this thing called FASPATH, which was a specialized in-memory engine for IBM's IMS from the early 1980s. But like I said, everyone pretty much does this today. The other thing we can do now, which is related to the speculative lock release or speculative reeds we saw under Hecaton and our MVCC, is that when a transaction goes ahead and commits, we don't have to wait for the updates to become durable on disk before we release all our locks. We can let other transactions start modifying the tuples that we've modified, assuming those log records will get written out of the disk, or they read things that we've modified before the log records get flushed to disk. And we have to sort of keep track of that. We did the speculative reeds, even though the transaction is logically committed. Physically, it hasn't gotten made to disk yet. So we know that if any transaction reads something that we wrote, they have to get stalled until we actually flush the disk. So that means that my read-only transaction, then it reads an update from your transaction. But your transaction has been written at disk yet. I have to wait for you to get flushed before I can get committed. Again, this is a standard technique. And our MVCC, we would know this information because we maintain the state map that says this transaction's log records have not been flushed yet. So therefore, you can't commit right away if you read something from them. OK. So I sort of alluded to this a little bit now in the last couple of slides, where if we have a multi-version system, then the delta records we're going to generate for MVCC are more or less the same thing we're going to generate in our log. It's not exactly the same because depending on how we're doing orders to newest or newest to oldest, we could be generating redo records or undo records for our tuples we're modifying. Whereas the database log file wants to have redo records, at least for in-memory database. So the idea is that what if we can combine all the metadata we're generating for MVCC with all the metadata we're generating with the log file, and that way we're not duplicating this effort. So that was the paper I had you guys read for today. It came out just this year, or last year, in 2020-19. So as I said, it's not in-memory system, but it's a multi-version system. And I think it's a good description of how you can piggyback off of the MVCC metadata you're generating to make logging work better. So this protocol is going to be a physical logging protocol that's going to rely on the database system's time travel table, which they call the version store, the tempDB. So we're going to rely on that time travel table to act almost as a recovery log. So what will happen is we'll start writing out the changes we make to this version store to a log file, but they're only going to contain essentially the redo records of the verges we've generated. And then when we crash and come back, all we need to do now is just suck in this version store into disk, or sorry, alpha disk into memory, and we don't need to undo anything right away. We just bring those pages back in, then the database is now at the state it was at the moment of the crash. And then we just need to do some background extra work to do this logical revert to make sure that we don't read and don't persist versions from transactions that did not commit at the crash. So the problem they were trying to solve that they mentioned in the paper is that they had certain customers running on Azure in the cloud or SQL Server in the cloud where they would have these really long transactions that when there was a crash, now you need to undo all the changes. So this $10 billion table update we just mentioned. Like if I crash before I finish all updating all those tuples, when I come back, now I've got to spend a long time undoing everything. So what they wanted to do is have what they call constant time recovery, which was saying that the time it's going to take for me to restore the database back to the correct state is only contingent on the size of the log file. Because I just need to be able to read that in the redo phase. And I don't have to worry about how much time I'm going to spend going in reverse direction to undo things. Does that make sense? So let's go to some examples, and then we'll see how it works. So again, this is with SQL Server. Azure SQL is the name of the cloud version of it. But as I said, for this feature, they're putting out in the on-premise version, which includes the Linux one. So they're doing time travel version storage, but they just call it a version store. And so the idea here is that you have the main table, and this has the latest version. And they're doing newest to oldest. So this version has a pointer now to another tuple that's the next oldest tuple, and then there's a version chain allowed to go back in time to find previous versions. So now if a transaction comes along, almost update this tuple here, we first make a copy of the current version into our version store, update the pointer to connect the chain, and then we can overwrite the old version with our new one here. So the idea here is that they want to leverage this thing, which is basically redo information to then allow them to restore this data to the database just by loading this thing back in. So there can be two variants of how they're going to do this persistent version store. So the first one is that they're going to do in-road versioning, where this is actually not going to use the version store. This is going to look like that cicada inlining, where instead of making a whole copy of the tuple and putting it to the version store, they're going to pack it in the special field inside the tuple of just the delta record of what was modified. And then that gets written out to the right-of-head log, just as it would any other tuple update. And that's enough information to be able to recreate the database state. I'll show this in the next slide. And then for the off-road versioning, which is actually using the time travel table, instead of having a time travel table for every single logical table in your database, they're going to have one giant time travel table for every single table. This is way different. I've never seen this used before. And so the way it works is that in here, like instead of having discrete columns or attributes that represent the columns in the table, they're just going to have one column and then to store a blob or byte string of the serialized data for that tuple. And that way, you can store any tables, tuples inside there. Because we're not doing any processing on looking at the actual columns in this. We're almost using it as a in-memory log that then gets flushed. So they're also going to modify the data table data structure itself so that you can do fast concurrent inserts. So that means that they're going to partition the database in memory so that different threads right to different regions of the table. And then this allows you to write them out very quickly to the right-of-head log. And you never can allow to go back and update any other tuple. It's only append only. And now you do sequential writes out to the log, which is efficient. So this is the in-route versioning. This is the same thing we talked about with Cicada. Basically, when I do an update of this tuple, instead of making this copy of this old version and putting it into the time travel table, there's embedded delta record inside of here. And then that just gets written out to a right-of-head log just as you normally would. So an important distinction here about why they can do this and why we can't do this very easily in an in-memory database system is that in a disk-oriented database, the tuples can be very, very length. In a in-memory database, we said we had the primary location of a tuple has to be fixed length. So we either have to allocate this delta space for every single tuple. It can only be a certain size if we're in memory. But in a disk-based database, I can leave this thing empty. And then if I decide to use it when I insert the new tuple, then I just grab a new slot in my page and can write everything out, right? That's one important distinction between what they're doing and how you do this in-memory database. OK, so let's see how they're going to do recovery. So this almost looks like standard Aries, except that under Aries, the database is not available until you complete the undo phase. In their world, it's immediately available after the redo phase. Because all they're doing is they're loading back in that version store. And then the old versions that are around are now available. And then we just piggyback of all that same version detection or identification that we do under NVCC to identify which tuples are actually visible to our transaction. So analysis phase, again, we just scan through the log up to the last checkpoint to figure out what transactions are running. And then in the redo phase, we're going to replay the right-of-head log to put us back in the correct state for the main table in the version store. But as I said, the version store is only going to contain versions from transactions that did not commit, as well as in the main table. But because we're NVCC, we know what the timestamps were for those versions. And then when we come back online and start running regular transactions, we would know to skip over things that were uncommitted. Because we're going to maintain this global state map that says, here's all the transactions that are around, and here's what their status is. Then in the undo phase, we select transactions to start execute. And when they find things that are aborted, they can go ahead and either clean them up, or there would be a separate background process to clean them up asynchronously. And they refer to this thing as a logical reverber. For my purpose, it just sounds like garbage collection. Although the fact that transactions can do something slightly different when they find aborted versions is not exactly garbage collection. So this is already what it says. So you have separate threads. They're going to scan through the blocks. Find all the criminal versions. And then if you recognize that the latest version of a tuple is in the version store, then you just move it back into the main table. Again, that's the same thing we would see in VEGL GC that we've already talked about. The only one optimization they do that's slightly different is if a transaction recognizes that, oh, in the main table I see the current version, the master version of the tuple, is from an aborted transaction, then instead of copying that out and making a new version in the time travel, there's a completely overwrite it. So again, like I said, I think this is interesting because this is, although it's a disk based system, it's showing you how you can use the MVCC metadata to do a tiffer type of logging scheme and potentially get better performance. Any questions about this? There's this other bit about the syslog where they would store changes for non-version data structures, like page tables and D-plus tree layouts. That, again, we can ignore all of that in memory system because we're not going to maintain that information. Yes? Our extra cost when they go to storage and that thing is serialized so they can't like. Yeah, so he's scraping it and he's correct. Does this mean at runtime, if I need to go read an old version, I have to go to the version store and everything just serializes a blob, do I pay a penalty for having to read that? Yes. But I think they measured it's kind of small. Same thing with the delta record. If the in-road delta record, I have to reply it and they pay a computational cost for that. If it is in-road, then you don't have much penalty, right? This is what they say that. Yeah, but it's not as free as like going, reading, and having everything there. I agree with you, yes. Yeah, that's why we use delta store in our own system. Most transactions, I don't remember if they say this in the paper, most transactions don't update all columns. It's right, the delta store, the in-road version is usually going to be enough. So this idea of what they're doing is not new. This is actually very similar to how Postgres was originally designed back in like 1986. Again, Postgres, if there's a paper called The Design of Postgres, and Sturmbaker talks about how, oh, well, the log is not really a log file, it's a log table, and everything's getting appended to that and that gets flushed out. So at a high level, this is very similar. And they claim the difference here is that this paper is about taking a system that was not designed to be multi-versioned in this way and adding this after the fact, whereas Postgres was designed from the very beginning to be like this. Again, Postgres is an MVCC system. So again, this is a protocol that takes advantage of the fact that we're multi-versioned. So now let's look at another logging protocol that is for MMORPG databases that's more about how you could architect the system to be aware of what the CPUs actually work and how they operate. So Silo is a MMORPG OTP embedded database engine that was developed out of Harvard by Eddie Kohler. So it's not going to be multi-version. It's going to be a single version system that uses OCC. And they're going to use an epoch-based garbage reduction protocol to keep track of when you actually commit transactions and flush things out the disk. So Silo is the same people that developed the mastery. They wrote mastery first, and then they built Silo around it. So this paper here proposes a physical logging technique called Silo R that is going to try to paralyze as much as possible all the logging checkpoint and recovery processes so that we get the best performance. And the way they're going to do this is that they're going to disaggregate the log across multiple files, which could be stored on multiple disks and allow them to be read in in parallel at the same time. So the way this works is that for every single CPU socket in my system, I'm going to have a dedicated log thread and dedicated log file. And the idea here is that we want to localize all the memory writes we have to do to our local socket so we're never going over shared memory to another socket over the interconnect. And then that way, that log thread can be just pumping data out as fast as possible to the log file. So now as transactions run, they're going to get a log buffer from their local logger thread and then write out just redo information. Because we're in memory. We don't have to do any undo. And in this case here, to make it simple, assume that all the log records for a transaction will be written out. Actually, in this case here, you can do incremental. But in most cases, it could be done all at once. So again, everybody save some of this. So every logger thread is going to have a pool of log buffers. We hand them out to our local worker threads that are on the same socket as us. When my buffer gets full, I hand it back to my logger thread. The logger thread can then start writing out the disk, which means you can do incremental flushes. And then the worker thread tries to get a new log buffer. If there's no more log buffers available, then the worker thread has to stall until the logger thread comes back and says, I flushed out enough, here's a new log buffer. So now what is going to happen though, is that because now our log file is broken up across multiple files on different disks, it may be the case our transaction might modify data that's managed by different CPU sockets. And therefore, the log records for that transaction might be broken up across different disks. So I need a way to coordinate all of them so that I don't have, if I touch two log files, my transaction commits. The first log file gets flushed, but the second one doesn't. And then I tell the outside world that I crashed, and I come back. And now half the updates from that other log file never made it to disk. So you need a way to coordinate all of them so that you know that all the updates for a transaction have been safely flushed. So this is what the epoch is going to do for us. So what's going to happen is every so often, this epoch will get incremented, and that forces every log thread to write out the current contents of the log buffer. And then we record what was the epoch everyone wrote at. And now we know that any transaction that has been committed and flushed to disk prior to that epoch is now durable. And we can tell the outside world that we've committed. They're going to use this epoch in other ways of the system to minimize, again, coordination across different CPU sockets. But for our purposes here, we can just focus on how we do it for logging. So our log records contain the idea of the transaction that modified a given tuple. And SILO is going to do serializable isolation or serializable scheduling. So that means that the transaction ID will be enough to guarantee the ordering, the correct ordering of the updates to the database. So if we just replay the log in the order of the transaction IDs, then that will put us back into the correct state. At lower isolation levels, you can't do this because I may update something, and you may read something and update something else. Unless we have a log sequence number to order those things, we may come back and replay things in a different order. So by being serializable, we can guarantee that this will put it back to the correct state. Yes? Isn't that? So isn't this snapshot isolation, if you guarantee that if your first writer wins, then that also solves that, too? The main takeaway is that the transaction ID is enough to guarantee the ordering to put you back in the correct state. And you don't need a separate log sequence number. Because with a log sequence number, if I'm deco-ordinate that across multiple sockets, then that's a bottleneck. So the log file is going to be a triplet. Just the table, the key, and the value that gets modified. And the value can just be the delta record of what the change actually was. So again, if I'm doing a simple query to update all the people that are lame with Matt and myself, then the log record will have a separate log record for every single tuple that this thing modified. So here's the high-level architecture of the system. And again, the idea here is that I'm going to expose to you or show you that this is just physical logging. So there's nothing really novel being done here. It's how they organize the system that I think is actually quite interesting. And I don't see that. I haven't seen this in any other system. So again, the transactions are going to update. The worker threads are going to execute transactions. So they're only going to execute these things as stored procedures. They call them one-stop transactions. But the basic idea is that we do a request, like an RPC, say, execute this transaction. And all the program logic for what that transaction is going to do is embed inside the system. So we never go back to the client. So everything is done in one invocation. So when a transaction starts running, the worker thread has to go to the logger and get a log buffer. And once it has that, it can start filling up the changes that it starts making to the database. And at some point, this log buffer will get full. So we hand it back to the logger thread and say, this thing's full. Please flush it for us. And we'll go ahead and get another log buffer. So then now at this point here, say that transactions are still running. This other epoch thread waits up and says, all right, now the new epoch is 200. So that's going to force all the logger threads in the system to now flush whatever buffers that they have, including any ones that were handed off before. So now the worker thread has to hand back the log buffer to the logging thread. And then it could keep on running. It could say, now give me another log buffer to start filling up. But in this case here, there aren't any more. So it's going to have to stall and wait. So now the logger thread can start flushing things out the disk. And as it flushes them, it frees up log space. And then we can hand back the log buffers and let this guy keep on running. So in the simplest world, assume every transaction will be finished within epoch. If it spans multiple epochs, then you basically have to keep track of this transaction was around. It spans multiple epochs. So you have to go back further in the log to try to figure out what actually happened to it. But the main idea here is that by having this backpacture pressure mechanism where if we ran out of log buffers, we don't allocate more memory. We make this guy stall. That prevents us from generating log records faster than we can actually write them out the disk. Because otherwise, the log buffers will grow infinitely and will run out of space. So let's talk about this persistent epoch thing actually does. So every logger thread is going to have its own file where it records all the delta records that transactions are generating when they run. But then there's going to be a special log file where we keep track of the highest epoch that all logger threads have flushed out successfully to disk. Now, everyone is flushing at the same time when the epoch increments. You tell everyone to flush, but then it may not all happen exactly at the same speed. And only when everyone says, all right, I flushed it, then you go update the persistent epoch. Now, you don't need this for correctness. This is actually just an optimization so that when you crash and come back, you can just look at this one file and say, all right, what's the epoch I need to start with? Otherwise, you'd have to go look at every single file to figure out what's the intersection of the epoch across all of them. This is just an optimization. It's nice to have. And the overhead of it is somewhat small other than an F-sync. So we know that if this thing gets written out the disk, then we know that any transaction that executed in an epoch that's less than or equal to our persistent epoch, we know is durable. So it looks like this. Say we have now three logger threads, and each have their own log file that are running out the disk. And each have a bunch of worker threads. And again, these guys are just going and getting the log buffers from the logger threads. And then you have this now, the special persistent epoch thread that's going to update the file on disk. Every so often now, the epoch changes. So everyone has then flushed out all the changes they have up to that epoch. Once they all then confirm with this epoch thread, the persistent epoch thread that they've written out the disk, then we're allowed to go ahead and write out the persistent epoch file. So for this one, like I said, you don't need this for correctness. This is just an optimization. So I don't think you actually need this to be another F-sync. Like these, you want F-sync to know that you actually made the disk. I think for this one, if you crash into F-sync, now you have the F-sync. If you're going to rely on this to figure out what the intersection is of the epoch across all of them, then you have the F-sync. But that could be just another five milliseconds. Yes? At every 200, 300, I'd be also writing the pages What pages? The in-memory pages, right? We are writing them and these are not checkpoints. They're not checkpoints. These are just log records. Then we have to start from the starting thing. Three more slides. We'll get the checkpoints, yes. But yes, this is what checkpoints solve. Yes. This is just, again, the architecture here is how to do disaggregated log files across multiple disks and you just have some centralized location that you only update whenever this thing gets changed in the silo paper. They do this every 40 milliseconds. So in a real system, this is problematic because if you need sub-millisecond latency of your transactions, you're not going to be able to get that with this. You have to crank this thing down. When I asked them why they picked 40 milliseconds, they said this seemed like a decent number. So you could ratchet it down so this thing gets updated every 10 milliseconds or five milliseconds. But now you're flushing a lot and this thing's getting written out a lot. And you now could potentially have transactions in multiple epochs and you have to do more stuff on recovery to handle this. All right, so now, next slide, we get to your question. So we'll talk about checkpoints in a few more slides in more detail, but as he said, if you don't have checkpoints, then these log files grow forever and when I crash and restart, I gotta go back and potentially look at the entire log file. So every so often they're gonna take a checkpoint and then when you, after recovery, you load the last checkpoint in and that sort of bounds how much log file you have to look at. And they're gonna rebuild the indexes based on the checkpoint, as I already said, because we're not gonna make any log records for the indexes. Now, what is gonna be different though than a disk based system, which is super interesting, is that when they do recovery, instead of doing in the redo phase we saw with SQL server where they start at some point in the past and they process log records going forward in time, silos actually can start at the end of the log file and go in reverse order and start playing log records from newest to oldest. And again, we can do this because we're in memory because we know that there's no dirty pages sitting around that got loaded in from the last checkpoint. So as we replay the log, we just need to know that what should be the final state of the database of a tuple in order for me to say the database has been restored. So if I have a tuple that's been updated 20 times, if I'm going in reverse order, I don't need to replay that log record for all those 20 updates. I need to find the last update and apply that, right? That's totally different than as you would do this in a disk-oriented system because again, dirty pages may have written a disk. So at some point I need to replay everything and then undo everything. That shouldn't be around, right? So what they're gonna do again, they're gonna keep track of the transaction IDs and every tuple order to keep track of what was the timestamp of when this tuple got updated. So as I'm replaying the log in reverse order, if I find a log record that has a timestamp that's smaller than the current tuple's timestamp, meaning it was updated by a transaction in the future that I replayed earlier in the log, then I can just ignore that log record and I don't have to apply it. So maybe in the case, if I'm only updating a small number of tuples over and over again in my log, I may be able to realize that within maybe the first megabyte of log record data, I can ignore everything else after that. I still wanna look at it but I still, I don't have to replay it, yes. This won't work if it's not single version, right? This question is, this won't work if it's not single version. If you don't need the old versions, then I think this is okay. And so you could say, all right, well, if I don't need the old versions, I can just say this. Well, one is no transaction, it's not like after a restart, any transaction that was running prior to the restart, it's not magically reappear when you come back. So there's no actual transactions with timestamps that could possibly even read those things. So I could just ignore them. Now, if I'm trying to do audit logs or retain things and do time travel queries, then yes, I gotta replay everything, then yes, this won't work. I still need to redo everything. But if I don't care about it, I just care about always the latest version, then this works. The reason why most systems don't do this is that when you do replication, which I'll talk about next class, the replicas are essentially like in recovery mode and they're just replaying the log. So if you do it from oldest to newest, then the same mechanism you would do to do log replay is the same thing you can do for on the replicas. If now I have a specialized recovery mode where I can go reverse order, but only on the single node, then I have to have basically re-implement this twice. So SILO is the only system I know that does this. It's an interesting way to think about it though, which I like. Okay, this one I think we've already talked about, so we go look in the persistent epoch file, figure out what the most recent persistent epoch that was flushed at disk, then as our logger threads start replaying the log, they just ignore anything that's greater than the persistent epoch. And actually, I already said this all before, that because we're going newest to oldest, if I recognize that the tuple has already been modified by a transaction that came later in the log that I've already processed, then I don't need to replay that log record. Okay, okay. So now we also get to checkpoints in more detail. As we already said, the log file can go forever. So that means that I potentially have to replay the entire log every single time I have to restart. If I have one year's worth of log without a checkpoint, then that potentially take me one year to restore the database, which is nonsense. Like nobody could do this. So for an MME checkpoint, the different approaches we're gonna choose are gonna be tightly coupled with our concurrency scheme. And in some ways, if we're focusing on multi-version systems, MVCC, then checkpoints essentially can become easy if we, depending on what we want the consistency level to be in our checkpoint, right? Because it could just be, we just have a starter transaction that takes a snapshot, scans through a table, and we just write out all the versions of the tuple that were visible to our snapshot. And anything that's not visible to us, meaning it came in the future, we'll have the log records and we just replay those, right? So there's a paper written by Dana Boddy, who did some early work on Comm Source, a few years ago in SigBod. We basically lays out what are the ideal properties you'd want for a checkpointing scheme and memory database. And these sort of seem obvious, but it's important to just keep these back of our mind. So obviously we don't wanna slow down the regular transaction processing because it's not good if we can run really fast and all of a sudden we take a checkpoint and now the speed of our system is cut in by half. So the conventional wisdom for checkpointing schemes is that about a 10 to 15% overhead is considered acceptable. So every so often if I'm taking a checkpoint, if I get 10% slower, then people are okay with that. Likewise, you don't want a sort of huge latency spikes meaning I don't want a blocking checkpoint scheme. I don't wanna lock the system or lock a table while I take the checkpoint and have all these transactions queue up behind this. And then finally I release the lock and then they're allowed to run because that's gonna be a huge spike in our latency. And people would pay attention to this. And the last one also too is that we don't wanna require any excessive memory overhead, meaning ideally we don't have to take a complete copy of the database in memory as we write down a checkpoint. We wanna be able to minimize that overhead because that puts pressure on our dims and our memory bandwidth and our caches. So we're gonna reduce this as much as possible. So let's talk about the different properties you can have for a checkpoint for an memory database. So this one here is, this idea here is very similar to what we talked about for disk based systems last semester. So you have this notion of fuzzy versus consistent checkpoints. So a consistent checkpoint is when the snapshot of the database that's written to disk only contains updates from transactions that committed. Again, think of it just like under MVCC with snapshot isolation, the file I write to disk only contains the updates from transactions that committed before the transaction started. So now when I crash and restart, when I load the checkpoint in, I don't have to worry about does my checkpoint contain changes from uncommitted transactions? It only contains changes for committed transactions. So again, this is easy to do with MVCC. I run my query that scans the entire table and write it out. The other approach to do fuzzy checkpoints, well this is where the snapshot could contain updates from transactions that committed after my checkpoint started. So my checkpoint starts running and I have a transaction that updates two tuples and say I scan through half the table, then the transaction updates a tuple I've already passed through, but a tuple I haven't passed through yet. My checkpoint now will contain half the updates of that transaction. So now I have to do some extra stuff when I come back and recognize that, oh, there's an update that I may have missed because this guy was running when I was running and I make sure I find the log records to reapply things correctly. So this is the easiest to do with MVCC. This one is going to potentially be faster and have less memory overhead because now I don't have to worry about maintaining old versions and pausing the garbage collector. Most of them choose this. This one has advantages for storage overhead. The next is how we're actually gonna do the checkpoint. So as I already said, a do-it-yourself implementation of this would be just a sequential scan on the table and write out every single tuple that I find that's visible to me. Another approach is to do an OS fork. So this idea is interesting. Because we're in memory database, when we call fork in the operating system, what happens? We have a child process. What's in that child process memory? It's the exact same thing as the parent process. Yes, question. It's not on-demand, it's on-demand. You don't copy the entire thing. Correct, yes. So the way it works is there's a copy on right. So I call fork. The child process now has mapped in its virtual memory table all the same pages as the parent process. But if the parent process updates any of those pages or if my child process updates any of those pages, then the OS will make a copy of it and remap it for your process. So as the parent process starts modifying in memory, the database, the child process won't see this. So the only sort of well-known data system that actually does this approach is Redis. So this is how Redis takes checkpoints. And they can do this because they're single threaded, single threaded engine. So it is Paul's transactions do the fork, and then now the child process can has a consistent snapshot that it can start writing out the disk. If you don't, if you're not gonna pause all transactions or updates while you do this, then in the child process, you now need to reconcile the database to remove any uncommitted changes from transactions that were running at the time you forked. So Hyper actually did this back in the day. So this is a paper from 2011. This is the first version of Hyper. It was actually influenced by a system that I was working on or helped build H-Door, which then became BoltDB. So they basically sort of built their own version of BoltDB, but they also wanted to do analytical queries. So they would do OS fork, and then on the child process, they could run analytical queries without slowing down the parent process that was running transactions. And then they also could then take the child process and write out that checkpoint to disk. But again, without slowing down the parent process execution. But because now when they took the checkpoint or did the fork, there might have been some inflight transactions that were running at the same time, they then, in the child process, you need to look at the undue logs for those transactions, which are in memory, and make sure you reverse the database, reverse those changes so that, again, you have a consistent snapshot. And then after some period of time, either when the checkpoint was written to disk or when you're finished processing your analytical queries, they would kill the child process, all the memory gets cleaned up, and then the parent would fork it again. So again, Redis is the only one that does this. It's easy to do because they're single threaded. I don't know of any other system other than Hyper that has attempted this. Well, we try to do this in each store, but we were based on the JVM. If you read the manual for the JVM, it says don't fork it. We said screw that, we forked it anyway. But it has all sorts of problems because you have a bunch of zombie threads because the garbage collector doesn't start, keep running again, other background things don't run. So it would work, but it was a bad idea. Okay. The next issue is that what are we actually going to store in our checkpoint? Okay, the two approaches are do complete checkpoints, the delta checkpoints, complete checkpoints is taking whatever's in my snapshot on my table or my tables in my database and just write them out entirely out the disk. The delta checkpoint is where you try to recognize well, what has changed since the last time I took a checkpoint and only write out those updates. So most systems do this, right? The only system that I know that does delta checkpoints is Hecatom. Because the issue is that with a complete checkpoint from a sort of administrative management standpoint, I have this file now on disk that I can say oh, this is my checkpoint, this is the exact snapshot of the database at this given time. With the delta checkpoint, I need to retain a bunch of deltas because there may be some updates that were in this snapshot, but not the next snapshot. And in order for me to make sure I put the database back into the correct state, I need to have all of them. So when Hecatom does this, they have a background thread that will start coalescing combining these delta checkpoints to make it basically one giant complete checkpoint. But you need to be mindful of this, what the file contains that you're looking at. So this is easier to implement, it wastes more space, but from engineering and management point, this one's better. One way to also to make this not have a huge storage overhead, like if my database is one terabyte and then since the last checkpoint, I only update one megabyte, this thing stores one megabyte, this stores one terabyte over and over again, if I store the data uncompressed on a file system that supports deduplication, then the pages of memory that I write out are gonna be duplicated over and over again and the file system could compress them down for me. So you can rely on things outside the database to make this thing actually tenable. All right, the last one is gonna be the frequency of how often we're going to take a checkpoint. Again, we could just take a checkpoint all the time, that could slow down the regular transaction workload. And so typically what you do is either say, I'm gonna fire off the checkpoint at a fixed interval, like every five minutes, or I fire off a checkpoint after I've written a certain amount of data to my log file, right? So in this case here, like this one you can bound how much time it's gonna take for you to recover. Like say, if I crash, I want my database to come back within five minutes, so I can take a checkpoint every four minutes, so I know that when I crash, I only have at most four minutes of log I need to replay to put me back in the correct state. This one, you can sort of do the math and figure out, oh, if I can replay the log at one megabyte per second, then if I set it so that I take a checkpoint after 100 megabytes, then I know I can recover in 100 seconds. So they're essentially the same thing, it's just sort of a different way to think about the problem. And again, some applications where you maybe don't care about things being super highly available within a single node, they can use replicas to hide all this. So maybe you take a checkpoint at longer intervals or longer follow-up sizes. And so that way if you have a replica, if the master crashes, the replica can come up without having to recover the log. We'll cover that on Wednesday. The other thing though that we need to do, which every system has to do, is that if the data system is told, hey, we're gonna shut down, then we wanna take a checkpoint at that moment of time. We acquiesce or stop all the worker threads, let them finish whatever transactions they're running, and then take a complete snapshot of a complete checkpoint of the database. This is why you wanna tell the database, hey, I wanna shut down, just don't pass it, kill-9 into a hard sigterm. You want the database to be able to write things out gracefully, because otherwise if you do this, then you don't have to replay the log, because you know the database is in the correct state. So this is just a quick summary of what some in-memory data systems actually do. And as I said, most systems in terms of what they're actually gonna store are gonna do complete checkpoints. Only Hecaton is doing the Delta one. And then what's sort of interesting too is like, some of the MVC systems like Hecaton and MemSQL are doing consistent checkpoints again, which is just like snapshot isolation, just doing a scan or anything out. BoltDB is not an MVC system, but they still have new consistent checkpoints because what happens is when you say I wanna take a checkpoint, they switch into this sort of specialized two version or multi-version system, right? Where you just have the version of a tuple that existed at the checkpoint, and then you just have another version that's always the latest version. You don't really have version change other than you only have two versions. So that's how they do consistent checkpoints. Alt-Base can actually do fuzzy end consistent checkpoints. I think they do the same thing under times 10 where like, if I'm shutting down, I'll do a consistent blocking checkpoint, but otherwise I'd normally take a fuzzy checkpoint, right? And Hanada is fuzzy with time-based. So again, different database systems do different things. If you're doing MVCC, then my opinion, doing the consistent checkpoint and taking a complete snapshot is the way to go because there's less engineering overhead of figuring out what they actually write out. You just scan through and write everything. So any questions about checkpoints? Again, after restart, I load the checkpoint in, and as I scan the checkpoint, I'm copying data into my tables, and as I do that, it's essentially like an insert. I update any indexes that are on that table so that they get populated correctly. And then once all that's done, now the database is back online. Yes? The fourth thing indexes also get copied, right? The question is, with the fourth thing that the index already get copied, depends on implementation. You don't have to. By a fourth will copy something, but you won't write it. Fourth, yes, all the indexes get copied, but you don't have to write it, right? As pie, again, it's a waste of space. My indexes are huge. It's not worth it that this guy ought to write that out if I'm gonna repopulate it anyway on recovery, right? To trade up the computation and storage. All right, so the last thing I wanna talk about very quickly is how to do fast restarts. So everything I'm talking about so far, the crash recovery for Silo and SQL Server and all the checkpoint stuff, this is assumed that, oh, well, our system was operating, our system was running, something happened, somebody tripped over the power cord, light instruct the data center, and we did a hard crash. It was unexpected, right? But there's other times we actually may need to restart the database system. What's not gonna be from a crash, right? Very commonly, maybe we have to update our OS libraries. There's a technique in Linux called case splice that allows you to update kernels without having to restart the system, but let's say you can't always do that, sometimes you have to restart the whole OS. Certainly, if you're gonna update the hardware, some disks are swappable, but DIMMs, I don't think are, like you have to turn the system off, put the new DIMMs in, or certainly, if you're moving to another AWS instance, that's a whole another piece of hardware. And then sometimes you just wanna update the data system software. Again, Oracle has techniques that allow you to do patching without taking everything offline, but most systems don't have that. So, let's say we wanna do this one here all. Assume these ones here, we have to restart. There's no way to get around this. Restart the entire box. For this one here, though, we don't have to restart the OS, we don't have to restart the hardware. So, the interesting thing to see a way if we can restart the system for an in-memory database and not have to flush a checkpoint at the disk and then load it all back in over and over again. So this is what Facebook can do in their SCUBA system. So, I'll briefly talk about what SCUBA is in a second, but if you remember from the introduction class, we spent a whole, this is what the students voted for, the most interesting system that they wanted me to talk about. So, SCUBA is a distributed in-memory O-Lap system developed at Facebook to do event log processing. So whenever you load a page in Facebook, they're gonna keep track of what every single stage, every single service it touches for that request, record all the information and then dump it off the SCUBA so you can do analytics and say, find me the, explain to me why my page request went 20% slower than yesterday. Because they're pushing out updates all the time for their web apps, they wanna know whether things are regressing. So, what they're gonna do, though, is that when they wanna update the database system software that instead of shutting it down and ticking checkpoint and loading it all back in, they're gonna write out the contents of the database, essentially the checkpoint to shared memory, restart the process, come back up, see that my database state is now in shared memory and suck it all back in. They're essentially using shared memory as a RAM disk, which is kinda interesting. So, again, I've already said this, it's a distributed in-memory system. It has a heterogeneous architecture with leaf nodes and aggregator nodes. This is not too interesting for us, but just the thing to be mindful is the state of the database is only at these leaf nodes here. The aggregator nodes and the root node up above, these are stateless and there's combining results of queries that these guys are generating. So, all the updates, inserting new data, goes to these leaf nodes, so this is the primary storage location. So, if we wanna restart these guys, we need a way to, again, write these out the shared memory so we don't have to load the check one from disk. As a high level, the way it works is that if I have a query like this, they're gonna pick up the plan fragments and say this guy goes down, these guys are all gonna send their updates up and then we just combine it together to produce the final answer. This is actually basically how MemSQL works up well. Because the story goes, the guy that founded MemSQL, he was at Microsoft, saw the Heckelton project, borrowed some of their, I was inspired by their ideas, went to Facebook, I don't think he saw Scuba, but he saw this pattern used in other systems at Facebook, saw this idea and then combined it together and made MemSQL. All right, so there's two approaches to do this. So the, we've already said this, we could just do the shared memory heaps so that we could just, in our system, we modify the memory allocator so that whenever we call malloc for the data we're storing in the table, instead of being local to my process, now it's sitting in shared memory. As far as I know, there's no overhead in the OS of saying something is in shared memory because it's just getting back a memory address and the OS knows that that memory should last beyond the process lifetime. So to do this though, again, you have to modify J, malloc or TC malloc or whatever malloc implementation you're using to be able to write things out to shared memory and be able to divide things up efficiently so that multiple threads can be writing to the same location. So in the paper, they talk about how they can't do lazy allocation of backing pages and shared memory, meaning if I call malloc and something in shared memory, the OS is actually going to need to have that backed by physical memory. So they claim this is in the paper, they Facebook bought the guy, or they hired the guy that created Jemalloc. So in the paper, you talk about how they talked to the Jemalloc guy and he says you can't do this. I posted this on Twitter, or at least the slides are this on Twitter, and then some dude reached out to me and says you actually could do this, like he actually tried it and at least in the newer versions of Linux that you could allocate memory and shared memory and not have it backed by a physical memory right away. So this part is actually not true anymore. So you actually could still do it this way. So you could have your memory allocator allocate pages on the heap and shared memory and not worry about any threat safety issues or backing it right away. But instead what they're gonna do in this version of scuba is that when I'm told my process is going to shut down, I stop all updates from any transactions, any queries, write everything out to shared memory and then I can go ahead and restart. And so they do some extra stuff that where they keep track of what's the layout of that memory that they're writing out the shared memory, like what's the version of the database and that wrote it out so that if you restart and come back and you recognize, oh, I have some shared memory contents of the database from what it was before I restarted, then you make sure that the layout is still correct. So they basically maintain some extra metadata when they write out the shared memory to say, oh, by the way, my layout looks like this because I'm on this version. So if I come back and it's incompatible, then I just load it back up from disk. So scuba's interesting system because in their world, this is not high value data, right? It's not like your timeline or all your friend messaging crap. Like it's data that they could potentially lose, they don't want to, but it's not like they lose money if this goes away. So ideally if they say a node comes back and there's nothing on disk, but the shared memory, sorry, the database system comes back, there's nothing on disk, everything's in shared memory, but the shared memory data is not compatible with our new version of our software, then those backfill it from another disk from some results. In their world, that's okay. So I like this idea. I don't know if anybody else does this with shared memory. The most famous shared memory system is Postgres, but they're obviously not in memory and they're doing this to coordinate across different processes. This is now, this is an interesting idea to think about because it's passing data from one instance of the process to the next, even though they don't actually overlap in time. They're allowing the memory of the database go beyond the lifetime of the database system process itself, which I find super fascinating. Any questions about this? Again, if my database is one terabyte, if I don't have this technique and I restart the database, then I got to suck in one terabyte off a disk, which could be slow, but in this case here, I could come back and instantaneously have everything that I need. Okay? All right, so just to finish up, the main takeaways from this is that physical logging is probably the best approach you would want to use for an in-memory database that is going to support all possible conceptual schemes. There are also some advantages we can take out, things we can take advantage of if we're using MVCC, like doing the copy-on updates to get consistent checkpoints by just relying on the idea of snapshot isolation to only see changes from transactions that have already committed. And as I'll talk about at the end of the semester, non-volatile memory is here. It is going to change how we want to do some of these logging checkpoint protocols, but the high-level idea will still roughly be the same, that we don't need to maybe restore or don't need to log any undue information. If we're careful about where we store our data, then we only need to keep track of redo information and that makes the log replay much faster. Okay? All right, so next class we'll talk about networking protocols, and I don't have a list to hear, but we'll also introduce project two. I'll post this on Piazza. You guys should start thinking about how to form groups of three because project two will be a group project, okay? If you can't find a group to be in, send me an email and we'll figure out something for you, okay? I forget how many students are in the class, I don't think many of you are dropped, so we should have enough to do exactly three, I think 13 groups of three or something like that, okay? Yes? So right now it is like this that there are six, seven groups, all of them are of two, two people. That has to be three, okay? So make friends, and again, and whoever is in your group for project three, well, sorry, project two will also be in the same group for project three, so if someone is like an a***** with hygiene problems, you know, if you can't stand run project two, then you have to deal with them project three and that's not gonna be good, okay? And if I had to break up fights, I don't know if four and I could do it again. I ideally don't want to do this, okay? Any questions? Bank it in the side pocket. What is this? Some old pool shit. Aye yo, aye yo. Took a sip and had to spit because I ain't quit that beer called the O-E cause I'm old cheap ice cube. Aye, you look, then it was gone. Take a safe eye to the brain.