 All right, so today we're talking about data recovery. The idea of data recovery is obviously we're in memory. All our changes are gonna be hanging out in memory. And so that way we wanna make sure if there's a crash or there's a failure or some trips over the power cord that we don't lose any of our data. And our problem is particularly challenging because we're not assuming that data is gonna be backed by disk, it's everything's in memory. So we still wanna make sure that everything is gonna be durable. So the recovery algorithms in general, they're gonna be used by our database system to provide three out of the four guarantees that we want an ACID database. So consistency, atomicity, and durability. Anyway, this should all be sort of introduction stuff so we don't need to cover it in too much detail. But in general, every recovery algorithm is gonna be comprised of two parts. The first part is what you're doing while the system is running, while you're executing transactions and updating the database. So all the stuff you're gonna do to prepare yourself in case there's a failure. And then after there's a failure or after there's a restart, it's how to recover all the database using the data that we were maintaining or the actual information that we were recording while we were running our transactions normally. So for this one lecture, we're gonna focus on how to do both of these. So if you look at the early papers on an memory database is going back to the 1980s when these things were first being built, they made this huge assumption in all their implementations and that they were gonna rely on non-volatile memory as the backing store of the database, where the database is actually stored. So back then, non-volatile memory was essentially battery backed up DRAM. So that meant that if you lose power to the machine, then there would be a little battery there that would have enough juice to be able to take the contents of DRAM righted out to some stable storage or non-volatile storage, like back then, a spinning hard drive or it could be a SSD today. And so all of these early systems assumed, oh yeah, you're gonna have an NBM. So they didn't really worry about how to actually implement the logging and checkpoint protocols that we'll be talking about today. So what I'll say is that battery backed up DRAM is still an option today, but think of like on Amazon, you can't get a machine on Amazon that has battery backed up DRAM. You can go buy it from some vendor and they'll have it, but in practice, this is not what people are using on commodity hardware. There's a bunch of other reasons like it's quite large because you actually had to have the battery on the motherboard and that takes up real estate on the motherboard that you could be using for other devices. And it's notoriously finicky about how reliable these things are. Everything's great if you assume your battery's fine and then you actually need it and your battery's dead or it doesn't have juice for you. What I'll say also too, what we're not gonna talk about this semester, but I'm happy to talk about offline, is that there is a new class of hardware devices coming out like now, like this year, called non-volatile memory that's actually not battery backed up DRAM, but actually a new storage medium or new storage material that is truly non-volatile. Meaning like it's gonna look like DRAM, fits into the dim slot on your motherboard, it's byte addressable, you can read and write to it just like DRAM, but if you pull power, then everything gets retained like an SSD. So Intel calls this, has a bunch of different marketing names, 3D crosspoint, I think it's the term they're using or Apache Pass is another one, or Optane Memories is the third one. But again, these are actually again, it looks like DRAM to you as in your application, but it's actually truly non-volatile. And normally when I give this lecture, I keep saying like, oh it's two years away, two years away, two years away, it's actually now, 2019, of course now, yeah. And I went back and watched the lecture from last year and then 2018 I said, yeah, it's coming in 2019. So hopefully I'll be correct in this time. Like this is real, we actually have access to it at CMU, we have a PCIe device here that's in our lab, that's not that interesting because that just looks like an SSD, but you can actually get the real hardware from Intel now, but they haven't like publicly made available. So the main thing I'll tell you about this is like this doesn't exist yet, but it's coming and nobody actually uses this. So the techniques we're going to talk about today are has to be designed for what's available, what's available to us now, right? Commodity SSDs and spinning this hard drives. So for in-marry database recovery, the problem that we're trying to solve of how do we make sure that our database is durable after a crash is slightly easier than what we'd have to do in a disk ordinance system? And this is probably because we don't have a buffer pool anymore, so therefore we don't have to worry about dirty pages getting written out the disk before the log records that correspond to those changes or how those pages got dirty in the first place are written out the disk. So this means that we don't have to have log sequence numbers or LSNs, we don't need to do all those compensation log records, all the crap we told you about last class or last semester about Aries in the introduction class, we don't have to do any of that here. So our life is easier. We also need to record less data out on disk than we have to do for a disk ordinance database because we only care about redo, we don't care about undo, right? Undo you need, if you write dirty pages that have been modified by a transaction that hasn't committed yet, so you need to know how to reverse their changes and go back to the original form, the original state. We don't have any dirty pages, nothing that's uncommitted will ever get written to disk, so therefore we don't need to restore undo after the transaction commits. Now while the transaction is running, we actually still need to keep undo because the transaction may get aborted and roll back, but once it commits, we know we're never gonna reverse any of these changes, so we only need redo. The other important thing about understanding the difference between a disk ordinance database and in-memory database recovery is that to the best my knowledge, no in-memory database actually records the changes that are made to indexes. Now Lean Store was slightly different that we saw last class, but thinking of all the major commercial database systems, Memsiko, HANA, AltaBase, VoltDB, none of these systems actually record any modification you make to indexes in the log, right? Because the idea is that if I crash my system and I come back, I gotta load the database back from disk in the memory anyway, so rather than me bringing back in a bunch of extra stuff about how indexes were modified, I'll just rebuild the index as I bring the data in from the checkpoint. This also makes us easier when we're running because now we don't have to log changes to indexes and we can run faster normally as well. So again, no in-memory database to the best my knowledge will actually log any changes to indexes. We just rebuild them upon restart. So this all sounds nice, but at the end of the day, we're still dealing with a slow disk, right? That's gonna be the main bottleneck we have to overcome, just the same way we had this problem with the larger than memory databases last class. But just again, just because we're in memory database does not make our four kilobyte page writes up to the disk go any faster. The disk doesn't care whether in memory database or disk database, it's gonna always go at the same speed of whatever it actually can do. So we're still gonna have to deal with that slowness. All right, so for today's agenda, I wanna break up into three groups. So the first two parts will be about actually how to do logging and recovery to restore the state of the database correctly after a crash. So let's talk about logging protocols and do checkpoints, right? And then as sort of a bonus part at the end, I wanna talk about how to do restart protocols. And this is when I am not, the system didn't crash, I just need to restart it and how can I do that very, very quickly? It sort of looks like the same way of restarting it for the crash. But there's, for this one, we can assume that we can retain memory from one restart to the next. Whereas in these two other guys, you can't assume that, right? Because you can get a crash, could be either the OS crashes, my process crashes, or the machine crashes. So I assume memory's gone, okay? All right, so the first thing we talk about is how can we do logging? What are the different techniques or approaches we can have to logging? So again, this should not be groundbreaking news for anyone here, especially if you read the Silo R paper. But in general, there's two high level classes of logging schemes you can have. The first is called physical logging and this is where we're gonna record in the log the low level changes that transactions make to the individual tuples or the bytes of the database. And so in the case of the Silo R paper, they call this value logging. The same idea, like for every single transaction, they modify an attribute in a tuple and we record the bits that were actually, that got written to that new change of the tuple. The alternative approach is called logical logging and this is where we're gonna, instead of recording the low level bytes or the bits of how the transaction modified the database, we're actually gonna just store the operation that the transaction invoked to cause that change. So you can think of this as actually just storing the raw SQL statement that they invoked to do either insert, update, or delete. So in the case of Silo R, they called this, I think, operation logging. So the Silo R guys are awesome but that was not published in a database conference. It was not written by database people. So they refer to things as value logging or operation logging but just convert that into database parlance would be database logging or the physical logging and logical logging. We'll see another example where they use different terms for things that we describe in a different vernacular. All right, so the obvious advantage of logical logging is that you end up writing less data for each log record than you normally would in physical logging. So let's say that I have a transaction, it updates a billion tuples. With logical logging, I would just have that single update statement in my log record that could update a billion tuples. But under physical logging, I would have to have for every single one billion tuple that I modified the actual change that I overwrote into those tuples, right, the actual physical representation of the record that got modified. So again, this sounds very seductive. This sounds like logical logging is what we would wanna use for everything. The challenge, though, is that if you have concurrent transactions, meaning transactions running at the same time, interleaving their operations, then it's sometimes hard to determine the order in which the different transactions modify the tuples in the database if they're modifying the same tuple. It especially becomes problematic if you run at a lower isolation level, right? It snaps the isolation that we talked about so far. It's super easy, because first writer wins, and then therefore, one transaction cannot write to the same tuple of another transaction. But if I'm running at read uncommitted or read committed, then I don't know whether maybe if I run it the first time on Monday, the first transaction updates the tuple, and the second transaction updates the second after the first one. Then I crash, and now on Tuesday, I replay the log, and all I have is just the operation that they modified, I may end up with a different ordering where the second transaction runs first and the first transaction runs after that. And again, technically, it's correct because it would be equivalent to some, you know, say, because there weren't any torn writes, at a high level, they're still correct, but it's not the state I had the day before, and therefore, my recovery failed, right? Because I may have exposed something to the outside world about the order in which those transactions modify the database, but I come back for the second time and I can't restore that state. The other major issue, and this is probably the one that is easier to understand and probably why nobody, actually, very few systems actually do logical logging, is that it's gonna take you longer to do recovery because now I need to re-execute all those queries in your log all over again, right? So if my, say in the case of my one query updates a billion tuples, if doing that update on a billion tuples took an hour, then the second time around, when I replay the log, it's still gonna take another hour. It doesn't matter that I'm in recovery mode, it's always gonna take that long. Whereas it may be the case, because the physical logging is saying, like, update these low-level bits, that can probably be applied more efficiently than having to rerun the query all over again, all right? So I think for this reason, I think nobody actually implements this. You see this in other cases of replication, but in general, everyone always does physical logging to the best of my knowledge. All right, so the paper head you guys read was based on the system called silo, and this is an example of a physical logging scheme for an in-memory database system. So again, just like in last class, we're focused here on OLTP systems or OLTP workloads because all of that workloads are read-only, so we don't have to do any logging. It's really focusing on doing transactional updates to the database. So silo is a very influential system. It was created by the same authors that invented mastery that we talked about a few lectures ago. So this project was led by Eddie Kohler, who is a professor at Harvard. The dude's amazing. Like, if you ever use hot crap for submissions of papers, he invented that and still maintains it. He's one of the best system programmers that I know. So silo is gonna be a single version system that uses OCC with the same kind of epoch-based garbage collection that we talked about so for. So we'll see in a second, the single version-ness is not gonna affect us too much, but there are things, in particular, with checkpoints that for being multi-versioned or using MVCC, our life is a lot easier. But for single version, they're fine. All right, so they're gonna use physical logging with checkpoints and shorted derby transactions. And one of the key overarching themes about how silo is gonna be implemented is that they're gonna try to avoid any kind of centralized data structure or centralized coordination between different threads because that's gonna slow us down. So they wanna be able to run the logging in parallel, writing out the parallel log files and generating the log records in parallel without avoiding having a single bottleneck for the entire system. So the way it's gonna work is that for each CPU socket in our system, so silo is a shared memory or shared, sorry, shared everything, database system, so it only runs on a single node, just like we've been talking about so far this entire semester. And the way it's gonna be at work is that for every single CPU socket, it's gonna have some local threads that are running on the same socket to it. And so every CPU socket will have a bunch of worker threads, a bunch of checkpointing threads, and then a single thread designated as the logger thread. And that logger thread is responsible for writing out to disk the modifications made by the worker threads running on that same socket. And so each socket's now also gonna have a dedicated storage device in order to maximize parallelism and not have any interference with other sockets or other logger threads running to other devices. So that means that for every CPU socket to get the best performance in silo, every CPU socket needs to have its own dedicated hard drive that it writes to. All right, so as the workers are gonna execute transactions, they're gonna create new log records that contain the values that they overwrote or they installed into the database. And again, we don't care about logging out to disk the undue information, we only care about redo. And then at some point, it's gonna hand off the redo information that it generates for transactions to its logger thread and the logger thread will write that out to disk. So the first issue is now, where are we gonna get the memory for the log records that we're generating? Right, again, we don't wanna just malloc on the heap, right? We wanna make sure that we have, that the physical location of our memory is close to our logger thread and that it can grab it real quickly and then write it out to disk. So, again, so what's gonna happen is the worker threads are gonna have to go to their dedicated logger thread and they're gonna say, hey, give me a log buffer, which is just a byte array that it can install log records to. And then at some point, when that log buffer gets full, it hands it back to the logger thread, say, hey, here's my changes in this log buffer, write it out to disk, and then it goes and tries to get another log buffer from the free pool that the logger thread has. So, if there's no more free buffers left, then we have to stall our worker thread because otherwise, we would be generating log records faster than we can get rid of them or the logger thread can get rid of them and write them out to disk. Because then at some point, you're gonna have, you're gonna run out of memory because you have all these log records that are queued up, wouldn't get rid out and written out to disk. So, this is not just specific to silo, this is pretty much every single in-memory database or every in-memory database that's actually running in the real world, you would specify ahead of time the amount of memory you wanna allocate or designate for your log buffers. In silo, it's 10%, right? There's no magical number, it varies on the database, varies on the application. So, it basically means in silo are 10% of the memory of the entire system will be allocated for log buffers. All right, so again, if I run out of memory, my log buffers, my threads just stall. And at some point, the logger will say, all right, I flushed everything out, freeze up a bunch of memory and then hand them out to the free memory pool and then the worker threads can pick them up and start running. So, the log files themselves, as I said, they're just gonna be grabbing these log buffers that the worker threads generate and just append them to this file. So, the way they're gonna organize this is that the logger thread's gonna keep, for each epoch, which I'll explain what that is in a second, for each epoch, it's gonna just take all the log buffers that it has that are waiting to get written out and then it writes it out. And then, after about 100 epochs, it's going to close that file, create a new file, and start appending the log records to that. And the idea here is it's gonna make it easier to manage your files because now you're gonna know, have an easy way to figure out which log files are old, which log files are new, and can you easily truncate the log without having to reorganize one single giant log file. So, this is a common approach. This is not specific to silo. You see this all the time. So, here's actually a screenshot of the, this MySQL 5.7 of a MySQL installation that I helped run here in the CS department. And so, you see right here, this is the old version. I think they renamed these files in the new version. But there's these two log files. Log file zero, log file zero, log file one. So, what happens is, when the log file gets to be 500 megs, then they close that file and create a new one. So, then once I know that I've, if I close this one and there's no open transaction that could be spending these two log files, it's safe for me to blow this away if I want to. If I care about auditing, then I need to keep this around because I need to know what transactions I ran. But there's nothing in here I need to worry about for recovery because I would have a checkpoint too. But we can ignore that. So, again, this is just a way to make it easier for humans to manage the log files that the Davy system is generating. So, for each log record that we're to store, it's just gonna be a triplet that contains the name of the table that was modified by this operation. The, some kind of record key or tuple ID to say, here's how to uniquely identify the record or tuple that we modified. And then the value would be sort of a list of pairs of attribute name to new value, right? So, let's say that I have a simple query like this. We have a people table, we wanna set some isLame flag for both Lynn and myself. So, the corresponding log records would just look like this. I for transaction 1001, we updated the people table, here's some unique identifiers for my record and Lynn's record, and then we have the mapping for the attribute to the new value. Again, this is the redo, we don't care about the undo. All right, so let's look at a large perspective of how SiloR is gonna work, right? So, we're gonna break up the three different components of our system, so this would be running on one socket. We have our worker threads, we have the logger thread, and then we have storage where we're actually gonna store our log files. And then for this, we have an epoch thread. This is the same epoch garbage collection that we talked about before. It's just some other thread that's every so often it's gonna increment this epoch counter. And then everyone synchronized and say, all right, at this epoch, we need to do something. They're using this as a way to how many avoid to synchronize in fine grained steps as you run transactions. It's sort of this coarse-grained batch to say, all right, every so often, everyone's gonna synchronize when this thing increments. All right, so say that we have a transaction request that's gonna start working in one of our worker threads or start getting executed on our worker threads. So again, in silos parlance, they call these one-shot transactions. Where you have the embedded logic of the transaction along with its queries, running it directly inside the database itself. In database world, this is what? Store procedures, right, same thing, different name. All right, so this store procedure starts executing, starts updating the database, so we need to generate log records. So we have to go to our logger thread that's dedicated for this worker and get one of its log buffers from the free buffer pool. So then once it has that, the transaction can start running and makes much modifications, right? And it puts all its log records in here, we update the state of the database, and at some point, this thing gets full. So we're gonna hand it off back to the logger thread, say, hey, this thing's full, write it up the disk, and then we go and try to get a new log buffer, right? So now, at some point, the epoch manager, or sorry, the epoch thread will say, all right, we're transitioning out into a new epoch. So now when this happens, all of the worker threads have to hand off their log buffers to the logger thread, regardless of whether it's full or not, right? Because this is the point we're trying to synchronize on. So we hand this off now, and now this worker thread, we're in the new epoch, it could start executing more transactions again, but because we don't have a log buffer, we gotta go back and get one of the free ones. But in this example here, there's no more free ones because they're all being queued up waiting to get written out the disk. So when this happens, we have to stall our worker thread because there's no log buffers for us to write into, right? Then our logger thread in the background will start flushing out, writing out these log buffers to the log files, calls fsync to make sure that they're actually fully synchronized on disk. And then when that's done, it can then go back, take whatever log buffers it's written out, put them back to the free pool, then notify this guy that they're now available so it can pick one up and actually start running it. So yes. Can the worker thread use the same buffer for more than 100 transactions? This question is, can a worker thread use the same buffer for more than 100 transactions? Yes. So the way SiloR works, you can essentially think of these as we're executing transactions in batches. So transactions don't commit until the epoch switches, which I think in Silo is 40 milliseconds. I asked them why it was 40 milliseconds and not some other number. They said they just picked it, they just pulled it out of the air. So it's 40 is not magical. So everyone's gonna validate and synchronize at that when the epoch transitions. So that's the single case where we have one socket, one worker thread, one logger thread. Now let's look at the case when we want to start scaling this up on our single box and now have multiple sockets with multiple threads running at the same time. So the issue we're trying to deal with is that if we're trying to avoid coordinating between the different sockets, then we need to somehow figure out to keep track of where each log file has, how far each log file has written in our time scale of our epochs. Again, without having to check all the time. So they're gonna introduce this new special log thread called the, that's gonna maintain this thing called the persistent epoch or P epoch. And that's just gonna be a log file that's gonna keep track of the highest epoch that it knows has been flushed at disk and is durable across all the sockets, logger threads. Right? So then what's gonna happen is we say that a transaction can only be considered fully committed and therefore we can send back an acknowledgement to the application server to saying your transaction is committed. Once we know that the epoch that it ran in is less than or equal to the latest persistent epoch. Because this guarantees that we know that no matter what socket this transaction may have modified some data, we know that all its log records have made it out to disk. So let's look at a larger example here. So again, now we have in our database, we have three CPUs, right? And each of these CPUs are gonna have a logger thread that are again, writing to a dedicated storage device that only this guy can write to. All right, so then we have our worker threads and again, these guys are just getting log buffers as we showed before. They're already filling them up as it runs transactions and then handing them off to the logger thread. But then we have additional logger thread here, some special one that's been designated as the persistent epoch, which I'm designating by the crown here. And then it's gonna record this persistent epoch log file on one of the drives. It doesn't matter which one, right? It just pick one. So what's gonna happen is that as, sorry, going back here, when this thing increments and we go up to 200, that's gonna then trigger all these guys to try to write out all the log records that come from this point or before. And then once we know that our three logger threads have written out the log records that correspond to this epoch, then we're allowed to have the persistent epoch thread update this log record and say, all of these have written out up to this epoch number. So while I fully admit, after thinking about this, I actually don't think you need this, right? This is a nice to have, not required to have. So if you don't have this, then all you really need to do is just, when you boot the system up, go figure out what's the highest epoch that all of your logger threads actually wrote to. Because the state, you still need to know that everyone has written out up to this epoch in memory because that's how you tell whether the outside world, whether a transaction has committed or not. But you actually don't need this thing because when you come back, you could then figure that out just by looking at the log files. So I think that's correct. I don't think you don't need this, but it makes your life easier when you boot back up. You have to do less work. Not 100% sure, but I think that's the case. So any questions about the persistent epoch? Oh, but everyone understands that immediately. Okay, all right, cool. All right, so let's talk about now how do we actually recover after a crash? So again, remember I said that every recovery protocol has two parts. It has what you do at runtime and then it has what you do after a crash and it need to recover the database. So inside the R, they're gonna do essentially what every memory database is going to do. You're gonna have two phases. The first phase, you're gonna load in the last checkpoint that you took. They restore the state of the database up into that checkpoint. And this is also where we rebuild all our indexes because remember I said that we're not writing those indexes out to disk as we stream the data in from the disk from the checkpoint then we rebuild the indexes. Reading data from disk is way more expensive than building an index. So when you looked at some of the BW tree and the MASH tree and the B plus tree, those guys were doing like five to 10 million inserts per second. That's fast enough for us to be able to rebuild this and it's not worth the penalty of having to read data from disk. So then once we have our checkpoint loaded, then we wanna start replaying the log to put us back into the correct state we were supposed to be in at the moment that the system crashed or went down. So the checkpoint will get you up to some point but then you need to replay all the log records that got generated after that checkpoint. So one of the interesting things that CYLA-R is gonna do that's different than how you traditionally talk about log recovery is that they're gonna replay the log in reverse order. Meaning they're gonna start with the newest log record and go backwards in time and replay log records. And that's somewhat different than how we talked about ARIES, right? Because ARIES is all about figuring out at what point do I need to start in the log and then replay forward in time. CYLA-R is going backwards. And what they do is they keep track of they keep track of what tuples they've modified as they're replaying the log so that if they recognize that if I'm going back in time and I updated, I have a log record that modified tuple A and then I go back in time a little farther and there's another log record that also updates tuple A. I want the first one that I saw because again that's the latest one so I know I don't even have to bother with replaying any other modification of that same tuple afterwards, right? If you're going forward in time then you don't know what the latest version is gonna be, right? So you have to replay every log record as you find them, as you encounter them as you replay the log. If you go in reverse order, you don't have to do that. The other interesting aspect about this is that the transaction IDs are gonna be enough for us to figure out at runtime what the serial order should be for our transactions. So CYLA-R is a serializable system so we can use our transaction IDs to figure out the correct ordering of transactions upon replay. So that part is kind of nice as well because you sort of, and you can do this all independently because every thread, every socket is gonna have its own logger thread or their own log file that's gonna replay and we don't have to worry about any ordering issues across different log files because we know that for a given transaction it only can modify data that's maintained or managed by that socket if it's in that log file. All right, so to do a replay we're gonna go first check the PEPOC file to determine what is the most recent persistent EPOC and again, I say we actually don't need, we don't need this, it's a convenience thing. So if we find any log record that comes after the persistent EPOC then we know we should ignore it because again, it's not like when the EPOC number flips then everything gets written out as a disk the logger threads are writing out the disk all the time because they have to get free up space from the log buffers. So maybe in the case we do a persistent EPOC check we've, everybody flushes out the log buffers that they have and then before the next EPOC gets incremented a bunch of other log records get written out. So we know when we come back since those log records come after the highest persistent EPOC that I've maintained I know I can't ignore them because those guys actually never committed they committed internally but we didn't expose any of their changes to the outside world the outside world doesn't know that they actually committed so therefore we can roll back we can abort those changes or sorry we just ignore them. All right and I said this already before we're going to replay the log records from newest oldest and then we check to see whether a tuple's already existed or we've already modified and if so we can skip it, yes. But what's the point about how the EPOC when we try to get the latest persistent EPOC file the higher is the EPOC in the log file then these log records will not be ignored. So he said if you don't have a persistent EPOC then you may not be able to ignore the transactions that did not commit so you have to do a second pass do one initial pass to look at all the log files and say what's the highest EPOC that everyone has. That's essentially what the persistent EPOC does for you. So you do one pass to figure out what's the highest one everyone has and then you do a second pass where you replay and ignore the ones that come after that high watermark. It's the same idea, yeah. Is that clear for everyone? Okay. All right so let's go back to our example here and see how we'd actually do recovery on this. So again we have our persistent EPOC thread it loads that in, figures out what the persistent EPOC is across all these log files and then it's going to instantiate a bunch of replay threads which shows a dump truck, I don't know what else to show this because they're not workers, they're not executing transactions, they're just replaying the log. So all of these guys now will then get updated with this persistent EPOC number and therefore they know what they need it to install. And all this runs in parallel, you load the first checkpoint then they all replay the log files and then you're done and now you're back to the persistent state you were in at the moment of the crash. Okay, okay. So, I like this paper again because I think it's a well-written description of how to do high performance physical logging in a in-memory database. It's not the only way to do it but I think it's one of those sort of straightforward discussions and implementations. So, as I said in the beginning, the disk is going to be the slowest part so we want to try to avoid the slowdown because of this as much as possible. So, the thing we can do and Sylo already does this but this is sort of general advice for in-memory databases in general, actually disk-based databases in general as well is that rather than when a transaction commits rather than waiting for its log records to get flush out the disk before you allow anybody else to start processing on the data that the weighting transaction modified or generated, you can actually let them run specatively and then just, you know, and let the background thread do all the writing and then we go ahead and run our transactions as if they've already been written to disk even though they haven't. So, there's a couple of techniques to do this. So, the first way is called group commit. The basic idea is this is that I'll have a transaction rather than sort of doing f-sync for every single transaction when they flush, they will do a bunch of batch of transactions all at once and then sort of amortize the cost of that f-sync. So, Sylo R is already sort of doing that. When a transaction commits, I don't have to immediately hand off the log buffer to the logging thread, the next transaction come along and it can append new log records to that same log buffer. So, essentially they're getting batched up and then written out together. So, this is called group commit. It's a sort of obvious idea but it's an old one, right? It goes back into the 1980s. Maybe backman it was considered mind blowing. So, this is originally developed for this thing called FastPath. So, IMS is one of the first data systems that IBM built for the NASA moon mission in the 1960s. And then they came out in the 1980s with a sort of in-memory optimized engine, the same way hecatons an optimized engine for SQL server called FastPath. It's hard to read the FastPaths papers because it's like these old tech reports and the language is kind of confusing, yes. Is it possible for a transaction log to span across multiple log buffers? This question is, is it possible for transactions, the transactions log records to span across multiple log buffers? Yes, why not? And like in that case, like, what if like, one log records are split across two buffers, like the other half of the transaction log records, not the two buffers. So, this question is, well what happens if my log records are split across two buffers, the first one gets written and then the second one doesn't get written? Well, again, so, it has to do with that in the case of SiloR, that's the persistent epoch. So, when I hit my persistent epoch, all log buffers have to get written. Now, how do you handle transactions that span epochs, that's a side story, but all the log buffers have to get written at the disk before that persistent epoch is actually considered finished, or before you write that persistent epoch record. And then, once I know it's been flushed and my persistent epoch record has been written, then I can tell the outside world that my transaction is committed. So, the transaction commits is not gonna do any more work, it's generated, all the log records are gonna generate, but I don't tell the outside world that it's finished until all its log buffers are written. The same thing with group commit. Group commit, you might have, if you're not using SiloR, if you're using like a sort of MVCC system that we've talked about before, you could have in a single batch of data being written at the disk, changes from committed transactions and uncommitted transactions, right? And it's up to you in the recovery protocol to figure out which ones are actually correct or not. All right, so, I'm gonna try to go through this very quickly, because I wanna get to checkpoints. So, all the optimization we do is the same thing we talk about with speculative reads with Hecaton. So, I have a change from an uncommitted transaction rather than waiting for it's getting written at the disk. I just let anybody else read it. I maintain some internal metadata in my database system to know that this transaction read a record that was modified by this other transaction. I can't tell the outside world that I've committed until I know that first transaction's log records have been written at disk. So, they call this early lock release. It's the same thing as the speculative read stuff that we talked about before. All right, so, in the sake of time, I'm gonna skip command logging. Again, this is like, there's so many things I wanted to talk about. I wanna get to the stuff that I actually think that is the most common, which you actually see in the real world. All right, so, for the logging protocols, we're getting whether it's logical logging or physical logging, we have the same issue. And that is the log file's gonna grow infinitely, meaning if I crash, I have to replay the entire log to put me back into the correct state. And obviously, if I have one year's worth of log records, my recovery time could take one year depending on what scheme I'm using. So, the way we can overcome this is through checkpoints. Basically, what we're gonna do is we're gonna say, let's take a snapshot of the database as it exists at some point in time, write that out at the disk. Then, when I crash and need to recover, I load that checkpoint in, and then the only log I need to replay is just whatever came in after that I took my checkpoint. So, checkpoints are gonna allow us to reduce the recovery time significantly by taking periodic snapshots of the database. So, there's a bunch of different ways that we can do this for an MRE database. Areas is the canonical way that you would do this in a disk based system, using fuzzy checkpoints and whatnot. For an MRE database, we have a bunch of different options which we can go over. And what I'll say also too is that it's oftentimes, whatever approach you're gonna end up using for the different choices we're gonna have, what we talked about in a second, it's usually very, very tightly coupled with whatever the concurrential protocol is. So, whether you're single version or multi version. If you're multi version, then doing snapshots are quite easy because you just disable garbage collection and just do a long running query. The other important thing to understand about checkpoints is that we want to minimize the overhead or influence on performance from a checkpoint operation. We want to minimize the influence that they have over the performance of the actual queries running or the worker that is running our transactions. Because it would really suck if we're running our checkpoint and now our system is 50% slower, right? So usually the conventional wisdom is like a 10 to 15% overhead for checkpoints is acceptable. That's what SiloR has, I think that's what Voltv has as well, right? Because again, we're writing stuff out the disk for our checkpoint, but we're also still trying to write log records. So we have interference on our log files and the disk drives as we try to write to them. Plus we have these worker threads actually doing computational work to figure out what our checkpoint should look like and copied things into memory and prepared to write that to disk. So we're going to avoid the overhead as much as possible. So typically what we're going to do is we're going to do asynchronous flushes or asynchronous writes to disk and not worry about that every single write is durable at an exact moment of time. Now when we do the final write for the last buffer, yes, we want to make sure that's durable so we know that our checkpoint is complete, but it's not like log records where we're going to do F things all the time. All right, so there's this paper written by Dana Boddy who wrote the column store paper that we talked about a few lectures ago. And he wrote in 2016 that talks about some ideal properties you want to have in a checkpoint protocol for memory database. And these are essentially obvious, but it's sort of keep these in the back of our mind as we go forward to talk about the different approaches because we want to make sure that we don't violate any of these. So as I already said, we don't want to slow down our regular transaction processing because that's going to be unacceptable for people. Related to this, we don't want any unacceptable latency spikes so we don't want to have the checkpoint have to write a shit ton of data all at once and then now all our transactions have their latency triple in time because people want stability in their workload. They don't want these weird oscillations because we're writing things at the disk. And this won't really be an issue too much with us for the protocols we'll talk about here, but there's other techniques that do a lot of copying but in general we want to avoid having excessive memory overhead because it's temporary memory we have to use for our checkpoint because we're just going to write out the disk and that could cause us to have pressure on our regular memory allocations for the database and we don't want to run out of memory because we're taking a checkpoint. Again, the protocols I'll talk about here are a bit lightweight so this is an issue but there's other protocols that are like lat tree or lock free and they'll triple the size of your database when you use them. As far as I know, nobody does that. All right, so let's talk about a bunch of different design decisions we have for checkpoints. So the first is that what kind of checkpoint do we want to take? And again, this is just from the introduction class. We talked about this in the context of areas but you basically can have either a consistent checkpoint or a fuzzy checkpoint. A consistent checkpoint means that the snapshot of the database that's getting written to disk is can only contain the changes from committed transactions. Think of this just the same thing as snapshot isolation. So I start my checkpoint and I ignore any changes in my database, in my checkpoint from transactions that were running but uncommitted at the time I took my checkpoint. So you can think of this as just like doing a sequential scan on the entire table within a snapshot and we write that out the disk. So the reason why this is ideal is because we don't have to do any extra work on recovery to figure out what transactions actually committed or not. So if you do a fuzzy checkpoint then you actually do need the undue records. And that's why most systems don't actually do this. So the fuzzy checkpoints as I said before it could have changes from transactions that have not committed. If it's a multi-version system then it's not a big deal because you just know what version should be the last version or the most recently committed version. If it's a single version system then you have to have undue records. And you have to do additional processing when you restart to make sure you remove any of those changes. All right, so now we're gonna talk about how we're actually gonna create the checkpoint. So the most common approach is just to do it yourself in the database system. So again, if you think of it the most naive checkpoint scheme or a mechanism would be sequential scan on a table and take the output of the scan and write it out to disk, right? We can be a bit more clever about this and maybe just look at like the Delta records if we're doing Delta versioning or there's different techniques but the basic idea is that it's up for the data system to figure out what is the checkpoint and write that out. The alternative approach, which again, as far as I know nobody does this or Hyper used to do this which I'll show in the next slide but is that since we're an in memory database the database is entirely in memory. So let's fork the process, right? And then we have the child process that comes out of the fork also has to copy the database in memory. And then we just write out have that separate child process write that out to disk and leave the parent process the original process to do whatever it wants to do, right? So the reason why this works is because it's not as bad as you think it is because the OS is gonna do copy and write memory allocations. So ignoring transactions running at the same time if I have two processes and I fork, sorry if I have one process I have a bunch of memory that I've allocated I do a fork now it's not like the OS is gonna make a copy of the contents of that process address space and put it to a new location for my child process it knows that it's a fork and therefore there's actually a mapping from the virtual memory pages from the child process to the physical pages of its original parent process and only when the parent process or the child process tries to modify those pages then it actually makes a copy of it. So I can do this and then just have again the child process will have a copy of memory now if there are running transactions then I need to do a bunch of extra work in the child process to treat those guys as the board of transactions and start undoing those changes and then put me back into a consistent state. The other downside of this approach is that it copies everything. So in this case here we know we only care about writing out the table chunks of memory for the tuples this thing copies everything. So whatever internal data structures we have for our system they get copied and any indexes also get copied as well even though we don't even care about writing those guys out. So this is actually what Hyper used to do in the very first version. So the version of Hyper you guys read about is what they built after this first version here. So this first version was actually was heavily influenced on the system I helped build called h-store and they followed some of the same ideas that we did for that original system and then they realized if you want to do h-top workloads or OLAP queries on your system then the h-store model is insufficient or it's not what you want to do. But in the first version what they would do is that everything was in C++. If I want to run an OLAP query or take a checkpoint I fork my process then the child process undoes any transactions that were sort of running at the time that I did my fork because they're not running anymore in the child process. And then I can take whatever was in there now I have a consistent snapshot and I can write that at the disk or I can run OLAP queries. So they end up abandoning this because the copy on right overhead from the OS actually becomes quite significant. Because again, I'm trying to run transactions as fast as possible on the parent process. They're going to start dirtying up a bunch of pages both for the indexes, internal data structures. So all of a sudden the OS is going to start doing a bunch of copies immediately after the child gets forked and your performance is going to drop. So they end up abandoning this and they switch to the MVCC model that you guys read about. So when this paper came out in 2011 I was like, oh that's sort of a clever idea. That's interesting. And I was in grad school at the time so I didn't have enough time to implement it myself. But then I ended up implementing this when I came to CMU with a master's student my first semester. And this is just sort of a simple experiment showing the performance of the system of H-Door when you do this kind of forking process. So the original version of Hyper was in C++. H-Door like VolTB is C++ plus Java. So they were Java front end for networking and query planning and store procedures. And then the execution engine is all in C++. So technically it's JVM with off-heap memory and off-heap query execution. So what would happen here is that we would do the fork at this time here, this is just running like TPCC and then we would run the OLAP query on the snapshot. So here you see the performance of the parent process and then the red is the performance of the child process. So the first time you do the snapshot everything sucks. Then I think the second time you do the snapshot the snapshot doesn't actually do that bad. I forget why. The two things I'll say about this. So we still have the same bottleneck that or the issue of the OS serves copying bunch of pages as they get modified in the hyper case is actually even worse than this. This is actually, since we're in Java if you read the documentation for the JVM they say don't fork it, it's a bad idea. We said, let's fork it. And you have, it's a bad idea. Because what happens in these managed memory environments like the JVM, it's not just your thread that's running your application. They have a bunch of other background threads like the garbage collector and other system threads that are doing stuff. So when you fork it, those other threads aren't alive in the child process. They're now all dead. Because it's only the whatever thread called fork is what gets, is running in the child process. So you're like in this weird zombie JVM because you don't have a garbage collector, you don't have the other stuff you expect to have. Then the garbage collector kicks in over on the parent process and that starts reorganizing the heap which then calls a bunch of OS memory rights. So that was a bad idea, which is fine. We just want to see whether it would work. Yeah, I'm interested, I don't remember why. Why this is fine and this thing still tanks. Oh, you know what, sorry, take that back. I think this is the regular h-store and this is h-store with the JVM forking. So in regular h-store, the OLAP query ties up all the threads and that's why it bottlenecks here. In the h-store with the snapshot and this is what the TPCC number. So this just shows you that when you do the snapshot, the reason why it takes a while to recover from the snapshot because this is all the OS memory rights. The bottom line is it's a bad idea and nobody actually does this. And don't do it if you have the JVM. All right, so we talked about what kind of checkpoints we want to take. So the two choices are to do a complete checkpoint or a delta checkpoint. A complete checkpoint is literally the entire database as if this is some memory written out the disk. So if I take a checkpoint now, my data is 100 gigabytes, I take a checkpoint, I have 100 gigabyte file, I can compress it, we can ignore that. I have 100 gigabyte file, then I take a checkpoint 10 minutes later but within that 10 minutes I only updated maybe one gigabyte of the database. My next checkpoint's gonna be another 100 gigs. So the way to overcome this is to do delta checkpoints which is you just write out the tuples that were modified since the last checkpoint you took. So the way you could figure this out is could keep a bit map to keep track of what pages or what blocks got modified since the last time or you can actually look at the log because this is essentially looking like the log and just have that be represented as the checkpoint. So I would say, we'll see in a few slides what memory systems do. Everyone pretty much does this one and I think it's a combination of it's easy to implement and it provides sort of people with peace of mind or comfort to know that here's a single checkpoint file that has my snapshot of my database and I can ship that around to the machine. Like I know it's there and I know it's safe whereas this thing's gonna generate a bunch of deltas and if one of those files gets trash I might end up losing the entire thing. All right, so the last issue we gotta talk about is when do we take a checkpoint? So the most obvious way to do this is that every five minutes or every 10 minutes or so I just take another checkpoint. And this is usually something that humans have to tune because the time in between these checkpoint the interval in between these checkpoint will determine how big your log file is after a crash and that will correspond to how long it's gonna take for you to recover after a restart. So the thing of both the me and the default is five minutes. I forget what other systems do. Another approach is to base it on how much data is written out to the log file. So in the case of MemSQL they say if I write out a half a gig of database 250 megs to my log file then that will trigger a new checkpoint. And that sort of bounds the amount of time you're gonna have to take to recover it in theory. You know, moduloes some constants of what you're actually modifying. But that'll bound the amount of time based on how much data is actually being written as well. And this can avoid the issue of like if your database is not that active at night you're not now taking a checkpoint every five minutes or 10 minutes. You're just doing it whenever the log file gets updated. The last approach is not really an option. So if you're gonna implement checkpoints you have to implement either one of these two. They're either or. This last one here is sort of you have to do because it'd be retarded not to do this. And then as you take a checkpoint whenever somebody asks you to shut down. So you don't do kill-9 on your database system. It's gonna be a bad idea. You politely ask the database system to shut down. It basically blocks any new connections. It acquiesces all the worker threads. Any transaction that may be still running. It's allowed to finish within some time limit. And then at some point when you know no there's no more transactions running then you go ahead and initiate the checkpoint. Once that's variable on a disk then you actually can truly shut down. So this is a little table from some of the major in-memory databases and the different approaches that they use. Again, everyone's doing different things. So in my opinion the most common one would be consistent and that's what MemSQL, VoltDB and Hecate's on use. But you see that like the time-stand is doing fuzzy. If you do a non-blocking one but they have a blocking version as well. And then Han is doing fuzzy. So what's interesting about this is that fuzzy checkpoints traditionally are associated with single-version systems, right? Because I need to have transactions running at the same time and there's only one version of a two-point and they're all modifying it. So therefore you would think fuzzy checkpointing would be done in a single-version system like what VoltDB is, right? And Han is a multi-version system, but it's reversed. Han, for whatever reason, is doing fuzzy checkpoints and VoltDB is doing consistent checkpoints, right? Think MVCC. If I have a consistent snapshot that's a consistent checkpoint. So the way VoltDB actually does this is that they switch into this like multi-version mode when you take a checkpoint. It's really two versions. It's like the version that existed when my checkpoint started. But normally it's always single version. Most systems are doing complete checkpoints. Only Hecaton is doing double checkpoints and the way they're handling this is that they're basically doing compaction on the logs. So you have one log file that says here's all the tuples that got deleted. Here's all the tuples that got updated or inserted and then it compacts them into a smaller log file or checkpoint file that represents the state of the database, right? And then how they actually do the, and what frequency they're doing checkpoints is all over the map. For Altabase, I looked in the manual, or the documentation, and it didn't look like they had periodic checkpoints. Maybe it changed. I mean, there's a command to invoke a checkpoint, but I couldn't see a way to set it up to do it automatically. All right, so any questions about checkpoints? Yes? What was Peloton used? This question is, what did Peloton use? So, we never had checkpoints. So we were going to, we still don't have checkpoints now. This is actually a project I'll discuss on Wednesday. I want checkpoints, yes. We were gonna do consistent checkpoints based on time frequency. We never got that far. We never got logged in correctly either. We have that now. Yes? Is there a reason you would want time frequency than outside frequency? So this question is, is there any reason you would want time frequency versus the log threshold? So, if your database is running 24-7, then sometimes there's SLOs or SLA guarantees you have to have to say, I'm guaranteed that my database can recover in this amount of time. Like if you're selling it as a service for somebody. So in that case, you could say, all right, I'll sell you my database system and I guarantee that you'll be able to recover in five minutes. So I set my frequency to be four minutes. You'll pay a performance penalty because when you take a checkpoint, you're doing work other than running transactions, but this will guarantee that you come back very quickly. It's a good question. Yes? For the NBCC databases, how much memory can spike your gap loss transactions going on when you kick off one of these accounts, right? Yeah. So this statement is, for the multi-version database systems, your memory's gonna spike when you take one of these checkpoints because it's the same thing as I said, running a select query that's gonna pause your garbage collector from actually cleaning things up because you're waiting for that one checkpoint thread to finish up. So this is where the techniques from the HANA paper we read comes into play. It's like my checkpoint thread is running this old version, this old snapshot. I have a bunch of crap that I can clean up because no endeavor is gonna read it. So that's why being able to do interval garbage collection the way the HANA guys do it is the right thing to do if you wanna do sort of snapshot isolation checkpoints. So why had you guys read that paper? Okay, 10 minutes left, let's do this. All right, so everything I've talked about so far has assumed that we're trying to restart our database system after a crash. Oh, f**k, sorry. Yes, f**ks. Okay. Right, so, but not all restarts are going to be due to a crash, right? There may be other cases where we need to restart the system in order to do maintenance things. So one example would be updating the operating system, right? Could be updating the machine, adding more RAM, fixing up the falsely drive, or it could also be updating the software, right? And this is problematic because if I have a, you know, a one terabyte database hanging out in memory, it's gonna take me a long time to do a restart to write that out the disk, restart the data system and come back up and load it all back in, right? And the reason why this is problematic is because, or the reason why this is the case, at least for this second one here, why we wanna try to avoid this is that it's sort of OS 101. The memory we allocate in our process is essentially tied to the lifetime of our process. So if we can decouple that, if we can have our database live in memory, even though our database process has restarted, then when we come back up, we can reclaim that memory and integrate it back into our system. So this is what Facebook came up with for their scuba system. So I'll show what scuba is in a second, but it was there, basically, they're distributed in memory database that they built to do event log processing, right? You have a bunch of services, they're generating these event records to say, I ran this PHP function for this long or this packet one here, things like that. They're dumping us into this giant system and they wanna do analytics to find, what was the cause for a slowdown? So they're gonna run on a very large fleet of hundreds of machines. Yeah, question. Okay, they're gonna run on a fleet of hundreds of machines and Facebook has this agile development environment. They're trying to push out new updates every two or three weeks. So that means that every two or three weeks you have a new version of your database system and you need to be able to restart it to install the new version. If you have to dump out the data to disk and restart it every single time, that's gonna suck because you're gonna have a large portion of your fleet at any given time just restarting this reading data in from disk for no reason or just because you restart it. So the way they're gonna handle this is that, as I said, because you wanted to couple the memory contents of the database from the process lifetime, they're actually gonna rely on shared memory in the operating system to dump the database contents in there, then they can restart the database, come back, look in shared memory because that's still living and then suck it back into its own process. So like I said, this is a really clever idea. It's sort of obvious if you look at it but it's interesting to talk about. So real clicking, what is scuba? So as far as you know, scuba is still actually used. As I said, it's distributed in memory system that's used for analytics. So this is not the primary storage location of like core customer or user data at Facebook. This is just all like the internal event data. So they lose it, it's not the end of the world. It's not like, all your timeline and all your friend crap is actually stored in MySQL separately. So they're gonna use what is called, it's a heterogeneous distributed database architecture. So they're gonna have leaf nodes, they're gonna do all the scans on the memory data and the filtering and then they're gonna shove that up to these aggregator nodes that are do group buys and aggregations and other things and then it's sort of like a tree hierarchy. You keep doing these aggregations until you reach the final answer and you send that back to the client. So real quickly it looks like this. So you have your leaf nodes and you have the contents of the database in memory and then maybe they're writing out some log files or checkpoint files and then if I have a query I'll break it up into fragments that are run on each of these leaf nodes. Each leaf node then operates on the data that has local to it and they take the animated results and then send it to the aggregator node which then combines it together from another leaf node. This is a very common distributed system pattern that Facebook uses for a lot of their systems. This is how MemSQL works as well because after he was at Microsoft, one of the co-founders at MemSQL, after he went to Microsoft, Sol Hackathon got their ideas, then he went to Facebook for a year, Sol sort of this general idea as well and then that's what MemSQL is based on. But this is not unique to Facebook, this is used in a bunch of other things. So again, the problem we're trying to solve is to be able to restart these leaf nodes here. These guys are stateless, we don't care, it's how can we restore the database after we start here. So there's two ways to do shared memory restarts. So the first is that we could have a special memory allocator that can allocate memory in shared memory for our database heaps and then that's just how we use malloc and so the rest of the system doesn't know that it's in shared memory, we just treat it as regular memory and we just know that if we restart the database process to come back, we know where to look to find the data we want and everything is just fine. So in the scuba paper, they talked about how they investigated whether they could modify JEMalloc or build their own version of malloc to do exactly this and Facebook actually owns or not owns, they employ the inventor of JEMalloc and so they asked them directly, hey, do you think this actually would work? And the feedback they got and they write this in the paper from the name that JEMalloc authored by name and they said that with shared memory, you can't do lazy allocation of backing pages. So it means that if I have to allocate a bunch of space, normally with malloc it's all virtual memory so it's not gonna actually be backed by physical pages but in shared memory, you actually have the OS would actually back them immediately and then it calls fragmentation issues and sort of performance issues. So this is what the paper says and last class, last year when I taught this, this is also what I said because whom I to disagree with the inventor of JEMalloc, right? And this is why I love the internet is that for this point here that you can't do lazy allocation of backing pages with shared memory, some dude who I've never met who is a programmer in Dubai, watched the video and says, oh, I think you're wrong. And I said, well, here's the part of the paper that says this is what they're doing. Then he goes and actually checks it and gets it to work in a kernel. They're doing actually lazy allocation and free. So now for this one, I don't know if he's actually doing it. You know, he's doing M-Map here. He's not actually doing this with JEMalloc but he's saying here, you could actually do this correctly and have it with shared memory that you don't need to do, it's not gonna do, it can do lazy allocation. So again, this is why I love the internet because I just posted it out and some dude is in the Middle East says, yeah, you're wrong. Here's how to fix it, right? So that's awesome. All right, so they didn't do this because of this, because of this reason. So instead what they're doing though is copying shutdown. Basically, you have all local memory as you malloc in your process as you're running, then you get the shutdown command, then you just copy those pages into shared memory and then once that's done, then you write out a little file at the disk and say, if you come back, here's where to go find the location where I wrote the shared memory and then you boot back up and you can suck it all back in. So this is essentially what I just said before, right? You get a restart command, you write out all your blocks to shared memory and then when you come back, you check shared memory and you see whether the actual physical contents of the database in shared memory matches with the new version of the software you just restarted to, right? So if someone changed the layout of pages in the new version, you'd come back and start reading garbage or reading things that don't look the way it should be looking to you. So they have a bunch of protection mechanisms to make sure that if my layout of memory changes for my database, I don't try to reload the old data. And if this all fails, you just go back and reload the data you already stored in the disk anyway, right? So like I said, this is a really interesting idea. I'm actually interested in exploring how shared memory can be used for other things beyond just this restart stuff, which I think is interesting. This is actually very similar to some of the arguments that the MongoDB guys made to me in the early days about why they were using Mmap and they would talk about how like, oh, yeah, if the database crashes, which I guess in the early versions of Mongo it did, if you were using Mmap, you'd come back and all your database is still hanging out in the OS's page cache, so restarting is super fast, right? So it's sort of the same idea here, like it's a way to decouple the contents of the data to send memory from the actual lifetime of the process. All right, so to finish up, the, I focused on physical logging and we didn't get into the logical logging, but that's fine. And this is just because physical logging is used everywhere. It's what we use in our new system. It's what pretty much everyone uses except for VoltDB and potentially fawn on DB, but I'm not sure. So if you're doing MVCC, the copy on update checkpoints is the way to go because you already have the versions. You already know what versions are visible to you and you know how to use them to generate a consistent view of the database. So you just write those things out at the disk. And then I talked a little bit at the beginning and I won't say more about, I won't say more about this for the rest of the semester, but non-volatile memory is coming. It's actually extremely interesting. As I said, my first PhD student did his entire dissertation on this. They actually turned it into a book last week. But right now I'm not interested anymore. I'm more interested in like the self-driving automated autonomous database stuff, which we're gonna talk about that instead of non-volatile memory. So I think that's more fun, okay? Are there any questions about recovery or checkpoints? Next class. Where do you network in protocols? Like how do you actually have the client talk to the database system? What does this package look like? And then we'll have the announcement for project two and then I'll go through a bunch of potential topics. I'm also gonna send out later tonight the link to a Google spreadsheet where you form your groups. You don't have to pick a topic yet, but you should, you know, sure talking about the things that we'll talk about next class, what you wanna do. I'm happy to meet with anybody too as well if you want help brainstorming ideas. Okay? All right guys, see you on Wednesday. Take care. You know what I'm saying. Got a belt to get the 40 ounce bottle. Get a grip, take a sip and you'll be picking up models. Ain't it no puzzle or dozen cause I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stackin' six packs on a table and I'm able to see St. I's on the label. No shorts with the cloths, you know I got them. I take off the cap and first I tap on the bottom. So about three in the freezer, so about to kill it. Careful with the bottle, baby.