 So the next sort of transition or next chapter of our lectures in this semester is now talking about logging and recovery. So up until now we've been sort of dancing around the idea that we're going to write these data out to disk and we'll make sure that everything is nice and durable. But now we need to talk about actually how we're going to do that. How the data is going to guarantee that if your transaction makes it commit something after making a bunch of rights, that if you stop the system or you crash, you come back and everything is still there. So just real quick, homework five is due today at midnight. Project three is due on Wednesday, November 15th. And as I announced on Canvas yesterday, there will be no class on this Wednesday coming up. We have a bunch of scheduling deadline or issues with me. Joy has a meeting with his parole officer on Wednesday, so he can't actually teach. So we're just going to cancel the class outright, okay? Any questions about project three? Okay. All right, so to help us motivate what we're going to talk about today, let's now look at a really simple scenario where we're going to have a transaction or make changes in our database and you want to see how we can make sure that everything, what can go wrong, how are we going to make sure that if we make a right, everything's there. So this is our schedule. So this is where we're going to execute the transaction, basically the operations. And then in memory, we have our buffer pool, the same thing you guys implemented for the first project. And then out on disk, we have the disk manager where we're going to be storing the pages of our data in our database. So when the transaction starts, it just does a read on A, and as you know, you look in your buffer pool manager, you see whether the page that you're trying to read is there. In this case, it's not. So we'll go out on the disk, I'm going to go now make a copy of that page and now put it into memory. And now we can read our value. So then now in the next step, we're going to do a write on A, and let's say we go ahead and do our update, and the update first lands in the buffer pool because all of our changes always have to go to memory first, right? But now we have a problem where the transaction goes and wants to commit, right? And we want to be able to tell the outside world that, yes, we got all your changes. But at this point, is our change safe? Is our change durable? No, right? Because it's in memory. And what could happen is that you could have Hitler come along and cut the power on your machine, and now everything in memory is gone, right? Hitler's an evil person. It took me a while to figure out who actually wanted to show, right? If you show people that are in the news now, a year from now, they'll be dated. So I went with Hitler because he's like the Hitler of databases. He's just bad, evil, right? So he cuts the power on our buffer pool, on our system, and everything we have in memory is lost, right? So this is bad. If we tell the outside world our transaction commits, then we need to make sure that no matter whether Hitler comes along and kills our machine, that our data is safe and durable. Now, in this case here, we would come back and we'd read back in that page and we'd have A equals one. So we'd have the old version even though we already said the transaction was committed and we saved all those changes. So this is the problem we're trying to solve today, right? This is called crash recovery. And the basic idea is that we're going to have some algorithms inside of our system that will allow us to ensure that our database will always be consistent. Every transaction's changes will always be atomic, and that we always maintain the changes for any transaction that commits, and we tell the outside world they commit, no matter how many times you crash and no matter how many times you restart or have problems. So a recovery algorithm is essentially comprised of two parts. In the first part, you have all the things you're going to need to do during the normal processing of the runtime execution of transactions while the database system is sort of running normally, right? You open a connection through your application, sort of executing SQL queries and you commit, that's the normal processing of the database system. So we need to do a bunch of stuff during that time to make sure that everything is safe and durable. And then there's also the things we're going to need to do after there's a failure, after there's a crash, and we lose power and we need to come back and make sure that everything is always in the correct state. So for do crash recovery, you need two parts. You need the runtime part and then the recovery part. So in today's class, we're going to focus on the first part here. We're going to talk about what you need to do while your database system is running to make sure that all your changes are durable. So for today's agenda, we're going to focus on, we're going to try to cover a lot. But we're going to first talk about what kind of failures we're going to have to deal with in our database system. And that'll sort of motivate what we can do to figure out how to come up with a scheme or a protocol to make sure that everything's safe. Then we'll talk about how we're going to make changes to our buffer pool manager to decide when it's safe to move data in and out. Then we'll talk about one possible implementation to ensure durability called shadow paging. Then we'll talk about right ahead logging, which is the most common scheme that people use and what you'll be implementing in project number four. And then we'll talk about checkpoints and then different types of logging schemes, okay? All right, so crash recovery, the way to sort of think about it is that we're going to divide the sort of the database system up into different components based on the underlying storage device that the database system is going to use as the backing for that data. And then based on these storage devices, now we can come up with different classifications of the types of failures that the database system is going to be able to handle for these different storage devices. So in a every database system, there's three categories of storage. The first is called volatile storage, which is just memory, right? This is typically the fast DRAM you can read and write to very quickly. But the key thing here, the reason why it's volatile is because again, if you lose power to it, all your data is lost. There has been some actually some experiments at CMU here where you actually can find that if you cut power to DRAM, you can still read it for like another 30 seconds. But in practice, nobody actually does that, right? You lose power and you assume anything that was in memory is completely gone. Then you have non-volatile storage, and this is where the data will be able to persist after it loses power, right? And this is your NAND flash hard drive and the spinning-disk hard drive. And then there's also a third class that we haven't talked about so far called stable storage. And stable storage is a class of storage devices that are able to survive any possible failure you could possibly think of, right? And we'll show what these failures look like. But even if you come and like the machine on fire and burn these stable storage devices, your data will still be there, right? And obviously, this doesn't exist, right? Because if you take a gun and shoot it at the hard drive, it's gone, right? So we're gonna refer to this sort of as an abstract concept that we can write this as stable storage, but in practice this doesn't actually exist. And the way you actually get around this is that you combine multiple storage devices and have redundancy to basically approximate this stable storage, right? Instead of having a single hard drive and a single machine, maybe you spread it across multiple machines or different data centers. And that way, again, if the data center burns down, your machine burns down, you still have another copy somewhere else, right? So it's a sort of logical concept rather than the actual storage device you can buy. But this is gonna be important because this is where we're gonna end up putting our logs. All right, so there's three categories of failures we can deal with. We have transaction failures, system failures, and storage media failures. So transaction failures have sort of two types. So the first is that you have a logical error where there's some error condition in the actual database itself that prevents the transaction from actually being able to complete, right? So say you try to insert a null value into a table where a particular actually cannot be null, right? The data system is gonna prevent you from doing that and abort your transaction, right? This is considered a logical error. Then you have what's called an internal state error, and this is where the data system is gonna abort your transaction because there's some error condition that is checking while it's running transactions that's gonna prevent it from being able to actually complete, right? And this would be a deadlock that you would have under two-phase locking. The next type are system errors, or system failures, right? And the first one is a software failure where basically the actual implementation of the database management system has a problem in it. There's a divide by zero or on-call exception, and your transaction's gonna have to fail, right? Ideally, we hope this doesn't happen. Most database systems are tested very rigorously, like in Oracle, for example, any time they put out a new release, it spends month months on Q&A and testing to make sure that there's no bottlenecks, no regressions before they put it out. So typically, you don't have this in a robust system, but it can occur. And the second one is hardware failures. And this is where, again, the machine that your data system is running on has some kind of hardware problem that either the power gets pulled or there's hard drive crashes or whatever, where the machine just basically just stops running, right? In for this type of error, we're gonna assume that the hardware of the non-bottle storage devices we're gonna use will not be corrupted if there's a hard stop, right? In practice, we know that's not the case. I had this happen when I was an undergrad. You pull the plug on machine, the hard drive could be doing something, it could end up getting corrupted in such a way that you can't boot it back up, right? Now for storage media failures, we have to deal with the problem of the actual, the non-bottle storage is completely lost, right? And again, this is the example I'm saying if you light your machine on fire and it melts the hard drive, right, there's no recovery from that, right? And for this, we assume that the destruction of our data will actually be detectable. Meaning like we'll have checksum errors from the discontroller or we know that machine caught on fire, right? So we know that this is a problem and we as humans can then take actions, actually sort of correct it, right? And so an example of a way to recover this, if your machine catches on fire and you melt all your drives, then if you have a backup version of the database, you can load it in and replace it that way, right? So this is the way to think about this is this particular type of error is not something a data system can actually recover from because it requires something beyond what software control can control, right? It requires a human to do something extra to recover your database. All right, so now given these types of errors, we need to think about how we actually want to make sure to overcome the ones we can overcome to make sure that any changes we make for transactions that commit will end up being durable. And so as I showed in the beginning, and as you know from the lectures on the buffer pool manager, the primary service location of the systems we're talking about here is always going to be on non-volatile storage. It's always going to be an SSD or a spinning discharge drive, right? But since that's really slow to read and write to those devices, we're always going to use our buffer pool manager as a cache to bring in the pages we need so that we can read and write to them very, very quickly. But of course now the problem is that these things are volatile. So if we pull the plug, we would lose everything. So with non-volatile memory, the database is going to sort of a staging error, we want to modify data on one page, we make the copy, bring it to our buffer pool manager, then we make the modification. And then at some point we need to be able to flush it back out the disk. And at that point we know that things are safe and things are durable. That's the key assumption we're going to make throughout all of those. So for us, what we're talking about today, the data that's been in the system is going to need to guarantee that if we tell the outside world that your transaction is committed, then the data is actually durable. So we can survive all the failures except for the storage media failure. Again, it's a hard drive melt, so we can't cover that. We can handle all the other ones. And then if a transaction ends up getting aborted, then if we crash and come back, any of the changes made by that failed transaction would not actually be visible, right? And the way we're going to do this is using a combination of undo and redo. So undo is the process of taking any changes made by a transaction that did not commit. Either it aborted because of a failure in our software or in the logic of the transaction, or because the data system found there was a deadlock and it got killed. Or the transaction simply just didn't complete before the machine went down. Then we'll use undo to roll back all of its changes so that nobody sees them when the system comes back on. And then redo is allows us to then to reinstall those modifications made by the transactions that did succeed. And we did tell the outside world that we committed. We're gonna make sure that those changes will persist forever. And the tricky thing is when it came to do this is we make sure that anything that's sitting around in our buffer pool will be able to have some artifact on disk to say here's the changes that were made in this transaction committed. So how exactly we're gonna do this depends on how the data system is gonna manage the movement of data in and out of disk. So let's look at an example here. So we have two transactions. Transaction T1 was a read on A, write on A. Transaction T2 was a read on B and write on B. And assume again we start the system up from a cold start. There's nothing in our buffer pool manager. There's nothing in memory. And then we have out on disk we have a single page that has the objects A, B, and C. So when transaction T1 starts it does a read on A and again we don't have this in memory so we have to make a copy into our buffer pool manager. And at this point it's an exact copy we have out on disk. Then we do our write on A and again as I said we always make the modification to the page in memory first. So now that the object A in memory is been modified. So then now we switch over to transaction T2. He does a read on B. That page is already in memory so we don't do anything special there. Then we do a write on B and we overwrite its value now. So now we get to this point here and we want to commit. So the first question we've got to deal with is how should we make sure that these changes are durable? Right? So one obvious thing you could do is you basically take that page and write it out immediately out to disk. And once that's flushed then you tell the outside world that your transaction has committed. Right? But what's an obvious problem with this? Let me say it again, sorry. A is not committed. Right, A is not committed but it made a change in the same page that, so transaction T1 made a change to an object that's in the same page as the object that T2 modified. So if we just take this and write the page out, then now later on T1's gonna abort and we need to roll it back. But now we have nothing in our system that would say, yeah T2 made this change and it should be there. And then T1 made a change but it actually didn't commit yet. So not have it really there, right? Now you could say well why don't I just make sure that I only write out the changes that be made out to the disk page, right? Well now that requires you to do much extra work because now you need to make sure that all right, well I'm writing this page out and this tuple here was modified but this transaction it committed so I can flush that but this other one didn't. I can't flush that and I gotta make a new page in memory then copy that out. It's a bunch of extra stuff you have to do and I'd have to think about it but there's no guarantee that this would actually work. Okay, so there's two policies we gotta talk about now in our Buffalo Manager that we'll use to decide when is it okay for us to write out dirty pages like this, that with changes that have not been modified yet. So the first is called the steel policy. So the steel policy determines whether a transaction that has not committed yet is allowed to overwrite the most recent committed version of that or value of that object in our non-volatile storage out on disk, right? So if you say you're using the steel policy then you're allowed to flush dirty pages out the disk that contain modifications that were made by transactions that didn't commit yet. If you say you're using the no steel policy, then any page that contains data from an uncommitted transaction has to stay in memory, essentially it gets pinned in memory and the data system is not allowed to write it out. The next policy is called the force policy and this says whether the database management system is required to flush all the pages made or all the pages that were modified by a particular transaction at the moment that the transaction goes and commits. So if you are using the force policy and it's enabled, then when that transaction commits, all those changes have to be written out, right? And this goes against the no steel policy because the no steel policy says you can't write out any dirty pages. So you would have to do that extra work as I said where you have to make the copy and make sure the pages you're writing out only contains the data from the transaction that actually committed. If you're doing no force, then you say you're allowed the transaction to commit, you can tell the outside world your transaction committed, but you don't have to flush all its dirty pages out. And of course that means we have to do something extra because if it's hanging out in memory, then we can still lose our data. So if we're using the no force policy, we have to do some extra stuff to make sure that everything sticks around. So now the force policy makes it really easy to do recovery, as we'll see in a second when we talk about shadowpaging. Because when you crash and come back, you know that on disk you don't have any changes made by transactions that didn't commit yet, right? It only contains the data from committed transactions. So when you come back, the database on disk is good to go, right? So this makes it really, really easy and really fast to do recovery, because you just come back and everything's okay. But it's actually going to be really bad for performance because now what's going to happen is every single time you have a transaction commit, you have to then pick out the pieces that you actually want them to write out to the disk to make sure you don't contain the uncommitted changes. Then you have to do an f-sync, or flush out to the disk, exactly all those pages, and then wait until that's done, until you tell the outside world that you committed. And now you gotta be careful about scheduling because if you have transactions trying to do this all at the same time, they're gonna be blocking each other doing a bunch of f-syncs. So the spoiler's gonna be that we're gonna end up using steel no force, but I want to go through an example of what no steel force looks like, right? So with no steel force, no steel means that you're not allowed to write out dirty pages from uncommitted transactions before the transaction is committed. And then force says all the pages that a transaction modified have to be flushed out before you can say that it's committed. So again, there's the same transaction we had before, T1, T2, T1 wants to read on A, write on A, and T2 wants to read on B, write on B. So transaction T1 starts, does a read on A, copy everything back into our buffer pool, then it does a modification, then T2 starts, it does the read on A, then it does the write on B, and then it goes ahead and commits. Now this point here, because we're using the force policy, all the changes that T2 made have to be written out to disk at this point. So we'll write everything out and do a flush. But again, because we're using no steel, it means we want to pick out those changes from T1 and make sure they don't actually make it to disk. So again, we'd have to keep track of who's modifying what and when we want to commit T2, pick out the changes from T1 and only write out the T2 changes, right? So you do your update like this. And then now when this guy aborts, it's super easy for us to roll back T1 because there's nothing on disk we need to reverse. And we just undo the things we did in memory. So again, with no steel force, this is the easiest thing to possibly implement. Because you never had to undo the changes from a board of transaction because they never made it out to disk. And you never had to redo any changes from a committed transaction because they always got flushed out to disk too. So again, you come back and the database is perfectly the way you want it to be. And you don't have to do any recovering. What's an obvious problem with this? Performance is one, but think of a bigger problem, ignoring performance. So again, yes. I'm not out of this city you're assuming here because it could be the case that let me just say, F-sync is not just one block away. Yeah, so his comment is that I'm ignoring what the hardware can actually provide in terms of atomic rights, typically on a page level. So if I transaction only modify one page, I can F-sync that and that's going to be atomic. If I have to modify a ton of pages, then I can't F-sync that and have that be entirely atomic, right? The hardware is going to guarantee sort of one page at a time. So that's one issue and for that we'll ignore for now. What's an obvious problem here? So what does NoSteel say? NoSteel says that, is the database allowed to flush out dirty pages from uncommitted transactions? He says maybe like a deadlock. I'm thinking something more simplistic. Memory, exactly. My database has a billion tuples and I need to update all of them. But I can only store one million tuples in memory. This doesn't work because under NoSteel you can't flush out, you can't write out the disk, any dirty pages from the uncommitted transactions. You can't modify the entire table because you can't fit the entire table in memory, right? So that's the obvious limitation of this. All right, so I think it's worth from a sort of historical perspective but actually spend some time looking at one possible implementation of shadowpaging, the memory issue is one, but what I'll show here, it actually solves the atomic issue that he brought up. But the memory issue is always going to be the big one. So it was shadowpaging. Shadowpaging was actually invented by IBM on the system R team. This is how they first designed their storage manager and buffer pool manager in system R. But they ended up later abandoning it because it has some obvious limitations. But the main idea of shadowpaging is that you're gonna have essentially two copies of the database. You're gonna have the master copy that is guaranteed to be consistent because it only contains the changes of committed transactions. And then you have the shadow copy, which is essentially the staging area for transactions that are currently running. And it's where all their rights will go. So you can sort of think of this at a high level. This is sort of like multi-versioning, but instead of doing multi-versioning at a tuple level, you're doing multi-versioning at a page level. And instead of having any number of different versions of a page existing at the same time, under shadowpaging it's only two. You only have the master version and the shadow version. So what's gonna happen is transactions are gonna run, and they're only gonna make changes to our shadow copy of the database. And then when a transaction commits, we will flip a pointer that says the new master version of our database is now the old shadow copy. And we can do that atomically, so that way if our changes span multiple pages, we're just flipping a single root pointer, and that makes everything immediately invisible. So shadowpaging is an implementation of no steel force. Because we're gonna force that all the transactions changes get written out to disk at the moment that it commits. And then we're not allowed to swap out any pages from dirty uncommitted transactions. So the way you're gonna implement this is that you're gonna organize now the database in a tree-like structure where the root of the tree is gonna tell you whether you're pointing at the master or the shadow. And because the root is essentially gonna be a single page, we can guarantee at the hardware level that we can do that atomic right to flip that pointer to the master of the version. And so when the transactions show up, they'll find out where the shadow copy is, and they'll start making all their changes there. So essentially it looks like this, right? So this is just the single master version without the shadow, right? We're gonna organize everything as a bunch of pages, and then these pages in memory are essentially pointers throughout where they're out on disk. All right, so now to install the updates, as I said, all we have to do now is just flip that pointer on the root to point to the new master, the shadow which becomes the new master, right? And we'll have to write out that swap to that root pointer out the disk as well at a known location on disk so that when we come back, we look at that location, and then that'll tell us where to go find the starting point of the current master. So let's walk through an example. So say we have a transaction comes along, t1, and when it first starts, when it starts making modifications, we first instantiate a shadow page table which is essentially just a copy of the master page table, and all its pointers point to the same records out on disk, right? So at this point here, both the shadow and the master are exact copies of each other. But underneath the covers, there is only one copy of the database, the actual disk pages themselves. So at the top, if any transactions are read only, then they can always go through the master page table and read a consistent snapshot, a consistent version, a copy of the database. But any transactions that want to modify things will always have to go through the shadow pages, right? So now as our transaction runs and starts making changes to pages in the shadow page table, we'll create new copies of these pages on disk and then update these pointers. Then when the transaction goes ahead and commits, we then flip the pointer from the database route to now the shadow page table. So now at this point here, all the changes made to the shadow page table are now visible. And we can guarantee that we can recover after a crash because we come back, follow the database route, and find our consistent view of the database from the new master, right? So here we don't need to maintain a log file or do anything extra. All we have to do is flip that pointer, store that pointer out on disk, which we can do anatomically because it's a single page. And we can come back and merely have the database in the correct state. And of course, we have to do garbage collection because now we have this old page table, the old master, and we have a bunch of pages that are not pointed to anymore by the new master page table. So we have to go through and scan through and clear these things out. Yes? Say it again? Would even my virtual commit? So if you have multiple transactions running, so I'm showing one example here. You could have multiple transactions running, and they're all modifying the shadow page table. And then as a batch, you go ahead and commit them. And then you switch over to the new master page table right then and there. So I'm showing one transaction here. You could have multiple transactions running at the same time. And then all the two-phase locking and the other stuff we talked about before still applies here as well. Correct. You can't say this one transaction committed. You have to take everybody. Because otherwise, you basically now have to have figured out multi-version inside of the shadow page table, and then you just get back where you started. So again, with shadow paging, undo is super easy because you just blow away the shadow page table. You don't update the route, so no one ever sees those changes. And there's no words that transactions are reading uncommitted data. And there's no redo at all, because as soon as we write out the database pointer at the root, then immediately everything is persistent, and immediately everything is durable. So this seems really awesome. This seems like a good idea. But as he brought up, you can't commit single transactions in a large batch. And then there's other issues like now you've got to copy around the entire page table every time. You can try to be clever and copy only the pass in the tree. But again, this is just sort of adding complexity to the implementation. And then the commit overhead is high as well, because every single flush will have to flush out every single updated pages and the page table on the route. Over time, your data will get fragmented. Save to do a bulk load and have everything nice in sorted order. Now you can't do in-place updates to these things, because now you're going to have a bunch of pages for different parts of the table here, a bunch of pages for different parts of the table over there. And now it gets more expensive to do a sequential scan. And of course, you need to do garbage collection to go through and prune out these things and clean them up. So again, as far as I know, the only data systems that's still around today that do shadow paging are LMDB and CouchDB. Those are the only two systems that actually do this. Everyone else actually does the right-of-head logging stuff that we'll talk about next. So the key observation, again, from shadow paging is that we have this problem where we need to write out these non-contiguous pages out on disk when we do a flush to make sure everything's always consistent. But if you think now when we're spinning this hard drive, that essentially is doing a bunch of random writes, because now you're going to have doing writing different sections and different parts of the disk platters, and now the arm is jumping around and doing these writes in different locations. So this is becoming less of an issue now with really fast SSDs and non-volta memory when it comes out. But historically, this was actually a big, big problem. And this is part of the reason why the IBM guys eventually ditched shadow paging. And ideally what we want to be able to do is that we want to have a way to convert our random writes to disk to be single sequential writes, which are much, much faster because we can batch things together. So this is essentially the problem that write-ahead logging is going to solve. So write-ahead logging, or WAL for short, is the idea where we're going to maintain now on disk a separate file that's independent of the, or separate from the actual database disk pages that's going to retain log entries where we record all of the changes that transactions made to our database. And the idea is that any time we make a modification to a tuple in a regular table page, we have to put an entry into the log file first. And then if we're going to flush out now a dirty disk page that was then modified by a transaction, we need to make sure that we write out the log entry first, right? So you have to write the log entry first before you're allowed to flush out a dirty page. And that way if there's a crash, you can come back, look in the log, and figure out, well, what actually happened, what did this transaction actually commit, what change did it make, and should it actually be there or not. So write-ahead logging is an example of using steel and no force. Steel means that, again, we're going to be able to allow the data system to flush out dirty pages from uncommitted transactions before it actually commits, which we can, is OK because we now still have the log file, so we'll know what that change was before we write it out, and we know how to undo and redo it if necessary. And then no force means that we're not going to require that all the pages that the transaction modified are written out the disk at the moment that the transaction commits. Now we're going to have to flush out all the log entries that we do have to force and flush out. But in terms of pages itself, inside the buffer manager, we don't need to flush all of them. OK, so the idea, again, we're going to have all these log records, and they're going to contain information about what changes the transaction has made, and that a transaction would not be allowed to be considered fully committed. We can't tell the outside world that your transaction is safe and durable until we know all of its corresponding log records have been written out to stable storage. The reason why I'm saying stable storage here instead of non-volatile storage is, again, we want the stable storage to be the end-all, to be all storage location of our log, because that's enough information for us to be able to restore the entire database and put it back into the correct state. The log is going to be all the information we need about how the data is changed over time, and we can always recreate it from that. So we want that to be indestructible. We want that to be replicated and stored as many times as possible, and so that way if the data center burns down, we still have our database log, we still recover the database state. Okay? All right, so the protocol is fairly straightforward. So what will happen is that when a transaction starts, you're always going to write a begin message to the log, and this is basically saying that there is some transaction that existed. It started with this transaction ID, so we know to expect it down below. Then when the transaction commits, we're going to write out a commit record, and we'll have to flush all the log records that came before the commit record to ensure that everything is durable. You don't want to have, like, you've got a bunch of changes on a table, and then you write out the log records for those changes, but then you flush the commit record before you flush the change records, so you come back and you have a transaction that doesn't have any changes, because you missed everything. So this is making sure that everything is done in sort of sequential order. All the log records get flushed, and then the commit record has to get flushed. And that tells you the boundary of what the transaction changes. Now for the modification log records, we store some basic information, like the transaction ID, the object ID of the thing that you modified in the simple terms could just be the record ID of the tuple, but then you're going to have the before value and the after value. The before value is going to allow us to do undo. If a transaction made a change and a transaction doesn't commit, we can use the before value to put back the value to the way it was before that transaction actually made the change, and that makes it as if it didn't actually ever execute. And then the after value ensures that if we crash, we can restore that correct value if the transaction was considered committed. Let's walk through an example. So if a transaction T1 wants to do a write on A and write on B, so now in memory we're going to have our buffer pool like we had before, but then we're also going to have an in-memory staging area for our write ahead log. And so when the transaction starts, we write a log record that says transaction T1 began. Then we do a write on A, and we write the log record for that transaction T1, modified object A, the before value was 1, and then the after value was 8, and that gets written to the log. And at this point, this is all still staged in memory, nothing's actually written to disk yet. Then we do a write on B, same thing. We have the transaction ID, the object ID, and the before and after value of the object. And then we commit. Now at this point, when we commit, we flush out the log to add on disk, and at that point the transaction is considered durable and safe and committed. So at this point here, although we have not updated the on disk pages for the database, because we know that our log records made it out to disk, that it's now considered safe to tell the outside world, tell the application that our transaction is committed, that if we stop and crash and we start, all those changes that it made will still be there. So how are we actually going to go through the log after a crash and figure out how to restore the data set back to the correct state? That's what we'll discuss next week in the next lecture. So again, if we crash here, there's enough information in that log that's going to allow us to restore the database to the correct state. So any question about this high level idea? Yes? So stable storage. So the question is, when you write the log, you always write it to stable storage. Again, stable storage is this high level concept. Most people just write it out to the same non-volatile storage that you write your regular disk pages. Some hard drive you have in your machine or EBS. But ideally, you sort of think that this is like the treasure of the most important part of the data system. For all its changes, because this tells you how things got modified. So you want to make sure that no matter what, you don't lose that. Because that's enough information there to be able to put us back to the correct state. So in this example here, I didn't show, when we first started, when the system booted up, there was already A, B, and C set with certain values. I didn't show actually those things be inserted. But the log would contain information about what the transaction that inserted those entries. So we could lose the disk pages. But as long as we had the full log, we always knew how to restore and put it back to the correct state. Is there another question over here? Yes. Your question is, does the performance benefit come from this, that we can transform random writes into sequential writes? That's a big part of it, yes. Because again, in this case here, I have one page. I only modified one page. And then we'd write that one page out when we want to flush. But if A and B were on two separate pages on different parts of the platter and it's spinning this hard drive, or say they were actually in a distributed system, they were on different machines, I would have to do a flush on both of those two pages in order to make sure that they were durable. But with the log, I don't have to flush one page. And I get modifications that span multiple pages. So that's one aspect of it, right? What's contents of what? His question is, the amount of data you have to flush for a log, would that be less than the amount of data you have to flush if you're flushing directly to the pages? Typically yes. Because again, you can pack multiple tuples on a single page, and most of the times when you only modify one tuple, the amount of data you have to log is much less versus like running out the entire page. Now, we'll get to this, but eventually you still have to write out that page when you take checkpoints, because you want to make sure that these things eventually make it to disk. Now, I said the log's enough to restore the data to the correct state, but an obvious problem is if you've been running your data system for like a week or a year, your log would encompass that year, you'd have to crash. And unless you have a way to restart quickly from the existing pages, you have to replay the entire log, and that would take a long time. So typically yes, you write less. The log has everything you need, but eventually you still want to write out those disk pages. OK, so from an implicit standpoint, there's two questions, and I sort of covered this already. So the first issue is when should we actually write out the log entries to disk? The answer is that when the transaction commits. To avoid the issue that I said before, where we're shadow paging or writing out just doing force all the time, where you have to do S-sync for every single time a transaction commits, you can use group commit, which is essentially just batching together a bunch of entries made by multiple transactions into a single buffer, and then do a single S-sync to flush that buffer out in a group rather than waiting to do S-sync for every individual transaction. So this is actually what you're going to implement in project four. The way you essentially do this is that you maintain two buffers, you have one buffer where you stage all your log entries, and each transaction can write into that log buffer. It doesn't matter if they've committed yet. And then when that buffer is full or there's a time out, then you go and take that staging buffer and have that be the flushing buffer, and you have another thread write that out, and then you use the other buffer, the old flushing buffer now becomes the staging buffer, and you have all your rights go into that one. And again, the way to sort of about group commit is that if you're the first transaction to put something into the staging buffer and you need to wait until it flushes, you're going to wait the longest because you have to wait until the buffer is full or there's a time out. But if you're the last transaction to put something in that buffer, then you're going to really fast response because it's the same thing as you running S-sync immediately. So on average, in the worst case, you're doubling the amount of time you have to wait, but on average, this maximizes the throughput you can for doing flushes. All right, the next question is, when do you actually write out the dirty pages, direct us to disk? This, again, essentially is the bufferful manager's responsibility to figure this out. Because we're using the steal policy, the bufferful manager's a lot of write out dirty pages anytime at once, so it can do it every time a transaction does an update or at any time it commits. Typically, it doesn't actually do this. It does that LRU stuff that we talked about before, where you look to see what pages do I have in memory, which ones are pinned, which ones are dirty, and then figures out which ones to write back. All right, so now one observation we could have is that if we can prevent the data-spinning system from writing out these dirty records until the transaction commits, then we don't actually need to store the original values or the before values in the actual log record. So say here, my log contains two log records for transaction T1, does a write on A, then a write on B. If we ensure that no pages of this transaction actually got written out until we actually commit, then we don't actually need to store the before image. We can take this out entirely. We're still going to have to flush the log record first. That has to get written out first. But if we're sure that we don't actually write out the pages themselves, then we don't actually need this. Because what would happen is if you crash in these two scenarios here. In the first case, if you crash, then you just replay the log and just put back in the changes that you want there to be there. You don't care what was there before. So you could be overwriting the same change, but that's fine. You always go back to the correct state. In this case here, you can simply ignore T1's updates because those pages never made it out to disk, and therefore you don't have to undo anything. You simply just ignore these things entirely in the log because you don't see the commit message. So this is, again, this violates the, this is the example of the no-steal policy. It's sort of showing you why the steal policy is something you want to use. And that just because you're using write ahead logging doesn't mean you couldn't use no-steal. You could still do that. But in general, again, we always want to use the steal policy. And as I said, this is because we want to be able to support transactions that have to update a data set that's larger than the amount of memory that you have, and you want to make sure that you can write out changes from uncommitted transactions to make new space in memory in order to bring in more pages and modify those guys, right? And so we always have to keep the undue information in the log because if we write out those dirty pages and our transaction upwards, we won't have a way to reverse those things. And we would have, we would persist changes of transactions that didn't actually commit. So this is why we have to use the steal policy. So another way to think about this is in terms of these two sort of quad charts here. And so to think about these different buffer pool policies and how you actually implement these things, you can think about in terms of the runtime performance and the recovery performance. So the runtime performance is how fast you'll be able to process transactions normally during execution. And with the steal policy, if you do no force and steal, that'll be the fastest for performance because with no force, you're not flushing out anything. Sorry, with no force, you're not requiring that you have to do an F-sync immediately when the transaction commits and you're allowed to write out dirty pages. With force, no steal, which is the case of sort of using shadow paging, that would be the slowest performance because you have to do all those F-syncs to make sure all the changes are written out atomically at the disk first before you can say the transaction committed. So steal, no force would be the fastest for the runtime performance, but it's actually going to be the slowest for recovery because you have to go through the log and then replay all those entries. Whereas in the case with shadow paging, it's the fastest because you come back and you don't have to do anything. The database is already in the correct state because you don't have any changes made from uncommitted transactions. So again, this is showing that there's this trade-off between getting really fast runtime performance but also making recovery slower. And this is why every single data system chooses to use no force and steal because they assume that the normal cases that you're not going to crash every 20 seconds, right? So they're going to make a design decision that they're going to choose to use no force steal because that's going to make you faster to run transactions. And then, yes, recovery will be slower. There are some ways to make them speed them up using checkpoints, which we'll talk about next. But in practice, you assume you're not going to have to recover all the time. So it's better to make the runtime performance be the fastest possible. So this sort of clear. Again, there's this dichot between the runtime performance and recovery time. And most systems choose to optimize for performance. And again, the steal, no force is no do and redo. But force, no steal, is with no undo, no redo. Because you don't have to recover anything. Yes. His question is, is there a combination of force and steal? So steal would say that you're allowed to write out dirty changes. Yeah, so you could use right ahead logging to do that, too. So force would mean that all the pages that may modify by transaction have to be flushed out the disk when it actually commits. So that means basically anything in the buffer pool that was modified by the transaction, you flush out immediately when it commits. But because you're doing steal and you could write out pages from uncommitted transactions, you need the right ahead log to make sure that happens. So force steal is essentially saying you're doing the extra step of forcing all the pages to be flushed out when the transaction commits, even though it's actually unnecessary, because the log is already been written out and you know that's durable, because you've already f-synced those guys. So you're forcing yourself to write out pages that actually don't need to write out right away. And then you'd have to do that extra step of figure out, oh, well, this page contains modifications from this transaction, and this transaction didn't commit yet, so let me pull those things out and make sure those don't get written out. You're basically doing extra work when you don't need to. Yes. You don't have to redo that. So again, his statement is you don't have to redo if you do that. Correct, yes. That's true, yes. But again, it's gonna make runtime performance slower and no one's gonna make that trade off. All right, so as I said, right ahead logging is the way most systems are implemented. And actually when we talk about in-memory data systems, the more modern ones, they also still use right ahead logging with a slight variant. But what's one obvious problem with right ahead logging? I think I've already mentioned this earlier. If my data system has been running for a year, my log is gonna contain all the changes from the last year. So he says you have to compact it in the background periodically. Yes, so that's, yes, more or less, that's essentially what a checkpoint is. So the right ahead log is gonna grow forever. And ideally what we don't want to have to do is we don't want to have to boot our system up and then figure out, here's the beginning of the log and figure out what actually needs to be replayed and put back on our disk pages, right? Because that'll take a long time. So what we're gonna do is that we're periodically gonna take a checkpoint. We're gonna flush out all the dirty pages in our buffer pool and write them out the disk. And then that way we know that those changes are now durable, but they still may contain changes from uncommitted transactions. Well, again, we can use our right ahead log to figure this out. So with the checkpoint, basically, again, we're gonna write out the stable storage, all the log records that are sitting in memory and then all of our modified blocks and then we can write out a new checkpoint entry to our log that says at this point in time, we know that anything on disk is contains the modifications from any transaction that comes up for our checkpoint. So you don't need to go back too far in the log to figure out what should be redone or you don't need to go back too far in time to figure out what should be undone, right? Yes. His question is, do we assume in this, and for this particular sort of scenario here that the checkpoint has to be written out to stable storage as well? Log to stable storage and then it is destroyed. Yes, so in this example here, we're gonna assume the checkpoints are written to stable storage, that you can come back, load the checkpoint from that, and then replay the log from stable storage as well. Yes. Okay, so again, we're gonna write this checkpoint entry to the log and say anything that comes before this, those changes have been written out the disk, right? So let's look at a simple example here. So we see at the bottom, we're gonna have a crash, but in our log, we have three transactions, T1, T2, T3, and they're making a bunch of changes. So at this point here, we know that any transaction committed before the checkpoint, sorry, when we take our checkpoint and we come back after the crash, we would look at our log and we say, all right, well if I scan through, I find this checkpoint. So I know that at this checkpoint, any transaction that committed before my checkpoint, all its changes made it out to the disk pages and are durable, because I have to flush the pages and the log record when this occurs. In this case here though, T2 and T3 did not commit before the last checkpoint, so we gotta figure out what to actually do with them. So in the case of T2, we're gonna have to redo all its changes because if we look beyond the checkpoint in our log, we will see that transaction T2 ended up committing. So we're gonna need to make sure that all its changes after the checkpoint are actually persisted. In the case of T3, we don't see the commit message, so it crashed after our checkpoint. So if T3 made any changes prior to the checkpoint, we need to make sure that those things get rolled back. So in this example though, I'm being very hand wavy about the checkpoint mechanism, but I will say that for this, I'm actually assuming that the checkpoint stalls the entire database system while you do this. So there's no situation where a transaction could come, modify a page, sorry, we start our checkpoint, we write out the disk page that's dirty, and then while the checkpoint is doing other parts of the system, working on other parts of memory, another transaction modifies that page we've already written out, so therefore we would lose that. I'm assuming that we have a consistent snapshot of the contents of memory and we can write them out the disk atomically. That's not how our database systems actually implement this. We'll talk about fuzzy checkpoints next class, but I just want to understand the high-level idea of what this is going to do and how this speeds up recovery for us. So the issue we have to deal with is that, there's two questions. So the first question is how often we should take checkpoints and then we've got to figure out what kind of information we want to store in our checkpoints. So if we take checkpoints all the time, as I said, we're stalling all the transactions while we do this, that basically, again, there's this huge pause in our system while we take the checkpoint. Now as I said, with fuzzy checkpoints, we can allow transactions to keep on making changes while we take the checkpoint, but for now, we'll ignore that. So if you checkpoint all the time, then that's basically going to slow down the system because you're not processing transactions, you're taking a checkpoint. And instead of writing out log records that the transactions need as they're modifying things, you're writing out the disk pages as part of the checkpoint. So that's going to slow things down. So there's sort of this trade-off between taking checkpoints all the time and speeding up recovery, sorry, but then slowing down the system, but then if you don't take them too infrequently, then that's going to make recovery take much longer. So there's no hard and fast rule or sort of magic number I can tell you to say, yes, take a checkpoint every five minutes. Different applications or different customers will do different things. But this is usually something you can set in your system. There's always going to be a knob to say, how often you take these checkpoints? And it could be how long, how many dirty pages you want to sit around in your buffer pool before you write things out and take a checkpoint, or it could be a periodic thing, right? So there's no, again, there's nothing I can say like this is exactly how you should tune your system to do this. It depends on the particular environment. Yes? So when we're really doing the rackboard, we just compact it as related long. So your question is, can you do it at a higher granularity? What do you mean by that? Since we have to do computer length as final value in the after-compact always like previous long. By doing that, we could say that like a concentrated time to... So you're saying that you compact the log that contains all the changes made to a particular tuple so that when you come back, you don't have to replay a bunch of entries. You just play the last entry because that's the last thing you should see. That's essentially what sort of the checkpoint is doing, right? The check... The checkpoints for the entire database, right? So like I say I have one tuple and I do a million writes to it, right? If I don't have a checkpoint, then yes, you could, you would have to replay all those million updates. We'll see you next class. That's how the recovery protocol has to work. What you're basically saying is that if I take a checkpoint here and then whatever that last update was, that gets written out to the disk page and the data system will know that I don't need to go back and look at the other 999,000 updates because the last one is actually what got written to the disk page and that's the only one I care about because that's the one that should be visible. So the checkpoint is essentially doing compaction but not explicitly compacting the log. It's doing it on a disk page level. That's a good point though. So now the next question is in our logging scheme is what information do we actually want to store? So there's essentially three categories of log records or logging schemes you can have in your database system. And so the first one is called physical logging and this is where you're gonna store, record the exact changes you made on a byte level to specific locations in the database. So the way to think about this is say I update a tuple and a tuple's inside of a page in my log record with physical logging, I'll say this page at this all set at this but here install this byte sequence. The other end of the spectrum is called physical or started logical logging where instead of storing the low level physical changes on the individual bytes that you would make to a page, we will just record the high level operation we did that transaction did on the database. So think of this as like you just record the SQL statement that you executed in the log record and that's enough for you to actually go back and replay it because you just re-execute the same SQL statement again. So the key thing about this is that in the case of physical logging, you have to have one log record per page. So if you do an update in a single query that updates multiple pages, you have to have a log entry for every single in one of those pages. But under logical logging, you just have the single SQL statement which may spam multiple pages and that's enough for you to then figure out how to re-execute it on recovery time. So obviously logical logging in practice will store much, much less information than physical logging because if I update a billion tuples with a single update statement, a single SQL query, I only have to have one log entry for that single SQL query under logical logging. But with physical logging, if I have a million pages, I have to have a log entry for all those one million pages. So the downside with logical logging, it seems sort of obvious that it seems like this is actually what we want to do, but in practice it's actually hard to implement recovery because it's hard to do this with concurrent transactions with a non-deterministic concurrency to a protocol. So what I mean by non-deterministic is that you could execute the same two transactions in exactly the same time as a recovery when you're recovering the system and because of like race conditions in scheduling, like who gets the lock first, one transaction, one query might actually get executed before another one does in sort of physical time. So you'd have to store a bunch of extra information to say this logical log record executed before this one and it made these updates to these pages before this other one, right? You'd have to install or put extra information in the log record to say exactly how two logical queries were interleaved. Now in a system like VoltDB, they don't have this problem because they're doing, they have single thread execution engines where only one transaction can be executing one SQL query at a time at a particular partition. So they don't have to record any of this extra information because they're single threaded. But most systems aren't like that. You allow multiple threads to run at the same time on the same data and now because of OS scheduling or whatever else happens in the system, the logical ordering the second time around during recovery may not be the same as the first time around. So that can end up with a inconsistent state and now your transactions aren't durable. The other problem with logical logging is that it's gonna take much longer to do recovery because now it's gonna be executing exactly every single SQL query over again. So if the query ran for the first time and it took one hour to run during recovery, it's gonna take one hour to run. There's no magic during recovery that it's gonna make this thing go faster. So that's gonna, again, if you can't have a lot of downtime, this is gonna be a bad choice. So the hybrid approach, sort of trying to get the best of both worlds and one that fits sits in the middle, it's called physiological logging. And what's gonna happen here is that it's gonna be sort of like logical logging where we're gonna record at a high level the change we wanna make to data, but we're gonna do this like in physical logging where we're gonna have a log record per page. And then when you crash and recover, you know how to replay the entry in the log to apply that sort of high level logical change to the data that you modified in the individual page. So this is the most popular approach, is what you use in most database systems because it's a nice trade off between physical logging where it's very exact, exactly how you want things changed and logical logging where you have sort of a high level thing that you can replay. This sort of gets the best of both worlds. All right, so let's look at a simple example here. So say that we have a table of foo and we wanna update a single attribute called value and set its value to x, y, z. And we're doing this based on its primary key where id equals one. Under physical logging, you're gonna record exact information about the table you're modifying, the page, the offset, and the before and after value. I don't talk about index logging too much, we'll cover that next class, but we'll have to do essentially the same thing now for information about how we're gonna record changes to indexes. With logical logging, the only thing you store in the log record is just the SQL query. And that's enough for us to be able to replay that and put us back to, ideally, back to the same correct state. And then in the physiological logging, instead of storing the offset, we just store the object id. And the advantage of this is that when we come back, again, we know how, we don't care about exactly are we modifying this particular offset, we would look in the page, find object id one, and then go ahead and make that modification. Right, so again, physiological logging is doing it on a per page basis, but the modification can be done at a higher level. So for example, if you say, say, like value equals value plus one, we would store that in the physiological logging rather than the exact byte sequence. Okay, so to finish up, what I've talked about today is primarily focused on right ahead logging. And that's the most important scheme that's used in most database systems. And this allows to handle the loss of data modifications made by transactions to our pages that hang out in memory. We can replay the log and put it back to the correct state. And by using steal and no force with checkpoints, this is gonna allow us to allow us, the transactions to modify data sets that exceed the amount of memory we have to our system. And we're not gonna require to have to force all our changes out to disk first for the disk pages when a transaction commits. We just have to make sure that our log records get flushed out. And then on recovery, all we have to do is undo the uncommitted transactions using the before images in our log records to put us back to the correct state. And then we redo the committed transactions to ensure that all of their changes or dirt won't persist beyond the crash. Okay, all right, so next class, we'll now talk about Aries. So Aries is the gold standard of how you implement recovery using the right-of-head logging in a disoriented database system. It was invented by IBM in the early 1990s. And every single database system that uses right-of-head logging implements some more or less variant of Aries. So I'll cover at a high level how it works. And then this is what you guys will be implementing a sort of simplified version of this in project four. So I will say sort of as a teaser for what we're gonna talk about next class, the big, big issue that I didn't talk about today that Aries will handle are doing crash recovery during crash recovery. So if you crash and have to replay the log, if you crash during the replay in the log, what do you actually do? And that's what Aries can handle, right? And that's the tricky thing. If you crash after crashing, how do you handle that? Aries can do that, okay? All right, any questions? All right, again, no class on Wednesday. Homework five is due tonight. Project three is due Wednesday night. And then we will start up again Monday next week, okay? All right, guys, thanks. See ya.