 What's your name, your ex and your sister Sugar Cuts came to my office looking for you. What do they want? Well, first of all, they said they needed to talk to you and that you had the money. Are you serious? Okay, first of all, Sugar Cuts, you should not trust a word coming out of her mouth like she used to work at Arby's. She lies for a living. Uh huh. I said, okay, she's the one that owes me money and also I took the test, it's not mine, it's not mine. So Sugar Cuts, her sister says you owe her money, the ex. They said the kid's yours. You're saying it's not yours? It's not mine and they owe me money. So what are you going to do? I just got a little, I think I'm just going to do the evasions. Alright. I hope the best for you. Okay. Alright. For you guys in the class, alright. So we cut off the lecture a little early on NVCC. There's nothing really deep in there that I want to discuss. So we can learn at that and it's not something that will be covered on the exam. So get us back on track on schedule. I want to focus, jump into what today's lecture will be about which is database logging. For you guys going forward, what's due coming up is Project, our homework threes due on November 13th, the Sunday. Project three was again been extended by three days. We do on the 16th and then the live lecture will be on the 8th. The snowflake lecture will be on the 6th, on the Tuesday. We also have the final exam review on the 8th as well. Okay. So any questions about homework three or project three at this point? Some of you have completed project three. That's awesome with 100%. So congrats. A bunch of optional things that are coming up. There's a lot of database talks we have scheduled at CMU actually. So today at three o'clock would be somebody from CMU alum coming to talk about HDB. It's basically trying to replace the relational model but it still uses Postgres which I don't think is a good idea but we'll see how that goes. And then next Monday we'll have somebody from a system project called Gaia which is a database built for autonomous robots. So sorry about what they're going to do there. The guy that's actually building this is like an elite database guy. Like he was the one of the lead architects of Amazon Aurora. He works at Redshift from Amazon. He's so good that Amazon is letting him do another database start-up while still paying him his salary because they don't want to lose him. So he's going to talk about what he's building at Gaia. Tiger Beetle is a special purpose transactional database system that's actually written in ZIG. Who here has heard of ZIG? Nobody. Who here has heard of Rust? Almost everyone. So ZIG is an alternative to Rust. So we'll talk about the memory safe programming language. Tiger Beetle is written in ZIG. So we'll talk about that. And then the VMware guys give a talk at the end of this month about SplinterDB. So again, I've said multiple times throughout the semester how much I love databases. Like here's four different databases targeting different workloads and different architectures running on different platforms. And none of them are... Well, one's running a Postgres, but like people are building new systems all the time now. And there's a lot of interesting things going on databases and this is why I find this field so fascinating. Okay, again this is optional. The first one... I'm not the host for the first one, it's a PL thing. The other three will be again on YouTube and Zoom. All right, so today's class is... We're going to talk about how to make sure that our database is durable, safe. So the last four lectures have been about these transactions. And then we focused on concurrently protocols and that was about isolation, how to prevent conflicts or anomalies. Now we're here to talk about how to make sure that if we commit a transaction, we write a bunch of changes to the database, we commit a transaction, the database system tells us that our transaction is committed, how do we make sure that there's a crash or something happens, we come back and our data is still there. All right, let's go through a simple example. We have T1, it's running by itself. For this purpose, we can ignore two-phase locking, we can ignore any of the conventional stuff. And it's going to read A and write on A. So assume that our buffer pool is entirely empty. So when transaction starts and wants to read A, we've got to go up to disk, it has A and then we bring it in. Then we do a write today, a write on A. And again, we talk about how this is just writing into the page that's sitting in memory in the buffer pool, right? We apply our change there and then we're done. Now our transaction goes to commit. So let's say the worst person in the world comes along and says, I'm going to zap the power on the data center, on the machine, right? And what happens? We lose memory. Or so we lose the contents of memory, we lose our buffer pool. So this is here, if we told the outside world, sorry, that our transaction has committed before it's actually written to disk, when we crash and come back, after assuming it's not bombed, like, I shouldn't joke, sorry. When we come back, our change is not there, even though we told the outside world that our database has committed, right? And again, when we say our transaction commits, it has to provide the asset guarantees and the D in asset is durability. So our change did not persist after a restart after a crash. So this is bad. So this is the focus today for crash recovery. So the recovery algorithm is going to be the techniques that the database system is going to implement at runtime and upon startup to ensure that all the transactions that are, you know, assuming we're running with all the full protections, this is how we're going to guarantee the consistency, atomicity, and durability, despite any possible failure. It's not entirely true, we're talking about scenarios, like if you actually bomb the data center, we can't, you know, how to say this, we can't bend the laws of physics and get your data back, but we'll see how to handle that. But basically, it's how the protocol, the implementation, the techniques that the system is going to provide for us to maybe provide these guarantees. And there's going to be two parts to this. The things that the data center is going to do while processing transactions normally, like during normal operations, it's the actual work it's going to do, an extra metadata and information it's going to store on disk in preparation for an unclean restart or a crash. Then the second part is if there is a crash upon startup, how do we look at all the things that we were storing during the normal operations? How do we use that to recreate the database and put it back to the correct state to make sure that we don't lose anything? So today's lecture is about the first part. One of the things we do at runtime while we're running queries, insert updates to the least of the database, what are we going to actually do to make sure that for the second part when we crash, come back, which we'll talk about Tuesday next week, how do we put us back to the correct state? So there's a lot to cover here. Hopefully we'll get through everything. The first thing we need to describe is what type of failures we could have possibly for the data system to recover from. Then we'll talk about how we want to manage memory in our buffer pool in response to having transactions potentially write dirty data out the disk. Then we'll talk about two techniques to support durable database systems, shadow paging, which we've covered a little bit, and write ahead logging, which I've littered to, I think throughout the semester. Then we'll talk about what we're going to put inside the log records if we're doing write ahead logging. And then we'll talk about doing checkpoints to reduce the recovery time if you're using write ahead log. But then we'll see on Tuesday next week how to improve this. So the spoiler is going to be, for this lecture, that write ahead log is going to be the ideal approach that we're going to want to use, and most systems are going to use. And most database systems are going to use this. We'll understand why as we go along. Okay? So the first thing we're going to understand is what are the type of failures that can occur and what is actually possible to recover from. So recall that throughout the entire semester we're going to use disk-oriented databases where we assume the primary storage location of the database is on disk and that we have this notion of a buffer pool that's volatile storage, volatile memory, and that we're using that as the stage data into memory to do writes and so forth. And so now the question is, okay, what are the different scenarios that could occur that we could end up potentially losing data or having a crash in our system and what can be handled? Right? So this is sort of repeating what everybody said. So there's a non-volatile storage, sorry, the volatile storage, that's a DRAM, that doesn't persist after a parallel to the crash. Non-volatile storage is things that are retained after a loss in a crash. And then we'll have this notion of stable storage. I think they talk about this in the textbook, but it basically says that no matter what happens, if you write something to this non-volatile storage, that you'll never lose data. This obviously can't possibly exist if I have a hard drive, and even though maybe I have synced to it and the bits are actually on the platter, if I light it on fire and it melts, it's not going to be able to get that or not get my data out. So the basic way we're going to be able to achieve this will be through replication either at the hard drive level or through multiple machines. But for our purposes, we really only need to focus on the top two. All right, so there are going to be three types of failures. So the transaction, system failures, and the storage media failures. And we'll see the first two we can handle. The third one we'll get around using replication. Again, the third one you would avoid if you had stable storage, which doesn't exist in the real world. All right, so transaction failures. These are things we've already talked about before. So these are reasons why the transaction may say I can't run and I have to roll back my changes. There's logical failures or logical errors. These are things like I try to insert something that is into a column that the non-unique value and there's a unique constraint on it. The database system will prevent me from doing that and I have to abort my transaction and roll back my changes. And sternal state failures would be if I got under two-phase locking, if I try to acquire a lock on an object and there's a deadlock, the database system could decide to kill me. So I need to be able to handle that and roll back any changes. The reason I'm going through all these things is because we'll see that for these transactions that may have to get aborted because of these reasons, they may have written things to disk. We need to make sure that when we come back after a crash, those changes aren't still around because we aborted our transaction before the crash. There's two types of system failures that to deal with. The first one is the software itself is buggy and crashes. The second thing is the database system has a divide by zero exception that isn't caught, the whole thing crashes. The OS could potentially have a kernel panic. This is a software level thing where the system has to simply halt and abort. So any of the state that we are maintaining about either it's in the buffer pool or program counters or registers, all that's gone. Harder failure would be that we lose power to the database to the machine running the database system. All memory gets wiped out and we have to reboot it. We're going to make this assumption going forward that we'll have what is called a fail-stop hardware, meaning if the system crashes, the data that's actually the data we've already written to non-bolts of storage is not corrupted. Now again, in the real world this doesn't happen. You can totally have the disk head writing to the spinning disk hard drive and then you pull the power and all of a sudden it just careens onto the platter and starts losing data. You can have even SSDs, you can have the cells go bad sometimes and you lose data. So for our purposes meaning within the database system we're assuming that this is not going to happen. Now we can use checksums to make sure that we can detect this, but we obviously if the data gets corrupted we can't reverse it unless we have another copy which we'll get through replication if you're assuming it's a stable storage. So the last one is essentially what I just said this is something that the database cannot handle and this is where we have a cataclysmic hardware failure that the database system is software it's not a robot, it can't pull out arms and start fixing the hard drive itself there's an issue with the physical hardware that the system is running on and the database system simply cannot fix this. This is going to require a human to intervene on behalf of the database system and either move the instance to another machine or swap out the buggy hardware this is not something that we can do so we can handle the first two but we can't handle this last one here and we'll talk about distributed databases and how to handle replication to avoid this problem next week, yes. So the statement is why can't we detect there's like use RAID and have hardware to detect there's a failure and just recover that. Yeah, so that would be that would still be hidden from the database system. So I'm not going to have like as far as I know it's not going to communicate with some kind of RAID controller and start making decisions on how to move things around there's an abstraction layer from that part of the hardware to the database system itself so we're not going to write custom code in the database system and say, oh, one of the disk was dying, let me go switch to another one. Right, because of all that so if the RAID controller runs out of disks that can swap out then it starts losing data then we're hosed, we can't do anything. Alright, so again if you light the machine on fire and the hard drives melt, we can't handle that. The human has to come and save us. That's okay. So as I said before again the primary storage location of the database is on disk because this is slower than than volatile memory so we're going to always bring things into memory to stage our rights and reads. Right, but now when when a transaction commits we need to make sure that any changes that it made are written back out to disk. Doesn't necessarily have to be where the original location of the data that it read in like if I'm updating a record and it's on page one, I can potentially write it out to page two, my change to it but now the system has to know that okay if I want the latest version it's over here there's no extra metadata we're going to maintain to figure out what was dirty who updated it and how to reconcile whether that transaction has committed or not and therefore what the current state of the page on disk should be the correct one or not. So again, things we need to guarantee to make sure our database is durable that any changes that transaction makes once we tell the outside world your transaction has committed that those things are persistent and durable on disk. And then we don't want any partial changes we don't want any torn updates and we don't want any lost updates. So the primitives we're going to use to achieve this are the following undo and redo. We talked a little bit about this also too with the Delta storage stuff in NVCC I said that was like a almost like an undo record where you were putting the diff of what the old version was with the new version and then if you ever needed to go back and to the older version you could just reapply the change which is essentially undoing the previous change. So it's basically the same idea it's like a diff. So the undo would be the operation would be removing the effects of an uncompleted aborted transaction and the redo would be reapplying the effects of any committed transaction back onto the logical tuple or the logical record or the logical data so that put us back into the correct state. So now how the data system is going to implement undo and redo are going to depend on actually how it manages memory in the buffer pool. So let's look at an example now. So we now have two transactions T1 is going to read A, write A and T2 is going to read B, write B. So say we have a single page in our database it's out in the very beginning we have a cold buffer pool and everything is out on disk. So T1 is going to start, it's going to read A what do we have to do? We go fetch that page, we got the disk it brings it into the memory. Now we do it right on A and for this one it's assuming a single version we're just going to write to update the new value of A into the page. Now T2 starts after a context which T2 is going to read B the page that contains B is already in memory so we don't do anything there we just read the value there but now we're going to write B and same thing with single version so it's going to update B with the new value. So now we go ahead and commit right and again I said that we need any time that a transaction commits in order to make sure that the data is durable any changes they made is durable need to write something out the disk with their changes. So in this case here we have to potentially write out this page that contains a change that transaction to made back out the disk. What's the obvious problem here? What's that? Right the problem is that there's a change to A in the same page and we have to decide whether we're allowed to write the disk or not. So say we did we just flushed the whole page now on disk we have the update to A from T1 and update to B by T2. So here we commit we tell the outside world transaction to you've committed you're done right that's fine but now T1 is going to abort and again assuming that we didn't know this whatever reason the client disappears or something happens T1 gets aborted but now if there's a crash you know we have now on disk we have a change from T1 which shouldn't have committed but in order to make sure that we can roll that back we got to go back out the disk right back in reverse the change and then write it back out right so this would be durable for T2 it would be durable for T1 but that's not actually what we want right because T1 shouldn't have anything durable because it didn't commit so that's bad so the question we have to figure out is when is it okay for us to write out dirty pages to disk and what is the requirement that when we say it commits what do we have to do with any of its dirty pages do we have to write make sure they're all right out of the disk or can we do something else so there can be two policies now we have in our buffer pool implementation that are going to determine these two questions and again as I said before when we talked about after the midterm the reason why we're talking about all these things now even though the first couple lectures were about the buffer pool because you need to understand what the buffer pool is then understand what transactions are and how we actually make buffer pool aware of transactions so this is why we're going back to buffer pool stuff again so the first policy of the deal is called the steel policy and this is going to determine whether the data system is allowed to have an uncommitted transaction overwrite the latest committed version of a page the latest committed version of a page or value to a tuple out on disk even though that transaction has not committed yet basically are we allowed to take a page from a transaction from a page that's been modified by a transaction or a value that's been modified by a transaction that's currently in our buffer pool are we allowed to steal that frame from it write it out the disk even though the transaction that modified the thing we're running at the disk has not committed yet so if you with the steel policy this is allowed the no steel it is not allowed no steel would say any page has been modified by a transaction that has not committed cannot leave the buffer pool, cannot be written at disk is this clear okay the next policy is called the fourth policy and this is going to determine whether the data system is required to have any pages that were modified by a transaction flushed out to disk before they're allowed to say that we've committed them again the client send the commit command through sql to the database system the data system doesn't have to respond right away it has to figure out okay what I need to write out to disk and only when those things are flushed to nonvolta storage under the force policy if you're running with the force policy then you say yes you've committed so force you say this is required no force is not required yes so he says isn't this a huge bottleneck to require anybody to flush dirty pages when they commit out to disk before you say they're allowed to commit is this a huge bottleneck yes example so in this setup here it's going to run the same transactions T1, T2, read on A, write on A, read on B, write on B but now we're going to do a no steel force policy right so T1 is going to start, it's going to read A that's on disk, we bring that into buffer pool then we do a write on A we update A to 3 here then now there's a context which T2 is going to read B, that's in memory, that's fine then it's going to write B it updates now via here and then it goes to commit so under the force policy this says that all their changes made by T2 have to be written to disk at this point before we can tell the outside world that our transaction has committed because as we said the problem is that there's a there's a record that was modified by T1 and T1 has not committed and under no steel you can't write any values that were modified by a transaction that has not been committed to disk before they commit right so in this case here the way we have to handle this under no steel and force is that we have to make a copy of the page only apply the change that T2 made to this copy page and then write that out to disk once that's flushed then we go tell the outside world that T2 has committed and we're fine and then at some later point when T1 starts again and it aborts it's sure enough for us to roll back this change assuming we have undue information for T1 to reverse the change that it made to the page yes the statement is could you read this here so that anybody that is going to touch this page has to commit or what are you saying sorry you say steel count like a counter how about this work so a counter you set a counter to do what the number of transactions that I've updated this yes okay so his potential solution for this all this actual work we're doing would be you periodically go to see you before you flush the page out the disk when T2 commits you wait some amount of time undefined how long but some amount of time and then at some for all the transactions have modified the page to then be okay I've commit sort of group them up and then flush the page out in a single batch so what if my transaction runs for an hour right T1 runs for an hour T2 takes one millisecond am I going to wait an hour right so the advantage of this is that obviously rolling back is easy the disadvantage is that the the copy and the copy and reversal if necessary on the board these are on the critical path like when I commit I got to go back here and make the copy apply my change or reverse the change from the other guy right I'm trying to do this when I commit and this is expensive right what's another big problem with this approach yes so he says this is required so the buckle manager be aware of the contents potentially but that's okay we're the data system we can control that yes I mean assume there's two face locking assume that there's something there's another harry level mechanism that's protecting them trying to write the same thing there's another big problem yes so she nailed it right so this is the big problem so I'll repeat what you said you have the right answer I'll repeat what you said in a different way the problem with this approach is that you can't modify a transaction cannot modify a portion of the database that exceeds the amount of memory available in the buffer pool so if my database is one gigabyte in size but my buffer pool is only 500 megabytes I can only bring 500 megabytes in I'll modify those 500 megabytes then I try to go get the next page and I have to abort because now I can't flush out any of those dirty pages and I've used all my memory so essentially what you're saying I can't modify you can't support transactions that fit that have to touch data that exceeds the amount of memory available under the no steal policy right so because this is a straw man approach this is a terrible idea I'm just trying to show you the the implications of the design decisions of steal and no force or no steal and force and we'll see why the the steal no force approach is better with the right headlong so that's your question yeah I have enough memory my table has a billion tuples I only have enough memory to keep one million tuples in memory I can't update all one billion right again if I want to do that in the context of a transaction alright so this is the easiest approach to implement the the nice thing about it actually even though it has some issues is that since the database is only going to contain updates from committed transactions when I crash come back my database is already in a correct consistent state because it's not going to have updates from transactions that didn't commit so I don't want to reverse anything so I come back and my database is all ready to go right and then I never need to also after crash I need to go never need to go back and redo the changes from a committed transaction because I know before I told the outside world all those changes were flushed at disc right and as I said you can't do this you can't do this if your transaction needs to modify a portion of the database that exceeds the amount of memory to you this is also really bad too as well because going back here for example assuming this t1 didn't actually abort right so I flushed t2 wrote that at the disc then this guy does commit and I have to write this page out again I had to write the page twice with these two transactions and so on SSD you can only write to them so many times before the cell burns out it's like a hundred thousand or something they've gotten better but like in this case here I would have excessive writes because I would have transactions update a page flush out the change to the disc then the next transaction may update to the same page then immediately flushes out the same page over and over again so you'd burn out the disc pretty quickly so shadowpaging is an implantation of no steel force that avoids some of these problems it's still not ideal this is not the way you want to build a database system although there's one or two examples that does but let's see how we can make the no steel force policy more tractable and again that will segue into why we want to do write ahead logging yes when I say before you commit like meaning like when I call commit then you flush everything out but before then if I'm up query 1 does an update then I have query 2 if I commit between query 1 and query 2 I can't flush anything out force says the client says commit my transaction then I write everything out and then when that's done I tell the outside world I'm committed do you say commit and then you like crash so the question is if you say commit and then you crash what is actually correct right well from the database system perspective it only matters like did you get everything out the disc doesn't matter whether you actually told the outside world you commit or not because we can't we can't we can't guarantee you'll get that message right so like you tell me to commit I flush everything to disc I send you a network message to say yes you're committed and either you crash the network crash or I crash before you get that message when you come back on disc is still the result of the committed transaction so from the database system perspective it's your responsibility well not I mean the application the application responsibility to figure out okay well I didn't get the commit message that I thought I was going to get did I actually commit that transaction today says we can't do that for you and once you leave the confines of the database system itself like we can't control things yes the question is am I saying that the commit is a topic of writing to disc no because you can only guarantee four kill light rights to hardware so the statement is if a question is what if I write part of my changes to disc and then I crash would I have to do some extra work to recover things in my toy example here yes like if I update multiple pages and I only write one of them I have to come back and figure out what I missed shadowpaging will handle that we'll see that now alright so I think we mentioned shadowpaging in the beginning when we started about courage control but now we'll go into a bit more detail so the basic idea is that when we're copying changes to you know sort of a single version of the database we're actually going to maintain two versions and again this looks a lot like MVCC which is now we're doing at the page level instead of the tuple level and so the master version in a shadowpaging system the master version of the database will only contain changes from committed transactions and then the shadow version will be this temporary space where we're copying pages before we write to them copy pages in the shadow version apply or change sort of a staging area and then when we say our transaction commit there will be this root pointer page that points to whether it's the the master version or the shadow version and it's always going to point to the master version so we just change that pointer to now point to the new shadow version and then that atomically becomes the new master and that avoids the problem that he brought up like how do I make sure I don't have torn writes when my transaction commits because I would make sure all my writes to the shadow copy are staged flushed once that they're durable then I just do atomic swap on the root pointer right so again this is an example of no steel force again it's kind of muddy because you could flush out shadow copy pages to disk could have been worse right okay alright so here's the setup here so we have memory we have disk and then we have this database root that points to this location of the page table right and then the page table just points to pages that are on disk so my new transaction comes along and I'm going to make a new shadow page table where initially the page table is going to point to the same pages that are on disk as the master version then as I so we're only going to modify the shadow copy and the master will be read only so now anytime I want to update data in a page I'm first going to make a copy of the page on disk to a new page update my page table to point to this new page and then apply my changes there so update 4 same thing update 2 same thing like that right so now if I crash at any time here when I come back the database root is going to point to the master page table and it's only going to contain changes of committed transactions I wouldn't see any of these other pages here so if I do commit then what I need to do is do a compare and swap on the root pointer here to update the page on disk to say here's the new location of the shadow page table or the master page table right and assuming this is sort of a 4K page I can guarantee that write is atomic on hardware so if it happens if it happens before and I don't crash then this becomes the new master if I don't write it before I crash then no problem because it was pointing to the master one and then this changes that were made down here technically never happened I didn't tell the outside world I committed yes we'll get to that he says how much fragmentation with this cause a lot we'll get to that right so now once I've told the outside world this guy's committed because I've now pointed this I can have some kind of garbage collection mechanism to blow away the old master page table this becomes the new master and this is now consistent snapshot of the database or consistent version of the database yes yes correct you would have to flush the page table contents of the disk yes no no so statement is I have to flush this out the disk you're correct soon this is not 4 kilobytes how do I make sure this is done atomically I don't have to do this atomically because I flush this out the disk basically the directory page I flush this out the disk if that crashes then who cares I come back the root pointer still pointing to the old one right this is the last step is always change this once that is written to disk and flushed then everyone comes after by this transaction we'll we'll see this right then at some later point we have to do garbage collection and then we print all these right so recovery is super easy rollbacks are super easy assuming there's not a crash if I need to rollback a transaction blow away the shadow page table then make a new one for the next transaction that's easy if I need to after it's a crash I come back again the root pointer there's always going to point to the latest latest master version so therefore I don't want to say consistent or snapshot so the latest master version I don't have to do any additional work to put me back in the correct state I nearly only have changes that were made by transactions that I told the outside world I've committed so we're good right there's nothing to redo as he pointed out this has a lot of problems so this is shadowpaging as I said before this is what IBM originally did for system R back in the 1970s when they were building the first relational database system at IBM this is the approach that they came up with but then they ended up abandoning it in the 1980s when they rebuilt a new relational database system DB2 that they were actually going to sell to the real world sell commercially because system R was only a research project when they built DB2 they scrapped all this and switched to the right-hand log approach right first one is copying entire page table each time is expensive you can avoid that by using a B plus tree structure where the database root essentially is the root of the tree and then anytime you need to make a new page or modify a page you make copies of branches in the tree this is essentially what LMDB does which is a system that exists today but again it's a lot it's a lot more work you're doing to maintain these sort of core screen versions the commit overhead overhead is super high as well because not only do I have to flush every single page I modified I have to flush the page table and I have to flush the root right so that's going to take a long time the point that he brought up which is correct about fragmentation sort of going back here say I bulk loaded the data at the beginning when it's just these pages here there's a lot of cluster index and things are sorted nicely in some primary key order but now after I run a transaction now I got a bunch of bunch of holes on disk assuming this is like a sequential file on disk and there's a bunch of holes in it they're now going to gain data from new pages that I create in the future so now if I want to do a sequential scan as I'm going along I may have to skip over pages that aren't visible to me or have been garbage collected from avoided transactions I may have reading a lot more data I may have to do more random access than I would have to do otherwise right and that was one of the big reasons IBM got rid of this in the 1980s because back then disk was super slow the difference between sequential scan and random access was much more much larger than they were now and with shadowpaging we had way more random reads and random writes the other challenge also too in this scenario is we could have two transactions that are running simultaneously make updates to different objects that are just stored on the same page and when we go to commit we have to make sure that we reverse the change from the uncommitted transaction and let the committed transaction actually go out the disk you can avoid this by bashing things up sort of what he alluded to before but as I said what happens if one transaction takes an hour and one transaction takes one millisecond do you wait the entire time or abort one of them yes so his question is does this solve the issue that she brought up that you can't modify a database that exceeds the amount of memory within a single transaction yes because you could have a buffer pool swap out the uncommitted shadow pages to disk just fine and again not an issue if you crash and come back those shadow pages are just ignored so as I said this is what IBM did back in the day CouchDB did use something similar to this but probably a well-known system that everyone had to use for the first project that used to do something like this was actually SQLite and so we'll see this is sort of a variation of the shadow paging where instead of copying the page to the copy page they would just copy the original page and then modify the master version and so basically let's see how they handle this this is how they used to do handle recovery before 2010 after 2010 they switched over to the right head log for performance reasons but you can still get this functionality in SQLite if you enable what they call rollback mode so basically what happens is now when a transaction is going to modify a page again also SQLite also only has a single writer thread but only one thread can write to the database at a time so we don't have the batching problem we mentioned before so a transaction is going to modify a page the first thing to do is bring the page into memory and then copy the unmodified page out to this journal file like on the local file system and then you go ahead and make the modification to the page in memory and you're fine, you're done then I go ahead and modify modify page 3, same thing copy it out into the journal file on disk so then now at some later point the buffer manager says okay well I have these pages that I'm running out of memory let me write out a dirty page out to disk and you would go and overwrite the I'll call it again the master version of the master copy in the database file I'll call it dirty page 2 here but then let's say now the system crashes the prospectus gets killed and we lose everything in memory but we've already flushed out a dirty page 2 so then upon recovery we would look at this journal file because that's been flushed to disk and go look at this and say okay is this corresponding to a transaction that didn't commit if yes let me go copy the original pages that are in my journal file to disk, to overwrite any pages that were modified by the transaction that didn't commit same as do you only delete the journal file after you've either reversed the changes of uncommitted transaction or transaction has committed correct yes so the SQLite documentation is fascinating I encourage everyone to follow this link and they describe exactly how this works and what's also about SQLite is they have to support all sorts of crazy hardware and operating systems and not just like nice laptops like they're supporting embedded devices running on airplanes satellites and boats and things like that so weird architectures that you never thought about and the semantics of what it actually means to flush things to disk and what the OS would support in these different embedded devices can vary wildly so there's a bunch of, this seems sort of expensive to do but this was portable this technique would work in a bunch of different bunch of different environments other than Linux for the reason they did this but they eventually switched to the right of headlock so shadowpaging but the SQLite version and the IBM version is not great because it's going to require us to perform a bunch of random writes to non-continuous pages on disk and if you said we want to maximize the amount of sequential writes we have and sequential reads if possible to improve performance the other thing we want to make is that we certainly don't want to lose data so we want our recovery protocol to make sure we put us back to the correct state but maybe we don't want to pay the big runtime overhead that the shadowpaging had to do about copying pages while we're running transactions because we're not going to crash that often this question is why does shadowpaging only support one writer transaction because going back here we're actually yeah maybe so if I have multiple transactions update page two one of those transactions commits and I write out page two to disk how do I make sure that the first transaction whose commit records should be there those get persisted after recovery and I don't have the other ones let's say the later one you crash now you come back and page two contains updates from transaction T1 and T2 the journal file contains neither updates so now how do you keep T1 but reverse T2 so again like so you either have to batch them then they all stage the changes well the journal file in this case here it's not staging changes it's just a copy of the original version of the page you stage your changes in memory then if you run out of memory then you write it out the disk right so either the all of the transactions have to commit exactly at the same time or I need a way to reverse or partially reverse some of the changes to the page so that T1 committed their changes get retained, T2 gets to roll back I have to reverse their changes this protocol as I'm describing here it doesn't handle that so the way you would do this in shadowpaging is you would batch transactions so like every five milliseconds all transactions have to commit so if my transaction finishes in one millisecond then I go to commit then I have to wait to the five milliseconds and then I'm allowed to then I'm allowed to commit that's fine if things can be broken up to five millisecond chunks if your transaction runs for an hour then that's not going to work because everyone else has to wait for an hour yes this example here I can only use half the disk storage available to run this example here yes because in theory a transaction could modify the entire database my journal file needs to have the same amount of space available to me as the entire database yes I don't want to dwell too much on shadowpaging because you don't want to do it, trust me so the approach we are going to use is called write ahead logging like I said this is what pretty much every database system does log structure databases we'll talk about in a second they're going to have a write ahead log the idea here is that we're now going to maintain a separate log file on disk that's going to contain the changes that the transaction made to the database while they were running right assume the log is on stable storage meaning like we can no matter what lighting strikes the machine or whatever we can always get back the log and that's enough to put us back to recreate the database right because the log records are going to have the same operation to undo and redo any operation that any transaction made to the database and we can either reverse the change or reapply the change as needed and so the key concept the key requirement we're going to have with write ahead logging is this part here and that is the database system has to write and flush to disk any log record that a transaction created when they modified a page or an object that has to be written to disk and flushed before you're allowed to write the page or the object that was modified to disk so if I have a transaction that updates a single record I'm going to create a log record that corresponds to the update and I'm going to modify the page before the bufferful manager can flush that page out the disk we have to flush the log record first that's what it's called write ahead log you're writing out the log record ahead of the page that was modified alright this is an example of steel no force so steel means that we're going to be able to write out dirty pages that were modified by transactions before those transactions commit as long as we write out the log records that's okay and then no force says that we're not required to flush out the dirty pages when a transaction commits to tell them tell the outside word you've committed but you do have to flush the log records because the log records are going to be enough to tell us what the transaction actually did yes this is just like a doubly what kind of just use the log so the statement is, and we'll get to this in a second I have a slide, the same thing is hey this sounds like log structure storage we talked about before isn't this the same thing yes the log structure storage systems are still going to maintain a write ahead log is that doubling up yes because in a log structure system you're going to start buffering all these changes and the are they starting basically the same thing at a high level the log buffer that's in memory that's going through the log structure storage that is going to be a tree data structure that's different than the write ahead log, the write ahead log is simply appending things, I didn't talk about the log structure merge trees but it's more than just appending to a file, there's an index on top of it and that we ignored so the assumption here is that writing the log is a lot faster than just writing it straight to the database yes so the assumption here that writing to the log is going to be a lot faster than writing to the database the answer is yes so think I have a transaction that's going to update a thousand tuples and say there's a thousand tuples that are stored in a thousand different pages but I'm only going to update for each tuple one byte so I would have one thousand one byte log records roughly correspond to those thousand changes or I could have one thousand four kilobyte pages that were modified to write the disk significantly less data out the disk through the write ahead log than updating the pages okay you good? seems like your mind is blown by this I know okay, any other questions? sorry okay so again we're going to stage all the changes all the transactions changes are going to be in our in volatile storage, in the buffer pool I say usually back by buffer pool that's actually not entirely correct depending on the implementation you would have separate memory allocated on the side whether or not it's backed by the buffer pool which goes directly to the log record sorry the log file depends on the implementation and then all the pages all the log records that correspond to the modified pages has to be flushed to disk before we're allowed to overwrite the master version the single version of the of the database on disk and again we can't tell the outside world whether the transaction is committed until all those records have been flushed to disk there's a bunch of systems where this is actually not true where maybe like by default you won't actually wait till they're flushed you do like asynchronous commit so like you would say alright well I've staged them they'll get written out in five milliseconds but I'm not going to wait for that right if you really care about not losing data you would actually wait until these things are actually flushed out the again this is something the distance can't figure out for you it depends on what your failure tolerance is like if you're okay with losing maybe the last five milliseconds of data then you don't want to wait for the flush if you don't want to lose that then you do not then you definitely wait for the flush the worst system that I've ever seen to do something like this is actually with Mongo as I wear the shirt in the original version of MongoDB when you did a write they didn't do transactions you would say you would do a write and then they would immediately come back and say yep we got your write but not only in the network they were acknowledging that they got your write message it didn't say it actually applied the write it didn't say it actually made it to disk it said yep I saw it good and then the only way to determine whether you're actually write made it to disk you had to send another message and say hey that thing I just told you to write did you actually write the disk and then it would actually wait for the flush so this is like early 2010s they would have amazing benchmark numbers because they were doing this this thing trick, trick's not the right word anyway so like take that offline no I mean in the newer versions they don't do that anymore they now have multi document transactions they now they have a write and log they'll wait till things flush if you want to the early version cuts some corners how about that so the way it's going to work is that we're going to have this write and log file and then every time a transaction starts we're going to have a begin entry a begin log record that's going to mark the starting point for a given transaction most systems don't add the begin record until you actually not when you call begin but actually when you actually issue a query that modifies the database right because if I call begin and then just commit immediately I don't want to write anything at the disk for that all right and then when a transaction commits we've got to write a commit record in the log and then we've got to make sure that the all the log records that correspond to that transaction that came before this commit those had to be flushed out the disk before we tell the outside world we've committed so this begin is essentially a like a guidepost just to tell the data system that there's not going to be any log records for this transaction beyond this point because otherwise without begin to the beginning of the file just to see whether there's some modification that the transaction made so each log record is going to contain some basic information this is a gross oversimplification but for our purposes this is enough to understand what's going on so there'll be some transaction ID again that could be a timestamp or some log job counter thing we talked about before then there'll be an object ID of the thing that's actually getting modified a page, a tuple, whatever then we'll have the before value and that's going to allow us to do undo to reverse the change and then we'll have the after value that's going to allow us to reapply the change so if you're doing a penalty on the MVCC like in Postgres they don't actually need to store the before value because they're always creating a new tuple a new physical version was always a brand new tuple you never need to reverse the previous one so they only need to redo they don't need to do undo so let's see an example here now we have in memory we have our buffer pool and we're going to have this redhead log buffer so when T1 starts we're going to add a pendant entry into our in-memory buffer pool or in-memory buffer for the redhead log to say T1 has begun then we do a write on A the first thing we need to do is to put a log record in memory that says here's the change that's being made so we have the transaction ID T1 we have object A is the identifier the before value is 1 and then the new value to get into it is 8 so once we do this we have to do this first then we're allowed to go ahead and write the change to the page the reason why we have to do it in this order because we don't want to have this get flushed out to disk before our log record does so we'll talk about log sequence numbers in the next class but basically every log record is going to get its own LSN its own identifier in sequential order and then in the page here we'll mark here's the log record number of the last modification to this page so we would know that if we write this thing out to disk we need to make sure that the log sequence number the log sequence number the course time that changed has been written out to disk that's basically like a watermark to keep track of like is it okay is it safe to write this so make sure that happens in the right order we'd write to the log record first get a log sequence number and then we update the page we'll cover that in the next class then I do it right on B same thing, get a new log record put that in the right hand log buffer and then update the page now I go ahead and commit I have to flush the log buffer in memory, write that out to disk then once that is on disk and it's durable then I know it's safe to tell the outside world the transaction has committed if now there's a crash at some later point and our contents of memory are blown away before we write out the dirty page that was in the buffer pool, that's okay because the log record has enough information for us to recover the state of that page we'll talk about this in the next class but basically upon start up you say okay well it's in the right of the log oh I see a transaction that has committed let me go make sure that the pages that it modified that any of the changes that it made to the data have been applied to those pages well how do you do that next time wait wait so if you write my changes you call commit but then you abort, how can you abort if you say you commit okay crashes yes the log or the page this thing this didn't get rid of disk, correct okay so scenarios how does this work if my log record got flushed I tell the outside world then I crash and I lose my content in the buffer pool of the dirty page before it gets rid of the disk what happens when the system starts back up what happens is before you're allowed to start executing queries the system has to go through recovery mode where it would look in the right of the log and say okay did all the changes for any transaction that I was told committed did those things actually get updated did the changes of those pages actually get applied to disk if yes I'm good, if no I reapply them once that recovery check is done then you start executing queries and then you're guaranteed to have not have the anomaly you're talking about okay so an obvious bottleneck is going to be or the problem is going to be that if every transaction says commit and I have to flush the changes out the disk that's going to become slow because now I have to wait for the if two transactions commit almost the same time but one gets in before another the second transaction has to wait for the first f-sync to come back does everyone know what f-sync is it's like the POSIX command of flush the OS will block your request until the hardware says I've safely written a disk whether how safe it is depends on hardware but we can cover that later sometimes it's like a battery back buffer down there but we don't care so I call f-sync on the first transaction it gets flushed to disk but then I had the second transaction was maybe like a half a microsecond behind but it didn't get part of that flush it has to wait for that flush to come back then it can call flush if it's in a disk so you're basically now the speed in which you can execute and commit transactions depends heavily on what the hardware can actually support if your flush time is 5 milliseconds then you can only commit a transaction once every 5 milliseconds and that's really slow so a simple optimization we can do is actually to batch transactions together so that we can have flushes combined together and they can amortize the cost of doing the f-sync or doing the flush so that transactions that are roughly close to each other will all be piggyback of each other and the system will have better throughput so best case scenario you're the last transaction before the flush then you don't wait if you're the first transaction then you have to wait for the queue to get filled up or whatever mechanism you're using decided when to flush so the way this works is that you can maintain multiple right-hand log buffers now and when a transaction starts it puts all its log records in the first buffer you keep writing into it and so forth right T2 starts running, it puts its log buffer here right to make change there and now at this point here our log buffer is full so we're going to go ahead and write this one out the disk we then flip a pointer in the log manager to say okay here's the next log buffer and then we start appending all our changes there so while the transaction is running we can keep put new entries into the second log buffer while the disk is flushing out the first one and then say that you guys commit obviously we don't want to wait for other transactions to show up and maybe fill out the log buffer then we go ahead and flush so there will be a simple timeout mechanism that says if it's been 5 milliseconds it's usually a good number if it's been a couple of seconds then go flush out whatever is in the buffer now right and again that's okay with with redhead log do okay to have log records from transactions that have not committed yet because we have the undue and redo information we know how to reverse them if there's a crash we come back we've got to see did this transaction actually commit or not right so we go commit and then the flush just to summarize the two different policies the steel versus no steel versus force no force the if you were to run time performance like how fast can my transactions actually run under normal operations the steel no force policy the redhead log one that's the best right because the log entries that's cheap to generate I don't have to block transactions to flush all the changes from dirty pages to rushing out the log records the the slowest one will be the shadow paging because I'm copying these pages sort of these core screens updating the directory the page table and so forth at run time though sorry at recovery time the shadow paging approaches actually can be the fastest because I don't do anything I just come back and I don't do any undue or don't need to redo because my master version of the database for master copy is going to be always consistent the redhead log it's going to be slower because it's going to have to look at this log and figure out what was actually running at the time I crashed and put the data back in the correct state yes this question is why is it so why is it no steel no force existing versus force and steel so I don't think you can actually do this right if you can't this doesn't make sense I can't read anything out but I don't require you to flush anything would this actually be correct yeah I mean it's basically you use redhead log but you can't flush dirty pages out but that defeats the purpose of redhead log right this one this is sort of what sequel light is kind of doing right because you can you can flush out dirty pages yeah but you still require you still require you to flush out the pages when you commit so yeah so this would be this would be sequel light here right the redhead log is going to be faster though most people are going to pay most systems are going to assume that crashes are rare therefore I rather have faster runtime performance than pay the penalty on the the occasion I have to recover from a log yes the current implementation is no force steel because it's redhead log the one I showed here is it would be force steel because it would be again even that's not yeah it would be force steel because you flush out the dirty pages to say that transactions commit and you're allowed to flush out dirty pages even before they commit but you always have to make a copy of it's like a reversal shadow page you always make a copy on the journal on the original version on the side file I think I mentioned before there was there was this old data system from Puerto Rico in Puerto Rico in the 1970s where they would always have power failures and so they chose to go with the system that was slower runtime performance in exchange for faster recovery time because they were crashing randomly throughout the day you didn't know how long they were going to have until the next power outage so they were willing to pay the penalty at runtime to have instant recovery because when power came back on they didn't know how long they had before the power went out again that's the only system I know that that's one example where you want to have really fast recovery time so let's talk about what's actually in our log files so there's basically three things you can put in there so there's a physical logging would be think of like get diff the delta between at a byte level of the old version of the page and the new version of the page the logical logging is when you would actually store just the operation that the transaction made or the query made literally the SQL command and then the physiological logging is what most systems implement where it's a recombination of the two of them you're still going to store a diff of the of the contents of the previous version of the old version but you would do it at a tuple level so in the physical logging it would be at the page level and the when you do the diff to avoid conflicts you would have in virtual control systems like yet you would have to have the right version when you apply the diff of the page so that the diff applies cleanly but if you do it with physiological logging you say here's the tuple here's the diff of the tuple I don't actually know where in the page is actually being stored but I'll just give you a slot number and then you can adjust where things are in the slot array and let the figure out where should the diff actually be applied and the reason why you want to do this one as well because if I run things like the vacuum where I'm cleaning up old versions the between the time I go apply the change on recovery versus the time when I actually made the diff in the first place the page layout may have changed so physical logging would have a lot of conflicts right so let's use a simple example here say I have a query like this I'm updating a single, updating a table physical logging would have like the diff exactly at an offset and before and after values for indexes you'd have the same thing, you'd have to say here's the actual page and here's the diff of the pages I want to modify with logical logging you literally just store the query in there right and that's enough at recovery time to say ok well here's the change that was applied to the database I don't know what pages they modified but if I just rerun the query then I'll put you back in the correct state and the physiological logging although it looks very similar to physical logging basically again I'm having a slot number for my diff instead of actually an offset a byte offset in the page and then for the index I can just say here's the page again here's the record, the key value pair I want you to put in this page I'm not specifying exactly where it is so this question is you can't use logical logging for right ahead logging because you don't know the old value because my example I showed the old value and the new value that was supposed to be like a high level demonstration of what basically what physical physiological logging we do you could replace the old value and new value with this you would do this primarily for you would use this for memory systems because then like you you would just figure out ok what yeah this is hard to understand you would know what is the state of the database on disk and what queries did not get executed get applied to the things that are on disk so I have to go rerun them you can ignore the memory part I have my version of the database on disk I look at my log record my logs say ok what what version do these queries run in my log records find the first log record that did not make it out to disk and then just rerun all the queries inside them so the advantage of this is that if this query updates a billion tuples whereas this these two approaches physiological and physical would have to have one billion entries whereas this is one entry so the log would be a lot smaller but there is no magic there is no magic we can do at run time or recovery time to make the recovery query run faster than the real query so this query took one hour to run on recovery we take potentially one hour to run as well so most systems don't do this yes this question doesn't support undo correct you wouldn't need it you would not need it you can't you can't have this is what I was saying there is a notion of if you know the version of the version that is on disk not of individual tuples of just the database itself think of that as a snapshot if I know what version that is and what was the last query that I could apply to the database then I could figure out what is the oldest log entry that I haven't applied and then re-execute all the queries that come after that so the question is you would not be able to use logical logging to undo the previous version correct yes because I can put update set value equals random literally random number and I wouldn't know what the old version was yes this approach is used in other systems but you could do it in the sake of time, let me skip this it's basically everyone does physiological logging physical logging has too much dependence on the having the exact page layout physiological logging gives you the freedom to potentially move tuples around and apply the diff just on the tuple not the entire page very very few systems do logical logging the thing I want to bring up that he mentioned before is that what about these log structure systems the answer is yes log structure database systems don't have dirty pages because there's only the log records you buffer them into memory and then once that buffer once that's full then you flush that out but that memtable could be quite large could be like 10 megabytes so I need to have the right of log as a separate log file to say here's the changes that I've made to the database for transactions that I said that I've committed but I haven't flushed the memtable out yet so if there's now a crash I come back and look at my right of log and use that to recreate the in-memory memtable seams are done this is what everyone does so quickly there's a preview of what we're talking about next week an obvious problem with the right of log just like in the version chains for MVCC is that these things can grow forever I keep committing transactions I keep appending log records if I crash now I potentially have to replay the entire log to put me back in the correct state so if my database runs for a year then it crashes I would have to replay the entire log look at the examine the entire log for the entire year and try to figure out did all my changes actually make it out so to avoid this problem one way to avoid this is to do what's called checkpoints and this is where we're going to at some point in time we're going to flush all the dirty pages that are in our buffer pool out to disk it's okay that the transactions may have not committed that's okay but then we're going to put introduce a new kind of log record that says here we took a checkpoint so at this point we'd know that anything that was modified prior to this checkpoint that was in memory has been written to disk and that sort of caps how far back in time we have to go look in our log essentially a hint of telling us where we're going to start our crash so I'm going to show you sort of a naive checkpoint scheme to give you the basic idea and then we'll see what the problems are and the next class will be how do we actually do this for real so the way this is going to work is that when the data is going to decide I want to run a checkpoint it's going to pause all queries whether or not you wait until they finish or not it doesn't matter for our purpose right here then we're going to flush all the log records that are in memory out to disk then we're going to flush all the dirty pages that are in memory out to disk because again the log records have to give it in first then we're going to add a checkpoint entry to our right head log and then resume any queries that we paused right so say we have it set up like this we had three transactions running T1, T2, T3 but then at the end there's a crash right so we took a checkpoint right here so upon recovery we would have somewhere in the system on disk would say here's the offset in a log file of where you took a checkpoint there and that'll tell you at least be a starting point before you look around and see what's going on so at this point here we would go back and say alright well I'm just going reverse order up in the log and figure out what transactions are running at the time I crashed so I'd find T3, T2 and then T1 right in the case of T1 it committed before the checkpoint so I know that all its changes were written out to disk at least the log records have been written out to disk and at the checkpoint I flushed everything to any dirty pages as well so any page that was modified by T1 at this point at the checkpoint have been written to disk so we can ignore anything that T1 does but for T2 and T3 they started before the checkpoint and then they didn't commit they started and did not commit before the checkpoint so we need to go make sure that their changes get applied in the case of T2 after the checkpoint we see a commit record we know that got written to disk it's in our log when we come back so at this point we see that T2 committed we told the outside world it committed so we need to make sure we redo all its changes and make sure that they get applied but we don't see a commit record for T3 so this transaction was aborted so now we need to go use this log record to reverse any changes make sure that any page that it modified those changes aren't applied yes alright so the question is yeah so his statement is I do a checkpoint and then T2 commits in between this there wasn't any update so therefore T2's changes were to be flushed to disk assume there's another update here then you have to check yes to your point yes again powerpoint limitations okay so this approach again it'll work but it has an obvious problem that we have to stall all queries and transactions while we take a checkpoint to make sure that it's consistent right and if my check if I have a huge buffer pool which is not unheard of these days and I have one terabyte of dirty pages I've got to write it out I pause all the queries then write out one terabyte of data to disk then I can unpause the queries right your system will look unresponsive that's not an option in a modern system so we'll see how to get around that problem next class the other challenge is going to be how are we going to scan and find what are the transactions that are actually running at the checkpoint again I was using I was being hand wavy here but basically I'm going to have to read log entries parse them figure out what's going on up to some point until I don't see any more transactions that could have been running so obviously that would be expensive to do if you have a lot of transactions running every time you take a checkpoint so we can just store some hints in our checkpoint records in the log to say here's the transactions that are running at the time of the checkpoint we can use that again to short circuit how much work we have to do the other challenge is going to be how often we should take our checkpoints which is not really defined by this so if we take checkpoints all the time like non-stop we're going to have really great recovery times because we're not going to look at much of the log because now the checkpoints aren't free especially in a blocking case we're blocking queries so now your checkpoints are going to be stalling the system indefinitely but you don't want to be at the other end of the spectrum you want to take a checkpoint every once a month because now again you have to replay the log for once a month so we'll see how to we'll see how to avoid the blocking case but how often you should take a checkpoint is going to depend on the application's recovery requirements so in some cases for highly available systems you want zero downtime so you're very aggressive in checkpoints and you're going to pay that penalty at runtime because you want to be able to come back real quickly after a crash and a lot of systems if it's five minutes, yeah, who cares ten minutes, that might be good enough again, it depends on the application this is not something that you can figure out for you alright, so any questions about right-hand log in checkpoints again, we'll cover checkpoints more next class but again the main thing to point out is right-hand logging is pretty much how everyone's going to do this it's going to do incremental updates to pages so you can steal no force and then on recovery we're going to undo any committed transactions and we do the committed transactions and we have enough admission and log to help us to do this okay alright guys, see you next class, hit it