 Yeah, you haven't heard anything right from my ex, my ex, oh your ex, yeah, they haven't bothered you or anything? No, I'm kind of in my office, so yeah. I think I'm in the clear though. They were talking about like Ethereum or something, I don't know, FTX, have you heard of that? Of course, yes, yeah, that's all on fire. Okay, yeah, that's all a scam. Alright, for you guys, let's talk about real stuff. So Project 3 is due tomorrow, again at midnight. Project 4 will be out probably tomorrow morning, and that'll be due on Sundays, February 11th near the end of the semester, or at the end of the semester. So we're having the Zoom session, the Q&A session actually this week, this Thursday, because next week's the holiday, and we want to do it sooner rather than later, right? So the project will be out today, tomorrow. I'll discuss it at the end of class on Thursday, and then we'll have the Q&A session, which again will be recorded on Thursday as well, okay? Any questions about Project 3 or Project 4? Okay, and then for the spring semester, for those of you that really love this stuff, again, I'm not teaching it, Charlie will be teaching the course next semester, but if you want to be involved in it as a TA, we'll post on Piazza how to sign up and reach out to him and talk about it. I have a question. Yes? It's not a registration week, so I was just wondering, who should take the odd? The question is who should take advanced database class? I think it says in the write-up, anyway, the dirty database skills, right? I mean, what do you want me to say? Everyone. I mean, like, yeah, what do you want to say? Yeah, I mean, so another way to think about it, so in the intro class, we're spending a lot of time on transactions, concurrency control, and so forth, and I've sort of talked or mentioned that, oh, yeah, there's these things called, you know, analytical systems, the column store stuff. I've sort of sprinkled some of that language in the semester, but that'll really be the focus in the false, sorry, spring semester in the advanced class would be how do you actually build like a modern high-performance analytical database system. Now, I think we're probably going to be using Postgres for the projects, which is not high-performance analytics, but we'll figure it out. But anyway, if you want to get, you take the advanced class, I also recommend you come be a TA next semester if you're not graduating, okay? And again, no rust, all super fuss. All right, so the last class, we were talking about database recovery and we were talking about specifically the things the system is going to do while you're running transactions, while they're updating the database, all this extra, you know, right ahead log entries we're going to add that we're going to use to figure out what happens when there's a crash or how to put us back into the correct state, right? So today's class is now, okay, if there is a crash, how do we go back and look at that log and our checkpoints and figure out what was going on at the moment of the crash and try to put the system back into, or we have to put the system back into a correct state. Again, I'm not saying all my lectures are easy or hard, it's hard to rank them, but this is probably the third hardest one to deal with or to understand. So like the optimization is difficult, the concurrency stuff is tricky, this is probably the third hardest one. And so I'm going to try to be very slow and deliberate as we talk about things as we go along, just because obviously this is super important, this is the whole reason why we tell everybody you should not be just writing your own database by hand and writing files on disk on your application, you want to be using a database system because this can provide you this, you know, this recoverability and durability guarantees. So this class is really the whole enchilada, say how do we actually achieve that promise, okay? So the technique that we're going to go through is going to be a high-level summary of something called ARIES, an ARIES CSR Algorithms for Recovery and Isolation Exploiting Semantics. I think the textbook probably mentions this, and so this was a very influential paper written by some famous database people at IBM in the early 1990s, and it's going to be the, they basically codified the rules for how you actually want to achieve reliable recovery in a database system or disk-based database system. So I'll say that not all database systems are going to implement exactly what the ARIES paper talks about. This paper is also 70 pages, it's very long, it's very laborious, and so I don't recommend reading it, the textbook is a summary, this will be a summary as well, but we're just going to sort of hit the high, the major points that you have to do and the steps you have to do to achieve reliability. This paper goes on much more detail about low-level semantics of DB2 and low-level internal data structures in DB2 that we don't need to cover, but the compensation log records, the fuzzy checkpoints, all that was defined here and we'll describe what those are as we go along. It's basically to say, this paper will tell you exactly how to do it, there are some things you can potentially relax and not do all of this in some systems and still achieve the reliable recovery that we're trying to achieve today. So this lecture relates to a condensed down version of this original paper, there's a Wikipedia article about it which maybe shows that it actually has some notoriety if you want to go for the detail. So the main ideas that we're going to have in areas are the following, the first is that we're going to rely on the right-of-head log to record all the changes that transactions are making to the database during normal processing and we said that the right-of-head, any update to a page in the database has to have a corresponding right-of-head log and that entry and that right-of-head log entry has to be flushed to disk before we're allowed to flush the page that it modified to disk. So we always write out the log entry to the disk first ahead of the dirty page. And this is going to be relying on the steel and no-force policy so steel says we're allowed to write out dirty pages from the buffer pool before transaction is allowed to commit but again we have to write out the log entry first and then no-force says we don't require the data system to flush out all the dirty pages for a transaction when it commits, right? We're allowed to do this at some later point but again the log records have to be flushed out. So that's what we're going to do normally why we execute queries. Checkpoints aren't required, we'll discuss checkpoints again today but checkpoints are a way to reduce the amount of log we have to scan upon recovery. It's basically a way to say I don't need to go past this point in my log entry or my log right-of-head log because I know that things have been flushed before this. But it's not required, yes? Why do you say must you use steel plus no-force? Can't you just use steel plus no-force? This question is why do you have to use steel no-force? Why can't you use no steel no-force? Steel and force. So this question is why am I saying it has to be steel no-force? Why can't it be steel and force? So steel means you're allowed to flush things out before they exist before they've written a disk and then no-force means again you don't have to require all the dirty pages to be flushed out. Like you could do force it's less efficient because you're like, you're requiring that you flush out all the log records and now you're requiring you flush out the dirty pages. So you only need to flush the log records. From a correctness point it would still be correct but you wouldn't do it. So that's what we do during normal operations. Then over crash when we need to recover another big key idea is that we're going to repeat the history or replay the changes that were made to the database system on recovery by replaying essentially the log records that we saw. The goal here is to put the database back to the exact state it was at the moment that the system crashed based on what we see is in the log. Then we're going to go back and reverse any changes from any transaction that did not commit before the crash. But the key idea here now is that we're also going to when we reverse those changes we're now going to add new log entries to say here's the changes we're reversing. What that's going to help us with is now if we crash and start recovering but then we crash during recovery we would then look at our log and be able to say what we're doing during the recovery. This is going to keep us in a consistent state or a correct state even if we keep crashing over and over again during recovery which can happen if you have bad disk. So these are the three key ideas. So today we're going to go over the log sequence numbers that's going to be the key idea of how we keep track of what wrote to what and when and then we're going to go through how to do the normal and commit abort operations again while we're running transactions normally or rolling them back using the log sequence numbers then we'll introduce fuzzy checkpoints because that's going to allow us to then handle the case of being able to take a checkpoint without blocking everything and then we'll finish up with actually what are the three phases during the recovery process. I'll say also too that this is going to seem oftentimes that we're being wasteful or redundant and how we're writing records out the disk and flushing things and I say that's by design because we want to be super careful, super cautious and very conservative in how we write things out. So there'll be some optimization we can apply like if I'm doing recovery maybe I don't flush dirty pages immediately because I'm not really running a transaction or just in recovery mode but the original Aries paper was very very like flush here, flush there because again you don't want to lose data people would notice if people would get pissed there's some obvious optimizations we can do as we go along but we'll keep those until the end a statement is if Intel still makes the persistent memory the obtained memory, if that still existed will we have to still do any of this? So you say in that scenario if you lose the memory you possibly lose the hard drive so I don't want to go down the rabbit hole a persistent memory right now you would still want to do something like this for redundancy reasons there's still an ordering issue you have to deal with when you actually apply changes some changes can be hanging around in your CPU caches like L3 those may not be written to actually the DIMMs yet the Optimemory it's not magic right? you still have to do some extra work and logging would be the way to do it comment? question? no? alright, so again there's some optimizations you guys are smart, you'll probably pick up on what these can be we'll hold off on those in the end alright, so the first thing we gotta do is we gotta extend the redhead log records that we talked about last class or introduced last class to now introduce some additional information and the thing we're not at now is called the log sequence number or the LSN and this is just going to be a monotonicly increasing counter that's going to get added to every single log record so you know the order in which they should appear in the redhead log yes, this could become a bottleneck because now there's a single counter everybody's got to go increment by one and get a new number there's ways to get around that but we can ignore that too so every single time now I'm going to go update a record in the database I have to go create a redhead log record first I go to this LSN counter get a new number incremented by one and get that to my redhead log record and it gets appended to the log and these log record numbers log sequence numbers are going to permeate all throughout the rest of the system because we're going to again use that to indicate what sort of the watermark is the threshold is how far when did something get modified and has that entry that corresponds to that modification been written out to disk so every log record is going to have an LSN but then now in different parts of the system they're also going to keep track of LSNs that again indicate who modified it and when so the first thing we're going to have for a page the first thing we're going to maintain in memory is just called the flush LSN and this is going to be again the watermark of the last LSN that we know has been safely written out to disk and then now for every single page we're going to maintain two different LSNs one will be the page LSN that corresponds to what was the recent update to that page what was the log record that made the most recent modification to that page and then the rec LSN will just be what's the the oldest update that we know has been sorry what's the oldest update since the page was last flushed to disk so when I bring a page into memory and I update it I'll set the page LSN and the rec LSN and then if I keep updating that page over and over again I'll increment the page LSN but not the rec LSN and then the last LSN would be the forgiving transaction what's the last log record in the redhead log that this transaction traded and then we'll have this master record LSN that's going to keep track of what's the last successful checkpoint that we took in for the database system because we're going to use that as the jump off point to say where to start looking in the log after a crash after recovery right so the page LSN and rec LSN the page header and then they get written out to disk but when they get brought back in you can reset them accordingly right so this is basically summarizing what I already just said so each data page has a page LSN and it's the LSN of the most recent update to that page then we keep track of the flush LSN to keep track of what's the the oldest LSN that we flushed out so far sorry the newest LSN we flushed out so far and then any time we're going to write to a page on disk or sorry any time we're going to write flush a page out to disk we got to go check to see is that page LSN less than or equal to the flush LSN meaning do we know that the log record that most recently modified this page has that been flushed out to disk if yes then we know that it's durable and therefore if we can write that page out to disk because we know we have the log record that says back to the state that it was in at the moment we flushed it out if it's not if it's the other way around if the page LSN is greater than the flush LSN then we can't flush that page out because we have to make sure the log records flush first so let's look through a visual example so in memory we're going to have the tail of the redhead log and then we have our buffer pool contents and then see now we've added into our log records we now included these log sequence numbers I think it's just a counter incrementing one every single time and we would store these out on disk as well then we have within a page in memory we're going to have the page LSN and the record LSN so the page LSN again is the most recent update to the page and the record LSN would be the first update to the page when it was brought into memory then we have our flush LSN which is going to point to the tail of the log on disk so in this case here it would be log record 16 and the master record just points to the last successful checkpoint for simplicity I'm just a single checkpoint record there were actually multiple checkpoints beginning and with fuzzy checkpoints it will cover that in a second alright so let's say that we have the page LSN here pointing to log record 12 and our flush LSN is pointing to log record 16 so the buffer pool eviction policy says I want to flush this out the disk are we allowed to do it at this point yes because the page LSN is 12 the flush LSN is 16 12 is less than 16 so we know the log record 12 that modified this has been written to disk so it's okay to flush this but if the page LSN points to 19 we cannot flush it because 16 was the last record we flushed out 19 is still in memory therefore we can't write the page before we write the log record so in this case here we have to stall and wait until the or pick another page to evict because we can't evict this one right pretty simple right it's a way to get to keep track of sort of decentralized of like is it safe to write something out you just check these LSNs yes this question is when do you increase the flush LSN so you would increase the flush LSN when you write out this like when we write out the memory the right head log then you would update this thing with the oldest or the largest log record when the memory log page is full yes like last time when we got full we wrote out the disk once we do an F sync we know it's safely out disk then we would update this thing so the question is would you upon flushing the log to disk and then you update this thing would you then go back and say here's some pages I wanted to evict but I couldn't evict before no you just because now this is like a global or a record or a global counter this thing gets updated so next time whatever thread is doing the buffer pull replacement they would say okay well my flush LSN is this here's much of pages and I can now evict right now maybe some cases where the the buffer page is full the buffer page is full and it could block right like you can't find anything you can evict I have to evict this page it has to block until it gets notified that this thing is going to flush out then you can evict it ideally it can happen ideally you don't want that to be the case this is also why there's that we didn't really talk about the background writer stuff there's like this background worker thread that's like going through it's trying to flush out dirty pages that satisfy this requirement it's trying to proactively flush out these dirty pages so that when next time the eviction policy needs to run it has a bunch of clean pages it can pick out right away okay so again this is just a summary of what I've already said all records are going to have a log records are going to have this log sequence number and every time we update a page every transaction is going to modify something into page we're going to update that pages page LSN and then when we flush out the log records the tail of the right head log in memory out the disc once we get back the F sync then we update the flush LSN to be the largest log sequence number of the records we just wrote out right so far so good okay so again now we want to run transactions so again for this lecture we're just going to assume it's just reads and writes like we had before we're not going to worry about SQL statements or anything more complicated we can also ignore indexes they more or less work the same way so some other substances we're going to make are we're going to simplify things in this lecture we're going to assume all the log records are going to fit into a single page you obviously can update records or tuples that are larger than a single page therefore the log record has to span multiple pages the way you just handle that is you just keep track of like segment IDs it's like when you write a log record you'd say you know it's log record here's part one of four and if you know if you see all four parts across four pages then you have the complete log record otherwise you don't and then it's considered you know it's considered torn right and you can ignore it right but for the simplicity we assume it fits into a single page we're assuming all our disk writes are atomic of course we're in all our examples we're going to do single version database system using strong strict 2PL so we don't worry about one transaction acquiring the exclusive lock and making changes as we try to roll back during an abort again that simplifies things with multi versioning some things are easier some things are harder and of course as we said we're going to use steel no force okay alright so when a transaction commits we have to append a commit record to the log and before we can tell the outside world that our transaction has committed we need to guarantee that all the log records up to that commit for that transaction are flushed to disk right this is called a synchronous commit some systems can support what are called asynchronous commit where it's like a best effort when you tell the outside world your transaction has committed and then you assume some milliseconds later your transaction will be written to disk the algorithm is essentially the same which is whether or not you block or not for our purposes we'll assume you block so all the log flushes are going to be sequential written out to disk and again one within one page log buffer page could have multiple log records that correspond to either the transaction that's committing or other extra transactions that are running at the same time and that's okay right we'll deal with that deal with those transactions if they don't commit later on alright so at this point the transactions commit we flush the commit record at the disk we now know our transaction is durable everything we need to say we can replay that transaction later on to recover its changes is out on disk so we're fine that's why it's okay for us to tell it's okay for us to tell the outside world that the transaction has committed so then at some later point now which I'll define when that is in a bit we're going to introduce a new log record called transaction end and this transaction end is going to be a sort of additional metadata or hint for us in the right-hand log to tell the database system that it will not see any more log records of this transaction after this point and for commit it's pretty obvious like I do commit and then at some later point I do end and that's when I'm done when we do abort we'll see that I'm going to abort do some extra stuff then do transaction end so that's why we need this but for now just assume that we call commit and if we wanted to we could immediately add a transaction end to the same as well but depending on how you're doing concurrently troll you need to wait for other transactions this may come at some later point so the statement is if I have one transaction commit and in my right-hand log buffer I flush out all its log records but in those log records a bunch of changes from another transaction that was running at the same time and that transaction then later abort how do I handle that two slides we'll get there any other questions so again we have this new transaction end record just says that we know we can guarantee at this point we will never see this transaction ever again in the log those will make it easier for when we do recovery to know that nothing else we can throw it sort of we can throw some metadata away now the transaction end record does not need to be flushed to disk immediately commit record does assuming you're doing sync or disc commit before I can tell the outside world I have to flush this this thing can come later point and it's fine alright so he saved me again this is going back to our example here T4 is going to begin we're going to do some changes and then commit so when we see the commit record here we're going to flush the contents of the right-hand log in memory buffer at the disk once we know that it's durable right we can tell the outside world that we've committed we can also then update this flush LSN to now point to the the tail of the log records we just pushed out so now it's 15 right and then at some later point we'll say alright this transaction is completely done and we'll just say it's end and then eventually this thing will get written out we can also sort of assume too that we can blow away the contents of this memory buffer here free it up and then use it for new log entries that sort of that group commit thing I showed before where you have two log buffers you fill up one start flushing that out then you fill up the other one and you sort of ping pong them back and forth right pretty easy so far okay so let's see how to handle boards so aborting transactions can be treated as a special case where we're essentially going to be creating new log entries that are going to reverse the changes that I made the transaction made when it was running before the abort right and to keep to make it faster for us to figure out what we actually need to abort for a transaction we're going to introduce another field to our log entry log record called the previous LSN and you can sort of think of this as like a almost like a divergent chain or a linked list of here's the here's the order, here's the list of all the the operations I applied for this transaction and so I can follow that to go find what I need to reverse right it's not required it just makes our life easier when we have to scan a lot of logs alright so we have our transaction here well the first thing to point here is we have the LSN and now we have the previous LSN so in the last entry the previous LSN is 14 the previous one is 13 and so forth going up and then for this top guy here it's the beginning it's the first log record for this transaction so therefore it's previous LSN it's just null right so transaction then abort and then we have to do something and then we're going to have transaction end right and again this just says that there's nothing else comes after this so the thing we need to talk about here is what we actually need to do between the time we get the abort log record we find the transaction end of what we need to do to reverse those changes right and we need those changes to also be written out to the log because we need to make sure that if we crash during recovery we would know what the changes we're making to reverse things when we come back so that we can correctly reverse them again so the idea is that we're going to do these new log records called compensation log records that are just going to be reversals of the some previous action that the transaction took before it aborted and it's going to look exactly like the like a regular update log record except now we're just going to have this keep track of like here's the here's the here's the LSN of the log record I'm reversing and then here's the LSN of the next thing I need to look at in my previous LSN so I know how to go back through time and try to find all the entries that need to reverse right so we're going to add these to the so we're going to add these to the log just as you do when we process transactions normally but the key is that unlike when we commit where before we say yes you've committed we have to make sure all the log records are flushed out to disk if a transaction tells us we abort or if we like there's a deadlock and we kill one of the transactions we don't have to wait till these CLRs are flushed at disk before we tell the outside world that your transaction has been aborted right because like who cares because if I come back and try to read those changes I shouldn't be able to see them because even though I may not reverse them yet even though the CLRs aren't flushed at the disk the higher level mechanisms in concurrently troll like the versioning if it's multi-versions or if it's two-phase locking I still have these things locked like no one's going to see our changes before we actually reverse them and that's okay that's what we want so that's why we can release we can tell the outside world you've aborted sooner rather than later yes yes even if you crash it is still hold up the latches are the locks no no so yeah so maybe I sort of fuddled this so what I'm saying is like if I call commit like from my application calls commit I don't get an acknowledgement from the database server that I've committed successfully until those log records are flushed to disk the commit record including right so but if I call abort then I'll get back immediately yeah you've aborted even though the log records that correspond to the reversing your changes have not been written to disk yet because the higher level protection mechanisms of the curcurtial scheme thinking like multi-versioning or 2PL or OCC when I come back and try to read the things from the last transaction that I just got aborted I won't be able to see them it doesn't matter that those log records have not been flushed to disk yet they could be and we have to reverse them we'll get there right I'm just saying that like again the key idea is commit you have to wait until the log records are flushed abort you do not so when you crash and come back there's a bunch of queries until we go back in the correct state and reverse all changes from avoided transactions so when we come back there's a bunch of stuff we have to do before we tell the world hey we're ready start sending us queries I would say even though again we don't have to wait to tell the outside world that our transactions aborted before the log records are flushed we still have to follow the regular right ahead log protocol that says the log record that last update of the page has to be written in disk before we can flush that dirty page out the disk right even again even if the transaction is aborted and we haven't rolled back to all its changes we can still write it out the disk the steel policy allows us to do that we have to make sure that the log record that modified it is out on disk alright so I'm running out of space on PowerPoint so we're going to switch to a tabular form for this example alright so let's see how we actually want to roll out a transaction so we have T1 T8 on A and then it aborts right so on abort the first thing we're going to do is the very first thing we're going to do is create a CLR that's going to reverse the change we made before right so we keep track of this is the CLR that's going to reverse the log record the update that happened at log log record 2 and all we're doing is just reversing the change so 40 we set it was 30 before, we set it to 40 so now we suspect it to be 40 so we reverse it back to 30 so it just looks like a regular update log record but it has this special status to say we're reversing things for this particular log record from before and then we just have this undo next pointer that just tells us here's the next thing that we need to reverse because the previous LSA would correspond to the abort log record we had before so we want to skip that we want to jump up to the next thing we need to potentially abort in this case here it would be the the get entry so at this point now there's nothing for us to undo anymore for this transaction so now this is where we add the transaction and log record right and it's just saying that there's nothing else to nothing else to abort or roll back and we'll never see this transaction ever again after this yes this question is can a dirty page written by a transaction that has been abort flushed out to disk yes but the log record again that course that modified the newest log record that modified it has to be flushed out to disk first so it's okay that on disk there's a bunch of pages that were modified by transactions that got aborted that's fine we'll fix that alright so yeah so maybe the case that like at this when we did the update it's some page we brought into memory we modified it then this transaction aborted but by the time before we actually reverse it it got flushed out to disk at this point here it's just like running a regular update we got to bring the page back in then reverse it and then we add the log entry first for that so this question is this is the recovery process this question is I bring the page in I update it then I get the dirty page before I can reverse the change I crash and come back when we crash and come back we'll replay this and reverse it yeah yes question correct yes where you play I mean like reapplying the changes of seeing the log this question is when I replay the log I'll see this update do I know I don't need to do it I have to reverse it because it boards later on what's your question sorry correct this goes back to my same minutes at the beginning it's going to seem like we're doing is wasteful because like yeah if this dirty change didn't get written back the change to the aborted transaction didn't get written to disk why replay it we're always going to replay it because we want to get back to the state we were exactly at the moment of the crash then we're going to go back and reverse things to remove any aborted transaction updates questions what if we don't log the CLR and then what so you're saying if I don't log this and then I just write this out so then I crash to come back and how do I know right so if you don't have this and I crash to come back and I just you basically doing the same over and over again so proficiency reasons this will help you from avoiding to update things over and over again does this happen often with two speakers it would happen enough that you would care what happens to you you would care right Harvard is super unreliable you guys are spoiled things are way better than they used to be right yes so without CLRs could we just do replay over and over again and be less efficient I think the answer is yes so I mean I know it's like a double ledger like the page with this update gets corrupted right then I'm hosed now I gotta like have a backup actually you can get corrupted in some weird way like big bits get flipped but having a log record additional redundancy allows me to put it back to the correct state Postgres actually does something like this so Postgres what they'll do is they'll have these write a log entries but when you first bring a page the first transaction that modifies it I think they write the entire page out in the right ad log so that way if the actual table heap gets crashed or gets corrupted you at least have a version of the page in the right ad log to replay and you apply the whole thing on top of it correct I mean it's almost like the log structure stuff you only really need the log you don't really need but like for performance reasons you have two copies and for library reasons you have two copies okay so to abort a transaction we set a lot of this already I'll just go through it one more time so we're going to write an abort record out to log first we don't have to wait until the outside world that this happened and then we're going to look at all the updates that happened that the transaction made in reverse order going from newest to oldest and we're going to write a CLR entry to the log apply the change to replace the old value and then when we reach the begin because we're going back in time we see the begin for this transaction we know there's no other log record that could appear further back in time in the log so therefore we just write the transaction end record and then we release our locks and then we're done with this transaction again assuming we're doing two phase locking like a transaction says they abort we don't immediately release the locks we hold the locks because we need to still go back and update them then we get transaction end then we release the locks and that avoids the problem of someone trying to update the record between the time we abort and the time we have the end and then for CLR we don't need a CLR for the CLR there's nothing mentioned in there for us to undo it so we got through log sequence numbers and now we know how to do additional things at run time to do commits and abort and of course as we said last time the logs can grow forever so we need a way to limit how far back in time we have to look in the log and we're going to do this through checkpoints so we're going to first look at two crappy ways to do checkpoints and then we'll see what fuzzy checkpoints are is the proper way to do this to handle the case without having to block everyone so last class I said the simple way to do checkpoints was just you halt the start of any new transaction like no transaction called begin you wait to any actual transaction that is still running finishes and then at that point you flush all the dirty pages out to disk because you know there's no transaction that may be updated one page that gets written to disk and up to another page that gets written to disk you have a consistent snapshot of the database in the buffer pool of the page in the buffer pool and you write that out and we said this is obviously bad for performance reasons because if I have a transaction that's going to take an hour then I have to stop running any new transactions for an hour wait until that one hour transaction finishes then I take the checkpoint there are some systems really small embedded devices that do work this way because they're assuming their transactions are super super short and you can make you can make this optimization but most systems do not do this right so a slightly better way to do this would be to instead of waiting for transactions to finish we could just pause them and say they're not allowed to make any more updates like if they're query running and it's updating multiple pages we just sort of halt it while it's actually running we don't wait for them to finish we just take the checkpoint immediately of course the problem with this is that we're going to have inconsistent checkpoints right so let's say I have two threads one is to take a checkpoint and one is a transaction and transactions can update page 1 and page 3 and the checkpoint is just a sequential scan of all the pages that are in memory and I'm going to write them out to disk whatever their current state is so say the transaction updates page 3 but then before it can update page 1 when it starts we stall all transactions we just pause them the checkpoint thread then runs through writes out all the pages out to disk then transaction is unblocked and then updates page 1 so this is obviously bad now because now we have a torn update what should have been an atomic operation for this transaction is now no longer that because we only saw part of its updates so to handle this we're going to keep track of some additional metadata now in our checkpoints about what transactions were running at the time we took the checkpoint and then what were the pages that were dirty when the checkpoint started so that we can know that when we had to replay the log from the checkpoint we're bringing in all the pages that were in the the checkpoint basically says at this point here's what the status of the pages were in my that were out in the disk and I would know what could potentially have been modified while I was taking the checkpoint so I know roughly where in the log to go look out and figure out what log records they need to reapply because they may or may not have written out the disk so for the actual transaction table it's basically for every transaction that's active or actually running we're just going to keep track of what the transaction identifier what the status is of the transaction and then what's the last LSN that they created and then once we see the transaction and record we can remove it from this actual transaction table so this is something we can be maintaining while we're actually running the system then we'll also maintain this when we do recovery so the status codes for transactions are going to be either I'm running, committing, or I'm a candidate for undo and so there are no more operations it's either you're running or you're finishing up a committing so when we do recovery the default status is going to be this undo mode because when we're replaying the log we don't know what's going to come up ahead so we assume that we're not going to see a transaction commit record and therefore it's going to be something we're going to have to roll back and undo and only when we see the commit then we flip it status to commit the dirty page table is again keeping track of what pages are in the buffer pool that have been modified but not flushed out the disk and so the only thing we need to keep track of here is what's the LSN for the transaction that caused this page to get dirty first get dirty when it was brought into memory page LSN is keep track of what's the most recent update to this page the REC LSN and what was the first LSN when I brought into memory so as you can see I'm running out of space in the PowerPoint here but let's see what we can do alright so we have our first checkpoint here and we record in our actor transaction table at the time we took the checkpoint that the only actor transaction was T2 because T1 started before the checkpoint but then we saw the transaction end so that point we know we're not going to see T1 ever again so we don't need to include it in our actor transaction table and then for our dirty page table we assume that there's only one page that's in our bubble that's dirty and that's P22 so we assume that transaction T1 updated P11 but then something happened and this thing got written out the disk so it's not dirty anymore and so the only other one is this update here that corresponds to T2 so then at the second checkpoint same thing actor transaction table is we have T2 because that started way above we haven't seen it we haven't seen a commit that's true that's yeah sorry that should be a give it a T2 there should just be T3 because at this point here T2 is already committed oh no wait sorry we haven't seen the transaction end yet that's why this is still considered active right so just because you commit but not the end it's technically still active even though it's not going to actually do anything yet and then for T3 again we haven't seen the commit or end so it's still active and then for the dirty page table it's P11 and P33 so this is enough information for us to get back to the correct state because we know here's the pages that are dirty here's the transactions we should look for in the log make sure we apply those changes but again we're still stalling transactions while we take this check point right because it's sort of like a single place in time in the log right check point and assume that all my my flushing out my dirty pages are atomic from my buffer pool which is not realistic if I have 1 terabyte of memory I can't flush out 1 terabyte of memory to disk atomically right so this is still not what we want to do because you know this thing could take a long long time so this is what fuzzy checkpoints are going to solve so with fuzzy checkpoints the reason it's going to allow actual transactions to keep on running make changes to the database just as they would normally but we're still going to be able to take a checkpoint and flush out all our dirty pages and so now in the right ahead log we're going to have a the checkpoint entry is going to have two parts the time when the checkpoint began and the time when the checkpoint finished and inside the checkpoint end entry we're going to keep track of like what is the actual transactions and what were the dirty pages right so the idea is that the tables are only accurate or consistent at the moment we took the checkpoint started but then we can't guarantee that any dirty pages that were modified after the checkpoint started will also be written as well so we go back to our example here so again assume that p11 has been flushed before the first checkpoint starts here so we add a checkpoint begin entry and then other transactions keep on running we're flushing out the dirty pages and then when we get to the checkpoint end we would say okay what were the transactions that were what were the transactions that started before my checkpoint began in this case it's just t2 and then what were the pages that were modified or dirty before my checkpoint began and this is just p22 so assuming p11 got flushed before yet right because the idea here is that when we do recovery we want to start at the checkpoint begin and we'll scan through in the log and see all the changes that came after the log records came after so we would see this update to p11 that right before the checkpoint end comes shows up but we need to know how far back in time the right ahead log that we need to look right so the actual transaction table and the dirty page table is telling us hey by the way there's some transactions that up above the checkpoint begin that you should look for because they made some changes that we may have not gotten out the disk and then when the when the checkpoint like this yeah yeah I'm sorry and then when the we successfully flush out the checkpoint end record then we update the master record just in the database system to point to the checkpoint begin in the log right because that's going to be our starting point when we start scanning for things but we'd use again the ATT or it's going to give us hints of what's above it yes yes you start your analysis there well it's the next slide we'll get to here but basically the this question is like when I say we started the checkpoint begin I mean we're going to start an analysis this scan for it and figure out what's going on but the ATT and ATT are going to give us a hint to say while the checkpoint begin there's some changes you got to go look at and reapply them so your statement is during your analysis you think you're looking at what's on disk from checkpoint begin but it's really checkpoint end that's fine I'm still going to have to replay all the log records anyway so again would you double apply something yes there will be cases where we will double apply but that's okay because we want to guarantee that things are actually all written alright so let's go I mean sort of dancing around how we're actually doing recovery let's actually walk through it now based on like the building blocks we have the log statements numbers we have the CLRs and we have fuzzy checkpoints with the DPT and the ATT so areas is going to have three phases the first one is in the analysis phase where we're going to use the master record to jump to the most recent successful checkpoint begin entry and then we're going to scan forward in the log just read the log entries and build the transaction table and a dirty page table from that point to the end of the log right and the idea is that we're trying to figure out what was going on in the system at the moment of the crash both in what's in memory and potentially what's out on disk so once we do the analysis then we do the redo phase and we're going to jump back to the some farther point in the log potentially it could be above that checkpoint begin and we're going to replay all the log records that we see including any transactions that could potentially abort later on because again we want to put the database back into the exact state at the moment of the crash then after we do a redo now we're going to undo now we're going to go and reverse order the log from the from the end up to some point and we're going to undo any changes from transactions that we know did not commit at the moment of the crash because we would have, we would see this in the actual transaction table we would know what a transaction is actually running we don't see a transaction commit message for them then we know that they should be aborted and some of those changes that they those aborted transactions made could have been written to disk so we need to go back and make sure we reverse them and so that nothing persists after the crash then after we finish the undo step then the system is back into a clean state all the transactions that committed before the crash all the changes are applied all the transactions that did not commit or aborted before the crash all the changes are reversed and therefore we now can open ourselves up for business we can start running new queries and running new transactions then we know changes from transactions that got aborted will carry over after the crash so I realized I'm making hand gestures here so let's see what this looks like visually alright so here's our right-hand log and we're going forward from time from orders to newest from up to the bottom so after a crash we enter the analysis phase and the idea is to figure out from the last checkpoint from the start of the last checkpoint we're going to scan forward from that point to the end of the log and just look and see what are all the transactions that we're running at this time and what changes did they make then in the redo phase we're going to jump up to the smallest RecLSN in our dirty page table that we identified after the checkpoint so this would be what's the oldest change that we need to apply that hasn't been applied to flush out the disk and we're just going to redo everything even if a transaction gets aborted we're going to redo it then in the last step the undo phase now we're going to go in reverse order back into the point to the log where we have the law of the the oldest transaction that was active at the moment that we crashed we're going to make sure all their changes are not persistent and we're going to remove them yes the question is how do you find the smallest RecLSN we'll pop in the dpt in a second we'll see it in the next slides but you'd find this is what the analysis phase does right because you would look at the dirty page table and say okay what is what was the RecLSN for that dirty page therefore that's when I need to go start from there and make sure it gets reapplied there's some obvious optimizations you can do that we don't have to go through but like if everybody's updating the same page a single record why reverse all of them why reapply all of them if it's just going to overwrite them over and again why does it jump to the most recent version again that analysis can be tricky it's just easier to blindly reply yes the question is how much of a nightmare is this to implement well it's hard enough that you're a random JavaScript program and you don't want to do it yourself there's that no this is hard what's the alternative up or you could go work at a Davis company and get paid a lot of money and build this there you go is this harder to implement a B tree again hard drives used to be really wonky in the old days this is really hard to do all sorts of weird shit like break on you it's gotten way reliable now so it's gotten better but you still wouldn't be super careful about all this right yes I get confused I understand the redo is because some transactions said I commit and then you do write a mod but their changes have not been flushed out to this yes and that's why you want to start with the smallest IEC LS to redo the changes and then commit like flush to this those who really commit yes what does that mean so for this one here I'm not going to when I do redo I'm not going to try to be clever and say well I know this transaction is going to abort so I don't need to redo it you redo everything right so now that includes transactions that are going to abort and you know they're going to abort because you've done the analysis phase because at the end of the analysis phase you look at your actual transaction table if it's not marked as committed it's aborted right redo phase you just redo everything for simplicity reasons then the undo phase is okay what are my transactions in my ETT that are aborted let me go find their log records and then reverse them maybe some transaction will say usually we're typing abort later on before it crashes after you crash wait hold up so I crash and then they send me an abort command how would that work how do you know some transaction aborted when you crash because if it didn't commit then it's aborted so any transaction any transaction that was running that did not commit is aborted like that's what you want so 2 and 3 is a undo any transaction that's not committed correct yes then if you do 2 and 3 why would you need more because one you need to figure out what the hell is actually you need to like what you need to do right like if I'm going to drive somewhere I got to know where to go right like I need a map directions same thing yeah alright so let's go through these one more one okay so now analysis phase we're going to scan for from the last successful checkpoint if we see a transaction end record then we can remove its entry from our actual transaction table otherwise if given log record if that transaction that corresponds to that created that log record if it's not in our actual transaction table then we add it but with the default status of undo because we don't know at this point as we're reading each log record rule 9-1 is it going to commit or not so we assume it's not but then we see a commit log record will changes status to commit and then once we see the transaction end record we'll just remove it from the ATT for any update log record if the page is not already in our DPT then we'll add it and we'll set its REC LSN to what our LSN is for the log right so think about we have this ATT and DPT that we generate for checkpoints the idea is to populate them with additional information when we do our analysis so at the end now yes sorry so question is when does the transaction have a status of running during the recovery never it's always going to say default is undo until we learn otherwise that it commits good question thank you yes so the question is do I have to log the ATT and DPT when I see checkpoint end at runtime normally when we do fuzzy checkpoints we log this if you don't log because it's fact correctness you could still repopulate it uh yes uh no no without logging it during the fuzzy checkpoint you'd have to scan the entire log because you don't know what you missed up above right because I could have a transaction make a bunch of changes then I do a checkpoint right and it began to end and then I don't see that transaction ever again I crash come back without that ATT you don't know there's some transaction up above that wrote a bunch of stuff that I don't know about you need like adding to the checkpoint it's a hint to say up above there's some stuff you got to pay attention to that you might have missed alright so at the end what do we have we have an ATT that's going to tell us what transactions were active at the time of the crash either because we saw them after their checkpoint or as I said they're up above in the log and then the DPT would tell us what are the dirty pages that might not have made it to disk right that means we're going to have to bring them back into memory and make sure we apply their changes during the redo phase right because we don't know what pages got written to disk after the checkpoint finished alright so let's see an example here alright so the first thing we have is that we have the checkpoint begin so we just add that to our we initialize the ATT and the DPT to be empty then we now see there's a log record for transaction 96 so we add it to our ATT transaction ID 96 with the status of undo we don't know what's going to happen what it's final outcome is going to be and then we see also that it updates page 33 so we would add in our DPT page 33 and then the RECLSN that did the update this is here it's log sequence number 20 then we have our checkpoint end and here now we add additional information that we didn't know about from before this is why I get to record this so we knew about 96 since when we started seeing our checkpoint because we see the log record for 96 but there's another transaction 97 we didn't see that one before at all so we know that's up above our checkpoint begin there's a transaction 97 that's hanging out and doing some stuff and we got to check on it same thing for the DPT we know about page 33 because T96 updated that but now there's a page 20 that was updated at log sequence number 8 and we haven't scanned that yet yes you don't have to log so his point you could remove T96 here when it's written out yes so this analysis phase worry about this how I'm populating this but we should remove T96 yes how do I know I can ignore it because simplicity reason basically what transaction were active when the checkpoint started so this case here assuming transaction T96 began at log sequence number 9 so it's before the checkpoint began therefore you would just add it right you don't need to be clever to say okay am I still going to see it or not yes or no to his point here for correctness reasons if this wasn't in here T96 because this is out on the disc if this wasn't in here that's okay but the log sequence number 20 will be able to update it but again the protocol says it's anything that was active so you could remove it correctness correctness says if you still be okay just to make the algorithm more less branching you just always add it yes oh so question is how would I know how would I know that for page 20 the recollection was 8 it's the first operation that makes 20 30 you could bring the page in and look at it right it's fine you can read it or you could you could potentially log into this doesn't matter this question is does each page have its recollection on the disc yes and the page yes so then we have the transaction N here again this transaction N is telling us we're never going to see T96 ever again so we can remove it from our after transaction table and then we crash so after the end of our analysis phase this would be the status we would know that there's a T97 that transaction 97 that's out there that we need to abort and then here's some pages that potentially have not been it may not have changes the changes that were in the right-hand log added to them correctly so now in the reader phase the idea is that we will repeat the history in our right-hand log and apply all the changes that we need that are necessary including for transactions that have aborted to put the database state back into the state it was at the moment of the crash right and I was going to say multiple times there will be some cases where we can avoid unnecessary reads and writes now so again if I know a transaction that is going to abort I maybe don't need to bring the page back in just for that transaction to reverse it because it's already been reversed there's things like that we can ignore so we're going to scan forward in the right-hand log from the log record that corresponds to the smallest LSN for a page that was in the dirty page table it's the smallest LSN that modified a page that had not been flushed out to disk and then as we scan through the log and we look at each log supers number one by one if it was an update record or a CLR for a given LSN we're going to reapply the change unless the affected page is not in our dirty page table meaning we know it got flushed out to disk already by the time we finished or the affected page is in our dirty page table and that log records LSN is less than the pages record LSN if it's greater then we know that again that it's been flushed out the change has been applied so then to redo an action we just apply the logged update set the page LSN to that log records LSN and then we're done we go on to the next log record flush the reddit log because we're not making any changes to an update even if it's a CLR and we don't need to flush the dirty page in the buffer pool we can do that at the end and then once we reach the end of the redo phase for all transactions that are marked as committed and still in the actual transaction table that we append a transaction end record to the log and we can flush those out and remove that from the actual transaction table and then into the redo the only things that are going to be in the actual transaction table are transactions we need to undo and reverse their changes yes what phase the redo or the whole thing so your question is like if I crash at any point of these phases I come back and I just do it all over again yeah so now we're in the undo phase so again these are transactions that are in the actual transaction table we know we've got new commands for them they're done we need to revert their changes so we're now going to reverse order in the log and roll back any other updates and we're going to add CLRs for any time we reverse the changes right and it's basically the thing I showed you before where during normal operation if I board a transaction I go back and reverse order reverse their changes and append a CLR to the log and then once I reverse everything we're doing basically the same thing here right so let's go through a full example so let's say we have we have a red head log we do begin and end and then we have this transaction T1 here that then calls abort so during normal operations we want to abort this we just go look and say what's the what's the change I made go ahead and reverse this we keep track of the version chain the sequence of the the log records I need to reverse and then now I do a bunch of other things other transactions and then I crash come back so now after my analysis I do my ATT and DPT look like this so at this point here again we crashed we're looking at the log up above we have T2 and T3 both their status is things we need to undo and then our dirty page table we have P1, P3, and P5 so we would look at the last LSM and figure out which one we need to reverse first so 60 is greater than 50 so we're going to reverse that update first so we're going to reverse the update that T2 did and then the undo next is just going to be a pointer to say what's the next thing we need to reverse but then now since we need to do the reverse on T3 because that comes next in the log we do apply the update and then for simplicity since I'm running out of space I put it on a single line so this is the CLR and then that's the last update we need for this transaction to reverse all those changes then we have the transaction end here at this point here we could flush the dirty pages and write a log out the disk because we know that everything here is durable it's not required it just prevents you from having to do maybe extra work when you come back around the second time so let's say now we crashed though during recovery and all this we have in memory gets blown away so then now when we come back and after analysis phase the only thing we would have in our extra transaction table is T2 because we saw the transaction end record here for T3 that got flushed out the disk so we can ignore it when we come back around after the analysis so now the only thing we need to do is make sure we reverse T2 so it would just be jumping to the CLR here redo it to make sure it actually gets applied follow back up here apply that CLR apply that change, flush that out and then we write our transaction end message here T2 so just showing you if I crashed during recovery I come back and just pick up the process where I left off is this clear, yes what are you planning to do? this question is when I'm doing undo and I keep doing that until for the transaction or the check point for the transaction I keep going back until I make sure I reverse all the changes for the transaction yes sorry sorry you have timestamp at 10 you say I flush it when when okay, yes yes how would you undo it? my example here no let's go back to the here so after the crash when I do my analysis I would have seen the transaction end message for T1 so therefore at the end here T1 was not active so it shouldn't be an actual transaction table it's gone if T1 aborts I have the yes yes it's hard to do this in PowerPoint so at this point here I crash and then you're right after analysis T1 should be in here right then on the redo phase I'm going to reapply the changes for T1 here and then when I do the undo it wouldn't be in my actual transaction table right because I saw an end yes I know the confusion is I'm showing this and I'm saying okay after the crash this is what I saw but this is actually the state after I do redo and that's why T1 is not there because I saw the transaction end and therefore I would have reversed the change as defined in this CLR right so what I'm trying to show is here after I do a redo what do I need to reverse well I need to reverse T2 and T3 okay and so the first thing I'm going to reverse the first entry which is 60 going back in reverse order so I have to create a new CLR for that for verse 60 follow the undo next and that's going to tell me I need to reverse 20 right but I also need to reverse now 50 because that appears next going in reverse order at LSN 50 I follow LSN there isn't anything else I need to reverse right so I can put the end record here then now at this point before we reverse 20 which is up here we crash and come back then after the second round of analysis and redo the only thing we need to then undo still is T2 yes I need to redo for those who write a commit log but you don't see a transaction end log you still need to redo them because you don't know whether the dirty pages were written out to disk the log records have been written to disk no because you don't know whether the pages it wrote to that it modified were written out to disk transaction end does not guarantee all your dirty pages flush to disk you're not going to see it ever again at the checkpoint or the background writer right or upon eviction because we need to swap a page out make space for it yes question what is DPD actually doing here so you would basically at the very end before you tell the outside world hey guys we're back in business start running make sure you flush all these guys out too it's additional bookkeeping does DPD get flushed to disk people or what the pages that are under areas what you're supposed to do is at the moment that the before you tell the outside world okay we've fully recovered we can start running again you're supposed to flush all the dirty pages out to disk I don't think people actually do that though because like if you assume crashes are rare then like why spend the time to flush these things out I'll just they'll get flushed out during normal operations anyway to the next checkpoint or the background writer how do you know the mixing of this between where and getting the prevalescent again it's the prevalescent you would have like I should be more consistent I'm showing the different variations of the log records because you don't want to put all in but like you would have the prevalescent to say here's the prevalescent for this and that's what it is same thing okay any other questions again this is hard this is why like you don't want you don't want randos loading data systems you won't seem used to this so some quick again just reiterating everything I said before so what happens if we crash during the analysis phase what do we need to do nothing we just run a recovery again because we didn't do any updates during the analysis what happens if we crash during the redo phase do we need to be clever at anything now just come back and redo everything just as normal before okay so what can we do to make redo go faster so we assume that we're not going to crash again that we don't need to flush disk immediately flush pages and disk immediately at every single step as we go along what does do this either all the end later on or just when we turn the system back on as I already said to make undo go faster we could potentially maintain a log of the changes we need to reverse four pages in memory so that when a transaction then tries to read that page then we go ahead and actually apply it before we serve it up I don't think any system actually does this you could actually there's one optimization you could do or you could just rewrite your application remove long running transactions because then that'll minimize how far back in the log you have to go to undo things but that's asking people to change application code and nobody's going to do that we've gone through this so hopefully you'll absorb some of this and everything will be you'll watch it again later on youtube so the main idea is again in Aries we have the red head log we steal no force we use these fuzzy checkpoints to keep snapshot of the dirty page IDs and actual transactions we're going to redo everything on recovery to get it back to the earliest dirty page and then we're going to undo any transaction that was still active at the moment of the crash and then we'll use these compensation log records to keep track of here's the changes we've undone so we know that we've actually we've actually applied them and the thing we're going to use to keep track of what comes before what is through these log sequence numbers that's going to allow us to then do make sure we apply the changes in either reverse or forward direction in a proper order and we'll use the just LSNs all throughout the system to keep track of what's actually drivable on disk and again this goes back to that diagram that you showed before where there's like here's the different layers of the system that we're building up and then there was this current control piece and then the recovery piece that would sort of again span all of them because this is why again the notion of like these LSNs are all throughout the entire system yes statement is the CLRs are the worst case where you keep crashing during recovery yes yep alright well congratulations at this point in the semester you know how to build a single node database system right there was a groan I heard that so now starting this Thursday we start talking about distributing databases and it's all the same s*** talked about the entire semester just now harder alright good alright guys hit it