 So today we're continuing on with our lectures on how to do database logging and recovery. So just as a quick update for everyone as a refresher in terms of what's due, homework six is due a week from now on November 27, project four is due two weeks from now on Wednesday December 6th and then as a reminder there's no class this Wednesday because of the Thanksgiving holiday. All right so real quick because you want to have any questions about project four? General questions? Yes? Wait, say it again sorry? For four? We'll double check that. Okay, we'll fix that. Anything else? Are you done project four already? So as a reminder from where we're at is that last class we spent the time talking about logging as sort of the first part of doing database recovery and this is part of this broader picture of doing recovery in database system and the idea is that we want to implement some kind of method or mechanism inside of our database system that's going to ensure that any transaction that commits or any transaction that gets aborted we end up with guaranteeing that we have consistency, atomicity and durability despite any possible failure you can have except for obviously if the you know the machine catches on fire and we don't have a backup or we can't protect ourselves from that but in the normal case that's not going to happen and so we want to be able to handle that. And as I said as last class as well there's two parts to doing recovery algorithms in a database system. The first part is the things we're going to do at runtime that we'll prepare ourselves so that in the event that there is a crash we have enough information that's going to allow us to be able to recover the database and put it back to the correct state. And then now the second part and what we're going to talk about today is the things that we're going to do after there's a crash to recover the database to the correct state and again ensure that the state of the database is such that all transactions are atomic everything's consistent and all our changes are durable. So last class we talked about right ahead logging had to handle the first part and the second part today we're going to focus on the second part here. So now the method we're going to use to do recovery is referred to as ARIES. ARIES is a specific class of algorithms from IBM research that we're going to implement in our data system today or describe how to implement in our data system today. So this was originally published in a paper in 1992 from this famous researcher at IBM called Mohan and this is sort of the the the the Bible of database recovery if you will. So this this paper is 70 pages it's very dense if you want to fall asleep you can try reading it but it's actually really awesome right it basically lays out in exact detail all the things you need to do in order to do database recovery in your system. So although this paper came out in 1992 it's not to say that this is the prior to this no one did recovery at all right this paper was really the one that sort of set in stone exactly the steps that you need to do to ensure that your databases is fully recoverable. And so while say is that what I'm going to describe to you today is a sort of a crash course pun intended of how to do ARIES. And it's what I would say though is not every single database system that's that does recovery does exactly what ARIES does. As I said there are systems of prior to ARIES that are doing recovery there's newer systems later that do something that do recovery and so ARIES is one specific implementation and it's probably the most well-known and if you took this paper and actually implemented exactly as you as they describe it you would have a data system that's fully recoverable after a crash but there's other sort of minor things you can tweak or change or drop or add well just still guarantee that you have full recovery and won't be exactly as as defined under ARIES but again this is considered the gold standard of how actually how to do it right. So the the basic idea what's going to happen here under ARIES there's sort of three main ideas the first is that we're going to use right ahead logging to record the changes that transactions are going to make while we while we're running under the normal normal processing and as we said in last class our buffer pool manager has to use the steel no-force policy to ensure that this is done in the correct order. So it's a quick quick question what does the steel policy mean for a buffer pool manager? Right so steel policy means that it's essentially what he said but I'll say it more succinctly. Steel means the data systems allowed to write out to disk dirty pages that were modified by transactions that had not committed yet and what does the no-force mean? What does that policy mean? Can be very careful. So he said you can tell the application that your transaction committed before it actually lives on disk. If you're very close the log the log always has to be on disk no-force says we don't require that our transaction commits that all its dirty pages have to be flushed out right. So we're gonna do right ahead logging and we're gonna write out things to our log in stable storage and then on recovery what's gonna happen is we're gonna repeat the history during our redo phase where we're gonna go through and reapply all the changes that were made by any transaction that was running at the time when there was the crash and this will be very clear when we see later on like even if we know the transaction is gonna abort we're still gonna redo its changes because we want the state of the data to be exactly the way it was at the moment we lost power and then what will happen is on undo will undo those changes from transactions that we know they didn't commit but the key thing we're gonna add into with areas is that now when you actually need to also record log records for any undo operation so before last class I just said oh yeah you do all your updates and you commit if you roll back it is undo everything but now we're gonna actually have to implement special log records to say this is not an action for a regular transaction this is an action to undo a previous action because this transaction aborted so the combination of these three things will be enough to guarantee that we'll have to put the database in the correct state exactly the way it was at the moment that has a crashed without seeing any uncommitted transactions all right so for today's agenda we're gonna start off talking about log sequence numbers and this is important because this is gonna allow us to to figure out what the order of the operations and the log should be then we'll talk about how areas works under normal commit and abort operations during as a processing transactions regularly then we'll extend our checkpointing protocol that we discussed last class to now introduce a new type of checkpoints cause fuzzy checkpoints and we'll do this for performance reasons and then we'll spend most of the time talking about then the recovery algorithm the areas part where you actually again apply all these techniques from these from these first three parts and allows put the data is back into the correct state okay all right so the first thing is that we need to extend our schema of our log record from last class now include additional information all right so last class I just sort of said yeah you know transactions when they modify the database you just pen your log record and you know and then then you write it out the disk and in my simple examples it was sort of obvious that the order that the actions got performed on the database corresponding to the order of the operations in the log but you can't actually guarantee that because otherwise you have to lock everything or take latches on everything while you you know update pages and update the update the log so instead what we're gonna do is we're gonna maintain a counter that we're gonna give out a new value from this counter every single time a transaction and wants to create a new locker record and as part of a modification to the database and this counter is gonna be called the log sequence number or the LSN against most simply you can think about this as a global in-memory counter that you just do an atomic addition and add one to it and then assign the new value to whatever whatever thread needs that the new LSN and so what's gonna be difficult now about the LSNs is that it's not just gonna be the LSNs for the log records I mean sorry we're not just gonna use the LSNs to figure out what's the order for the log records we're actually gonna use these LSNs all throughout the database system in order to keep track of what's going on what's in disk what's in memory in all of different parts of the system right so this table here shows you some examples of where the LSNs are gonna be used right so again every time I needed to update a page update a record I have to get a new LSN and that corresponds to the order in which my log record occurred or my modification occurred and then then I have to go update other things in the system to say all right I did this change and here's my LSN that corresponds to the log record that where I made this change so you have first the what's called the flush LSN and this isn't this is gonna live in memory and this is just the last LSN that was successfully flushed out to disk right the last log record that was written out to disk we know that everything from that LSN and less is now durable and safe on stable storage then for each page we're gonna have two to additional LSNs so we're gonna have the page LSN which is the newest update to the page and then the rec LSN is the oldest update to that page so we know the the LSN of when we first made the change when it was brought into memory all right and that's the rec LSN and then we know the last change that was made and most recently in history and that's the page LSN and again we can use this to figure out whether it's safe to flush out a page to disk then for every transaction we're also gonna maintain the last LSN that's the last action that they took and then we had this thing called the master record which is a LSN that's gonna live on disk and this is just gonna be a pointer to an LSN in the right-of-head log that corresponds to the last checkpoint that we took right and so this last one is simply just just for convenience so that we don't have to scan through the entire log to figure out where our checkpoint is we know exactly where to jump yes yes it's just part of the page yeah it's in the header yeah so whatever is it was in memory it gets written out the disk as well yes correct the page ID again just tells you where to jump on on disk and a file to find that page right and so no matter how many times you modify that page that page ID doesn't change in these with these LSNs every time you modify it it gets it gets it always goes up yep okay so let me get an example of how we're going to use these LSNs to figure out when it's safe to flush things out the disk right so as he just mentioned right every data page is gonna have the page LSN right and this is the LSN the most recent update that was that transaction made to the page and then we're also gonna keep track of the flush LSN and that's just a pointer to say here's the last LSN that I successfully wrote out out the disk and so before we're allowed to flush out a dirty page write it out to disk we have to check to see whether the the flush LSN is greater than the page LSN right so if it is then we know that whatever log record that corresponds to the change that modified our page as identified by the page LSN we know that log record has been written out the disk because it's gonna be less than the flush LSN right so the flush LSN you know you can't you can't have it jump you know go back in time right so you always have to sort of write out sequentially all the log records as as they appear and so if the flush LSN is some number then you know anything less than that has written out so therefore the page LSN is less than that you know that the log record that modified the page has been written out so therefore all the changes are now safe and durable in the log so you can write out the page to disk so we're gonna use again these LSNs are essentially watermarks for us to figure out is it safe to release the page and then when we do recovery we can use it to figure out what was going on at the time when we crashed to figure out what changes we actually need to reapply or undo in our system all right so let's look at a table like this so we have on on memory we have the tail of the right of head log again we say that it's the tail because we don't actually need to keep the full log in memory as we flush things out we can free up that memory and then reuse it for new log entries right so we always have just the tail and then in our buffer pool we have some some some one single page we bought in so the first thing you see what we added is that in now the right of head log both on disk and in memory every log record now has a prefix of the log sequence number and then out on in a page but on disk and in memory we have the page LSN again that's the newest LSN that that modified the page then we have the rec LSN which is the oldest LSN that that modified the page and then we have the flush LSN which points to the last log sequence record a log sequence number of the last record we flushed out and then the master record just points to the last checkpoint that we took in the log okay all right so let's look at a simple simple scenario so let's say that the page LSN is 13 in this case here so it points to some log record out there so is this case in this particular example if the page LSN is 13 is this page safe to write out the disk yes because the page LSN is less than or equal to in the flush LSN but let's say the page LSN points to this this record here right this can't be written out because the page LSN is now greater than the flush LSN so again we so we know whatever log record made the change to this page as identified by the page LSN it's not safe on disk and therefore we can't write out this this page right we said the reason is because if we crash and we come back this page will be written out the disk but then we won't have it we possibly won't have a log record that says what that change was so we have no idea to what the actual correct state of this this page should be right if the transaction aborted then we need to roll back but we don't know what the old value was for anything that modified so we don't know how to reverse that right so that's why again the every log record has to be written out every log record that modifies a page has to be flushed out the disk and it's safe and durable before we can write out the dirty page to disk that's the key thing we have to guarantee throughout all of us all right so again just as an overview all the log records are gonna have an LSN it's always increasing in time essentially just adding plus one plus one plus one and then every single time we modify a page we have to pin that page get a log sequence number right add our entry into the tail in the log in memory then we apply our change to the page set the page LSN and then and then we can unpin it and at that point the internally the system will know the flush LSN it tracks the flush LSN and can decide when the buffer manager wants to flush the thing out the disk whether the flush LSN is greater than or less than the page LSN of a particular page and if the flush LSN is less than the page LSN then that page is not a candidate for eviction and this is something the buffer will manager what has to maintain all right so now with these LSNs we can talk about how we're actually going to execute transactions at runtime and again there's two cases right there's there's the transaction commits or the transaction aborts so for this discussion I'm going to make the following three assumptions we already said that we're going to steal no force with right-hand logging so that that's not anything anything out of ordinary but we're also going to assume that all our disk rights are atomic and what I mean by that is for simplicity we're not going to worry about the case where a log record spans multiple pages right we're just going to say that anytime we can we have to flush out a a log record or a disk page that can be be done atomically and obviously we know we can't guarantee that if you have large page sizes with the hardware so different database systems do different things to guarantee that this happens like so my sequel actually write to a double write back buffer first write all your pages out there and then once that's durable then actually then does the random rights to throughout throughout the the table heating the other thing we're going to do is that we're going to we're going to assume we're running with strict two-phase locking concurrency control now it's not to say that Aries is not compatible with timestamp ordering or other things but for again for this discussion we're just going to assume or our transactions are serializable and then we don't have to worry about doing cascading boards we won't we we assume that no transaction will read any any record or any data from from another transaction that hasn't committed yet so we don't worry about those kind of rollbacks and that just sort of makes this this it simplifies the discussion but it's not required to do Aries all right so when a now transaction commits we're going to write a commit record to the log and then once we know all that the next step is then flush all the log records for that transaction that come before that commit record and then once we know that commit record has been flushed out to disk then the transaction is considered fully committed and now we can tell the outside world that your transaction is finished and everything is durable right so obviously what's going to happen here is we're not going to just take every single log record write that out you know as a single line and some file and do a f-sync on that right that would be really slow and typically the log records are smaller than a page size so we'll batch together a bunch of log records together in a page and do a single f-sync for those and for project number four you guys will have to admit group commit which is essentially doing the same thing and so this is now why you see we're going to we're going to have to do maintain undue information because when our transaction wants to commit we have to flush all its log records out to disk but interspersed with its log records will be other log records from other transactions that have not committed yet and those will get written out to disk as well right because you don't want to have like log sequence 20 written out to disk before log sequence 10 is so everything else has to be always written out in the sequential order always increasing you don't want gaps in the log so when we know now our f-sync is it's completed and all the changes from our transactions in the right-hand log are flushed out to disk then we can get tell again tell the outside world that our transaction has finished but then now we're also going to write a new special log record called transaction end that this just as a signal to the database system that this transaction at this point in time in the log is completely finished you will never see you will never see it ever again and we know that everything is durable in the log out on disk and so this transaction and record does not actually need to flush right away and we can write it out after we tell the outside world that your transaction has committed right this is just an internal marker for ourselves to allow us to decide last know that this transaction is truly finished so let's look at an example here so we have a transaction t4 and once you do a modification on object a and object b and then in in our in our in our in our system at run time when the transaction goes to commit at this point here the data system says I have to flush everything is it's it's in my right-hand log in memory that comes up for transaction record 15 so now that all gets written out the disk and then we do an f sync there and then we now we update the flush lsn to now say that the largest log sequence number that we know we've written out in this durable on disk is now 15 and so then we can use that to figure out you know what other things we can write out and then later on once we don't say we we flush out all the pages or whatever happens once we know that everything's been committed then we can write out the transaction and record and again at this point we know that we don't need any more information in memory about what transaction for did so it's safe for us to go ahead and free up this memory in in for the right-hand log and then reuse the pages for other log entries you want to we want to possibly allocate in store yes this page so you see his question is maybe the diagram is confusing the question is I get transaction end and at that point I know that I don't need anything else that comes before this for transaction for so I can free up this memory and in my diagram here I'm not showing it in terms of pages I'm just saying here's the log and anything comes before it is freed up but you're right if we actually showed this with the breakdown of what pages we're using the page would then be freed and then put back into a reason buffer you know the buffer pool and it can reuse them for other log entries so it's just like you know anything in this page is not needed anymore for these log records and you just reuse it I'm not showing that here I'm just saying just truncate anything that came before it okay so now for transaction aborts again last class when I said when I talked about aborts I basically said yeah you know you have some read write set for the transaction and you just know how to undo those changes but now we actually to do this correctly with areas we actually have to record some extra information in the log to keep track of the undue operations that would that we did and so the way to sort of this is like undoing undo is a sort of special special case operation where it's going to apply to a single transaction and we're going to undo its changes but the all records as we undo them we're going to store new log records that are essentially update actions that reverse the changes that we made previously for our transaction so one thing we're going to add is now an additional field in our log records called the previous LSN or pre-LSN and this is going to be the LSN of the last log record that that transaction added to the right ahead log and this is this is done on a per transaction basis and so we don't need this for correctness because we can always just do a sequential scan of the log and figure out what's the next what's the next log sequence number we need to reverse but then we do this for performance because now it's really easy for us to walk back through this linked list and say all right I undo this action what's the next action I need to undo right and it's also really useful for cases where the log the transaction is really long and it makes a lot of changes and some of those changes and it's in the right ahead log may be written out the disk and so I can go again just follow my LSN my previous LSN linked list and figure out is the LSN I need to examine order undo this transaction is it actually in memory or is it on disk and I know where to go to get it on disk because otherwise you have to do a sequential scan on the entire log record on disk to figure out what are all the things you can do undo if you don't see the transaction begin you don't see the head of the link list right let's go back to our example here transaction t4 and again you see now that we've added the the the current LSN for this operation and then the previous LSN and the first case for transaction begin the previous LSN is just null saying that this is the head of the link list so now when we do an abort we then we need to undo all those changes and then once we know all those changes have been successfully logged then we can write out the transaction end message to say this transaction is completely done right so the tricky thing though is this middle part here what I'm not showing you is the steps we need to now do to actually undo the changes we did when the transaction was running the first time right and again the highlight idea we're going to do here is we're going to store new log records that are essentially reversing the changes of the transaction bay when it ran the first time in in in reverse order so it updated a and then updated b then it aborted so when we reverse these changes to undo it we have to reverse them on b first and then reverse them on a and so these special actions we're going to now take are called compensation log records and these again these are basically undoing the operations of a previous update when the transaction ran the first time and so we're going to add now an additional lsn pointer called undo next which is just going to be a keeping track of the next operation we need to undo when we reverse all these changes so the previous lsn will tell us how to get back go back in time to undo things and then the undo next is just a additional marker to say all right I've done I've done I've done this much and here's the next place to jump to to undo more so these clr's are important because if we crash during recovery so we we crash the first time and we start recovering the database and we crash again we can use these clr's to figure out what operations were we undoing when we crash the second time to make sure that again we still put the database back in the correct state so these clr's be added just like any other log record they're going to have an lsn that's always has to be you know in increasing order and unique but again there's sort of a special case to say this is what we did to undo the thing that we're trying to undo when the transaction got aborted so let's sort of most simply as an example so here now again the way to think about this table is that every row is a new log record right so we have our lsn, previous lsn, transaction id the type of operation that we're doing the object we modified before after value and then the undo next so let's say that we get this on board the abort operation or request for their transaction so the first thing we need to add is a new clr the compensation log record that's going to be a reversal of this previous update record that we saw before in our log and so essentially if you just look at what it's doing it's taking the before value and the after value of the object and reversing them so the first time the transaction ran it saw value 30 in object A and set it to 40 so now we want to undo that we would expect there to be object value 40 and we want to put it back to 30 right again we're undoing the update we had before and then we also had the undo next which again is just a pointer to say if here's the next log record that I need to reverse in order to undo this transaction in this case here it's pointing to log sequence number 001 and that's just the beginning of the chain the beginning of the transaction so there's no additional step that we need after this so we know that we've successfully undone this transaction and therefore we can add the transaction end at this point yes Is this a performance optimization or is it seems like if you don't write these records you could always start from the beginning yes so his statement is do we need CLRs for performance or do we need this for correctness you definitely you definitely it definitely helps for performance because if you know how far you've undone something then like the next time you come back around you don't need to be undo it again for correctness I think you need it too but I have to think about why yes yes so wait we'll talk about recovery in a second so his statement is is the issue that during recovery stuff could be written out the disc again yes undo is this sorry recovery is essentially like yes it's like it's like a normal execution of the transactions it's just that you're not taking new transactions from the outside world you're looking in the log and looking how to replay things right yeah statement is you need because of the steel policy because at any time a page could be written out the disc you need to know you need to make sure that you log these entries to make sure that like if you crash come back the second time you know what was actually written out the disc and how to how to put it back in the correct state yes you you do need it for correctness because if you're guaranteed that you basically just write an end point say I'm done with recovery and that's basically what you need right because if you don't see that token that you play back the log from the beginning as always and then you can always look at the log and be like no I actually don't get this already I say it again sorry yeah yes yes so you are correct this is done for performance reason because and the same reason checkpoints are done for performance reasons like because otherwise you have to replay the entire log right and if you you know if your log takes you 10 minutes to recover and you for for some reason you crash every five minutes like you'll never put the days back and you know come back online and think about like the old days of like 1990s 80s really slow discs where like just recover like reading stuff in and out of memory or reading stuff in and out of disc would be really slow so this sort of helps you speed things up sort of like it's a mini checkpoint in the log to say I know I've recovered this much so yeah I think this is done for performance reasons I don't want to say exactly that it's not done for correctness but I have to think about it but definitely definitely for performance all right so again for boarding again we always write out the abort record to the log first when we don't need to f-sync that's okay like the commit we do have to f-sync right away if we abort we don't and then we start undoing all the changes in reverse order and we add a new CL or entry for everything we need to reverse restore the old value of the object that got modified and at the end we write out the transaction and to say the transaction fully finished and the reason why we don't have to f-sync this is because if we crash and come back before we write out the transaction and the transaction abort message in the log who cares because the transaction got aborted we're gonna get aborted in any way the next time we recover right it's only only when we tell the outside world your transaction has fully committed we need to make sure that everything is durable so we can we have to f-sync on the on the commit messages we don't have to f-sync on the abort messages all right so at this point in the in the in the lecture we've covered log sequence numbers so we know now how to record the order in which transactions make changes to the database to a log and that tells us how to and replay them and what order we have to replay them on recovery and then we talk about how to do now the normal commit and abort processing in areas where we keep track of for commits we just have to make sure we write out our commit messages and everything comes forehand we use the flush lsn the page lsn to figure out when it's safe to write out a dirty page out to disk and for abort we add clrs to undo the changes that we just made at runtime for an abort transaction all right so now we need to talk about how to do fuzzy checkpoints which again the idea here is for for performance we don't want to have to replay the entire log this allows us to have a to sort of cut off how how far back on the log we have to look and then we'll get to the recovery algorithm how we put this all together and actually do recovery to restore the database to the correct state so in last class I talked about how to do a really naive or simplistic checkpoint scheme where you essentially halt the entire database system while you take a checkpoint to ensure you have a consistent snapshot so what you have to do is you basically tell the system to stop accepting any new transactions so no new transactions can start and then any transactions that are currently active you have to wait until they finish then when they're done and committed or boarding you go back to the changes then you go ahead and can take a checkpoint right and this ensures that you have a consistent state of the database when you write it out and one thing I should be clear about is when I say checkpoints I don't mean like a delta checkpoint you would see in some file systems where you have sort of making multiple copies of the database right this really is taking all of the dirty pages that are in memory and then flushing them out flushing them out the disk so if you come back the database is in the exact state it should be so in this for this particular scheme it's obviously bad because we have to stop the world while we take our checkpoint right we can't accept any new transactions and we have to wait for any active transactions to finish before we can go and then go ahead and take the checkpoint and if you're so if you're if your transactions are short and the amount of memory you have is actually small then maybe this is not a big deal but if you have a one terabyte buffer pool size then you have to basically stop all transactions from running while you flush out that that one terabyte of memory out the disk and then furthermore if you have transactions that can take hours and hours and hours and you want to take a checkpoint and one of those multi-hour transactions just start you have to wait till that thing finishes before you can then go ahead and take the checkpoint and all the while you're stopping the world and you're not you're not accepting any new transactions so basically your data system looks down or broken off for this long long period so this is obviously bad so a better way to do checkpoints which is not the way we actually want to do it but it's better than what I proposed before is that we're going to pause the transactions while we take the checkpoint but we're not going to require the data system to to wait until all the actual transactions actually finish right so the way to think about this say you have your in memory you have a bunch of pages and then you want to write out your checkpoint to disk so I'm going to have one thread be the checkpoint thread and this is essentially just going to do a sequential scan on every single dirty page in in memory and write them out sequentially in the disk and then I have some other transaction that wants to scan maybe in the reverse order and update some some pages so let's say that at the beginning the transaction modifies page three and then the checkpoint starts so the transaction gets stalled in modified page three and the checkpoint scans through the changes it writes out page one and two and gets the modifications from transaction in page three then when our checkpoint completes then the transaction is allowed to keep on running and then it can go and modify page one but because we stalled that transaction at the moment when the checkpoint started even though this transaction modified page three and page one our checkpoint will only have the changes from page three because that modification occurred before we took our checkpoint right so this is an inconsistent state of the database because now on disk we're going to have changes from the transaction in page three but not as changes in page one so we're going to have to maintain some extra information to keep track of what was the state of the database at the moment we began our checkpoint and so the two additional things we need to keep track of is the what's called the actual transaction table or the ATT and this is the this is basically the set of transactions that we're running at the moment I took my checkpoint and then the dirty page table is just the set of the pages that I know that were dirty in memory and dirty means that they were modified but transaction that has not committed yet in memory when I started the checkpoint so in the ATT again we're going to store an entry for every single transaction that was active at the time we started our checkpoint and we just have to store this transaction ID the status of its current mode at the moment the checkpoint started and this could be other because it's active or it's committing like it's it's flushing out of the log records or it's a possible candidate for undo meaning we don't know whether this transaction was going to commit or not because at the moment we took our checkpoint it had just started running but it actually modified the database and we don't know whether later on it's going to commit the the commit could occur after our checkpoint started but at the moment we took our checkpoint we don't know what happened and then we're also going to maintain the last LSN and again this was be the last LSN that was written by the transaction that made it out to disk so now when a transaction commits to reports in our ATT we'll go ahead and throw that away and the DPT the dirty page table we're going to keep track of all the pages that are above the pool that contain changes from uncommitted transactions right so this would be any transaction that's in the ATT that's set to be uncommitted or candidate for candidate for undo and so the only thing we need to maintain in the dirty page table is just the rec LSN and this is the LSN of the log record that first caused the page to be dirty so what's the LSN of the last transactions log record that modified or sorry what's the LSN of the first transaction that modified this page when we brought in the memory that caused it to be dirty so now what's going to happen is when we take a checkpoint we're going to record the ATT and the DPT in the checkpoint information as well as all the disk pages that we write out and again for this scheme here we're pausing all the transactions we're not allowing them to keep on running while we take the checkpoint but we don't have to wait until they actually finish all right so transactions are still in flight while we took our checkpoint so we need to record this information to say what transactions are around what do they and what do they modify so in this case here when we take the first checkpoint T2 is still running right so our actual transaction table will just contain T2 T1 committed before our checkpoint so we don't need to record anything there and then in the dirty page table we say that there's page 10 and page 12 they are modified by some other transaction in this case here it has to be T2 and so they was modified by these transactions and therefore it we have to record that we know about them when we took our checkpoint then now in the second checkpoint at this point here transaction T2 committed in between the first checkpoint and the second checkpoint and transaction T3 started so T3 is the only thing we're going to have in our active transaction table and then in our dirty page table excuse me we have page 10 and then we have page 33 so page 10 is still hanging around was modified since the last time and page 33 was modified by transaction T3 so in this case here these pages have not been flushed out so in this case here this is better than with the checkpoint we have before because again we can allow transactions we just pause them we don't have to wait till they actually finish but this is still not great because again we're still stalling all our transactions while we take our checkpoint which can occur for the checkpoint could take hours if we have a really large large memory buffer pool and we have a lot of dirty pages and slow disks so we need a better way that allows us to take a checkpoint still allow transactions to run while we're taking the checkpoint and then still be able to reconcile the fact that there's some pages that got modified after we took the checkpoint and we may not know about them yes this question is is anything in the dirty page table are they definitely dirty and by definitely dirty you mean that correct yes they're ahead of what's on the disk yes and T2 commits before T3 finishes and P10 will like the stuff that T3 did to P10 will also be on this so it's not technically not dirty the page it's like a wait, wait, wait, wait what are you saying sorry yes so hey sorry one last time okay so his statement was if P10 was modified by both T2 and T3 T2 commits therefore any change that it any any log record that contains the entries any log record that modified that page would have to be written out the disk and therefore the T2 is allowed to commit but then we flush out P10 and P10 contains changes from T3 and T3 hasn't committed yet is that still considered dirty in that case no yeah so what it should have made made it more clear I should have said exactly what pages these guys are modifying and that that would have made this more clear yeah okay so as I said this what was your point sorry correct yes yes yes okay okay so the way we're going to solve this is through fuzzy checkpoints and again this is another this is another contribution of what Aries provides so a fuzzy checkpoint is where the Davidson is going to allow other transactions to keep on running while we're taking the checkpoint and not not have to stall them so now to do this we need to now record not just that we took a checkpoint we actually need to record the boundaries of when we started our checkpoint and when the checkpoint finished and then when we when we finish we'll record what the actors transaction table and the the dirty page table was at the moment that we that we finished all right so let's look at our example here so the first thing to point out is when we when we record a transaction begin that's what we're going to update the master record on disk to say again after a crash you need to come back and figure out what should be the state of the database on the pages on disk you would find the checkpoint begin and then look at for the checkpoint end and then I'll help you figure out what exactly was going on at the time of the crash so now any transaction that starts after the checkpoint begins will be excluded from the from the the the ATT in the begin record sorry the end record for the checkpoint so in this case here transaction T2 started before our transaction began so therefore it's in our ATT and then for the dirty page table here these are the pages that were modified at the moment our checkpoint began and so all those are written out so the way to think about this is it's it's it's recording all of the it's all the pages that that are possibly dirty in memory and therefore we should make sure that we we apply those changes again when we redo transactions on recovery right so again this is just a boundary to say our transaction began and transaction or sorry our transaction checkpoint began our checkpoint ended and here's what was going on while we wrote all these pages and here's the pages that could have been modified by transactions during during this point all right so now for recovery so we have our fuzzy checkpoints we have our right of that log that has all the the the LSNs previous LSNs the CLRs so now we got to say all right we crash come back we have our disk pages we have a right of that log how do we figure out what to recreate the world to put us back into the correct state so the recovery phase is going to be broken down into three phases so in the analysis phase you're just going to read through the right head log and try to identify all the early pages that were in the buffer pool and roll transactions that were active at the time when we took our checkpoint and then at the moment that that that we crashed then we're going to go back and repeat all the actions starting at some point in the log and reapply all those changes to to the database right and we don't we're going to assume that we don't know whether the dirty pages the pages that were modified the pages that the transactions modified made it actually disk or not so we're always going to replay them and then and it's also too when we redo things we're also going to replay all the changes from transactions that will end up later aborting right even though we may know that they abort because we can see through the analysis we're always going to redo them then in the undo phase we're actually then undo those those operations to reverse any transactions that we know that did not actually commit and we know they're not going to commit because we're not going to see either an abort message or we don't see any commit message at the end so we do see an abort or we don't see a commit then we know the transaction didn't fully finished we never told the outside world that they finished so we go ahead and and can reverse those changes because visually it sort of looks like this right so we crash to come back we look at our master record that's out on disk and that's going to tell us the the last checkpoint that we took so we're going to start our analysis phase at that moment there I'm going to scan from the checkpoint time to the end of the log going forward in time and then based on this analysis we'll figure out well what's the smallest page that was dirty find the the the last LSN of a transaction that modified a page that was active at the time we took our checkpoint and that'll tell us how to far back in the log we need to go and reapply all those changes right because we know that this is the we know there cannot be any change in the log that came before this moment that was not applied to a page and actually written out the disk so we know any page that's modified prior to the Wreck LSN that I'm pointing out in the redo phase anything prior to that is now durable in the disk pages we don't need to worry about them then in the third phase the undo phase we need to start from the end of the log so at the newest amount of time and then now go back in reverse order and undo all the changes from uncommitted transactions and then and what I'm showing here is that the undo may go beyond the redo phase because there may be a transaction that modified the database its changes got made and the pages made it out of the disk but the transaction never actually committed so we need to go farther back in time to undo all of them and this is why I'm saying we're going to assume strict two phase locking because we're not going to worry about any transaction reading changes from uncommitted transaction alright so I'll go through each of these phases one by one so again in the analysis phase the high level idea is that we won't re-establish knowledge of what was going on in our database at the moment that we took a checkpoint and then we can scan forward in times and figure out what all transactions were having to abort or commit so we start off by looking at the checkpoint you look at the ATT and the DPT and then now you can scan forward from that last checkpoint and any single time you see a transaction make a modification then you want to add its entry to the actual transaction table if you see a transaction end then you go ahead and remove it because you know there's not going to be anything else that comes after that transaction end record for that particular transaction it's been fully you know fully removed fully committed and then for all other log records you find you'll add the transaction to the ATT with an undo state because at that point during the scan of the of the log you don't know whether it's actually going to commit yet because you're going forward in time and then when you see the commit record then you go ahead and update the ATT status to be commit for that transaction so that says that we need to make sure that all the changes for this transaction are re-applied during the redo phase and we don't want to undo them so this is sort of a good example though of a possible race condition that you can have in a database system where the database system thinks your transaction committed it did it wrote the commit message to the log and it was about to send the response back to you and say your transaction committed but then then it lost power and you never got back that response so you think the transaction actually got aborted but internally the data system actually says oh it actually committed and actually will reapply those changes yes so this question is on a commit do you have to update the DPT for the DPT? no right because that's done by the buffer manager as it flushes things out the ATTs like here's all my transactions that are active and they could have dirty pages that you know after I've commit they may not actually get flushed out to later on yeah all right and then for any update entry so any log record we see that it's actually modifying the database if the page is not in the DPT then we'll go ahead and add it and then we set its REC LSN so the LSN of the last the oldest LSN of the transaction that last modified that page to be the current LSN of this update record all right so again we're essentially building up state about what was going on in our system in during the analysis phase so then when we scan through again we can then figure out what changes should get aborted what changes should get committed yes the question is does the bubble manager update the DPT that's like an implementation detail like it doesn't have to but like some so some systems have like a recovery mode where they have sort of special components that actually maintain all this state information and track things and then at runtime you don't actually need to do this other systems are essentially always kind of always in recovery mode so they're always maintaining this information automatically so in that case it would be in the bubble manager so his question is the seems weird because the dirty page table should be something that the the bubble manager maintains but now there's this Aries component the recovery manager that has to maintain this information as well and could they end up being inconsistent again typically what happens is you switch the database system onto a special recovery mode you don't accept any new transactions so the recovery manager is essentially piggybacking off the dirty page table that the that the the bubble manager already maintained for it okay yes yes all right so his question is the race condition I just said where we flush out the commit message but before we send the acknowledgement to the application that our transaction actually committed we crash so then the application will time out because the connection gets dropped eventually why is that a problem it's so so I say it's not a problem for the database system we don't care right because we we committed it we did it we did what you want we lost power couldn't prevent that it's more of the onus is on the application programmer to then be a say oh I timed out I need to go back and when the data system comes back online figure out whether my thing actually committed or not right so it's not a problem for the database system implementation we did the correct thing we got your transaction it's committed it's you know we everything is durable but we just couldn't send you the acknowledgement in time we don't care it's the application has the right extra code to say all right did that commit actually succeed or not I did most applications probably do that or so most applications do not do that okay all right so now so again the analysis phase the way to think about this is the ATT it's going to tell us what transactions are active at the time of the crash and then the DPT will tell us what are the dirty pages that may have may have not made it to disk all right and this is again you you jump to some point whether you took the last checkpoint you scan through bit out this information and then this tells you what was going on so let's look at a simple example so we have a start off our transaction does begin begins the checkpoint which I did some begins the checkpoint and at this point then we we're taking the checkpoint and then we we transaction 96 and make some modification so we're going to add entry into the actual transaction table to say here's transaction 96 it started before we began our checkpoint so we add its first entry into the ATT and then the they have the transaction ID and then its status and then you means it's it's an undue candidate we don't know at this point in time during our analysis whether this transaction will commit or not so we have to say as far as we know at this point in time it's not going to commit so we're going to undo it then now sorry then also too here I'm showing that we've modified page 33 in our transaction so now in our dirty page table we have that page 33 is dirty then we get to our checkpoint here and now we write out that we have in our ATT that the there's transaction 96 that we already knew about because we saw the update in LSN 20 but then there's also transaction 97 that started and made modifications before we began our checkpoint so that ends up getting put in our ATT from the checkpoint end message so we can add that to the ATT that we're building during the analysis phase and then we also see that somebody modified page 20 as well so that's dirty and that ends up in our DPT then we get to 40 transaction 96 commits so now we update our ATT to say transaction 96 status is commit because we saw the commit record and then 97 still remains uncommitted or undo candidate and then when we have our transaction end we can then finally completely remove transaction 96 from the ATT but then we still have our dirty pages hanging out right so again this is this is the ATT is where we're going to build we're going to build this up as we scan along and during the analysis phase and then the ATT and DPT that are in the checkpoint is providing us extra information about what was occurring in our system before we started our checkpoint so that we don't have to scan all the way up back in time beyond the checkpoint to figure out what else was still running so we're going to build the ATT and the DPT on our analysis phase and we supplement with the additional information that's in the checkpoint end message to say here's all the transactions that were still active that you may not know about because you haven't seen any log records all right so now that we've done our analysis now the next step is to redo and what we're going to do here is that we're going to repeat the history to reconstruct the date the database back to the correct state it was at the moment of the crash and this includes any transactions that that will end up getting aborted because we want to look exactly the way it did at the moment of the crash all right and so what's going to happen is if we're going to reply all our updates and then add CLRs for anything that that would possibly need to undo and so there are some optimizations you obviously can do to avoid unnecessary reads and writes like if you know transactions going to abort because you saw it in the analysis phase they maybe don't need and no other transaction modify the same pages so therefore you don't need to actually redo anything just during the redo phase you just say this transaction doesn't commit so I'm not going to actually apply those changes I'm going to ignore all those obvious optimizations right now I just want to go through the sort of the the naive implementation and that way you understand exactly how it works and then obviously you could go back and do some additional things to speed things up and this is where most systems deviate from the from the ARE standard because there's a bunch of things you can do to avoid unnecessary writes all right so again we're going to scan forward from the log record that contains the smallest REC LSN in the DPT and this is the transaction that the last transaction the oldest transaction that modified the page with the lowest LSN for any page that exists in our dirty page table and then for every log record we're going to to to reapply we're going to redo the including any CLR we're going to redo the action unless the following criteria are met so the first is that if the effective page is not in the DPT then we know the modification made by the log record has been written out the disk so we don't have to redo it if the effective page is in the DPT but its LSN is greater our records LSN is greater than the smallest REC LSN for that page then we know we can skip that or if the if the affected page LSN is greater than this LSN or this log record then we know that that page got flushed out the disk after we made this log entry in time so therefore what's on disk is actually the the newer version that we don't need to update for this particular transactions change so then to reapply it we just reapply the log action set the page LSN to our to our log records LSN and then we don't need for reapplying action we don't actually need to add an additional CLR or do any flushing or forcing right we just let things happen we let things get flushed out as needed and because we can always go back and reapply those changes later on and then at the end the redo phase for any transaction that's sitting around in the ATT where we saw a have a commit status but we didn't actually see the transaction end we can add a new transaction end record to to the log again this is telling us that we've applied all our changes for this transaction to to the log and on disk and therefore we don't need to worry about them in ATT anymore so then in the last phase the undo phase what we're going to do is just undo any transactions that were active at the time it was a crash but they didn't actually commit and we know which ones these are because we built that ATT during the analysis phase so at the end of the analysis phase if there's any transaction in the ATT with the undo candidate status then we know that it didn't commit and we need to go back and reverse all those changes so essentially what we're going to do we're going to go and reverse LSN order so start from from the the the greatest LSN and go back in time and undo all the all their changes that for these uncommitted transactions and then it's just like a normal operation at runtime when we had an aborted transaction where we had to apply create CLR records for any time we to undo an operation from an unborted transaction we have to add new CLRs for those as well right so it's sort of like the runtime operation for aborting a transaction is the same as the recovery operation to the aborting a transaction you always have the CLRs to say how to what was the reversal you actually did all right so I realize this is very dense areas is always sort of difficult to follow but hopefully some visual examples will make this more clear right so here we have sort of a log and a sort of way to think about this is that what I'm not showing here is that every log record is going to have a previous LSN that's going to allow us to say for a particular transaction if we need to reverse it how to jump through the log and find all those entries to quickly remove things right so let's say in our example here after we at this point in the log we end up with a crash so now when we do our analysis we're going to build out our ATT and our DPD and it's going to look like this and so for to save space I've sort of put multiple log entries on a single line so checkpoint begin checkpoint end the way to think about this is this is zero zero for zero five for and I'm just putting them together on a single line to a safe space all right so the first thing we see that in our ATT we need we have these transactions T2 and T3 they did not commit so we know we're going to have to undo them right and so their status is in the ATT will be undo candidates and so we know the last LSN for the both these transactions that correspond to the last modification that got written out through the log so we're going to go ahead and start undoing these things so in the first case here we're going to undo the modification that transaction T2 did at LSN 60 and then we're also going to put a ramp now a pointer to say the next log record I need to undo will be 20 up here right this is just again think of this as like the previous LSN keeping track of how to jump back in time to quickly find the thing you need to the next reverse you don't need to have this again this is done for performance reasons because you don't want to scan the entire log to figure out what's the next thing I need to undo right you can jump exactly to the page that has the thing that you want but now let's say we get to here and then we start undoing transaction T3 the same thing we reverse the change at log sequence 150 then we get transaction end to say this thing is actually done but before we can actually undo the next operation from transaction T2 we crash so at this point here when we have transaction T3 at TXN we can flush the right head log to disk and now we know all its changes are those are durable but now we crash again during our recovery so when we come back we're going to lose everything we have in our ATT that we had before but we're going to redo the analysis phase again and now in our ATT we only need to have T2 in it because we saw that T3 finished reverse all its changes we saw the transaction end message so we don't need to undo anything for T3 everything now is is is durable so now what we're going to do is jump ahead just to else and 70 and in this case here we need to follow the chain back to say what we need to undo and because we already undo this first one we don't need to undo sorry undo this at else and 70 we undo the change at else and 60 we don't need to redo that so we just need to undo the one at T20 so we just have that one entry then at this point we flush out the log and our transaction is fully done so the way to sort of think about this is when I came back the second time I looked in the log and I saw that I have a CLR for transaction T2 that reversed the change for 60 and I know things got flushed out because I flushed things out at when this guy committed so I know when I recover the second time I don't need to undo this one at 60 because that was already undone at 70 I need to undo the one at 20 so that's why I might only have a single CLR for the operation that occurred at else and 20 then I know that's the end of all the changes from this uncommitted transaction I can read out the transaction end message and now the database is in the the exact state it should have been at the moment of the crash without any uncommitted transactions and now at this point I'm ready to start accepting new transactions and updating the database everyone's eyes are bleeding everyone's falling asleep don't feel bad this is areas is is very painful to get through and again this is another great example why you always want to use a database manager system right you don't want to build something yourself unless you again this is what you you know if you're like us and build database systems you don't want to make your own data system for some you know JavaScript application or whatever you're building because doing all of this is super hard and it's really hard to get correct right and the I'm trying to cover it in an hour and a half and there's way other more corner cases you have to deal with that we're not even talking about here and it's really really hard to do and this is why you know database systems that are you know vetted and well written and been around for a long time are guaranteed to be more safe than anything you can whip up on your own with you know in a day right areas is hard implementing is hard you don't want to have to do it unless you're building a database system okay okay so what happens if it crashes during analysis phase with this sort of obvious answer right if we crash during analysis phase what do we do when we come back online what's that we redo what analysis right yeah you have analysis phase you haven't done anything so who cares right but it's a you crash during the redo phase what happens what's that yes yeah you read it to redo everything again yes yes all right we talked about this before how do you limit the amount of work you have to do in the redo phase so again because we're in this recovery mode we're not accepting new transactions from the outside world so we don't have to worry about right away that we have to flush everything immediately we can flush in the background and continue to keep processing things and again long as we make sure that any entry to an updated page gets written to the log first before we the flush dirty page gets written then everything will still be durable and correct in the case of how do you limit the amount of work they have to do in the undo phase for this one the only thing you really do is in your application code is just avoid having long running transactions right if you have a transaction that takes five hours to run then you're going to have to go back and undo five hours worth of work during the undo phase there's nothing the data system can do to sort of speed that up right okay so to finish up the main ideas we talked about with Aries are we're going to use right ahead logging with additional LSNs to keep track of the order of things both going sequential order and in back in reverse order for a particular transaction to the previous LSN and we're going to use again steel no force to make sure that we write out pages to the dirty pages in the buffer pool can only be written out of the disc once the the log records that modify them are written out we'll use fuzzy check points to keep track of what was going on what actions transactions were running and what was dirty in our buffer pool manager at the moment that we took the checkpoint and during recovery we're going to redo everything since the earliest dirty page and then undo anything that did not commit by the end and the log and then we'll use CLRs when both undoing and but the runtime and recovery this allows to have be able to support recovery multiple times after if we fail during restarts and again these LSNs are really important because again this allows to figure out exactly the order that we should replay all these operations now we'll see and if you take the advance class you'll see that these LSNs become a big bottleneck because if you have a bunch of multiple cores modifying the the database at the same time having everyone try to get the same atomic number will become a bottleneck because you need to synchronize that across all your cores or sockets and so there's a bunch of schemes that allow you to have LSNs and batches so that the different sockets don't need to coordinate with each other every single time you need a new LSN all right so any questions about recovery? Yes? So the question is how does Aries work with multi-version? It works exactly the same right so his statement is that with multi-versioning there's a bunch of optimizations you can do because you know that first of all depending on you always have the old version sticking around so maybe you don't need to actually reapply the change that you need right there are a ton of optimizations you can do with MVCC that I'm not covering here we're doing simple in-place updates with strict two-phase locking like that and if you understand that that you understand then how to apply that then to a multi-version environment so then the idea of the system just like lost everything until oh sorry sorry so if you had a system that was running on constant recovery mode the constant recovery mode oh yeah yeah so his question is I said on a recovery mode essentially the data system has to block all new transactions from running until you know the data system is back in the correct state then you can start accepting new connections and this question is are there any data systems that can have sort of a lazy recovery scheme where you the system boots up and then you in the background you're starting recovering the data is but in the meanwhile you still can apply changes or take new transactions I know there's papers on this so the Goetz graphy the guy that is a volcano he has something called instant recovery where essentially you allow transactions to run and read and write data as long as they don't after you do the analysis phase they can read and write data as long as they don't modify anything that you know you still need to redo or undo right I don't know whether anybody actually does that in practice and then in our own system we have a technique called right behind logging where it's sort of the same thing we know what pages have been recovered and as long as you don't read and write to them you can read and write anything else as far as I know I mean MySQL Postgres don't do this you'll see this if you pull the plug on MySQL and boot it back up look in the log you'll see you'll see like the not the right head log like the debug log it'll say restoring the data is from the checkpoint in log right everybody pretty much does it this way so at this point in the semester I've taught you everything you need to know whether you remember it or not but I've taught you everything you need to know now to build a single node asset database system so you can quit CMU and go do a startup or go work on Oracle right here's everything we've talked about so far is what you need to do to have a system that say you can support transactions that are asset and you know then be able to process queries and run things correctly so now we're going to switch over and now I'll talk about how to do this in a distributed environment the basic ideas are still going to be the same right we still need concurrency control we still need recovery we still need you know areas like things but now in a distributed environment things a lot more tricky because now you have do coordination across multiple machines some of the algorithms will be the same but just the delay in sending messages can be much longer and in some cases the messages can disappear and so you need to do additional things to account for that so at this point in the semester when we come back after the holiday break we'll now start talking about how you build a distributed database system so we'll first talk about how to distribute transaction processing and then we'll talk about how to do distributed OLAP or analytical processing okay any questions all right guys see you next week have a good holiday