 So DJ's dropped cable his what do you tell me his second girlfriend found out about his third girlfriend So he went to Vegas to hide out. That's his problem. He has to deal with that So, all right, I've been gone What what happened to me well my wife had a kid and In theory Yeah, so Let's be honest here It doesn't look like me And I'm not sure it's mine yet So we're waiting for the paternity test to come back. So just hold off hold off your balls, okay? The other thing I want to update for you guys is that at the beginning of the semester. I said that I Only care about two things my life number one was my wife number two was databases So I have an updated version for everyone now The new version is that my wife is still number one Databases are still number two and again depending on the paternity test The the baby is is just you know started there, okay? I'll say one thing to everyone here. It's do not get pregnant or get anybody else pregnant before you graduate school It is a nightmare. I have done nothing for two for the last two weeks except cleaning like poop diapers and like vomit and stuff like that It's it's awful. All right, so It's my poor wife is at home now with the kid. All right, so For you guys that that don't have any newborn children to take care of here's what you have to do So this is what's coming up for you in the next month homework for is due two days from now on Wednesday at midnight Project three will be due at the end of this week on Sunday at midnight and then I will announce this on Piazza and post this on the website We'll do the first checkpoint for the extra credit will be on will be on on Sunday November 24th after the one week after the project three is due and so with the checkpoint means basically is you'll Submit the URL to the article that you've been working on The myself or the TAs will look at it give you feedback give you suggestions Say what looks right and doesn't look right and then that'll give you guidance towards the final submission So I'll just say up front that you won't get full credit for the extra credit unless you submit the checkpoint Right if you just submit the check if you just put the final thing at the very end without giving you know That's giving you feedback You won't get full credit and I'll update the document provide information about all these things and then after that There's one more homework and that'll be due in December and there's one more project and that'll be due in December as well Okay, so we're almost done Any questions, okay So Let's talk about logging So the idea of logging and recovery is that we obviously want to be able to persist any changes we make to the database Whenever there's a crash or a failure So in order to understand this problem, let's talk about the kind of system we would have we've talked about so far See the problems are and then we'll go back now and add logging recovery and we'll see how to handle the issues When making sure everything is durable and safe Let's say I have a simple transaction t1 was your read on a write on a and at the very very beginning There's nothing in our buffer pool like we haven't brought anything in the memory And we only have one page that has the object a in it out on disk So when our transaction starts it does the read on a and then we go fetch that page from disk and bring that Into our buffer pool and that's good. We know how to do that. We've talked about talk about about that already So now when I want to do the right on a I modify the object as it is in in the buffer pool Right. I make a flip that make me make the change Then now my transaction says it wants to commit what has to happen here What does it commit me that the application tells us we want to commit? When did we tell the outside world that your transactions actually committed? Well, if we immediately say they tell us commit we immediately say well, you don't there's no deadlocks There's no timestamp violations or validation issues with your transaction if we immediately tell now the outside world Yeah, your transactions committed What could happen? Well, our change is just hanging out here in memory that we passed it all our conventional checks That's all fine, but it's still sitting in memory So now if like the most evil person for databases comes along like the Hitler databases Just Hitler if he comes and that's us that's our data center or our machine and we lose power Then all the changes that were sitting out in volatile memory are gone Right, we never wrote anything out the disc so if we tell the outside world that when you know immediately hey your thing committed and Nothing got persisted to disc we can immediately you know We could lose power right away and lose all our changes and now we told the outside world you committed But you come back and your your changes aren't there and that's bad So this is at a high level is what the problem we're trying to deal with today, right called crash recovery And the logging schemes it was a mechanism. We're going to use to prevent these things So the recovery algorithms we're talking about today are the techniques that the database system is going to use To ensure that all the changes make the changes that transactions make We'll guarantee that the database is consistent all the changes are atomic and all the changes are durable Right, so we care about a C and D in the asset acronym We don't care about isolation for what I'm here today because that's sort of handled by the commercial protocols Like that's worrying about you know who can read whose rights. This is really about how can we make sure that our changes are atomic consistent and durable So every recovery protocol or mechanism is going to have two parts The first are all the things we're going to do at runtime while the system is running while we're executing transactions and queries That will get will set us up so that if we have to recover after a failure We're able to do that and not lose any any information So the first part is all the things we do at runtime The second part is if after a restart or if after a crash How do we use the information that we collected from the first part when we were running normally to go back and put the database back to the correct state? So today's lecture is focused on the first part What do we do at runtime to make sure that we can collect the right information so that we can restore the database correctly? The second part is was on Wednesday if after we crash we look in here and figure out what the hell we actually did and put Us back to the correct state So for today, there's a bunch of stuff we need we need to talk about before we actually talk about the method We're going to use So first we're going to talk about what kind of failures we could have in our system and how can we you know? Which ones we can can recover not recover from? Then we'll start talking about how we actually going to manage memory in our buffer pool in a slightly different way Then we've talked about so far So that we can ensure that we can we can we can recover after a crash Then we'll talk about two techniques to do database recovery. The first is shadow paging The second is right ahead logging the spoiler would be right ahead logging is the Better way is way what every single database and actually uses but it's good to know what shadow paging is just to see for historical reasons and see why this is superior Then we'll talk about how to do different types of logging and right-hand logging like what's actually in the log record itself And then we'll finish up talking about checkpoints, which then we'll segue into what we talk about on Wednesday to do recovery Okay right, so the the database system itself is going to be divided so conceptually Into different components based on what the underlying storage device they operate on Right the buffer pool manager keeps things in memory the disk manager keep things on disk Right one's volatile one's non volatile and then so so based on that we want to keep track of and understand How can these different components fail based on or have problems based on different types of failures we can incur while we're running transactions while we're running queries and Then and then based on that we can figure out, you know, what do we actually need to support in our recovery protocol? So there's three categories of failures and we'll go through each of these one by one transaction failures system fairs and stored media failures, so the spoiler or the The heads up for what I'm talking about we care about the first two the third one is impossible to handle and we'll see why as we go along So transaction fairs over all the things we talked about so far, right? We talked about currency control these are things like when When the transaction has a dev log or transaction tries to update something that's not allowed to update or Update a value in a certain way. That's not update These are things that we can't allow the transaction to continue and therefore we have to abort it and roll back as changes so again logical errors would be the transactions trying to violate some internal integrity constraint that's put upon the database or Referential Referential constraint like if you try to insert something that's the foreign key reference that foreign key doesn't exist Then the database system says you can't complete your transaction has to fail We got to make sure all your your changes get rolled back and never persist even though if we you know we start multiple times Internal state errors are the things we talked about on our two-phase locking and timestamp ordering, right? If we have a deadlock between two transactions We got to kill one of them and abort them roll back all its changes and then you know make sure that then they don't persist after a crash So our database system our logging protocol needs to handle both of these these are sort of obvious Then we get to the action the system failures the hardware failures That we have to account for in our protocol as well So the first is the software failure. These are where the the database system itself is Buggy right there's some crappy code in the database system and like a divide by zero and now the software its system itself The database system itself crashes, right? It's gets a 6 of a boards And so we need to be able to account for those kind of failures in our database system and make sure that you know Any transaction is still running they get a board and roll back correctly or any transaction that did commit before this This error occurred all this changes are persistent The hardware failure is when the actual machine that our database system is running on crashes or cease to cease to operate a run And this could be someone like tripped over the power cord or like there's a loose wire plugging into the disk drive Right the system has a failure and they can't keep running the operating system You know crashes the database system crashes and we need to come back and recover the database state So in order for us to make this, you know, be able to handle this We have to make this fail stop assumption and that is we assume that the harbor is not going to suffer a unrecoverable damage If we have a hardware failure Meaning like if if we're if you had spinning his hard drive and there's the needle writing on the platter if we pull the power and The the needle is not going to like careen into the platter and start, you know mess, you know messing up sectors We assume that if we if we crash have a hardware failure that we can always come back and recover the correct state So the last category there is the ones that we can't handle at all in our database system Simply because the database system although it's you know as great a piece of software. It is it can't bend You know the the the principles of matter to its own will so a Harvard a novel pair of horror fair would be like The example of the needle crashing into the platter or like fight like the machine on fire and melt all my discs No database system can recover from that Right, so we're not going to design our protocol to account for this. We can do other things like just replicate the database To overcome this or maintain archive backups that we can recover if there's a crash But that's not really recovery in the same the same we were talking about today Like if I had to restore it from an archive version, that's just backup and restore That's you know just me loading it in from it from a you know a separate copy That's not doing anything extra special the kind of things that we're talking about today to recover the database state So again, we only care about the first two failures No database system can account for this but through redundancy which we'll talk about with distributed databases we can we can Try to avoid this or mitigate the issue okay So the entire semester we've been talking about discord in database systems Right and so we've already covered this already But now we need to sort of to go over it again and see how this is gonna be an issue when we talk about logging and recovery So again and disorting database says what says that the systems of design such that the primary storage location is assumed to be on disk and That anytime you want to read a record or manipulate a record You have to first copy it into memory into your buffer pool make your change and then eventually write it back out the disk in order to persist it This is the Von Nome architecture from the 1950s. It's not specific to database systems It's it's the operating assumption We're basing our entire discussion on and there are some special newer hard drives that can you know have CPU cores on the disk itself that you can try to manipulate the data down there But that's just sort of moving the problem somewhere else for our purposes. We can ignore that we say you know What amount of high something we bring into memory make the change then write it back out the disk So the question is going to be When do we actually write those changes out? So when you guys built the buffer pool stuff from a four all you're really doing is just saying all right Well, this page should be evicted It's dirty. So therefore I have to write it back out the disk and you didn't worry about who actually made that change and Whether it was the right time to actually write that change out the disk So that's the thing we need to account for in our logging protocol today So the the basic guarantees we need to ensure and in order to provide the the atomistic is the endurability guarantees is that If we tell the outside world that their transaction is committed meaning we send them an acknowledgement say you've committed Then all those changes are persisted and durable forever Someone made common overwrite those changes and update them. That's fine, but you know Before that, you know those changes should always persist forever and likewise if any transaction Makes changes and those changes make it out the disk But then that transaction abort so it doesn't complete correctly before the crash We need to make sure that we can reverse those changes as well So those are the two main guarantees we need to have in our logging protocol and The core principles we're going to use to in order to provide these guarantees are undue and redo Which are exactly as they sound So with undo it's basically information we're going to we're going to maintain to allow us to Reverse any changes to to an object in the database that a transaction has made So it's like here's what the old value used to be for this attribute for this tuple And I'm going to store that somewhere so that if I ever need to reverse the change that someone made to it I can go always go back put put the old value back in and Then redo is the opposite of that redo is the information needed to Reapply a change that a transaction made to to a to an object in the database Right. Here's the information on how to say, you know, here's what the change Here's the change they made at this given time if I ever need to go back and make that change again I might redo information tells me how to do this So based on these two principles or primitives we can now build on this and now start to build something more complex a Logging and protocol that allows us to generate this information at the right time and in the right way To allow us to restore the database after a crash But how we're actually going to use undo and redo and when this information actually gets written a disk It's going to depend on how we're going to manage disks. Sorry manage dirty pages in our buffer pool So let's look at a more complex example here If T1 T2 T1 does a read on a right on a T2 does a read on B right on B So in this case here, we're not worried about deadlocks or or a concurrency troll We just assume that they're allowed to acquire these locks into whatever they need to do We only care about this point is like the low level changes. They're making to these objects So we transaction T1 starts we do a read on a There's only one page in our database and so in order to do the read for on a we got to first bring into our buffer pool And then the transactions allowed to read it Then it does the right on a and again, it's already in our buffer pool Assuming we can get the latch on it. We can go ahead and make our change Update it directly in place. We were ignoring multi-versioning for now We make our change right there and then our operation finishes Now we have a context switch T2 starts running. It does a read on B The page is already in memory. So that's fine. That happens right away Then it does a write on B again already in memory. We assume we get the right latch on it We can make our change and we're fine So now we go ahead and T2 wants to commit what needs to happen here Where there's two decisions we have to make The first is in order to be able to tell the outside world that our transaction is committed Should we force the the buffer pool to flush out and write out all the changes that it made for this page out the disc? Yes or no Yes, right because you have to because otherwise if I crash and Hitler comes along and takes takes us away All my changes are gone What's the issue correct t1 modified a in the same page So should I be allowed to write out a page that's been modified by a transaction that has not committed yet out the disc He's shaking your head. No, but what's the problem right bees in here T2 made that change but he wants to commit It's allowed to But there's this other change in here From an uncommitted transaction Let's say that all right. Well, I take it's better for me to write out t2's changes Even though t1 has a committee yet, so I write those out the disc But now I you know I tell the outside world t2 is committed. I go back to t1, but now t1 aborts So what needs to happen here? Yes Right so I need to roll back the transaction so I need to roll back the change it made on a I Can do that in memory pretty fast, right? That's not a big deal But I wrote out the page To you know that a existed in with the change that t1 made out the disc So now I got to go you know make that changes here and then write it out again To to reverse the change that I made What's the problem of that? Exactly so he said By the time I get my abort I may have reversed the change here before I overwrite my change out the disc I crash Now I come back. I don't have any of this I only have what's on disk and now I have a change from t1 that I that shouldn't be there But I don't know it shouldn't be there because I have no extra information to tell me that t1 did not actually commit So the two things we talked about here Where the two decisions we had to make were here Whether we should require to force the transaction to write out all 30 pages out the disc before it's allowed to commit and Whether or not we're allowed to to to copy out a page or evict a page from our buffer pool from a transaction that has not committed yet So these two policies are called steel and force So the steel policy says whether transaction whether database systems allows a uncommitted transaction To overwrite the most recent committed value of an object in the database out on disk Before it's allowed to commit So if you say if steel if you're using a steel policy then you're allowed to do this if you're using no steel Then it's not allowed The way to think about this is if I'm running out of space in my buffer pool For one transaction is that transaction allowed to steal a page in the buffer pool or slot in the buffer pool From another transaction that has not committed yet Right, that's why it's called steel The force policy says whether we require that all Updates that a transaction makes to any object in the database have to be written to disk first Before it's allowed to commit So if you say I'm using the force policy, then it's required to do this If you're using no force, then it's not required so Forcing is gonna make it our life easier Because it's gonna allow us to recover really quickly because we just come back and we see all our changes are there Right, we don't have to you know look at any other place to try to you know redo information to redo the changes But the steel policy would be problematic because now we're gonna be writing out changes from transactions that have not committed So let's look at one way to do this Let's look at the no steel force policy right because they're sort of they have conflicting goals and you know you can only choose you know two combinations of these two things So no steel force means that No steel says that any uncommitted changes from any changes made by uncommitted transaction cannot be written to disk And the force says that Any change all changes that a transaction may have to be written to disk before the transaction is allowed to commit So t1 starts does a read on a we bring that a buffer pool. That's fine Now we do the right on a update the page in our buffer pool Then we do a context switch to t2 t2 does a read then does the right we update the buffer pool Then now once I go ahead and commit Again, the force policy says that all the changes for this transaction made have to be written out the disk But we have this This change from t1 hanging out here as well So we need to get rid of that So what do we need to do? Right copy the page in memory Right only apply the change that we want or reverse this other change We don't want and then we can go ahead and write that out So now when we when we come back over here and we abort t1 now, it's super trivial for us to reverse the change That all the change that t1 made because it's just updating this this this page in memory We don't have to go out the disk because you know no dirty dirty change got written out there So the database system is going to maintain some extra metadata to keep track of the right set over these different transactions You guys already saw this on our two-phase locking and and OCC What on our concurrency of particles? So we already have that information about what changes they made to what objects So it's not that big of a deal or extra work we have to do to be able to reverse that change We make that copy And it's in memory, so that should be pretty fast So this seemed like a good idea or a bad idea So what's one? What's what's one good thing about this approach? I've already said it It's actually right there It's super tripled to roll back after a crash Just because there's nothing to roll back because I know that anything that's on add-on disc should be out on disc They're all from committed transactions What's that sorry is there a dialogue between two transactions So his so his question is going back here in my example here I have one thread or one transaction wants to write out something to desk it makes one copy But then another transaction may be committing at the same time and it modified the same page and should have made another copy We can ignore that we assume that I mean you have to have a single latch protect these things there's no way to get around that But we can ignore all that here. So there's there's two problems Actually three problems. Yes So you're close so he said the now when you're processing page to commit this copy here, it's more work It's on the critical path or it's in the critical section of the commit the commit protocol for the for the concurred show Mechanism that becomes more expensive. Yes, absolutely right. Yes But more than the CPU cost is actually the you have to write this thing out multiple times now Like so say if you go back if you go here if T1 didn't actually abort and actually committed Then in order to get its change to a out the disk. I'm gonna write it out again So for every transaction that commits I potentially have to write out the same page over and over again Yes Exactly he's absolutely right So one big issue with this in my CBO example here I have one page Say it's four kilobytes the hardware can guarantee that I can do a four atomic four kilobyte page right But if I update multiple pages the hardware can't guarantee that for me So I could if I update four pages I've read out the first two then I crashed before I get the next two now I come back and I don't have you know, I have torn updates So that's right. That's one big problem. There's another big one. That's it more nuance. So Again, I have one page. So this is sort of a trivial example but in this case here for a given transaction Since I can't write out any dirty data from an uncommitted transaction to disk That means that I can my my right set of my transaction has to fit entirely in main memory So if I have a table that has one billion tuples I have a single query that wants to update all one billion tuples But I can only store one million tuples in my buffer pool Then I can't run that transaction in under this this this this system Because I'll hit the first one million update them. That's fine Then I try to get the million plus one and then I'm running out of I ran out of space So he says can't you is there a workaround can you write into a temporary spot? Yes, give me two slides That is that is that is the solution. Yes, it's not a good one, but it's one Okay so No steel force is the most easiest way to actually implement a Recoverable correct durable buffer pool manager and a disk-oriented database system Because I don't have to do any redo after a crash. I just come back my database is guaranteed to be in a correct state and I never have to go you know undo anything from an unaborted transaction at runtime because I know none of its changes ever made it out the disk But as we said already that you you can't support transactions that have a right set that exceed the amount of memory that's available to you The commit protocol is now more expensive because you have to do all these extra extra work to figure out You know what things actually should be written to disk versus not written disk and The and you're doing multiple rights out to disk You know for every that could it just been one right? Now for every single transaction you're writing the same page over and over again And if you're an SSD those things actually can't be written forever, right? You can burn out the cells in the SSD, right? You know it's in the you know hundreds of thousands rights per cell But eventually to keep doing this if you're just writing a lot you'll burn it out in a short amount of time So nobody actually does what I'm describing here, right? It is the easiest way to implement it But nobody actually does this the thing that he alluded to That people have tried before is to basically store the changes you're making from uncommitted transactions in a temporary space and Then at some point when the transaction commits you somehow resolve the the directory or the page table to now say here Are actually the correct the latest versions of our of our pages and that way if you crash You just ignore anything that that got modified in those in those temporary buffers So this is what shadowpaging is we briefly touched about this in the very beginning when we talk about current control This is one way to do a no steel force buffable management system that avoids some of the complications we talked about before So a shadowpaging works. It's like it's sort of like multi-versioning But at the page level instead of the tuple level And there's only going to be two copies at any given time There's always the master copy. That's the latest most recently committed version of of the database And then there's the the shadow copy that all new transactions are going to end up modifying So when a transaction commits we want a way to atomically switch the the shadow to become the new master I mean we can do this in such a way that we don't worry about torn writes if we're updating multiple pages so unlike in Multiversing where we copy every single thing we're going to modify Actually, it's like multi-versioning but instead of doing a tuple level again You're doing it a page level and you can organize the the directory of your pages as a tree structure So now you only need a cat copy sort of portions of the tree and then just do path copying to update them in place Or apply them to the the page table without having to recreate the entire hash table all over again So at the root of this tree is going to be the database route That's always going to point to the latest master master version So when that means we can make a bunch of changes to the low portion of the tree They update the lease to point to our new pages that we just created and then when we're ready to apply the changes to Automatically across all these pages. We just swing this database root pointer to now point to our shadow Portion of the tree and then all all our changes get get immediately Become immediately visible So a high-level looks like this again There's this database route and it points to the master page table and this master page table points to our pages out on desk So I'm gonna briefly go with this but this is sort of quickly, but let me just skip this get to the example So say we have a transaction comes along t1 Any transaction that's read only can always go to this database route and go to the the master copy and see a consistent version But if we have an updating transaction, we have to create a shadow page table that is that the the transaction is going to modify So the very beginning the shadow page table It just all its entries point to the same pages that the master page points to the master page table points to So now as this transaction starts modifying pages We're going to make a copy of that page Into a new location in our temporary space in on disk Make all our changes there Right, and we keep doing this for all the other all the other pages. We want to modify And then when this transaction says that I want to go ahead and and commit All we need to do is update this database route which is stored in a single page To now point to that the view the this portion there are this page table the shadow page table So we flush that change out the disk and then now it immediately becomes that becomes once that's durable We then swing that pointer in memory and then we now know that everyone can can follow this one So if a new transaction comes along and wants it wants to you know What's to read with the latest version is of the database it just follows this route and finds the shadow page table. Yes The question is why is the database route written the disk? Right, so if I crash Right, so so so I'm here by transcendent transactions as I want to commit. I want to tell the outside world I committed right so If I don't update this database route I crash and come back and now I look at my database route and it's pointing to the master page table and all these changes are gone Correct yes It's a page number it's a page ID Right that has to be durable because if I crash and come back if I tell the outside world I commit But then I crash I'm like, all right. Well, I just assume whatever this thing was pointing out is the is the root of the page table And that means I told the outside world I committed But now all my changes I made here in the shadow page table are gone They're still there on disk. It's just no one can see them there They're you can't logically see them so it's as if they didn't exist You quit your question is can you be clever and Figure out a way to When you write out these pages put a little mark in here to say yeah, you're the latest version So all right, so this guy updated three three pages So I need to now record that Okay You updated three three out of x pages or n pages to make sure that if I crash come back I see all those changes Yeah Some like out of some out of date pages there right these are set to keep track of whichever the database Who tells you which ones are the most up to date? Yes, it's actually whatever transaction commits Let's say you write some pages up this for those two pages right up to this You maybe have a time stamp system for the transaction or something sure just keep that there And you just look that reconstruct the database route for that. My database is is a hundred petabytes That this is one page that has everything I need Oh, you can always think in extremes. Yes Excellent so he said and then my example here I have one transaction. What if I have a bunch of transactions at the same time How does this work? So you either have to have only one transaction run out of time which sequel light does Or you you have to commit them in a batch So I say all right Ignoring to base locking because all that is orthogonal to this Let's assume they you know have a way to figure out who's allowed to update what if I have multiple transactions within the same batch updating things I have to wait until They all finish then they all get committed. I swing mom my database root pointer, right? And then it get automatic applied So that's one way to do this if you assume all transactions are going to finish in a reasonable amount of time If you have one transaction that takes an hour Then you either need to kill it after a certain amount of time or wait that one hour before everybody goes ahead and commits Some systems do this it is rare Most systems don't don't operate this way, but you most systems already do shadowpaging, right? IBM did this in system R in the 1970s They abandoned it in the 1980s when they did DB2 because you have fragmentation issues, right? So now I blow in my master page table. I blow away these pages here because they're no longer visible Right, so this is all that I have now, but it doesn't match up the ordering of pages here So now we do sequential scan, you know, I may not be reading things in the right order So I can't do clustering indexes is what they were trying to do back in the day So nobody actually does this but you have to write you have to either commit them in a batch or have one transaction committed at a time Yes You know You're going back to the previous keyboard t1 t2 With that hand on this case. Yes, if you assume t1 t2 are committing together in the same batch Right. So yes, keep this that's the handle exactly in that case because again I have the undue information I need to reverse any changes that a transaction made in memory so if as long as that thing has been written out the disk Then I I can just reverse it here and I'm fine and if I crash before I flip the Davis root pointer then I'm fine okay So nobody actually does this so let's jump through more quickly and get to the good stuff So the reason why this sucks is because copying the entire page table is expensive You can even if you use a tree structure it becomes it's not cheap And the commit overhead is high because you have to update flush every single page that you modified the page table and the root The data becomes fragmented you need a background garbage collector Just like a multi-version country troll and you either have to commit everything in a batch or only have one writer at a time So as I said the only systems I know that actually does do this We're a couch DB, but I think couch DB is giving is is going away from that and doing They're switching over to rock DB LMDB is a tree-based system that uses m-map. So that's sort of hidden from them and then Like it's a system R in the 1970s, but they they abandoned that IBM abandoned that in 1980s The one system you probably have heard about that does something similar to this is sequel light But this is what sequel I did up until 2010 Then they ditched it to and switched to over what we'll talk about next the right-of-head logging approach So what sequel light would do is that instead of copying the Instead of copying the page That you would they're gonna modify and then make the modification in the copy. They have a copy original page write that out the disk then make the modification to the master version and Then if you commit you just blow away the the copy that you had or if you crash before you commit Then you then you look back on that that separate copy file and restore that restore the change So they would call this the journal file. So let's say that my transaction was almost update page two So before I modify page two in memory I first make a copy to it and persisted on disk and a journal file And then when when that's done I can modify it same thing for page three before I can modify I make a copy into separate journal file Then I go ahead and make my change Now let's say before that should this transaction actually commits We end up flushing out page two out the disk But before we flush out page three and complete and actually commit this thing We crash and right so everything gets it gets blown away in memory So when we come back we would say all right well, I have a journal file So I need I need I mean to make sure that all my changes that are in this new journal file Because these are the original versions they get written out back to the the original disc file So again, like I said, this would this what sequel light did up until 2010 And then they abandoned this for performance reasons to use the redhead long Okay, so the shadowpaging approach It'll guarantee correctness, but it has some performance issues and the main performance issues going to be That it's we're gonna do a bunch of random IO So in my back going back here in the sequel light example When I had to replay the journal file, I'm updating random locations on Disc to order to restore the database back to the correct state When the case of shadowpaging when I was flushing out all my changes and say my transaction actually committed Again, I'm doing random IO to different locations to persist all the changes from the from the shadow shadow copy So Even with the fast SSDs that we have today Sequential IO is always be faster than random IO So we need a way to convert all those random IO's into fast sequential IO and still have all the durability guarantees that we'd want in our logging protocol So this is what right ahead logging is gonna it's gonna achieve for us So the idea of right ahead logs that we're gonna maintain a separate log file on non-virtual storage along with our table heap and As transactions make changes to the database We're gonna make entries into this log file that record the changes that were made and Then when a transaction go has goes at once go ahead and commit We just need to guarantee that we flushed the log records that they generated out the disk and not the Actual changes to the to the objects or the pages in the buffer pool So for owner for us on right-hand logging to say our transaction is committed and it's durable We all need to flush the log. We don't need to flush anything else And so now the log file is just sequential IO because we just keep appending to this this file You know would you know page after page after page that again that sequential IO is me much faster than the random IO of Writing out the random pages and again the reason why it's called right-hand log is because we need to make sure that Any log record that corresponds to a change made to an object in the database is written a disc before that object can be written a disc That's that's the very important most important thing to understand about right-hand logging So right ahead logging in a sample of steel and no force because we're gonna be allowed to write dirty pages out the disc before transactions actually commit as long as the log records have been written out first and Then it's no force because we don't require all the changes that the transaction made to objects be written up the disc We only require that the log records be written out the disc So again, this is the most important thing you don't understand don't don't understand the protocol So the way to think about this is like if I have a transaction that updates a thousand tuples I potentially have to create a thousand log records So let's say that thousand tuples are stored in a thousand pages My log records could just be stored in a single page and Therefore, I don't need to want to write one one page out to flush the log rather than all the pages that they were modified So there's gonna be a bunch of performance advantages. We're gonna get from this approach So the database is going to stage all the changes that a transaction makes in these log records in volatile storage This is typically also backed by the buffer pool And then again, this is we already talked about everything gets flushed out And we're not considered committed until we know that all our log records have been written out the to disc as well So the protocols will work this way. So when our transaction starts We have to write a begin record into our log that's going to tell us that hey There's this transaction that started it exists. Here's some metadata about it. Like here's the identifier for it and then when transaction commits We're going to write a commit record Out of the log and when you make sure that this commit record appears in the log After any log record that corresponds to changes of that transaction made It's going to be interleaved potentially with other changes that other transactions are making But for our one transaction that we care about at this point in time our commit record needs to appear after all its changes Right. Once we see this commit in the log, we know there's no other change the transaction could ever make So now in each log record First in the very you know, it's sort of initial simplistic version of the protocol. I'm teaching right now We need to really need to record four things the unique transaction identifier Right like the timestamp the transaction assigned when it was started The object ID of the thing that actually modified and then the before value with the corresponds to the undo And then the after value that corresponds to the redo So this information by itself is enough for us to be able that we need to Recover the database after all possible affairs that we talked about at the very beginning So let's look at an example here So we just have t1 does a write on a right on B And so now in memory we have our right-hand log buffer as well as our buffer pool So when we do the when a transaction starts and we begin We're going to add an entry into our log record that says hey, there's a transaction that just started It's usually not done exactly when you call begin. It's usually done Unless you're running with auto commit turned on it's usually done at the first right But different systems can do different things So now I do a write on a So the first thing I need to do is add our entry to my log buffer That says here's here's the change I'm making to a right here's the before value and here's the after value And once that's in my log buffer now I go ahead and make my change into my page in my buffer pool Right we'll talk about this more that next class the reason why you have to do this before this Because there's going to be this thing called the log sequence number That's going to get assigned to the log record that we have to use to figure out What was the what's the log entry that course that that changed this particular page So you may think who cares and since i'm not reading out the disk Can I just update this this first then add this thing in here? We'll see next next next class why you have to do this first followed by this So then now I do the write on B same thing I add my entry to my redhead a redhead log buffer in memory Then I go ahead and make my change here So now when I go ahead and do a commit I add my commit record to My redhead log buffer that at some point I'm going to flush it out the disk in this case here since we need to tell the outside world that a transaction is committed We'll immediately flush this out do an f sync Once we know that's a durable and safe on disk at this point the transaction is considered Safe to now return back to the application and say that it's it's committed So now who cares if hitler comes along and kills kills our buffer pool Because everything we need to do to replay the changes that it made is now safe on disk So if this page number got written out We can just look in that log and replay it to to update the the page as needed To do exactly the same thing the transaction did when it ran the first time So again at a high level is this clear? Yes, our question is What instead of storing the the redhead log buffer as a separate file? What if I stored it in the page itself? Let me see are you telling me like a log structure storage we talked about earlier Yeah, so in the log structure storage you don't You don't have this on disk you only have this Yeah, it's a log structure yet correct and the the What you're what you're giving up by not having this Is that reads are now more expensive because you have to replay the log to figure out what the page actually should look like Or on every read But it makes writes super fast because now you don't do extra writes to write out dirty pages You only keep appending things to the log. So there's a trade-off So this would be like rocks db level db Cassandra bunch of different systems do these log structure and merge trees log structure storage Yeah, for us what we're talking about here, we're assuming you have table heaps Everything we talked about we have table heaps. We have a right ahead log We're not doing the log structure stuff we talked about before But the concept is still the same You just you just don't have this you don't have the table here All right, so this question is if I have two transactions And they're updating the database are their log records intermixed in the same log buffer or they have separate log buffers they're intermixed Again, this is this is this is independent orthogonal to the commercial stuff So some higher-level part of the system that says Determines whether one transaction is allowed to update, you know, you know, this object or that object At this point we assume that they're allowed to do that And they did do it they're going to do it and therefore we just we just add their log records into the log file We don't need we don't maintain several log files for different transactions Because we have the transaction ID we know who did what yes okay So again if we crash there's everything we need to to re restore the databases is in the log so Two sort of implementation questions One is when should the database and write out the log entries to disk? Well, it's obvious as we said it's whenever the transaction commits Before we tell the outside world our transaction is going to go ahead, you know Is fully committed we need to make sure that those log records in the buffer are now stored out out on disk But that's all we need to be able to restore the database But in my example here, I only showed one transaction And when they said commit, you know, I just immediately did fsync and write it out But that's actually going to be super slow if I for every single transaction I do an fsync when they commit to write out it's a log buffer That's not what I'm going to want to do because say an fsync takes five milliseconds On a slower disk then I can only commit one transaction every five milliseconds And that that's going to be that's going to be bad. I can only do 2000 transactions a second So what every system actually implements instead is what's called group commit Where you're going to allow multiple transactions to buffer up a bunch of changes in the log buffer And then at some point you make a decision All right now I'm just going to go ahead and write whatever's in the log buffer out the disk And that may include log records from transactions that have not committed yet But that's fine because we know how to undo their changes because we have the undo information in the log record So this is what you're going to end up having to implement in project four So the way basically it works is that in the simple form you instead of having one log buffer You have two log buffers. It's sort of like shadow paging There's the master there's the master one everyone's writing through and then there's the background one That you're eventually right to next So when my transaction starts t1 here does a write on a You always go you go to the first one first add the Begin then do the write on a the write on b Now I do a contact switch here All right add t2's entry do a write on c But now at this point My log buffer is full So I can't add anything and anything else in here So i'm going to go ahead and write this out the disk Right and I have to do an f sync so that you know that's going to take some time So in the meantime now I switch over to the to my second log buffer And now any other changes that transactions may get added to this thing here The idea is sort of your while one's getting right out the disk you fill up the next one when that gets filled Then the ideally the other one's been flushed out So now then you write the write the second one out and then fill up the first one. So just going back and forth Uh While one's getting flushed you fill up the other one But let's say for this transaction here Uh It stalls right it does something right it does additional computation. So neither transaction are generating new log records So Instead of just waiting till it's filled up before I write it out. There's also a second prop Uh process where you you say well if it's been a certain amount of time Since any since I've last flushed this thing out. Let me go ahead and flush it out Right the idea you should want to tune it such a way that like if you know that what the f sync time is write something out the disk Then you sort of set your time out to be to be that so like If it takes me five milliseconds to write the write the first page out the first log buffer Then I'll wait five milliseconds For the second one and when five milliseconds is up if I'm not full yet. I go ahead and write that out as well So again group commit basically says instead of one transaction F syncing the log when it commits you batch a bunch of them together and then you amortize the f sync costs across multiple transactions So if you're the first transaction that gets added to the to the log buffer and then you commit You're kind of like you're screwed because you have to wait the longest before you're actually written out the disk But if you're the last guy Then it's as if you get the you're getting the disk exactly to yourself because you're not waiting any time So on average this works out to be much better than everyone f syncing immediately right so then the The last question is when should we actually write the dirty records out the disk? well This depends now so Now that we said that the log buffer is everything we need to do All the information we need to be able to redo the changes that transactions made And we say that the log records that correspond to changes to pages are written out the disk before there's pages written out the disk It's not actually immediately urgent for us to flush all those pages Immediately when the transaction commits because the log records have been flushed So now on our replacement policy in our buffer pool manager, we can account for this we can say all right well This thing is dirty But I the log record has been written out for it So maybe I don't want to evict this one right away Or I can evict it right away where this other one here Although it's probably has a higher priority for me getting removed or getting evicted It's log records haven't been flushed yet So maybe I don't want to evict that one because I had to flush the log buffer first before I can write that out So this this idea of logging recovery Permits throughout the entire system and now we need to account for the other parts of the system But we've already talked about we can update replacement policy to keep track of these things Yes All right, so So his question is well in these examples, we don't want to commit so let's say that I Have a commit in the log buffer But before I read it at the disc I crash Who cares No, no The application said commit We haven't told them you've committed yet We only tell them that they commit when that log buffer that corresponds to that commit message is written to disc That's the genius of this Because again the the if we're doing OCC where we're doing validation after after the transaction commits We still might end up getting aborted Right at this point here at the log buffer. We assumed all that's taken care of so we see a commit message here We don't tell you know, there's no call back and say yeah, you're good. Everything's durable until the log buffers We've written a disc His question is is the application hanging for a while Yes Correct, but I mean like what else would you do otherwise? If you don't want to wait to see whether you committed that's fine, but you could lose data Right, so if I don't run with a transaction then then you know Actually, even if you autocommit a single statement transaction I think it's still going to write a write a log record before it comes back and says your things actually finished I don't Some systems allow you to allow you to turn off logging on a per transaction basis So I can run a transaction That would say I make a bunch of changes, but when I commit don't write anything out the log You could do that some systems will like to do that By default, they don't Yes Do we need to commit entries of our redo transaction Our read on this question is do we need to commit entries for a redoing transaction? What do you think? What would you actually commit? Right What would the log record be? There's nothing there's nothing to log. This is what I'm saying like going back here. Uh, oh shit, sorry God damn it Oh, that wasn't me. That was PowerPoint. Sorry. So I showed the the Right in this in this case here, I showed that when the transaction started I added begin entry You don't have to do that Right because if it's if it's a read only transaction like if this thing Called begin but then did a bunch of reads Then I'm storing crap that I don't care about You could do that. You don't have to though Right and some systems you can actually declare a transaction read only at the beginning That you say begin as read only or something in sequel Then you just turn off all logging and all commercial if you're doing snaps to isolation because you just know that I'm going to see exactly what I should be seeing For simplities. I'm showing that here Yes, uh, so question is when do you garbage like the logs? Fewer slides we'll get there. Yes actually is Spoilers checkpointing, but we'll get to that. Okay So, uh, just to recap everything we talked about The way to think about the different methods for doing graphical management and recovery Is in the context of the runtime performance and the recovery performance So the runtime performance would be how fast is it you know to to maintain all this information while I run transactions So in the case of no for steal, which is right ahead logging. That's going to be the fastest during runtime because When I commit I'm just committing out those log records. I don't worry about those dirty pages hanging my buffer pool I'll take care of them at some later point Whereas with shadowpaging it's more expensive because I have to make sure I flush all the pages that I modified in my shadow page table Then flush the database route To disk before I can tell the outside world I committed But now the downside is if I have to recover the database after a crash Shadowpaging is the fastest because I don't do anything extra. I just come back and my database route points to my shadow page The consistent master The most recently committed version and I'm done But under right ahead logging That's actually going to be slower because I'm going to have to replay the log after some some point Which she was sort of deluding to with garbage collection so Because of this trade-off between performance and recovery time Most database system implementations Choose the the right head logging choose the no for steal because they rather be faster at runtime And assume failure failures are going to be rare, which you know in the grand scheme of things they are Your database system is not crashing every minute If you do you have other problems. So therefore they're willing to trade off faster runtime performance in exchange for slower recovery There's only one system that i'm aware of Except for the ones that do Shadowpaging like the old sequel light. There's only one system that actually makes the trade-off for having faster Uh recovery time in exchange for running slower at runtime Uh, and it was this system. I don't know the name of it. Uh, it was A database system built in the 1970s for the the Puerto Rican electrical system because in Puerto Rico in the 1970s They had power outages like every hour So the data system was crashing literally every single hour. So for them it was much better to be slower at runtime Such that for every hour when you crash when you lose power you can recover the database immediately afterwards Right another way to think about this at the high level too is this is with no undo No redo because there's nothing to reverse and nothing to reapply with with right ahead logging. You need undo and redo all right, so I showed sort of at a high level with what these log records are that there's an object id and then there's a The before value and the after value But what how is this actually implemented? So there's a couple different approaches So one the one would be what is called physical logging which is basically what i've talked about so far Is we're recording the low level byte changes to a specific location as some object in the database That you made to you know, and you know how to reverse it So thinking this is like if you run diff or get diff It's it's it's it allows you to get out the before and after image of of the change But the downside of this is again if I have to make if I update a you know a billion tuples In my transaction I have to have a billion log records that corresponds to all those low level physical changes Another approach is do logical logging Where you just record the high level operation that the change you made to the database And that's enough for you to be able to to reapply it at runtime undo is a little more tricky Based if you have you know based on what the actual query is, but let's ignore that for now So think the way think about this is this is storing like in the diff of the change you made This is storing the actual the sequel query That of the change you made So the advantage or disadvantage of each of them are that logical logging allows you to record More changes with less data. I updated billion tuples with a single update statement. I only only love that one update statement the downside is going to be with logical logging though is that It's going to be tricky for me to figure out what changes I potentially made made to the database that I've gotten written to disk Before the crash because I don't have that low level information I updated billion tuples over a billion pages Maybe half of them got written out the disk. How do I know which ones I need to update and reapply? The other issue is going to be also that However long it took me to update the database the first time when I ran the query with the logical logging scheme It's going to take that same amount of time the second time So my query took an hour to run During recovery it's going to take an hour to run again. There's no magic because I'm in recovery mode That's going to make that go faster So although the skin is storing I'm storing less information with logical logging It's going to make recovery more expensive and most systems do not make that change The hybrid approach that most people use is called physiological logging Where you're not going to store the low level byte information about the changes you're making to the database You'll still you'll still sort like it's low level enough to say here at this page I'm modifying this object, but you're not you're not really taking a diff as you would Under physical logging you're just saying here's this logical thing. I want you to make a make a change to So this is what most systems actually implement So let's say we have this update query here So under physical logging it would be like at this page of this offset Here's the before and after image and we haven't talked about indexes as well But indexes you basically have to log in the The same time you're making changes to the database because if my if my index doesn't fit in memory Then I don't want to have to rebuild it from scratch upon recovery So most data systems also record the log the changes you make to the indexes Logical logging this query again, you just store the sequel statement physical logical logging is you're saying at this page At this slot number Here's the change I want you to make to these these low level attributes And this allows you to not have to By having this extra in direction sort of like the slot of pages It doesn't it allows you to sort of reorder the the replay operations In such a way that the the the database doesn't need me to bite for bite copy Before and after the crash you have some wiggle room to actually Reapply these in different ways and restore silver store back to the correct state all right So the getting to her to her question is The issue of everything we talked about so far is that these right ahead logs are going to grow forever If my database system is running for a year, I'm going to have years worth of logs So now if I have to crash If I crash and come back and to replay this log, I potentially have to replay the entire years worth of data So logical logging that would be terrible Right because if again the query takes the same amount of time it took the first time as it does during recovery So I have a year's worth of data of logical log records if I crash and come back I potentially take a year for me to recover the database So that's bad So the way we can truncate The the log is to what are called checkpoints And the idea of a checkpoint is that we're going to flush out all the pages that are dirty in our in our buffer pool out the disc And add a entry to our log record to say at this point There's no dirty pages that are not durable in discs. So therefore you don't need to replay That far in the in the past potentially From from from from my checkpoint Because I know all those changes have been persistent And the idea here is again because we're we're doing the steel policy Uh or the no-force policy We're not requiring that the dirty pages that a transaction made Has to be flushed out the disc before the transaction is committed So We don't know whether they actually made it to disc or not if if we crash Whereas with the checkpoint when we know the checkpoint completes. We know that at that point everything has been written to disc So look at a really simple example here. So for this one, I'm going to use a very simplistic checkpointing scheme That basically stops the world stops all transactions from running And flushes out all their changes all the dirty pages out to disc And then once I know all those dirty pages are written, then I let them start running again And this is called consistent consistent checkpointing or blocking checkpoints Most systems don't actually implement it this way We'll see on wednesday how to do it better and how things can run at the same time But just understand at it, you know the very basic, uh, how the basic protocol works assume. That's the case So I this I'm going to add this checkpoint entry here. So what'll happen is when I take this checkpoint I stop all transactions from running and I flush out any dirty pages And so now if I have a crash when I come back I know that I don't need to look at any t1 t1 changes because t1 committed Before my checkpoint. So I know all the t1 changes are written to disc So I don't need to replay and look at its log It's the other two guys t2 and t3 those guys started and could potentially made change before my checkpoint So I need to go back in the log up to that point and figure out what they actually did to put my Davis back in the correct state So in the case here t2 committed, uh, before the crash So I know I want to reply it changes or redo their is changes t3 did not commit before the crash So I only know I want to reverse its changes Right so the checkpoint is basically just again just telling us that we know that at this point in time All dirty pages from any transactions have been written to disc And it's up for us to then figure out Well, what came slightly before the checkpoint and what came after the checkpoint to decide who's allowed to actually Have their changes persisted So again, we'll talk about checkpoints more on Wednesday, but in my simple example here I start all the transactions to make my life easier Because if I have a transaction that's updating a bunch of pages, I don't want to have the case or It could be I have to do some extra work to figure out Well, I'm updating 20 pages and my checkpoint flushed out the first 10 pages that transaction modified But then while the checkpoint was running and modified these other ones that I didn't flush those things out So I don't want to figure out which ones actually should be around The other tricky thing is going to be it's not clear how often we should should take a checkpoint because these checkpoints aren't free Because we're running out dirty pages and that's slowing slowing the disc when we could be running out to a log Now in a lot of systems, they'll have the disc the log be sort of in separate discs And the heat balls are sort of separate discs So when you're doing f things on both of them, you're not slowing down each other But again now my checkpoint is running out dirty pages When I could have been you know evicting pages for for disc to to To re get new space in my buffer pool to have other transactions keep on running So how often you take a checkpoint can vary it based on the implementation so the One approach is to say every after every every certain there are minutes or seconds take the checkpoint Uh, if I do that then my recovery times will be much faster because Now I don't need to go as far back in the log to re figure out what should be actually persisted because my checkpoint is occurring More frequently because now the checkpoints are slowing me down. So my my runtime before it suffers Another approach which I actually think is better Is that the checkpoint only occurs after a certain amount of data is written out to the log Like after I've written out 250 megabytes of data to the log, then I take a checkpoint And that that bounds the amount of time you have to wait And you don't worry about um You know whether you're you're taking checkpoints unnecessarily because it's you know I know that I only need to look at maybe at most 250 megabytes of the log before I have to recover my database Again, I'm going over this very very very fast because we're running out of time But we'll cover this more detail when we talk about recovery on on wednesday So any any high level questions about checkpoints Thing at checkpoint is like garbage collection for for the right ahead log But I know that at that at that point of the checkpoint I don't potentially need to look at anything it came before Of course in in the extreme case if I have if I have a transaction that runs for days And I'm taking a checkpoint every five minutes and I need to go back To to when that transaction started to figure out what all changes is actually made Okay, so as I said right ahead logging is is almost always the preferable approach the the better approach to handle Avoiding data loss or make sure that our database system is durable on disk And the core idea what's going to how it works is that it's going to use stio no force buffer pool management policy It's going to flush all changes that transactions made to their log records to disk Before we tell the outside world that a transaction is committed And then in the background at some later point we can flush out those those dirty pages But we have to write the log records first before we can write out the dirty pages that they modify And so on recovery We just undo any any changes from uncommitted transactions and they redo the commit changes committed Redo the changes of any committed transactions to make sure that they get applied. Yes The question is we have to undo potentially changes on recovery again, we'll cover that on wednesday Because changes from uncommitted transactions could have those dirty pages could have written out the disk because we're using the stio policy This question is This question is are there scenarios where upon recovery We would not have to undo because we can look at did our changes actually make it out the disk the No, the the spoiler would be for wednesday you redo everything You go you're going to go through the log multiple times You're going to redo everything But then as you redo you say, oh, I see this transaction didn't commit Then you go back and reverse on the log and you undo anything that that is on So you you just you play it safe and you always undo There are some optimizations that I don't think most people do them all right, so On wednesday's class again, it would be the second part of what we talked about for logging recovery It's the things we do after a crash or after a restart How do we use the right head log? How do we use the checkpoints to put us back in the correct state? So this is probably the third hardest part of of of database systems So the thing we're talking about is aries aries is the the gold standard of how you do database recovery I don't know whether the textbook calls it aries Um But and most systems that implement redhead log are not going to call it what they're doing aries But everybody that does redhead logging is going to be based on the ibm's protocol from the 1990s Whether or not they know that they're using they're basically using aries okay all right, uh, i'm having office hours now at 130 and um See you guys on wednesday i'll drive You