 This is where we were at the last time. You're starting to look at the different logging schemes and looking at what we could do to, what do we need to log to make the recovery protocols work? And there were three logging schemes. The one is physical logging, which you can say every time you make changes to a page, you will capture in the log the before and after image for the page. And one downside of that is that could get really large if there were logical moves of the records happening in the page, just like rearrangements, the diff will be large, even though the change that you made was really small. So the way around it might be to do some form of logical logging where you effectively say, I'm going to log the operation that happened. And the simplest way of logical logging is to log the entire query that made the change. But of course, this means that when you want to apply the log, you have to rerun the whole query. And if that query took a long time to run, you're going to have to do all of that stuff again. The preferred method at what systems use is called physiological logging. And the way to understand that is this first line up over here, it is physical to a page and logical within a page. So imagine I've got a record that spans multiple pages, which can happen, right? Records can span multiple pages and you made some changes to that record as a transaction. This means you updated one record, but you actually updated bytes across two pages in the database. So what you'll do in the physiological logging is you will create two logs, one for each page. So every page gets a log, so it's physical at that level. But within the page, you will use logical logging to simply say, here is the slot number and within that slot number here are the bytes that were changed in it. And so you get that best of both worlds, okay? Now, of course, that also applies when you have indices and stuff like that, things get far more complicated with all of these components when you have indices. In this course, we are going to ignore that. But again, if you're interested in that, I can recommend some papers and perhaps the best source for that would be Jim Gray and Reuters Transaction Book. And if you want, you can come and borrow it from me if you want to dig deeper into some of those issues. So physical versus logical logging, this is just retraiting what we've said is logical logging requires less amount of data to be written. So the logs are smaller, these logs occasionally will have to be flushed out to disk. So having small things is better. It's difficult to implement recovery with a logical logging. Like imagine if I'm logging a query and I have to undo the query, now I really need an undo mechanism at some lower level, perhaps through version chaining and stuff like that to do that. So pure logical logging in the general case can become difficult. In the case of a B tree, it turns out that if you want to do logical stuff within the B tree, it's not that hard because B trees have certain semantics as long as you preserve the semantics, you're okay. So there are like not hard and fast rules but you might actually use some combination for the purpose of this class. Just want you to know the difference between these two and that precisely what physiological logging means. So everyone clear in terms of what physiological locking means? Okay, all right. So last class I briefly mentioned that logging can sometimes be confused with, does that mean using logging in a database system results in a log structured file system? No, when people say log structured file systems is just a different log, different file organization than a heap organization and you've seen that in the first half of the semester. However, there is a connection where if you looked at log structured file systems, most of them will have something like a mem table which is kept in memory where the changes are and eventually they will make it to the lower layers in the log structured file system. But for a little while, actual data is simply going to sit in memory before it makes it, right? Does that make sense? So now if you wanted to protect changes that are happening in memory and we're doing logging and recovery to say that if I say something's written as the data platform and crashes of different types happen or failures of different types happen like we looked at earlier, we want to prevent that. So you can imagine actually using recovery and logging mechanisms for the mem table stuff and being able to recover that so that you don't lose changes because a lot of this is related to saying I want to use an efficient mechanism to cache. We've looked at the buffer pool throughout this material for this lecture and we've said buffer pool is where you want to keep your data because that is fast but it is volatile and if a failure happens you want to handle that volatile information correctly. Same thing with the mem table. So you could actually use all of these techniques for keeping the mem table safe past all of these types of failures that we've talked about. Questions? All right. I'm gonna skip over the next two slides or skim through it really fast because we'll talk about checkpointing in great detail in the recovery component that you're going to start. So we are logging things. We are following the right ahead logging protocol which basically says anytime I flush a page from my buffer pool in memory to stable storage, non-volatile storage, disk, those all terms are interchangeable. I will flush the logs before I flush the changes and at commit time I'm going to flush these commit log records to disk but this right ahead log that we are creating is effectively a list of all the changes that we've ever made and that can grow really large. And so we'll use this mechanism called checkpointing to help us with managing that log and if you did nothing, for example, let's say you started a database system, kept logging and logging, it's a brand new system, all the hardware was awesome but after a year things started to fail. Now, unless you do something like checkpointing, you'll have to go and recover the system by replaying all the transactions through the log for a year, that's too much. So we want to cap that amount of time that we need during recovery. So we'll use this mechanism called checkpointing to get us to that point. So things will still be correct without checkpointing but you're running into trouble with performance on recovery, the recovery could take forever. So we'll talk about the different ways of doing checkpointing next but the main point is we're going to create a new log record which is called a checkpoint record and as you'll see in a little bit we'll create a begin checkpoint and an end checkpoint record when we start doing something called fuzzy checkpointing. So you have range of algorithms to make checkpointing even better than a simple approach and at a very high level for the simple way of doing checkpointing we will pause all queries, flush all the logs to disk, flush all the modified pages to the disk too which is basically have everything in stable storage and then write the checkpoint. Now that's going to be very expensive if I've got a terabyte buffer pool and most of the pages were dirty this will, the checkpointing process itself could take a very long time. So we look at ways in which we can make it better but the general idea is this checkpointing is used as a mechanism to make recovery go faster. It's not essential but without that you don't have a practical way to do database recovery. Okay, so that's all we need for this class this is an example that just says you will go create this checkpoint we'll look at better examples of this in a little bit. But I do want to cover this last point before switching over to the recovery lecture which is there is an issue of saying how frequently do I take these checkpoints? The checkpoint is going to do some work we'll try to make that work as small as possible but it's still going to do some work and you're going to pause the system for perhaps a very very small amount of time. But regardless it has to do IOs and stuff like that so the question is how frequently do you do checkpoint which means you're using the resources for making eventually this recovery go faster versus how long can you wait for this recovery process? So sometimes there are requirements that say my system has to come back up from a failure within a minute and you can actually go and set up this checkpointing interval many database systems or most database systems have a way to say how frequently do you take this checkpoint? It may be every hour, every day, every week and you can then control what that recovery time is. So the main point is database systems give you this parameter that lets you control different components of this checkpoint the most important one is the frequency with which you take this checkpoint, okay? And that's all we'll talk about the checkpoint and setting that up just want you to be aware that if you're ever running a database system you will have to ask yourself as to what's my tolerance to that recovery time because when a system is recovering no transactions will be taken can be admitted and if you don't checkpoint frequently enough then that recovery time could be very large. All right so you're going to switch over to the material for today's talk and I feel pretty good today that if you interrupt me a lot I will still be all right, so fire away today. Few announcements, project four is on concurrency control that's due on December 10th and the second part is Andy's gonna start teaching again the next week onwards so the last few lectures he's gonna teach so this probably the last time I'll see you this semester but I hope I see some of you especially if you are around the next semester and want to TA the database class so this class is going to get repeated next semester and I'm gonna teach that and would love to get your help especially if you've done well in this class which I know everyone attending over here has and the one pitch I am hoping I'm taking notes so I have matched that stuff so I think you guys are all good. I still don't know why so many people don't come to the classes which is like if you're paying this much tuition give the extra two hour 40 minutes for your investment of your time to learn this stuff so anyways I'll make a few changes next time perhaps to encourage people to come to class I've got a couple ideas but if you're a TA you can help me come up with some of those ideas the last yep extra credit for coming to class something like that probably is coming send Andy and I a note and if you're on the edge maybe we might consider that so maybe, maybe but the real pitch is the following let's I'm hoping some of you or most of you are super excited about databases and all the things that we are learning all these mechanisms that we are learning are very general systems mechanisms that you're gonna use all across so if you're excited about systems you should be excited about databases the reason why I got so excited about databases when I was of your age which is a long time ago was because in one place you could connect all of these mechanisms not as individual mechanisms as you often will learn in other classes operating systems, distributed systems but here you actually have to say how do I bring all of that together into something together and it's not just learning each mechanism by itself but the interactions between that which makes it easy and I can't think of any other system but a database system where all of these things like concurrent signal pro or query optimization recovery and all of that come together in one place so the long story is when you teach for the first time even though you think you've got the material really understood you'll be surprised how much deeper you understand the material when you actually have to tell someone and teach someone so that would be a real good reason especially if you're excited about databases and really understand these mechanisms because you have to explain it to someone and you'll be shocked by how much deeper you will get into it because you'll be forced to think about it at a level that you've never thought about before even if you get a hundred on the exam I guarantee you haven't thought about it as deeply till you have to go and explain it to someone else and all of you guys are smart you'll ask all kinds of random questions most of them are really good most of them and that forces you to think and be prepared so that's my last plug you can apply to this website where you can register if you're interested and then shoot me a mail so especially if you've taken this class and have attended and I'll pull you into the TA course okay alright so enough of that let's get started with the meat of what the recovery mechanism has to do so far we've said that we've got this mechanism of logging with physiological logging where the logs are going to be relatively small in size right so if I've got a thousand byte record and I just changed four bytes I will only log those four bytes and I have in the log enough information to tell me how to reapply a change and to unapply or undo a change because I keep track of the before and after images okay but that's now still needs to put into this context of saying how am I going to recover from a crash and that's what we'll talk about today right so it's the second part you've seen this slide before and we've covered that first bullet point in that at the bottom now we're going to look at what is the algorithm that you use to recover and to guarantee that atomicity and durability and consistency comes because you're not messing up any of the integrity constraints so just to reset and reiterate re-emphasize what we are operating under hopefully from the last lectures material we're all in agreement that we want to use a steel no force policy to get an efficient buffer manager where the buffer pool replacement policy is purely based upon things like the LRU timestamp and the buffer pool is allowed to steel pages and as long as it is unpinned it can flush that out to disk even if a transaction is running and it is not required to force all changes of committed transaction to disk at commit time okay so that we've established as being the efficient way to run a buffer pool okay so we wanted a low latency high throughput transaction system we needed both of those for it now we have to think about how are we going to deal with two big classes of transactions the first one so imagine this is time going from left to right and transaction T1 started B is the begin log record C is the commit log record A is an abort log record so what we want to happen over here is when the crash happens at the very end the changes of T1 and T2 because when the commit happened we already told the outside world as soon as the commit log record as per the second aspect of the wall principle hits the stable storage we tell the world that this transaction is committed so T1 and T2 have committed but their changes may still be in the buffer pool have not been flushed out to disk right so we have to deal with that T3 is a transaction that explicitly aborted right either the applications issued an abort call or there was a deadlock or some other form of failure but its status is known it was aborted that abort was publicly was disclosed we told the external world that this is an aborted transaction even if it has dirty pages that have been flushed out to disk already we need to go and undo those changes and T4 is this transaction that had begun it was running its final state was not determined but the crash happened and we are going to treat it like an aborted transaction so any transactions running at the time of the crash need to be undone aborted treated like an aborted transaction and explicitly aborted transactions needs to be aborted okay and the committed work has to get recommitted so that's the thing that we need to do without recovery protocol okay everyone okay with the setup here? all right so the algorithm we are going to talk about is called ARIES and it was discovered at IBM the guy who did that was C Mohan who is a NAE member National Academy of Engineering primarily for this work and it's a beautiful protocol the paper however is very difficult to read it's a 70 page paper the textbook has pretty much all textbooks have good treatment for the high level overview of this paper and that's what we'll talk about that's what everyone basically implements even if not exactly the way this paper talks about it but the mechanisms in here is essentially what everyone has and its key thing is going to have these different components to it we have the right ahead logging that we will follow under the steel low force policy and we'll do the two components of the right ahead logging which is flush logs to disk for before we flush the dirty page or we flush a page from the buffer pool to disk and flush commit log records to disk and now the key part of this I'm just going to preview this diagram which we are going to look at in more detail in about eight slides from now is there were lots of recovery protocols that came before that ARIES gets its name that repeating part of it as a very interesting component to it called repeating history and what it does is it will do what seems like extra redundant work and it is redundant work but logically what it'll do is it's going to make three passes the first pass is an analysis phase where it will start from some point looking ahead it's going to start from the checkpoint record and from that checkpoint record read all the logs forward so whenever you see an arrow in time for recovery purposes it basically means it's following the log in time order and as we'll see logs have that structure it'll do a forward pass in the log essentially starting from the checkpoint replaying stuff till the forward pass at that point after the A pass is done the analysis pass it knows all the transactions that were running at the time of the crash so it's kind of like it's figured out what all each of these transactions are you know are they a C or A or you know they were running and a little bit more information with them then they'll have reconstructed that state of where the world was and then in the redo pass it'll go back and reapply all the changes in the log including changes of avoided transaction and that's where the repeating history part comes from in case of a very robust property as a result of that is that even if you crash while you are recovering it you still end up with the right state and the protocol becomes a lot simpler and then at the end of the recovery phase it has effectively the database recreated as of the time of the crash and then it'll undo all the quote unquote called loser transactions they are all the aborted transactions again these terms loser transactions is in literature so I'm using it it's 2023 so probably if you were doing this again you will call it something else but the repeating history part comes from the fact that our part is going to get repeated it's going to start and do all the stuff there were protocols before that they were trying to hyper optimize the recovery protocol and say I will do the least amount of work in recovery and they had subtle bugs in them all the time it was super hard to implement it's already super hard to implement Aries and that is the beauty of this protocol is it's logically a very clean algorithm now okay seems like work right yeah yeah there are three different areas so just hold on this picture is a preview the main thing I want to get across right now is there are three passes because as I start talking about these passes you wonder do I start optimizing this stuff hold on to your optimization questions till we finish the protocol but the main part is that the three passes the A and R passes are forward passes the U pass is the backward pass okay and the algorithm becomes super simple because of this and you'll see that in about 30 minutes from now okay alright that picture is detailed version of that as the explanation is coming in a little bit but I just want you to get that that we need a way to go back and forth in time and sometimes we may have to go backward so we're going to have to do something special to make that happen in the logs so that's what we'll start with the log sequence numbers which is the next topic and then we'll go into what how do we create these logs what attention to detail do we have to pay to doing regular comment and abort operations then we'll talk about different types of checkpointing the one that everyone implements as efficient it's called fuzzy checkpointing but we'll talk about two that are simple to understand and they are just examples but inefficient examples and then we'll go into the actual recovery protocol okay so that's the four things we'll cover today so first we're going to talk about this very specific thing in the log called the log sequence number so far we've talked about this log records and they have information about what was the before and after image but now what we are going to do is add one more piece of information to these log records and as you'll see as the slides progress the class progresses you can add a little bit more a little bit more to each of these log records and you'll see why but the first thing that everything will now need is called a log sequence number it's a logical proxy for time and you can think about just as we were doing in the version MVCC stuff we needed some timestamp based up and we said oh we could actually take a timestamp or we could have a a number a integer number a global number that you atomically grab an increment using atomic instructions log sequence numbers are like that they're going to be some number you do not take a timestamp for efficiency you're going to take this monotonically increasing variable and you're just going to grab that every time you want to go create a log record and you're going to put that as a log sequence number as a result the logs are always going to be created in the log sequence number order and it's going to be the history of time from the perspective of all the changes that were made okay alright so don't confuse any of this log sequence number with the timestamps we were grabbing or the version number the transaction numbers we were grabbing they're different variables okay this variable has nothing to do with that okay this is just for recovery purposes right okay now there's a lot of complexity in what we need to keep through even though I said Ares is simple query optimization, concurrency control, recovery are the three hardest part in database systems okay so a picture will help and again at the first shot it may not be clear but it'll become clear in a couple slides but I just want to throw a picture again to get your mental ideas to what's going on we've got different types of storage we've got DRAM also called volatile storage which is your memory where your buffer pool sits there'll be some things that we keep track of in the DRAM called the flushed LSN it'll become obvious in a little bit so these LSNs they come from a global counter but they also get used in different ways and that's just introducing to you that there are lots of LSNs all come from that single global counter but we have to keep track of a bunch of other bookkeeping stuff in DRAM we're going to keep track of this thing called flushed LSN on stable storage shown as disk here this is the one that will survive a power failure we'll keep track of something called page LSNs and the log file which is where the log records are stored they are typically stored on disk 2 but it could be a separate disk often that disk is mirrored so that if one disk fails you still have the other log if you lose the log everything is lost and in many cases especially as you start looking at distributed transactions that log can actually be stored somewhere remotely too so that you can have a copy of the log if there's a single site that fails but the important part for today's lecture is this additional part of this picture all pages the pages that you implemented for your buffer pool in your previous assignment will now have an extra field which is called the page LSN and that's going to record what was the record that updated that page let me bring all of this up yep great so the LSN counter is going to be grabbed anytime we need to write a log record and we'll write a log record anytime we are going to update a page we'll also grab that LSN for the begin transaction log record end transaction log record and commit transaction in a few other places but the most common reason we are going to grab and ask the LSN counter to give us a new LSN is because we are updating a page when we update a page we'll of course create the log record for it now that log record has the LSN that we just grabbed that's sitting in the log record but we'll also take that log record that LSN and put it in the page which is sitting in the buffer pool so inside each page is a special area where we are going to keep track of this thing called the page LSN the transaction updates multiple pages we have multiple LSNs or just one LSN representation great question if a single transaction is updating let's say two pages we have a single LSN or multiple LSNs depends on the logging so if you're doing physiological logging then if it's you'll grab LSN for every page that you do so it and that's typically what most people do now there are some times where you can optimize if I've got a record that is spanning across multiple pages do I grab one LSN because that counter also can become contentious you might say oh I've got an LSN and I've got that is 13 but I'm going to create 13.1 because I'm just updating a record but it's across two pages so you can do optimizations like that but basically for simplification you can assume that every physical thing that you're logging is going to be an LSN so single transaction if it's updating a million records it's going across million pages it's going to grab a million LSNs so this counter can get very hot okay great other questions alright so in the page now we'll keep an area for the page LSN which is every time you update you put the log record in the log buffer pool which is in memory right now and you record the page LSN keep a little room for something else called the rec LSN also on the page which is coming next and you'll see why and we'll start making those changes when the page is flushed out to disk before that what will we do we'll make sure its logs are written to disk first and then we'll flush the page out to disk too and so pages are going to have page LSNs and rec LSNs coming in the next slide on every page whether they're in the buffer pool or on disk because there's a permanent change to the structure that we are going to make okay and we'll interpret these page and rec LSNs in a little bit now on this side up over here you're seeing a pictorial representation of the log records these log records are sitting in a log file and they also have a little space called the log buffer pool sometimes it's the same buffer pool sometimes it's separate for a purpose let's assume it's a separate buffer pool and as the log records get written they're written in memory sequentially right and ultimately you're going to run out of that log buffer pool space and you need to manage that but this is a simpler buffer pool management there's no replacement policy here you're simply going to say keep flushing because it's sequential file you're just going to flush out the older pages and internally they get organized into pages too so they're block level storage and you keep track of something called the flush LSN which is where is that log records boundary between what's sitting on disk the black portion is sitting on disk and the gray portion in that log tail is sitting in DRAM so we'll have this flushed LSN to keep track of if I'm looking for a log record that is less than the flushed LSN I know I need to go to the disk if it's greater than that I need, it's in memory but this flushed LSN will also get used in a critical way to ensure the write ahead logging protocol okay, all right if it wasn't enough with just having a page LSN flushed LSN uh, uh, uh, uh introduced there are little, there's a little bit more extra stuff we need okay so flushed LSN you saw page LSN you saw one most structural change we need to the page is this rec LSN and it's going to keep track of the log record that updated the page when it was flushed to disk It's usually going to be less than or equal to the page LSM. So when the best way, and this will become clear in a little bit, each page is going through different version changes. And those version changes are being recorded in the log records corresponding to that page. Now, RecLSN effectively says, what's the oldest change that was applied to this page that since it was flushed out to disk? And it'll all become obvious in a little bit. But effectively, logically, what it means is that between the RecLSN and the page LSM, those changes may not have made it to disk. Is the RecLSN and the page LSN the same in disk? No. On disk, they can be the same when you're flushing it out. Yes. Yeah. So then how can they ever be different on disk? In the buffer pool. So I brought a page from, yep, great question. So when can the Rec and page LSN be different? I bring a page from disk. I now have the RecLSN. Now I keep applying changes to it. I'll keep updating the page LSN. Now they've gotten further away. At some point, I might flush this page out to disk. I will go make changes to that as I do it. But then again, if I bring it back, I'm going to start to see this thing separate out. But as the recovery protocol, we have to be ready to bring up a page where these things can be different and then deal with that. Now we're going to keep track of another data structure. So remember, if you're doing a log-based scheme, then you have a log manager. And the buffer manager also has its own metadata. We're going to keep track of one more metadata in memory. It's called the Active Transaction Table. And it's going to keep track of which transactions are active, what their current state is. And in there, too, will be an LSN, which was the last LSN that was created by that transaction. Multiple transactions could be running. Every transaction that wants to create a log record will get it from that LSN counter. But you need to know for each transaction, which was the last counter for that. And you'll see why we need that in a little bit. Last piece of information we need is something called the master record. This is something that will stay on disk. And when we start recovery, the first thing the recovery manager does is go pull up the master record, which is always at a known location on disk. That's hardwired into the database system. And so it will always go to the disk, build that up, and that will contain the location of the checkpoint record, more precisely the LSN of the checkpoint record, because LSN is effectively a logical pointer to a record. So that master record will say, here's the checkpoint record, you start the recovery protocol from this point on. All right. A lot that's been thrown at you, but as we go through examples, it'll become clearer. To make matters worse, sometimes you have more than this information, but this is all that we need to cover the basics of what we need. Just like I said, we started and say, oh, there's an S and X lock, and that is simple. And then we threw up and said, you know, real systems have a lot more stuff. Real systems have a lot more stuff. Some of it comes from B3s have all kinds of crazy stuff that can happen to it, and you'll do special stuff and things like that. So we'll ignore all of that. So let's just look at normal operation, what's happening. So on the left side here is what's sitting in DRAM. That's the symbol for that. That side is sitting on disk in non-volatile storage. And then we'll start, we are creating for the purpose, I'm not showing any pages and stuff right now. Today we just care about what's happening to the logs. So a log tail, remember the one above the flushed LSN is sitting on the left side. We have a buffer pool with a page that has a couple values. Now everything has a log sequence number. It's monotonically increasing. And when we go make a change to the page, we're going to start making changes to the page LSN. Every time we grab a new log record, update the page, we'll also do the page LSN, but notice the page also has a rec LSN, which was the last log record whose changes have made it to desk. Now, when we also have this flushed LSN that is sitting in memory and that's just telling where the tail of the log is. So they're just revisiting what we said before with the diagram. Master record says where the checkpoint record is and we'll talk about creating checkpoint records in a little bit. And essentially what you do is, let's say, yep, question. On disk, they could be different, but just hold on to that for a little bit. Yeah. It need not be because it's okay, I can recover from it if it is different. So if a page on disk didn't have the same page LSN and rec LSN, I can recover that from disk as long as I follow the right-eyed logging protocol. Okay? So as long as the logs are in this, the best way to think about it is, the, you could also ask the question is should I enforce that to be the same on disk? You could, but you don't have to. As long as you follow the right-eyed logging protocol that says the logs might hit the disk before the page hits, then you are fine, but you may want to optimize that further by fixing certain things, right? So in general, yes, that's what you would do. You would go and fix the rec LSN when you flush it out to disk. But after the logs have hit the disk. Okay? Yep. Yeah. Yeah, logically the question is when can the rec LSN and page LSN for something that's been just flushed out to disk, get out of sync? It'll be based on the implementation. So generally you would not make that happen. Yeah? But all I'm saying is the mechanisms are generally enough that if you want to use in a different way, you could. Okay? And we'll just punt on that question, but you can assume for today's lecture that they will be synchronized at that point. Okay? That's a good question. Other questions? All right. So now if this buffer pool page has to be flushed out to disk, the first part of right ahead logging protocol says, I check whether I can, it's log better be sitting on disk. And we will do that check. I know in that page, I will look at the page LSN. That's all I look at. And say is it less than the flushed LSN? If it is, that means the log is sitting on disk, right? Which means it's safe. I can go flush it. At that point you could update the rec LSN. Okay? And then I come back up over here and basically my page LSN is referring to log number 19. If that is what the page LSN is in the second example, if that is not flushed out to disk because the LSN on disk is 16, I have to flush log 17, 18 and 19 before I can do anything else. Now you might say, oh, I don't need to flush. I only want to flush the log records corresponding to this page, which is true, but we'll always treat the right ahead log file as a sequential file. So we'll flush everything up to the log record we bought. We're not gonna try to break apart the logs. Okay? We'll just keep it simple. Things can get super complicated and you'll have subtle bugs. So we're just gonna flush everything. Just gonna treat the log file as a sequential file. So we flush everything up to that 19. And once that is in there, we can go flush the buffer pool out, right? So that's just the same as the first part of right ahead logging protocol. Before you make changes to a disk permanent, to a page permanent, it's log better have hit the desk before you do that. Question. Do you flush everything that you have in the buffer pool? In the log buffer pool. Yes. Only you flush the log point that you want. Yeah. So the question is, do you flush everything in the log buffer pool or only up to the point you want? So generally only up to the point you want. Now then you will not go wrong if you flush more. And so usually that's also pageified at page boundary. So you might flush up to the next page boundary, but you don't have to. Correctness will ensure as long as you make sure at least 19 is flushed to disk. And doesn't matter if 20 also got flushed to disk because you're trying to write 19 on. That'll be okay. You'll still be correct. You're not gonna do damage by writing more logs to disk than you absolutely need to. Question. Why do we need recollection? Yeah. So the question is why do we need recollection? That's gonna come when we try to do the recovery protocol. So just wait for that. Yep. The flushed LSN is the latest thing in disk, right? The flushed LSN, so here, let me go back to this diagram. Now maybe this makes a little bit more clear. The flushed LSN is that. It's saying, where's the tail of the log? Yeah, it's the latest thing that we flushed but inside disk. In disk, yeah. It's safe. Everything after that is not on disk. I don't have a record for it. Why do we need to keep track of it? I guess for going back to the other slide, do we need granularity of being able to choose exactly which logs we wanna flush? Wouldn't it be easier just to flush everything that we have? Yeah, the question is, can I flush everything I have all the time? Yes. It'll be inefficient. You'll be writing more log IOs than you need to. You could in fact say, oh, the log records are always created on disk. I have no log buffer pool. That would be correct but it would be very slow. I meant like, so in this case, you're saying that we're not allowed to flush specifically log 19 or? Yeah, I'm saying we will flush everything including 17, 18 and 19, perhaps even 20 to disk. I'm not gonna try and pull out just 19 because that's the change to the page because there might be record 17 that is also records 18 that is related to the page. I don't want to go and try and grab all of that stuff and the protocol will get completely complicated. So I guess what my question is, why do we need to flush LSN? Why do you need the flush LSN? So that you can check whether the changes have been made to disk. So for example, if imagine I had log 21, 22 and 23 and just because I needed to flush 19, I also flushed 20, 21 and 22 to disk because it was on the same page. Then the next buffer pool page that has LSN, page LSN of 20 comes in. You will say, oh, you know what? Flush LSN tells me that I don't need to do any flushing of logs for you because you're already done. So we don't necessarily have to write out the logs in a strictly sequential order. The logs are written out. So the question is, do we have to write the logs in sequential order? The logs are going to get written in sequential order. What's the confusion? There's some of the confusion you have. Okay. The flush LSN is always pointing to the tail exactly as this diagram says. That's what it is keeping track of what log sequence number has made it to disk. But you'll have to go read the tail from somewhere and you'll make a disk IO. You don't want to take a disk IO to find out where the tail is. Yeah, that's why you're keeping it in memory. As I said, you could come up with a protocol that says there's no log buffer pool, everything is done, logs always hit stable storage but that'll be very slow, it'll be correct. So a lot of this is efficiency mixed with making the protocol work correctly and simply, that simple part is super important and that's why we're not going to optimize exactly which log records to write. We're just gonna write it sequentially. By the way, sequential writes anyway is pretty efficient. So it's actually going to be the right thing to do. Okay. So we keep writing these log records just to get everyone reset. We'll update the page LSN every time a transaction modifies and update the flushed LSN in memory every time we flush logs so that we know where the new tail is. Okay, because that's against which we are checking to figure out whether the first part of write-ahead logging protocol has to be taken care of when we flush a page out to disk. So normal execution, we're gonna simplify even further. We're going to assume that all log records fit in a single page in the log file is also page-ified. If it doesn't, there are little sort of straightforward extensions to deal with but you have to worry about it, kind of what you have to worry about it. The log record spans four pages. What does it mean for a log record to have hit disk and come back? All those four pages, quote unquote, have to be atomically written to disk and come back and the operating system doesn't provide you that. So you have to provide other mechanisms to deal with it. Sidebar, I don't want to go down a rabbit hole to go determine that our disk writes atomic. If not, there are other mechanisms to deal with that. We're going to assume all of that stuff is okay for us today, okay? That this could also fail as I said, you could mirror it, you could replicate it in a geographically distributed area to get that log to be available to you. We're going to assume that we are operating with single version of records. There are multiple versions, if we are doing MVCC-based protocols, kind of the straightforward way of things apply. Not going to test you in the exam, but you can just think of, if I'm doing physiological logging and if I've made it work with records that span multiple pages, the versioning stuff works out in a similar way, okay? So basic point is this is all compatible with all the version creation stuff that we were doing, okay? With the different types of storage structures that we had for version management. And of course we are doing steel and no force. So that's just saying a lot of simplifications, but straightforward ways to go, not straightforward, but you can extend it to deal with the other types of complexity that might come in the system. Now, we're going to introduce one more type of log record called the transaction end. But before that, just to reiterate that, when a transaction commit, it's going to create a commit log record. And if a transaction abort, there's going to be an abort log record. And those, all these log flushes are happening to this log file, which is just sequentially getting flushed, okay? Now this at the, when your transaction commits, the second part of right ahead log in protocol says that commit log record must hit the stable storage, must hit the log disk. It's at that magical moment when you get that reply back from the log disk, saying your commit log record has hit disk. And you could have written a few more log records after the commit log record, we don't care. But as soon as that commit log record, the page corresponding to the, where that commit log record is hits disk, you can say this transaction is committed, you can declare that to the outside world. And recovery will take care of the atomicity and durability part of it, okay? But we will also do something for various bookkeeping purposes that become super important for efficiency, where we will write the commit log record. That's when the commit happened. But even for committed transactions, we'll do all kinds of cleanup that we might have. For example, if I'm using OCC or some form of that, I'm going to clean up all my workspace in which I've checked out and kept all my read copies and stuff, I'd allocate a bunch of memory. I will throw away all of that, deallocate that memory. And when I'm completely done, I will write a end transaction log record. Even if you're not doing OCC, there's cleanup that you often need to do. You'll do all of that and write the end log record. This does not have to be forced to disk. The commit log record has to be forced to disk to declare the transaction committed. Bookkeeping cleanup that you needed to do can be done later. This end log record, we will write that, but it'll just help us when we're doing the recovery protocol to say, you know what, I don't need to worry about it. So as far as the recovery protocol that we are looking at, that we'll look at shortly, it's the end log record which will say, I'm done with you completely. So just wanted to know there's an end log record, but it doesn't have to be flushed to disk. Is the commit log record flushed? That is the magical commit point for the transaction. All right. So let's see what happens in regular commit. We are going to start. Let's say we get to this commit point. We create this commit log record that's sitting in the log buffer pool. And now what we are going to do is go and flush that out to disk, to the log disk. Once that reply comes back, that's when the commit has actually happened. And we have that magical moment that we were looking for. And now what we'll start to do is keep processing further, we'll adjust the flushed LSN, of course, we have to keep track of the tail. Notice how we do that. We flush the log first, then update the flushed LSN. It's okay if your flushed LSN is a little bit off, don't do it the other way around. Just make sure the log is done before you adjust the tail. Okay. And at some point we'll clean up and write our end transaction log record. So recovery protocol has to be ready to look at these end transaction log records which is important. At this point, we could choose to trim away the flushed, the log that is in the buffer pool. We can make space for it. If it's a buffered log file, which it is, then we now have space to create new log pages. Okay. All right. For a bot, we have to do a little bit more work. So I'm gonna introduce one more LSN type. So a transaction like T4 that we had in this schematic example, a little while back, right? T4 that was running at the time of transaction and T3 that was explicitly aborted, they are both going to be treated as aborted transaction for as far as Ares is concerned. Okay. So now for these aborted transactions, excuse me, we are going to do the following. We are going to add one more field to a log record and this field is going to be called previous LSN and its job is to keep track. It's not strictly needed, but it makes things more efficient. Its purpose is to allow me, if I'm undoing transaction T4 and I look at a log record, which is a apply an update to a page, I will undo it. I create a new log record for that called CLR, that's coming, but I'll undo that and then I have to figure out what's the other change that this transaction made and this previous LSN allows me to string together the log records of the same transaction. So it's like a link list for all the log records. Remember, I may have hundreds of transactions happening, they are all doing their work, they are grabbing this LSN. So for a given transaction, its logs are scattered all across this LSN. They're monotonically increasing, but they're not strictly sequential. So this is just an optimization, very important optimizations, without that your recovery protocol will be really slow and this picture I'll come back again to in a little bit, but the part, so ignore this undo next LSN, just hinting one more LSN type is coming, I know your head is probably spinning at this point, I promise that's the last LSN we need. But to demonstrate the previous LSN, so imagine I've got a transaction that did an update, some of the transaction ran, then there's a red transaction and a blue transaction, then the red transaction did its second update, then a few log records later did a third update. So its action was one, two and three, but its LSNs are not sequential, right? There's gaps, gap of one here, gap of two here and so on. When you undo the transaction's work, for an aborting transaction, you will undo it in the reverse order, we'll talk about that in a little bit more detail, but intuitively we'll undo this update first, for which we'll write a compensating log record, which will also have a previous LSN that says, that's what I undid, and now when you undo that, this thing had a previous LSN that points to you too, and that's gonna make it easy when you're doing the undo to say, which action do I want to do next? Oh, I know where that record is, otherwise you'd have to traverse this chain backward and these two could be spread hundreds or millions of log record apart, it'll just be very expensive IO, because all of that may not even fit on in memory, right? So it's just a chain, the previous LSN just changed together the transactions, the log records created by the same transaction, it's not essential, but will allow us to go back and figure out what were the log records for an individual transaction much more easily because these pointers help us, okay? And we'll talk about undo LSNs and CLR in two slides. So what does it look like? I've got the, what does the abort scenario look like? Here, I've got the tail of a log and now besides the LSN, there's a previous LSN, in this case, there's no gaps in between just to make it all work in PowerPoint, but you saw that picture, there could be gaps and when an abort happens, aborts don't have to be flushed to disk because if a transaction state is unknown at recovery time, you're gonna treat it like an abort. So unlike commit, flush abort log records to disk, eventually we will end it, but in this case, when it's an abort, between the abort log record and the end log record are going to be a bunch of other log records, those are going to be the compensating log records, okay? So that is the work and aborting transaction has to do and there'll be more log records that will be created over there. Okay, question. Before we go to the compensating log records and I promise that's the last log type because I've already introduced the checkpoint log records to you, yeah. Don't need the previous LSN for this, we need it in the recovery protocol, yeah. Before it's a bit more detailed. Yes. In between the aborting transaction, there would be log system with a transaction. Yeah, got it. So the question is, can you clarify that aborted log transactions don't have to be flushed to disk? The abort doesn't have to be flushed to disk and we'll see that in the recovery protocol in a little bit because we'll just redo and undo it, but if between the abort and the transaction and there was a commit transaction of some of the transaction, then we would definitely have to flush that to disk, right? And as we flush that to disk, we'll have flushed everything, including the abort log record and maybe part of the CLR for this transaction and that's okay. So you won't go wrong ever flushing more to disk but we're trying to minimize how much we have to flush because the disk IO is expensive, yeah. So it wouldn't be incorrect but of course if there was a commit between those two, the commit has to follow that second part of the right-hand logging protocol. Questions? Yup. Very good. The question is in the recovery phase, can other transactions be admitted? For this entire course, we are going to assume they cannot but there are ways you can do certain things even at that stage but we will say no right now, okay? But it gets tricky pretty fast and certainly not during the analysis phase. Maybe during the undo phase you can start to get a little crazier but maybe even not there. You'll have to acquire the appropriate locks to lock the pages out in the dirty page table and not have anyone do anything with that. That's a quick answer. Doesn't make sense that after you look at the video on Zoom, come talk to me during my office hours. So there's certain points where you can start to admit transactions even before recovery is completely done but ignore that if you didn't understand it. Okay. So last type of log record, the checkpoint log records are coming but that you already kind of know is the compensating log record and it has the last lesson that we need which is the undo next lesson. Again, it's a convenience mechanism and effectively what it says is if we go to the diagram that we have just a little, a few slides ago, here as you can see, imagine I am CLR three which is ending the update to the update, third update that happened in this transaction and let's say I crash here I want to, I will start by reading the end of the log and imagine CLR three made it to disk, to the log disk. This pointer simply helps me find you too fast. The next thing I need to undo. Otherwise I have to follow this chain and imagine if I were over here because I'm the second undo I have to follow this chain counting how many CLRs I need. So it's one and I have to skip that many to get to the next one but CLR two directly tells me that's the next one to do. So again, it is not important but it helps with all of these things including and especially when you're trying to recover from during a crash. And by the way, that sounds like a made-up scenario but it's extremely common because many times when something fails the likelihood of that failure happening again is pretty high. Especially if the failure happened because the disk went bad once it starts to go bad the second failure probably comes pretty fast. So it's again a convenience stuff to go find these pointers back so that you know where to pick up work from. All right, so if you understand that diagram then the rest of it is pretty straightforward. So let's just go through what the example looks like as we create these CLRs. So we'll start in this case we have transaction T1 we'll create, we are trying to abort it so we will create a CLR just given it a type CLR and then it will have the before and after values. So that CLR is for that update the last update that happened and basically the before after values are essentially cross-off what the update log record was. So we will basically do that and then this is the chain that I was talking about pictorially that you just saw again a convenience stuff if you didn't have the undo next lesson you can still get a correct protocol but these log chains can be large in a heavy transaction system for example many of the in-memory database transaction systems will do hundreds of thousands maybe millions of transactions per second. So logs can get very large very fast. All right, and that's the picture that we are now familiar with. So CLRs are these things that will have these two log pointers. All right, so we'll write the log record when we do the abort then we'll analyze the transactions updates in reverse order. So in the slide as you saw if I had an update followed by a second update and a third update we'll undo the third update first then the second then the first we'll update in reverse order to get the correct result. And sidebar if you don't understand the next sentence we'll just let it be why do you have to strictly do in the reverse order because it may be that this update and this update were updating overlapping bytes and if you didn't update in the exact reverse order you won't reconstruct the world as of before. And if you didn't get that comment let it be because it's gonna be a 30 minute sidebar. Yep, question. The prevalence is pointing to the previous log record that was created by that transaction. So for update log record two or if you see in this example the prevalence is 11 saying I'm CLR one and that was my previous log record in transaction T1 the whole bunch of other log records in between from 11 to 26. It's just reconstructing my chain of log records for me. Others have to reverse traverse fact this log all the time which is very expensive. Yeah, why do we need undo back next LSN? So imagine I'm at CLR two and that's what I pick up because maybe the transaction was partially aborted and system crashed. So if I met CLR two now I need to find U1 to go apply CLR one. I could do that by chasing previous LSNs but imagine instead of three updates there were a million updates in this transaction I'll have a long link list to traverse on disk. That'll be very slow. These link list are being traversed on disk. So this just gives me a fast pointer to get there. So it's a efficiency stuff not essential optimization. Yep, yeah. No, no, this is a CLR record. So it will say here's my before and after image very much like you have your update stuff. So it will basically have the same thing as an update log record has. Yeah, sorry? Because it is a special type called CLR which is have to be treated differently when you're dealing with the recovery protocol. It's semantically like an update log record except it's saying I'm an undo. So because you will undo when you start, let's say I'm take this example, right? You don't want to undo an undo which also you can do but essentially you don't want to undo and undo unnecessarily is the better answer. So here is the update log record. I'm going to undo that and I'm gonna keep track of that for all purposes is similar to that except this will also have that undo next LSN and it'll also have that only the CLR log records are going to have this extra piece of information. All log records don't have that. So CLR is a little bit bigger and right now in the diagram, it feels like all log records including bigger than commit are the same size but log files are usually variable length records and you won't unnecessarily waste fields because you're trying to be efficient with the log record size. Okay. All right. So now let's get to the checkpoints. We started the very simple checkpoint. As I said at the beginning of the lecture, the checkpoint's purpose is to tell the master log record tells you where to start the recovery process from. So it's to limit the amount of work that you have to do. So the simplest way to do the checkpoint, that algorithm, I know that but just introducing the idea is to say, all right, I've got a million transactions running in the system right now. I'm going to stop any new transactions from coming in. I'm going to drain out all these transactions. Let them finish, right? So drain out, let them finish. They'll create the log records, they'll flush whatever they need to flush out to disk. And then I have a state of the world. What I'll do at that point is I can take all the dirty pages that are in the buffer pool. I've stopped everyone from running, flush it out to disk. So everything that is committed is all on disk, everything is clean, nothing to undo, nothing to worry about. I've got a clean snapshot of a database that is correct effectively. But what the challenge with that is if I've got a buffer pool that's a million pages, it'll take writing a million pages out. And sometimes you might have buffer pools that are even larger than that because you have terabyte memories. You will have stopped the world for a very long time. So there was a question that was asked about like, when can transactions start to run? You don't want a checkpoint which is an efficiency thing to stop the world for a long time. Could be hours. So the other way you can do, which is also a bad protocol is to say, I've got, let's say three pages in memory and transaction that is running is updating page three and will eventually go and update page one. What I'll do is I will not stop everything as we were doing before, but I will pause everything. I don't have to do the draining business, but I'll pause everything and then either checkpoint record is good. It's just gonna go through all the pages one, two, three. Usually you'll go through it in the sequential order, flush them out to disk. But then once my checkpoint is done, transaction T1 comes in and updates page one. Effectively what you have on disk is not a stable snapshot. It has got partially committed changes. So it's not as clean. Your recovery protocol is still complicated, but you pause for a little amount of time. Little better, but still pretty bad. We're gonna do something that's going to make the recovery protocol a little bit more complicated, but will make checkpointing a lot faster. So to do that, we're going to introduce the final two data structures, which is an active transaction table and a dirty page table. Every time a new transaction comes into the system, we're going to create its entry in a table that is called an active transaction table. It'll just say, this is the transaction ID that was assigned to you. That's a different number, from the log sequence number. Your status, at this point, you're active, or you could have committed, aborted, or ended. And the last lesson that you grabbed as a transaction to do some change, okay? Initially that field is empty. And then the transaction status could be I'm running, completed, or I need to be undone. The dirty page table is going to keep track of every page in the buffer pool that has been dirty, but whose changes have not been flushed out to disk. And you can see how it would be pretty easy to keep track of this information in the buffer pool manager that you built, okay? And that will keep track in the dirty page table of Recalison, which is the log record that first caused that entry to be dirty, okay? And effectively that Recalison will be kept over there and when you flush the page out, you will add it to that page's Recalison. Like we talked about, you can update it at that point, which is usually what you would end up doing. So now we have a slightly better checkpoint where here's the log record on the right-hand side. As you can see, there's a bunch of begins and commits, and there's a checkpoint log record that is created. Now this checkpoint log record is going to have copies of that ATT and DPT as part of its log record. So that log record is big. Imagine I have a million active transaction, that ATT table is a million entries long. So it could be megabytes in size. The dirty page table could be pretty large too if I've got a very large buffer pool and everything is dirty, that could be megabytes, hundreds of megabytes, maybe even more, okay? But what I'll do is I'll have in that checkpoint record my ATT and DPT. So in this case, let's assume there was T1 and T2 and at the first checkpoint, assuming P11 was plus to this, in my DPT what is only left when I'm doing the checkpoint is 22. So it's only what's in the DPT, which I'm maintaining is what will get stored. And I'll keep creating these checkpoints periodically, based upon some criteria. Maybe it's like every five minutes if I really want recovery to be fast and essentially whatever is active in those tables will be what we'll go create. So no real rocket science here, the magic is gonna start coming in a little bit. Okay, so the next stuff is what people end up really doing and that is called a fuzzy checkpoint. So the previous technique that I told you was much better than stopping the whole world and draining all the transactions, but it's still required us to grab that dirty page table and other table and basically do all of that work. We're going to optimize that a little bit and say, because that checkpoint when it started, it will basically copy all of that stuff, write all of that out to disk and then be done. But the disk I was going to be super expensive. So what if we could do something really simple, which is to say, I need to pause, but I'll pause for a very little amount of time. I've got a 100 megabyte and a one gigabyte ATT and DPT respectively. So I'll latch both those structures for a short amount of time, make copies of that in memory. And just while doing that, and copy in memory is much faster than writing to disk, orders of magnitude. So yeah, I'll stop the world, but for a very little amount of time, I will then start writing that to disk in a separate checkpoint record. So when I start my checkpoint, I'll grab the begin checkpoint record, write that to the log buffer pool. Don't have to flush it. I'll make copies, then prepare my commit log record, which is big. But after my begin commit log record, I've unlatched both the ATT and DPT. I just got a consistent snapshot of both of those. Other transactions can keep making changes to it. And then eventually I will write, I'll get a chance to write my commit, checkpoint end log record, which is this massive log record with all this information. None of this has to be flushed to disk by the way. You may want to flush the checkpoint log record to disk for efficiency to really get that recovery guarantee. That's typically what is done. The end checkpoint log record is often flushed to disk. And if it didn't make it to disk, you'll just have to start recovery from way further back. So that's what we'll do. So if you look at this example, we'll start. And you can see now there's a begin checkpoint log record, which is where those copies were. You'll only pause for that short amount of time to make that copy. And then the IO, disk IO for this or preparing that log record, even preparing that log record may take you a bunch of time. Don't the checkpoint is correct as of this time point in the LSN sequence number, except the information for this is coming late. So all that we did is split that work into two parts and allow the checkpointing process to be more efficient and block the world for as little time as we can. And there are things you can do to block it for even less, because if I've got a million entries in the buffer pool, even doing that you can latch in pieces in portions of it and have everyone go in the right direction. But again, there are optimizations to make this even better and to stop the world for even less amount of time. But already we've gotten a lot better from where we are and by and large, this fuzzy checkpoint gives you the biggest efficiency from the other simpler types of checkpoints that we talked about. And then we will record, we will the checkpoint. In the master record, you're going to record the begin checkpoint log number because that's really where the checkpoint happened. Okay? So we'll flush that, flush these things out to this typically the end checkpoint, but recording the master of the begin checkpoints location because that's from where the recovery needs to start. Okay? All right. So now, and this is basically just saying the ATT and DPT are constructed based on that. So if you see here in ATT it is T2 because T1 ended, so it didn't need to be in the ATT, right? That could have been removed from the ATT. In the ATT that we create when you see the end is when you can remove it from the ATT table. And T2 is in the ATT in that end checkpoint log record. And even if it had committed, it doesn't have a commit here, but if it had a commit here, but the end wasn't there, it would still be in the ATT. So the end log record is just for ATT management. It's to remove it from the ATT after which you don't have to worry about it. And again, there details as to why you do that. We are not gonna worry excessively about that in this class, beyond just that statement that I made. Okay? There are nuances even there. All right. So we have to get to the recovery protocol. It's pretty fast now, even though it's only 10 minutes left because if you understood everything so far, it's really simple and that's all the machinery we needed to make this Ares work really well. We already talked about there are three phases, so let's just get to it. We'll start. So we've got all these log records. We followed right-hand logging protocol. We've got these different types of log types and we've got checkpoints and we're gonna start by recovering from a crash. So at the crash point, we will go and consult the master log record to figure out where the checkpoint log record is. If one doesn't exist, we'll start from the first log record in the file. Okay? Then we will go and analyze the system from top to bottom from that start checkpoint all the way up to the last log record we have to reconstruct the ATT and DPT. You'll see that in our next slide. Then we will find from the dirty page table that we reconstructed, which is the smallest RecLSN, which is the oldest log record that caused some page potentially to be dirty in the buffer pool that we have to reapply changes to. We'll reapply all of those changes even though there might be stuff here that we don't need to apply to keep the algorithm simple. As you revisit this lecture, you'll say, ah, I could have optimized something here. To keep it simple, we are gonna reapply and the whole world does that. Try to optimize the hell out of this. Your protocol that's already complicated will get so complicated, you're very likely to have a subtle bug that's gonna break your system. So don't muck with it. Just reapply, maybe you'll do a little extra work. It's worth it. Just take more frequent checkpoints that will help flush pages from the buffer pool that will help reduce all this work. Then we'll do the undo, which is going to go backwards and that's going to go up to the oldest log record of a transaction that was active and needs to be aborted at that time of the crash. So let's go through each of these. Analysis phase is going to, I'll just explain this with a picture. So, and then I'll come back to that slide. So analysis phase is going to start with, let's say we have a checkpoint record and the end checkpoint log record, we'll start to reconstruct the ATT and DPT. We'll start here, just a checkpoint record. That's where the master record points us, right? Then we'll come here and say, oh, this is the first time I'm seeing transaction T96. I'm in the recovery protocol. My ATT has been destroyed, right? It's been lost because it was in DRAB. I'm reconstructing it. So I'll say, I know what must have happened. The transaction table at that point must have T96. But at this point, I don't know what happens to T96. Maybe I'll see a commit later. Maybe I won't. Right now I'm just gonna say I need to, I don't know what the status is. I'm gonna mark you as a U, which means you need to be undone. Then I also noticed that page 33 is modified. It must have existed in the dirty page table. And now I will record the record lesson, which is the log record 20. Remember that would also be in the page itself, but we are not fetching the page. We are just looking at the log record right now, only looking at the logs. We are not fetching the page yet, okay? So we'll say, I'm gonna guess you were about 20. You were exactly 20. And that's fine. I could be conservative, but I'll be correct. Then keep moving forward. I'll hit the checkpoint log record. And that says, oh, by the way, that's what my ATT looked like. You'll say, oh, there's a 97 there. I'm gonna add you here. I'm gonna mark you as a U. Then I'll also notice the dirty page table, which is part of the log record. So I'll update my dirty page table to have page 20. And I'm assuming this 08 came in from this DPT, okay, couldn't fit it all in this slide, all right? And then I get to this commit log record, which tells me that 96 has to be committed. It gets a C label, all right? Then we keep going up to the transaction, and it says, oh, I can actually remove 96 from the transaction table. Don't need to worry about it, but pages, it may have changes still here. And that's okay. Transaction table just needs to, ATT just needs to keep track of the transactions that I still need to worry about, okay? And that's all the analysis faced us. It's going to reconstruct the ATT and DPT. And again, if this phrase doesn't make sense, conservatively, it might be overly conservative, but it'll be correct as of the time of the crash. It seems to have modified or could have modified. Well, why don't we keep it in the 32-day table? Transaction 96 may have modified P33 and that doesn't matter because P33 changes, we will flush to this. So just wait for the redo phase. The ATT is simply saying what transactions do I need to worry about? It doesn't say that undo transaction 96 is work. Transactions 96 work will show up in the redo phase. So just hold on to that for a minute. It's just saying what transactions do I need to worry about further. It is not changing the commit status of that transaction. Okay, all right, redo phase, best to go through it with an example, and then I'll come back to it if it's not clear. So here is, maybe I actually might need to do this part. This part is important. So the redo phase, we are going to reapply the log records from start to the end. And the critical part is that as I do need this part here. We are going to reapply all the history. And you saw this dirty page table that we have here, which is 33 and page 20. And it tells me the log records here are 20 and 08. What does that 08 tell you? That's the recollection saying page 20, 08 was the first log record, after the checkpoint record that messed it up, that made some change to it. That may not, or may have made it to disk, but everything before 08 is on disk. I know that for sure. Now 08 may also be on disk, and if 10 was a log record for that page, that may also be on disk. I don't know that till I find the page. But it tells me I don't need to go any further than the smallest of these two numbers. There could be a large number of pages in my log record to reapply all the changes. And imagine page 20 had 08 applied to it. When I bring the page up, I will look at the page LSN. And if it says I am 10, I will not apply the 08 log record. I'll just throw it away. But at this point in the analysis phase, all I've done is collected that piece of information, but that dirty page table tells me I do not need to go any further back than the min of those numbers in the dirty page table, in this case, min of 20 and 8. And from that, I'll just blindly start reapplying all of the log records to get my redo phase. Okay, and in this redo phase, we will do the following, which is apply all these transactions, and whoops. And as we see the update log record, we will see, is that page in the dirty page table? So at this point, we're still going through the logs from that 08 in that example, and say, okay, is the page that this log record refers to in the dirty page table? If it is not, its changes are safe. I don't need to worry about it. If it is, so far we are only reading logs, now I go fetch the page in question, expensive, right? So we're trying to avoid doing this as much as possible. And then we are going to check when we bring that log record, we will say, is the affected page, it's in the dirty page table, but the log records LSN, the log that I've just encountered in the redo phase, if it is less than the regulation on the page that I just fetched from disk. So this came to us from the regulation on that page, and this comes from the log record that we are trying to redo, if that is behind it, then I know this has been applied already. Otherwise I will apply it, now I have a modified version of that page in the buffer pool, right? So I'm doing all the work that I do as my transaction was going forward, and I will proceed to go do all of that. So I'll just explain that with an example, and then I'll talk about the undo phase. So here I've got this checkpoint log record, I have got this abort, sorry, I will have to do the undo phase to make that example work. So redo as we'll see in a little bit is going to, the analysis phase reconstructed the dirty page table and ATT as of the time of the crash, redo is going to reconstruct the database as of the time of the crash. Now the last thing we need to do is we need to undo all the transactions that are explicitly aborted, they will have an U in the ATT, all transactions that were running at the time of the crash which will also be a U in the ATT because when we started the analysis phase every transaction got a U with it. So all the U's in the ATT now have to be undone. So now what we'll do is the transaction table tells us the last log record for each of the transactions that need to be undone. Let's say it is nine and 14, we will take the max of those two which is 14, undo that, and then write a CLR, from that CLR we'll say for that transaction that we just undid what was the next update log record we need to undo, right? Let's say that is six, so now we have six and nine as the next log we need to undo, we'll do nine, so we can mix and match the log changes from other transactions basically picking the largest number first in this reverse order and that's all to allow the safe way so that if two transactions now had changes happening to overlapping bytes of data you get that correctly. So this reverse thing has to happen by looking at the max transaction number that we have to undo. Again, you can look at the slide later on to get that idea. Now let's go to this example, we'll start and I'll see if Andy wants to redo this for example in the next class. I'll just take one more minute, I know I'm over time so I'll stop, but we'll create that compensating log record for transaction T1, basically we'll undo this update that happened here and we'll mark that LSN as 10 saying that's the next one I want to undo which basically that was for T2 so that we simply end it and then this is the previous LSN chain now we can go and undo the other transaction and then so on. So that was just the abort for T2 that happened followed by the changes to T1. Now if I crashed over here I can come back so that's the same thing as before imagine I'm doing that undo phase I've crashed over here, I've reconstructed my ATT and DPT so assume that is correct I have got 50 and 60 as the ones that I need to undo next because that's what the last LSN is we'll take the max which is 60 find that log record undo that write it CLR then keep going on till we are done with all of that and we are done. So basically at the end of this the nice thing about this is that if it crashes we're doing restart because we are repeating history and redoing everything life is going to be safe and it all works out. All right I know I'm over time I'll see if Andy can go through this full example one more time because it would probably make sense to take five minutes in there. I'll let you read slide 44 and slide 45 as to additional things that you can do for performance improvement because you might have questions related to that but hopefully you got a decent idea for the mechanisms of what ARIES has of logging following right ahead logging protocol and having this three phase algorithm. So Andy will start the last three lectures which is about distributed database. Sorry went a little bit over time but we'll end this with music and I hope I see you guys around the database corridor. This shit is gangsta. Now listen I'm the poppy with the motherfucking hookup 28 a gram depending on if it's cooked up you ain't hit a mob yet still got you shook up I smack you with the bottom of the clip and tell you look up show me with a safe set for a blow your face back I got a block on taps the feds can't trace that style is like Tampa proof you can't lace that at the Dominican or you could call me Dominican black Scali black leather black suede Timberlands my all black dirty haters send you to the pearly gates you get your solvent trying to skate and that's your first mistake I ain't lying for that cake your family see you wake my grand's is heavy weight and ran through every stake when they asking how I'm living I tell them I'm living great