 Alright Shubham, I think you have a couple announcements. You had something to share with the class? Yeah, happy Diwali to everyone. How was your Diwali? It was good. It was good, but it's like shifted, right? So, the Diwali in India just happened, but then there's a US Diwali that's happening the next weekend because everyone has to do it twice. So, yeah. I mean, that was nice and my girlfriend visited me all the weekend and she was really interested in finding good sending flowers and we did come into the lab so there was an interesting conversation. Thanks to Andy and I might be playing couple of shows next weekend so I'll post you guys on that as well. Enjoy. Great. Wonderful. Thank you. Alright. We have a bunch to cover today. So, we have a few things left over from the multi-version concurrency control part that we just started discussing last class and I'm going to spend about like 20-25 minutes on the MVCC. And there's some portions of this that depending on how much we make through today might just relegate to stuff that you read on your own. That won't be material for the exam, but I'll tell you when we hit that point, okay? And then what you want to do for the bulk of today is to get started on the recovery component, which is going to be split across today and the next lecture. So, if you remember, we were doing this multi-version concurrency control. So, if I go back a couple slides, we were essentially doing things where as changes were happening to these objects, we were creating these chains of objects. So, let's go to slide seven where there you go. So, you saw how in this case, A was being read and written a bunch of times and what was happening in the databases, we were keeping track of all those versions. We had a begin and end timestamp that basically told us what is that value, like A0 is valid from 0 to 1, from 1 to 2, there's a different value and 2 onwards till sometime in the future, there's a third value. So, that's what multi-version does. It's going to create these different versions and now the readers are going to come in and find the right versions to work with. So, multi-version is really, really popular and in fact, if you look at the systems that use it, just about every modern system will use multi-version concurrency control. Now, we'll go and dig into the details of the other components that we need besides MVCC. We need a version management component as we talked about and with those versions you can decide which version a reader is allowed to see based upon their timestamp, but we have a couple other things to discuss about the MVCC design considerations. So, we're going to try and lock through most of this today and see how far we can get. And we are time bucketing about 25 minutes to cover this component. So, we've talked about a whole bunch of concurrency control protocols, optimistic concurrency control, two-phase locking. So, the natural question you may be asking is, how is this related to MVCC? So, the best way to think about this is that MVCC is a mechanism that tells you how to maintain multiple versions. You still need a way to protect that mechanism with a concurrency control protocol. Multi-version basically creates a linked list of different versions. And at some point, there's some writer that's potentially writing and creating a new version. We saw in those examples. So, obviously we have to guard against things like two people trying to write and create new values. Like if you have a single linked list, only one person can add to that linked list at a time. So, only one person can be creating a new version at a time. And so, we're going to have a concurrency control mechanism that we need in addition to the mechanism of keeping these versions. What this version stuff allows is that if I am a transaction that is reading something and my timestamp allows me to do that, I can read an older version and the readers can get past the writers. The writers don't block the readers if the readers have the right timestamp or they have the appropriate serialization order where they can pass through. In two-phase locking, if I grab the right lock, there's only one version. Everyone has to wait for it. A reader has to wait for it. And that's why we started to relax things with the intentional locking and things like that. But still, writers will block readers in the end case with two-phase locking. So, the concurrency control protocols you have, you can say I can do a multi version, which is a mechanism to allow what can be read at what time you're keeping these different versions around. Can be combined with optimistic concurrency control where you'll just run the three-phase protocol except in your private workspaces where you'll keep these new versions. Or two-phase locking where you'll use locking for objects that are getting created like the writers, not interfering with writers. Or if a reader needs to read that version that is being written because the timestamp says you need to be at a version that is being written, they'll have to wait for that. So, locking can be used to do that. You can also do something very simple. You're not going to talk about that, but timestamp ordering where you have some mechanism for picking a timestamp and then use that to determine the serial order effectively, all that says you can come up with even simpler mechanisms where you say I grab a timestamp, let's say at the beginning of the transaction and I only read versions as of my timestamp, which is kind of what snapshot isolation does. So, you can get that type of protocol. What we're going to concern ourselves with is these other topics now. There are five things we need to talk about in addition to MVCC as a mechanism that creates this version. The first one was concurrency control. They're going to start knocking these other things down. Next thing we're going to look at is how is the storage for this version maintained? It's a singly linked list that we are maintaining. What are the different ways of organizing that? Just as we looked at what are different ways of organizing records and pages and structures like that. So, let's get to the version storage. Now, we are creating these version chains and these version chains have these begin and end timestamps that are associated with it and there are different ways in which we could create these version chains. As we'll see for all of the remainder topics that we have for MVCC, for each of those mechanisms, each of these addition things that we need to consider for MVCC, we'll see there are some number of options. There'll be three, four, five options and then each one of them will have their pros and cons. So, the first question we are going to look at with version storage is how do we store these versions? Guess what? There are three ways to do it. Both are all of them have their pros and cons and some historical context. The first one is append-only storage where I'm going to create the chains kind of like what we've been seeing in the diagram so far, where in the same table where the record is, you're going to create the version components of those records. Time travel which is used by systems that didn't have MVCC, but then realize that wow, with MVCC you can get a lot more concurrency especially for the readers. So, they slapped on a mechanism to keep versions by saying, I'll just create a second copy of the table where all my versions are maintained and we'll see that in a little bit. The preferred way is delta storage. So, if you're building an MVCC system from scratch, that's kind of what you will use. So, what's the simplest? It's kind of what our diagrams have been. So, here I've got two records A and B and the entire records even though only one value is being shown like 111 in the first one. Think of it as being the entire record. That's what's getting stored in the table and as you create new versions, you will create that essentially as a new record in that same table, in that same physical table or the physical file that is keeping track of all the records in there. Because the new record has the same schema, all of them now have this additional what is shown as a pointer, but that pointer is like that begin and timestamp and bunch of other information that's associated with it. So, you'll just create the new version and then you'll maintain the pointer's internal. So, you can think of it as your slotted page organization being used to create a file and in that file you have record ID. So, A0 is a record ID, A1 is a different record ID, but this is extra field of pointer that chains them together and allows you to maintain your singly link list. Pretty straightforward, easy to implement if you start with this slotted page organization and you had extra fields that you designed in the first place with this begin and timestamp to keep track of all these pointers. There is an interesting question where the version chain is a singly link list. The question is where does a new version sit? Is it at the end of the chain or is it at the beginning of the chain? And you can decide the implementation and their pros and cons to that from a perspective of what is easy to implement. It's very easy to implement something that goes from oldest to the newest because you just create a new record as we were just doing up over here. A new value needs to be created, a new tuple, you just create it at the end and then connect everything back to it like that. So, oldest to newest is the easiest and that's what will get used as the mechanism in this last project assignment that you have in bus stop, but obviously the downside of that is if I have to find the latest version, I have to traverse the chain and notice these chains may actually be spread across multiple pages. So, you may have to go through multiple pages to go get a record ID that you're interested in. The other way is newest to oldest, but that's more difficult to implement. It's kind of logically you have to create the new record, move everything around to get that the chains to work in the reverse direction. And you'll see there's a subtle point associated with both of these as to how does an index refer to the record ID. So, if I create a three version chain for the record A as we saw in the slides a little while back, it's like I have an index, which one does it point to and the implications of this oldest to newest and newest to oldest will become a little bit more apparent as we go through some of that. But there are two ways that's all we need to know of organizing this list and it can go from one forward to backward or the other way around. So, in the newest to oldest, the question is how are you updating that? So, imagine this is you can just think of it as a singly linked list and so here is a singly linked list maintained in a pendoli storage going from oldest to newest. If it's the other way around you just have to flip the pointers around so there's more pointer management stuff that you'll end up having to do. You create a new record that becomes the head and what does it mean to be at the head? It means that's what someone else refers to if they're coming from outside. So, it will become clear in a second as we get to the index spot. The only thing you need to know is that yeah, I can when I get to a record from somewhere else most often from an index, do I land on the oldest or the newest? And depending on the scheme to find my newest record I have to chase down the chain. If it's newest to oldest, if I'm only looking for new stuff which is often the most common pattern, I'll find it right away. I don't have to access a chain and these chains can span across pages so it can be very expensive to chase this chain down. Okay. All right. Second type of storage is time travel and this is systems that globed in an MVCC system when they didn't have one before will use something like that which requires the least disruptive change to your existing storage structure. So, here if I have a record and I create a new record, A2, it will get put in the main table and then I've got a time travel table and in that time travel table I'll put the old values and the old values will keep getting accumulated in that second table. So, now if I need, if I'm a reader and I'm only allowed to read A2, I will start from the main table, go down to a second table which is a different file. So obviously that's more expensive that way. You won't ever get locality to chase down that. You're crossing table boundaries but it doesn't disrupt the main table which just has a single copy, the most recent copy and as you can see over here. Okay. And that most recent copy may not be the one that eventually gets committed. We'll start talking about the recovery protocols a little bit but it's the most recent copy in the chain that exists at that point in time. If you do time travel storage it implicitly gives you a new S2 oldest. Yes, that's correct. That's a good observation. That's correct and basically it's done as you can imagine if you had a 30 year old database system and all of a sudden NBCC is the way you get more concurrency especially for the readers. This is the least disruptive change you would make to your system. Okay. Because your indices can keep pointing to the latest copy in the main table. All right. And the record ID hasn't changed. It's still in the same slot in the same page. Why is it implicitly newest oldest? Like I get the first? The newest is there. Yeah. So here in this diagram as you can see it's a little weird. A2 is the oldest which is pointing to A1 which is not necessarily the newest but and you can flip it around and switch it around in a different way. So it's it can be a hybrid between newest to oldest and oldest to newest. But the newest copy is always going to be in the main table through which you start the access and that's the main point. A2 is the newest one but you can imagine a weird scheme in which A2 is in the main table but that points to something else that you know the chain will become A1, A2, A3 and A3 is in the main table. You could say yeah A3 points to A1 from the main table to the time travel table. You can come up with schemes like that to be not purely newest to oldest but if you have to say which one it sort of makes more sense it's like the main table has a newest stuff which is where you're starting your access from. So you can do all kinds of hybrid and with all of this stuff I'm going to speed through it a little bit. You can even find like 50 different schemes and there's a full-fledged paper which I'll leave you with because we could spend like four weeks just talking about MVCC. All right so master version in this case is going to be get written in place. All right so now let's go on and so here that's just showing that the master is getting written. The preferred way is this delta storage. So far all of this storage if I made change to a single column I was actually making a full copy of the record and the record may have hundreds of columns so if I just changed one I'm doing everything I'm making a large amount of copy where the diff is really small. So delta storage is essentially that keyword diff and I'll only take the value the column that I'm changing which could be more than one in a given record depending upon what that update query is and I'm going to store in delta storage that the value that has changed and then keep track of that through a pointer. So essentially the big difference between the delta storage and the other methods is here you just keep track of just the value that has changed and this can get complicated too because there could be multiple values in a column that has changed but as you can imagine you can generalize that stuff to only keep in track of the change that you've made. So I noticed that you always keep the other yeah yeah yeah so you're asking when do I make a change in the main table versus the delta storage table. So again there's going to be a bunch of details but most of the time what you're going to do is you know take the copy put it over there then update the value in there you'll move the old one stuff out to the various your backup storage or whether the change storage before you make that because if you remember that and you have to make that copy before because what happens if there's a crash between you making those changes if both the values are the same they'll have the same timestamp you can go clean some of that stuff if you need it to. Yeah yeah yeah and so you're getting into a little bit of this recovery protocols and the fact that sometimes you may want to undo some change because something about it absolutely all of that will happen with versioning too and we'll get to what we'll do like literally what we'll do is before we go and along with the changes that we make we'll also write what we are going to talk about next which is log records to maintain the change and then depending upon how you're using those log records you could do one of different schemes. But we'll record stuff to undo things so all you need to know now is yeah that things can happen while you're making these changes that's the recovery stuff we'll start talking about where we'll keep track of logs to be able to back ourselves out of some unsafe state if we started to make this change and things crashed in the middle or the transaction aborted right yeah good question. All right so delta storage is that transactions can recreate in all of these things transactions can recreate old values in the tuple based schemes they just go to the tuple and they get the whole thing here you can create old versions by applying the delta in the reverse order because it's creating it's keeping track of the delta right even though it's not quite shown value is being changed it will say it might have some sort of a delta associated with it and we'll talk about that from the logging perspective too but that notion of delta is effectively the same. Now the minute you have versions you also need to do garbage collection and because as you keep building the version chains lots of updates let's say happen to a record at some point the version chains will build up new readers are coming in they have a new timestamp no one's reading no reader has a timestamp that was really really old so we need to clean up those versions okay and so there are again multiple ways in which you can go do this and you have to look for these expired versions and then you have to decide when it is safe to go and reclaim that. Now when it is safe to do this expired version is typically you're going to say I have some sort of transaction number or timestamp that I'm assigning to transactions and I kind of know what's the oldest transaction that is running in the system anything older than that I don't need so you have a way to go and figure out what portion of the tail of this version chain you can throw away okay now the question is when you have this type of garbage collection that you need now you determine what you can throw away when do you go about doing that again there are two approaches you can do it at a tuple level where you're going through the chains and I'm just going to assume everything is at a tuple level the same thing will generalize if it's not and there are two methods for the tuple level background vacuuming where a background thread does this cooperative cleaning is kind of where you bust your own tables right so when you see something wrong you go fix it that's the second approach and then there's something to do at the transaction level so let's go through each of those techniques the tuple level GC is pretty straightforward I've got some background vacuuming thread that starts up every once in a while it goes and looks through all the place where the versions are right depending on the scheme it's going to be in a different place and then it's going to say I know what the latest transactions are so as these things are getting changed you end up with a place where you say at this point I can determine that a 100 anything with a version of 100 is no longer needed because I know all the transactions that are in the system no one needs a value older than 100 and I can basically go and start to vacuum those out and remove that now remember these version chains could be long they could be spread across different pages and things like that so this vacuuming process can be pretty expensive and it's making extensive changes to that entire database which you know obviously is not just one file one table but all the tables could have had this type of version management that is needed one optimization you could do instead of saying I'm always going to scan all the files from start to the end from the first page to the last page every time some update happens to a page you'll keep track of a simple metadata which might have one bit per page that marks the page as dirty so you've got a billion pages in your database and only thousand of them were touched the vacuum process doesn't need to go and read the billion pages to determine have you been touched right is there a version tail in their page that I need to clean up it will just go and look at the pages that are the ones that need this cleanup and only go and change those okay so that's an obvious optimization you'd build especially if you're doing this on a large scale cooperative cleaning is you know if you don't if you go to a sit down restaurant they'll bring all the food for you and take away the plates from you but if you go to a place where you're bussing it yourself you'll go to the counter pick up your food and clean up your tray it's kind of like that with cooperative cleaning as you do the work the workers as they do their work will identify if they see something that needs to be cleaned up and they will do the cleanup themselves and sometimes it's like when do you do the cleanup while you're doing the work or after transaction commits those are other choices that we can make we'll ignore those comments but the difference is that you don't have a separate background thread that's doing that you basically do a paid forward style of work where imagine I am a transaction that wants to get a value a I use an index to get to that that index now gets me to the a record but that a record imagine the chaining technique that we use is oldest to newest so it brings me to the head of the oldest I know what my timestamp is right so I know I don't need that I need to keep chasing it down till I get to the one that I should be reading but as I do that since I'm already chasing down the version chain I'm bringing the pages into disk if I need to might as well go and help clean up till I get to the the value the record that I need to read which may by the way not be at the end of the chain right because it may be depending on what my transaction number is I should be reading that value a2 and there may be newer versions behind it but whenever I need to stop I'll stop and clean up stuff before me yes yeah so it can even happen that if I've got a chain that is really long and that record is no longer access it was very hot in 2020 and a billion you know Taylor Swift ticket counter and the Taylor Swift ticket counter for a venue was really popular and had a billion version chains million entries built up in the version chain now she has a new venue and no one's going and cleaning that up and that could happen you could ask a more philosophical question is why did the chains build up if everyone is cleaning up on the fly but we'll defer that sometimes they may be deferring that I thought that's what you're gonna ask but that that can that can happen okay and again as I said there are like so many million ways of combining these things in different ways that all you need to know is that there are these different schemes to do it and there are exclusive exclusive large number of ways in which you could combine all of this but I do want to get to the indexing part which is important but before we do that remember two slides ago we said garbage collection tuple level and transaction level right so we finished the tuple level stuff now let's go to the transaction level stuff the transaction level stuff is very uh is different every transaction is going to keep track of all the stuff that it is doing including things that it is now out dating because it created new versions a transaction knows right when it has added something to the version change so in this approach as a transaction proceeds it's doing an update uh and then the second operation comes in it's creating these old versions the basic thing is when a transaction is done it knows I created two versions one for record one for record b and at that point a vacuum process can be handed over those versions and now that can determine saying my oldest version I need is 10 onwards if all of these are less than 10 I can go clean it up right so the vacuum process doesn't have to go through each and every page even with that dirty page marker to determine where these versions are transactions just hand it over to you so the and those are the things that need to get cleaned up okay so just a different way of doing that uh and clearing up the versions now this I want us to spend a little bit of time okay in the rest of this uh uh material I'll just basically leave it in slides and let you uh worry about it in other words index management is tough up to uh all of this material material for exam questions but the last part I'm going to skim over and you don't need to worry about so pay attention now all right so indices point to object IDs to record IDs right imagine you have a slotted page organization it's going to say page number and the slot number now if I've got the primary key index and I could be updating the primary key and I'm making changes to that I now need to figure out how I'm going to keep track of that version change and version change stuff we can keep track of the techniques we talked about but now I have to go and update that primary key that gets very tricky so often what happens when you're doing this versioning type of mechanism for primary key updates you'll basically treat it like a delete followed by an insert it makes the semantics clean and you won't have that version change develop right because you delete everything's gone and do an insert okay the problem becomes more interesting with secondary keys because they're more complicated right in some sense the primary key controls the record right because you're accessing the record the record idea and the primary key in many ways are analogous so you can do these things like insert delete but for the secondary keys you can't do that right there's a secondary key and there's a very famous incident that happened at uber a few years ago where they used to be on my sequel that had a good way of doing secondary indices and postgres doesn't have a good way so it is multi-version both of them are multi-version and as a result if you have lots of updates happening in your system the performance will go down quite poorly unless you have the good way of doing secondary indices so we'll talk about the good and the bad way and so uber went back and changed it again to postgres from postgres to my sequel because they realized that this was a problem so you know they hadn't taken this class uh so secondary indices are going to point to logical pointers and herons is going to be the problem is like those are think of it as surrogate pointers they can't control the object they can't willy-nilly do this delete followed by an insert uh technique and the two approaches one is a logical pointer and a physical pointer so just look at it with a diagram i've got a version chain a4 to a1 and if i have uh appended stuff to it and let's assume this is newest to all this okay and same things will apply for other schemes but newest to all this is the is the interesting one i say get this value a from this primary key and i will go i'll get the record id right that's what the index is going to have the record id and i'll go find the record a4 which is the fourth version of the record a but it's a physical record that i'm going to get i'm going to get a physical record id and i can locate a4 so far everything is good no problem right and now you can see if the primary key is getting updated if i delete and insert it then i don't have to worry about a bunch of this the secondary indices however if i say get something which has to get to this record a the record id is going to point to a4 and that's okay except i could have multiple secondary indices on the record a you know it has five columns i could have built a secondary index on each of those five columns now each of these secondary indices is going to point to that first record for older the newest version in this version change again so far everything is good nothing nothing bad has happened so far now you start to get into some issues which is if i have to go and update this value and create a new version a5 what's going to happen i'll have to update a5 i'll get a new record in a page fix the version change using these version management schemes that we discussed maybe of delta storage now i have to go and update each of the secondary indices to have the record id point to the new a5 which is a lot of updates so one update to the record will cause every secondary index to be updated but you can see how this starts to become a huge problem right what would be the way to avoid it i already showed you yeah good good good but you're paying attention that's great see in computer science indirection is a very powerful technique you know 50 percent of problems can be solved with indirection right uh so instead of secondary indices pointing to the record id is what you would do is you would say because the primary key is kind of like a record id the secondary index will say i'm just going to point to the primary key which does mean that when i'm accessing a record through the secondary key i have to go to the primary key and then get the record id one extra hop but if it's if it's hot then probably the primary key index is already in the buffer pool so it's not too bad but what that gives me is that now if i change a record all i do is make the changes in the primary index the secondary index stuff doesn't have to change so a single update to a single column think about in uber's case if they changed the rate for a taxi service and that's changing all the time and that's one field that was changed in a record now all the indices have to change that just causes the massive problem and hence that huge performance problem okay there's another way to do the indirection which is to say what if you had a global structure that converted tuple id to the to some sort of an address and everyone went through that into the primary key you could do that but no one does that because the primary key index is essentially that there have been proposals that talk about doing that because hey does this mean if i have to use this more fancy scheme which is better for performance with the secondary key indices that i must have a primary key index the answer is yes the primary key index is typically always built because that's how the system enforces primary key constraint so you can assume especially in that environment you have that so that scheme is basically the preferred scheme to do it okay so if you're doing versioning with just about every system does you have to be careful as to what your secondary indices point to alright okay question yeah yeah yeah yeah yeah so this works well well when you're newest to oldest if my newest comes in an a5 it's got a new record id right so in this case it'll have a new record id that i need to point to you could have said if i had oldest to newest i wouldn't have this problem which yeah so depends on the scheme if i've newest to oldest every index is going to point to the newest head of the link okay and so he said look for primary key updates we're going to assume that it's delete followed by an insert so ignore that case right we swept it away by a different type of implementation so in this case now i've got newest to oldest and the version chain has built up a4 is the newest which is being referred to by all the indices right so now if i add let's say a new a5 version then all the secondary indices have to point to a5 because you know they have to see the whole version chain to traverse through it in the same position oh you're saying in the time travel where i had my you're talking about the scheme number two where the main table had it yep yeah yeah yeah that's right so in that scheme and you're exactly right in the scheme where we had a separate time travel table that was the lowest lift for someone to go and get mvcc there they're pointing to that main table so the record ID hasn't moved it's the same page id slot id and this problem is is is simpler over there absolutely but as i said look there are all kinds of schemes that will go through that i don't want to spend an infinite amount of time in looking through all the combinations i will leave you with a paper that talks about a lot of these combinations okay and then what are the different pros and cons but i'll take that final question before we move on yeah and you know again it'll get into some other nuances even for time travel you can think about there's a version chain that's maintained but the head if it remains over there in the page id and the record id and it's updated in place then this problem doesn't happen if you have update in place and that's the head of that table you can come up with all kinds of schemes to avoid this problem but most systems will do some sort of version management that may end up having this problem just needs you to be aware that if your record IDs are moving around then if you're using any sort of version management scheme then you're going to run into trouble like this okay okay good the rest of it from here onwards slide 27 onwards it's not going to be material for the exam so i'm just going to skim through it really fast uh because mvcc looks like the keys can be duplicate it turns out that even the primary keys if i'm using the scheme in which i'm updating a primary key and i have versions it'll kind of look like if you think about the implementation of b3 for typically for a primary key you would have implemented to say i can never have a duplicate key but in some sense the keys can have you may need duplicates for a little bit of time okay so again as i said i'm not going to go into the details for it but let you look through that and you can read the chapter in the book that talks about that but you have to worry about those components there's also the issue of deletes again i'll let you read that by yourself but there are different ways to implement the delete function by keeping a delete flag or a tombstone flag this has to do with i've got a version chain and if i've deleted there are different ways of deleting that and sometimes it's it's better to just mark a record as being deleted and then eventually it can get cleaned up or you can have a tombstone based approach to basically clean up this version with in a in a two-step process so again the details for this not material for the exam but i encourage you to read the textbook but up to here the key problem i do want you to know and we may ask you questions about that okay last piece over here just for those of you who have more curiosity is the there's an explosion in the combinations as you can achieve with all of these different ways what's my garbage collection what's my index doing is it pointing to a physical record id or is it pointing to a logical record id that i can create in one of many different ways including pointing to the primary index what's the protocol i use to protect the right objects right as versions are getting created i still need to protect that new versions getting created and you can do two-phase walking or optimistic concurrency control postgres uses some a combination that's 2pl and also a time order protocol so a lot of these things are feasible as combination if you're curious about the types of things that are feasible there are two papers that you may want to read at one is andy had a beautiful paper on in-memory mvcc so even a smaller version than on disk based mvcc which is even more choices for you but even in that the explosion in number of choices is huge and he does an excellent job here and his students of cataloging the different mechanisms giving them proper names and categories it's a beautiful paper to read if you're interested in that and in memory mvcc is super interesting because a lot of transactional databases fit in memory because today you can get a full terabyte server and very few transactional databases need a lot more than 8 or 10 of those to really come do all of their workload including with replication and a little bit of fault tolerance there's another paper if you wanted to go into a depth of one specific protocol i'd recommend the hackathon paper which is very clean does developed by these guys at microsoft including paul larson who's a giant in that field he's since retired but one of my students worked on this it's a very clean protocol as simple as paul could make it with help from others and it also has an interesting aspect because they had sequel server which is an on disk stuff and they wanted to add this mvcc in memory stuff and they're very interesting ways in which they could cleanly put that together so hackathon is the in memory mvcc which can be put as an extension to sequel server which is an on disk system so very clever engineering and a very clean protocol for the in memory case so those of you have been asking a million questions i love that please go take a look at that these papers won't be covered in the graduate database class but if you're interested stop by my office hours and be happy to walk you to the paper and take questions okay and i'm sure andy would too all right you guys are doing a good job of making sure the exam is not going to have a lot of material so great but uh let's keep moving all right and let me just make sure sharing is still working great all right so we're going to talk about logging uh next and it's part of a two piece component we'll figure out how to log things so that we can make changes but if something crashes like some of the questions that were asked it's like what happens if i'm adjusting this chain but i haven't finished fixing the chain or i fixed the chain and something bad happens types stuff like that though largely today we are only going to focus on single version protocol but the concepts will apply in the other cases okay and again it's like if i can if you can get you the single version foundational pieces in there you'll be able to read papers by ourselves and figure things out i was just telling a couple students last week and earlier today too is that the best we can do as teachers is to teach you how to learn things so hopefully we can get the foundation material that you need and this is a never-ending game of how you can invent better and better techniques but before that a couple announcements project three is due really soon there's special office hours on saturday uh three to five p.m in ghc four three or three if you still need that help some of you already done congratulations but for those of you need help uh uh that's available project four is on concurrency control it's going to be out today and uh that's due december 10 so you have a little bit of time for it but there's a bunch in there so you know don't wait again to the last minute for going through that and the write-up unfortunately is going to be a little bit uh uh detailed because getting into concurrency control like optimization it's hard the optimization piece that you did was very light but here we're going to have to get you into uh mbcc and stuff like that so buckle up it's going to be fun but you're going to get to do uh some interesting things with mbcc and o2n type of version change a bunch of database talks there are three more left in this and i think there may be a uh i think this is the last three in the semester uh i strongly recommend not missing uh especially the alibaba talk because they're doing something at a massive scale and uh it'll be super interesting to see what they talk about the pg vector stuff and chroma is coming after that all right so let's get to recovery protocols today really fun stuff so you remember we had the asset components so far we've only done i which is the isolation stuff okay we still have a and d to cover and c as we talked about is consistency based upon you know database has integrity constraints and other forms of defining what semantics need to be enforced and we will enforce that so let's start getting into the recovery component which is the last missing piece to complete the asset components okay so we're going to want transactions to be all or nothing and we have one to a part and we need transactions to be durable so if a transaction is declared committed even after that if the system crashes the changes are recorded in the state of the database so we'll start simple with uh motivation simple transaction reading writing stuff and now we have a buffer pool which obviously database systems have but now we are going to make it explicit in our diagrams because it's because of the buffer pool that we're going to have to worry about a lot of things so we bring in a page the page has a lot of stuff a lot of records a lot of columns in each record but the one of them is that value a that we are trying to write the column a and we'll go bring that in and then when we write it we will go and update that in the buffer pool okay that's what we do right the updates happen in the buffer pool now at the time that we get to do a commit what do we do to get durability we could insist on taking any change that has been made and push it out to disk and if we do that we will get durability but it will be very slow but let's say we have other protocols that don't require us to flush everything to disk at commit time because this transaction could have touched a billion records and the billion pages you don't want to flush a billion pages but what happens if the power goes out or someone zaps the memory okay and the zaps the memory we've committed the transaction but the change was only in the buffer pool and we've lost it right so we don't have a durable transaction anymore that was Andy's picture of saying a bad guy comes and zaps it so all right so that's what crash recovery is all about and we have to go and get this atomicity and durability components and they're going to do this in two parts so there are two lectures today we'll talk about the first part is what actions do we need to take during regular transaction processing to create stuff that we need to recover from the second part which is once you have kept stuff that you need to keep around how do you recover that's the second lecture on Wednesday so today what do we need to do so here are the six things that we need to worry about first we'll talk about why failures happen then we'll talk about buffer pool policies because that is at the heart of why we need to do these fancy new things and we'll talk about two different mechanisms shadow paging and right ahead logging as a way of keeping track of changes that we make and I'll tell you shadow paging is a bad idea it's what old system you see use it has all kinds of issues I will only touch upon it and then move to right ahead logging which is the main part that we need and then we'll talk about some of the logging schemes what do you put in these logs and then checkpoint it so first is we're going to have to concern ourselves with these different storage types and we've talked about this I'm going to go to the next slide and come back to it you remember this slide on the right that we had which is which talked about data is in different places and there's a reason why we have processor caches and DRAM and SSDs and slow SSDs and fast SSDs because as you go down this hierarchy you get more capacity but it's also slower and you don't want to have all the data be in the lowest capacity let's say SSD or a spinning disk if you didn't have a buffer pool the database system would be very slow so we want to use the highest level which is going to be volatile storage which is DRAM because that's where you want to make all of your changes that's fast but it's volatile the non-volatile storage the SSD layer is where you want to make sure you make put your changes in there so that if the bad guy comes in and zaps the DRAM the volatile storage you still can provide this durability property that you're that you have to guarantee okay now there's a third type of storage described in the textbook which is real called stable storage and that basically is something that survives all possible failure scenarios such a theoretical storage doesn't exist but you can come close by making replicas of everything that you do you can replicate the disk and synchronize the replicas using distributed transaction protocols we'll talk about that in the last three lectures of the class briefly okay so stable storage we won't touch at all today so we want dirty pages to sit right now we'll just concern ourselves with the two tier scenario for the remainder of today's lecture and tomorrow which is buffer pool in DRAM and that gets filled out to some stable storage like SSD or disk which can survive a power failure okay so let's begin why do transactions fail there are multiple reasons first transactions can fail because there are logical errors in the transaction I updated a record and hopes the integrity constraint failed or a constraint on the database failed the c part so now I need to abort this transaction okay or it could be the transaction failed because everything that it's doing is fine but it's deadlocking with another transaction and sorry but you got picked as a transaction that needs to be killed so transactions can fail for a variety of reasons and that will all get part of the atomicity and durability that we are doing because a transaction that has failed may have already started to make some changes we'll have to undo all of those changes there can also be system failures such as surprise surprise software sometimes have bugs operating systems sometimes has bugs right so your system could crash for a software failure and part of the work may have been done transaction has not committed or it has committed the changes were in buffer pool and they still need to make sure the right thing happens there could be hardware failures like in the early days of data centers there is to be a failure that the meta guys then called facebook had written papers about saying the DRAMs that were put on the servers because all of them were so close by and they hadn't really figured it out as to what happens at that scale not just them but everyone the DRAM chips would come off and one of the common failures was after a little while even though everything was fine the DRAM chips would just come off the slots now they glue it most of the time don't want that to happen so hardware could fail for a variety of reasons it could be something like that or it could be the part actually failed the system crashes all kinds of things that happen one thing you're going to assume today is that the non-volatile storage the SSD or the disk does come back and has the contents because we don't have this ideal storage with the replication stuff so just reiterating that component the storage media could fail where you know you thought you wrote a page to disk but it actually didn't get written the disk driver came back and said yeah it's written but the bits got corrupted on that we've talked about bit rotting and other kinds of things you will need other types of mechanisms with the replication to deal with that and again we won't cover that we will assume that something else takes care of that we'll concern ourselves with just this two tier stuff that we talked about okay where the primary storage is in this non-volatile DRAM storage and the main storage is in this volatile storage now what we need to do as a database system is to make sure committed transactions changes make it to the stable storage and no partial changes are left around even if they made it to durable storage we can unwind ourselves from that we need two key mechanisms to do this one is called undo which is whoops I put something into stable storage into the non-volatile storage that was changes made by a transaction that got aborted because it was a transaction failure or one of these are the failures we talked about and now I need to undo that and the other one is redo which is oh the committed transaction made changes that were just in the buffer pool but the transaction is committed we told the world that the transaction has committed but now we need to go and reapply those changes so those are the two mechanisms that we need to build into our system so another example you start with the buffer pool and you read of A you get the page which has a bunch of things in it including the A value you write it you have a new value then stuff happens where B is changed by a different transaction on the same page as where A is that was updated by transaction T1 so now two transactions have made changes to the same page and remember we moved stuff from the buffer pool to the disk in pages right so now this page has two different things from two different transactions T2 commits what do we do at this point if we said we can flush that page to disk to ensure that T2's changes are durable uh we will also carry along with us A's changes and A's outcome is T1's outcome to A is not yet known we don't know what's going to happen because if we flush all of that to disk ultimately T2 could abort and we need to go unwind the changes that made it to disk okay so there'll be all kinds of schemes you can come up with we'll categorize this into a nice quad chart shortly but basically saying because things are on the same page and the buffer pool can move things around in pages in a given page might be changes for multiple transactions in different states and we have to make everything work with that scenario okay all right so if T2 needs to be rolled back I needed to know that the previous value was not A3 now that may be sitting in the version chain if that's the storage technique that I'm using but I have to go chase it down and I have to go figure it out okay I still have to keep track of the changes that I made which is the logging stuff which works with all of these techniques that we've discussed before it's orthogonal to all of that okay but as I said today we'll just concern ourselves with a single version uh component of this you can do all kinds of a little bit more interesting thing with multi version but let's get the single version foundation in first right okay the other complication arises from the buffer pools replacement policy so you guys implemented the LRU2 policy when you wrote the buffer pool and that gave a lot of freedom to the buffer manager simple piece of code that can decide I'll keep track of the recency of a page based on this LRU2 counters and I'll decide when to kick something out when an eviction needs to happen the only thing we said is that if a page is spin someone's actually using it can't kick it out but if a page is unpinned it could be dirty and I can kick it out so the buffer pool to get maximum performance maximum use of the space it's a cache of that caching if efficiency is saying I'm allowed to kick things out even if they're dirty changes if there are changes that are made to that page so that we'll call as a steel policy there's a second component that's a one dimension to this problem and stealing is saying the buffer manager can take a page that is unpinned and flush it to disk even if it's dirty and the transaction that's dirty that hasn't committed yet so we'll write uncommitted changes to the stable storage and that's okay all right no steal is saying no no I'm going to take away this power of replacement policy from you certain things besides pinned pages you can't steal pages for transactions that are running and that's obviously going to give you a lot less flexibility it'll be a poorer performing system the second dimension to steal no steal is the force policy which is at the commit time what do I do at the commit time if I say all the changes that were made by the transaction that is committing must be forced into disk before committing then I'll get durability but it'll be very expensive because imagine I'm a transactions that's touching one byte in 100 byte records for a billion records I'm just changing 1 billion bytes but I've touched everything and I have to bring everything into memory update all of that stuff and then write all of that stuff at commit time okay so that will be a lot of changes that have to be flushed at commit time and you have a slow system but but that could be how you how you could work in that case so force says at commit time I will force the committed transaction changes to disk no force says no no no don't force that find a better way to do the commit that is more efficient and then deal with bad things that can happen in a different way which is where logging is going to come in okay so let's look at one of the simplest policies of these combinations of steal no steal force no force the easiest one is no steal buffer pool can't do much it can't steal a page and forces you write stuff so it's this combination that works but it's obviously pretty slow but let's see in that just an example as to what happens with this simple scheme which is not realistic right we want to do better and we will so you have a read of a bring that page in you've seen this page multiple times now you go right to that second transaction comes in makes it's right and force means I will now force the changes of this page to disk but because I have changes to a that was made by t1 which is not done if I'm doing a force policy I'm still going to need to figure out what's the older version of this page I need some mechanism to go keep track of that and then only flush that out to disk okay so just want to know even if the simplest scheme that you can think of you're gonna have a little complication but we'll do much better than that this is just an example no one implements it like that because this will still be a very slow system okay but there's no free lunch even with the simplest combination you still have complications all right so easiest scheme to implement it has a couple other problems with it one is if the number of objects that I'm writing to imagine you have a buffer pool with a million pages and you want to update in a transaction a table that then you want to update a field for all the records in a table and the table has a million plus one pages you can do this because according to no steel force policy every page has to be brought into memory and can't be pushed out till because of the no steel part can be pushed out to this till the transaction is done so this just to say that the no steel part can be really problematic the force part is not trivial to because you have to go figure out which part of that page changed so we need to ultimately live in a world where we have to try to make the opposite of this happens which is to have the most high-performance scheme which is to do steel no force now as we start to make changes we'll have to keep track of what we have changed and work with that so one of the options that we will have to do and this is running out of battery for some reason even though I've got this plugged in so hold on before everything dies out here maybe this thing is broken here jeez not good it's saying I'm gonna shut you down in a minute but yeah it should power through it otherwise I'm gonna have to take a little time and this is serious so hold on there we go now I have to switch this guy out here and then set everything back up again give me a minute fun that gives you a little bit of downtime to think about all this stuff so less stuff for the final I have half a mind to record a 20-minute lecture and make it part of it but you guys will kill me so I won't do that I won't do that but I was thinking ways to get out of this sorry no no no I'm nearly there I'm fighting here guys for every minute thanks for all the help okay I think we are back let's make sure uh oh good now this thing today this connector to the external monitor is very finicky it seems to not want to do that okay there we go I think we are back in business and hopefully the lecture is still recording whereas I'll record it at home and add extra material for the exam I'm just kidding won't do that all right so we still have to record what we need to unwind from or make sure it made it to disk the two techniques one is shadow paging it's a bad idea no one does it this was the first thing that people did when they realized they need recovery protocol so I'll describe it but I'm going to skip through the slides that are in there over here and I will refer you to the deck if you needed to go look at what shadow paging so remember we ran into this trouble where two transactions that changes were on the same page and we had to worry about that there's a bigger version of this problem where I've got bunch of changes that I'm making and shadow paging effectively what we need is we need to keep track of a before and after version of the changes we are making okay one way to do that is to keep track of a scheme in which I have all the pages that I have on disk and I'll keep track in memory something called a master page table kind of like your operating system those of you have taken an operating system class know that there's a page mapping table from your virtual address space to a physical address space right kind of like that but not that complicated here is just saying page one is this page or this disk position in my stable storage okay it's just a pointer it's just a list of where the pages are and when the transaction comes in it will make a copy of that page table effectively think of it as getting a snapshot by virtue of copying this and now when it makes changes it's going to make changes if page one has to be updated it'll make a new copy of that page a shadow make a full new copy of itself of that page make changes just there so if some of the transactions making changes to page one it will make another full copy of page one and there are ways to merge it and combine that but assume it just as one at a time and then effectively all the changes will create new pages those are only pointed to by my shadow page table copy and then ultimately when I'm ready to commit I will do the following which is I take the there's a pointer in memory to the root of the master page table which is pointing to the old page table a copy of that is also kept on disk right because that's the stable point when the disk when the database system starts from scratch it'll read that stuff because it knows kind of where everything is and then I will go flush that out over there then switch the pointer to the new shadow page and now that becomes the master and eventually some background subgoes cleans things up so this was the easiest implementation as you can imagine someone who wanted to get this type of recovery protocol no one does that I guess there are some systems who do that but it's a bad idea the better approach is what we are going to talk about with the write the head logging protocol obviously massive problems right huge fragmentation now you're doing this garbage collection at the page level and you make copies and copies of pages and stuff like that so lots and lots of problems we won't talk about that but as I said here you can look at last year's lecture on this which spends about 10 15 minutes on the next two slides including this I would recommend just for learning purposes to go back and look at what SQLite used to do they had the old scheme with shadow paging because they started in 2000 people hadn't advanced in all these protocols as much at that time but then they switched over to write head logging which we are going to talk about next okay and so it makes copies and copies but it does copies in a slightly different way but effectively the same kind all right this is what we need to pay attention to material for exam starts back again we won't ask you questions on shadow paging okay right ahead log is the way you implement the Bay foundation for the recovery protocol the idea is we want to make the steel no force the opposite of the easiest scheme the hardest scheme is steel no force work and for that we'll have to keep track of what changes are being made and use those changes in two different ways so we'll create something called a log file don't confuse this with the log structured file system I'll make a comment about that in a little bit quick preview of that is even a log structured file system for the mem table will do logging like this the right ahead logging okay so this is database logs and we'll keep that in a database log file which is a different file and the log records we are going to create log records and they'll get created in the buffer pool in pages that will eventually get flushed out to the log file okay so there's a separate file called the log file okay and it will go through the buffer pool to and usually there's a separate place in the buffer pool for buffering log pages so you can think of a design space as having these two dimensions force yes no a transaction commit time to a force the changes to disk or how flexible is the buffer pool does is it allowed to steal pages we want that because that makes it more efficient or not allowed to steal and we already talked about the no steel force policy has been trivial and what we want is that desired space so no force steel policy on the force aspect basically says for every update flush the updated page to disk and this means transactions are durable committed transactions changes are on disk so for committed transaction you can say I've met the d property for you okay but it'll be poor response time because you have to flush a lot of pages to disk for transactions that update a lot the steel policy and the force plus no steel is the easiest combination right but all no steel policies what they will say is the buffer manager cannot take a page away for a transaction that is still working and that works for a body transaction because their changes they will make it to disk right but it will be a low throughput because very soon the buffer manager has very few degrees of freedom to take pages and do the replacement policy so we really want the no force and steel policy what's the complication with the no force so remember no force a concern is that at commit time I'm not requiring the dirty pages be flushed to disk we'll let the LRU timestamp dictate that when a page is a page is written out but what happens if a page crashes before exchanges make it to disk the example we started out with Putin zapping up the page like we have to go deal with that so what we'll do is we'll write these things called logs which are essentially diffs of changes that we made and we'll use that at recovery protocol time next lecture to redo the changes that should have been in the stable storage in the first place so we need to we need logs to redo changes that we should do the steel policy we want stealing allowed and this is like 2023 so probably if all of these techniques were invented now we'd probably not call it steel and force but you know there's a non-pc times when these terms were invented so so pardon that but we'll stay with the terms that are in the literature so with the steel policy our concern is that a page that was stolen and flushed to disk may have changes that were made by an uncommitted transaction now we need to undo that so we're going to need logging to allow us to undo that so logging is going to be used logging must have enough information to allow us to redo and undo and the recovery protocol that we'll talk about in the next class we'll decide whether that needs a log record whether it has to redo or undo but along with that there's another fundamental protocol that we need and that is called write ahead logging the two very foundational pieces that you need in databases one is this notion of two phase locking because from that you've got the dependence graph of which everything is based for you to understand how to get isolation the other piece is write ahead logging which is the protocol that says what at what point can I declare a transaction status to be committed or important uh there's a magical moment in the protocol where you say at the point where x happens that x is going to be when the commit log product commit log hits disk uh the transaction changes its status from active to committed it has to be one very finite boundary in that magical moment is determined by this write ahead logging protocol and what that says is that the database system is going uh to log records the stays in volatile storage and all records updated to a page are stayed in this non-volatile buffer pool log buffer pool storage but before I write a page to stable storage I must make sure its log is written before I can write the page that's the write ahead logging protocol says that way I have the log information to undo and redo right so before I can write the page and overwrite it in stable storage all of its log must hit the disk and without that all of the stuff that we'll talk about will fail so write ahead logging protocol says this is the way you're going to get that log information and intuitively it makes sense right I if I wrote the page and then crashed before the log hit it I don't know how to redo and undo the things that I may need to okay so record your changes record take that record of changes put it in someplace safe that is stable before you can make the page the the changes uh push out to stable storage yes log so there are two things happening I've got a table in which I'm making changes a bunch of tables in the database all of those changes we're going to record in the log that is sitting in the buffer pool because we also don't want the log to be written to disk every time that's slow but in that buffer pool we'll flush the buffer pool the logs portion of the buffer pool to disk so if I'm page 13 and the buffer manager says I want to evict page 13 first I will say to the log manager which is managing the log buffer pool different buffer pool right may come from the same memory space but it's a different manager and say hey log manager flush all your logs for page 13 and tell me when you're done the log manager will flush it to the log file it will all the writes to this log file are going to be sequential and then basically come back and now the buffer manager can go and evict page 13 okay so yeah we'll see the logs the structure of that log so that's coming what's in the logs and what that looks like is coming okay but basically just think about it that way when a buffer pool a dirty page only on a dirty page if it's a clean page you don't have to talk to the log manager you as the buffer manager buffer pool manager but the buffer pool manager on evicting a dirty page has to first tell the log manager please flush everything because the write head protocol requires you to do that before you can do anything else all records pertaining to an updated page are written to non-volatile storage non non-volatile is stable yeah the disc and ssts yeah I know the textbook says non-volatile previously people used to call it stable before that they used to call it disc so terms are changing like every few years because the storage hierarchy is changing yeah so non-volatile is the good stuff that will not get zapped okay all right so the slide is correct great so besides this we're going to do a couple more things we're going to write a special type of a log record called a begin log record and when a transaction finishes with a commit status we'll write a commit log record there's also an abort log record we'll see that in the next class and then again this says before I commit the transaction this magical moment for a commit write health protocol has two components one is eviction of a dirty page please flush your logs before you flush the page and the other thing it says when you commit a transaction that commit log record must be created written in the log buffer pool and that log page must be flushed to disc into this non-volatile storage when you get the signal back that the page has been written correctly that's the magical moment when the commit happens and now you can tell the world that transactions come in so right here logging protocol will have those two components dirty pages essentially it says the when a log of the change you're trying to make hits stable storage at that point we can declare that we now have a mechanism to unwind ourselves out from any changes that we need to move out to or reapply changes as we need to because logs will have the redo and undo information okay all right questions i want to make sure i get this part because this is the foundation for understanding everything today and the rest of the next lecture the big end record is just to say this is the mark of it and you'll see when we get to checkpointing it will help us figure out what's the boundary of that otherwise if you have to scan a big log file the log file can have millions of billions of records you'll say where did transaction t1 start you have to scan everything this tells you oh i don't need to go any further if i'm worried about transaction t1 so you want that all right so what's in this log records i'm going to simplify it at a big dramatic level there's a whole set of papers on what to put in the log records but they'll all pretty much have this in the different type of log records too besides big income it and the ones i'm telling you again we're going to ignore that there's a big fact textbook they can come that jim gray wrote uh on all the details you'd ever want to know about transactions and logging borrow it from me if you're interested in that for this course we are going to assume logs have this common structure they'll record which transaction do i belong to what object am i logging record and they're just going to assume it's record level for this class and then what type of information am i logging is it something that i use for undo or for redo and obviously the undo stuff which means give me my old value is something you wouldn't do if you're working with an mbcc based storage system okay so little connection across that but rest of it again we are going to work with a single version system to keep all the power points uh sensible so right ahead logging example start with the transaction now as you can see the buffer pool has a buffer pool for your pages and a little additional thing called the right ahead uh buffer so it's a buffer pool like your page oriented buffer pool probably much smaller uh and it's things from that get evicted from top to end it's a the log as you'll see is a linear sequence of log records they're ordered and so effectively the right ahead buffer pool is typically pretty small it might just have a small number of pages you need at least two it might have a few hundred and everything's going to get flushed from top to bottom and that will become more clear when we talk about the recovery protocol in the next class okay so it's a sequential file a little bit different than the other files but it has a buffer pool so that when log records are created they get created in the buffer pool so transaction t1 started has a begin now t1 is getting written so that transaction is recording your old value was one new value is eight uh sorry your old value is eight new value is one and so that's why that record has t a one and eight because eight was the old value so now in the log record i'm keeping track of my old and new value okay and you can as we just talked about you can optimize that stuff if you're in mbcc uh and now b got written again i've got old and new value when i write when the transaction is ready to commit even though that log buffer pool page is not full for that right ahead blog file i will flush that half full page to disk before i can declare commit when that page comes back with a okay the page was written the transaction is declared committed okay now obviously the scheme at comic time we've optimized a lot of things if this transaction were touching a million objects then it may have just created if it were touching just one byte in the million objects it may have just created a few log pages as opposed to having to write million data pages to disk that's why you can see why this is faster but it's still slow because at comic time i still have to wait for that disk i have to come back and those are as you know many tens of milliseconds right uh and so even though it is safe this can be slow and if the now at that comic time if the buffer pool gets zapped out the DRAM gets zapped out which means i've lost both buffer pool i can reconstruct from that log file but it's still slow because for each commit i have to wait for the disk to finish now imagine i've got a hundred transactions running they all are ready to commit i'll the first transactional commit wait for that page to come back the other transaction cannot commit till the page comes back so now you're blocking transactions from tens of milliseconds which is an eternity like still have a pretty slow system much better than the no steel force uh policy that we had but can we make it better so most systems for high performance will do something called group commit it's a very simple idea is that i'm going to batch up the commits that are coming together i'm going to pick an interval like five milliseconds or something like that is pretty common and even though a transaction is ready to commit like t1 comes in creates that stuff in the buffer pool here it's in the right ahead buffer pool now i'm not showing the data buffer pool right remainder of this class will only care about the right ahead log buffer pool okay or the log manager's buffer pool this comes in starts to go and that page gets flushed to disk and that's fine because maybe we need that page we start writing to this new page and one of the transactions is ready to commit as you can see the other other transaction is going to commit very shortly we don't know that but with the group commit what you'll do is you'll say i as a transaction and ready to commit but you know what i'm going to hold off for five milliseconds whatever is the timeout and wait for anyone else who wants to write a commit log record to everything that's in the log buffer pool okay and then every five milliseconds i'm going to take everyone that is waiting to be committed because they've declared i've thrown my commit flag tell me when i'm done you collect all of their log records and then you'll flush that out to disc and so effectively when you write that page everything that you've written all the transactions that were ready to commit they'll wait an average of five milliseconds is your timeout interval they'll wait an average of 2.5 milliseconds but what you'll get is a very a much higher throughput system though you've added late a little bit of latency to each transaction on average 2.5 in this case but the throughput of the system will be much higher because one disk i will commit a whole bunch of transactions if lots of transaction systems are active maybe you got a hundred transactions that were ready to commit in that five millisecond period so you'll get a much higher throughput system than you have so pretty much everyone does some form of group commit to do that some might even play around with a little bit of tricks like oh that uh i won't even wait they will go even further you'll say i will write this out to disc i've initiated the disk IO i won't even wait for the IO to come back and tell you it's committed and you can read the manuals and sometimes you'll see that in all the database vendors like oh you could set it up that way which means it's like what's your tolerance to that failure in that short amount of time while that IO happens okay there are options like that you will see a couple more slides and then i promise i will stop the logging schemes are i want to finish the logging schemes and then they'll stop uh there are three different types of logging schemes let's just go through that the first one is physical logging which is to say at the what i'm going to record is effectively like a good diff i'm going to record in the log the before and after image of what was changed now usually that is done uh at the value level but as you can imagine this diff can start to become really large now the challenge with this scheme is that if i'm doing this imagine i've got a page and that page is a slotted page structure and records have records can move around in a slotted page structure while still keeping the slot id right you might have compaction happening inside the page for example this before this physical logging when i'm keeping before and after images might record changes at the page level of stuff that is not even changed by this transaction because it was just the before and after images uh uh might have physical it's just physically looking at the bytes that have been changed on the page before and after and might get a whole bunch of stuff that is irrelevant to the actual changes which is a cue to the physiological logging that is coming next as the other extreme which is to say logical logging i'm just going to record the data function that caused this value to change in a very simple form it is recording the query more often it's going to be i incremented the value by 10 right if that was an update like that if the update query said a plus 10 this will actually record it's a i'm an increment function and my parameter was 10 to this record so it will logically record what was what was changed uh uh or it might say i updated 10 for everything that has a predicate of b greater than 10 effectively take whatever is in the query find some representation for that and represent that in the logical structure in the log now you can get a much more compact log uh a record but when you have to go and apply these changes you have to actually go run that query again to redo it and if that query took an hour to run while applying the log you're going to have to take an hour to run and so that could be very expensive so people don't do logical logging what people do is this basic geological logging which basically going back to the previous slide is it's physical to a page so at the page level you'll keep track of the before and after images but it's for every page that you change you're going to create a log record but within the page you'll just say what was the update made so you if the records got moved around you won't have that the downside is that a given operation log record might now generate multiple log records so if something changed uh uh an update to a record might have involved or update to in the transaction might have involved changes to multiple pages you'll have one log record per page the bottom line is there are different logging schemes and what we will use this logging schemes is to do the redo and undo and we'll pick up on that in the next class I ain't lying for that cake, your fam, I see your weight My grand's has heavy weight, they ran through every steak When they asking how I'm livin', I tell them I'm livin' great