 All right, so today is the last lecture we're going to have on concurrency control and transactions. Of course, it will permeate through what we talk about next when we do logging and recovery and distributed systems. So just as a reminder for everyone, we have the homework, the homework number four will be due, I think it's homework number five, right, that should be five. Homework number five is due on Monday, November 13th. And then project three is due a week from today on Wednesday, November 15th. Any questions about either of these? So the, what we're going to talk about today is called multivergent concurrency control, or NVCC in, is usually the abbreviation people use. And so this, as I sort of said last class, this is not a concurrency control protocol in the same way that two phase locking or time stamp ordering was a protocol. All right, this is a larger idea, a larger concept that permeates throughout the entire database management system. And the high level idea what's going to happen here is that the database management system is going to maintain multiple physical versions of an object in the database that will represent a single logical object. And so as I go through and show you an example, it will become clear what I mean. But the basic idea of this is that I can have a single logical tuple in my database. So when I go through SQL, I see that one tuple. But internally, underneath the covers, the database management system may maintain multiple versions or multiple copies of that tuple that exist in different points in time and are visible to some transactions and not others. All right, and so what will happen is that if you have an existing object and you update it, then the database management system is actually going to create a new version of that object just for you and it leaves the old one in place. And this is different than what we saw in the last two classes when we talked about time stamp ordering and two phase locking because every time a transaction updated the tuple, we just overwrite the value with our new value. And so this is called doing in-place updates. Whereas in here, for multi-verging, we're always going to create a new version. For insert, obviously, there isn't an existing version, so just an insert, right, and delete, you'll just mark it and say this thing's been deleted. But for updates, it's really creating a new version, a new physical version for that logical tuple. So what will happen is this can vary depending on what isolation level you're using, but in general what will happen is when you start a transaction and you read an object, the database management system will figure out what's the newest version of that object you're trying to read that existed right before you started. So you have a transaction one, they create an object, or they update an object, you create a new version, and then transaction two updates that object again to get a new version. So when transaction three comes along, the database management system will know I speed reading the version from two, not the version from one because the version from two is one. So this idea of NVCC or multi-version concurrent control is, as a lot of things in databases, is old. The first known implementation or discussion of the NVCC protocol comes from a 1978 PhD dissertation out of MIT, and then the first actual implementation in a real database management system was in the early 1980s in a system called Firebase, sorry, it's called Innerbase, which is now a little bit later renamed to open source as Firebird. There's another Innerbase that's out there. I actually don't know whether it's the same one as this one from the 1980s, but I know the source code for this version of Innerbase is now all in Firebird. And so the, what's one thing worth noting is that the guy that first implemented Innerbase at deck, which I think was bought by Compact, which was then bought by HP, the guy that implemented NVCC in Innerbase was this guy, Jim Starkey. He also claims to have vented blobs and I think triggers as well, but then he also went on to form NeoDB, which is one of the newer startups in the space, which also uses NVCC. So what's interesting to point out here is that the NVCC protocol, the idea of using multiple versions of your database management system has been around since the late 1970s, but back then most of the systems didn't actually implement NVCC. Oracle did, Postgres did, as we'll see, but most of the systems at the time were doing two-phase locking with in-place updates. But now in the last 10 years, with the exception of maybe VoltDB and FaunaDB, every database management system is essentially doing multi-version concurrency control. And the part of the reason why NVCC has this nice property where the writers don't block the readers and the readers don't block the writers. So if you don't remember anything from this lecture, just remember those two bits there, right? And so the reason why this can happen is because we're going to create multiple versions. So if I start and you start at the same time and then you update the tuple, you create a new version of that and then I want to read that tuple, I don't have to get the shared lock on the object I want to read that you just wrote to, I can go back and I can go read the older version, right, without blocking for you. In the same way, the other guy could have written to that tuple and not interfere with the other guy reading from it. So these are called snapshots in the database. And the idea is that when your transaction starts, the database management system will provide you a consistent snapshot of the database as it existed at the moment that you started. And again, depending on what isolation level you're running at, you may see things from committed transactions or not. But in general, if you're running with full protections and you would not see anything that was not committed from any transaction that maybe started before you did. And so the way we're going to do this, as I will show some examples, is that we're going to maintain timestamps now for the different versions that we're going to use that determine the visibility of these versions. So when every transaction starts, it's given a timestamp and that timestamp can then be used to determine whether you're allowed to see a particular version of a tuple, because inside that tuple, we'll actually now store the timestamps of the transactions that created that version. I'm being very hand wavy on how this actually works. This is actually very complicated. But the general idea is, again, you don't want to read anything that from a, you don't want to read a version of an object that has not been committed at the moment when you started. So consistent means you only see things that from committed transactions. Of course, this is going to complicate a lot of things because now we're going to do garbage collection and we're about indexes and all this other stuff. That's actually really hard. But the high level idea should be pretty easy to understand. One interesting also benefit you get from time travel query, or sorry, in multiverse coverage control are these things called time travel queries. So if you don't do any garbage collection, if you just let any version that you ever created sit around the data forever, now you can ask questions like run this query on the state of the database that has existed two weeks ago. So all you have to do now is figure out what's the time stamp for two weeks ago, and now when you do your query, actually your query and scan the database, you skip anything that was created after that two week threshold and you see the data is exactly as it existed at that time. So this is not a new idea. This actually was one of the regional ideas that Postgres promoted in the 1980s. They said they were going to support time travel queries. They ended up getting rid of it in like 1999, 1998 because if you're not doing garbage collection and all these old versions, you run out of storage space pretty quickly. And, you know, in the 90s the hardware was expensive and not as large, you know, we didn't have as large drives as we have now. So Postgres got rid of this, but it's now sort of coming back in the last four or five years. You can actually get in the commercial systems extensions or add-ons or plug-ins that enable this feature. And at a high level it essentially just means you just stop doing garbage collection, all right? So SQL Server will sell you something that allows you to do time travel queries. I have to admit though, I don't really come across a lot of people outside of the financial industry that actually want to do time travel queries, right? It's mainly for auditing or regulatory requirements that you want to be able to do this, right? I may need to go back in seven years in time. All right, so let's look at a really quick example of this or look at two examples. All right, so we're going to have two transactions. Transaction T1 is going to read on A and then read it again. Transaction T2 is going to write on A. And then now, in our database over here on the side, we see that we have a now version ID and this is sort of a logical thing I'm showing you. You wouldn't actually store something like this. So this is object A and we're going to say it's version zero. And then we're going to have the value for it and then now we have a begin and the end time stamp. In different databases, they're called different things. All right, we'll see this in Postgres in a second. They call it, you know, I think Min and Max. But the basic idea is that this is the scope in which this tuple should be visible. All right, so we're going to assign time stamps to our transactions when they both start. All right, so T1 gets one, T2 gets two. And so the first thing will happen is T1 will do a read on A and we look in the database and say, well, our time stamp is one. Actually, this end here should be null. So ignore that for now. But our time stamp is one, so we want to find the version of object A that existed and was committed at the time, any time before one. And so in this case here, zero is less than one, so that's the version we're allowed to read. All right, so in and out, transaction two starts and transaction two wants to do a write on A. So what we're going to do again, we're not going to overwrite the existing data we already have in our database as we would under in place updates with two-phase locking or time stamp ordering or OCC. We're actually going to create a new version A1 and then we're going to set the, we're going to set our begin time stamp to be, that should be type A, that should be two, sorry. The begin time stamp should be two and then we'll go and modify the end time stamp for the other guy to actually be the type again, that should be two, right? I could just edit this right now, but that would be kind of bad. Right, so we created a new tuple, we set the begin time stamp to two and it says anybody that comes after time stamp two should be able to see us. And then we set the end time stamp to the version A0 to two and that says that for this particular tuple, if you find it, this is, if your time stamp is greater than two, this is not the version you should be seeing. Follow a pointer, which we'll talk about how that works in a second, follow a pointer to find the next version of the tuple and do the same thing, check to see whether your time stamp is within that valid range and therefore you're allowed to see it. So now when we go back over here and to transaction T1, T2 hasn't committed yet and so now it wants to do a read again and in this case here it knows it'll end up reading A0 because it's time stamp is one and that's the last one that it read before, right? So that one reads the exact same version that it had prior to that in the back, yes. Again, so begin value should be, yeah, sorry, it's a typo, it should be two. I'll fix this in the slides later, right? And then the end time stamp up there, that should be two as well, right? Because time transaction with time stamp two wrote a newer version, so now the prior version has now been outvalidated up to time stamp range two, right? I could edit on the fly, but it's always a bad idea for class, especially some video, right? Yeah, correct, so this would be two and that should be two and then the starting point should not be, oh I might just crash PowerPoint. All right, there it goes. All right, and then the starting time stamp here, that should be null, not zero, right? Because otherwise nobody will be able to see it. So I'm grossly oversimplifying how this actually works and we'll look at Postgres in a second, but this is the general idea, you set these time stamps and that determines what's visible to you or not. All right, so let's look at another example. So same thing, we have transaction T1 gets one, transaction T2 gets two, T1 now let's do a read on A, and then read on A and transaction T2 wants to do a read on A and then write on A, right? So again, same thing we start with transaction T1, it wants to read A, right? So it reads that first version up there and then it does a write on A and we're going to create a new version, I did it correctly this time, right? So the new version now is A1 and the begin time stamp is one and then we update the end time stamp of the previous guy to be one. So now that says version A0 is only visible from zero to one and the new version A1 is visible from one to infinity at this point. So now we come over here and T2 wants to do a read on A, so T2 will actually read the version A0 because A1 hasn't committed yet. Now what I'm not showing you here is the database management system is going to maintain some additional data structure that keeps track of all the time stamps or the transactions that are still running. And so in Postgres they call this the C log, right? So it would know that this tuple here was created by a transaction with time stamp one and it can look at this additional data structure and say, well, is the transaction corresponding to time stamp one, has that committed yet? If yes, I can read that version. If no, then I want to go back to the previous version, right? In some database systems like in memory guys, they'll actually store the transaction ID directly inside the tuple itself and Postgres, for example, they store it separately, right? So in an in-marry system you wouldn't have to check an additional data structure, you would know just by looking at the tuple whether the transaction that modified this has committed or not, right? And Postgres does this, keeps it separate because this is all just disk pages hanging out in our buffer pool and any time we could swap them out to disk and you then have to go, if you embed whether the transaction that modified this is committed or not inside of the page itself, you have to go out the disk and figure that out where they have a separate data structure, they can look and know whether, you know, transaction T2 or T1 has committed yet, right? So now we do want to do a write on A. And at this point, we actually have to stall because the other guy, transaction T1, he wrote to A and then now we actually can't create a new version because we've got to wait to see whether this actually commits, right? So I said before, readers don't block writers and writers don't block readers. All the concurrency stuff we talked about before still comes into play when writers need to write the same thing written by other people, right? So now T2 is going to stall and then T1 will now do a read and it can read its same version that it wrote earlier, right? And then it goes ahead and commits and then now at this point, T2 can create a new version and it'll be allowed to commit, right? Because, well, depending on the isolation level you're running at, if you're running at serializable, it would have to abort but under read committed, it would have been okay because even though it read the older version that T1 didn't create yet under read committed isolation level, that's allowed. In serializable, this wouldn't be allowed under read committed, this wouldn't be allowed. Now this is all independent of the concurrency stuff that we talked about before. So if you're running like when you do writes, you actually still have to do two-phase locking or time-stamp ordering, this is an addition to that that allow you to figure out what version you can look at. In the back, the statement is this is just caring about the right, right conditions, right, right conflicts and you say discarding everything else or correct, yeah. So when you have to do writes, you have to either do two-phase locking or time-stamp ordering to protect the right operations but your reads can always still go back and read the older versions. Is there another question here? Yes, say this again, sorry. Yeah. Yes. At serializable together, you have to read the version one. Correct, yeah. So her point is that in my example here, if T2 is allowed to commit, this is not serializable. Yes, and that is absolutely true. But this is still to be read committed. So I don't want to get into the details of this right now, but NVCC supports actually a different isolation level that I didn't tell you about because I don't want to complicate the thing, it's called snapshot isolation. It basically says that T2 read the database at a consistent snapshot based on its time-stamp. So when it started, they got time-stamp two, so it read the version of the database that existed for everything that was committed prior to time-stamp two. In that case, A0 was committed, its time-stamp was zero, and so it was allowed to read that. Under serializable isolation, if you're running, again, two-phase locking or whatever additional protocol you're using to ensure the writes are done correctly, then, yes, this guy would actually have to abort, right, and we can see this now in Postgres. All right, so any questions? All right, so let's do a quick demo in Postgres, and there will be some details that I'll sort of explain as we go along about what Postgres is actually doing. That is slightly different than what we talked about here, but hopefully we can walk through it. Okay, so we have our simple table, right, transaction demo that has two tuples. So what we're going to do is we're going to start a transaction at the top, and here we're going to run under read committed, and I'm going to start transaction at the bottom, same thing running under read committed. So what I'm going to do now is I'm going to go read one tuple, right, from our table, and we see its current value. So as I said before, when I talked about the begin and end timestamp, you actually can see this in Postgres directly in SQL. So Postgres doesn't call it begin and end, they call it xmin and xmax, right? So now we see that we have a xmin for our tuple, right, that's the timestamp that created the transaction that created this tuple, and then I have an xmax set to 0, meaning you think of that as being null, that there's no expiration for this particular version of our object, our tuple, right, this is the latest version that we have, and then we see then ID and value are just the value that we were storing before. So this timestamp thing here, right, we can get that, we can, like every transaction when it starts is always assigned again a timestamp, and we can see this in Postgres by invoking this function txid current, right? So this says that my transaction that I just started, my timestamp is 62990, right, and the, before class, when I created the table, it had timestamp 62989, right, so this is the, in succession, this is the next transaction in our sequential order, right? And so we also see also too here, Postgres is using a logical clock, right, it's just a simple counter where it adds 1 to it, right, it's not using the system clock to figure out the timestamp of things, all right? So now what will happen is, I'm going to update the tuple, and I'm just going to increase this value by 1, and do that, and now if I go back and select the same tuple again, now you see that the timestamp that it has for the begin timestamp is what was my transaction ID, right, so this is it here, and then this is what I got when I asked Postgres what my transaction ID was, right, so this is clear, so it created a new version internally, and then it set the min timestamp, the begin timestamp to my transaction, now this transaction hasn't committed yet, so now we can go down to this other transaction here that started the same time it did, and now we do our lookup on that same tuple, right, now we see before we're at the older, we're at the previous version, because this is the one that was created by me when I started the class, but now you see there the x max, the end timestamp is now set to this, right, because this is saying that this version is now visible only to transactions that are in this range, right, in this case here also the reason why, if we go look at what our, the bottom transaction, you can see it's timestamp, it's going to be 62991, should be, yeah, right, and the reason why the bottom guy was still able to see this other version, right, is because it's doing this lookup into additional data structure to say, oh, well, the transaction that of 62990 has not committed yet, so you can't read the version up above for its begin timestamp, because again, it hasn't committed yet, so it's not visible to you, so you see the older version, and we know Postgres is actually storing multiple entries for this, because we can do, that trick before we talked about buffer pool managers, right, we can see if we do a query now that includes the CTID, again, this is the sort of the page ID and offset, the physical location of this tuple on disk, we see the top one is in page zero offset three, and then the bottom guy is page zero offset one, so we know that we're actually accessing two distinct physical versions of the same logical tuple, and the way Postgres is figuring out what's visible to you is based on your timestamp and the timestamps that are embedded inside of the tuples themselves, all right, so now, if I commit the top guy, if I do a select here, what should I see on the bottom, you should see what, sorry? Should I see the second version or the first version? Second version, right, correct, right, and now we see again the same as we had at the top, so internally Postgres still has these different versions, right, I don't think it's gonna let me access them, like I don't think we can do things like Xmin, it was at 89, oh, let me do that, okay, so here, it's sort of an example of a time travel query where I asked it to say, go back in time and give me the version of that tuple that existed, actually, no, that's the wrong tuple. So what you want, there's nothing, yeah, sorry, this is the second tuple, and that's why it's let me do that, so actually, it looks like I can't go back in time, let's try this, CTID equals zero to one, yeah, so at this point, Postgres blew it away and we can't see it anymore, yes. His question is, are the Xmin, Xmax numbers, are they specific to this relation or are they global to the database? It's global to the database, right, it's the transaction ID that corresponds to the, the ID to the correspondence of the transaction and you use that no matter what table you talked, you access or modify, and then you use it because you need to figure out, you make sure you have a consistent snapshot of the entire database, right, if you get, if you had different timestamps when you access different tables and it's very difficult to guarantee that you have a consistent view of the entire thing at all across all tables. Okay, so we can do another demo real quick, so we'll start another transaction here, the top, I think I need to roll back here and I'll start another one here and again we're running under recommitted, so we'll do the same thing, we'll update the top tuple and then we'll do a select at the bottom, right, and we see the old version, right, we know this because we look at our timestamp at the top, right, this is not the timestamp, this timestamp here is not the timestamp there, we know we're doing the older version. So now the bottom guy now tries to update that same tuple, what should happen here? What's that? Stalls, right. And then at the top, if I go ahead and commit, we'll see that the bottom one will then get released and then allowed to do its update, right, and then now the bottom, if I commit, should this be allowed to commit? Raise your hand and say yes, raise your hand and say no. Yeah, the answer is yes, right, because it created a new version that was allowed to do that and that version existed is now committed after the timestamp of the guy at the top. So it had to block to make sure the other guy would, which of the other guy actually commits and then it was allowed to commit, right? So now contrast this also, if we run this under serializable, same the top and the bottom, top guy does an update, bottom guy does an update, what should happen here? Stalls. But now if I commit at the top, what should happen at the bottom? What's that? Raise your hand and say it's allowed to update. Raise your hand and say if it'll abort, all right, not very many, abort, right? So with multi-version currency troll, just because you now have multiple versions, doesn't mean you can violate serializable ordering. You still have to do all the same protections that we talked about last time, we still have to do if we want true serializability. We've recommitted, again, we can have lower isolation levels, all right? And actually I can give one more quick example. Let's say that I run the top guy under serializable, the bottom guy, we have to roll back real quick, but I'll run him under read uncommitted. So the top guy, we do our select and we have that. We do the same select on the bottom and now here we know we're both looking at the same version. The top guy does the update. Now on the bottom we do a select, what version should it see? What's that? The old one. The old one or the new one? The new one. Raise your hand and say the new one? Raise your hand and say the old one? It's the old one, right? And again, this is because remember, the isolation levels doesn't mean you're going to have exactly, you know, hit those anomalies. It doesn't mean you're going to read uncommitted data. It means you could read it. In the case of Postgres, because they're doing multi-versioning with these timestamps, they would have to do extra work to figure out that you can't read that, you can't read the, you could read the latest version from the uncommitted transaction. They just look at your timestamps and say, oh, well this is the version that, this is the version that existed when you started and that's what it lets you read. So now if I commit at the top, at the bottom, I do a select, I'll get the new version, right? Because again, Postgres is using these timestamps to figure out what's visible to you. All right, so any questions about that? What's that? With MySQL, I think MySQL will do the, do uncommitted as you'd expect. Okay. So, again internally what's going on is that we have these different, in our tuple, we now have a bunch of extra metadata we want to store. Transaction eight, for some sense you have that, not always, but then you have your version lifetime and then we have this pointer now that'll come up when we actually start talking about how we actually implement this. So in this pointer will allow you to say, what's the next or previous tuple in my version? So you basically maintain a link list that says if this is not the version I want and I want to go backwards at a foreign time, how do I do that? And usually the link list, it's always in a single direction because that would be an extra space to go in both ways, right? And usually you only need to go in one direction. And then there's some additional metadata for some systems have, and then your actual regular data for the tuple. All right, so again I said this in the beginning, I'm going to stress it again, that the time stamps we're using to maintain different versions are not the concurrent control mechanism, right? We still have to do two base locking, we still have to do all this other stuff. And the way to think about concurrent control is that it, again, it's a design of the entire system. And there's all these different database, this is just a small sample of all the different database management systems that are out there that do MVCC and they all do it very differently. And they all have different semantics and they all have different behaviors. And so the four design decisions we're gonna have to deal with are the following, right? So we're gonna first have the concurrent control protocol which we've already covered, but I'll say again a little bit about it real quickly. But then you also have to manage how you're gonna store these different versions, how you're actually gonna garbage collection to prune the older versions, if you're not doing time travel queries, if you're not keeping the history around forever, how do you go through and find the old stuff to throw away because otherwise you're gonna run out of space? And then how do you do index management? In particular, how do you index manage secondary indexes? What should a secondary index actually point to? Okay? All right, so again, for concurrent control protocol, I'm not gonna say anything about it. All the same things that we talked about before, you can use in MVCC. So some systems use timestamp ordering, sort of the basic protocol that we talked about last time, you can do OCC with all the three phase protocol that we talked about before, and then you can do the regular two phase locking. And basically what happens on these approaches is that the concurrent control when you do writes is done only for write-write conflicts on the actual physical version you may be trying to modify. All right, so think of like under two phase locking, if I wanna update the tuple, I have to acquire a write lock on the physical version of that tuple before I'm allowed to create a new version. And that's what we saw in the case of Postgres, it would stall when you try to do that. So now the next thing that we have to talk about is how we actually store these different versions. So as I said, there's gonna be this pointer inside of your tuple in the header that's gonna point it to the next version of that logical tuple, the next physical version of the logical tuple. So this essentially forms what's called a version chain that the data management system can use, can traverse to figure out, to find the correct physical version it should be looking at for a particular tuple, right? So in the case that I showed before, and we did that query with the bottom, we had to figure out whether it was gonna look at the old version, the new version, it's always gonna land on some, the oldest or the newest version, and then we look at the timestamps and figure out, should I be allowed to see this or not, if no, then I follow the pointer to the next guy and do the same inspection, right? And you stop once you find the version that you want it. And so the way to sort of think about this is that all our indexes now, whether they're the primary key index or the secondary index, they're always gonna point to the head of the chain, and we're always gonna traverse the chain in that one direction. So that's why you only need a single direction link list, right? You follow your index, jump to the head of the chain, and then you do your checks as you go along to find the version that's actually visible to you. So the different storage schemes that are out there will determine where and what you wanna store for each version, and we'll talk about that in a second. Actually, we'll talk about that now. And they have different trade-offs and they have different performance implications, and there's no one particular implementation that is the best for all possible workloads. So the three approaches, and I'll go through examples for each of these, are to do append-only, time travel, or delta storage. So append-only is the simplest thing to think of, right? It's just basically, anytime you create a new version of a tuple, you just insert a new entire tuple into the main table. So this is what Postgres does. So anytime you update a tuple, it copies the old version and inserts it in a new slot inside of a page, and then you just overwrite the values that you wanna update. Time travel storage is where you take the old versions and store them in a separate table in its entirety. So you sort of have a master table or main table that always contains the latest version of the tuple, and then anytime you do an update, you copy that thing out and put it into another table, right? The difference between time travel storage and append-only storage is that append-only you pen to the same table and time travel you pen to a separate table. And, yes? For append-only, no longer it has a transaction that it access it. This question is, under append-only storage, is the garbage collector gonna go through and kick out anything that we know is not visible to anybody, any transaction? Yes, but it's not just for append-only, we can do this for all of them, right? We'll get to that. And the last one is delta storage, and under delta storage what happens is that you always have sort of a master table with, sorry, a table with main table with the master version of the tuple, and so it's the latest version, and then there's a separate storage space where you essentially contain the deltas of the changes that were made to earlier versions. So I have my tuple in my main table, I modify it and say only modify one out of three attributes, then in my delta storage I'll put just that one change. I'll put the old version of that one attribute that I modified. So let's go through examples for each of these. All right, so here along the side I'm showing sort of the logical names or the versions of these tuples, the actual values doesn't matter, and this is not what you actually would store. And now you see we have our pointer, right? And so in this case here we're pointing from the oldest version to the newest version. So AX is the oldest version, AX plus one is the latest version, so AX is a pointer to AX plus one. And then for BX, we don't have another version of that at this point, so it's by itself. So under a penalty storage, all the physical versions of the same logical tuple will always be stored in the same table space. So the same table heap that you would maintain for an in-place system, we're gonna maintain the same thing, just now we have different versions inside the same table. So now anytime we have an update, again we will copy the old version of the tuple into a new slot and then update it with the new data. So say we update object A, so now we wanna make version AX plus two, so we just copy down the data and then override it with our new values, but then we have to go back and now update the pointer to what used to be the newest version to now point to our new version. So again, if you wanna do a lookup on object A, say you follow the index, you would first land on version AX, check the timestamps, if it's not visible to you, you keep following the chain, the version chain until you find the one that actually is visible to you. So this version chain actually gets kind of complicated. In this particular example, I'm showing you going from oldest to newest, but you actually can also go newest to oldest. And as you would expect, these have different performance implications for different workloads. So under oldest to newest, what we'll do is just anytime we update a new version, all we have to do is just plop it down at the end of our chain and update that last pointer. So that's really easy to do an update to add a new version because it's only updating one pointer location, but now it's gonna make it more expensive to do lookups because we always have to traverse that entire chain. Now garbage collection will kick in and prune those things to reduce the amount of versions you have to look at, but that doesn't come for free. And if you have a workload that's updating the same tuple over and over again, then your version chain can get kind of long. The alternative is to go from newest to oldest. So in this case here, what happens is the head of the version chain is always the latest version. So when you update a tuple, you just plop down the version and then now that new version pointer points to the old head of the chain. So now this is really fast to go find the latest version of the tuple because you just land at the first version, that's probably what you want. So it makes it really fast to do lookups. But now the problem is if your indexes are now pointing to the head of the chain, you gotta update all your indexes to now point to the new location for the new head of the chain. So if you have a lot of indexes, this becomes expensive to do. And we'll see actually this later on. This is actually one of the reasons that that Uber cited why they went from Postgres back to MySQL because they have issues of doing these updates on all these indexes. All right, so the next storage version storage scheme you can have is the time travel tables. Again, this is basically the same thing as the dependent only storage, but instead of pending the new versions into the main table, you actually just write them out to a separate time travel table. So here now in our main table, we always have the latest version. And then in our pointer, we just have to point to the location in the time travel table with the older version. So now if I wanna do an update, I copy the current version that's in the main table into our time travel table. And then we update our pointers to point to the complete the version chain, right? And then in our master table, the master version of the main table, we just update the pointer to point now to what used to be the old version or the most recent version that we just swapped out. Yes. Does this copying happen before the transaction is committed? Does this mean the master could have uncommitted data? His question is, does this copying happen before the transaction commits? Yes. And then his next statement was, does this mean that the master table, the master version in the main table could have uncommitted data? Yes. But again, you use these timestamps to figure out whether the master version is actually committed or visible to you. And if it's not, then you follow the pointer to the time travel table and keep going until you find one that is visible to you. But if you had to roll the copy back. Correct, his statement is, if you roll back, you'd have to copy it back over. So now there are some implementations of this where you actually do the reverse. The oldest version is always sort of the main table, right? And then when you do garbage collection, then you move whatever is the oldest version out over there and over here, right? There's no one way better than another, right? There's all these different design decisions that you have to deal with. Okay, the last one is the delta storage. And then this one, what happens is that instead of actually storing, again, the entire tuple in our sort of separate table or separate data place, we're only gonna store the changes that were made from one version to the next. So we have our main table, this is what we have, this is always gonna be the master version, or at least the most recent version, whether it's committed or not, it doesn't matter. And then when we modify, say, a single attribute, in our delta record and our delta stored segment, we'll have the change that was actually made. So we wanna create a new version A2, we first copy the value that got modified, in this case it's the value attribute, over into our delta segment, and then we can update now our master version in the main table. And we say we do this again, we create a new version, like that. I mean, we could do this multiple times and just keep our chain going like before in the time travel segment. So what's one obvious benefit of this? The obvious benefit is that when you, it saves a lot of attributes, you have a thousand attributes, and I only update one of them, my transaction, I only have to store one attribute in my rollback segment, in my delta record. Whereas in the other two examples, you have to make an entire copy of the tuple and put it in whatever version segment you're storing it in. But the downside, of course, is that now, if you need to go back in time, because you need to find a version that's visible to you, you're essentially replaying the log to apply all these changes to put the tuple back into the form that it should be in, based on the time stamp that's actually visible to you. So this is actually what my SQL and Oracle do. It actually makes garbage collection really easy, too, in this case. And actually the time travel table and the delta storage segment makes garbage collection much easier in some ways because there's a single region in memory or in storage, you just say, these are all now not visible by any actual transaction. You just blow it away. Whereas in the pen only case, you actually have to scan through and find versions that shouldn't be there and then prune them out. So this is essentially, what I described is essentially what garbage collection does. So the idea is that we need to go back and find what are called reclaimable physical versions. Reclaimable means that there's no actual transaction that can see that particular version. Or the version was created by an aborted transaction. So if I do an update in Postgres and then I abort my transaction, physically that version is still there. Just no one can see it. So in Postgres, they call this the vacuum. You go through and do garbage collection to remove these things out, to reclaim storage space. So the two design decisions we're gonna have to deal with are how are we gonna look for expired versions and how to decide when it's safe to reclaim a storage space or memory. There's other decisions like how often you should run your garbage collection and things like that and all of the same sort of standard issues you would have in JVM and other memory managed languages apply here. If you're too aggressive on garbage collection, yes, you're gonna free up memory but then you're gonna burn cycles and slow down the regular workload. But if you're too laxed on it, then at some point you're gonna have a big pause because you gotta go through and clean things up because you're running out of space. So the two sort of types of garbage collection we can do at a high level or either do what's called tuple level garbage collection or transaction level garbage collection. And then within tuple level garbage collection, we can either have vacuuming or a cooperative cleaning. And so a tuple level, the basic idea is that we're gonna look at the database or tables directly and look at the tuples and figure out whether they're actually visible to you or not. And then on a transaction level, we'll have is every single transaction is gonna maintain all the old versions that they created or invalidated in sort of a single location. So, and then when they go and commit, at some point the data system will know that the versions of these transactions invalidated are not visible by any other running transaction. So it just looks directly in those data structures and then throws them out. So under tuple level garbage collection, then you can have the two approaches. The first one is to do background vacuuming and this is where we're gonna have a separate thread that's gonna periodically scan the table and try to find old versions that we can get rid of. And this will work with any storage scheme that we talked about. This works with Delta storage, append only and time travel tables. Whereas today we have two threads. Transaction, we have the first thread has transaction ID 12 and the second guy has transaction ID 25. So in our background we'll have some vacuum thread that comes along and it's gonna scan through and try to find versions that are not visible. So the vacuum knows that there's a transaction with ID 12 and a transaction with ID 25. So anything that is gonna be less than 12 that's not visible by the oldest transaction 12 we know we can throw away. So version AX and version BX, their begin and end time stamps are one to nine and 10 is greater than nine, or sorry 12 is greater than nine. So there's no extra transaction that can actually see these records. So it knows go ahead and clean those things up. And so you can be pretty smart about this, right? You can do things like if I have a bunch of pages and I know that page has been modified since the last time I ran the vacuum, then I don't, the vacuum doesn't need to look at it again. You can maintain like an in-memory bitmap to skip those things, right? Because otherwise you have to scan again, you're doing essentially a scruncher scan but looking at every single physical version and deciding whether it's actually visible to you or not. And that can be really expensive, yes. His question is, do you actually condense the pages once you do this? It's also called compaction. I don't know whether Postgres does this. And the full vacuum, they actually might do that, yeah. And a lot of times actually in some systems you typically don't, yeah, you would do compaction but you'll still keep the pages. Like so if you have like a three gigabyte database, 10 gigabyte database and you run the garbage collector and do compaction, maybe you get down to a two gigabyte database. It's not like the database systems actually give that storage space back to the OS, right? Because it wants to have the extents and everything be nice, nice, scruncher writes, right? And so again, we can maintain a dirty bit that says whether we even need to look at this page the next time the vacuum comes around. And the second choice is again to do cooperative cleaning and this is what's gonna happen here is that as the transactions run, I mean as the threads that are invoking the queries are scanning through our tables and checking these different versions to find the ones that are actually visible to them, what they can do is they can also figure out, oh well I'm looking at a version that it's not visible to me and I know it's not visible to anybody else so I can go ahead and clean this, right? So let's say that I do an index lookup for the first transaction and we land on AX. And AX is not visible by any transaction, like say again it was timestamp one. So the thread knows that I'm timestamp 12, the other guy's timestamp 25, so nobody can see AX so it'll go ahead at that moment that it finds it, prunes that old version and then updates the chain and the index to now point to the latest version. So now when the next transaction comes along and follows the index, they'll jump to AX plus one rather than going through AX, right? The idea here is sort of do load balancing, like yes it makes transactions run a little bit longer but you don't have to worry about scheduling a separate background thread to do this periodically. So I would say that this approach to cooperative cleaning I think is mostly done in meridiativist systems that are doing MVCC and this also only works if you're doing a version chains going from oldest to newest. What's an obvious reason why this is the case? Why wouldn't this work for newest to oldest? Correct, yeah, because when you land through the index you'd always land in the newest version, you would never make it and find the oldest version. An additional problem you have under cooperative cleaning is that if you have versions that nobody ever reads, sorry, if you have tuples and nobody ever reads again, then the threads are never gonna come across them so you could have, you could do a ton of updates to an object, have a bunch of versions but then nobody ever reads that object again so those versions never get cleaned up. So in Microsoft they call this, I think they're called dusty corners. The idea is that even though you're doing cooperative cleaning you still have an additional background thread that occasionally goes through and looks at everything to try to find these things. So even though cooperative cleaning might be better because you don't have to do additional scheduling for a background thread, you still need a background background thread to prune things that the other threads at runtime might have missed. And then the last approach, there's nothing really to say here but basically the transactions we keep around rewrite set, right? And we do this because we know what versions we create, we know what versions that came before it and then when we commit, we keep around that metadata about the rewrite sets for transactions and at some point those older versions that the transaction invalidated will not be visible anymore by any other transaction so that at that point we can just go through and clean them up, right? And again this is avoiding the problem of having to do sequential scans or the background thread to look at everything. Okay, so the next thing to talk about is how to manage indexes. This thing, this is actually often very overlooked when people talk about MVCC, right? People may do MVCC, they maybe talk about the version storage or the concurrency protocol but this index management actually is actually can make a big difference in the performance of a system as well. So the idea here is that our primary key indexes are always gonna point to our version chain and whether it's the oldest version or the newest version, it doesn't matter, right? It's always just gonna point to the head. But the, and typically what happens when you update a tuple in some implementation of MVCC you're actually gonna do a delete and then an insert. So logically they're still the same tuple but physically now there'll be two separate chains, right? There'll be the deleted version and then the new version if you update on the primary key. But if you update on other attributes that aren't the primary key, it becomes more complicated because now you need to maybe point to, depending on what you're pointing to, you may have to go through and update all your indexes. So this actually came up in a blog article from Uber last year when they talked about why they switched from, in their internal database system that runs Uber, they switched from Postgres to MySQL. Actually the true story is that they went from MySQL to Postgres back to MySQL. But in this blog article, they're only dealing with the second half, right? And what they're pointing out here in the article is actually what we'll talk about next. But the basic idea is that in some systems the secondary index doesn't actually point to a version. It actually contains the primary key. So when you do a lookup on the secondary index, you have to do another lookup on the primary key to actually find the tuple that you want. Now this may make it seem like it's slower for reads and obviously it is because you're doing two index probes. But if you're doing a lot of updates and have a lot of secondary indexes, you don't have to go update what the secondary indexes are pointing to because they're always gonna point to the primary key index and that's always gonna be the same, right? So another way to think about this, there's two ways we can maintain our secondary indexes under MVCC. The first are to use what are called logical pointers where these are these immutable identifiers that we're gonna store in the secondary index that doesn't change no matter how many different versions of an object that we create, right? And so to make this work though, we need a second indirection layer or second mapping that go from this logical identifier to the actual physical version location. And the example I showed you before, you could use the primary key or you could use another internal identifier like a tuple ID or record ID that didn't help you figure out how to go find the thing that you want. The alternative is to use physical pointers where the secondary index is always gonna contain the pointer to the head of the version chain. And of course now that means that if you're doing newest to oldest and you're always creating a new version that always becomes the new head of the version chain, then now you have to go through all your secondary indexes and update the value portion of what they're pointing to to now point to the new version head because it's not gonna be in the same physical location. Now if you're doing, if you're doing depth of storage, this is not a big deal because you're always gonna point to the main table or the master version and then you can just follow the version chain to find the things you want. But under the append only storage, if you're always going oldest to newest to oldest, then you always do have to update the secondary index. So let's go with both these examples. Let's say that I have two index data structures, the primary index and the secondary index, and then we're just gonna store our versions using append only storage with newest to oldest. So if I wanna do a lookup on an object on the primary key index, well this is trivial because in my primary key index, I'll have the physical address of the latest version and so I can jump down to the version head and then if the latest one is what I want, then I'm done, otherwise I follow the chain to find the version that's actually visible to me. So that's easy. For the secondary index though, if we store the physical address, then we have this problem again where every single time we update this location here, we always have to go through and now update the secondary index as well. We already have to do that for the physical of sort of the primary key, but now we gotta do it for the secondary index as well. And so if we have a lot of secondary indexes, then every single time we update things, we always have to update all these pointers. So what could have been a simple update to a single tuple could then get propagated to be updating a large number of indexes. The alternative is to have now either the primary key that we can then do a lookup on the primary key index to get the physical address, or we have some additional mapping that says for a logical tuple ID, here's the physical address for it. And this could be like a hash map or something like that. So in Uber's case, Uber was hitting this problem in Postgres because Postgres does this, right? They're doing newest orders and the physical address is always stored in the secondary index. So in Uber's case, they were doing a lot of updates with new data and every single time they created a new version, they have to go update all of these pointers to now update to change the location where the head is. And MySQL has Delta storage and actually uses the primary key inside of the secondary index. So any single time you did an update, you only had to update the primary key index. You didn't have to update the secondary ones. So I'm not saying MySQL is better than Postgres, or they do MVCC better than Postgres, but for Uber's particular workload, the way MySQL did it was actually better for them than the way that Postgres did it. There's a bunch of other things in the article that they cite to say why they switched, but this is one of the main things that was highlighted. Yes. So his statement is, why isn't this a tunable knob inside of a database system? I would say that you're oversimplifying how much you'd have to change in the system that makes this work, right? I actually don't know of any other system, maybe some of them do, but I don't know that any other system actually lets you do this, right? I think it would be a lot of engineering to maintain both of them. Look at that. This question is, how does these systems handle deletes for versions? Deleting a, do you mean a physical version or a logical tuple? A logical tuple. Right, so his question is how would you handle a delete of a logical tuple? Right, so let's take these one at a time. So you take this case here. You would have to know, you have to know what the tuple is you want to delete. So then you get, you can extract from them the keys that are indexed in the secondary indexes, and then you do a look up in there and say delete this from my entry, right? That's easy to do. And the same thing for the primary key index, you know what the primary key is in order to find it, in order to update it, or delete it, so you know how to get to it, and then delete it. And then in the actual storage scheme, typically what you do is you can store either a dummy version at the end of the chain, or the beginning of the chain that says, this version chain has been, the latest version of this tuple has been deleted, right? So now if anybody comes along and follows the chain, they would see that marker and say, you would know this thing's actually been deleted. So you can now just store that as a separate dummy version, like a virtual version, or you can flip a flag inside a tuple header of a tuple and say, at this point in time, everything's been deleted. Right, and then the garbage collector knows how to go through and clean those things once it knows that the delete marker comes before any other active transaction. That's a good question. So as I said, at the beginning, NVCC is done by a ton of database systems, and so this is just a sampling of some of them that we came up with in a paper where you wrote, doing evaluating all these different designs decisions last year, or this year. So at the top, you see Oracle Postgres and NADB. Those are obviously not in-memory database systems, but everything below that are all in-memory systems. And I'll say also too, the Oracle came out with NVCC, or transaction support in like 84, 85. So when the first version of Oracle didn't do transactions at all, and then they added it later on in 84, the first version of Postgres did multi-version two-phase locking, and then in the 90s, when they forked it from the Berkeley code and actually became the project that it is now, then they switched to use timestamp ordering. And you see those Minimax things, the timestamps when we did our demo, and then NADB, it was late 90s, and they do NVCC with two-phase locking. But then you see mostly it's completely random, what protocol they use below that. Some use OCC, some use 2PL, and some use, in our system I think we use timestamp ordering. And then you see they use different storage mechanisms, different garbage collection mechanisms, and then different systems, different mechanisms for storing the pointers for tuples. And here you see in the case of Postgres, they're storing the physical pointers to the tuples, so every single time you update a new version, you have to update all those indexes, and then in the case of MySQL, they use the logical identifier, in this case it's the primary key, so you don't have to do all those updates every single time. So again, the main takeaway from this is there's not one design decision or one implementation of NVCC that's better than others. There's all these different trade-offs, all these different implications that can vary based on what application, what your workload actually looks like. Okay? So, again, NVCC is the most widely used scheme in database systems, it encompasses all the things we talked about today, and not just the Convertio protocol. And what's interesting is that, in all the examples here, I'm talking about transactions, like I'm doing multi-statement transactions, updating multiple objects within my transaction, but even some of the NoSQL systems that don't support multi-statement transactions, you can only update atomically one object at a time. In these systems, some of them also do NVCC, because this sort of makes it easier to reason about what is actually visible to a particular transaction or a particular query at a given time. Okay? So, any questions? Awesome. Okay, so this sort of ends again in the chapter of the semester where we talk about Convertio control, but now we're going to switch over to talking on Monday next week, we're starting to talk about logging and recovery. And the way to think about this again is everything we talked about here so far is we just assume that everything's fine, nothing's going to crash, we're not going to lose any data, but since at the beginning of the Convertio lectures, I talked about durability as being an important issue. So now we're going to talk about how it actually can guarantee that any change made by a transaction that when we commit and gets flushed out the disk, will be safe and durable no matter how many times we restart. Okay? All right guys, have a good weekend and I'll see you on Monday. Thank you.