 This is the last lecture we're going to do on ConnertyTroll. Again, as a reminder for everyone, what's on the doc of you, I'll look for is due next Monday, and then Project 3 will do the Monday after that. Okay, and is everything on Gradescope working yet or no? I haven't checked. It's working? Okay, awesome. All right, so today we're going to talk about multi-virgin ConnertyTroll, and so this is a slight misnomer in the term because it's not going to be a ConnertyTroll protocol in the way that we talked about two-phase locking and timestamp ordering. So this is in another category that you could use to do ConnertyTroll, but rather it's a way to architect the system when you have concurrent operations or concurrent transactions or queries by relying on multiple versions or maintaining multiple versions. And so we sort of saw this last class when we were talking about OCC. We talked about how transactions would maintain a private workspace, and any time they wrote to an object or read an object, they would copy it into their private workspace. So that was sort of like maintaining multiple versions except that the version that the transaction generated was only visible or only viewable by itself. The private workspace, another transaction, couldn't read anything that was inside of it. It was local to the transaction. So multi-virgin ConnertyTroll is sort of like that idea, but instead of having a private workspace where we maintained these different versions, we're actually now going to have the versions be part of the global database. And we're going to figure out how to determine whether some version is visible to a particular transaction. So what's going to happen is every single time that a transaction updates an object, it's going to make a new version. So the way you think about this is that we're going to have the distinction between logical versions and physical versions. So there'll be a logical tuple that only really has sort of one logical version. If I want to read the tuple with the key A, I only see that one tuple. But underneath the covers inside of our database system, it's going to maintain multiple physical versions for that single logical object. So when the transaction writes to an object, updates it, we create a new physical version, and then when we read an object, we have to figure out which physical version is the visible representation of that logical object to my transaction. And your transaction can be running, my transaction can be running, and what is visible to me versus you can be completely different. So that's a high level. That's the idea of what we're doing here today. So the idea of multi-version concurrency show, or MVCC as it's normally called or in short, goes back to the 1970s. So the first reference to this idea was actually in a dissertation by somebody, a PhD student at MIT in 1978. And then there's a bunch of papers from Phil Bernstein, who I mentioned in last class, that come around in the late 1980s, or sorry, early 1980s, and they reference this dissertation as the first example of multi-version concurrency control. But the first real implementation of it wasn't until a few years after that, like early 1980s, 1983, 1984, at the company called DECC. You may not have heard of that. DECC was sort of a large computer company in the 1980s. They got bought by Compact, and then Compact got bought by HP. So it's no longer around anymore, but they did a lot of early pioneering database work. So the first implementation of MVCC in a database system was this product out of DECC called RDBVMS, the relational database system for VMS. VMS was an old operating system. They never heard of VACS, same thing. And then there was another system they implemented also at DECC a few years after that called Interbase. And so both of these projects, RDBVMS and Interface, they were both being built by this guy Jim Starkey. Jim Starkey was one of the early data pioneers. He claims he invented blobs. He claims he invented triggers. I don't know whether it's true or not, but that's what Wikipedia says. So he actually implemented both of these versions of MVCC and DECC at RDBVMS and Interbase. And then he later on went to go found new RDBVMS, which is a newer database startup at Boston, which also uses MVCC. So DECC-RDB got sold to Oracle, and now it's called Oracle-RDBVMS. So this is where things get confusing. Now you start to see why I want to write the database of databases, the encyclopedia, because it's hard to keep track of all these things. So there's the Oracle database, there's the Oracle relational database, and then there's the Oracle-RDB database. And Oracle-RDBVMS is from the DECC-RDBVMS. And then Interbase, when it was later sold off by DECC, and then it went through a couple of different holding companies, eventually it was open-source, I think the late 1990s. And now the open-source version is called Firebird, but Interbase still exists. It's now like a mobile embedded database, and as far as I know, it's still the same code from the 1980s. So Firebird is not as well known as MySQL and Postgres, but it's one of the earliest open-source database systems out there. So if you ever wonder why Firefox is called Firefox, the web browser, Firefox was originally called Phoenix, they had to change the name because it conflicted with another company, then they wanted to call it Firebird, they couldn't call it that because that conflicted with this database, so then they had to change the name to Firefox. So again, these are good examples of the earliest implementations of MVCC. So the basic concept you need to understand about MVCC, the way it's going to work, the benefit you're going to get from this, is that the writers are not going to block the readers and the readers are not going to block the writers. It's only when you have two transactions trying to write to the same object at the same time, then we have to fall back and rely on two-phase locking or OCC or the other curriculum protocols that we talked about before. And so the high-level concept of how this is going to work is that we're going to assign time stamps to transactions in the system and we're going to provide them with a consistent snapshot of the database that existed at the moment or at the time that they arrived. And what I mean by this is that they're not going to see any changes from transactions that did not commit yet at the moment the snapshot was created. And this is a virtual snapshot. It's not like we're copying the database and putting it to another location that your transaction runs on. We're going to use these time stamps to figure out what is actually visible to you. Sort of like this virtual snapshot that we're going to generate for you as you go along and execute queries and try to read data. So if you're a read-only transaction, this is actually an awesome thing because now if I show up, I declared you to the data system on read-only, which we show ahead of you that last class. And now the data system won't require you to acquire any locks or maintain any read-write sets for your transaction because you're going to have a consistent snapshot. You're only going to see things that exist at the moment you started. That's super efficient. That's really fast to do. This is sort of the example I was saying before. Just because some systems let you declare that your transaction is read-only doesn't mean the system is going to optimize it, optimize your transaction, make it run faster. So the way you actually optimize it was just not maintain any read-write sets if you have a consistent snapshot. The other advantage you're going to get from NVCC is that you're going to support what are called time travel queries. So the way to think about this is that you can tell the data system run my query but not on the current state of the database, run on the state of the database as it existed three hours ago or three days ago or three years ago. And you can use these timestamps that we're going to have embedded inside of our tuples to go back and say, well, what was the timestamp range that would have been visible to me if I ran three weeks ago? And now I know to skip out the latest, skip on the latest versions and only read the version that existed at that point in time. So this idea goes, actually, was one of the original inventions of Postgres in the 1980s to support this idea of time travel queries. It actually doesn't exist in Postgres now. They took it out in the 1990s, late 1990s when Postgres came out of academia and started being used outside of Berkeley. The reason is because essentially what they're doing is just turning off garbage collection on the older versions. So you're just accumulating all these old versions as your transactions are updating the database. And, of course, now you're going to run that disk space very quickly if you're making a lot of changes. So Postgres now has the thing called the vacuum, and that's essentially doing the garbage collection. As far as I know, I say this every year, and every year I go talk to companies, I always ask them who actually needs time travel queries. The only sort of application domain that I've seen where people tell me, yes, we care about time travel queries is in the financial companies because they have to do this for regulation compliance. They have to maintain the last seven years of the history of all the financial transactions. So they want to be able to say, what is my risk assessment? What is my total amount of money that I have a year ago, two years ago, three years ago? Things like that. But as far as I know, think about it. Every website you access, you don't really say, you're only the state of the website as it existed three years ago. So let's look at an example of how MSCC is going to work. So the first thing we're going to point out is that now in our table, we're going to maintain this version header, this version field, and just assume that this is going to be the unique key for our objects, our tuples. So this is saying there's an object with key A and the subscript I'm using says at version zero. Then we're also going to now have a begin and end fields. So these are going to be timestamps. Whether they're logical timestamps or physical timestamps or a hybrid timestamps, it doesn't matter. It has to have the same guarantee that we talked about last time, timestamp ordering. It has to always be moving forward at time. We can't roll back. And these two fields are going to be determined when this particular version of the tuple was visible. And so now what I can do is, when my transaction shows up, I'm assigned a timestamp. I can use my timestamp because it's in the same domain as these begin and end timestamps to figure out whether that particular tuple version is visible to me or not. So this thing to point out here, I'm not even talking about to be as locking or OCC or any of that here. This is just how to figure out what versions are visible to me. All the same things we talked about in the last two classes are still applicable here when we start to have write-write conflicts. So my transaction T1 starts and it has timestamp one. So I want to read object A. Ignoring how I got to this particular tuple. We can talk about indexes later, but for our purposes, we don't care. We just know that we want to do a scan on our table and we want to find the version of A that is visible to us. So it's obviously, we only have one version at this point. I assume there was some transaction that loaded this in, it had timestamp zero. So it's the only version we have. So what we're going to do is we're going to look at the begin timestamp and the end timestamp to figure out whether this is visible to us at timestamp one. So the begin is zero, meaning there was transaction with timestamp zero created it, but then the end is null. And this means that this is the latest version of this tuple. There is no other tuple that comes in logical time after this particular version here. So at this point, this is the only version. One is between zero and infinity. So we're allowed to read this version because it's visible to us. So we go ahead and do that. Now we start transaction T2 and it wants to do it right on A. So the way this is going to work is you're going to create a new version A1 because we're always sort of incrementing the version counter up and we're going to set the end timestamp of the previous version to be our timestamp. So we created the new version A1. We updated the new value. Begin timestamp is our timestamp. And then we go back up here and we set the end timestamp to B2 because this now says for version A1, sorry, version A0, that version for A is only visible between timestamp zero and two. Anybody that comes after this, say timestamp three, would skip this version and look at this one because this one is two to infinity. So what's one obvious problem with this? So we know that we have enough information in our timestamps to tell us whether it's actually visible or not. If I'm timestamp three, I can look and say, all right, well this is, I can't read this because three is not between zero and two. Oh, but I can read this one. Yes, no. What's that? It says you have too many copies. We're not there yet. We'll read that later. So we don't have any information to know about what the hell's going on with the transaction that made this change, right? What if this thing aborts, right? I got to go back and now reverse that to make sure that somebody doesn't think that this is actually not a visible version. So one way to do this, again, you could go back and on abort, undo all the end timestamps to reverse them back to be what they were before. But if I update a billion tuples, then that's going to be a lot of rights to do that on abort. So one way to get around this is actually to maintain an additional table called the transaction status table where you just say, all right, well if you want to find out what's going on with the transaction that created, it has timestamp two, here's its current status. So if anybody comes along and wants to read this version, and he says, oh, I'll skip this. Well, I can check this and say, well, it's still active. So as far as I know, at this point in time, this version is not visible to me and I should go to the next one. Yes? Why not change the database after the transaction? So this question is, why don't I change the database? Meaning, why don't I change the end timestamp? Not worry about committing. At this point, if we commit, this information is correct, between zero and two. It's for abort. If I abort, then I got to go back and say, well, what are all the tuples that the transaction T2 modified? Let me go back and remove, flip out their timestamp. So if you just update the table when the transaction is committed, that means there is some inconsistency in the database. You may see this version or these tuples, but you may see the other version of another tuple. So he says, he says that if you do it this way, there's inconsistency where you see different versions of the same tuple. Because these transactions are not related. Yes? There's just part of the transaction done and another part is you and them. Right. That's it. All right. So his statement is that, for this particular transaction, I made a change in the global database. Anybody that comes along can now read this. Right? Well, it depends on the isolation level. I'm ignoring all that. I'm ignoring what concurrency or protocol you're actually using. For now, that's fine. So this guy over here, right? Does this say we're doing 2V's locking? Again, readers don't block writers and writers don't block readers under MVCC. So this guy over here, he's not taking a shared lock on the tuple. He just says, all right, I can read any tuple that's within that existed at timestamp one or before. So I can read A. That's fine. Later on, I'll read A again. That's fine. I don't need to acquire a shared lock for this. For this guy over here, he does the write on A and then depending on what isolation level you're running at and depending on what concurrency or protocol you're using, you could take an exclusive lock on this to prevent anybody else from reading this. But we're ignoring that. We're just trying to understand how we figure out what's visible to you or not. Okay? All right. So again, at this point here, T2 updated A. So we created a new version A1 and then we updated its end timestamp. Then we flip back over here and this is the point I was making to him is now I'm going to read A, right? And my timestamp is still one. So if I go look in here, I can't read to infinity because one is less than two. So this is the version I want to read and I'm guaranteeing now I have a repeatable read. I'm reading the same thing that I had before. Right? So I go ahead and commit and everyone's fine and that means we'll blow away the transaction table. Pretty straightforward, right? So let's look at an example now when there's conflicts. So T1 and T2, everyone's going to read on A, write on A, then read on A. T2 is going to read on A and write on A. And again, we're starting back off where we had before where we only have a single version here, and it starts from zero to infinity. So we read A, that's fine. We write A. We're going to create a new version here, A1 and then we're going to set its begin timestamp to be one, set the end timestamp to be infinity, go back and update the end timestamp for the previous version and set that to one. Then we now start this transaction over here. So the first thing we've got to do is we've got to add an entry into our transaction status table to say we're in actual transaction timestamp two. And then we want to read A. But for this one here, depending on what protocol we want to use, we could either read A0 or read A1. Right? To see me running with serializable isolation. Right? So we want to see a consistent snapshot of the database. So in this case here, we're going to read A, but at version A0 because we would say when we look at this guy here, this one is still considered active. So we don't, it's not actually been committed yet, so we don't want to read that because it didn't actually exist before we started. Right? Because the transaction is still in flight. So we're going to read A0, but now we want to go right A. And for this one here, we're going to have to stall, right? Because now we're going to have a right-right conflict. So, assuming we're doing two-phase locking, we would try to acquire the scusal lock on A. We can't do that. Right? And so, assuming we're running a deadlock detection, we're going to stall and wait until we find out what happens with T1. So then we switch back over here. We read A, and then we're going to read the version that we just wrote earlier. So that's fine. Then we go ahead and commit, and we blow away our entry in the, in the transaction status table to say we've committed. So now when we switch back over to this guy here, we can create our new version. But depending on, again, this depends on what concurrency protocol we want to use. If you're running at read committed or something less, then we're allowed to do this. If we're running at serial isolation, then we shouldn't be allowed to do this. So the idea here is I just want to say that the multiverging aspect is independent of the concurrency protocol that we talked about before. Yes? Isn't that called branching in the database? So he says wouldn't this call branching in the database? No, because in this particular example, we're just going to overwrite whatever the version A1 here is. We're now the newer version. Don't think of this as like get where you have like branches. It's only a linear history. So if you're running serial isolation? Serial isolation. So it should be like T1 before T2 before T1? Yes. So this seems like T1 executed on A0 but T2 still executed on A0. Right, so this is not running serial isolation. Why do you say serial isolation? Because in T2 you read A from A0 but not A1. So you read A here. You read A. So it's not serializable, it's snapshot isolation. So I'm running this with a consistent snapshot of the database. So at this point here, I read A. So the snapshot says I should only see things that were visible, that were committed at the moment my transaction started. So at this point in time, there is a new physical version A1 but that did not exist before we started. Sorry, it was not committed before we started. So we look at this and say, well, T2's timestamp is 2. So this is actually visible to us. But then we go look in our transaction status table and we say, oh, T1 hasn't committed yet. So this version is not visible to us. Alright, so when T2 commits, you will see there's A1 and A2 in the database and A2 is the T2 just executed without T1. If you examine the content of A2, you will see this is kind of like from A0 to A2 but there's nothing to do with T1. You're applying sort of get concepts to this which is not the same thing. There's no branching. It's a linear history. So in physically, it's a link list and we have all these versions. Logically, in this case here, if you're not running serialized isolation, then yes, in theory, you're sort of jumping over a version. But it's not like there's a diff between the... I'm getting jumping ahead. It's not like, again, you know here's the branch history and where things got merged together, right? It's not like that way at all. We're just saying, here's the new version. We don't... This version doesn't necessarily come from the previous one. Right, but you don't know that. Maybe someone from, like, the original history. Right, but again, it's like... All you know is what the version before you was. You don't know, like, the full lineage. You're not maintaining that history inside of, like, each commit. All right. You're also sort of applying a blockchain idea to this and it's not the way at all. Okay. So here's only because there's a partial ordering. There's, like, a variance. He says there's only because there's partial ordering. I don't use that term either. Let's go forward. When I show you how it's actually being maintained with pointers, this might make more sense. Sure. Okay. Any other questions? Okay. Everyone's favorite. Let's do a demo. All right. So this is Postgres. I couldn't figure out how to get transaction IDs in my SQL very easily, so we're just going to do Postgres. All right. So we're going to create the same table we had before last time, this transaction demo table. It just has two tuples, right? ID1, ID2, value 100, value 200. So what we're going to do is we're going to start the transaction at the top. We're going to run this in the, sorry. Let me reload the table again because that was probably, I just probably just blew that away. Okay. So we're going to start the transaction at the top and we're just going to read a single tuple. So remember I said before that the database system is going to maintain internal fields that keep track of the timestamps of when the version, the physical version of a logical tuple is visible to your transaction. So in Postgres, they call these fields Xmin, Xmax. Right? Oh, I can make this bigger. Sorry. Okay. Sorry. All right. So again, so I just read my tuple and then here we see we have these fields Xmin, Xmax. And then this number here corresponds to the transaction ID that I just had that I used just to insert this data into the database. And then it's setting the Xmax to be zero. Right? That's sort of like infinity in our last example. Right? To say that this is the latest version. So I can see what my current timestamp is for my transaction by executing this function here, txid underscore current. Right? And my current, the transaction ID for my current transaction is 918. And this one's 917 because the last transaction I just ran was the one that inserted this data. Right? So now if I go update this tuple, if I go back and read my table again, read my tuple again. Right? Now you see that I've updated the value but now the Xmin is the timestamp of my transaction, 918. Right? And we know that Postgres is actually storing multiple physical copies of, or maintaining different multiple, the way it's maintaining these versions is actually is making new tuples inside of the database is by running the same query but now I'm going to include the ctid. Remember I said before when we talked about buffer pools and storage that this is, in Postgres this is how they, this is what they're using to determine the page ID in the offset. So again here's my transaction, here's the start timestamp, 918, here's the max that says zero and then this is saying that this particular version that we're looking at is at page zero offset three. And so now if I go back down here, down below in the other terminal, right, and I run the same query, now you see is that the Xmin is what we have before 917 because that was the transaction that inserted the entry. The Xmax is 918 because that's my transaction because I added that to be the max to say this is the boundary, the end timestamp. And then my ctid is zero one, page zero offset one. So this is proving to us that this is actually a distinct physical tuple inside the database than the one that I have up here. Is this clear? And so Postgres will actually maintain every single time you update a tuple, if you update the same tuple in transaction, it always actually creates a new version. So if I go update this thing again, it doesn't overwrite the original version I created, it actually makes a new one. Again, zero three, zero four, but then the min, max or the timestamp ranges are the same. So now I'll start a transaction down here at the terminal on the bottom and I'm running on recommitted isolation level. And if I want to go read that same tuple, I see I get the original version, the one that starts at 917 because the guy at the top has not committed yet. So this is where again I'm looking at the versions and I'm saying, well my timestamp, actually we can do that too, so the timestamp for the guy on the bottom would be 919. Again, these timestamps always have to be increasing so Postgres is just using a logical counter to add one to it and they assign it to the current transaction. So again, this is proving that I can read an older version of a tuple even though a newer version exists because the guy at the top has not committed yet. So now if I go back up here and I commit, I come down here and I read it again, I get the version the guy just created. So again, this is what the isolation level stuff is telling us. Even though we're maintaining these timestamps to figure out what's actually visible to us, we're running it read committed so we're allowed to read things that are committed. If I do the same demo, so let's go back, let's reset the database. So we can do xmin, xmax, CTID. So there's the current version of the database and now I'm going to run my transactions instead of running them in read committed, serializable isolation. Same thing on the bottom. So now I'll do, let's forget what our timestamps are. This guy has timestamp 24, this guy has 25, that's expected. The top guy will update this and we do a select and we would see, we successfully created the new version down here below. We do a select on that same thing. We should see the older version. All right, we do with value 100. Again, CTID is different. But now this bottom guy is going to try to update to the same tuple. What should happen here? It's going to stall, right? Because we're running serializable isolation level. So now when we're doing, to do the update, we have to acquire the scuse of lock and that lock is being held by the transaction at the top. To not say I commit this guy at the top, what should happen to the guy at the bottom? What's that? He says it should start up again. After the top bag commits, he's saying that the query should finish. We're running a serializable isolation. Should we be allowed to do that? What's that? He says abort robot. Yes. Right? So even though there's multiple versions, right? And we could, in theory, create multiple versions. But it's sort of what he was sort of asking about in the last slide. If you're running a serializable isolation, you're not allowed to do that. Even though we have multiple versions, you would not be allowed to create a new version from another transaction. Because you would not create a new version that would then overwrite a transaction that ran the same time you did and overwrite what you did. All right, one more demo. So let's roll everything back. Here is our table. Right? Min timestamp is 929. So now we'll run the top guy in serializable isolation. This guy is still on a transaction. The bottom guy will run as read uncommitted. No, it's about to run. There you go. All right, the bottom guy is running read uncommitted. Top guy is running in serializable. So we'll have the top guy. So everyone's going to read the table. So read again. Looks like that. Down below, read the same thing. Everyone's seeing the same version. The top guy will update 2.1. The bottom guy is going to update 2.2. Should this conflict? No, right? They're different objects. They're different tuples. So that's fine. What should the top guy see? So it's going to see its own right for tuple 1. What will it see for tuple 2? What version? The original version, right? Same thing. So yeah, same thing. So here we've updated. Well, here. So we see the old version of tuple 2. But we see that its timestamp has been set to the transaction at the bottom, right? 931 here, 931 there. So again, this is all the physical metadata for the tuple. But logically, we're still seeing 2.200. And then we see 1.101 because that's what the tuple we updated, right? And our timestamp is going to be 930, right? So let's do the same select at the bottom. So I was going to ask you what was going to happen. It's just sort of spoiled it, right? So we see 2.201 because that's what we updated. But we're seeing what? 1.100. Should that happen? We're running it read uncommitted. So what does that mean? What's that? Exactly, yes. So you said it's running at a higher isolation level. So even though I told Postgres, I want read uncommitted. It can't actually do that because of these timestamps, right? It would actually have to do more work and say, oh, you read uncommitted and this other transaction is actually still active. So yeah, you could see this, but if I really want to make you read uncommitted, let me go figure out where this guy's version is and I'll let you read that one. So even though we told Postgres we want to be read uncommitted, it didn't actually do it for us, right? Because these timestamps are using these timestamps to figure out what's actually visible. We can try the same thing in my SQL. I don't think I'm actually connected though. Okay. So in my SQL, you don't say begin transaction, you have to say set session transaction isolation level. So we'll set this guy to be serializable. We'll set the bottom one to be read uncommitted. So in my SQL, I don't know how to get the timestamp information, right? So we're just going to look at the logical tuples. So we just select at the top. All right, let's actually get everything, sorry. Get everything there. Get everything there, right? So that's the same. All right, so the top guy is going to update tuple one. Bottom guy is going to update tuple two. What should the top guy see? Say it again. It says 101, 200. 101, 201. What does the bottom guy see? 101, 201. We did run isolation, right? Serializable, right? Yeah. Sorry, yes. That's really stupid of me. We're not actually running in transactions. All right, let's go back. Begin, no, get set isolation level. Make sure we do that. It's all SQL. It's all, you know, it's a universal language. Make sure we run isolation or unread uncommitted. Call begin, right? Do a select. Do a select here, right? Beginning everything is the same. Then let's do an update. On tuple one, this guy is going to update tuple two. Installed. The guy at the top has a shared lock on the tuple, right? So try to get the exclusive lock to update it, and it wasn't allowed to do that, right? Because again, even though we're running read uncommitted, we still have to acquire exclusive locks to update things, right? We just can't blindly write to whatever we want. That can mess things up, yes. Say it again. So he says, at this point here when we did the select, we hold the shared lock on everything we read. That's what, right? If I do this again, right? Roll this guy back. Roll this guy back. Start transaction. I'm just going to update it. Come down here, start my transaction, and if I do a select, this should block, but it doesn't because it's read uncommitted. If now I try to update tuple two, should this block? No, right? So now that I go back and read, I see 101, 201. What will this guy see? He says 101, 200. So again, bottom guy is running read uncommitted. Top guy is running serializable. So it's not allowed to, you know, it can't get the shared lock on the, can't get the shared lock for the tuple where the guy below holds the exclusive lock for it. So is this sort of, again, the main takeaway I want to prove in this is that this is more a demo of the isolation levels rather than the different versions, because we can't expect what the actual physical version IDs are. The main thing I want to get out of this is that the isolation level, when you set your isolation level, it's for your transaction. It says what you're allowed to be exposed to. It doesn't affect what other people can or cannot see. Yes? If the second transaction is running read uncommitted. He says, if the second transaction is running read uncommitted, why should it be allowed to acquire a lock? Because it's two-phase locking. It's strict two-phase locking, so you're holding it until the end. So you could, yes. If you're not able to see any changes that are uncommitted, there's no need to get a lock because you're not sure what they're going to make. So his statement is, if you're allowed to see any, yeah. So his statement is, and I agree with you, is that if you're running read uncommitted, who cares? Right? Acquire the source of lock on the thing you need, make your change, and then immediately release the lock. Some systems actually do that. I don't know why they're not doing this. Right? Again, this is another good example of like, just because the textbook says this is how you do it, this is what the high-level concepts actually mean. In practice, people do things completely different. I don't know enough about Postgres, or sorry, my SQL, to understand what they're actually doing here. Okay. So is this, again, the main takeaway, the Postgres example is what I really guys wanted to focus on. This idea that these minimax timestamps, you have this transaction ID that's using to determine that it's in the same value domain as these timestamps, and you're using that to figure out what's visible to you or not. Okay? So let's jump back into this. Okay. So the thing I want to impress upon you, and what the rest of the lecture will be on, is that MVCC is more than just these timestamps. Right? There's all these other design decisions we have to make in order to actually implement a system that supports MVCC. And in our, you know, just to show you how popular MVCC is, and as far as I know, every single database system except for a small number of them that's been created in the last 10 years is using some variant of MVCC. And this is just a small smattering of the systems that I'm aware of based on the database of databases. Okay? So the other things we've got to worry about when we actually build a system that supports MVCC are the following things. So one is what is the concurrential protocol we're using. Again, two-phase locking, OCC, things like that. Then we've got to worry about how we're actually going to maintain these different versions, how we're going to clean up the old versions when they're not visible anymore, and how we're going to have our indexes point to the correct version. And the different systems all do different things, and there's trade-offs for all these things. Right? So the, as I said multiple times, the, all the stuff from the last two classes are applicable here. When we actually do a write, we have to then rely on other two-phase locking or OCC or basic timestamp ordering to figure out who should be allowed to write what object and what isolation level, you know, are we going to run out. So there's nothing really to say about how we're going to do this. Right? Again, it's just that we're maintaining different versions now, but all the same concepts still apply. So all the same, dialect detection, dialect prevention, OCC private workspace, all of that is still applicable here. So again, we can skip over this because we've already covered this already. So now for version storage, we need a way to figure out, you know, for a particular tuple, we need to find the version that actually should be visible to us. I'm ignoring how, you know, the index point to things or how we should find this data. Right? It's assuming we're doing sequential scan on the entire table and we're going to look and say, all right, well this is, we want this, we want this particular object. Where do we find the version that we want? So the way we're going to implement this is we're going to maintain a, an internal pointer field that is going to allow us to say, here's how to find the next or previous version for this particular logical tuple. So you can start to think like we're forming a link list that we can then jump into and then scan across the link list and say, you know, here's the different versions that I have and they're going to be sorted based on their time stamps. So the index is always going to point to what is called the head of the chain and whether the head is the oldest tuple or the new, sorry, the oldest version or the newest version of that tuple depends on the implementation. Right? And the, there's different schemes we can use to say where we're actually going to store these different versions. So the first approach is to use what is the most simplest thing is called the append only approach or append approach and this basically means that every single time we create a new version, we just copy the old version as a new tuple, a physical tuple in our table space and then just update it and then we update our pointer to say here's the next version. The next approach is what is called time travel storage and this is where you have the one table that's sort of the master version table that always has either the latest, you know, the latest copy of the tuple, the latest version and then you copy out older versions into this other separate table called the time travel table and then you just maintain, you know, the pointers from the master version to the time travel table. The last one, which in my opinion is the best approach is to do delta storage and this sort of thing of this like is diffs in get or if it's a version where instead of copying the old version every single time and updating it I'm just going to maintain a small delta to say here's what changed from the previous version. So let's go through each of these one by one. Alright, so again the pen only storage is the most simplest way to do this. I'm just going to say every time I create a new version a new physical version it's just a new tuple in my table. Alright, so let's say that we have a transaction here and it wants to update object A so somehow it followed the version chain to say this is the version I want to update so when I want to create a new version I just find an empty space in my two in my table space or empty slot in my table space and then copy the values out of the old version into the new version and then update it and then I update the pointer from the old version to now point to my new version and now this version is installed. Alright, so Postgres does this actually Postgres is probably the most famous one that does this so another aspect of this that we have to now consider is also in what way we want to order these versions. So my example here assuming that this is the this is the head of this version chain this is actually the oldest version so let's say I follow the index find key A the index will point me to this version here but if I want the latest version I've got to follow this version chain to get down to here so in my example I did it but you could do it also newest to oldest and there's performance implications and trade-offs for all these different approaches so again oldest and newest means that any time I create a new version all I have to do is just find the end of the version chain append my new tuple and update the last guy so that's really easy to do if you go newest to oldest then you've got to add the new entry and then it's pointer now points to the old head but then now you've got to go update all the indexes to now point to your new version right again if I do my look up on A this example here I would land here and this is the oldest version and if I want the newest version I've got to follow the version chain if I reverse this if I had A2 be the head of the chain then any index that was pointing to A1 before I now need to go update them and have them point to A1 or sorry have them point to A2 so oh I didn't it's missing the bolt sorry I have a there's a great there's a great blog article from there's a great blog article from Uber that shows you why how this actually matters a lot and I can't actually get to this stupid thing sorry should not be updating slides on the fly this is embarrassing I can't do it alright whatever alright sorry oh I went right back to the beginning sorry that sucks there's this blog article from Uber that basically shows how they switch from MySQL to Postgres back to MySQL because even though MySQL and Postgres are both doing version store multi versioning they are one was doing oldest newest one was doing newest oldest and so in the case of Postgres is doing newest to oldest so every single time you would update a version you'd have to go update all the indexes and that would be expensive to do and they had a lot of secondary indexes alright so the next way to restore versions is do the time travel table so the way this is going to work we're going to have our main table that's always going to have the latest version of the tuple and then we'll have a separate table space called the time travel table where we just copy the old versions as they get modified into that other table so say our transaction wants to update A2 here we're going to copy A2 into a free slot in the time travel table right then we update the version pointer now to say if you want to go backwards here's the old version and then we over right now the master version in the main table to be our new value and then we update the pointer to point to the version we copied over so the reason why you want to do this is because this actually makes garbage collection somewhat easy to do because there's only one space to look you just look in the time travel table and then you can also physically store these two tables in different formats you could have one be a column store versus one be a row store depending on your environment the last approach which I think is the best way to do this Postgres is actually using the pen storage and I've seen blog articles from some of their main developers they actually want to switch to this approach and this is what actually MySQL and Oracle do so what's going to happen is every time you're going to do an update you're just going to copy the values that were modified into this delta storage segment in MySQL they call this the rollback segment so I'm going to update update A I'm going to update the value here so I'll make a copy of the old value for value field into my delta storage and then just update the pointer to say if you want to go back in time for A here's where to go and the same thing it looks a lot like if I want to update it again I want to maintain I want to have pointers to all these different versions so what's going to happen is if you need to read an old version you're essentially almost like replaying the deltas to put the tuple back into its original form that it should be for your that existed at your time stamp so the this is another good to see about the trade-off screen reads versus writes so to read old versions in the append only version approach is super easy because I just find the version I want and then boom, everything's there in this case here, writes are going to be much faster because I don't have to copy the whole tuple if I want to make a change to a subset of its attributes but now to do a read on an old version I've actually got to replay the log or replay the deltas to put me back into my correct form so again Postgres would be faster for reads My SQL will be faster for writes because they do this and it actually makes the overall size of the database much smaller because if I have a thousand fields and I only update one of them for every single transaction then my deltas storage is going to be super small whereas in Postgres with the append only approach I'm copying every thousand fields all one thousand fields every single time even though I update one of them alright the third thing we have to worry about is garbage collection right again all these old versions are accumulating our transactions are finishing and at some point we're going to know that a particular version will not be visible to any other actual transaction there's no actual transaction with a timestamp that fits in the ranges from these old guys so we want to go ahead and blow them away to reclaim the space so it's more than just sort of like you know just saying I'm just going to sweep everything and try to figure it out we have to worry about how we're going to look for these expired versions and we got to figure out when it's actually safe to claim them right so these two things we're sort of going to punt on for the most part we'll cover this in the advanced class if you take that I just want to describe at a high level the mechanism that you can use to implement garbage collection so the two ways to do this is either tuple level or transaction level so tuple level basically means that we're going to essentially do sequential scans on our tables and have them look at timestamps to figure out oh I know what actual transactions I have is this actually visible to any transaction if no then I go ahead and want to prune it and the reason why this is complicated or expensive to do is because not only do we got to look at the pages that are in memory we got to look at the pages we may have swapped out on disk too right because we want to vacuum everything so the two approaches that do vacuuming or cooperative cleaning which I'll show in the next slide and then a second approach to doing this which we won't really discuss is just you have the transactions maintain the read write sets and you know when they commit you know whether anybody could actually see those old versions and then once you know it's no longer visible then you're going to blow it away so you can rely on like almost the private workspace that you've generated under OCC you can think of that as the read write set of the transaction because it is and you just maintain that after they commit and then at some later point you go clean up anything that they created but I want to talk about the tuple level garbage collection because that's one of the most common so say that I have two threads running my system they have two extra transactions T1 and T2, T1 has time stamp 12 T2 has time stamp 25 so the first approach is to do background vacuuming and to think of this is like you have one or more threads in the background get kicked up every so often and they're just going to do sequential scans on every single table look at what transactions are active to me and figure out whether they're actually visible or not right so you have some thread it says go to some background thread you go to the transaction threads and you say what are your current transaction IDs 12 and 25 then you look in this table here and then you just do a sequential scan and figure out whether based on these two numbers here whether these tuples could actually be visible to them so between one and nine that's not visible to 12 and 25 so we know that we can reclaim these and throw them away whereas this last one is actually still visible and again this is just a sequential scan we're not doing anything there's no fancy way to do this so that means that if we start throwing things out of the buffer pool because we have to bring things into vacuuming that's unavoidable so this is a good example where you could have a separate buffer pool just for the vacuum thread so it doesn't interfere with the regular worker pool thread so one obvious problem with this again is that there may be a portion of the database a portion of the table that hasn't been modified since the last time you ran the vacuum thread and you don't want to have to go out the disk and figure out whether that was true or not so a really simple optimization is to maintain what's called a dirty bitmap and you just have basically one bit zero one for every single page and you try a database and any single time you modify it you just flip that bit to be one so that when the vacuum comes back around they just scan that thing to figure out well these are the pages I gotta go bring in and go vacuum them so this is typically run as either again a cron job that runs periodically you can run it manually in Postgres by calling full vacuum from the terminal you could also set a trigger to say if my table hasn't updated 20% of the table hasn't updated let me run the vacuum and clean things up no one way is better than another the other approach to do cooperative cleaning and this is basically where the threads as they're executing the queries when they come across old versions that they know are not visible to anybody else it's their job to clean them up as they go along so in this case here it only works with oldest to newest ordering for the version chain because if it's newest to oldest and I'm always looking for the newest one I'm never going to get to the end of the chain so nobody's ever going to see the old versions and I have to then rely on the background vacuuming approach so let's say that I have an index and this transaction wants to look up on object A and it's going to land on the at the head of the version chain which is the oldest and then as I scan along try to figure out what versions actually visible to me if I recognize the version I'm looking at is actually not visible to anybody then I want to go ahead and just mark them as deleted and then I can reclaim the space are we done here? what's that? yes exactly yes so unless I go back and update the index because now the version chain is this thing then whatever A is pointing to is going to point to nothing so it's sort of like this you kind of get the order correct on this so you have to go update the index to point to this version then anybody will skip around it and then you can go ahead and reclaim that space that came before it so no one way is better than another I think most actually I don't know what my SQL does or Oracle does but Postgres does the vacuuming most of the other systems I think do cooperative collection I think when we looked actually I have a slide at the end to show what people do again we already talked about transactional garbage collection we just maintain the rewrite set of transactions and we use that to figure out what's not visible to anymore so the last thing to talk about is indexes so as I said before the primary key index is always going to point to the version chain so at any single time we create a new version and we update the version chain we have to update that and this gets tricky now when you want to start updating the primary key right because now you could possibly have two version chains for the same logical tuple so the way you implement this is just when you want to update the primary key you treat that as a delete followed by an insert of a new logical tuple and there's some bookkeeping of the maintain to know how to roll back if necessary for secondary indexes it's more complicated so the two approaches to do this are to maintain a logical pointer some kind of unique identifier for the tuple that does not change and then you have this indirection layer that you can say how to map the logical ID to a physical address or a location in the database and at any single time you update the the tuple update the version chain you just update that indirection layer not every single index the other approach is actually use physical pointers which I talked about before where you just point directly to the head of the version chain so anytime that gets updated I have to update every single index so it would look like this so say we have a simple database down below and we have a version chain running it's a pen only running newest to oldest so for the primary key index if I want to look up on object A this would just be a physical address page ID and offset to put me exactly to the head of the version chain and anytime I if I create a new version then I always update that for secondary indexes you could use the physical address again same issue anytime I update the tuple I have to update the secondary index to point to this and so it's fine if I have one index right it's not a big deal but now if I have a lot of them it is actually very common in old to be environments people would have maybe a dozen or so secondary indexes on one table because all your queries run really really fast so if I have a lot of secondary indexes every single time I update the version chain I have to update all of them and that's expensive because now if it's an index B plus you have to traverse that take latches as they go down and then apply the update the alternative is to instead of storing the physical address in the secondary index I could store actually just the primary key literally a copy of the primary key is the value in the my secondary index so now when I want to say find the tuple I get the primary key out of the secondary index and then I do a lookup in the primary key index and I figure out the physical addresses so now anytime I update the tuple the head of the version chain I just update this thing and that automatically updates all my secondary indexes so this is actually what my sequel does Postgres does the physical address the alternative the last approach to do a logical ID is to just have some synthetic value like a tuple ID and then there's some hash map or some hash table that says here's a map of tuple ID to an address so I get a tuple ID out of the secondary index do a lookup in here and then I get my physical address and again anytime I update the version chain there's one hash map so again the depends on the workload, depends on the application domain no one approach is better than other in all possible scenarios so this is a table we generated for a paper we published a year or two ago where we looked at the sort of historical MVC systems like Postgres, Oracle, MySQL and a bunch of newer ones with an academia and a commercial setting and here's just showing you how they do the various design stages we talked about here's how they do all the different and then here's how they're doing the storage and as you see append only is actually pretty common whereas I think delta is actually the right way to go and then here's how they're doing vacuuming and then here's how they're doing indexes so again the main takeaway here is that everyone does different things this is why I want to show you a number show you results and say alright well two phase locking is better than OCC or OCC is better than two phase locking because it's actually all this other crap that actually makes a bigger difference and the spoiler that I've said we've done experiments we find actually that Oracle and MySQL and I think I think NuoDB, the way they do MVCC encompassing all these things actually is the fastest for all to be workloads and Postgres is actually the slowest alright so any questions about multi-version convergence tool again I want to press upon you that we're using these time stamps to figure out what versions are visible to our transactions but it's more than just the visibility aspect of it, it's also figuring out how we're actually storing the versions how we update them, how we update our indexes and things like that okay alright so this is it for concurrency control for this semester starting Monday next week we're going to start talking about logging and recovery and that will sort of be the last week of material you need to build an entire database system so everything that will come after that we'll talk about distributed systems and sort of random other topics that go beyond the basics so the way I think about this if you understand everything so far in the course if you understand what next week is then you can go off and build your own database from scratch okay alright any questions alright guys have a good weekend and see you on Monday bye see you next week