 So, today we're going to talk about MVCC, multi-version currency control, and so this is actually the first part of a three-part lecture or three-part series where we're going to go into really much greater detail than we did in the introduction class and actually talk about how to implement MVCC on a modern system. But just, again, so everyone's on the same page at a high level. What multi-version currency control is not so much a specific currency protocol, but it's a way of designing a database system where you're going to maintain multiple physical versions of a single logical object in the database. And so by object, I'm being vague here. It could be anything. It could be a table. It could be a block of tuple, a single tuple. It even could be a single attribute within a tuple. In practice, most people just do it on a per tuple basis, but again, the idea is that for a single logical object that we see from the applications perspective, we're going to have multiple physical versions underneath the covers. So what's going to happen is when a transaction writes to an object or doesn't update to an object, then instead of overwriting the original value, the existing values, it's going to create a new version, and that new version is going to have the change that we just made. Again, the idea is that we have to keep track of, yes, this is f*****g hard. All right, we'll fix that later. We'll fix that next time. All right, so again, the idea is that when we do an insert, the version doesn't exist. We're just putting in the database. When we do a delete or an update, we need to keep track of that we had previous versions and we're creating a new version. And then when we try to go read that object, again, at the logical level, underneath the covers, the database system has to figure out, well, what is the correct version that we need to read? And it's going to be the one that existed when a transaction started. So NVCC is not a new idea. It goes back into the late 1970s. It was first proposed in this dissertation by this guy Reed at MIT. And this is sort of the first description of how NVCC would work. The first real implementation, as far as we know, was in this system called Interbase at a deck. Interbase is actually still around today. It has sort of changed hands a couple of times. It got bought by Borland with the old compiler company. Now it's bought by some mobile phone company. Now it's been sort of rebranded as a mobile database. But when Borland had it, they did actually fork the source code and open-sourced it as Fireberg. So that actually exists today. I don't know. I've never really come across anybody using it, but they're still maintaining and still working on it. If you ever wondered why Firefox is called Firefox, because when Mozilla went under and they took the Netscape browser, they were originally going to call it Fireberg. But that conflicted with this system, so then they had to rename it to Firefox. So what's interesting about NVCC is that pretty much almost every single database system that's been developed or released in the last 10 years, including the two we've been working on here at CMU, they've all been using NVCC. It's super common, so it's kind of important for us to figure out how we actually want to build a real system. We should keep it like a jar as possible. Yeah, sorry. I'll figure out how we get a key to this. All right, so the main idea of NVCC, the benefit we're going to get is that the writers aren't going to block the readers, and the readers aren't going to block the writers. So that means that when we go and update a tuple, a writing transaction and writing thread is going to update the tuple, unlike in two-phase blocking, where you'd have to take an exclusive block on the tuple and block anybody from doing anything, even from reading it. Under NVCC, we don't have that problem. We can create a new version, and other transactions can read the older versions without any conflicts. Now, if we have a right-right conflict, we'll have a simple rule that says the first writer always wins. That's the simplest way to do this. You obviously don't want that. You can't have right-right conflicts. But this is going to allow us to do some things that we couldn't normally do on a single version system, like have a long-running analytical query run in the background, and read old versions while newer transactions are updating the database. So for read-only transactions, what we're going to get is a, it's going to allow us to be able to read a consistent snapshot of the database. Everybody's going to have a consistent snapshot, and I'll define what that is in the next slide. But the big advantage of this is that if we declare our transaction as being read-only at the very beginning, then we don't actually have to acquire any locks to read anything, because we know we can always read a consistent snapshot. In some systems like MySQL, you don't even acquire a transaction ID, because that's a, for them, that's a centralized bottleneck handing out those transaction IDs, so you just skip that entirely. And so the way we're going to use to determine what is visible to us is through timestamps. Again, we'll go over this in more detail the rest of the class. But then we're also now going to be able to support NVCC, what are called time travel queries. So again, in a single version system, what happens? I update a tuple, the old version is gone, the old values. It's in the log in case I need to undo it, but when queries run, they're not looking at the log, they're looking at the tuple. But with NVCC, if we don't do any garbage collection and keep all the old versions around, then some systems will allow you to do time travel queries to say, run this query on the snapshot of the database that existed two years ago or three years ago. So a bunch of systems claim they support this. That's not a new idea either. Postgres had this in the 1980s. You basically just don't do garbage collection. And you do a little extra work to allow queries to specify what snapshot they're looking at. Postgres, again, Postgres originally proposed this. It wasn't until Postgres actually started getting used outside of academia in the late 1990s that they realized that this is the feature nobody wants and you run out of disk space very quickly. And so this is one of the first things they did when they made Postgres be actually usable outside of Berkeley was get rid of this support. A bunch of newer systems claim they support time travel queries today. I have yet to see any real use case for this. All right, so we need to find what snapshot means. So in the inter-interaction class, when we talk about isolation levels, we actually never really talked about this because this is actually something that's not exactly defined in the SQL standard, but this is something you get in some systems when you use NVCC. So the way it's gonna work is that when a transaction starts, it's gonna see a consistent snapshot of the database that existed at the time when that transaction started. So what that means is that you only see versions of tuples that existed, that were created by transactions that committed before you started. So if my transaction starts and there's another transaction that started before me, but they haven't committed yet and say they updated 10 tuples or they updated five before I started and five after I started, I won't see any of their changes because none of those versions actually got committed before my first transaction started. So that's what I mean by consistent. You're not gonna see any torn writes from any actual transactions. You only see things that you know have committed before you started. So then also, as I said before, if we have two transactions running at the same time and they try to do updates in the same object, then we'll just use a simple rule that says the first writer wins. Under like two phase locking when you do deadlock prevention or deadlock detection, there's all these different rules to say who's older than this? When should I wait? Should I abort? Who holds the most locks? Maybe I let them go before the other guy goes. We don't do any of that because that's too complex for us just to be trying to run in a memory environment. We just say whoever wrote it first, they win. Second guy tries to write the same object. Well, we need to be able to detect that we're trying to write to the same object. And if so, then we just kill ourselves and abort. So you may be thinking, all right, well, this seems like awesome. Snapchat isolation gives us exactly what we want. Are we able to achieve certain, yeah, question this? So you say the first writer wins, is it the time you start to concept it or the time you commit it? Right, so this question is, what do I mean by the first writer wins? Is it based on the time that they started or is it based on the time that they wrote? It's based on the time that they wrote. So you can actually start before I do, but if I write to this tuple and then you try to write to the tuple, like you have to go abort. Right, and again, think about it. It's the simplest thing to do because as I look at the tuple to try to write it, I would see, oh, if someone got to me before, got to before I did, you just go ahead and kill yourself. I don't know whether you're gonna write to the same thing, so how would I abort before you wrote to it, right? That's an important question, that's a good question. All right, so Snapchat isolation is not serializable, right? I'll show you why, because it's susceptible to what's called the right skew anomaly, but it sort of has different anomalies than like repeatable reads and read committed. And what's also sort of confusing about this is like if you take Oracle for example, with Oracle, if you declare that you want your isolation level to be serializable, what you're really getting underneath the covers is Snapchat isolation, but they don't tell you that unless you read the documentation. So it's sort of important to understand what the distinction is and you'll see why this is not gonna be serializable and the next class will see methods to go actually add additional things to the database system to make Snapchat isolation serializable, yes? This question is, are fairness possible under this, yes. But in particular, it's susceptible to this right skew anomaly, which repeatable reads doesn't have. All right, so the easiest way to understand the right skew anomaly, again, so remember from the introduction class, we talked about dirty reads, unrepeatable reads, and the other conflicts you can have, right skew is actually something that's very specific to Snapchat isolation that you get in a multi-version system. So the easiest way to understand this is through this simple visualization that supposedly was created by Jim Gray, the guy that invented two-phase locking. So let's say I have a database of marbles and marbles can have two colors, they can either be black or white. And so I have two transactions that wanna change all the white marbles to black and all the black marbles to white. So let's say these two transactions start at the exact same time, right? And so again, they will have, under NVCC, they'll have, or under Snapchat isolation, they'll have a consistent snapshot of the database that existed when they started, right? So each of them have two black marbles and two white marbles. So the first guy here, he's gonna switch all the white marbles to black, and the bottom guy here wants to switch all the black marbles to white. So they go ahead and do that. There's no right-right conflicts because they updated different things. And then now they commit. But what happens? Now we have two white marbles and two black marbles. They essentially just got reversed. Is that serializable? You're shaking head no, why? Or you're just drinking your drink? Okay. Yes? Can reverse serializable then like one of them, either switch all the black, they make it all black and then all white or all white and then all black and we just have all white or black. Exactly, exactly. So he said, if it was truly serializable, then it would have to be equivalent to a serial ordering of the transactions. So that would either be, they'd be all black or all white. But in my example here, right, we ended up with two white and two black. So under serializable isolation, right, this could not occur. Yes? Aren't you working on the assumption that you don't need to change the bottom two because they're already white if you actually declare that all four are susceptible? So your question is, sorry, you're saying that like, I'm working on the assumption that I only need to update these two here? Yeah. Right, but the query is update everything. Find everything that's white. Try everything that's black and make it white. Find everything that's white and make it black. Right? So when I looked at this, when this guy ran, when he looks at his snapshot, he doesn't see, you know, he sees that these guys are white so he knows it doesn't need to update them. But then when I commit them, then I end up with that anomaly. Yes? So the assumption here is that like, the transaction essentially reading all of the tuples which are the marbles. Yes. Then only writes into the ones that are black or white respectively. So would this issue be fixed if it just blindly wrote into every tuple and then under snapshot isolation, you get an equivalent to a serializable operation? So the statement is, in this case here, we would read everything, find the ones we do need to update. That's our write set and then we write that out. You're saying, what if we just blindly wrote everything? So this guy was already white, make it white. This guy was already black, make it black. And then try to use that as our write set. Yeah. Well, in that case, one of them would have to abort. Yeah, right. So then, yeah, so, so yes, so one would abort, retry, but then you're this. I guess more generally what I'm asking is like, if there's some way you can just like rewrite queries into like a serializable way under snapshot isolation or maybe other, I mean, you could just not say so. No, yeah, I think you changed. So the issue here is my, this is written in English. Change black marbles to white. So that could mean like, I mean, you can infer that find the black marbles and make them white. You could in theory also say, all right, just make everything white because that's what I want. Assuming I only have two colors, doesn't matter what's white now. Make it white, you know, white, make sure that it's white. No databases would actually do that because one, there might be higher level semantics about what this query is trying to do. Like my case is super simple. So like you can say, oh yeah, of course, if it's white to make it white, that's fine. But for in a real database, it's probably too difficult to do. Furthermore, you're also always in extremes. If my database has a billion marbles, I don't want to have to write out things that I need to write out. Because otherwise I might just, I'm better off just executing this as one thread after another, right? The main takeaways that again, I want to point out here is that under SNAP's isolation, this anomaly can occur, right? And the end result is from a higher level perspective of how we understand what the database should look like is wrong. And that it's not equivalent to executing them in serial order. Yes? So serialized abilities just define that as whether or not these two... No, serializable means that, again, this was covered in an introduction class. Serializable means that you generate a, the final state of the database is equivalent to any serial ordering of transactions. All right, so this case here, I had T1 go first followed by T2, but I could reverse that. T2 could go first followed by T1 and both are considered correct. Yes? This anomaly can also occur in read committed, right? This question is, this anomaly can also occur in read committed. Yes, but not, I don't think repeatable reads has this issue. Like I'm always trying to figure out the difference between read committed and SNAP, because both of them are... Yeah, so it's this picture here, right? So under the, in the textbook definition of isolation levels and the ANSI standard, at least, I think still now, like you sort of have this one path up to serializable. Read uncommitted, committed, repeatable reads is serializable. But SNAP to isolation is this other thing here where it's right skew anomaly cannot occur for this. And then this is, repeatable reads is susceptible to, I think dirty reads, I forget, yes. And therefore that can't happen in this. You can't read, you can never dirty read in SNAP to isolation because you only see things that committed when, you only see the versions of tuples that were created by transactions that committed before you start. So you can't have a dirty read. All right, so again, the main point of this, I was just showing this, is that like SNAP to isolation is, you don't get it by default with MVCC, right? So like Postgres uses MVCC, by default you get, I think read committed. I think the same thing for MySQL. But like if you sort of follow the strict definition of how we're gonna do currency troll and determine versions, identify what versions are visible to us when our transactions run, as far as I know you end up with SNAP to isolation. And then our next class on Monday, we'll see a bunch of extra stuff you can do in the database system to make this actually be serializable, okay? All right, so the paper head you guys read is a sort of overview of the different design decisions you have to make when you build a modern MVCC system. And so despite having the name concurrency troll or the words concurrency troll in the name of MVCC, it's more than just concurrency troll, right? The idea of multi-dverging actually permeates throughout the entire architecture of the system. So it's not just a matter of saying, I'm gonna pick two-phase locking and then, and you're done, there's all these other things you have to be worried about and the implications of these design decisions can be quite significant based on what your application or based on what your workload is that you're trying to support. So the paper head you guys read actually came out of this class. So when I first teaching this class in 2016, we were like, I wanted to cover MVCC or modern MVCC. And actually we were building our own database system here that was gonna be multi-version. And a bunch of questions came up of like, how should we do version storage? How should we do garbage collection? And when you go read the literature or go read the documentation for all these other systems that are doing MVCC, they just usually just say, this is what we do. They don't justify why they would do certain things. And all these different systems were doing something slightly different. So the idea of this paper was to go through and actually understand all the different trade-offs for these different design decisions and then eventually then do a bake-off and then whatever turned out to be the best one that we wanted to put in our system Peloton. So the original name of the paper that I had you guys read was actually, this is the best paper ever on in-memory multi-version current show. So papers are sort of like your children, you're not supposed to pick your favorites. This one is actually one of my favorite ones. I had a lot of fun writing it and I think the results are very interesting. So we literally submitted the paper with this title. The reviews were very positive but the very first thing they said was, remove this is the best paper ever, right? They were quite adamant about that. So I'm like, all right, well, I understand their point. We didn't want to be sort of flip or subjective in our paper because I still think it's the best paper but would it always be? I don't know. So then we changed the title to if you only read one empirical evaluation paper on in-memory multi-version control, make it this one, exclamation point. That one they didn't like, they came back again, they said this is subjective, this is your opinion, you have to make the title be more scientific or a more precise. And so the third name of the paper was, we think you really enjoy this paper. And it's true, because it's true. I did think they would enjoy it. That's a factual statement. So at this point, the paper got accepted but they were starting to get pissed off and they were like, if you don't change the title, we're gonna flat out reject it. And so I didn't have tenure, I still don't, but I didn't have tenure, the students needed the paper to graduate and get jobs. So we capitulated and so that's why the title of the paper is what you guys read it was. So we had to make it real dry and boring. I feel like I could have fought a bit more and gotten this one, but I was, I caved. It is what it is. Okay. All right, so again, the four design stages that we're gonna discuss are concurrency control, version storage, garbage collection and index management. So let's go with each of these one by one. So for concurrency control protocol, there's, again, it's all the same methods you guys read about last class or we talked about in the last lecture. There is not anything special that you do differently because you're using a multi-version system. But these are just adaptations on how you use those classic protocols to make them work in a multi-version system. And in particular what we're gonna discuss is how do you actually wanna do this in a memory database system? It's again, for the disk-oriented system, we would separate the locks and the locking information from the actual tuples themselves, the extra stores, because we wanted to keep the lock information in memory so that if the tuples get swapped out the disk, we can still know who holds locks on what. But now everything's in memory, so we don't want separate data structures. So we're gonna try to be clever and actually store in the tuples all the information we need about what's going on in our concurrency protocol. In addition to all the information we need to know about the versioning of our tuples. So we don't have to store anything separate. So timestamp ordering OCC and 2PL. So we're gonna discuss the first and the last one. OCC, as I said, is just a variant of timestamp ordering where you just put everything in, all your rights end up going into this private workspace that's not immediately visible. Whereas in timestamp ordering, it will be immediately visible, but we use some extra metadata to make sure that people don't read things they shouldn't be reading, okay? We also think too, I apologize about the paper from the last class. We wrote that before we wrote this one and in that paper about the thousand cores, we described MVCC as sort of being a single concurrency protocol that we're using timestamp ordering because that's what it was defined in the original MIT dissertation. But in actuality, again, under MVCC, you can use all of these different approaches. So timestamp ordering MV2O is what we used in the thousand core paper, but you can still use 2PL. So the first thing we need to discuss is how we're actually gonna maintain information about what we're storing in our 2PLs. So every 2PL is gonna have a header where you store, again, additional metadata. For simplicity, it's a row store, but if it's a column store, you just have a separate column with this metadata. So the way to think about it is there's some offset into the fixed length data pool and the first couple of number of bytes will be this metadata. So the first thing we're gonna do is we're gonna have a unique transaction identifier. This is just, what is the idea of the transaction that either created this version or is currently operating on, holds the lock on it? Then we'll have the begin and timestamps and these are gonna be used to determine the lifetime of the visibility of this particular version. And these timestamps could be logical counters, like a single counter, or they could be a really high precision, like a hardware clock, like in nanoseconds. Typically people use a counter because it needs to sort of be following the same domain as transaction IDs. Then we're also now gonna have a pointer to either the next or previous version in the version chain for our 2PLs, right? So again, when we do a lookup to find a 2PL, like say through an index, we're gonna land at the head of this version chain. The version chain is a link list that we can then traverse to either go back or forward in time to find the right version of the 2PL that's visible to us. So we need to maintain a 64-bit pointer in our header to say here's the next 2PL or here's the previous 2PL and here's the next thing you should look at in the version chain. And then some other constraint protocols will have some additional metadata here followed by the actual 2PL data. So the first thing I'd point out is that these are all gonna be 64-bit integers, right? So eight bytes. So in this example here, if you ignore the additional metadata, I have four 64-bit values or integers I'm storing here. So I have four 8-byte values. So that's 32 bytes I'm storing for every single 2PL. Doesn't seem like a lot, but if I have a lot of 2PLs, this can start to add up. If I have a billion 2PLs, then these four fields are gonna store, are gonna need 32 gigabytes for those 1 billion 2PLs. That's a lot. But it's unavoidable, right? Because it's sort of this trade-off between compute and storage, that's sort of classic in computer science, where we're gonna hit it a lot this semester. So I could be a bit smarter, maybe store these begin and end time stamps on a block level, right? So I have batch of 2PLs that can have begin and end time stamps. But now that means that when I start doing look-ups, I'm spending more CPU power to go find the actual version that I'm looking for. So for this reason, pretty much everyone always stores this for every single 2PL, and that's good enough because it gives you the fine granularity you need to achieve high concurrency or high parallelism. All right, oh, I'm sorry. Mmm, that was weird. Oh, God. All right, sorry. We'll cut this out. Sorry. This is gonna take forever. Yep, yep, great paper, 2PLs. Okay, time's up, we're doing it. All right, so, the first thing to point out, so we're gonna store this in our table. The first thing to point out is that I have this sort of column here of like the version IDs. This is just for illustration purposes. We're not actually gonna store this in our real system, because we're gonna use the begin and end timestamp to figure out what the actual version is. This is just for us to understand this visually. So the, no, there we go, all right, yeah, there's that. So the transaction ID, again, is gonna be a 64 bit integer, and we're gonna use that to keep track of does anybody hold the lock on this 2PL, right? So at the very beginning, it's gonna be zero. Then we have our begin and end timestamps. So this again, this is gonna determine the visibility of each version. So in this case here, there were some transaction that created this version at timestamp one, and because this is the latest version of our 2PL, we set the end timestamp to an infinity, right? Because it's visible to anybody that comes after timestamp one. All right, and then for MP2O, we need to also include the read timestamp, right? And this is gonna be a timestamp that's gonna keep track of the last transaction that read this version. And because we have to use this to figure out if we try to write a new version for this 2PL, if the read timestamp is greater than our timestamp that our write timestamp, then we know that somebody in the future read this 2PL and would have missed our newer version so therefore we have to abort. So this is a good example of another overarching theme we're gonna have throughout the entire semester is that we wanna minimize the amount of global data structures we have to coordinate between our different transactions. So by recording the read timestamp within the 2PL itself, I don't have to do a lookup at some global data structure and say, what transaction capacity I've read this 2PL? I look at the 2PL and I immediately know exactly what happened. All right, so now we have a transaction comes along and say it's given timestamp 10 and it wants to do a read on A followed by a write on B. So the read is pretty easy, right? You go check to see whether the transaction ID is zero, meaning you know nobody is actually writing to this 2PL and then you check to see whether the visibility is within the range of your timestamp. So our timestamp is 10, 10 is between one and infinity. So we know that this version is visible to us. So then all we need to do now is a compare and swap on the read timestamp to update it with our timestamp. Right, and the compare and swap fails, who cares because that just means that somebody, you know either after that, somebody in the future or in the past have also read this. So we just check it again and try to update it, right? If it's less than us. So it's okay if this, we try to read it and say someone comes along with timestamp 11 and we do compare and swap and it fails and we come back and timestamp 10 is less than 11, that's fine, we don't have to update it. Like who cares if another thread read the same thing we read. All right, so now we're gonna do the write. So for the write, we're gonna do write on B. So again, the first thing we need to do is do a compare and swap on the transaction ID so that we can take the lock on it. So that it prevents anybody from kind of create the new version while we are. So we do compare and swap, set it to our timestamp. Now this implicitly means that we hold the lock on this tuple and then we can go ahead and create a new version and implicitly this also means we hold a lock on it as well because assume there's a pointer from this to this and no one can get to B2 without going through B1 but I hold the lock on B1 so no one can follow along so I can do whatever update I need here and then now I just flip this end timestamp to B10 and that sets the visibility of this transaction, right? And when I'm done, I do compare and swap, well this doesn't need to be a compare and swap but this one does, actually none of these do, you just set them to zero and you're done and now this new version is installed. If anybody comes along with a timestamp, say greater than 10, they would be able to see this version. Is that clear? Yes? For the read part, say a transaction with the timestamp say 15 comes before this 10 comes and it reads it. Reads what, A or B? A, A, B sets the read timestamp 15. And then if 10 comes, then do we change it? No, so the read timestamp is always going forward in time. So that's what I'm saying. So you're back here, the read timestamp is one. Say I do a compare and swap, right? Compare and swap says you have to read the value first then you do a compare and swap and say I think the value should be one and if it's one, set it to 10. So if I do that and it fails because 15 came along then I would loop back around, see that now the value is 15, right? Then that's 15 is greater than 10 so I don't try to update it. If it was eight, 10 is greater than eight so you try to do the compare and swap again. Yes? Is this implemented when you actually, do you atomically grab this chunk of memory, the entire row of A, you look through it to compare some value in the compare and swap, the new value of the entire row back in. I guess I'm just kind of confused as to how you actually go through. Because imagine a case where you're reading through these memory locations and then somehow some other thread comes along and like jumps around you in exactly the right order so you read the right things but then you can't actually. All right, so I think, so this question is how am I actually doing this atomically, right? And it really starts off with this transaction ID. So if I do compare and swap on this, I set it to my transaction ID then nobody else can come along and do the right the same time I do, right? I hold the lock on the whole thing. So I don't, this is 64 bits so I just do compare and swap on 64 bits. I think what you're opposing is sort of taking a lock on the entire piece of memory for this tuple and for that one you would have to, and that's essentially what this is doing, right? You can't do a compare and swap more than 128 bits, at least on Intel. So this thing is way larger than that so you can't do atomic stuff at that granularity. So by setting this, now this will prevent anybody else from writing a new version before I'm done. So now you say, well what if someone comes along and tries to read this? Well, the first thing the reader would do is see that this thing is not zero and back away because this thing is being written to now and I don't know what the end time stamp's gonna be yet so I don't know whether I should be reading this version or the next version he's about to create. So now you say, all right, well what if the reader comes along, sees zero, right, and then starts doing the read, then this thing does 10 and while the other guy's still reading, what happens? Well, who cares, right? Because this version is not going away. That reader who got him before I did, they can still read this, right, and we're fine because this is not gonna corrupt that data or do anything weird. It'll miss potentially this new version. Which then leads to my next question. How do we actually know this, how am I actually keeping track of this thing actually committed, right? Because all I showed was that when I updated the versions I flip the transaction ID to zero, that means I release the lock. But now how do I know whether I should, if this transaction's actually gonna commit? And therefore I should actually read this version, right? So what's missing here is an auxiliary data structure that basically has to keep track of the status of transactions. So that when I go do a read and I say I read, in this case here, well, you keep track that I read this version here you need a way to know later on, oh the transaction that created that version did they actually commit, right? Because the begin timestamp was the same transaction ID that I set up here. So I know that transaction at timestamp 10 created this version. When I go to commit, I need to know whether they committed as well. Because if they haven't committed then I read something I shouldn't have read I need to abort. So what's missing here and what we'll see in Hecaton on Monday next week from Microsoft in their version of MVCC is that there's a separate data structure that you have on the side if you have a giant hash table where you just keep track of what are the different transactions that are running and whether they've committed or not. But wouldn't it suffice if you just don't change the transaction ID back to zero before you commit it? This question is, if I did this, if I just left this as 10, then wouldn't that suffice? And commit and then commit it. So that's basically holding the lock on this version forever until I commit. By releasing it now, anybody can come along and read stuff because you're assuming transactions aren't going to conflict or are not going to abort and therefore you want to do speculative reads. Yes? Why not just have embedded other commit ID somewhere? So this question is, why not embed another commit ID? So that's what Microsoft's gonna do. There's two ways to do that. So one, you just have another commit ID but that's another 64 bit field and now you're adding way more space. What Microsoft is gonna do, we'll see it in the next class is that they'll piggyback off of the, I think the end timestamp, they'll actually use one bit in the timestamp to say this transaction actually has committed or not. And that's how they avoid, you just have to go back and update things. But that avoids having a separate column. All right, so I wanna jump ahead to two-phase locking because there's still a lot to cover. So with MV2PL, we get rid of the read timestamp but now we're gonna add this recount. And the recount essentially is gonna act as the shared lock on the tuple, right? So my transaction comes along and it wants to read A. I'll do a compare and swap on these two fields at the same time, right? So these are both gonna be 64 bit fields. x86 allows you to do 128 bit compare and swap. So I'll check to see that transaction ID is zero and then I'll also automatically update this to be one. So the way you do that is you set this also to be zero and then you increment this by one as the result, right? So if I do that, then that means I have a shared lock on this tuple, right? And I can read it along with other readers running at the same time. But now when I do the right, same thing, I'll do an atomic 128 bit compare and swap to set the transaction ID and the recount to one. I actually don't think you need to set the recount but you definitely have to set the transaction ID. So now I have the exclusive lock on this tuple. I can go ahead and create a new version. I update the end timestamp with my timestamp and then I do compare and swap to read it back to zero and then unlocks it. Yes. In the study of the class, you said that the last right to rinse, but in two field locking, you take locks in the starting. So one guy has taken right lock on being the starting then won't the person who came first win, not the person who acquired the lock in the starting? Are you saying the starting? What are you saying? Sorry. So in two phase locking, we acquired all the locks in the beginning, right? No, no, no, no, no. Two phase lock, you acquire locks as needed. You acquire them in the growing phase. But it doesn't necessarily mean you're gonna acquire them exactly when the transaction starts. Okay, but you're not gonna release them immediately, right? Correct. Then how will the last right to rinse? How would the last right to win? The first right to wins. No, so first right. It's always first right to wins. First right to win. How would the first right to win? How would it win? This guy got to this. He wrote to it. Anybody else that would come along, in this case here, so if another transaction comes along and say they have timestamp five, well, they would see that this version's visibility is one to 10, five is in between one to 10 so that it knows that there's some other version that came after timestamp 10, right? That physically got wrote before I did, although logically it's in the future, but in actuality it's in the past, in physical time. So my intersect with this range, therefore there's another right here before I did and I have to abort. First right to wins. These timestamps tell us everything. Okay. So in the paper ahead, you guys read that when we wrote it originally, the transaction ID and the recount were actually set to be 32-bit integers, right? Because at the time when we wrote it, I don't, at least x86 we were looking at, didn't support 128-bit compare and swap. So we put these two together, right? If you think about it, there's probably not gonna be two to the 32 minus one threads reading the same two at the exact same time. So setting this to be 32 bits is kind of an overkill, but we still would want this thing to be the largest possible, so we had to make this thing be at least 32 bits, maybe 48. But even then, although it's a large number, if your system's running for a long time, you can burn through these transaction IDs pretty quickly. And what's gonna happen if you get to the end of transaction IDs? You wrap around, right? And then you start to have problems. So let's say that we have a simple table with one tuple and we have this one version here. So we have some transaction ID that's gonna get, transaction ID two to 31 minus one. So it does the write on A, right? And do all, you know, set the transaction ID, do all the lock stuff that we just talked about. And then it creates a new version here, right? That sets the range now to be two to the 31 minus one. Set this to zero. But now another transaction comes along and now it has timestamp one because it wrapped around. And now I create a new version, but now the ranges are all messed up because this is between one and infinity and physically that's in the future. But based on these timestamps, it's logically in the past. So now someone could come along, try to find the, wants to look for this version and it could end up, you know, going nowhere, right? Because this thing is in a disjoint range, right? So, this is problematic. And so if you have a 32 bit integer for your transaction ID, which Postgres does, you can wrap, you can get through this pretty quickly. Even if it's a 64 bit integer, at some point you're run out and you need to wrap around and handle that. So there's a couple of different ways to do this. I like describing what Postgres does because it's so simple and easy to understand and some ways that makes it sort of elegant. So what they're gonna do is that they're gonna add an additional bit in the header of every single tuple, right? Usually for tuple headers, they always pad it out to add some extra bit space. So like if in the future, if you decide, oh, we need to keep track of some additional information about each tuple, we just add an extra bit in there, we have space for it. So they're gonna add extra bit that says that a particular version or picker tuple is considered frozen. And any version that's considered frozen means that it's always gonna be deemed as in the past no matter what transaction ID or timestamp you compare it with. So even in this case here, even though this one could have a timestamp that's greater, two to 31 minus one is greater than one, if I set the frozen bit, then this will always be deemed older than this. So the way they do this is that the system recognizes that you're gonna about to wrap around, they'll run the vacuum, the garbage collector, scan through and find old versions and set this bit flag. So there's a lot of, not a lot of this, there's a couple posts on the internet if you go look around for like Postgres transaction ID, wrap around problem, there's a bunch of cases where people actually turn off the garbage collector on Postgres during the day because it adds some overhead. So you're trying to get the system to run as fast as possible during the day because that's when you have most of your customers using your website. And then all of a sudden you start hitting this wrap around problem and the system freaks out because it has to stop holding, accepting new connections, accepting new transactions because it doesn't, it has to wrap around but the vacuum can't run because you've turned it off and then you basically have to run the vacuum manually and that can do a full compaction or full vacuum pass and that can take hours and hours or even days, right? So the Postgres way is pretty simple, like the vacuum will do what you want it to do unless you turn it off. So this is clear. So when you read the academic literature about concurrently show or transactional database systems, the emphasis really is always on these concurrently show protocols, right? It is two phase locking better than OCC or timestamp ordering. But what we actually found out in this paper was that it's actually the other things that actually matter a lot more. And in particular what mattered a lot was the version storage mechanism or the architecture or the version storage component of the system. So again, as we said that we have for every single physical object, we'll have multiple physical versions. So there's different ways for us to actually store these versions but at a high level they're gonna be organized as a latch-free linkless single direction. And the head of the link list is always gonna be what the indexes point to or sort of what if I'm doing a lookup on a tuple I'm always gonna land at the head and then I can traverse along the version chain to find the right version that's visible to me. So again, the index always point to the head but the head could either be the oldest or the newest depending on what approach you're using. So the different schemes we'll talk about here will determine where we're actually gonna store and what we're actually gonna store for each version that we create. Sometimes we'll store a delta, sometimes we'll store the actual entire tuple. So the three approaches are a pen-only, time travel storage, and delta storage. And the spoiler would be that the delta storage is the best way to go. But there's still a lot of systems like PostgreSQL in particular that do a pen-only. So what's gonna happen here, because we're in memory system, again we wanna avoid having global data structures. So we wanna avoid having to allocate space for new versions in a global space or global memory chunk. So instead what we'll do is as threads create new versions they'll create the versions inside of thread local storage or memory that's allocated for just that thread. Now that means the version chain could traverse the storage across multiple threads, but that's okay, right? Because everything's always just in memory. So the other thing we also need to think about too as we describe these different approaches is not just how fast they are, but also how much storage space they're gonna require to store these versions and what's the engineering effort to actually implement it. Because some are obviously easier than others. And when you think about this in both in terms of what is the overhead of finding the right version that we need for our query, but also the overhead of actually cleaning up older versions over time as they accumulate, like in the garbage collector. All right, so we'll go to each of these one by one. So a pen-only storage, the idea is that we have a single table, like a single physical space where we're storing our tuples and any time a transaction updates a tuple, what is create a new physical tuple in that same table space, right? So say our transactions can update A1, right? We just make a copy of A1, append it to some new space or a free slot in the table, and then we just update now the pointer to our version chain to say, you know, here's the new version, right? So it's pretty simple to do. Again, this is how Postgres originally designed it in the 1980s. I think Interbase did the same thing. The downside obviously is that, you know, I'm making a lot of copies, right? You know, I'm copying the entire tuple, even though I may only update a single field. So another important thing to also point out to is the order in which we're traversing the version chain for our tuples. So in this example here, I'm going to, going oldest to newest. So I started off like this, I have A0, A1, and then A0 was the head of the chain and it pointed to the A1, and then when I created A2, A1 now pointed at that, but I still have my indexes still point to A0. So I want to find A2, I have to jump to A0, check that there's begin and end timestamps to see whether it's visible to me. If not, then I jump to A1, do the same thing, and then jump to A2. So that may not be the best approach depending on your application, right? So again, that was oldest to newest. You always have to traverse the chain to find the newest version. You could also go newest to oldest where the head of the version chain is always the latest version that just got created. So you don't need to update any previous versions, you just add your new version and then have its pointer point to the old version, head of the version chain. What's one obvious problem with that? So it's faster doing lookups obviously, right? Cause you, it's, you know, I want the newest version, I land the head of version chain and I'm done. Yes. No. Yes. Update of the index is correct. Yes. So again, we'll talk a few more slides, but like all the indexes are pointing to the head of the version chain. If for every single update, the head of the version chain changes because it's newest to oldest, then I have to update all the indexes now to point to the new version chain, right? And actually that's what Postgres does and compared to MySQL, that's problematic. So again, I'm not saying one is better than another. In most cases, newest to oldest probably is the better way to go because most of the times, most transactions, most queries want the newest version. If you care about doing, depending on how you do garbage collection or depending on how you do, like if you have queries that look at old data, this actually may be better. Yes. Can you like resolve that by like essentially not necessarily updating the indices like immediately or somehow like easily updating the indices or something like that or doing some optimization and how you update the indices as opposed to necessarily incurring that cost on every single update? So you said a lot there. So is there a way to avoid this overhead of having to update every single index, every single time by lazily doing it? No, because that could give you false negatives. And then the other purchase, what? If you have an indirection layer, we'll see in a second, that will avoid that problem. But then you have to maintain that. That's additional storage overhead to have that indirection. Okay, again, so again, in practice, I think this is the better way to go if you're doing a pen only, but this does have some benefits that you don't have to update the indexes. All right, so the next approach is to do time travel storage. And the idea here is that as we create new versions, instead of appending the new versions into our main table space, we will have this other table that looks exactly the same as the first one, that has the same scheme on, the same allocation of columns and rows. But that's what we're gonna put versions as we create them. So in this case here, we're doing newest to oldest, so A2 is in the main table space and has a pointer to A1 in the time travel space. So now if I'm gonna update this guy, I first gonna copy it into the time travel space, have the pointer point back to A1, and now just overwrite the master version with the new version that I wanted to create, all right? And again, first writer wins, I don't have to worry about two transactions trying to update this master version at the same time. I do the right here, and then anybody tries to update the same thing, will have to get aborted because they'll conflict with me, all right? So then now if I, so then update the version pointer to now point to A2. So this is actually what HANA does, and this is actually what SQL server does. So we'll see SQL server, how they do logging with MPCC in a few weeks, and I think they do it this way because it's a byproduct of the system was not originally designed to be multi-versioned, and so by adding this, it requires the least amount of changes in the overall architecture. The other interesting thing you can do too also is that you could have the main table actually be a different, use a different storage model than the time travel table, right? Because you're doing a lot of updates into this, you could have this be like a row store, or a Delta store, row store, and then the time travel table, because these are like older versions, you can maybe over time convert them to a column store, and that way if you need to read old data, you can access it through a columnar layout, which is faster. All right, and again, so the last one is gonna be what I think is the better approach is to do Delta storage. And the idea here is that instead of having to copy the original version every single time and then make our change to it, we only need to record what was the change we made and just store that information, right? So if I'm gonna update A1 instead of copying it, I'll just have a Delta record in some space in memory that says, here's the change that I made. So here's the old value for this tuple. I'm gonna update that and now the pointer points to that. So then for this Delta information, I'm gonna maintain the same begin and timestamp just as I would before in a regular tuple, but the only thing I'm storing is this Delta information. Take it like a diff in git. That, so now if I wanna do a lookup and say find an older version, I just follow the version chain and replay all the log entries or the Delta records to put me back into the state that I should have been for that version. So again, if I have a thousand columns in my table and I only update one of them, then instead of having to copy all thousand columns every single time, I can just store the one column that got changed. So this is what Oracle does, this is what MySQL does. I think this is, and this is what we now do in our system, this is the better approach, right? Because again, computationally it's more expensive because I gotta essentially replay the log or the Delta to put me back in the correct state I should be for my tuple when I wanna do a lookup. Most of the time though, you only need to read the latest version. And that's just, you know, all the data is just there and you're done. But you end up also storing less space which also can improve your cache locality or have less pressure on the cache, keep as much data as in memory as possible. So there's a big benefit to that. Yes. What if a lot of my transactions are doing scans and some of them are updating? Your question is, what if a lot of my transactions are doing scans and some of them are updating? This will be very slow, right? Why would it be very slow? Because you have to boot again and again in Delta storage to find out. So again, in a real database system, most of the time you're not updating every single tuple. So say I'm doing sequential scan across the entire table, most of the time I'm just gonna rip through the main table space and I don't have to follow the version chain. In the event I have to follow the version chain, yes, I'm gonna pay a little computational penalty to now follow this pointer and replay the Delta's to put me back in the correct state, most of the time I don't do that, right? I would say also too, I think depending on how you architect the system, but to do an update anyway, you're already sort of generating the Delta and actually this is what you're gonna log out the disk anyway. So you might as well just record that in memory as the Delta record. So you don't have to do any extra work to make the copy when you create the version. So there's this trade-off, absolutely you're right, that like reads are just right, which ones you wanna favor. I think the science turns out that this approach actually, it'll make the writes go faster and the penalty you're paying for reads is not significant that it's not worth it to get the right benefit. Plus you have less storage space. Garbage collection is also easier in this world too because I only need to go through this thing, clean this up, I never touch the main table because that's always the latest version. Yes. So in order to get all the attributes and there's a whole convenience based on the transactions. So say we had a value and another attribute and the last 100 updates were just value based. So we'll have to traverse all the 100 in order to get the other attribute. Right, so he makes a good point. So if say I have a thousand tuples and I have a bunch of Delta's that each update a separate attribute, do I need to go back in time super far to find the correct version? So the way you handle that is garbage collection and you try to shrink down or compact even the Delta, the version chain as quickly as possible. We'll cover that next week. Yeah, yes. For clarity, if you're changing multiple attributes. Question is if you're changing multiple attributes, can you sort of that as one Delta, yes. So there's only one attribute value, but yes. For every single attribute I modify it will be within one Delta record. But the latest version is on the main table, right? Yes. Why do we want to... So this could be uncommitted, right? And this also could be written in the future that I shouldn't see because it's not my snapshot. So therefore I have to go back in time to find what is actually visible to me. All right, cool. So one additional problem though, if you're doing a pen only, and this is why I think the Delta store is better, is that if now you have string values that are stored in the variable length data pool, every single time I create a new version, I have to make a copy of this tuple, or sorry, of the variable length data so that my next version can have its own pointer. Because if I start doing garbage collection and I clean up A1, because A2 is now the latest version, if this thing is pointing to this and I go clean that up, then now my pointer points to nothing and that's bad. Right, so that sucks, right? So that means for every single time, even if I don't update this, if I don't update this string at all, I'm still making a copy of all the string values. So one way to handle this obviously is just add a reference counter in the variable length data so that when I know that how many pointers, how many versions are actually pointing to this, so if I go clean up the first version and I decimate the counter and it's still greater than zero, then I know that someone out there is pointing to this string in my variable length data pool and I shouldn't go ahead and clean it up. The downside though is, and this is actually what we tried in our old Peloton system, the downside is that because now I'm keeping track of the, if I have multiple versions pointing to this, I actually don't know where they are. So if I ever need to move this piece of data around in memory, like if I start compacting my variable length data pool, then I have to do a sequential scan and try to find the one version of the tuple, the one version of the tuple that actually pointing to this. So this actually turns out to be a bad idea and it's something we abandoned pretty quickly. This is essentially sort of what dictionary compression does as well because instead of storing the pointer to something in the variable length data pool, I would have a dictionary code that I could then do another lookup and find what the actual value was. But at least in that case, if you're compressing the data, then it's usually cold data and therefore you're not gonna have to clean things up very often. Yes. What's this thing? No, so again, in our last class, I talked about how there's the fixed length data pool and the variable length data pool. So all our tuples are gonna try to reside in the fixed length, or they have to reside in the fixed length data pool. That's the location of them. But because we want all the size of every single tuple to be fixed length, and it could be variable length has to be stored in a separate data pool. So inside my fixed length data, instead of actually storing the string, I store a pointer to the string. And that's another piece of memory that's managed separately. And the point I'm trying to make here is that on one hand, we could just duplicate the string over and over again every single note version, because each of these guys need to have a unique pointer. But that's obviously gonna be wasteful because if I don't update the string, then I'm copying it for no reason. And if my string is huge, then that's gonna get expensive very quickly. So you could try to share pointers by adding a reference counter to the variable length data pool for every single thing you're storing. But now the problem is if I ever try to move this data around, I don't know what else is pointing to it and I could have broken pointers. Yes. So this question, could you use a logic pointer? That's what dictionary compression essentially does. We'll cover that in a few weeks. It's hard for me to balance what do you guys need to know now with what's gonna come in the future. So dictionary compression solves this. Okay. Oh, we need a lot to get to. Garbage collection. The idea of garbage collection again, it's just like GC for the JVM. We need to find physical versions that are reclaimable. And physical version is one where we know there's no actual transaction running that can see that version, right? Because it's in the past and all new transactions have a timestamp in the future and they can't get to it. Or the version was created by transaction that later got aborted. And we know, again, no one's gonna read it. We gotta go ahead and clean it up. So there's a bunch of different design decisions we need to be mindful of. How to find expired versions, what to do, how to determine when it's safe to reclaim a piece of memory and where to actually look for them. So we're gonna focus on this, how to look for them. We'll cover this a little bit but we'll do Wednesday next week. We'll spend an entire lecture discussing on how we actually do garbage collection in MPCC and go to more details about this. All right, so the two ways to do this is at the tuple level or the transaction level. So the tuple level, the idea is that there's not gonna be any central location that says, here's all the versions of the tuples that we can go clean up. The idea is instead we're gonna have threads run and do scans and when they come across data or come across versions that are reclaimable, they go ahead and we clean them up. We can either do this with separate threads or cooperative threads as a running queries. And then transaction level, the idea is that each transaction is gonna keep track of every single version that they invalidate and therefore it could be reclaimed. And then when they go to commit, they hand off this set of pointers to this garbage collector and says, hey, by the way, here's some things I invalidated, you should go ahead and clean them up. And that's a voice having to do the sequential scan. So let's go through tuple level first. So with vacuum vacuuming, again, the idea is that there's separate threads that are gonna run, that are gonna do sequential scans to try to find old versions. So when the vacuum starts, we have to look and say, well, what are the actual transactions? What are their timestamps? And then now when we scan, we're gonna compare the begin and timestamps for all the versions that we see and see whether that the range is specified them do not intersect with any active transaction. So in this case here, we have transactions with timestamp 12 times at 25. So we know that for the range one to nine, these two transactions can't see that. So therefore, no one can see this and therefore it's safe for us to go ahead and reclaim this memory, right? So this can be expensive to do, like sequential scans across the entire system, especially if you want this thing to run all the time. So a simple trick is just to keep track of a bitmap that says here's all the blocks that got modified logically since the last time I ran. So therefore, I just need to scan through that and I can skip anything that wasn't modified. So I mean by logically is that depending on whether I'm doing oldest and newest and newest to oldest, I may update, say I have two versions in two separate blocks and it's oldest and newest and I add a new version that's in the second block but now I can reclaim that old version that's in the first block. So I need to know logically that going back in the version chain, here's the block that has the thing I should examine. All right? Whereas newest, oldest you would know, I add my new version, I know what I need to point to as the previous head so I know where that thing is located and I can update that bit easily. So cooperative cleaning, the idea is that we're not gonna have any separate threads potentially if we're back on back and mean it's just now as our transactions or queries start running if they come across old versions we go ahead and clean them up. So say this guy runs, he's gonna do a lookup in the index define key A and we're gonna get to the head of the version chain. So then now as I'm scanning along to find the right version that I want I check the timestamp which I'm doing anyway to see whether it's visible to me but I also know what is the sort of the high watermark for the low watermark for the oldest timestamp of any active transaction and if I know that that thing is not visible if this tuple is not visible to that oldest transaction I can go ahead and garbage collect it right there and I keep scanning along and can print out anything and I'll also then have to update the index now to point to the my new version chain. So the benefit of this approach is that you don't have to maintain these separate background threads but now your queries could potentially run slower because they may come across a long version chain and you had to clean things up before you can actually finish running your query. What's another problem with the cooperative cleaning? Yes? Only one tuple is near us. Correct, well one, it says that there only where it's the oldest near us but even then there could be another problem. What if no friends ever? There you go, yes. So let's say I create a new version and then no one ever goes back and reads that logical tuple again. Then now I have that old version sitting around that no one's ever gonna get. So Microsoft calls these dusty corners and basically the way to handle that is you periodically also have to still have to run the background thread. Just go find things that could be still sitting around. Yes? It's from versus, it's why I'm extirpient. Yeah, so extirpient is there a notion of cleanliness in our system and it could quantify that. So the example I was saying before like the garbage collection takes time, right? So Microsoft roughly estimates in their paper that garbage collection adds about a 15% overhead. If during the day, I want my transactions to run as fast as possible, I could maybe potentially disable garbage collection. The system becomes more dirty and then when maybe in the day then I run the garbage collection and I'm burning more cycles but I have spare ones that I can use to clean things up. It's usually the storage overhead that people care about. You run out of space pretty quickly if you have a lot of virgin in your turning through them. I would say that the notion of cleanliness would be something that has to be human defined. Like I use fewer threads for garbage collection because I want queries to run faster but then there's sort of a pushback to say, well, I'm running out of space, especially if we're in memory. So I want to be more aggressive in cleaning things up. Yes? Was it overhead just with like a background vacuuming thread? Because I mean like maybe during the day you only run the cooperative cleaning and then at night you clean up the dusty corner. So would there be overhead, like that 15% is that, like would that... Would that include a couple of... Yeah, because think about it, like my queries are running slower now because this guy could have just said, well, I don't care that these versions are reclaimable. Let me just go get what I want. But it's implemented such that you clean things up as you find them. Okay, the other one is transaction level. Again, we'll discuss this more on Wednesday next week. Again, the idea is that my transaction runs. I'm creating the versions because I have to update the tuples. So I know what was the old version before. And so I just record to say, hey, by the way, here's this transaction, or here's this version. Here's the begin and end time stamp, which I know because I had to read it. If I commit, go ahead and reclaim this. And then we'd put this in a queue and then the garbage collection will go, it's some kickoff, some background threads to clean this up over time. So for this one, the garbage collection is no longer on the critical path of queries like it is in cooperative cleaning. So that means that our queries can run faster and we could end up creating older verges more quickly than before. So we actually may need to use multiple threads and make sure we clean things up in timely fashion. So in our current system today, we do this, but we only can do a single thread garbage collection. And in some of the experiments that Matt has done, we can burn through transactions pretty quickly and we start running out of space and the single thread actually can't keep up. So we'll discuss more about this next week. All right, the last one is super important is index management. And we sort of are talking to this a little bit, but basically how do we find the right version? And this depends on what the version scheme is and how much work we have to do to maybe replay deltas or what the ordering of the version chain is. So the primary key index is always gonna point to the head of the version chain. It doesn't matter whether it's oldest and newest or newest oldest or time travel or Delta storage, it's always the head, right? And so that just means that depending on what order we do, we may end up on the exact right away, the first version that we want is the head or we may have to traverse the whole thing. So now if any transaction tries to update any attribute in the tuple that's in the primary key, rather than trying to get clever and trying to maintain different versions or different version chains for the same logical entry in the index, we're just gonna treat this as a delete followed by an insert. So even though sort of conceptually it's the same tuple, it's from the database systems perspective, it's now two discrete logical tuples. Secondary indexes are more complicated because as I sort of already said before, because depending on whether we're pointing to the version chain or not, every single time you create a new version, that could get really expensive. And so there's this great blog article that came out actually exactly when we were writing this paper from Uber and it talks about their journey from going from Postgres to MySQL. The true story was they originally started with MySQL, they hired some guy that really loved Postgres so he switched to Postgres, then they realized it was a mistake and had to switch back to MySQL. So that probably was super expensive, they just took this course, they would save a lot of money, right? So they talk about a bunch of different things about how Postgres does things different than MySQL, but one of the main things they stress is how they actually manage secondary indexes in a multi-version system. Because Postgres would actually point to the head of the version chain whereas MySQL uses a logical pointer, right? So that's the two different. So a logical pointer would be some kind of identifier that's fixed for the tuple that we can then, don't have to update in our secondary indexes anytime the physical location of the version chain changes. And it can either be a primary key or some kind of synthetic tuple ID. Like in case, so primary key is what MySQL does, like if you don't declare a primary key MySQL, they will actually generate one for you called the row ID. Or we just have some kind of global counter that says this tuple one, two, three, four, five, six, and then have an indirection layer to do a lookup to get the physical address. Right, the physical pointer is what I said is you just always point to the head of the version chain. So let's look at this visually. So soon we have a, we're using a pendulum of storage, newest to oldest, we don't wanna look up on key A in the primary index. Again, the primary index always points to the physical version, no problem. We can just jump here and then scan along to try to find the version we're looking for. The secondary index, if it's using a physical pointer, right, same thing, you land here and scan along. If I only have one index, then it's no kind of, no big deal. Every single time this thing changes, because I just go update that one index. But if I have a lot of indexes and they're all pointing to the physical address, anytime I create a new version and going newest to oldest, I gotta update all of these. This is what Postgres does in the case of that, the example with Uber, their application had a lot of secondary indexes and they got really expensive to update the version chain every single time you create a new version of a tuple. So how can you handle this? Well, if you have an indirection layer, either by storing the primary key and then now just do a second lookup in the primary key index to get the physical address. Now if I update the physical address, I only have to update this index and nothing else. So if your primary key is not that big, if it's like a 64 bit integer, no big deal, but if it's a large text field, you're storing that as the value in your index over and over again and that can be expensive. The alternative again is to use a synthetic tuple ID, but now you need some kind of hash table or some kind of other lookup table to map that tuple ID to a physical address. So now if any time the physical address changes, you just update this map and you don't update the tuples themselves. So is this clear? Yeah. If we have stored it from in the format of years to oldest, can we assume that in the main table tuple that we have, it can be uncommitted or committed, but the ones which are stored in the time travel table and that those are the committed ones? This question is if we're doing time travel storage, let's, I think the next few slides will actually answer your question. Is there a way to avoid having to update this? Is there anything that you're asking, right? Not exactly. Well, short of time, we'll take it afterwards, sorry. All right, so another nasty thing, this actually wasn't in the paper that you read, but this is something we've encountered actually building our own system, is that you need to be able to support duplicate keys that could exist in disjoint snapshots. So the issue is gonna be is that in our indexes, we're not actually gonna store the version information about our tuples, right? So for our B plus three, we don't wanna store, like here's the key for version one, here's the key for version two, because that would be super expensive to maintain. Every single time I update the version, I gotta go update the index and I have to store some initial metadata. So most systems don't actually do this. If you're using an index-organized table, like mySQL, NODB, where they store the tuples in the leaf nodes of the index himself, then you kinda get this for free, but most systems don't do this. So the issue is gonna be now, I could have the same key could exist in different snapshots, and therefore in my index, I need to store that same key multiple times and have pointers to different version chains. So let me look at a short example. So I say I have a simple tuple with A, I have a single version. My first transaction comes along and does a read on A. Oh, we'll cover what begin and timestamps mean in the next class, but just assume this is the timestamp it's given to when it started. So this guy does a read on A, no big deal. I follow my pointer in my index and I land here. This guy does an update on A, same thing, I follow the pointer, I create the new version and update the version chain. So that's fine. But now I do a delete on A. And so what needs to happen? Well, I'm gonna mark this thing as deleted. There's a little bit you can set in the header as well. And so that's fine. This guy goes ahead and commits. We update our timestamps to say that here's when this thing actually finished. So this is setting the begin timestamp and end time is 25, they're the same. So this is saying this thing has been deleted. So now this other transaction comes along and he does an insert on A, a timestamp 30. So now I have to create a new entry into my index and now points to this new version chain here. Because I can't get rid of the old one because this guy's still running a timestamp 10. So now when I do a read on A on the index, I gotta make sure that I get this one and not this one. Even though the value of the key is exactly the same. And this is allowed to happen because this guy committed, so he's gone, so he starts after this guy committed. So he's allowed to do a write on A conceptually. It's not a conflict because A's gone at this point for this snapshot. So I'm creating two entries for A in the same index. So that's kind of weird, right? That's the idea that the same key can exist multiple times even though it's supposed to be a unique index because it's the primary key index, but at different snapshots. So the ways one way to handle this, you just sort of mean, you could maintain some like, actually how do we do this? Do we have two entries for the same key? I don't think so, do we? Yeah, we'd be allowed to put the keys. But it has two different verdant chains. I think we store some extra metadata in the key, right? To say that this is unique, we don't. Okay, yeah, what do we do? This is a problem. Let me come back to you, we'll discuss this next time. Or maybe we don't handle this, we're broken. We definitely have this problem in Peloton in the old system. And I think the way we got around it was, we would store, we were oldest and newest, so we could store the, you'd have to store, ah, no, no, no, I know how we did this. So the same key could produce, even though it was a unique index, so a unique index you would think, all right, for a single key has one value, you would get back a list of values. And those list of values would be different pointers, different verdant chains. And then you'd have to traverse every single version chain to find the version that is actually visible to you, right? That's how we do this. So it's, they should be on top of each other. So for a single key, you get back a list of pointers, but you have to traverse each of those pointers, version chains to find the entry that's visible to you. Yes. Even after the tuple is deleted, why can't the version chain keep continuing and like we can market that, okay, it is deleted? So this question is, even though this guy got deleted, why not just have this thing pointed this thing, and I could traverse that? Got a collection to take care of it. Because you need a sentinel value, right? That says, you need a way to say there's nothing else comes after this version chain. You don't need to scan anything more. But I guess if this is not null, you could still do that. Yeah, that actually might work too. I'd have to think about that though. This tuple one extra bit, right? That says what? That at this point it is deleted that if your timestamp is before then you are done. Yeah, you actually may be, yeah, he's right. You may actually, I have to think about this though. You may actually need to do this. I think what we do is we give you back multiple pointers. Oh, we don't do that at all. No, there's duplicate keys and updates on an index after you just delete and go. But you're not updating the attributes. You deleted the entry. Like, we didn't update the key. We just updated the tuple somewhere. And then you delete it. It gets marked as deleted and then this guy inserts, I thought we added a new, add another entry. The details are not important. You just understand that this issue can occur, right? Because this guy wants to do a read and we make sure we get the right version. All right, we're out of time. So I'm gonna skip all this, just come to the evaluation real quickly. So again, this came out of this paper. Peloton is dead because it had a bunch of other problems which we'll cover many times this semester. But the main experiment was basically us trying to bake off the different denying stages. And then the goal was, whatever was the best one, we would keep that in Peloton. This pissed me off because we did not do that, right? Like, if you look at this graph here, I'll come back to this, right? This is the main graph that we understand. So this is scaling up the number of threads running a TPCC. And what we did, we implemented based on this previous table here for all these different systems. For each one, we configured the system to match what real systems actually do. And so Oracle, New ADB and Hyper actually do it the best way. Postgres actually does it the worst way. And it's not just Postgres and New ADB are doing the same concurrency protocol. Their version storage and garbage collection and the index management are different. The other things actually matter more. So I was like, fantastic. Oracle, New ADB, Hyper the best. Let's do it the way they did it. The student that was writing the paper just left us with this. And then he went back to Singapore and didn't actually finish it. So we ended up, after all this work, we ended up in Palestine with the worst way of doing it, not the best way. The new system is the best way, right? So to go back to this quick screen, this is in the paper. This is just showing you that all these different systems are doing all these things differently. And the day, I think the main takeaway was the version storage and indexes turned out to matter the most. And currency control, not so much. So not saying because of our paper, I think PostgreSQL recognized that they had some issues. But there's a blog article came out in 2018 that I think this link is our paper. But they basically said that, oh, the way we're doing a pen only kind of sucks because you have this vacuum and it's a lot of overhead. And this guy is one of the main developers of PostgreSQL now. So they talk about how they're gonna switch to Delta storage and the newer versions. I don't know whether it's coming out in version 13 or 14. They're called Zee heaps. And this is something that they want to get to. So, all right, we're well over time. I can briefly talk about Project One before you guys run. Just do it now and then we'll release this on later today, okay? Project One. The idea is here to get you comfortable with the system. So we're asking you to work on some one very specific part and that's to do sequential scans or parallel or current sequential scans. So the task itself is not overly difficult. We'll have sort of different levels of complexity that you can try to achieve. It's mostly to teach you how to actually work in the system and actually do profiling and performance measurements in the system, okay? So, Matt covered this yesterday. Here's what our system does, blah, blah, blah, blah. Right, so we're gonna provide you with a bunch of C++ benchmarks written in Gbenchmark from Google. But there's one in particular called Concurrent Slot Iterator Microbenchmark that this is the bottleneck or the thing that you're gonna investigate. Basically, when you do a scan in the system, you get this iterator and it has a latch on it. And so if you start scaling up and adding more threads and they all can try to acquire the same latch and this becomes a bottleneck. So the idea is that we will learn how to use perf and call grind to understand what are the main conflict points in the system and then you try to fix it. So this is a single project system but it'll expose you to the full system. So you wanna try out different workloads and threads and access patterns beyond with just what the microbenchmark that we're providing you, okay? So the way we're gonna do grading is that your score will be based on how much faster than you are than our implementation. So we will have a basic implementation that tries to improve this. We'll have the before and after. So your score will be based on how much faster you can get over what Matt writes, okay? Matt talked about yesterday too, we had to run Clang Format, Clang Tidy. You had to make sure that your code is all clean. You wanna use Google sanitizer stuff as much as, you'll get this automatically but make sure you don't have any memory leaks. We talked about this as before. It runs on anything greater than 1804 for Ubuntu and OSX. You can also do a VM on Docker. This is CMU. I assume that everyone has access to a machine to do development locally. If not, email me and we can fix this. The main important thing though is the way we're gonna do grading is not gonna be, like you're sitting on grade scope and that'll do like a smoke test to see whether your thing actually compiles. But you're not gonna be able to identify this bottleneck if you try to run it on a machine with less than eight cores. Most of your laptops have, you know, four cores or eight cores or less. So what we're gonna give you is 50 bucks on Amazon that you can go get one of these C59X large that has I think 32 cores or 36 cores and you can do your analysis on that machine because that's how you'll be able to identify the bottleneck, okay? So everyone gets 50 bucks. If you go get the on-demand one, it's $1.50 an hour. If you get the spot instance, which means that like you're saying, hey, these machines are idle, I'll use them. They could take them away from you at any time but you pay a fraction of the price. So I encourage you to use this because if you blow through your 50 bucks, I can't reimburse you, okay? So I'll send an email out to everyone who's enrolled in the class, like here's your code for Amazon. You need to have a credit card to sign up for Amazon for EC2 or AWS. If that's a problem, let me know. We'll figure out how to fake one or something like that, right? But anyway, all right. So this will be the 16th. We'll post this later today. And then next class, we'll discuss more MVC implementations, okay? All right, guys, see you. Bank it in the side pocket. What is this? Some old pool shit, what? Took a sip and had this bit because I ain't quit that beer called the OE because I'm old, cheap. I needed just a little more kick, one sip. Put it to my lips and rip the top off.