 This is the third lecture now we're going to do in a class, but we're focusing on concurrent control. As I said in the lecture from the shack, we are focusing on transaction processing first, because the idea there is we want to ingest data, and then be able to process it later on. So that's why we're focusing on the transactional side of things. I do have one correction that I want to make from last class. So if you watched all the way through two hours of the lecture, you would have seen the bit where I talked about isolation levels. I made a mistake saying that the original specification of the isolation levels in the SQL standard from 1992 was based entirely on the assumption that you're running a two-phase locking database management system. Then later on I said that there was this paper that came along called the critique of anti-SQL isolation levels that basically lays out the two different anomalies that we talked about, the loss-update anomaly under cursor stability, and then the right-skew anomaly under SNAP's isolation. So as I said, this is a very famous paper that came out in 1995. Well, the dude that actually wrote it sent me an email yesterday and said I was completely wrong. He's got this little bit here though he says that the SQL standard wasn't actually written assuming that you're running under two-phase locking. It was just written based on the anomalies that we talked about. Then he talked about the guy that wrote the specification and the standard. He gave this guy a early copy of what the spec was. He was too busy, didn't read it thoroughly, and then the loss-update anomaly ended up into the standard. It's only later on when Microsoft was coming out with SQL Server and they were claiming that they were serializable, and he was like, no, no, no, you're not serializable because you're susceptible to the loss-update anomaly. Then they finally got together with the guy that wrote the standard and some other famous database people to actually write this paper and say, here's all the issues with the standard. So this little snippet here where he talks about the story of how this all happened, it's amazing because he starts like name-dropping all these authors in here. Some of these you may not heard of, but Pat O'Neill did a lot of early work on lock structure merges trees. So if you know level DB, Cassandra, ROX DB, that's all based on his early work. Jim Gray won the Turing Award for databases in the 1990s. He invented two-phase locking when he was at IBM on System R. Phil Bernstein invented the first distributed database called SCD-1 in 1978 or something like that. So this email is amazing. This is like, if you're really into the p****r and then you find the chronology of from some like that or if you're really into 50 shades of gray, whatever. But to me, this is amazing and again, this is why I love the internet. We just put on YouTube and this random dude just sends me, no, no, here's what really happened. So I think it's amazing and he watched all two hours of it. So that was awesome. So today's class, as I said, we're going to focus on N-memory multiverse card control. So I'll go through what multiverse control is, but then we'll go through the actual design decisions, which is in the paper that you guys were assigned to read. Again, the reason why we're focusing on this is because pretty much every modern database management system, whether it's actually in memory or on disk, that's been created in the last 10 years, is using some variant of MVCC. So it's worth for us to actually think about this and understand it better. So multiverse card control, if you take in the intro class last semester, some of this will be review. But the basic idea of is that the database management system is going to maintain different physical versions for any single object in the database that gets modified. I'm being vague here when I say object because, again, these protocols don't really care whether you're talking about tuples, single attributes, tables, or batches of tuples. I mean our purposes, we're going to focus on just some tuples, but the more formal way is just to say an object in the database. So what the difference between the physical and the logical is, you'll have a single logical record, like I insert a tuple into the student table with just your name. That's a logical record. But anytime I go and modify your information for your tuple, underneath the color of the database system, it's going to maintain different physical versions. So again, the way to think about this, anytime I do an update on tuple, underneath the covers, it creates a new version for that, and how it actually creates it, where it actually stores it, that's all the things that we're going to talk about today. So then when it happens now, now that we have a single logical object, we have multiple physical versions. When I now do a query and say, read that object, the database system has got to figure out what's the correct version that I need to see. This is what we're going to be focusing on today and a lot next class on how it's going to figure out, for what timestamp should something be viewable to me and where to go find the record that I want or find the version that I want. So MVCC is old. MVCC was first proposed back in 1978 by a very famous dissertation done at MIT. The first commercial implementation and actually a real database system was that done at DEC in early 1980s in a system called Interbase. Later on, the guy that worked on Interbase which is called James Starkey, he went off then try to build a replacement for NODB called Falcon for MySQL that failed. Then he went off and built NODB which is a new SQL started by Boston. So Interbase is really old, it actually still exists, but it sort of exists in two forms. So DEC got bought by Compaq, but I think before then they sold off Interbase to Borland, the compiler company if you've ever heard of them, and then Borland then open sourced it as Firebird, but it still is the commercial version of Interbase which is now a mobile phone database that's been rebranded. So it still exists, Firebird is probably the most up-to-date version that you would want to use. If you ever wondered why Firefox is called Firefox, it's when Netscape went under, they went to open source the web browser and they were going to call it Firebird, but they couldn't call it because of this database system existed. So then they had to rename it to Firefox. So the key thing here is NBCC is going to be used in pretty much every single database systems that's been built that I know about in the last 10 years with some minor exceptions, and so this is the way people build modern systems so we should understand it very carefully. So why use NBCC? What benefits to get? Why are people actually choosing to implement this? So the biggest benefit you're going to get is that the writers don't block readers, and readers in some cases don't block writers, but this one's most important. So think of like in two-phase locking. If I want to write to an object and it's a single version system, I take the exclusive lock on that object and now nobody else can get a share lock and read it. But if I actually don't need to see the latest version, if my transaction is running in the past from your transaction in logical time, then I can read the older version and not interfere with you. So the way we're going to do that is that we have transactions, if they're read only or not, we can talk about what that means later in a second. Every transaction will read a consistent snapshot of the database that existed at the time that transaction started because you look to see, I know my timestamp when I started, I know what versions of the database for tuples should be visible to me, and the data system figures out which ones you actually can go see. The other advantage you can get for NVCC is that you can support what are called time travel queries. So if you don't do any garbage collection, if you just keep all versions that ever existed in your database around, some systems allow you to write queries that say, go get me this tuple as it existed two years ago, or go do a scan on this table as it existed five years ago or one week ago. So these are called time travel queries. Time travel queries are primarily only used as far as I know in the financial industry where you care about having the last seven years of all the transactions in your bank for regulatory reasons. Most applications don't actually need this because it's done at the application level. It's only really in the banks that I see people actually using the actual explicit feature on our NVCC to do time travel queries. A lot of systems claim they support this, but again, I haven't really seen a compelling reason of why other than the banks, people want to do this. This is not a new idea. Either some systems tell them, look how great we are, we do time travel queries, look how novel we are. Postgres did this in the 1980s. The first version of Postgres had time travel queries, because all you do is you just don't implement garbage collection and you get it for free. It wasn't until when Postgres actually started being used outside of academia, like in 1998, 1999, when they forked it from the Berkeley version and added support for SQL, then they realized, well, if you don't have garbage collection, you're going to run out of space really quickly, so they got rid of this. So only recently, I think Postgres can add the added back support for explicit time travel queries. It's a combination of turning off garbage collection and exposing some extra commands to you in SQL. So the key thing I hope you got out of the paper that I signed you guys to read, is that NVCC, despite concurrency troll being in the name of what we're talking about, it's more than just a concurrency show protocol. If you want to support multi-versioning in your database system, it's going to affect the entire architecture of the system. There's a bunch of design decisions we're going to have to deal with in order to actually implement this efficiently and correctly, that we have to think about as we go along. So again, that's what the focus is on today, but at a high level, you should be able to understand what's going on with Snapsite isolation, because that's really important here. So again, Snapsite isolation, what happens when a transaction starts, it's going to have access or view a consistent snapshot of the database that existed at the time that it started. So the key word that I'm putting out here is consistent, and that just means that in our snapshot for our transaction, even though there may be other transactions that are running the moment that we started, and they may have modified the database, we will not see their changes because those transactions had not committed at the moment that our transaction started. So a consistent snapshot means we're only going to see changes from transactions that committed before we started. When I say Snapsite, I don't mean this in a physical sense, like I'm going to make a copy of the database, put it over here, and then do look up and make modifications to that. This is all done underneath the covers, and we try to do the minimal amount of copying as possible. So not like literally making a copy of files or copies of blocks in memory, we're just using this version of information that we're maintaining to say, what is the actual correct snapshot we should be seeing. Another way to describe is saying that you're not going to see modifications from uncommitted transactions. These are called torn writes. So if my transaction updates to tuples, I should either see all of them or none of them. I don't want to see just one of them. That would be a torn write. Now, under these snapshots, we'll see this as we go along. If two transactions try to update the same object at the same time, then the simple thing is we just let the first writer win. The second guy will abort and we'll get rolled back. So with MVCC, we're essentially going to get Snapshot Isolation for free. Free is in quotes because this is bit of a topology here, because if you want to get Snapshot Isolation, you implement MVCC. If you implement MVCC, then you get Snapshot Isolation. So it's not like you can do something different with MVCC and get something and you could read uncommitted, you could do that. But essentially, if you just follow the standard protocols that we'll talk about, you'll get Snapshot Isolation. If you want Serializable Isolation, if you want to avoid phantoms or the right skew anomaly, then you have to do some extra stuff. For this lecture and the paper you guys read, we're not doing that extra stuff except for the certifier thing. We can ignore that for now. So next class, we'll see how we actually add back extra stuff and I'll describe what the extra stuff is in order to get Serializable Isolation. So any questions about the high level of MVCC is and what Snapshot Isolation is? Again, I get excited when we talk about databases, tell me to stop and slow down if I'm going too fast. So what are the four design decisions we're going to focus on today? So there's the concurrential protocol, and the coordination method we're going to use to keep track of what transactions allow to read what. But then we're also going to talk about how to do version storage which turns out to be the most important thing, garbage collection and next management. So next class, we're going to focus more on modern implementations of the concurrential protocol for MVCC. Then the class after that, we're going to go more detail how to do different types of garbage collection. We're not going to talk too much about version storage index management beyond this lecture, but we'll focus in on today. So the paper I had you guys read was actually something that came out of this course, something that I wrote with some students here. I'm still kind of a new professor, but I've done enough papers now, where I can start to pick favorites. You're not supposed to pick your favorite children, if you have kids, you're not supposed to pick your favorite papers. This is actually one of my most favorite papers that I've been involved in. I love it so much that when we submitted it to get reviewed at VODB, we put the title of the paper as, this is the best paper ever on in-memory wall to verge of current control. And I believed it then and I believe it now. So the paper got great reviews, the reviewers really liked it, except that when we got back to comments, the very first thing that they wrote was, remove this is the best paper ever from the title and try to be more scientific, right? So that was sort of disappointing, but I could sort of see how they would say that you couldn't call your paper, this is the best paper ever. So we came back with a second title. If you only read one empirical evaluation paper on in-memory current control, make it this one explanation point. So that didn't fly either. So they came back and said that this is subjective, like you're assuming that people should only read yours, you can't do that. You have to be more about grounded in the facts about what the paper's trying to prove. So then the third title was, we think you really enjoy this empirical back paper in-memory multi-version current control. So at this point, the paper got rejected and now they're, sorry, it got accepted and they were starting to get a little pissed off and the program chair was like, look, either you change the paper title or rejecting it. And I don't have tenure, so I needed this paper, the students needed the paper. So we ended up with the boring one that you guys read. So this is the true paper. The true title is, this is the best paper ever, but it has to be referenced in the sort of correct term, so, or the correct title. All right, all right, so again, four design decisions. Current control protocol, version storage, garbage collection, index managers. Let's start with the first one, current control. So last class I went sort of at a high level. I talked about three different current control protocols. Time stamp ordering, two phase locking and OCC, optimistic current control. So all those same protocol can then now be applied in a multi-version environment. And we can adapt them to work in an in-memory environment. So at a high level, they're gonna work exactly the same. So if you understand how we did this on a single version system that we talked about last class, then you can understand how we're gonna do this somewhat easily in a multi-version environment. So I'm only gonna focus on the timestamp ordering and two phase locking protocols. OCC is basically timestamp ordering, but just with private workspaces. So the key thing I'll point out to also is that in the 1,000 course staring at the best paper I had you guys read for the previous class, we talk about using NVCC, where we didn't label it as being NVCC, 2PL, OCC or timestamp ordering. For that one, we're actually following what was done in the original 1978 dissertation, which was NVCC with timestamp ordering. But again, as we'll see when we talk about what real systems actually implemented, they're doing a combination of all of these things. So again, there's no one standard NVCC algorithm, even though systems will claim they use NVCC, you need to sort of drill down to understand which of these three they're actually doing. So we're gonna go through the first one and the last one, but before we get into that, we wanna understand what we actually need to store in our 2PL to keep track of the different versions. So unlike in a disk based system, and this will come up mostly when we talk about 2Pay's locking. And in disk based systems, sometimes they'll store the metadata about 2PLs separately from the actual 2PLs themselves. So in 2Pay's locking, you have shared locks and exclusive locks. They will actually store those in a separate hash table that's separate from the actual data itself, the 2PLs itself, because they don't want the lock table to get flushed out or written out the disk when you run out of space if the 2PL gets written out. Now in memory database, the data's always gonna be in memory, so we don't have to worry about that anymore. So we can actually get rid of some of these additional data structures and just inline the data about 2PLs, the metadata about the 2PLs with the 2PL itself. So in this case here, all of the 2PLs are gonna have some kind of unique transaction identifier, and in our purposes here for NVCC, this is just some time stamp. It can either be a physical time stamp, like a wall clock time, a logical clock, like a counter, or a physiological clock, which is a combination of wall clock time and a logical counter. Then we're gonna have our begin and end time stamp, and we're gonna use this under NVCC to figure out what the version lifetime is for this particular 2PL. So we're gonna use this to figure out, for my transaction, does this particular physical version of a 2PL exist in my consistent snapshot based on what my time stamp is? Then we're gonna have a pointer that's gonna point to either the next or previous 2PL in our version chain for a single logical 2PL. So again, for every single logical 2PL, or a logical record, I can have multiple physical versions, and I'm gonna use this pointer field, it's a 64-bit pointer, to figure out where the next one I need to go if I'm traversing that chain. So we'll see this when we talk about version chains in a second, but this is gonna be a single direction, or a singly linked list. We're not gonna do double link, because in order to efficiently make modifications to the linked list, we want one 64-bit field, so we need to compare and swap on that. And then there's a bunch of additional metadata depending on what protocol we're using, we may have to store some extra crap in here, again, it varies per protocol. So the one thing I like to point out here is that this doesn't really seem like a lot, but it is relative to how big your 2PL could be a lot. So I'm storing here these all 64-bit integers, you can ignore the additional metadata. So I'm storing four eight-byte integers per 2PL. And so if I have a billion 2PLs, then just this metadata is gonna be 32 gigabytes. So think about it too, if my data is a single 16-bit integer, then I'm storing 32 bytes per 2PL to store then 16 bits. So this overhead is not trivial, but we had to do this, right? This is the classic trade-off between storage and compute in computer science, and we're gonna see this all throughout the semester. So we could store some additional, this thing separately, and do some clever tricks to actually reduce the amount of data we have to store for each individual 2PL. Like we could store these things as block headers instead of actually on a per 2PL basis, because a bunch of stuff we could do to alleviate or reduce this overhead, but then that would require more instructions or more computational power to process the data as we've run transactions. And no system's gonna make that trade-off. You want your transactions to run as quickly as possible, so we just pay the extra penalty of storing data like this, right? And we'll focus on how to maybe compress the size of the actual data itself later on. All right, so let's look how to do multi-version timestamp ordering. So the first thing to point out here is that again we have our transaction ID, and say in this particular instance here there was some transaction at timestamp one that inserted these two 2PLs, A and B. And so for this column here, the version, this is just like for illustration purposes, this is actually not something you would actually store in the actual system. I'm just showing you that there's object A, object B, and here's version one for each of those guys. So we have the transaction ID, right? And then we're gonna use this to keep track of whether transaction is actively modifying our 2PL. So when it's zero, it means nobody's working on it. If it's some other number, then someone holds the latch on it. Then we have, we're gonna end timestamp here. So again, some transaction that came before and inserted these two 2PLs had timestamp one. So the begin timestamp for each of these 2PLs is one, and then the end timestamp is infinity. And that means that this is the newest version of these two particular 2PLs, and there's no other version that comes after it. Right, because infinity just goes forever. And then we'll have the retimestamp, we'll use that to keep track of just the timestamp of the last transaction that read this 2PL, and this is under the basic TO protocol. And we're gonna use this to figure out whether transaction has come along as maybe was or modifying this and maybe read a, you know, they're in the future and they read the old version and we're in the past logically and we're trying to write a new one and that the guy in the future would have missed. So we can use the retimestamp to check for that. All right, so the save our transaction comes along and it's running on one thread and we're gonna sign it timestamp 10, right? So in the first operation, we wanna do a read on A. So we're gonna allow this transaction to read A if the transaction ID is zero, meaning nobody holds the latch on it, and then our transaction ID is within the range defined by the begin and the end timestamp. So our timestamp is 10, begin is one and is infinity, so 10 is in between that range. So therefore it's allowed to read it. So then we go ahead and update the retimestamp to say it's now 10, because we read it. So now we're gonna do a write on B and so what's gonna happen is if nobody holds the latch for this tuple, which again is defined by transaction ID if it's zero, then our timestamp ID is greater than the retimestamp. Again, nobody in the future has read our old version before we installed the one we're trying to write now. Then what we're gonna do is we're gonna do a compare and swap on the transaction ID, set that to 10, because that's our timestamp ID. And then that essentially provides us, again, a latch in this entire tuple. So nobody else can read it and nobody else can write to it while we're doing this. Does everyone know what compare and swap is or no? Okay, cool. I can then now install my new version and for this one I set the begin timestamp to be 10 because that's what mine is, my timestamp is or my transaction ID, the end timestamp goes to infinity. I don't have to set the transaction ID to 10 here because I'm implicitly getting the latch on this as well because since nobody can read this because I'm latching it, they can't follow whatever the version pointer is to get to this next version. So they wouldn't be able to see it either. So now I go back to the older version, right, B1. I change this end timestamp to be 10, which is what mine is, and then I do a compare and swap on each of these guys, set this as zero and that releases the latches and now the new version is installed. So this is clear. What's missing? Yeah, so go ahead. So in the read case, you don't grab the lock, the transaction lock. So his question is, in the read case, do I not grab the lock? Correct, yes. Because all you need to do is a, yeah, so his question is, do I need to acquire the latch on the object when I do the read? The answer is no because if I go back here, right, yeah. So I do my read, I want to do my read. So all I need to do is follow the version chain. I find that this one is in my range. I flip this to be one, or to be, my timestamp is 10. So now if another transaction comes along and their timestamp is in the past, right, say they have timestamp five, they would come here, see that I read in the future and they, before they installed their update and so they would have to abort because they wouldn't be allowed to do that. If a transaction comes along with timestamp 20, then who cares, right? Because I can still read this, right? So it's still all done transactionally and correct. So, any other questions? So at this point here, what's one big problem that's missing? What's one piece of information that we're not conveying in this? How do we know that the transaction committed, right? Because there's nothing in here, right? Because I installed my timestamps, anybody comes along and read this, but there's nothing in here that says, oh yeah, you just read a version created by transaction ID 10, he committed. Yes, you're allowed to do that. So what's missing is this auxiliary data structure, and again we'll see this next class when we talk about Hecaton and other systems. You need some other way, you need a global data structure to be able to say, yes, you're reading data from a transaction that committed. Or you need an extra field or some way to specially identify that yes, the version you're reading is from a transaction that has committed. So one system, the way they do it is they'll use an extra bit in this and say this is installed by a transaction, and here's the time, it's always the same timestamp, but this is a timestamp because it's committed, this is a timestamp because it's still running. So you can store that in there. But that doesn't solve all your problems, you still have to have some kind of global data structure to figure out what actually really happened to the transaction. All right, so let's look at two phase locking. So we get rid of the read time stamp now, and instead we're gonna have a read count. And the read count we're gonna use as a way to essentially be the shared lock, to keep track of how many transactions hold the shared lock on the two board we're trying to read. And then we're now also gonna use a combination of the transaction ID together with the read count to represent the exclusive lock. So if the transaction ID is not set to zero, if it's something other than zero, then someone holds the exclusive lock for this two board, while they're modifying. So let's look at an example. So I wanna do my read on A, same thing, I check the transaction ID to see whether it's zero. If yes, then I know no one has the exclusive lock. So then I do a comparison swap here to flip the read count to one. Now, in the paper, they talk about how, if you wanna actually have this done atomically with a single compare and swap, then this has to be 32 bits and this has to be 32 bits. Sorry, this has to, yeah, because it has to be together, it has to be 64 bits. So this assumes you're allowing your transaction ID to be 32 bits, whereas in all of the other protocols we talked about, they're all 64 bits. I don't know whether in the newer versions of the Xeons, whether they have 128 bit compare and swap instructions. I should check that. But that's a minor detail, but when you flip this one, you're also making sure that this is still zero. And you wanna do that with a single compare and swap, because otherwise you have to take a latch on the whole thing, which is expensive. So this gives me the shared lock on this. I can do whatever read that I wanna do. Now when I wanna do my write on this, same thing, I'll do a, I'll check that both the read count and the transaction ID is zero, then do a compare and swap to set both of them, and now I hold the exclusive lock for this. And then now I can go ahead and install whatever new version that I want. And when I commit, I go back, or yeah, flip that to be my timestamp, and then when I'm done, I just flip it all back to zero, and then now it's been fully installed. So again, the big difference here is that we don't have a separate lock table. If you took the intro class last semester, you had a bit of lock manager, but these locks you use and you wait for a graph to figure out who's waiting for what. We have none of that, right? We have no global data structures to figure out who holds what locks. We just embed all of that now inside of our system. Now, if you want, if you, this works great if you're doing deadlock prevention, if you wanna do deadlock detection, then you do need a global data structure, because then you now need to see who's waiting for what lock, to build that wait for a graph to figure out how to do cuts. And so if you're, that's why most systems that are doing in-memory, multi-version, they do deadlock prevention, not deadlock detection. Okay, so my comment here about the 32-bit integer transaction IDs is actually an important problem. This wasn't covered in the paper. This is sort of something that we've realized afterwards. And that is that if you reach the max value for your timestamps that you're signing up to transactions, you're gonna have a bad time, you're gonna have problems, because all such a weird shit starts to happen that you wouldn't expect, right? So let's say that I have a transaction here and their 32-bit integer transaction IDs. So its transaction ID is two to 31 minus one, but the largest number you can have for unsigned 32-bit integer. And so I do a write on A, that's fine. So I flip now my transaction ID to be what my timestamp is. I update all my end timestamps and the new versions to now be correctly conveyed that I've installed a new version that goes from two to 31 minus one to infinity. But now another transaction comes along and now we wrap around our transaction ID and now we go to one, right? Because it can't be zero, because we use that to denote that no one holds a latch or something. So we go to one. But now I wanna do a write on A and now I got this weird situation where I've installed a new version that's valid from one to infinity, but my homeboy up here, he's valid from two to 31 minus one to one, which is not correct, right? So wrap around causes all sorts of problems, right? It'll affect things like, now rows you thought were deleted, now become visible, rows will end up reverting back to the state that they were before. This can mess up all sort of like, like referential integrity and foreign key constraints. This is really bad if you allow something like this to happen, right? Without doing some extra steps to prevent it. And so the way Postgres does it actually, I think is the sort of simple and in some ways elegant way to handle this in that they just have an additional flag in the tuple header that denotes that a version is considered frozen. And so what that means is that if that bit is set, then no matter what timestamp you have, when you do a comparison, that version you're looking at is always older than you, right? So again, the issue here is that when, if I say these two transactions are still running, this guy has transaction ID one, this has two to 32 minus one. Again, one is less than two to the 31 minus one, but this transaction is newer, right? It's in logical time, it should come before this one. But if we just wrap around, then we have that problem. So again, what Postgres does, it just says I scan through all my tuples before I wrap around, and I set that frozen bit. And so now anytime I do a lookup and say, what version do I want? Is this version older than me or not? Or is this version within my snapshot? I use that to figure out whether something is visible to be in or not, right? And so the way they do this is that when they're getting close to wrapping around, the vacuum runs and it goes through and just flips that bit for everyone. And at some point when you wrap around, then again, everything's completely frozen. So if you look up like transaction wrap around problem Postgres, you'll see people talking like, they turn off the vacuum for some reason and they start getting up to that two to the 32 limit and then all of a sudden the system has to halt because now they gotta go through everything manually and flip that frozen bit. So Postgres will do this in the background for you so it shouldn't really be a problem. But if you're not aggressively, if you have a lot of churn, you have a lot of versions, you could hit this. And just again, setting the frozen bit makes all this problem go away. So now in our system, yes. When you're running the vacuum, would you have to make sure that nobody's using the version that you're freezing? The question is, when you're running the vacuum, you have to make sure that nobody's using the version that you're using. Yeah, so what you basically do is you just, you know what actual transactions exist and so you know what transaction, what the lower bound is for transaction IDs for anything that could be visible. So anything less than that, you can set the frozen bit. Anything that is at the cost before you wrap around, it still handles it. If you have open transactions when you wrap around, I think they have to stop and do some extra stuff. But if you do a bunch of old crap, then the frozen bit solves that problem. Yes. What happens twice in a row? He says, what happens twice in a row? What happens twice in a row? You have like a frozen, so you have like frozen. Yes. True to the part that you went for. Yes. You wrap around. Yes. And you get a new branch. Yeah. Why is all of the thing is still in the database? Yes. Your age is 30 million. You just wrap around again, right? Once it's marked frozen, it's always marked frozen. Yeah, but how do you compare two frozen things in different intervals? So the question is, how do you compare two frozen things in different intervals? Again, you wouldn't do comparisons between, you don't need new comparisons between two version timestamps. It's like, I'm an active transaction. What versions can I see? Right? And because you can also, you would also sort of derive the version chain would give you the chronological order as well. Because you just have the pointers go as they're added. Yes. How do you do time travel queries then? Yeah, so this question is, how do you do time travel queries under this environment? I don't think they actually support it under this. The way you could fix this would be an easy fix. Instead of having a bit, either frozen or not frozen, you just have like epochs. You have another counter to say, how many times have you wrapped around? And that would solve that problem. Yes? How do you prevent the transaction from writing a tuple? Like, if these tuples have been written by another transaction in the future? So this question is, how do you prevent one transaction from writing to a tuple that's been written by another transaction? There's no read time stamp order. Under which one? Under two days locking or which one? Two days locking. Oh, is there a time stamp ordering? Right. So I'm just there. Yeah. All right. All right, here. So this question is, if this guy writes to this tuple, how do I prevent somebody else from coming that's in the past and writing to the same tuple or the future? If a transaction writes a tuple but because there's no read time stamp, so he doesn't know whether this tuple has been read by another transaction in the future. Yeah, you do know that because that's what the begin and end time stamp give you. So this question is, say I'm here. I installed this update. I have b1, b2. b1 is from 1 to 10. This guy is from 10 to infinity. Now I have a transaction in the future. Say b15. He wants to write to b. So what will happen is, again, I install the version chain. I see that this is the latest version. Then I'll just go through the same process before where I flip that field. Now I hold the latch for it. I can install a new version. And I'm fine. If now I'm in the past, say my time stamp is 5, then I would come along, say b1 to b10. Oh, well that means that there's another version that comes after time stamp 10 from a transaction that is in the future from me, even though physically they got to it before I did, logically they're in the future. And therefore my time stamp is less than this. I have to abort. First writer wins. I have to abort. I can't complete. But if another transaction, like with time stamp 11, has already read b1? So the same is, if a transaction comes along with time stamp 11, before we install b2, so at this point here, he comes along. He has time stamp 11. He flips the read time stamp to 11, right? And this is MWTO, but not 2-phase locking. So in 2-phase locking, you have the read counter. But really comfortable to set to 0 after this transaction. Yeah, but who cares? You commit and you're done. Yeah, sorry. Yeah, yeah. So I understand your questions. This is what I was saying. You need some global state to say, like, if I'm in the past, this transaction has committed and it read something, you may have to stall and wait for the writing transaction to finish before the reading transaction finishes. There's some extra metadata I'm being handled with you about, which will cover next class. God damn, I love transactions. OK. So any questions about the time stamp ordering, 2-phase locking, or the wraparound? Again, MVOCC is essentially very similar to time stamp ordering. You just do the validation at the end rather than as you go along under this scheme. OK. So now we can talk about the version storage. So again, we're going to use the pointer field, the 64-bit pointer field in each tuple to build a latch-free version chain, a link list, for all the versions that belong to a single logical tuple. Again, what's going to happen is this is going to allow us to quickly jump to find the version that we want for a particular tuple. Because we could just scan the entire table until we find what we want. But that's going to be stupid. We don't want to do that. So we want to jump to what we know is the version chain for our logical tuple and then scan along the pointers until we find the version that we know what we want. So the indexes are always going to point to what we're calling, again, the head of the link list, the head of the chain. Now whether the head will be the oldest version or the newest version, that depends on how it's actually implemented. And you can go both ways. So one important thing I also point out, too, is that in order to get good performance, all these different versions that we're going to generate will be stored in what are called local memory regions or local memory pools for each thread. So don't think of this as like I'm going to malloc some giant block of memory, and then every thread can try to get a slice of that. Every thread is going to do a malloc on its own local storage. It's going to be close to where it's physically located, and a NUMA system would be on the same DIMs that are local to your socket. And then that way, we can make new versions quickly without worrying about synchronizing on some global data structure. Again, that's going to be another re-encoring theme throughout the entire semester. We don't want to do any global synchronization or the minimal amount of global synchronization, because that's always going to be a bottleneck in a highly parallel environment. So of course, this now means that when I have to scan through and find my version chain, I may be jumping around to different memory pools and different sockets. But again, depending on what your application or what your query is actually doing, if the head of the version chain is always the newest version, then you're really only going to jump maybe to one or two hops in the link list. You're not going to be scanning a ton of data to find the version that you want. So physically, the memory region will be determined where actually in the hardware we're going to store our data. Up above inside our software, the different storage schemes will each determine where and what we're going to store for each particular version that we generate. So again, we'll go through the three examples one by one. So the first example will be appendling storage. And this is just where we create new versions that we add them to the table space. Time travel storage is a variant of this. We'll have essentially two tables, and we'll store old versions in one table. And then depth of storage is where instead of storing complete copies of tuples, we'll just have the disk of the changes that we make. So appendling, this is the easiest way to implement multi-version concurrency tool. And so we have a single table space, again, a single logical table. And we have all intermixed to all the different versions. Although underneath the covers, again, physically, the memory could be stored on different sockets, different NUMA regions. So what happens is every single time we do an update, we just make a copy of the tuple we're modifying to a new tuple slot inside our table and apply a change there. So say our transaction wants to update A1 here. We'll make a new copy, A2, with the new value we want to write into it. And then we update the pointer chain to now point to us here. So think of this as the head of the version chain. In this case here is actually the oldest, A0. So if I want to get to A2 like the file from A0 to A1, A1, A2, down there. So that example is called oldest and newest. You can actually also go newest to oldest. And again, these are gonna have different trade-offs in terms of performance, and it depends on what your application is actually doing. So if most of your applications are time travel queries where you always want to say, find me the older version, always, then oldest and newest is obviously gonna be faster because you don't have to scan the entire version chain to find the version that you want. For a lot of applications, they'll have hotspots in the workload, like most of the queries are gonna access a small portion of the database or some subset of the tuples. So most of your tuples are logical because they are not gonna have very long version chains, but for the ones that are hot, if they're being updated all the time, then these version chains can get quite long. For newest to oldest, this will be faster if you want to always find the newest version to do lookups. But the downside's gonna be that every single time you do an update, you're gonna have to go change all your indexes to now point to the new head of the version chain. On our oldest and newest, you only have to do this any single time you do garbage collection where you compact the version chain. In this case here, if you're always under a pen only storage, if you're always adding new versions and that's the head of the version chain, every single time I do an update, I have to update all my indexes to now point to my new version. And that can be really expensive. We'll see in a second how that's actually problematic for some famous company. The next type of version storage you have is time travel storage. Again, the idea here is that we have our main table where we're gonna have the master version and this is gonna be the latest version of our tuple. And then we have some separate auxiliary table that's gonna have the same schema, essentially the same layout as the main table. But any single time we modify tuple here, we'll just append the old version over there. So, say update A2, I'll make a copy of A2 into the time travel table and then update the pointers to now point to this new tuple here. And then I'll overwrite the master version to now be A3. And then I can update my pointer to now point to A2. So it goes A3, A2, A1. So, can I think of a really simple example or a really simple advantage of doing something like this over the single table? Yes? So, you know it's gonna be the latest now in Europe and when you can garbage it correct you can erase your entire time travel table. Perfect, yes. So he said that since this is always the latest version, it may not be committed yet, but assume for now it is. Since this is always the latest version, when I do garbage collection, I could just blow this entire thing away, right? Assuming again, no one's actually actively scanning it, we'd be careful about that, but like this blows away, I don't have to scan anything here because this is always going to be the latest version. There's another obvious advantage as well. Yes? Boom, exactly, yes. Squential scans are super easy because I just ripped through this. I don't worry about following the version chain or checking the visibility, I use it to check the visibility, but I don't have to follow any pointers to figure out what version should actually be looking at. Right? So this is what SAP HANA does and this is if you give Microsoft extra money when you get SQL Server and you support time travel queries, they set it up like this. Other than those two systems, I don't know if any other system actually does this. Right, so the one that is, in my opinion, the, well, it was a spoiler, the better way to do this and what we do now is the Delta storage. So the idea here is that we have our main table and this will be the master version, the latest version of each logical tuple. Then every time we do an update, we're gonna store just a copy, we're gonna store a Delta record that only has the change that was made to the tuple. Right, so in this case here, I'm gonna update the value field, set that to 222. In my Delta record, I just have a map that says, oh, I updated the value attribute and the old value was one on one. Right? Do this again, right? And I update it again. I have another update here and then I have a pointer down there. So what's an obvious advantage of this? It says if you're doing sparse updates, then you're not storing a lot of data. This example here only has one field, let's say I have 1,000 fields and in my update query only updates one of them. In append only and time travel queries, I have to copy all 1,000 fields even though I didn't update 999 of them. In this case here, it's super simple. I just copy the things that actually get modified. But now what's one downside of this? Here's actually reconstruction. I may have to scan back farther and farther to find the right version that I want to put the tuple back in the correct state. So like append only storage, you could do oldest and newest or newest to oldest. Oldest and newest actually doesn't make any sense because I may have to go back through the entire version chain to reconstruct the tuple. So anybody that does Delta storage always does newest to oldest. You look in this thing and it's always gonna have the latest version and I may need to go back to some point and reconstruct the tuple. So this is what Oracle does, this is what MySQL does, this is what we do now, this is what Hyper does. This is the better way to do this. But it's more engineering work. And actually you get the same advantage that he brought up before too for garbage collection because I just blow this thing away. I don't need to scan through this thing. I just say all right, here's all my old versions in some table space or the Delta storage, let me just drop that. So MySQL has really easy garbage collection whereas Postgres, at least in the current version, has to do complete vacuuming because it's doing a penalty storage so it has to look at all the tuples. All right, so one additional thing that you may have to deal with is how to handle non-inline attributes. So remember I said in the very beginning that unlike in a disk based system where on a single page or block of data for each tuple, we're gonna try to store both the fixed length values or fields and the variable length fields all together. In an in-memory database, we're gonna store the fixed length fields together in one data pool and then the variable length data in a separate pool. Because we can pack in the fixed length guys efficiently and jump to offsets without doing any additional indirection lookups whereas the variable length guys, we can store these in a heat space somewhere else. So now the problem is that for all of these, if I am updating my tuple and not updating my variable length field, I'm end up making a bunch of copies for redundant data. So in this case here, I only updated the integer value which is fixed length that I have here my string value is just a pointer. So I have to make a copy of my long string here so that this thing has its own pointer to the data as well. Right? So what's the one obvious way to fix this? What's that? Sorry, reference counter, exactly. So I can reuse the data I already have in my variable length field pool and have the tuples just point to the same thing if I have a reference counter to keep track of how many times people are pointing at them. The problem is with this though is that now you can't move this memory around very easily because I don't know who's pointing in the other direction. So if I want to move this piece of data in memory to from one location to the next, I got to scan through all my tables and see who's actually pointing to this. So we actually tried implementing this in the Peloton system a few years ago and turns out we did something stupid where every fixed length block had its own variable length block and so if we wanted to reuse pointers like this, we'd have to have pointers span blocks of data and it became a big nightmare. So when we talk about dictionary compression, that sort of solves this problem automatically for us. So as far as I know, nobody actually does this. Our dictionary compression essentially does the same thing for you for free, okay? All right, go ahead with the question. So again, just like you think in Java or other memory management programming environments, we need to reclaim the memory or the space we're generating for these physical tuples over time because otherwise we're gonna run out of space. So the two requirements we're gonna have that a physical version has to satisfy before we're allowed to reclaim it is that we have to say that no active transaction that's running can see that version of that tuple. And so I'm putting C in quotes here because it can scan it and see it, right? If it's following along the version chain, it may come across that particular old version but it won't be in the timestamp range therefore it can't actually read it. So as long as no one can actually access that version in a logical sense and if, or if that version was created by the board of transaction which we, again, for board of transaction no one should see it anyway. So if we satisfy these two criteria then we can reclaim the memory we're using for these physical versions, all right? So there's a bunch of design decisions we have to deal with to figure out how to get these expired versions and expunge them. So how to look for them, how to decide whether it's safe to reclaim the memory and where to actually look for the expired versions. So for this lecture we're only gonna focus on this part. We'll discuss these parts when we talk about the entire day on garbage collection next week. So at a high level there's two approaches. There's the tuple level and transaction level. So we're gonna focus mostly on the tuple level we'll cover transaction level in more detail later on. So basically for tuple level what's gonna happen is we're gonna actually look at the physical versions themselves or you can look at the data in the tables and decide whether we can reclaim them, whether we can garbage collect them. And so the process we can go about looking at them could either be with background vacuuming so separate threads checking things or what was called cooperative cleaning where we have threads actually as they scan the data and scan along the version chains they find old versions and they go ahead and delete them. In transaction level garbage collection basically every transaction keeps a little some internal data structure and says these are the tuples that I saw or these tuples that I modified and created old versions for. And so then when they commit or abort the garbage collector looks inside these data structures and figures out what it should remove. So let's go through the tuple level one the most. So again, so I have two threads here. T on thread one we have transaction ID 12 thread two has transaction ID 25 and we have a really simple database here. So with background vacuuming we're gonna have some separate threads that are gonna run periodically and just scan through all the data and figure out what versions are actually still visible. So again, they will know what are my actual transactions what are their time stamps so therefore what things are viewable. So again, it really is just a thread we'll just do a sequential scan look at the begin and end time stamp figure out which ones don't have no intersection with the intervals defined by the transactions time stamps and go ahead and remove them. It's the easiest thing to do just just just a sequential scan with an extra step of actually removing the physical versions. So this can actually get expensive because you may be going over data over and over again that you shouldn't delete. And if you have a hotspot only 10% of the database is being updated by transactions therefore they'll have old versions the rest of the 90% is actually not being modified at all so it doesn't make sense to actually scan them so you don't wanna waste time doing this. So the way you can overcome this is to maintain essentially a simple bitmap this keeps track of whether any of what blocks of data have been modified since the last time you scan through so you skip any of the ones that have not been modified because you know you've already vacuumed them and you only look at the ones that have been modified. Postgres does this, a bunch of other systems do this it's a pretty simple optimization. So now with cooperative cleaning the way it works is that as transactions scan the version changes themselves they since they're already checking to see whether something's visible to them or not so they'll also check to see whether it's actually should be garbage collected or not. So again the transactions will know what are my actual transactions what are the range of timestamps that actually are still active or still in play so now they do a look up through an index they land into a version chain and then as they scan along again if they see something that is not visible by any other active transaction they go ahead and just delete them, right? And again in this case here we're doing oldest to newest so therefore you have to update the version chain just update the index now to point to the newest, the new head of the version chain, right? What's one problem with this? Beautiful, awesome. So as I said if no one's touching the tuple no one's ever doing a look up and that tuple has old versions then no one's gonna come clean it because no one's gonna scan it, right? So Microsoft calls these dusty corners, right? So if no one say B0 and B1 are old versions they should be removed but no one ever does a look up on B and these guys are just hanging out forever. So the way you resolve this is that you actually do the background vacuuming every so often just go through and print out this stuff, right? So you sort of still need both. Again with transaction level garbage collection the basic idea here is that every transaction maintains a read-write set of the old tuples that they've generated and then when the transactions commit or abort the garbage collector maintains this internal data structure it says here's all the tuples that I need I know I need to go clean up later on and then at some point when they know they're not visible like it sort of pulls every so often then it goes through and cleans things up. Question? No, good, okay. So with this one in our experiments we found that like if you have, if your system's really fast and generating a lot of versions very quickly you actually may need multiple threads to run in parallel in the garbage collector to clean things up fast enough because otherwise if you can't reclaim memory faster than you actually are using it creating new versions then eventually you're gonna run out of memory. Right, that's true actually for all of these cases. I think this one is more susceptible to it. All right, the last design decision is index manifest. So I sort of alluded to this already. I said that like we're gonna have to have index always point to the head of the virgin chain. So for the primary key it always points to the head how often we have to update it depends on whether how often we change the virgin head. Right, it depends on how often we change the tuple and whether we're going oldest and newest or not. The, if you update the primary key values then the easiest way to treat this is just instead of treating it as an update you treat it by delete, volify and insert. Like if I have a primary key and my primary key value is five and I update it to four rather than trying to maintain a version chain going from five to four I'll just delete the first one and insert a new tuple and so that has a new version chain. Right, that makes your life way easier than try to be smart about how to keep to span different primary key across versions. Secondary indexes though are gonna be more complicated. Right, so again primary key is always pointing to the virgin chain. So if it's oldest and newest then every time we make a new version we don't update it if it's newest or oldest then every time we do every time we make an update we do have to do it. For secondary indexes it's harder because you may not want to point to the physical version because you could have way more secondary indexes than you could have for primary key indexes. For every table it's gonna have one primary key index for every secondary you know could have multiple secondary indexes and you don't wanna have to update all of them potentially every single time. So there's this great blog article from Uber this is 2016 and they come out and they talk about how that is basically the story from how they switched from Postgres to MySQL. The true story was they went from MySQL to Postgres back to MySQL because they hired some guy that really liked Postgres so they switched from MySQL to Postgres the first time and then the dude left and like wow we're fucked we gotta figure something out it's not working so they went back to MySQL. So very expensive discovery for them and they point out a bunch of the differences between how Postgres and MySQL manage versions in NVCC but this figure here is actually what we're gonna talk about right now how they're actually gonna maintain their indexes. So again the primary key index is always gonna point to the head of the virgin chain for a tuple but the secondary key index could either point to the head of the virgin chain or point to something else and MySQL points to something else so that means every single time you update a tuple you don't have to update the secondary indexes Postgres always points to the head of the virgin chain so every single time you update a tuple you have to update the indexes and for Uber's particular work load they were talking about in this example here that became a real bottleneck for them. All right so the two choices are logical pointers versus physical pointers. So with logical pointer we're gonna have some placeholder value, unique value for our tuple that doesn't change across physical versions that we can use to identify it uniquely and then we essentially need to then map what that logical ID is, this internal ID to the actual physical pointer to get us to the head of the virgin chain. And the two choices are primary key or tuple ID I'll explain what that is next slide. Physical pointers is what I said before like what Postgres does they always point to directly to in memory or the physical location of the head of the virgin chain. So if I'm doing new disorders every single time I create a new version I pen that to my virgin chain, now it's the new head if I'm doing physical pointers I always have to update now the indexes. All right, so I use this visualization to explain this, these different concepts. So again, say in our simple example we're doing a pen only, new is to oldest. And for doing a lookup on the primary key we go through the index, the index spits out a physical address that points us down directly here to the head of the virgin chain. That's fine. For a secondary index say we do another lookup if we again we have a physical pointer then every single time this thing gets updated we have to update that pointer. So for one index who cares but if you have a lot of secondary indexes then this starts to get expensive because all of these guys have pointers to the head of the virgin chain too. Right, and we'll talk about indexes in a few weeks like this is not cheap to do probes in an index especially if you're in memory it's going to be expensive be nicer if you can minimize this overhead. The alternative is to have a, use the primary key so MySQL does this. So if you have a secondary index the key is the key you're indexing on the value is literally the primary key for that tuple. Right, if my primary key is an integer then I'm just storing that integer as the value in my index. So then I take the primary key that's output from the secondary index and I do my lookup at my primary key index which is always going to be there because you have to have a primary key or they have to meet, logically you don't, physically you do. You take the primary key index then that gives you the physical address and that finds the head of the virgin chain. Right, so this avoids that problem we had with the multiple secondary indexes pointing to the physical address because no matter how many times I change the head of the virgin chain I only update this, I don't update any of those guys because those guys have a logical reference to something that this thing tracks. Right, of course now the downside is if my primary key is huge, then I'm fucked again. Because now I'm storing that entire primary key inside of my secondary index. If my primary key is a one kilobyte string then I'm duplicating that all over inside the secondary index and that's gonna suck. The last alternative is to have some kind of indirection layer. So the thing of this is like a hash table where I have my logical tuple ID and that's like an internal reference or internal counter that I'm maintaining for every single tuple inside of this. So this thing spits out the tuple ID. I do a lookup in my hash map or my hash table to say give me the physical address and then that points me here. So now if I update the virgin chain, if I update the head, I update my primary key index so they always have to do that but then I just do this 01 lookup in here and update the hash table. Right, so again, for this one again, depends on what your workload is, depends on what the application looks like, one approach might be better than another. Right, if you have no secondary indexes then who cares? If you have one secondary index you may be doing the lookup in the physical address, having the physical address in the index is fine because you're not updating, it's not that expensive to update one but if you have thousands of indexes then that's gonna be problematic. So any questions about any design decisions? Yes? How much more expensive it is to do two memory lookups? So this question is, how much more expensive it is to do two memory lookups instead of one in the context of what? I'm like, is it worth it? Because I was- I might say that's not one extra on the way up. Like for physical address versus this thing, this is not a single memory address lookup. This is a hash table. Right, so you get a hash it, then do a lookup and depending on how the hash table is implemented you may have to scan, right? This is not like I'm going to do one load to memory and this is way more. Yes? Why is it hard to change from like newest all of this together way around, let's say, what's going to be used by oldest and newest, right? And then, why is it not like copy beautiful? So this question is, why is it not configurable for as like a DBA to say, oh by default you're storing new to oldest but now go oldest and newest? That's a major engineering issue, right? Because think about all the other crap around the system that's making assumptions about how these versions are stored and as I scan along I'm going to find the next thing I look at should be older than the one I just saw. That sort of trivial aspect of it permeates all throughout the entire system. It would be a major engineering overhead to do this. So as far as I know no system does that. Yes? Is this table from dual ID to address also considered an index? This question is, is the lookup table from the two by D to address also considered a index? Logically yes, physically no. So this is invisible to the application. The application isn't going to be able to see this, right? You can't see this in SQL. Underneath the covers, this is where they're going to use this to route your tuple to where it actually is, right? So this is just an aspect of the internals of a database system that we need to make the system actually run but it's not something that actually speeds up individual queries like an index does. So I wouldn't call this an index, but it's a hash table, so it sort of is, right? Okay, so it is only used to solve the problem of that we have a long secondary index and total authority, right? Yes. All right, so it stated was that we're only using this to solve the problem of the indirection from the secondary index to the physical address. Yes. Yes. So this question is, when I do garbage collection, say I'm garbage collecting this, right? Or sorry, A4. Yeah, why not copy A2 to A4 and then change the pointer to A1? All right, so he says, why not copy, say I'm garbage collecting A4. This is newest oldest, I assume it's going in the other direction. This is the oldest, and I want to garbage collect this. So instead of blowing this away and updating this pointer now to point to A3, what if I copy A3 over here and then change the pointer. And change the pointer to this to now point to this. So that you don't have to read copy or all of that. So that requires updating two memory locations and therefore you need a latch, right? Like to update this, the link list, that's a single parent swap, I can guarantee that's done atomically without having to take a heavyweight latch. To do what you're doing, to copy things over here, how do I make sure that nobody else is trying to do the same thing I'm doing at the same time, right? Because if I'm copying all this data, think about this, so let's break it down. Again, I should explain, this will come up more and more when we talk about like, memory indexes and things like that, but let's walk through the steps you described. So we're gonna take A3, copy it over to top of A4 and update this pointer now to point to A2, right? So let's say I do the update the pointer first, right? I update the pointer to A4 now to point to A2, someone comes along looking for A3, they can't see it because this link isn't there anymore, right? Now I copy A3 over before I update the link. So I copy A3 over, but then now something comes along and scans this and they'll see A3 twice. So basically you can't atomic... You can't atomic update two memory locations in a way you're describing, without taking a latch. And we wanna do this, we wanna do this latch-free. Yeah, and another question, why do you talk about linked lists do you mean a traditional linked list or a linked list injected into a large trunk of continuous memory? This question is, when I say linked lists, do I mean a traditional linked list, we have to find what that is, or a linked list? Every node is just malach in a queue. Or does malach in a large trunk of memory end? All right, yeah, I know what you're asking. All right, so this question is, am I describing this version chain as a linked list? Am I describing that as like, I'm malaching a bunch of nodes and I'm linking them all together or do I malach a bunch of crap ahead of time and just reuse the memory and then point to different parts? It's the second one, because remember I said in the beginning, every thread's gonna malach some thread local storage. And that way when they wanna say, I need a slot to store a version, they don't have to go to a global data structure and get it, they just get it from their local memory. They don't need to synchronize. So yeah, so it's exactly what you're saying, you're malaching a big chunk of memory and you're slicing it up to maintain these versions. But we're doing that on a per thread basis. Okay, awesome, good questions. All right, so the paper, how do you guys read? The best paper ever. Again, this came out of this course. So as I was teaching, I think it was 2016, I think the first time I took the course, I was like, oh, well, there's all these questions about MPCC and when this came out, when we were building the Peloton system, we were like, oh, how should we do this? How should we do this? How should we make these different design decisions? And when you go read the papers we'll see next class, they talk about the Coherenture Protocol and they talk about how they do all these four different things, but they don't say why they do it. They don't evaluate other options. They say, this is the way we did it. And so this paper's idea was we were gonna take all these different techniques from all these different papers, build it into our single system, run a complete bake off on all of them, figure out whatever which one turns out to be the best and that's when we end up picking and using, right? So these are actually the students actually in this class, this is actually on the final exam day when we do the final presentations. And so the prize they got was the anime textbook for databases and then the fake autograph copy of Larry Ellison's unauthorized biography. The difference between God and Larry Ellison is that God doesn't think he's Larry Ellison. Anyway, Larry Ellison is the founder CEO of Oracle. I can't talk about him on video because the MySQL guys complained that I talk about Larry Ellison too much and that when they watch these class videos, they wanna share them throughout the company but I say too much crap about Larry Ellison so they can't do that. Okay, all right. So the first part was again, we sort of isolated each of the different design decisions, did a bake off of them and see which ones actually turned out to be the best within just making that one decision. So for the current year protocol, I would say the results are inconclusive. Some protocols work better than others for different workloads and this would be not just for NBCC, too much any, even for single version systems or even distributed databases. There's no one protocol that is so much better than everyone else. They have different trade-offs, there's other assets, the implication that actually end up mattering more. For version storage, we turn out the deltas work the best because again, you're minimizing the amount of data you get to copy every single time you update a tuple because you're only copying the subset of the attributes that modify. Tuple-level vacuuming turns out to be the best. I think cooperative is actually slightly easier to engineer but this was slightly better and then no surprise, logical pointers is much better than the physical pointers for workloads that look like the Uber environment. So what was really disappointing about this paper was we did all this and then we said, all right, well let's go put this in Peloton. These are the best ones we did and then we didn't actually end up doing that. We end up picking like the worst things which is why we had to throw away all the code. All right, so this is the table that just summarizes what these different systems actually implement. And so MySQL, Postgres, and Oracle, these are not in-memory systems where we're showing them for historical reasons and obviously they're very popular and widely used. Everything down here are all in-memory systems and so the main thing to point out is again, everyone's doing all different things and sometimes they're for performance reasons like for a particular workload, they were trying to optimize for, that they picked certain design decisions and sometimes it just came down to be engineering. We asked them why they did something that said this is just the way to do that, wrote it, wrote it, right? For our new system, right, the name to be determined, please do not call it tiered to be, we're not calling that, that's the name of my dog. We had to put something in the GitHub repo, so we put that, we're gonna rename it, but don't call it tiered to be. So in the old version of Peloton, we were actually doing MV2O, append-only, vacuuming, and then I think logical pointers, but these really kind of nasty ones. In our newer version, we actually went back to Hyper and actually copied a lot of the ideas that they did. Cause when you do the bake-off and look at how they all perform, right, at the very top here, you have Oracle, NeoDB, MySQL, and Hyper, right? So these guys are all doing devil storage, they vary on the concurrency of protocol that they're using, but they're doing the background vacuuming and then the logical pointers, right? So this is running TBCC with some extra additional scan queries, right? Postgres actually ends up being the worst, followed by Hecaton and so forth, right? So again, the way to think about it over here at 40 cores is the maximum amount of parallelism we're trying to achieve on our box inside the system for this workload, and it turns out the delta storage stuff actually ends up mattering the most in this environment. Cause again, you're just reducing the memory pressure in the system. So I'm not saying because these results, but Postgres announced or they, somebody that works at a high level project committer on the Postgres team announced last year that they were exploring the idea of dropping the appended only storage that they're using and switching over to the delta storage. It has a name, this is all very experimental, but they talk about how if they switch to delta storage, then they don't have to be any more vacuuming because they just look in the rollback segment or the delta storage parts and you find all the old versions to delete. So Postgres is moving in this direction with delta storage and I think that's the right way to go. Okay. All right, what are my party thoughts? MVCC is the best approach to support mixed workloads which is the focus on the semester. We want to do transactions, we want to do analytics on the same database instance, having multiple versions around make our lives easier, right, because we can have these read only transactions, read only queries, go read old versions without interfering with the updates. Primarily, in this class though, we only focus on for OTP environments, but the delta storage stuff as we'll see next class in UnderHyper is actually still really good. It's mixture stuff you have to do to make them go make the analytics queries run faster and we'll see how they handle that next time. Okay, all right. Next class, we're gonna go to more details about real world implementations of a memory MVCC. You guys are assigned to read the hyper paper which is our system is based on, but also covers the Cata which came out of CMU, Microsoft Hecaton, and if you have time, MemSQL. Okay, all right guys, any questions? Can't go, you know, great hiding. Got a bounce to get the 40 ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I guzzled, because I'm more a man. I'm down in the 40, am I sure these got sore cans? Stacked some sick attacks on a table. And I'm able to see St. I's on the label. No shorts with the cross, you know I got them. I take off the cap, I'm fresh on tap on the bottom. Throw my three in the freezer so I can kill it. Calf with the bottle, baby, oops, don't spill it. Cause St. I's is said to pay not wet. You drink it down with the God, little Bob's ass. Take back the pack of dust. They gon' get you some Zane love. Peace is down with the weak, guys. Be a man and get a ca-