 For today's class, I'm going to spend time now talking about multi-version concurrency control. And so in last class, we talked about OCC, but there was much of stuff that I wanted to cover in the beginning about start procedures, and we sort of rushed through the actual silo protocol and we ended up skipping the entire TikTok protocol. And so last year when I taught NVCC, I did it in a single day like I did OCC in a single day. And we had the same problem where we ended up talking about all this background material that we never actually got to the core protocols that you guys read about in the paper. So what I'm doing this year, we're going to split up NVCC into two lectures. So today's lecture will be all the background about how you implement NVCC. And then on Monday next week, then we will go into the details of Cicada, Hackathon, and Hyper of how they implement NVCC. So that's why this is part one of NVCC. We're going to cover the background material in the paper you guys read, and then we'll do the actual hardcore implementations of the protocols on Monday. So for today's class, I'm going to start off talking about compare and swap and how you actually implement it, because that's going to be the core primitive we're going to use all throughout in-memory databases, especially in-memory concurrency control. Then we'll talk about isolation levels, and then we'll cover the paper you guys read on NVCC design decisions, and then we'll finish off with the announcement of project two. Okay? All right, so I think last class I mentioned, or I brought up who here knew what compare and swap was. Not everyone raised their hands, so I want to spend some time going through this and understand what it is, because you're going to need us to understand how we're going to implement in-memory concurrency protocols. We're also going to need this for project two, because you need to implement a latch-free data structure. So the compare and swap, the basic idea of this is that it's an atomic instruction that the CPU provides that it's going to allow you to examine a location in-memory, check to see whether the value is what you expect there to be, and then you can go ahead and modify it if it's what you expect there to be. So this is not a new idea that's been around in processors since the 1970s, but now they're pretty much in vogue because everyone's looking to build or use a lock-free or latch-free algorithms, and this is the core construct you need in order to implement a latch-free or lock-free algorithm. So the basic idea again is you're going to look at some memory location and check to see whether it matches the value that you give the instruction, and if the values are equal, then you're allowed to install a new value, overwrite the old one, and then if it's not what you expect there to be, then the instruction will fail. So the way to use this in your CAC++ code sort of looks like this. So this looks like a function. It looks like a normal function you would have in STL or libc, but this is what is called a CPU intrinsic. Who here sort of knows what an intrinsic is? One in the back, two. Okay. So an intrinsic is basically a way that the compiler allows you to invoke an instruction without having to write the raw assembly to do this. So even though this looks like a function, when the compiler gets this it knows how to convert this into the single instruction you need to do the operation. And so you know it's going to be intrinsic because it has this double underscore as the prefix. So in this example here, we're going to invoke a compare and swap on a memory location and check to see whether it has an integer and install our new one. So the first thing is that the first argument to the intrinsic is our address. That's some location in memory that's going to have a current value. Then we give it our compare value, check to see whether that equals to this. In this case it does. And if so, then we're allowed to install our new value. So in this case here, we would update this memory location and make it 30. And so this particular command is called sink, pull, compare and swap. So what will happen is it will return a true or false to tell you whether the compare and swap is successful. There's other intrinsics that do something similar that do things like it will return the old value of sorry, it will return the new value that's put into the memory location and you can use that to compare whether your new value was installed. The core idea about how this all works is that again it's a single, literally a single CPU instruction. If you had to use the dirty mutex to do this, you'd have to acquire the mutex, then you have an if clause to check to see whether the current value is equal to the value expect. And then if so, then you go ahead and then install your new one and then release the mutex. And as I said last time, mutexes are the Hitler of concurrency so we want to avoid them and compare and swap is much better. So in this case here, if I try to do the same thing, but I want my compare value to be 25 to install the new value 35, this instruction would fail because it would see that the value doesn't expect what it doesn't match my compare value and it would return false and therefore you knew that this thing didn't actually install the correct value. So we could use this in the do the time stamp increments or the transaction ID increments from OCC last class. We could do compare and swap to add plus one, plus one, plus one and we're going to use this for other things later on of a talk about today how to install new transaction IDs in our time stamps or new time stamps, new transaction IDs inside of our tuples to let us know when a particular version is visible. So is this clear? Right in also C++ that you can get a you can have STD atomic and you can wrap a integer or a boolean or some other primitive type around the atomic flag and that's going to provide you the same functionality. So in general, you maybe want to use writing C++. You don't always want to write intrinsics. You want to write using the what C++ provides you. Okay, so now the next thing I want to cover is it spent a little time talking about isolation levels. So in last class and certainly in most introduction database classes, we spent a lot of time going on about how great serializability is. We say that this is the gold standard, which you would want in a transaction processing database system because it allows your programmers to not have to think about the correctness of their program. Meaning they can write transaction code in their application and assume that the transactions are going to be executed in serial order even though we know the data system for the most part most systems will interleave their operations. But then the end state of the database is equivalent to one where they are executed in serial order. So that means that we don't have to write any special code in our application to reason about inconsistent data. We don't have to worry about our transaction reading data from a transaction that hasn't committed yet or getting back in different values. So the problem though with serializability is that it's useful for us to reason about, but in practice it's often that people don't use this because trying to achieve serializability often requires additional checks and mechanisms and limits the amount of parallelism and concurrency you can have in your system. And so a lot of systems end up foregoing serializability altogether and they want to run at what is called a lower isolation level. And these lower isolation levels as you're defined in the SQL standard are specified in terms of what different anomalies that transaction may occur while it runs. So what we'll see is that in today's class we'll mostly talk about snapshot isolation but then on Monday's class when we actually go to the actual protocols we'll see how they try to achieve serializability in their protocols. But the reality is in the real world not everyone needs this but it's good for us to understand what's going on. So the way you can control the amount of parallelism you can have in your database system is to specify what isolation level you want your transactions to run at. And so the isolation levels are defined in terms of again what anomalies or what issues your transactions could be exposed to while they're running. And the standard defines them in terms of these three types of anomalies. So anybody that's taken introductory database class should be able to give me definitions for these. So what's a dirty read? He said you read data from an uncommitted transaction. Correct. What's an unrepeatable read other than him? So she said you read a value once then somebody changes it and then you read that value again and you get a different value. Correct, yes. This is going to occur even though the other transaction that modified the data, even if they committed then you would see they're committed data and it's not what it looked like when you first started. This one's a little bit more tricky. What's a phantom read? So he said that a phantom can occur, and actually I'm going to expand, you got a half right. A phantom can occur if you read a range of data, a range of tuples. And you do the scan the first time, then you go do it a second time and in between the first time and the second time someone else has come in either added new data or deleted data so that when you compute the scan now you see things that weren't there before. So this can come up in an aggregation query. If I want to count the number of tuples of all the students with age 18 to 25, like this class, if I scan it the first time I see 10 tuples then somebody inserts a new record with someone aged 22, I scan it again and now I get 11 records. That's considered a phantom. It was something that wasn't there before when I ran my query the first time. And it's not the same thing as a dirty read or unpeed or read because I couldn't read the thing before because it didn't exist in case it was an insert. So if I come back, it's not like I read it before and I got a different value. It didn't exist before. So the SQL standard defines these isolation levels in terms of those anomalies and it specifies in them again what anomalies could occur. And I say could or potentially could occur because there's no guarantee that your transaction is going to hit these anomalies if you run it at a lower isolation level because it depends, obviously, on what other transactions are running at the same time and what those other transactions are doing. So if your application only has a single thread and is executing one transaction one after another in serial order, then even if you run at the lowest isolation level, you're not going to have any dirty reads or other problems because no other transaction can be running at the same time. This really only matters when, again, you're in a highly concurrent environment where you have a bunch of simultaneous transactions running at the same time and we could be reading data that we shouldn't be. So at the very top, the strongest is serializable. This means you have no phantoms. All your reads are repeatable and then you have no dirty reads. And then as you go down to these lower isolation levels, you could incur more anomalies. So when you go to repeatable reads, then you could have phantoms. If you're not committed, you could have phantoms or unrepeatable reads and then read uncommitted or all bets are off and any one of those anomalies could occur. Yes. Question? No. Another way to think about this is in terms of this hierarchy. So at the very top, you have the most protection and at the bottom, you have the least amount of protection and these sort of subsume each other. So, again, in a academic environment, we say serializes look great. Of course you want to run your transactions with that protection, with that isolation level, but the reality is most people don't. And part of the reason is that when you go look at what real systems actually support, it's not serializable, it doesn't come up a lot. So this is a table that Peter Bayless, a professor at Stanford, generated a few years ago, where he looked at the manuals for a bunch of different database systems that people are actually using for real applications and he looked at what the default isolation level was and he looked at what the maximum isolation level, the system purported to be able to support. And so what you see here is that in the default column, only two systems support serialized by default. VoltDB and Ingress. So VoltDB is based on H-DRAW, the system I help build and that's because they have these single-threaded execution engines that only allow one transaction to run at a time so you can never have an anomaly so you essentially get serialized by default. And then what we see for the maximum isolation level, some of these systems, like for example MemSQL, don't even support serialized at all. They're okay with running it recommitted. So again, the main takeaway here is that most systems by default run at a lower isolation level, usually recommitted, and most applications don't change this. So I don't have the graphs here but we've done surveys with DVAs and we find that most people run whatever their system is, most people run with the default. And that's probably good enough for them. Facebook runs at repeatable reads and they have the largest MySQL installation in the world and for them that's fine. So the other thing I want to point out here is that there's actually two different isolation levels that weren't in my chart of four that I have before. So for DB2, their default isolation level is something called Cursor Stability and then for Oracle their maximum isolation level is something called Snapshot Isolation. So in Oracle actually you can say I want serializable transactions and the data system will come back and say, yeah, go ahead you got it, right? But in reality you don't get that, you get something lower and you get Snapshot Isolation, right? So what are these other isolation levels? And why weren't they included in the standard? Well, it turns out that when the standard came out in 1992 the standard's body that wrote it, they were taking this view of ConcurrencyTool from a perspective of that the database system is implementing two-phase locking, right? And therefore all of the anomalies that they talk about are in the context of a two-phase locking database system. But as I said last class, there's another whole another class of ConcurrencyTool protocols, like the time-saving ordering ones that are differently different than how two-phase locking works and it turns out that they don't hit the same anomalies at least not exactly the same ones that I showed before when you run in another scheme and in particular we see this when we talk about Multi-Version ConcurrencyTool they by default have Snapshot Isolation guarantees. So we need to understand what these actually are. So I'll go through both of these. And so the first one was cursor stability and as I said this was the default in DB2. And so the the way to understand cursor stability is you got to kind of throw out everything you know about two-phase locking just to go with this, right? And the way to understand what a cursor is is essentially a pointer inside the database system that points at whatever data item say a tuple that the query is operating at that moment. Let's say that I need to access the cursor will acquire locks for the things that it wants to access. So say I want to scan two tuples, my cursor would find the first tuple, acquire the lock on that do whatever it is it needs to do then release that lock and then go grab the cursor lock on the second tuple and then do whatever it is it needs to do on that. And then when it's done it releases that lock. So that looks a lot different than what we talked about before with two-phase locking, right? With two-phase locking we said that in the growing phase you acquire all the locks you need but then as soon as you release one lock then you're now in the shrinking phase and you can never go back and acquire another lock. But with cursor stability locking or cursor locking you can do that. And they don't have to worry about deadlocks because they say that only one query can hold one cursor lock at a time. So I can't acquire a lock and then acquire another one. I acquire my lock do whatever it needs to do then I give it up before I'm allowed to go get the next one. Because of that you don't have deadlocks. Now with multi-threading that makes it a little more difficult but the general idea is the same. So cursor stability turns out to be another isolation level that exists before in between repeatable reads and read committed. So it's stronger than read committed but weaker than repeatable reads. And the reason why it's stronger than read committed because it's going to event something that's called the lost update anomaly. Which again when I talked about those anomalies in the beginning, dirty reads, under repeatable reads, phantom reads, this thing wasn't in there. So we need to understand what that is. Yes. Cursor stability have dirty reads because you take a lock, like what he said, you take a lock, let's say a transaction one takes a lock on a toggle, changes it, releases the lock and take the lock on double two. And transaction two comes in and takes a lock on double one which is it's basically at the dirty value because transaction one has been committed here. The question is, can you have dirty reads with cursor locking? There's some extra stuff you have to do to make sure that you don't read things from transactions that haven't committed yet. But you're not guaranteed to have repeatable reads because you would come back and read something that is committed. Question or no? Okay. So let's look at a really simple example. So we have two transactions, T1 and T2. T1 wants to do a read on A, write on A. Transaction T2 wants to do a write on A. And for this, we're going to assume that we're running on a system that only has one core. That means only one thread has the program counter could actually be making forward progress while it actually has transactions. So we would start here, transaction T1 wants to do a read on A. So it goes, inquires the cursor lock for A, reads it, and then releases the lock. Then transaction two starts running. We have a context switch. Transaction two starts running, grabs the cursor lock on A, does the write, and then releases that lock. Then we have the context switch back over here. This guy gets the cursor lock on A, modifies it, and then releases that lock, and then it goes ahead and commits. And then at some later point, transaction T2 is going to commit. But now the problem is that we're losing this update because although in our physical time, transaction T2 committed after T1, all the transactions are only going to see the update from this guy, because this guy overwrote this one. Because we only had cursor locks. So a cursor lock in general will prevent this problem, but not always. There are other scenarios where it won't do this correctly. So this guy, if he held his cursor lock until he committed, since he knows he wasn't going to do anything afterwards, then this write would have the block and this guy would get committed first. Or this guy held the cursor lock on the read and held it to A, then this guy would block because it couldn't quite the cursor lock, and then it could do the write afterwards. And we wouldn't have that anomaly. So this is a bit esoteric. This only occurs in systems that use cursor locks like DB2. In the embassy systems that we talked about today and next class, they don't have cursor locks. So they don't have this problem. But it's good to understand this because it may show up, it shows up in literature when people talk about the different isolation levels and what else is actually out there beyond what the standard specifies. Yes. Correct, yes. So his statement is with a cursor lock do you hold the lock until you need to acquire the other lock? Yes. Okay. The one I do want to spend more time on is snaps to isolation. So the way to think about snaps to isolation is that the database system is going to be able to guarantee that when a transaction starts it will see a consistent snapshot of the database as it existed at the moment that transaction started. And the key word I'm saying in that sentence is consistent. And so what that means is that the snapshot would not contain any modifications from other transactions that had not committed yet. So say I have transaction T1 it starts, it updates the database and then transaction T2 starts, it will not see the updates from transaction T1 because that transaction did not commit and therefore its view of the database with its modifications is not consistent a consistent snapshot. So the transaction T2 will see the view of the database as it existed before T1 started. Right? So when a transaction wants to commit under snaps to isolation we have to check to see whether our right set conflicts with any other transaction any other updates since we started and created our snapshot and if there's no conflicts then we're allowed to go ahead and commit. Right? So this sounds awesome, right? This sounds like this be basically serializable, right? Well it turns out that this snapshot is, the snapshot isolation is susceptible to another anomaly again that wasn't in the original 3 that I defined before called the right skew anomaly. And this is actually something that cannot occur with repeatable reads this is something that is specific to snaps to isolation. So the way to understand snaps to isolation are this right skew anomaly is I always like to use as an example that was invented by Jim Gray a few years ago I guess two decades ago and the way to sort of think about this rather than worry about tuples and indexes and things like that let's just say we have a database that has marbles and we have marbles can be either black or they can be white. So when we start off in our database we have two black marbles and two white marbles and we're going to execute two transactions at the same time on two separate threads and they're both going to make changes to the database. So the first transaction is going to want to change all the white marbles to black and then the second transaction is going to want to change all the black marbles to white. So when they start running exactly the same time this represents their consistent snapshot of the database so they see it exactly as it existed when they started. So now when they start running they're going to find all the white marbles at the top and switch them to black the bottom guy is going to find all the black marbles and he's going to switch them to white. So now their snapshot of the database has all black marbles at the top and all white marbles at the bottom. But now the issue is when they go to commit we're going to end up with this state and the reason is because the right set for this transaction here does not overlap with the right set of the bottom one. So he was allowed to install two white marbles and the bottom guy allowed to install his black marbles. Is this equivalent to a serializable schedule? No. Because what would a serializable schedule be? Well it would be that we would first execute the transaction one it would flip all the white marbles to black then transaction two would flip all the black marbles to white and we would end up with either all white marbles or all black marbles. So the point here is that this is an anomaly that should not occur if we have a truly conflict serializable schedule but under snaps to isolation you don't. You have something that's less than that. And this requires to do extra stuff to avoid this problem of having these separate updates. So the way to actually think about now your isolation levels is not the sort of straight hierarchy that I showed you before it's actually something more complicated. So now we have cursor stability in here and it's in between recommitted and repeatable reads. And then snaps to isolation is something over here on the side completely separate. It's another path up going up. It's actually way more complicated than this. So one of the optional readings that's on the for the class, for today's lecture is this great paper from this guy Atu Aditya and he basically wrote the an amazing thesis in the late 90s about different isolation levels you can have in data management systems and the this is actually not even the full thing but this is roughly what the view looks like. So what we see in here we have cursor stability, we have snaps to isolation these are all it's way more complicated there's way more things and not all database systems will provide all these. Typically they only have main four or less than that. And we see actually also as well full serializability is what we're referring to as serializable in database systems but there's something even above that called strict serializability and this is equivalent to what is called external consistency. So this means that full serializability means that if I transmit transaction T1 first followed by a little bit later by transaction T2 then T1 will commit first followed by T2. Under serializability general serializability you're allowed to flip those orders. So I only bring this up to show just to say that this is way more complicated than I'm showing here and this is why transactions and curve control is really hard in database systems because it's all these corner cases all these other issues you've got to deal with beyond just oh do you know am I reading data from an uncommitted transaction. So any questions about snaps to isolation or the right skew anomaly? So as I said in this lecture we're not going to go too much detail how to handle the right skew anomaly in NVCC we'll see that on Monday next class. And the paper you guys read I wrote we cheat in some ways, cheat in quotes because we avoid the right skew anomaly because we don't have any range queries everything's always point queries and those are easy to make serializable. Okay. So with that now we can talk about NVCC. So the NVCC protocol the basic idea is that the database management system is going to maintain multiple physical versions of an object in the database to represent a single logical object and for simplicity we can assume an object is a tuple. So what's going to happen is that when a transaction updates an object when it writes to an object the database management system is going to make a new version of that object that's moving forward in time that represents that change that transaction made. And we're still going to maintain the history up to a certain point of all the previous versions that came before it and this is why we get snaps to isolation for free because if we have those older versions then if our transaction comes along and it's a timestamp then it knows how to look at those older versions to figure out what should I be allowed to see how do I get my consistent snapshot. Right and that's what we do with REIT. So when we do a REIT on an object we need to know what version should we be allowed to see and that's going to be based on what was existed at the moment that what was what existed in the database system from a committed transaction at the moment that transaction started. So NVCC as itself is not a new protocol. It was first proposed in a dissertation by somebody at MIT in 1978. It was then the later implemented in the early 80s at DEC in a system called Interbase. Interbase has still exists I think some because DEC doesn't exist anymore but the Interbase as a system still exists some company has it and they're rebranding it as a mobile database but as far as I can tell it's still derived from the original Interbase from the 1980s. The Interbase from the 1990s when Borland bought it was open-sourced as Firebird about a decade or so ago actually maybe longer. So this is available today you can go get that and you can go buy this from the company. The fun fact is the original name of Firefox was actually going to be Firebird but then because it overlapped with these guys they didn't have to rename it to Firefox. So although this protocol is from the 1970s what is very fascinating about it is that basically with a few exceptions every single major data spend system that has come out in the last 10 years any system that does transactions uses MVCC. So part of what this lecture is about and what the reading was about is to try to figure out why. What is it about MVCC that makes it be something people want to use versus like in-place updates or sort of a single version system. So the three main things that people tout for the benefit you get from MVCC are the following. The first is that your writers your writing transactions or writing threads will not block any reading threads. Because the reading threads can always read a new version and the writing thread can always create new versions. The other benefit we're going to get is that if we have any read-only transactions they can read a consistent snapshot of the database without requiring any locks or without requiring us to track their read set. So in SQL when you start a new transaction you say begin you can specify that it's going to be a read-only transaction. And the data system can use that hint to optimize how it actually works in the sphere of your transaction. So in our system in Peloton if you tell us you're a read-only transaction we disable the read set tracking. Because there's no reason to do that because we're reading a consistent snapshot of the database. So we're seeing things that existed at the moment our transaction started. It may not be the newest version of the database but who cares right with serializable ordering that's fine. And the last feature that comes up often is support for time travel queries. So a time travel query basically allows you to say execute this query on my database execute the select statement as the database existed on the state of the database that as it existed one hour ago or one day ago or one week ago depending on how much history you have. And we can do this because we're going to assign time stamps to all our transactions so we'll know what was the time stamp what was the version ID that existed one hour ago when they want to execute this query and we know that we can skip any newer versions that come after that time stamp and only read that snapshot of the database in the past. So this is not a new idea this actually was used in the original features of the first version of Postgres they talked about how they could support time travel queries later on the 1990s like 1999 when people started using the open source version of Postgres they end up turning this off because you run out of space pretty quickly if you're doing a lot of updates if they maintain all these version histories but a lot of new systems that come out now talk about how they can do time travel queries and as we'll see as we go along it essentially means you just turn off garbage collection and maybe add a little extra syntax in the SQL statement to allow you to go back in time Microsoft was happy to sell you a package that supports time travel queries and that does require some additional code to do something different than what Microsoft normally does but again when you're doing ABCC you essentially get this for free if you just don't clean up anything but the reason why I say I don't know how useful this is because the only place ever seen this being used is in like financial companies where they have to do this for like regulatory reasons so under SAR based Oxley because of the US government you have to maintain a seven year history of your database so you need to go back in time and say what transaction occurred six or seven years ago and time travel queries is one way to do this but for most applications when you think about it you're already starting to kind of doing this anyway in your application so like when you order stuff on Amazon Amazon doesn't use the multi versioning to maintain different order histories right it creates a new tuple to insert for your new order it inserts all the new order items so you can easily go back in time and see what orders I bought a year ago because they're just separate tuples right so the usefulness I think of this is actually quite limited but I haven't seen a really compelling reason beyond just what the financial guys need and they often just take snapshots anyway and keep those as separate files so the important thing I want you to add the paper that I signed to you guys and this is a mistake that I made when I was younger is that although it's called multi version concurrency control it's more than just concurrency control right so this is the confusing part from last class of like there's the optimistic concurrency protocol but there's also a class of protocol it's called optimistic concurrency control so there is a original multi version concurrency protocol from the 1978 thesis but then now it encompasses so many other things because it permeates how you're going to design all different parts of the system right all the different design decisions and that's essentially the paper that I had you guys read was we try to go figure out how do you actually want to implement an MVCC database system a modern MVCC database system and so the way this sort of started was when we first started building Peloton there's all these questions came up about how should we store different versions how should we maintain our indexes in your garbage collection and when you go read a lot of the academic papers that talk about in memory MVCC they usually talk about the first part they talk about how they evaluate this first part in much more detail than the other parts so for like garbage collection they'll say oh this is the way we do it they may say a little bit why they did it but they never actually compared all the other different possibilities so this paper is basically we're going to implement everything we know about we're going to implement it in one database system and there's run a complete parameter sweep of benchmark on every possible combination and then whatever one turns out to be the best that's what we'll just end up using in Peloton right so that was the motivation for this paper so as I said last class the original title of this paper was not an empirical evaluation of multi-version courage scroll the original title was this is the best paper ever on MVCC and I still stand by that today right so the reviewers came back this was VODB they came back and said well that's a subjective statement can you be a bit more scientific right so then our next title was if you only want empirical evaluation paper in memory multi-version courage scroll make it this one, exclamation point so then they came back and said no you can't do that either can you make it a little bit less direct so then I said we think you really enjoy this empirical evaluation paper on MVCC so now at this point the paper was accepted but then the program chair basically came down and said you gotta change this title or reject this paper so I don't have tenure I needed this paper the student that was working on it needed this paper so we I broke down and we ended up with that title it is what it is, that's fine okay so the other thing these are the four things that are covered in this paper I want to spend a little bit of time talking about something that goes beyond this paper that wasn't in this and I'll talk about how we can handle a transaction ID wrap around which I think is actually an interesting problem and I know how Postgres does it for the other ones it's not clear how they do it but this is something we can investigate later alright so we'll go through each of these so this is the table we generated from the paper that again just shows you all the different data systems that are out there that are doing MVCC and how they implement all those four different design decisions so the first three at the top, Oracle, Postgres and MySQL, these are obviously not in-memory database systems but we thought to include them just for historical reasons, the first version of Oracle from the 1970s didn't actually do MVCC, didn't do transactions at all they only added that later until like 1983-1984 and so the rest down here from HiRISE, Hecaton, MemSQL, HANA, New ADB and Hyper, these are all in-memory systems and the main thing to show here is that no one system implements MVCC exactly the same so we need to figure out what is it about these different implementations that makes one better than another and try to identify that and that's what we want to implement in our system so before we get to that we want to talk a little bit about what we need to store now in our tuples to do multi-version current control so in every single tuple we're going to have a header that we're going to use to keep track of the metadata about this particular version of the tuple so you're always going to have a transaction ID and that's going to correspond to the transaction that is either holding the lock or latch for this tuple or the transaction that made this version then we're going to have a begin and end time stamp and that's going to correspond to the visibility of this version so any transaction that has a time stamp that is in between this range should be allowed to see this version and then we're going to have a pointer that's either going to point to the next previous version of the tuple in our version chain and I'll explain what that is later and then there will be some additional field for whatever extra metadata you have I think for two phase locking we could put the re-lock and the right lock inside there the other protocols do other things so the key thing I want to point out here though is if we ignore the additional metadata we have four fields that we have to have per tuple and in every system these are all going to be 64 bit pointers 64 bit integers so we have four 8 byte fields with the store per tuple this doesn't seem like a lot but when you actually look at much larger database sizes this becomes actually quite significant so I have 32 bytes at the store per tuple so if I have a billion tuples then I have to have 32 gigabytes of storage just for the multi versioning metadata and in every database that means I have to have to dedicate 32 gigabytes of memory just to store the metadata so that's a lot and this is just meant to show you how although multi versioning seems awesome and it provides a lot of benefits it doesn't come for free and this occur multiple times throughout the semester where there is a classic trade off in databases of compute versus storage compute versus memory this is one good example we'll see next class in heccatine you can actually get rid of one of these fields but it's still not a non-trivial amount so the first thing we have to deal with is the currentio protocol and there's essentially three approaches you can use timestamp ordering OCC which is the approach of the serializable snapshot isolation or SSI I'll cover that paper in more detail on Monday and so I'm going to skip the discussion of these two here because this is what we discussed last class it's kind of easy to see how you can take OCC and just make it multi version you do all your rights to your private workspace and then when you want to go commit you do the validation as you normally would and then the workspace essentially just becomes the new versions you just acquire the lock on any physical version before you're allowed to read and write to it but I want to spend more time talking about timestamp ordering because this is actually how it was defined in the original MVCC paper from 1978 but I'm going to show how to do this in the context of the in-memory database so say we have a sample database here and then we have two versions so the first thing to point out is that we're going to store the transaction ID and for this we're going to use the transaction ID as a way to represent the transaction that holds the lock or latch for this tuple for this physical version and then we're going to have begin and end timestamp and so for the end timestamp if it's the newest version we'll just set it to infinity meaning anybody with a timestamp greater than the begin timestamp should be able to see it the new thing we're going to add though is this retimestamp and this is going to be just like in the basic timestamp ordering protocol that we have before we're going to use this retimestamp the timestamp of the last transaction that read to this and this always has to be increasing in time, it can never go back in time so let's say we have one transaction here and it wants to do a read on A followed by a write on B and we'll assume that when the transaction starts that we'll give it some transaction ID it doesn't matter what method we use it just has to be unique and has to always be increasing so in this case here the transaction ID is 10 so now a transaction can do a read on an object it is allowed to if its timestamp is within the range between begin and end timestamp and the transaction ID is set to 0 meaning there's no transaction that holds the latch for this physical version in memory so in this case it's allowed to do that so we need to go update its retimestamp to now increment that forward and we do this with compare and swap we read the first value we get a 1 then we invoke the compare and swap and say replace 1 with 10 if it's still 1 when I check in this case it is so if we come back and we see that someone else changed it if that new value is less than our timestamp then we would try it again if that timestamp is greater than our timestamp then we don't try to do it the retimestamp always has to be increasing so now I want to do the write on B and so for this I'm going to create a new version if no other transaction holds the lock for this version and my timestamp greater than the last transaction that read this we said in the last class we had to do this because we don't want someone in the future to have read this tuple and read the logical tuple and not see our new value or our new physical version so in this case here we'll do the compare and swap on this we can set it to 10 that means now we hold the latch for it and then we'll go ahead and now make a copy of the old version and create a new version here we set the transaction ID to us by default which means we implicitly hold the lock and then now we can do whatever it is we want to update the value so the next step is to go back now and set the end timestamp for the old version to our transaction ID and this tells anybody that comes along that says alright if your timestamp is greater than 10 then you don't want to look at this version follow the pointer which I'm not showing here but follow the pointer to get up to the next version you might want to read so for this this is actually not completely serializable because if I release the locks then someone may come along and read this newer version but I haven't committed yet so we'll see this on Hecaton and other protocols next week there's extra stuff you have to do to make sure that other transactions know that whether you've committed or not they don't read things from uncommitted transactions so I'm just showing you how you can set do compare and swap to flip these different fields embedded in the header to allow you to tell other transactions what's going on without having to have a central data structure that's another key thing we get from in-memory system as well when I set these locks I could do this directly inside the tuple header I didn't have to have a separate lock table because I know this thing's never going to be swapped out of disk is this clear okay so the other more important things we can talk about though is the version storage so as I said the database system on NVCC is going to create multiple physical versions for a single logical object and the question is now how do we actually want to store those different versions there's a bunch of different ways and so the way to think about how the data system is going to recognize these different versions is through what is called a version chain so we need an efficient way to allow us to say here's the logical tuple here's the first physical version and be able to check some way to find the actual physical version that is actually visible to you and we're going to do this through what is called the version chain you can think of this as a single direction link list that is going to go from one version to the next you only want to go one direction because you can't compare and swap the two addresses at the same time so you always want to go one direction so what will happen is we'll do some look up on an index or scan on the database and we're going to find the head of the version chain and depending on how we order it the first version may be the oldest version the first version might be the newest version and we may need to traverse that chain to find the one that we're looking for so as we'll see the index are always going to point to the head of the chain and then then Fred knows how to jump through from one version to the next so now one thing I won't talk about too much is that but in memory system we're going to end up doing this, we're going to store all these versions in thread local storage for each thread so each thread is going to have its own I think the paper calls them a memory space or memory pool so as I create a new version my thread creates a new version I don't store this in a global heap I store this in memory that's local to me meaning it's going to be physically close to my core so if I have a multi socket machine it's not another socket's memory, it's my local memory and so I do this because I want to avoid contention on any centralized data structure so I'm the only thread that can be writing to my local storage so I don't need to acquire a latch to get a new free slot I can just go grab the next one I know but nobody else is trying to steal from me now this means though that now the versions themselves for a single logical tuple the version chain may span the local memory pools for different threads which is fine because a thread can jump through memory it's just when we do updates they always have to be local to you and so the different storage teams are going to go through now are going to determine where physically in memory and what we're going to store these new versions and what we actually want to store so we'll go through each of them there's the three approaches there's the pendulum storage, time travel storage we'll go through examples of all of these so with pendulum storage the basic idea is that every single time I create a new version I just insert a new tuple in my regular table heap so there's only a single storage space for my table we can ignore how things are actually spread across multiple local threads but the way to think about it is there's just one heap there's a chunk of memory that everyone's going to put new versions into and so now what I'm showing here is that now we have our pointer is just pointing to for a tuple it's pointing to the next version so in this case here, the head of the version chain would point to A1 so this would be the oldest version and then we can follow the pointer and get to A2 and then B1 doesn't have another version so it doesn't have a pointer to anything else so all the physical versions of a single logical tuple are now stored in the same table space and any single time I update a tuple then I'm going to copy the contents of the tuple I'm updating into a new slot in my table space with the updated value and then I go back and update the pointer now to point to my new version so at this point here no one can see my new version because I haven't updated that pointer so I do a compare and swap and say alright now I want to change this pointer now I'm pointing to mine and then somebody else tried to create a new version and then I just follow that pointer and then do a compare and swap on its pointer to point to me so this is the easiest way to actually implement this in some ways because the tuples are always the same size so it's really easy just to make a copy into it and copy the contents of the old tuple and put it into the new one and this example here I'm showing the oldest to newest the version chain always points ahead of it is always the oldest and then you have to traverse it to find the newest the other approach is to do newest to oldest where the version chain is always going to be the newest tuple and if you need to go back in time and find an older version then you follow the version chain that way so there's trade-offs obviously for each of these and it depends on what your application or what your workload is actually trying to do so if you have a lot of insert queries then or queries that want to always find the latest version then newest to oldest is probably what you want because you always you follow the index, you land to the version chain and voila that's the exact tuple that you wanted if you're doing a lot of you read only queries that want to look back at older versions then it might make sense to do this but then also it depends on how many indexes you have as we'll see in a second because if you have a lot of indexes then every single time you create a new version with newest to oldest you have to go update the pointer to that version chain whereas in this one you don't so again they all have these different trade-offs no one way is better than another yes when you create a new version why can't you just let's say the new version is a new node why can't you just copy the data of the old node to the new node and make the changes in the old node and just move the pointers around so the latest version does not change in which case you don't have to move the changes so his question is instead of making a copy of the tuple and then making the change to my new copy my new version what if I copied the old version down here but then made my change there and then now I don't have to update any tuples I just change the pointers around I just change the pointers for that that this thing is now the newest and so this should point to here and then this should point to there so that is one way to do it and that's essentially what time travel storage basically is in time travel storage what happens is instead of having a single table space you have a separate space where you put older versions and essentially gives you the same thing but when you're proposing actually again depends on what the application is that actually might be bad I have to think about it depending on your workload because in that case you don't have to go and change your pointers in the indexes way you still have the newest older but you don't have the problem of going and changing pointers at all the indexes and all the indexes we can take it offline if I think about it I think you might have some correctness issues because now you need to update two pointers atomically and that's hard right with the version chain always going in one direction you just do a compare and swap of one address and then that automatically fixes your problem yes Is it common that these are like knobs in the database or that you want to engineer so this question is, is it common that not just whether to do oldest or newest but there are all of these things yeah all of these questions is it common that data systems allow you to two knobs and change what scheme you're using for all of these different design decisions nobody does that because it's way too much engineering right most of them pick one and that's what they just use in the back wasn't that a huge effort to implement this in Peloton and have all these things his question is, was this not a huge effort to implement this in Peloton and have all of them yes we would have to say right but we had three students that took this class visiting PGA student banged it out over 4 or 5 months it was a lot of work so the next version scheme is timetable tables and so for this instead of having all our new versions be stored in the same table heap we're going to have a separate table location or separate storage area where we're going to put either the oldest versions or the newest versions so in this case here the virgin chain would always point to the main table versions of the tuple and then the timetable tables where we stored the older versions so on every update what we'll end up doing is again copy the latest version into our timetable table then we can update the pointer to now point to this thing and then now we can do a compare and swap to we overwrite the master version and the main table with the new value and then we do a compare and swap to now point to the new version of the correct version chain so this is what HANA does HANA actually goes oldest and newest so the master version here is always the oldest version I don't know why they argue that they want to do this I have to go check their papers again but I think the gist of it was that they wanted to support they wanted to be optimized for for reading older data to do analytics so you can always find the oldest version here for sql server when you buy their timetable timetable extension they end up giving you this architecture here the main table always stores the master version and they copy older versions over to the other side the last approach is to use delta storage and this is like timetable tables where you have the main table and then you have a separate storage space where you put different versions but the difference is that instead of storing the actual entire tuple that got from the older version we end up only storing just the changes that were made from the old version so what will happen is on update here we're going to copy the values that were modified from the old version and put them into the delta storage so in this case here if I have a transaction that updates this tuple here and that only updates this one field the value field we'll just store what the old value was in the delta storage and then we have our pointer to say let's go back in time and find this here's where it is right do it again, same thing I modify that, I create a new delta record and then I can update the master record so what's one obvious benefit of this last storage right so you can always go back in time and recreate the older version by essentially just replaying this these delta records to get you back into the correct state this is what hyper does this is what mysql does this is what oracle does and I'll show this at the end this is actually what Postgres as of yesterday found out is considering this as well this also makes it really easy to do garbage collection because as we'll see in a second you don't have to go scan through the main table heap to find older versions that aren't visible anymore you just go to one location and the high water market is for transaction IDs and this blow away the whole the rollback segment and that's enough yes so the question is for each tuple we need to go through multiple version chains for each activity yes so the question is if I if I modified there's another column here and I modified it from a transaction long ago and I need to go back in time to find it I may have to go back through multiple version chains or what are you saying so for each column I need to go through with version chain to find the word the version chain here is not for column or attribute it's for the entire tuple so I'm only showing this tuple only has two attributes I'm only showing value here but if I had a third attribute and it was modified that would be in there as well so one additional thing that we can talk about is how to deal with non-inline attributes so this is independent of what version storage scheme you can use the issue is that as I said in the in-memory database on like a disk based database we store all the fixed length attributes together vigorously in the fixed length arena pool and then all the variable length data like large varchars var binaries, text fields they'll be stored in a separate data pool and then in our fixed length tuple we will just have a 64 bit pointer to some location in the variable length pool to go find that data so if I create a new version and I say my transaction updates this only this one fixed length field I have to go create another copy of my of my string field or varchar field here and make sure that this thing points to that so an obvious way to optimize this is that if the varchar field or the string field doesn't get modified from one version to the next there's no reason to go make another copy of this so instead we can reuse the pointers to the variable length field for versions of attributes that don't change from one version to the next right so for do this though I have to put a reference counter as a prefix on the on the field to keep track of how many tuples or how many different versions are actually pointing to that field right the downside of this is that it makes it difficult to relocate memory so say I'm doing compaction I want to reorganize my var length pool since I don't know who's actually pointing to me I have to do a sequential scan if I want to change this memory address to somewhere else or I have to add an indirection layer which is another overhead to deal with the fact that multiple guys could be pointing to me and I want to change the actual physical address this is not so much a problem if your data system can do compression because if you have strings that are repeatable then you can do dictionary encoding for the dictionary code here but if you have if it's a unique var char that gets copied one version of the next that you're not going to want to put in a dictionary encoding then you can use this approach to handle this so this is something we tried in our own system we ended up abandoning it because we end up storing a var char field a var char or var length pool per tile group or per block of tuples so you couldn't have we didn't want to have different tuples and different blocks point to var char var length pools that aren't in the same block so we tried this and we ended up abandoning this in our own system but this is something I like to look at later the last one is garbage collection so again if you're coming from if you understand how garbage collection works in programming languages like a memory managed platform like java it's essentially the same idea we're creating all these new versions of the tuples and then eventually these tuples are no longer these versions are no longer be visible to any transaction so we want to go back and reclaim that memory so the definition of what we're going to say that we're allowed to reclaim a physical version is when we know that there's no other transaction could possibly see that version and this is defined under snaps isolation because we know what the time stamp it is for all the transactions that started so we know what's visible to them or not and then obviously if any version gets created by an onboarded transaction we know that no one should ever be able to see it and we want to go ahead and reclaim their memory so there's two additional design stages we have to deal with when we want to do garbage collection the first is how we actually look for expired versions and then how we actually decide when it's safe to reclaim memory so in the sake of time for this class I'm only going to discuss the first one the second one we'll cover next week when we talk about indexes more and I'm going to have to implement this in your index for project number two because you have to implement sort of the same epoch based garbage collector that Silo uses so we'll cover this more later I'm going to focus on this one for now so there's two approaches the first approach is to do the way to think about their approach is in the scope of how they're going to find these old versions that they can reclaim so the first way is to do it a tuple level scope and this is where the database system is going to have to examine tuples individually and decide whether that version should be reclaimed or not and the two ways we can do this is to background vacuuming or cooperative cleaning and then the other approach is to do transaction level scope where all the transactions are going to maintain their read write sets essentially pointers to the tuples or the versions that they read or modified and then when they commit they hand off that read set write set a separate thread or garbage collector and then it goes and figures out whether those versions are actually visible and then goes ahead and prunes them right so the thing we'll see in Hecaton next week is that these two different approaches this is probably what most people use and in general garbage collection amounts for about 15% overhead of the actual runtime performance of the system which is unavoidable because we had to clean up these old versions because otherwise we'd run out of space so those look real high level examples of the two of these so with background vacuuming we're going to have a separate thread that's going to periodically scan through the database database system or the database to the tables and it's going to look to see whether tuples are visible this is probably the most common approach and it also works with any possible version storage scheme that we talked about so this is supposed to represent a vacuum thread but generally it records what are the actual transactions that are running and then it just scans through every single tuple looks at the begin and end time stamp and recognizes whether any for all the actual transactions that it has if there's any version that is not does not overlap with those active transaction 80s right so in this case here the transaction 80s 12 transaction of these 25 the range 1 to 9 is not visible to them there's no other active transaction so it's safe for us to go ahead and delete these versions in the case of this one here B2 since 12 it comes between 10 and 20 it does not delete this version right so this is what Postgres does this is probably the most common approach that people use when they use when they want to do garbage collection one simple optimization you can do that Postgres does is you can keep track of a dirty bit per block and when you pass through the garbage collector you check to see whether that dirty bit is set to true meaning it was modified since the last time the vacuum went through and if it's not set then you know you can skip it entirely but it's set then you go inside and figure out what you need to reclaim the other approach is cooperative cleaning and basically what happens here is the threads as they scan that scan the version chains since they're already checking to see whether the version is visible to them or not you might as well check to see whether it's actually should be deleted the memory should be reclaimed and then the thread as it does the scan or as it traverses the version chain will actually clean up the old versions for you right so in this case here version ID 12 comes along and then it will check to say is version A4 visibility if not then it sets a little bit to say or flag to say that it's going to be cleaning it so it's allowed ahead to get and prune it and then it prunes it on the fly and then keeps going along and until it actually finds the tuple that I was looking for right so you're sort of piggybacking off of the you know the normal scan operations you have to do on the version chains to do the garbage collection there's one additional optimization you have to deal with if you want to do cooperative cleaning and that is because you only clean things when people scan through it if you have a version chain that has not been scanned through in a while then essentially it's never going to get reclaimed so in the hackathon paper they call these dusty corners so you periodically have to invoke a thread that basically does the background vacuuming just to check to see whether you have anything that needs to be reclaimed and it's only obviously only works with oldest to newest because if this was the newest to oldest then you would never go to the end if you're only looking at the newest versions so by being oldest to newest you force every thread to always have to go down the version chain and then you find the things that delete yes corrective cleaning when you travel through the old old troubles do you like equipments of country nets or is it like the garbage collection the thread doing garbage collection actually no or the active transaction so your question is as the thread is doing scanning through the version chain how does it know what should be deleted or not you have to know what are the other actual transactions that are out there so the last way to do garbage collection is through transaction level as I said in this approach you just have every transaction as it runs collects its rewrite set and then when the transaction commits you hand that off to the garbage collector and then it knows whether those tuples are visible or not and then go ahead and delete them so for this you still need to maintain multiple threads because you may be creating new versions so fast that you're going to run out of memory so you have to add new tuple or new threads to your garbage collection to start reclaiming space so there's this classic tradeoff between how many garbage collection threads you want versus how many transaction threads you want if you have too many transaction threads you may not reclaim memory fast enough and you start thrashing if you have too many garbage collection threads then they're obviously not processing transactions and so that slows down your throughput there so how to figure out how to do the right balance of these is non-trivial so real quickly I want to talk about how to deal with wrap around so the all these time stamps are going to be some integers unsigned integers that always need to be increasing and at some point if we have 64 integers we're going to hit the limit of that ID and we're going to wrap around back to zero and this is problematic we're using these transaction IDs to figure out what's visible to us because what'll happen is we'll have a transaction that we'll have a bunch of versions that were created by transactions with really high time stamps and then we'll wrap around back to zero or one and now all of these all of these versions that should really be in the past but now look in the future because their transaction ID is greater than my transaction ID so in this case here if I wrap around and have a new transaction with transaction ID one this tuple here is no longer visible because it's created with a really, really high time stamp so we need a way to deal with this because there's a bunch of weird stuff that can start to happen if you wrap around and not handle it correctly so you can have things like deleted rows that were deleted, started reappearing updated versions go back to the old state because all that same things we just talked about before when you're traversing the version chain that depends on the transaction ID or time stamps always going increasing in time but now I'm back to one so the I'll skip through this but the one example I think the easiest way to handle this is actually what Postgres does and it's actually it's almost kind of trivial and it's simplicity so basically what they do is that as you get closer to the upper limit of your transaction ID then they will stop processing transactions you can avoid this having to stop entirely because you can vacuum ahead of time and sort of clean things up but at some point you have to stop processing transactions because otherwise you can hit all those that wrap around problem and so what they're going to do is the vacuum is going to go through and it's going to set a flag in every tuple that is part as older version before the wrap around to say that it's now frozen in time simple bit flag and then now what happens is that any transaction ID that gets generated will always be considered to be newer or in the future from any tuple that has this frozen bit set so I save a tuple at version 10 I flip the frozen bit and now my transaction ID is one I check that frozen bit it's set to true so no matter what the actual time stamp or transaction ID for that frozen tuple is I know that it always has to be in the past so basically they have to do a full scan set all these frozen bits before they wrap around and that ensures that everything is back in the past right so they avoid having to do a complete stop if you get to the point where you can't generate any new transaction IDs Postgres will stop dropping you down to a single user mode don't accept any new updates one thing you can do is run the full vacuum and so they try to be a bit aggressive and try to avoid this ahead of time by running the vacuum and cleaning things up before it gets too bad alright the next thing to discuss is index management I'm running out of time again that's not good I love it so much alright so for this since we said before the index is always at the for a primary key there's always going to point to the version chain head right and how often we're going to have to update this primary key index it really depends on how often we're creating a new version or a new head of the version chain right doing updates is simple you just update the existing key and you set the value in the index to point to the new head of the version chain to deal with updates of the actual primary key value itself like if I change my primary key is my email address then I need to be careful about having two keys in my index point to now to the same version chain the actual logic you have to do to handle that so to avoid that as far as I know most systems just treat an update to the primary key attribute for a tuple as a delete followed by an insert so I delete the old tuple and then I insert a new one so technically at the highest level you can think it's semantically as being the same logical tuple but underneath the covers it's actually treated as two set of version chains now the tricky one is secondary indexes and this is a great example of why this actually matters a lot in NVCC is this blog article that Uber put out in April 2016 was right around the time we were writing this paper and I felt so vindicated I'm like oh yes this is totally a hard problem it's thinking about right and so Uber in this blog article they talk about how that they had to go from Postgres to MySQL because of the way part of the reason was the way that MySQL and Postgres handled their secondary indexes in a multi versioning environment the true story actually is that they went from they started off with MySQL then they hired some guy that really loved Postgres so they switched to Postgres and then they realized that for their workload the secondary index management was a problem so they had to switch back to MySQL so I'm sure that cost them millions of dollars if they had they read my paper beforehand they could have saved them that money and sent it to us but that's fine so for secondary indexes the tricky part is now what are we actually going to point to and there's two ways to do this we can have a logical pointer that points to some it has a sort of fixed identifier that doesn't change no matter how many times we have different versions of that tuple and to do this we can have primary key or we can use some kind of internal tuple ID like a record ID for example but this requires to have an indirection layer to go from this logical identifier to the actual physical pointer the physical address of the head of the version chain the other approach is use physical pointers always just point to the physical pointer chain so let's look at this visually so say that we have a single version chain here and for this the newest oldest so every single time I create a new version here I have to append a new record and put it at the front of the version chain and then update these pointers so for the primary key index this is always going to have a pointer to the physical address of the head of the version chain so every time I create a new version I have to go update the primary key it's unavoidable that's not a big deal but for secondary indexes if I point to a physical address then the same thing every single time I add a new version I have to go update that index now if it's one index that's not a big deal but the problem is that if you have a lot of these secondary indexes that means for every single time I update this regardless if I even update the value that these indexes are keyed on then I have to go and update all of the pointers right and this is what Postgres does and this is the problem that Uber was having because in their workload they were doing a lot of updates and they had a lot of secondary indexes on these tables and every single update required them to modify the physical address that's being stored in all these indexes so a way to get around this is usually a logical identifier so once you can just store the primary key and then when you want to figure out the physical address you do a lookup in the secondary index you get back the primary key then you do a second lookup in the secondary index and then that gives you the physical address right so now if I no matter how many times I update the the number of verges for that tuple the primary key doesn't change if the primary key does change then that doesn't lead followed by an insert and it's treated separately the other alternative is to use a logical identifier like a tuple ID and this requires us to have an indirection layer to do a mapping from that tuple ID to an address in our system this is actually what we use in Postgres they use the primary key pointer in no one way is better than another I think for this if your primary key is only a 64-bit integer or 30-bit integer then that's small that's fine if it's something really large and you have to store that in the secondary index which I think is wasteful so we're short of time so I'm going to show one graph right and this is sort of the money shot of the takeaway graph from the paper and for this we're going to run the tbcc benchmark on a machine with 4 sockets and 10 cores per socket and we're going to scale up along the x-axis the number of threads we're allocating in the system to execute transactions so the main takeaway I want you to get from this is that although we're all running inside the same system you actually see there's quite a gap here in terms of performance at the very top you have newdb, hyper and mysql and oracle and for these systems there's not one thing I can point to say this is what you need to do if you're building an empty system that gets you the best performance oracle and mysql are using rollback segments of the delta storage newdb is doing is doing the append only some of them I think are using the logical pointers for indexes others are using garbage collection in different ways so again it depends on the workload, depends on what the application wants depends on what's in your database in terms of the size of the database, how many indexes you have will vary a lot so that was sort of one the main takeaways you got from that paper was it depends right so what we see though one thing interesting is that the bottom here is Postgres right and the things I think Postgres let's run the video be careful the append only storage I think is probably and then the the physical pointers are probably too the combination of those two things then cause the most problems and as I said there's a blog article came out yesterday January 30th this is written by one of the lead developers Robert Haas works on Postgres at EnterpriseDB and he basically talks about switching from the append only storage model to the Delta storage model the same thing that's used in MySQL and Oracle right so this is an active thing that people are actually considering now to try to fix it in Postgres to try to boost these numbers and we'll see how that goes and I think actually this is probably what we should do in our own system but right now we're using append only storage again this is trade off if you're trying to do a lot of reads then append only storage can be good depending on how you organize things so right so in some ways it's sort of disappointing right you get through this great 12 page paper and it talks about all the details of all these different things you have to consider in an MVCC database and then it's like yeah it depends right so that was disappointing alright so the main takeaways the MVCC is probably the best approach for doing transactions in a mixed workload environment so we'll talk a little bit more about this during the semester as we get started talking about analytical queries but for Concurjitoll we're mostly focusing on OTP operations but we also want to be executing analytical queries at the same time it's called a hybrid environment or mixed workload environment and MVCC because it has this nice benefit that the readers are not blocked by writers is actually ideal for that scenario I think that's part of the reason why people choose to use it there's a bunch of interesting research topics that I want to explore in the context of our system that I don't think I've addressed one is how to do block compaction this is maybe only an issue for append only storage but I think also in where there are like pools and compression and other things I think this could be interesting and then we're also looking at how we can exploit the semantics of snapshot isolation to do online schema changes so the way to think about this is if I add a new column like call alter table and add a column in most systems now what they'll actually do is go apply that update immediately so they'll scan through every single tuple make a copy of it, add a new column and then insert that as the new master version so I think you don't have to do that I think we can do this lazily but we can have different versions of tuples that are controlled by different versions of a schema and so you can call alter table add a column and immediately come back and says I did it but underneath the coverage you can manage when that column actually gets propagated to the table so this is something I think is actually a paper we're working on now and if you're interested in getting involved in this send me an email we're well over time so I'm going to stop now I'll discuss project 2 in more detail on Monday I'll be short to a lot of time for this but the website will be up you don't have to implement a skip list skip list is a latch-free concurrent index and you need to be able to support garbage collection and forward and reverse iteration so skip list is the primary index used in memsql they are huge on skip lists they claim it's amazing because it's latch-free it's not that great so you will be implementing something that our other indexes can beat so BWT will be able to beat a skip list and the art index and the B plus tree can crush these things the reason why I have you guys implement skip list is because it's actually really easy to implement I shouldn't say that well relatively the other ones it's much easier the first year I taught this class we had to implement the BWT that was a huge mistake that was way more complicated now I completely underestimated it so now we're doing skip list because it covers all the important things that we want to talk about like compare and swap garbage collection and how it fits into the rest of the data system but it's actually something you can implement in a month so I'll cover this more on Monday I'll bump the deadline for this to be an extra few days because we won't announce until later but if you haven't signed up yet for the for your group on the spreadsheet please do that or contact me if you don't have a group and we'll find somebody for you ok? alright guys have a good weekend see ya