 For today's class, we're going to continue our discussion on multi-virgin concurrency troll. And so last class, we spent time talking about the four major design decisions when you build an MVCC database system, the concurrency protocol, even though concurrency troll is in the name of what we're designing. Again, this is how we're going to coordinate transactions. The version storage, basically every time we update a tuple, where do we actually store that new physical version, garbage collection cleaning up all versions, and then index management, basically how do we keep track of what are the secondary indexes point to? They're either point to the actual physical version, the head of the version chain, or they're pointing to some logical identifier. So for today's class, we're now going to go into more detail and look at real-world implementations of modern MVCC systems. And what will happen is we'll do this in the context of the four design decisions that we talked about last class, but we'll spend a little more time talking about how they're actually going to do the first one. They actually concurrency protocol itself, how they're going to coordinate transactions. But then as we go along, there'll also be other aspects of these database systems that we'll cover that are good ideas, potentially, for having a high-performance MVCC system that doesn't fit exactly into the four design decisions that we talked about before. And so the way we'll go through and proceed with the discussion, we'll first start with Microsoft Hecaton, because that was sort of the first implementation of a modern in-memory MVCC system in 2011-2012. And that will serve as the baseline for us understanding how these other systems implement MVCC. So if you understand this, they're going to say how can we tweak it or do different things, or add additional things to that MVCC implementation that can improve what Hecaton originally proposed. Again, so Hecatons, we focused on OLTP. These two here, Hyper and Honolary, focused on OLTP with OLAP. And then we'll come back with Cicada to discuss how we do, or how to do OLTP again, all right? So who here has actually ever heard of Hecaton before that hasn't hung around me? That's OK. So it started off as a research project at Microsoft, and then now it actually ships in SQL Server today. But when you read the documentation, they don't call it Hecaton. I think it's just called MVCC storage. But if you look at the papers, it'll refer to this project as Hecaton. So Hecaton project started in 2008 at Microsoft as a way to extend the lifetime of SQL Server. So at this point, in 2008, SQL Server was 20 years old. And so they started a project to say, well, what can we do to improve the performance and modernize the system such that SQL Server is still relevant 20 years from now? SQL Server makes Microsoft a little money. So they said, how do we keep that gravy train going 20 years in the future? So they decided to work on how do you actually build a new OLTP engine specifically for the SQL Server ecosystem? So if you don't know the history of SQL Server, in 1993, 1992, they actually bought a license to Sybase. And then they poured it to Windows NT. And then a little bit after that, then they bought the source code license from Sybase. They ended up forking it. And at this point, they've rewritten everything from scratch, I'm sure, all over again. But there are still remnants of Sybase's influence in SQL Server today. Like, SQL Server uses TSQL, which is a variant of SQL, because that's what Sybase used. And certainly now today, SQL Server is considered a state-of-the-art system where Sybase is more or less in maintenance mode. SAP bought Sybase. They still make a shitload of money on it, because we had the guy that leads the Sybase project give a guest lecture in this class last year. But no one would start by saying, oh, I'm going to build my new system based on Sybase. Nobody does that. So to start this project, Microsoft got two very famous database people at Microsoft. Paul Larson was in MSR. He's a famous database person. It's been around for a while. He invented linear hashing in the 1980s and did a bunch of work at deck. Mike Zwilling was actually one of the main people they hired at the very beginning when they started SQL Server to help them port the Sybase code over to Windows NT. And he worked at University of Wisconsin on a very influential system called Shure. So they got basically the two best people to lead this project. So unlike HANA, Hecaton, Cicada, and a bunch of other systems we'll talk about this semester, instead of building a brand new database from scratch, for the Hecaton project, they had to make sure that it could integrate with the existing SQL Server ecosystem. And this is more of a business decision. You have millions of millions of people are running SQL Server today. There's all those tools and all these visualization frameworks and things built on top of SQL Server that you want to keep around because that helps sell the product. So if you threw it all the way from scratch and built Hecaton as a new standalone system, you basically have to re-implement all that ecosystem. So for this reason, they had to make it run inside as an engine in SQL Server. And the way it works today is that when you call CreateTable, you can pass a flag at the end that says In Memory. And that then ends up using the Hecaton engine. Another interesting thing about their design decisions they're going to make is that part of the reason they chose MVCC and the approach that they're pursuing is that they want to be able to support all possible to be workloads with the same predictable performance. So we didn't talk about, I don't think we talked about H-Door or VoltDB in the intro class, but this is a system that I helped build when I was in grad school. And the way this works is that you have a single-threaded execution engine that can execute one transaction at a time. Of course, you paralyze this across multiple cores, but now that one transaction is super fast because there's no contention. You don't require locks or latches because nobody else is running at the same time. I can run almost at bare metal speed. But the problem is, as soon as I touch multiple cores or data across multiple cores, then I have to hold locks and then things slow down. So some of your transactions will be insanely fast. Other transactions will be super slow. And in the Microsoft world, that's a hard product to sell because basically you're saying, hey, go buy our new product, give us a lot of money. For some of your transactions, it'll go really great. 90% of your transactions run really fast. 10% of your transactions run really slow. You can't sell that, but no one would buy that. The example I like to use is you try to sell a car and say, hey, this is the fastest car. It's gonna have great fuel economy. It's gonna be, you know, you should buy this. But oh, by the way, 10% of you are gonna get cancer when you drive it. No one would buy that car, right? So for this reason, they're not gonna get the sort of most optimal performance relative to something like HGRO volt DB, but it'll at least be predictable. And this is over at what NVCC is gonna provide us. So at its core, again, it's gonna be doing, it's gonna use timestamps to determine the ordering of transactions and what versions are feasible to it. But unlike in last class, when we talked about timestamp ordering and so basic NVCC, transactions are now gonna have two timestamps. So when we talked about timestamp ordering, again, the transaction gets a sign of timestamp when it shows up. In OCC, you get a timestamp when you validate. And so in this world, you're gonna have both. You're gonna get a sign of timestamp when you start and then you'll get another commit timestamp when you have to pass validation. And then we have to use that to, again, you're using this information to go figure out what is actually visible to us. All the two posts are also gonna contain two timestamps. We'll have the begin and the end. And so the begin timestamp is just like as it was before, where it's gonna represent the begin timestamp of the actual transaction that created it or the commit timestamp of the transaction that committed, that created it. And then what's gonna happen is we'll see in the next slide, this thing, we'll have to go back and change because we'll start with this and we create a new version, we can populate it with our begin timestamp for our transaction. But then when we commit, now we have the commit timestamp and now we gotta go back and update all our versions and replace the timestamp. And I'll show you how the, everything still works out and there's no fandoms and there's no issues even though we're changing timestamps. The end timestamp for our tuple is a little bit tricky. So this is gonna be the begin timestamp of the actual transaction that created it, created the next version in the version chain. Affinity, meaning that it's the latest version and that there's nothing else comes after it. Or the commit timestamp of the committed transaction that created the next version. That invalidated the current one and then it points to the next one. So again, this one's a bit more confusing but it'll make more sense when we show an example. Okay, so say we have a simple table. Again, so Hecaton is doing a pen-only storage, oldest and newest in the version chain. And say for this one we have, we have just a single logical tuple with two versions, A1 and A2. So our transaction comes along and then at the very beginning we have to give it a timestamp. So it's begin timestamp and this example here is 25. So now we wanna do a read on A. So we'll land somehow at this, at the head of the version chain at A1 here. How we got there, it doesn't matter whether it was an index or a sequential scan. For our purposes here we don't care. So now we're gonna check to see whether this version is visible to us. So again, we have to go look at the begin and the end timestamp. That tells us the visibility of our tuple. In this case here 25 is greater than 20. So we know that this thing comes before us or comes before our snapshot for our transaction. So then now we follow the version chain pointer and now we land here. In this case here 25 is in between 20 and infinity. So we know that this is the version that's visible to us and this is what we'll read. So that's just the same as before as last class, okay? All right, let's see now we wanna do a write on A. So same thing, somehow we land at the head of the version chain. We know that this is the logical tuple we wanna update. So we wanna find where should we install our version. So in this case here again, 25 is greater than 20. So we know we don't wanna, this is not the head of the, this is not the version we wanna append afterwards. Come down here, 20 to infinity. So we know at this point here that this is the most recent committed version and then because this is infinity and our pointer is null, we know that nobody else is trying to create a new version after us right now, right? So now we need to install our new version. So for this one, we need to be careful how we're gonna do ordering because unlike last class where we had this, sort of take a lock on the whole tuple and then do our updates. For this one, we wanna do everything as compare and swap. So we wanna avoid having to lock this whole thing when we make a change. So the first thing we need to do is append our new version. Right, it's append only storage. We just find a free slot in our table space and then we claim it and then we can now install our new version. But now we'll see with that for our begin timestamp, we're doing something special here. So I'm having these little markers here say transaction at 25. So what does that mean? So the hackathon is going, instead of having a sort of special transaction ID to say this represents the timestamp of an uncommitted transaction, they're just gonna use the begin timestamp to record that this is the version that was created by this transaction for both committed and uncommitted. And so what they're gonna do is they're gonna use the most significant bit and set that to one to mean that the transaction at this timestamp is uncommitted. So I'm representing that by having the prefix transaction in front of it. This is just for illustration purposes, but underneath the covers what they're really doing is just flipping this one bit. Let me take a guess why you wanna do this. So that any projection that looks, I have this virgin term won't have a begin timestamp that is greater than this one, but it's a one. So she's saying because this thing's set to one, we know that nobody else can ever have a timestamp that's bigger than us. No. And so when we do the comparison we're actually gonna ignore that one. Yes. It can help you avoid having to like cascading a board. Does it help you avoid cascading a board? No. Hackathon's gonna support speculative reads so we still have cascading of boards. You can support like isolation levels that prevent you from reading this if it's like preventing you from reading uncommitted if you're running like recommitted or something like that. So he says, you could use this to prevent you from reading something that's not committed yet by checking this and to say I don't wanna read this. So again, they're gonna support speculative reads. So you wanna be able to read things that are uncommitted because you're gonna assume the transaction that you read from that modified this thing is not gonna abort. So you wanna go ahead and allow this. She says, you can read your own things you updated. Yeah, but I don't need the bit to figure that out. These are actually good. This is actually very interesting. Yes, go. So you're saying if you read this and you see the one then you know that you read something that was uncommitted. Yeah, so go back and see. These are my transactions that read something that was... You're almost there, yes. It's very close, yes. The garbage collector can figure it out that this one is unused while traversing and then deleted, right? So he says that you can go through the garbage collector and recognize that this thing is unused and go ahead and clean it up, but it's not unused. We put something in there. No, but right now till it's not committed, right? A3 is not committed. Yes. If it is committed you remove the one and make it zero so that... Yeah, but why? Well, what does the one actually do for us? You guys are very close. Make it last. He says make it last for you now. Recovery? No, she said recovery now. All right, so. And this is all very... This is good because this is like... I read the paper. Oh, of course, yeah, do it this way. And then you guys are coming in it from not having all this prior knowledge so it's interesting to see what you guys think actually matter. So what we're actually gonna do with this is that we need to... At some point we need to figure out... He was very close to this. At some point we need to figure out, did we read something that was not committed? And if yes, we need to know what happened to that transaction. So by setting this bit one, while our transaction... Another transaction is reading this, it'll say, oh, I see this bit set to one. That means that this thing is potentially uncommitted. It actually may be committed. We just don't know it yet because they haven't come and updated this time stamp to remove the one. So we'll see this in a few more slides, but there's actually gonna be this global state map that keeps track of the current state of every single transaction that's running in the system. So if I'm gonna allow for speculative reads, I need a way to figure out later on did the thing I read actually get committed or not. And so by setting this to one, I don't have to go check to see whether am I actually allowed to read this. I make a little mark in my local metadata, like in my thread that says, by the way, I read this A3, figure out what happened to it later on because it may have not committed. Yes. What if on commit when you go to check that this transaction is still uncommitted? So he says, what happens if I go on check when my transaction commits that they go ahead and read this? What if it's still uncommitted? I have to wait. Let's do that in a few more slides. Yes. Shouldn't A3 update the version change pointer? We're not there yet, yes. The question is, do we have to update the version? Yes. Okay. So just, yes. But if we have not updated the version change pointer, then how can somebody read it? Yes, so at this point here, no one can see this. Right? Because all I've done is I've just, the first transaction installed this. Right? Instead of taking a lock on A2, I just said, let me go ahead and create this. Then now, I'll do the compare and swap here. And if that succeeds, then I know that I was successfully able to install this new version. But at this point here, can anybody see this, can anybody see my new version yet? Why not? Infinity, right? So now I need to go back here and flip that to be my timestamp. And now my version is installed. It's still uncommitted. My bit is still set to true, or set the one to denote that this is potentially from an uncommitted transaction. But at this point here, anybody that comes along that wants to read data after timestamp 25 should be able to, will be able to see my new version. Yes? Well, I keep track of both the end timestamp and the end timestamp, if you just can't keep that in like this. Your question is why keep track of the begin timestamp and the end timestamp here? Yes, like why keep track of like the end timestamp and let's say like A1, when you just use the begin timestamp in A2 or vice versa. Right, so his question is, why do I need to keep track of this end timestamp here? Couldn't I just follow the version chain and say, oh, well I know that the visibility of this tuple is 20 to 25 because I have it here. But now when I do garbage collection, I always got to keep the last one around. Yeah. Right? So if my tuple is super big, super what has a lot of columns, then I'm not actually saving anything. Do you actually have to keep the last one around? Because if you know the pointer is pointing to null, then you just have to put like just being infinity. Yes, his point is if you know, right, so he's saying if you know this pointer is null, then that shouldn't that just be read as, this is infinity. Yeah, maybe. Injection number 24. Yes. Then it should read A2 instead of A3. But if there's a transaction 26, well, read A3. So there is a difference between A2 and A3. If you... So you said that if there is transaction at timestamp 24, they should read this. What he's saying that you could infer, I think you want to reduce them are pointer chasing. I think it's compute versus storage cost. So if I'm scanning along I'm oldest and newest, I see this, I don't know the end timestamp. My pointer is not null. I gotta again follow the pointer now to go figure out, well, what should this actually be? And that's a waste of time. It's not worth it. So for 64 bits, it's avoid that extra computation. So when we're comparing, say if we have a transaction at 24. Yes. The current end timestamp at A2 will be greater than 24. So because we just said the MSP as one, so it will seem like infinity to it as well. Oh, so his point is back here, this thing set to one, so therefore the number is super high. When you do the comparison, you only look at 63 of the bits. You ignore this. Yes. So we are introducing this extra one bit so that other transactions can read it. Why not just change the pointer only at the end? Why not change what? Sorry? Change the pointer only at the end. Like after you change the pointer, then you'll make the one to zero, right? Or in transaction at the end. No, no, no, no, no, no. That still stays, like this is our begin timestamp. We only flip that one to zero when we get a commit timestamp, which is, I'll show next, okay? Okay, this is just being able to tell us, hey, I read something that's not committed, you need to figure out later on whether it actually was committed or not. All right, so now our guy commits. Again, so they're gonna do, it's OCC, so we're doing validation. So we need to go, you know, go check to see where we're allowed to commit. In this case here, yes we are. So we get our commit timestamp at 35. And so all we need to do now is just go back at some later point and flip this to be now 35. Doesn't need to happen immediately. Because again, this is saying that if anybody reads this, go figure out later on in my global map whether I actually committed or not. But now if I install 35, the bit is set to zero, so I know that I'm reading something that I definitely was committed. And I don't need to check that map. Yes. Before updating that. Boom, okay. So this question is exactly this. So let's rewind now. Let's go remove the commit and we're back at this state here. Let's now bring in another transaction. Okay, this will answer your question. So our transaction begins at timestamp 30 and it wants to do a read on A. So again, follow the version chain. 30 is greater than 20, we don't want this. 30 is greater than 25, we don't want this. So then we land here and this is the version that we actually want to read. Again, we're doing speculative reads in Hecaton. Instead of taking a right latch or right lock on this tuple, I'm allowed to read something that's uncommitted. I'll have to later on make sure I was the transaction that created this version that I've read whether it actually committed or not but I'm allowed to do this. But now this guy wants to do a write on A. So in this case here, my timestamp is 30, that's after 25, but this again, first writer wins. So this guy is gonna get come down here and recognize that there's another transaction that's uncommitted here that created a new version that I cannot create a new version after this and therefore I have to abort my transaction. So again, by seeing this thing here, I know it's uncommitted. Actually in this case here, you would want to go check but I think, yeah, if you go check, you could go see, oh, it is actually committed, therefore I can install the version. For this example here, 25 is not committed yet. So this thing would, it's a write-write conflict, so we go ahead and abort. Is that your question? Okay. Yes? So say the thread is not able to read but the thread one commits. Sorry, so which thread cannot read? So this is stored right now, is the thread two? No, thread two is dead. At this point here, when he tries to write on A. It has just a read and it begins at 30. So you say you're back here? Yeah. Okay. And the thread one commits. Yes? So it's A2 is 20 to 35. Yes. And so shouldn't it be, and now it comes in and then it should be reading A2, so the value. So this will get a commit time stamp after say it gets 35 again. Then when this guy does a write. Just to read, just to read. Just to read? Yeah. It is stored at 35 and then like it executes at 26. Is that allowed or late? So it executes it. This guy, so this guy commits. Yeah. And A2 gets an end time stamp of 35. Yes. And then. This guy then reads. It begins at 30 and then it would read from A2, right? Yeah, so the question is, in my last example here, this guy got a commit time stamp at 35. But now if I come back and do this read but I have time stamp 30, would I see? No, because on a snapshot isolation, I should only see things that, I should only see things that were committed before I started. In this case here, let's 35, go and install it. The spectator read, then you would be able to read. No, you would miss it. You would miss it. But that's okay, that's still serializable. Yes. Do you have an issue with like, say missing data or reading wrong data because like you, I guess I'm wondering, how do you atomically actually update both the begin time stamp and A3 and the end time stamp and A2 to the same value and know that some other thread hasn't like jumped through, like you updated one to 35 before you updated the other to 35 and so you miss some stuff or see some stuff. Yeah, so this question is, how do I make sure if I need to go flip these? So what would happen is if I go do a lookup and I see this thing's committed, right? The one says, I need to go figure out what is actually going on. So if it's set to one, I go look my map and I would say, oh, it's committed. The real commit time stamp is 35. So I would substitute that in my local view of this. And then when you do the update, you go from the newest to oldest because people are going in the other direction to read. So you wouldn't have that issue. Okay, so let's jump now and talk about how we're actually gonna, the state map here. Again, so this is a global hash map that we're gonna maintain throughout the entire database system that just keeps track of what's the state of every single transaction running. So actual transactions are just when transaction shows up, it's done a bunch of reads and writes, it has begin time stamp, could still do more reads and writes, we just don't know the outcome just yet. Validating is when the application says I wanna commit my transaction and now we begin the process of figuring out whether it is actually allowed to commit, right? Like we have to go check the read-write sets and see whether there's any conflicts. Committed transaction is that we pass the validating phase, we have a, anybody who goes reads our data will know that it's actually been committed, but we just haven't gone back yet and flipped those begin time stamps and end time stamps to be what our commit time stamp is. And then terminated is when we updated all the time stamps for our versions that we created and then at some point we'll get pruned away, we'll get thrown away out of the map. So let's go through now the lifetime in more detail. So at begin, transaction gets begin time stamp, we set it in the map to be active, then we do our normal processing, again under OCC they would call this the read phase, we're doing reads and writes on the data. But as we run, we're at the main, keep track of the read set, the scan set and the write set of a transaction. So for the read set and the write set, these are just gonna be physical pointers to the versions that we read and wrote, right? But they're actually gonna be like, in case of read set, it's going to be the version that's produced to the access method of the transaction, or sorry, of the query executor. I'll explain that in the next slide. And the scan set will be the actual where clause that we use to do a scan on a table. Because we need that to do a fan of checks for serialized voting. Then we hit pre-commit, this is when the application tells us we wanna commit, we now get our commit time stamp, we do our validation to figure out whether we're actually, are there any conflicts? And then validation, again, we just check everything, see whether we're not violating serialized order. But one interesting thing is here is that this is when they're gonna write now the new versions to the right-hand log out on disk. So this is different than a disk-oriented system. A disk-oriented system, remember, you're just appending to the right-hand log buffer, and at some point the buffer gets full, or there's a timeout when you're doing group commit, and then all your log records get flushed to disk, even though that transaction may have not committed yet. In the case of Hecaton, they're gonna buffer all the log writes in memory and only flush them out when the transaction actually is committed. So that means that there's a crash, you come back, you're not gonna see any log records from the transactions that did not commit. You'll see all the log records. Okay, so you see all log records for only committed transactions. Then we do the commit, we set our transaction state to committed, we go through now and update our version timestamps to flip the begin timestamps and end timestamps to now be the transactions commit timestamp. Then at that point we're done, we set ourselves to terminated, and we'll get cleaned up later on. All right, so let's talk about this metadata we're tracking here, because this is important to understand how we're actually maintaining serialized voting. So for the read set and the write set, these again, these are gonna be the physical, point us to the physical versions in memory that I accessed or created. And for the access one, it's gonna be the physical versions that I've produced back to the access method that I used to execute the query. So the thing of this is like, say I have a version chain and I have three versions in it, the access method does a scan and wants to get back a tuple, it's only gonna get back one version. So my reset doesn't contain the first two versions that I looked at that weren't visible to me. I don't care about those. It's the one I then returned back to the query executor. Now, up in my query plan, that one version may get pruned out. Like if it gets fed into a join operator and the join operator says, I don't need this tuple and throws it away, I still need to know about it, right? Even though it's not produced the final result of the tuple or the query, because I gave it to my access method, I have to return it. And then the scan set is gonna be the, think of this as like, it's the bare minimum of the information we need in order to re-execute a scan operation. So think of like, it's just a where clause of a query. And what's gonna happen is, in our validation, we're gonna basically just re-execute the scan all over again and see whether we get back the same result. Because if we don't get the same result, then we know there was a phantom, somebody inserted or deleted something since we started. And therefore we have the, giving back a different result and therefore we have the border transaction because that would violate serializable order. Now if we're running at a lower isolation level, we don't care about this but for serializability we do. Yes. Where or when? And during the validation phase, right? So I'm checking my read set to see whether I've accessed anything that is from uncommitted transaction and you know whether they commit or not and then I run my scans again to figure out if I run this a second time do I get the same result? If I don't then I know there's a phantom and I abort. So you may think this is kind of heavy handed, right? Like if my scan reads a billion tuples, when I do validation, I'm gonna read a billion tuples again. Yes. Let's go say where does the board that involves this? So this is just the metadata. This is just the metadata we have to track in order to do this. So abort would happen here. Or actually here, if I have a right conflict, I can abort myself right away. If under validation, like if I scan again to get a different result, then I kill myself. Okay. And do you move yourself from like the global hash table so then like the hash table comes up? But you sort of skip down and, well, yeah, you just skip down and terminate and say I'm not gonna commit and then the garbage collector will come through and clean up any versions. Shouldn't be around. It's like for other ones that are kind of like waiting on you is my point. Like if they see that you're not. Yeah, we'll get to that. Yeah, yeah. Paulina, a question or no? Is the where clause what, sorry? Also the clause of predicate. Yeah, so the where clause contains predicates. A predicate would be like A equals one and B equals two. So each of those A equals one, that's a, I call that a predicate. Yeah, okay. So it's a set of predicates. Yes. Okay. All right, and the commit dependency he was asking here is like, this is basically a list of all the transactions that are waiting for my transaction to commit. Cause they read something that I wrote before I committed. So think of this as like a PubSub system. Like a transaction reads my uncommitted data. They then subscribe to my commit notification, whether I'm going to finish or not. And then if I commit and I read something that you wrote and you haven't committed yet, then I need to wait to see whether you commit before I can commit. Yes. Is it going to use the commit timestamp to do this again? Yes, his question is, would I use the commit timestamp to re-exit scan? Yes, you have to. Cause that's like, that's when I'm installing everything. All right. Yeah, correct. Can you have circularity and dependencies because of how like the commit timestamp like the original timestamp like might change? And so somehow you get like a circularity and dependencies there. This question is, can you have a circular dependency in the commit, commit, commit timestamps? No, cause when you, when I ran the first time, I can write a bunch of stuff. I'm before you, but I haven't committed yet. So you can't see any of my stuff. Yeah. If your timestamp is greater than mine, then I won't, then you write stuff. I won't read it. Yeah. It's always one direction. All right. So in the head cut on paper, they propose actually and evaluate both the optimistic and a pessimistic version of what I described. So the optimistic is what we've already talked about. Again, the key thing is there in order to ensure serializability, I have to check for phantoms by re-running my scans on commit. And so when you download SQL server and you run it and you, you run a memory table, this is actually what you get, but they also implemented a pessimistic version based on two phase locking. And the main thing to understand is that they don't have a validation phase because they're going to do index locks and share locks and special locks to avoid any phantoms. And they have a background thread to do deadlock detection. All right. And so this is the only graph I want to show you from the paper. Again, the papers from 2012, so relative to the other systems we're talking about, this is pretty old. And the harbor they're running on is a bit dated. The main takeaway from here is again, as you scale up the number of threads for this sort of high attention workload with only 1,000 tuples, the pessimistic version is worse than the optimistic one. And this gap may not seem like a lot, but if you look at the numbers here, it's the optimistic version is doing at most 1.5 million transactions a second, and then the pessimistic one is doing 1.2 million. So the difference is 300,000 transactions per second. That's actually a lot. So like running this particular workload, I don't know the exact details on the hardware, but my SQL process can maybe do at most 100,000 transactions a second. I don't know if you disable logging, maybe a little faster, but this is pretty, so this is a lot, even though relatively between the two of them doesn't seem like much, right? So this paper basically said like, oh, an optimistic approach for an in-memory multi-version system is the way to go and not the pessimistic one, okay? All right, so what are some of the takeaways we can have from Hecaton? So from a design standpoint, they talked about having this global map to keep track of the state of everything, they're gonna have for speculative reads, and of course everyone's gonna do first-writer wins. But some of the things that came out of the paper that they discussed, it said like, if you're designing an in-memory system that you're sort of tenants you should follow. The first is that they said that you only wanna use lock-free or latch-free data structures. So that means no latches, no spin locks, no critical sections, everything should always be written in a lock-free approach. And this would be free indexes, your transaction map, the memory allocator, the garbage collector. So I agree with everything except for this one here for indexes. So next week you guys will read about the BWTree, which is a lock-free index that they built in part of the Hecaton project. We end up implementing the same thing. And it gets, it's not crap, but it loses to indexes that still use sort of traditional locking techniques, right? The way Hecaton originally started was they went all in with skip lists. Skip lists is a lock-free data structure that sort of looks like a probabilistic B-plus tree that you can make lock-free. So in the original versions of the Hecaton they were like skip lists are the way to go. This is how they designed the system. Then they said actually skip lists are crap. You actually wanna use a BWTree, but even then the BWTree is gonna lose to, you know, regular B-plus tree. This is what you end up guys booting for the second project. So it's also interesting about this. The one database system that's out there that is very big on skip lists is mem SQL, right? And the reason is because one of the co-founders at mem SQL was at Microsoft at the time working on SQL server when the Hecaton project was going on and he saw a bunch of these internal talks from Paul Larson and Mike Swilling, the two guys leading the Hecaton project. And in the very beginning of the project they were like skip lists are the way to go. This is what you should be using. So the mem SQL guy saw that and then, you know, sort of borrowed those ideas when he went off and dealt mem SQL, but he missed the second half of the talk. So basically he said skip lists are crap and that you wanna use a BWTree. And I think still today, like mem SQL is still very big on skip lists, but again the research shows that it's garbage. So I'm gonna say like, how do I say this? You almost, like with very rare exception you never wanna use a skip list. Like there are some cases where like for in memory stuff that small skip list might be okay. I just have some nice properties for memory fragmentation but in general for an index, we'll cover this all next week. Skip lists are a bad idea, don't use that. And don't use the BWTree other even though we still use, we still use it. All right, the other interesting thing they talk about is that, and I agree with this, is that you should only really, you wanna minimize the number of civilization points or bottleneck points in the system. And the only place you need to do that in the case of Hecaton is when you hand out the begin and commit timestamps. Cause these always need to be monotonically increasing and you can't give out duplicate timestamps. So the way they do this is through a compare and swap for atomic addition, single instruction to add one to the counter. In the paper you guys read last class, we showed that if this approach can, really high core count can get maxed out maybe at like 100 million timestamps per second. So you could try to batch these things. In our current system, we actually don't do this. We do what Hyper does. And in that case, there is actually a spin lock around a sort of critical section. So in the case of Hecaton, they don't have that issue. They just hand out these timestamps, they're efficient. All right. So again, Hecaton, I spent the first 40 minutes about talking about it, is because that's going to provide us a baseline to now understand other sort of modern implementations or variants of a multi-version concurrency troll. So we can make some observations now about what the Hecaton architecture looks like and what are some challenges or what are some bottlenecks they're gonna face when we look at other workloads other than OTP. So the first thing is gonna be what I've already mentioned is that the reset and the scan set validation is gonna be really expensive if our transactions access a lot of data. So typically under OTP workloads, transactions update maybe a couple dozen tuples or access a couple of dozen tuples. So in that case there, the scan set or the reset is gonna be quite small. So we can do that validation very quickly. But now if I'm running an analytical query where I scan the entire table and my table has a billion tuples, now when I do that validation, I gotta go read one billion tuples all over again. And that's gonna take a long time. The next issue is that the append only version that they're using going oldest and newest is gonna be bad for OLAP scans because now as I'm scanning along the version chain I'm reading a bunch of stuff that I probably don't need just to get to the version that I actually want. Depending on how fast I do garbage collection the version chain can get quite long. And all this pointer chaining, chasing and branching is gonna be bad on a modern CPU architecture because they're gonna have these long instruction pipeline. So now if I mispredict on a branch I gotta flush my pipeline and fetch everything back in and I'll have long stalls. The last issue is gonna be with Hecaton that they were doing a coarse grained conflict detection to determine whether you had conflicts between two transactions. So what I mean by that is all they were doing to determine whether there was a conflict is just does the pointer exist? Did I actually, is there another version after the version that I'm trying to read? So let's say now though, I have a table that has a thousand tuples and my transaction reads one of them. Sorry, my transaction writes one of them. Your transaction reads the other 999 of them. You don't read the one that I wrote but now that's gonna be considered as a speculative read because you read a version that I created but you didn't actually read the part of the data that I modified. So now if I abort, you have to abort too because the database doesn't know anything about what you read. It just knows that you read a version that was created by this other transaction, that transaction aborted. So therefore you have to abort too. So, right, so again, and we're doing this because we're just, the mere existence of the pointer is enough to say that you read something that I wrote but nothing internally, nothing more fine grained. So the hyper system, again, this is out of the tu Munich. This is the Germans, anytime I refer to, oh, the Germans are beating us, we have to work harder because the Germans, it's these Germans, they're very good. So Hyper was the first database they built. It's an in-memory column store using Delta record versioning going from newest to oldest. And what they're gonna do is that they're gonna support more fine grained conflict detection than what Hecaton can do. And we'll see how we do that in a second. They're also gonna do avoid write write conflicts by saying the first writer wins. And then the other thing that we talked about at the end of last class is the way they're gonna handle indexes in a multi-version system is that anytime you have an update on a tuple, if the attributes you're updating are indexed by, you know, in some index, then you have to do a delete in the index followed by an insert, right? Because otherwise you have to maintain these different versioning information inside the index to point to both of these and it becomes a big pain. So you treat it as a delete by insert and that solves that problem. So let's look at an example here. So again, they're doing Delta storage and it's a column storm. So they're inside of every block of tuples, they're gonna maintain a special column called the version vector. And this version vector is gonna point to now a Delta storage space that's allocated per thread or per transaction that we're gonna maintain all the updates that we have for this particular tuple. So if the pointer is null, then we know that this is the master version and the latest version of this tuple and there is no other version we need to check. If the pointer is not null, then if we need to go back and try to find an early version, we just follow along this pointer and we look inside of our thread space. So sort of related to what Hectown was saying that you don't want any global data structures, you don't want any global memory space, this Delta storage is gonna be on a per thread basis. So when I allocate a bunch of space for my thread, all right, thread local storage, I don't worry about contending on the memory allocator for other threads because that space is allocated for me. Other threads can read my memory, that's fine, but nobody else can write into it but me. And that voice haven't taken any latches on any internal data structures we have in here. So now if I have another transaction that comes along that wants to do an update, right? So say it wants to update the tuple tuple. So the first thing we're gonna do is copy out the attribute that we're gonna modify. So we're gonna modify attribute two. So we put the old value over here and then we update our pointer and now point to what the previous old version was and then we can go ahead and do a compare and swap, install our version vector and update our value, right? Same thing for the next guy, right? We'll update this tuple. There is no older version so our pointer doesn't point to anything. We copy the old value here and we update our version pointer there. So in our current system, we do it the same way, right? But we actually do this all in a latch-free manner which gets a little tricky in order to make sure you get things in the right version when you have deletes. We can ignore that for now. In hackathon, the way they actually did it was have a latch on the entire block of tuples. Each block is like 10, 24 tuples. So in that case here, I take the right latch on the whole thing and I don't worry about someone else coming along and reading something on the intermediate state. In our system, because deletes are tricky, we do some extra stuff to make this all be latch-free. But you don't have to worry about that for now. All right, the thing that is really cool about Hyper is, again, how they're gonna do validation to ensure serialized building. So again, first writer wins. That's easy, right? If I have two transactions that try to update the same thing, whoever got their first succeeds and the other guy aborts. So the paper that you guys are starting to read describes how to handle this in actuality, which is not discussed in the paper, and it's not really public, but they don't really have right-right conflicts because they actually only have one writer thread. So only one thread can update the database at a time. There's much of reader threads to do analytics, but there's no right-right conflicts because there's only one guy could be updating the database at the same time, but the protocol you guys read about would actually handle multiple threads. So now, the way they're gonna do validation instead of re-extruding the scan set is that they're gonna use a technique called precision locking. So this is really cool because this precision locking paper came out before I was born in 1980, and at the time, before these Germans picked it up, it had like 40 citations. So a 30, 40-year-old paper with 40 citations is essentially forgotten, right? No one's citing it, no one's reading it, right? And somehow the Germans dug this thing out, and what it's gonna do, it's gonna be like an approximate predicate locking. So I don't remember whether we discussed predicate locking in the intro class, but basically predicate locking is a way to look at two ware clauses of queries and figure out whether they intersect, right? And for simple cases, like A equals one and B equals two, like those don't intersect, that's not a problem, but for more complex things, it becomes very, very difficult. I think it's NP complete. So precision locking is like an approximation of that where I only need to look at just what I actually modified, whether the tuples that I modified overlap with what my queries are trying to do. And so rather than having to re-execute the scan, you just sort of re-execute the query based on the delta records that are generated by transactions. It'll make more sense than next slide. But basically, when my transaction commits to do validation, I look at all the transactions that have committed after I started, meaning I did not see their updates when I ran because they did not commit before I started. So I'm going to see whether they then created new versions that I should have read when I ran the first time. All right, so let's look at an example here. So say this is the transaction we want to validate. It has three queries, and then there's three transactions that have already committed. So they're in the past, they're gone, and we have their redo records. So these are the same redo records we would write out the disk anyway. So it's not like we're generating these to do just this validation. We have to log them out the disk, so we just keep them around a memory to do our check. And so the idea is that we're going to look at the where clause for every single query and check to see whether there is an overlap with what we're looking up here with what they actually modified. And if so, then we know that we should have read what they wrote, but we didn't, so therefore we have to abort. So the way to take about this easily is that we look and see what attribute are we accessing in our where clause for each predicate, and then we look at see whether they modified that attribute in any log record, and then we just substitute that value in here. So for this first one, attribute two is greater than 20. This guy modified attribute two instead of 99, so we replaced 99 here. So 99 is not greater than 30, or greater than 20 and less than 30. So we know that this thing would not evaluate, so for our query here, we wouldn't have read this. Yes? So these transactions are the ones that started after? No, no, no, they committed it after we started. Yeah, but started also after? No, not necessarily. Again, snapshot isolation says I can only read things, I only see things that are visible to me if they were created by a transaction that committed before I started. So I've started, and there's a bunch of transactions that have already started, and they're making some stuff, and there could have been transactions that started after I started that wrote some stuff and then committed as well. So in this case here, for both these predicates, for this transaction, it evaluates to false, so we know again that this transaction did not create anything that I would have read if it was installed before, when I actually did the read the first time, so I didn't miss anything. So there's no fandom here, I could ignore that. Same thing for the next guy here, right? After two is in 10, 20, or 30. This guy created 99 and 33, and so for both these predicates evaluates to false, 33 is not 10, 20, or 30, 99 is not 10, 20, or 30. So therefore, I didn't read any, I didn't miss anything. So that's false, so we can skip that. In this case here, I'm doing a lookup on attribute one. This guy only modified attribute two, so there's nothing that this guy would have modified that I read, so I replaced that with a null. Null, like any wildcard, would actually evaluate the null. So long as it's not true, then we're fine, and I didn't miss anything. So I'm gonna do this down the line for all these guys, one by one. But now we get to this guy here. Attribute one, like ice, wildcard, ice, wildcard. In this case here, somebody installed the string ice cube. So now in this case here, my predicate would evaluate to true. So therefore again, that means that this transaction did an update that was committed after I started, but because I was looking at a previous snapshot, I missed it, right? This is the right skew anomaly, because I missed it, then I have to abort, because it's something that I should have read, but I didn't read, and that would have violated serializable ordering, okay? Yes? If the first transaction were there, change one, like let's say both transactions 1001 and 1003 updated the same row. Yes. And the first transaction that was committed before passes these web classes, but the third transaction does not. So what would be? Yeah, so this question is, this question is, what if like here, this evaluated true, and then say this is going forward in time. So transaction 1001 is older than 1003. So this evaluates to true, but then this evaluates to false, what should happen? In that case, that's a good question. There's a brief period where it was true, but the end result would be false. I think the last writer would win. Yeah, I think the last writer would win. So you could, in theory, if this guy updated attribute one on the same tuple, would you actually have that information though? I don't think you would. You wouldn't know that. So this is an approximation of phantom checking. So false positives are okay, because we abort when we didn't have to. False negatives are bad. We don't want to abort. We don't want to not abort when we showed off. So in that case here, I think the safe thing to do is like, even though there is a window where like, or after this guy, then everything would have been okay. Because I conflicted at least once, I'll play it safe and abort immediately. Again, predicate locking would be an exact measurement or determination of whether there was a conflict. Precision locking, despite the name, is a little less precise and it's okay to have false negatives or false positives. Okay. The other cool thing about hyper that they do is these virgin snopses. So again, the issue is that under hecaton is that we have these long version chains and then as we're scanning along, we basically have to traverse them all the time to figure out to find the right version that we actually want. So if you're doing O2B queries, where you're doing lookups on a single attributes through an index and I land at the head of virgin chain, I trace that, yeah, who cares? Because I just follow the virgin chain and I find what I want. But now if I'm doing analytical queries, when I basically want to do entire table scans, I don't want to have to go check that virgin vector and go follow some, chase some pointer over there because that's going to ruin my casual quality because I want to be able to rip through these columns very efficiently. That's the whole point of having a column store. But having these pointers causes me to have indirection. So what they're going to do on a per block basis, they're going to maintain this virgin synopsis. And this is just going to keep track of what range of the tuples here, if they might block, don't have a virgin vector, don't have older versions that I need to go check. So in this case here, it's between two to five inclusive-exclusive. So when you think about this, these offsets are numbered from zero to six. So this is going to say between zero and five, zero inclusive-five inclusive, that I don't need to look at anything else. So in this case here, when I do my scan, I look at my virgin synopsis, it would say, oh, I'm at offset zero, offset one, that does not intersect with my virgin synopsis, so I know I can ignore this thing entirely. But now for this range here, I got to go check. Sometimes it'll be null, that's fine, but as I'm doing the scan, I don't have to go even look at this for these other parts here. And now I can rip through these columns as fast as possible. You could even do something like, because SQL or relational algebra is unordered, I could then maybe have to do a fast scan on these two guys, and then at the end, then go check this one. I don't actually know how that's actually implemented. You probably don't actually want to do that for really large columns or large blocks because this will be within, you'll have a stride in your cache, and jump into another part, could invalidate your cache. So it's better off to maybe just rip through sequentially. So this is sort of like a zone map, but for the physical versions. Like a zone map is a little thing of the header of the block that says, here's the values that are in my block and use that to figure out whether you actually need to read it. This is like a way to say like, all right, here's the versions that you made. Here's the tuples that you know you don't have to check the versions, all right. The idea is that as data gets cold over time, you can have progressively larger and larger ranges when you don't have older versions, and that makes your scans go faster, right. Under Hecaton, you had no notion of this. Well, it's also a pen only, but without any notion of this, you have to scan through everything. Yes. So this version synopsis is for each tuple? Version synopsis was in a block. So this block has 10, 24 tuples, and this is just saying like, here's a range where here's, I actually don't know where you can have more than one. I suppose you could, but here's the range where you have to check the versions. Anything that's not in the range, you don't. Yes. Well, I use like the range instead of like something like a bitmap or something, but like this tells you which ones you don't know about. So I just, why not, why use a range versus a bitmap? Yeah. Yeah, I mean like, yeah, I suppose you, that might be, so actually, this would be a great project for project three. Try maybe the bitmap and try this one and see what's as faster. And imagine the larger the blocks are, the bitmap. The more tuples you have in a block, the bitmap gets larger. But if you're doing 1024, that's not a large bitmap. Actually, probably what you do is just take this and convert it to a bitmap and use that for the scan. We might as well just keep it around. Yeah, this is something we should explore for project three. That'd be a cool project. All right, so another system we want to talk about now that does, that sort of MVCC, but with a focus on doing transactions and analytics is SAP HANA. So HANA was a, it's a memory data system that was built by SAP that was meant to be mostly designed to power their big enterprise resource, whatever ERP application, right? The things that you use to keep track of like payrolls and stuff like that to get off Oracle. But it is actually a standalone data system, something that you could use for other applications. So they're gonna be doing time travel storage with newest to oldest. And like Hecaton, they're gonna support both optimistic and pessimistic. I actually don't know what you get by default. I don't know, I'm presumably, this is tunable. And so they're gonna have a hybrid storage layout. They're gonna have rows and columns. And so some versions are gonna be in the row store and then some versions are gonna be in the column store. And the idea is that you wanna have the latest version, sorry, you wanna have the oldest version in the column store, because most of your analytical queries can just rip through that. And then you have the row store, it was where you append the new, the new and the time travel space, you can append them very quickly. Like they're just gonna be sort of dealt as you're adding. So the history of HANA is actually kind of interesting. It originally started off as this Frankenstein system, where they bought a bunch of database companies, mashed this thing together and called it HANA. So this P-Time was an in-memory data system that came out of South Korea. South Korean, for whatever reason, there's a lot of awesome in-memory databases in South Korea. Altabase is another one. So they bought these guys, they had this thing, T-Rex was like a search system, and then a MaxDB was an embedded row store system. They mashed these things all together, they called it HANA. Turns out that was a cluster of, and then they've since, as of last year, sort of released a new version where they've actually rewritten everything to be a sort of a clean code base. I don't know how much of P-Time is still there, but like it's much better now. It's actually pretty interesting. The name HANA, they told me it doesn't mean anything. For large corporations, name is a big deal. They're paranoid of getting sued. That's why all the Intel CPUs, the Xeons, like KB Lake, Sky Lake, those are actually named after real lakes. And they do that because no one can sue them to say they stole their name because it's a geographical region. So by making a name like HANA, it doesn't mean anything that avoids them getting sued. I've heard that it does mean Hosno Plattner's new architecture. Hosno Plattner is one of the founders of SAP. He's the P and SAP. But I don't think that's true. I think there's someone out of that afterwards. Okay, so let's talk about how they can do version storage. So this is kind of interesting. So the oldest version is always gonna be in the main data table, and that's gonna be a column store. And then they'll have this little flag that's gonna say, hey, there might be a newer version in the time travel space that you need to go check and see whether that's actually the version that you're looking for, right? And so for some queries, like for analytical queries, I don't maybe need to run and have the exact latest version of a tuple. Like who cares if I'm trying to compute the number of open orders that I need to go check? Who cares if I'm 10 milliseconds behind, right? So I'll just scan through the column store in the main table space. It may not be the newest version, but that's good enough for my query. And so now when you land in the time travel space, they're gonna have a hash table to let you do a lookup to go find the head of the version chain for a particular tuple. Yes? Wouldn't this be old to new? What did I say? New to old. It's old to new, yeah, sorry, this, here. Okay, right, so every tuple in the main data table is gonna have a record ID, right? That's an immutable identifier to say that this represents a logical tuple. And then there's our version flag, which could just be a bitmap that says for each version whether there's a newer version that we need to go check in the time travel space, right? So if I'm doing a lookup on A1, I would say, all right, well, this thing's set to true. So now I use my record ID A, land here, and then now I can traverse my version chain. So, yeah, so it's oldest to newest globally, and then when you land in this hash table here, this is also oldest, this is actually newest to oldest. That's what I was confused about. So globally it's oldest to newest. I look at the oldest version here, but then if I wanna find a newest version, I land here, and then the head of the version chain is newest. Because again, if I'm doing updates, and I care about mostly invalidating the latest one, this is my hash map. If I need to go install a new version, I can do that very efficiently, all right? And again, I think this architecture bore out from the fact that they were trying to match these different systems together. You know, this could be like the T-Rex side, and this could be like the in-memory P-time side. All right, so now, unlike also in Hecaton and Hyper, where we're gonna keep track of the timestamps in the header of every single tuple, they're actually gonna embed a pointer for every tuple to some auxiliary metadata object. And that's gonna have all your timestamp information for every single tuple. And the idea here is that we're doing this to reduce storage because instead of having this repeated timestamp over and over again for every single update I make to, every single version I create for a given transaction, that they all now have a 64-bit pointer to some metadata object that has all the information that you need to know about that tuple, I cut down the meta-header space I'm storing for every single tuple. And now when I need to update timestamps like we did in Hecaton, I just go to that one location, update that one metadata object, and that updates all my versions immediately. So again, so there's sort of this trade-off between the slower reads versus faster updates for large, you know, for big updates because I can go update a one single thing and it updates everything all at once, right? So it looks like this. So down here in I have this transaction metadata space and for every single transaction I'm gonna maintain, sort of for a group of transactions that I'm gonna keep track of like, here's all the things that they've been modifying. So now when my transaction comes along, once in an update say I write to C and D, these guys as I create these new versions now point down to this single object for my transaction that says here's my current status. So I don't have that global map. I don't really maintain all the version information in every single tuple. I can go down here and figure out what am I actually looking at when I look at a version. Then they have another layer of indirection that then combines together a bunch of these transaction contacts when it does group commit and adds them together into this group commit contacts and then that flushes everything out. So now the upper parts of the system wanna say, hey, did my transaction at ID three, did that actually commit? Instead of maintaining that information for every single transaction, I could look across all of these in this group contact and know, yes, everything got flushed out the disk. I think this is a bit wonky, I think this is a bit over engineered, but I can see why they do this. If you're doing a large updates, which I'm sure they're doing in their workload or for their target application, this approach would make sense. I think there is somebody to say about like, I do like this sort of idea that you have this global thing to keep track of everything so you can do quick updates. This one I think is, I don't see the real reason to have this, okay? All right, so let's finish up real quickly to talk about Cicada. So let's talk about some limitations about the approach that we talked about so far for the MBCC and sort of optimistic convergence role and then we'll see a system that's specifically designed to overcome these. So one is that we've talked about so far is that there's all this indirection because of this version chain that we have to go look up all the time to figure out what the right version is for us. In the case of HANA we just saw, right? They have this global context object. Hyper try to reduce this by having the version synopsis to avoid having to check the version chain but in general we have to, if we're updating tuples, we have to maintain these version pointers. So we got to now do garbage collection to go through and try to print things out so we don't run out of space and it reduces the size of the version chain that reduces the number of things we have to look at when we do scans. All the other approaches, most approaches we talked about so far are also maintaining a bunch of stuff in global memory, right? There's the global map for Hecaton, for Hyper, there's sort of the global space of the blocks of tuples. And so in this case here, because we're reading, writing to any address space in memory, our threads aren't entirely cognizant about where the data they're reading is and where the data is located that they're reading and writing to and how that's gonna affect their cache locality. So we'll discuss numerous stuff for the detail in the semester, but in modern architectures, if you have multiple CPU sockets, the memory access speed for one memory location on one socket may not be the same on another socket. So if I don't know where my memory is actually located, I may end up reading on another socket and that could be really slow. And then for all these approaches as well, we also talked about how there's a single counter to do timestamp allocation and that could become a bottleneck on really high core counts, the thread counts. So for OCC, some of the locations that we talked about, and these aren't specific to multi-verging, this is sort of general for OCC protocols, but if there's high contention, then we're gonna have a lot of abort. So we saw that on the 1000 core paper when we show how everything cratered at 1000 cores, all the time we were spending executing transactions we're basically just aborting them because there was a conflict immediately. We're at end of kind of do a much extra reads and writes if we implement the private workspace for OCC and this is additional copying and memory and that's gonna slow us down because that's extra work. And then now when we do, we'll talk about more of this next class, but in some protocols, the way to handle updates to indexes before I actually commit my transaction is I have to solve what are called virtual index nodes. Basically it's a placeholder to say like, I'm gonna insert this key into this index. I don't know exactly what my commit timestamp is yet because I haven't committed yet, but this is just a way to prevent somebody else from trying to insert the same key at the same time. It's another way to do first writer wins before indexes. So Cicada was the ultimate engine built here at CMU by Hientek Lim. So he was a PhD student of Dave Anderson, stuck around there to postdoc and dabbled in databases and up building this system. So there's a bunch of optimizations that are interesting to us that Cicada proposed that I wanna sort of go through. We'll do best effort in aligning Lucy's synchronized clock, which is, we're not gonna cover too much, but intentional word validation index nodes. So the first one and the second two are things that we could consider in our own system. This one I have not thought through more carefully to see whether this actually makes sense. This is actually barring techniques from a distributed database world, which makes sense if you're saying, if I have a lot of cores, I have a lot of sockets, it's essentially like a distributed database, which is on a single machine, but I wanna focus on these three because this will make more sense for what we talked about so far. So best effort in lining, the basic idea here is that instead of having this version chain that could just point to anywhere, what if I allocate some fixed amount of space along with the version pointer and that way when I wanna do a look up and say, what's the latest version? Well, I can maybe just go find the latest version right here, right? And so the idea here is that as of now I'm scanning along instead of jumping following this pointer to some other location, there's a small amount of space right here that has the latest data that I need. And most of the time that will be good enough for me. So this one's okay. I mean, I think the paper shows that this actually for sort of old field workloads, this makes a big difference, but I think this is tricky to do when you have variable link data because if these are strings or different sizes, then this is hard to do. And for analytical queries, I think the column store approach is actually, is obviously gonna be better. All right, but let's talk about how we can do a fast validation, right? So again, the scan set approach is will guarantee serializability under Hecaton. Precision locking approach would guarantee serializability for Hyper. For this paper here, I think he's mostly retorting the scan set approach that's been used in Hecaton. He didn't actually evaluate the precision locking because I don't think he was aware of the paper when he wrote it. And he wasn't specifically targeting MVCC systems. But he basically has three approaches here to try to speed things up, right? To do validation. The first is that, instead of treating every transaction blindly or do validation blindlys, you actually can maintain a little information on what happened with the last transaction you validated. And so if you can identify if there are records that are highly contended, like everybody trying to update this one tuple, instead of having checking that maybe at the end or at random, I'll check that at the very beginning of validation. And that way, if I'm gonna abort, I abort right away without having to check everything else. So all you basically do is keep a counter for the most contended tuples that I keep aborting transactions on. So now if I go see if that, when I do my validation, if that tuple, ever I've read that tuple and that everyone keeps aborting on, I'll make sure I check that one first. And that avoids doing wasted work. Another technique they're gonna do is early consistency check. And this is actually barred from a paper that we wrote after the DBX1000 approach. We had another protocol called TikTok. And with this one is that, again, just like before, if I keep track of what things I'm contending on in my validation phase, if I can then push that information up to now actually why I'm running the transaction. So now if my transaction reads something that I could conflict on, instead of waiting for validation to go do the validation, I'll check immediately after I do the read or even before I do the read to see whether if I read this now, would I abort? And the idea here is instead of waiting for the entire lifetime of the transaction just to see I'm gonna abort and not if it'll finish, I can do some early checks and avoid all that. The last one is through incremental version search. And the idea here is that since they're doing newest to oldest, if I know that I keep reading some older version, then rather than having to follow the version chain, I could maintain a little memory address in the master version that says that the version you're probably looking for is here, here's how to go jump into it. And then if I land into a version that it's exactly what I wanted, I'm done. If it's not what I want, then I just go repeat the search as I normally would. It's taking a little extra memory space and devoid having to do the full version change search. So for these first two, you wanna skip these if your transactions are mostly committing successfully. It's only when you have high contention what these things actually make sense. And so you sort of need to be adaptive and have a way to determine, oh, well, I keep aborting because I have conflicts, then you switch these things on and maybe that can reduce the amount of work. Because again, OCC, we only do validation after the transaction committed. So if I'm doing a thousand updates and it's the first one is what I'm gonna conflict on and that's gonna cause me to abort. If I don't check to see whether there was an abort operation until I do the other 999 updates, then that's all wasted work. So a technique like this can try to avoid all this. All right, so the other interesting thing about Cicada that I find extremely fascinating, this is the third year where I say this is something we should look into. And hopefully I have a PhD student that's gonna start looking into this is instead of storing the index nodes for the B plus tree and the heap, they're actually gonna store it as tuples themselves as blobs in a table. And the idea here is that instead of having to do the scan checks, instead of having to do precision locking, if the index nodes are just tuples themselves and I can do validation to see whether I'm allowed to read something or not read something, then I potentially get serializability and phantom checks for free if everything's a table. So again, normally, and you'll see this in the B plus tree you'll build or the BW tree that we already have, the nodes are just sitting out in the heap, right? But instead we're actually stored these now as blobs inside of a table. Like we're just serializing the keys and serializing the pointers to the next version. So now when I do a scan to get down to G, I'll do my lookup on A, I basically know how to interpret these bytes and that's gonna tell me, oh the thing I, for the keeping key I'm looking for I wanna jump to B or C. And then I know how to then do a lookup and find the node that represents C. Of course now the issue is that if I do an update for a single key in my node, that's gonna invalidate this entire version because I have to create a new version and that could be an excessive copying. So I think it's just trade-off loop of like how much turn there is in the index to how much benefit you're gonna get to this. But again, I think that the paper is correct that this will give you serialized buddy checks for free. But I don't know this. And he claimed that single light actually does this. So as far as I know, other than cicada, no real system actually does this. I checked with a single light guy last night, the creator and he told me they don't do this for B plus trees but they do it for like full table indexes, like inverted indexes because there's no notion of those things as a first class object or entity within the database system. So you store them as blobs and that gets you the storage for free. Whereas with the B plus trees are stored separately as separate pages in the heat. Again, I think this is super interesting and this is something that if we wanna get off, we wanna eventually get off the BW tree and our system, we might wanna pursue this. But we'll see. All right, so two quick graphs for cicada. So this is the first workload. We're gonna have low contention, right? And they're gonna compare against a bunch of different systems. So two phase locking sort of standard approach. Silo is a system out of Harvard that we'll talk a little bit about later in the semester. They have a data structure called the mass tree which we'll cover. TikTok is the protocol that I helped work on for OCC. And then all of these other ones are sort of variants of silos. Ermia is a better version of silo. Photos is a better version of silo. And they have cicada. So under low contention, you see that a hecaton is actually doing the worst here and these sort of variants of silo are all doing all about the same. But when you have high contention, it's because cicada has all that extra stuff that I talked about. Like the early consistency check, the jumping to the right version that I needed are checking on validation to see whether what two of my most likely to have to board on so I make sure I check that first. Because of that, that's why they're able to do better than these other ones. So the overall architecture of cicada is interesting but it's all those little one-off optimizations I think that are worth pursuing as well. So this might also might be another project for you. So in addition to the cicada style B plus tree, we could also look at adding in to our own system all these early validation checks because we're doing the same thing. All right, so I rushed through that last part very quickly but the main idea is that just showing that like instead of treating every transaction, sort of learning from scratch, whether I'm gonna board or not, we could maintain some information for one transaction to the next and make better decisions when we do validation. All right, so the other main takeaway once you get from this lecture is that there's a bunch of other design decisions we talked about that don't follow exactly into the four things we talked about last class that can affect the performance of the system depending on whether they're targeting O2 workloads or OLAP workloads. So like if you're doing long scans for OLAP queries, then the global hash map and the version chain for pen only is probably a bad idea under heck of time. But the hyper way is that might be better. And so there's this classic trade-off again between having these different interaction layers to maybe store stuff off separately or have redundant information stored in one location to reduce our storage overhead but then that could affect performance. And some systems like HANA, they decided to make that choice. Okay? All right, so next class will be the last lecture we do on garbage collection. So you end up reading another paper from Hyper. This came out in VLDB or just came out last year on how they do garbage collection. And then we'll spend time at the end doing sort of demo of how to do run perf for project number one. Okay? Any questions? Bank it in the side pocket. What is this? Some old pool shit. Ay yo, ay yo. Took a sip and had to spit cause I ain't quit that beer called the OE cause I'm old G.I. You look then it was gone. Grab me a 40. Just needed just a little more kick.