 So, this is the second lecture now in our three-part series on NVCC. Last class, we spent time talking about the four major design decisions based on the paper that we covered. Like, a current year protocol, how do you actually coordinate transactions, virgin storage, where do you maintain the information about the new physical virgins you create, garbage collection, how do you clean them up, and index management. And as I said, for today's lecture, we're going to focus on the first one and a bit more detail. And then on Wednesday's class, we'll discuss garbage collection in more detail. For today's class, we're going to look at real-world implementations of NVCC systems in the context of the design decisions, but primarily focusing on how they are going to coordinate transactions. And then we're also going to spend a little time talking about other aspects of their architectures that don't fall directly into those four different design decisions, but are actually important, actually, if you want to build a real system. So, we're sort of talking about with SQL Server and Hecaton. We'll use this as a baseline to understand how they do concurrency control in their system, and to be primarily focused on just old DP workloads. And then we'll look at Hyper, HANA, and Cicada. These will be derivations of Hecaton or have additional optimizations, either for making transactions run faster or making analytics run faster. So, the way we're sort of setting this up is we'll spend a lot of time beginning to talk about Hecaton, how they do things, but then we can build upon that and say, here's how to do it even better in these other systems. So, Hecaton came first, and these other ones came later. Okay? All right, so let's jump right into it. So, actually, a quick show of hands. Who has ever heard of Hecaton before this class? All right, will you hang around me? Yeah, you're at Microsoft. Okay. So, it's not called Hecaton now. The product isn't called Hecaton. It's called in-memory engine or something. Like, Hecaton is the research name, but it doesn't really mean anything, but this is what they were calling it internally when they were building it out. Now, if you give Microsoft money and pay for this feature, it thinks it's just called the in-memory engine. It's not called Hecaton anymore. But in academia, in the research community, we understand to be Hecaton. So, Hecaton started as an internal incubator project at Microsoft in 2008, and they were trying to build a new old-to-be engine for SQL Server. So, at that point, SQL Server was 20 years old. SQL Server is actually based on Sybase. So, in the 1990s, Microsoft bought a license for Sybase to port it to run on Windows NT, and then they bought a source code license to modify it heavily. And then, since then, obviously, the two branches of code have diverged significantly. Sybase is kind of old. They had a new version come out one or two years ago, but it's pretty much almost in maintenance mode at this point. Like, no new startup is like, man, we're building on Sybase. We're super excited, right? Whereas, SQL Server is actually a pretty state-of-the-art, right? Because Microsoft spent a lot of time and money making it better and better. In fact, it runs on Linux now, which is crazy. So, at this point in 2008, we're like, you know, SQL Server has been around for 20 years. What do we need to do to make SQL Server relevant in the next 20 years? And so, they ended up building a new engine for this. And the project was led by two very famous database researchers. Paul Larson was on the Microsoft research side. He's just been around and been involved in many different database projects the last 20, 30 years. Like, he invented linear hashing back in the 80s. And then, Mike Zwilling is like an incredible database hacker from, actually, originally from the University of Wisconsin. He worked on the Shores system, which is an influential storage manager back in the 1990s. And then, Microsoft hired him to go lead or be part of the effort to port Sybase to Windows NT. So, he's been there since then. So, they were trying to build a new in-memory engine to run transactions very quickly inside of SQL Server. And so, the way to understand what Hecatine is, it's not a standalone database system. It runs as part of running inside of the SQL Server ecosystem. The reason why they did this is because, you know, they had 20 years of people building applications for SQL Server, not just applications running, but also tools to help manage SQL Server. So, you don't want to throw all that away just because you have a new system. So, they had to build the work inside of the existing SQL Server architecture. And this is very common in the major database system vendors. So, Oracle, IBM sort of do the same thing. They have sort of special engines you can buy, give them money for that run inside of the core system to do something, make something run faster. One of the key design decisions that they had to make in designing Hecatine is that they wanted to make sure they supported all possible OTP workloads with predictable performance. Now, for the papers we're going to read this semester for transactions, we're not really looking at really kooky ideas or really ideas that only work for a narrow class of workloads. So, if you took the intro class last semester, we talked about h-door and BoltDB. BoltDB works really well if your transactions look a certain way, right? And if your transactions don't look the way that BoltDB wants, you're going to get terrible performance. So, in the case of SQL Server, they couldn't go down to the route that BoltDB went because if you want to be able to say, here's this engine you can pay for and run faster, but you might be the unlucky one that the application that gets slower, that's hard to sell a product like that, right? If you don't know whether to get faster until you buy the product and switch over to use it, that's a terrible, that's hard to sell that. Whereas like, hey, look, I have this car, it's going to be really fast, but you might get cancer if you drive it. No one's going to buy that, right? So, they want to predict the performance, so they didn't go down the route of, again, doing this single-threaded partitioning. They want to make sure that you got at least some improved performance for all applications, even though it might not be the huge gain you would get in something like BoltDB. So, the way they're going to do MVCC is that they're actually going to assign transactions to time stamps. So, you're going to get a trend, it's going to be, well, we're going to cover the OCC version, but you're going to get the time stamp that the transaction starts, that's your begin time stamp, and then you get a second time stamp later on, that'll be your commit time stamp, right? And then the tubeless themselves, just like we talked about with, for MVCC and SNAP-SEL isolation, they're also going to have two time stamps, they're going to have the begin time stamp and the end time stamp, right? So, the begin time stamp is going to be the time stamp, the begin time stamp of the active transaction that created it or the commit time stamp of the transaction that created it. The end time stamp is going to be the combination of either the begin time stamp of the active transaction that created it, the next version in the version chain, infinity, meaning it's the latest version, or the commit time stamp of the committed transaction that created it. So, this one's sort of easy to understand, this one's a bit more nuanced. So, let's actually go through an example and see how this all fits together. So, here's our simple database, we have one logical tuple A, and right now we have two versions, A2, again, so the begin time stamp and the end time stamp specify the visibility of this tuple based on the time stamp of this tuple, this version of A is visible between 10 to 20 exclusive, and then there's a pointer to the next version, A2, which is visible from 20 to infinity. So, this A2 is the latest version. So, what are they doing? Newest or oldest and newest? Oldest and newest, right. So, now our transaction comes along, at the beginning we'll sign it begin time stamp. So, today this transaction gets begin time stamp 25. So, it's going to do two operations. The first operation wants to do a read. So, again, we're going to use the begin and end time stamp to determine what version of A is visible to us based on our begin time stamp. So, we start at the very beginning, we can ignore how we got to the head of the version chain, whether it was an index or not, it doesn't matter, the protocol still works the same. So, we land here in A1, we check to see it is 25 in between 10 and 20, and then there's no. So, we follow the pointer and come down to the next one, is 25 in between 20 and infinity, and then there's yes. So, this is the version we read. So, we covered that last class, that's pretty straightforward. Now, I want to do a write. Same thing, I land to the head of the version chain, and I want to find what is the latest version, I want to write my new version afterwards. 10 and 20, that's not us, 22 infinity, we know that this is the latest version because this is set to infinity. So, there's nothing else that comes behind it at this point. So, the first thing we need to do to create a new version here is that we actually have to create the version. We would allocate some space in memory, get a free slot, we're doing a pendulum storage, so it's going to be in the same table space as the other versions, and then we fill in all our new information here. So, at this point though, for the begin timestamp, we're going to use this sort of special marker for the transaction at 25. This is me indicating that this is a special begin timestamp that corresponds to an uncommitted transaction. So, really what they're essentially doing is they're using one bit, the most significant bit in the timestamp, it's either one if it's uncommitted or zero if it is committed. So, transaction at 25 is me saying to you that it's been marked as a special timestamp to say this is from uncommitted transaction. Let me take a guess why you want to do this. Yes? He says to prevent readers from reading uncommitted version. Yes, but if I didn't have that special bit, I could go look in a global map to figure out whether it's committed or not. So, just by setting that bit, I don't have to do that check. So, they're avoiding one additional lookup by setting that bit. So, you're right, it prevents you from reading something that's uncommitted, but actually hack a time will allow that. We'll see that in a second. But it's just avoiding having to go check whether it's committed or not. Yes? At the same time, because of this, you have to have the transaction ID stage, right? Because you lost one bit. So, your statement is that the size of the transaction ID domain is now smaller because you lose a bit. Yes, so you just wrap around. Like we talked about in the last class. Does that happen more frequently? Yes, but these are actually 64-bit numbers. So, right, it's not as bad as Postgres for 32-bit. Okay. Yeah. Okay. So, we create the version first, then we update the pointer to now point to our new version, right? Now at this point, is this tuple visible? Is this version visible? Why? Why not? Because you still have an n times n infinity. Exactly. He says, because you still have an n times n infinity. So, he's actually right. So, now if anybody comes along and tries to look this up, they will see that 20 to infinity and say, well, this is the version that's actually visible to me and they don't follow the pointer, right? So, now, once I created the version, updated the pointer, now I flip this to be my begin timestamp. And now this point, the tuple is visible. So, the main thing I want to say here is the order that we do this has to be correct. It has to be done in the right order. We just can't flip this and then try to create the version and update the pointer because someone can end up following the pointer and points to nothing. So, we create this version. Then we do an atomic compare and swap on this pointer and try to point to our new version. If that fails, then we know that somebody else wrote to create a new version before we did and we have to abort ourselves and roll back. If we succeed, then we know that nobody else can create a new version because we got there first and then now we can just do an update on this. It doesn't have to be an atomic compare and swap, right? All right. So, now we're done for this on the right and now our transaction wants to commit. So, now we get a commit timestamp. So, the only thing we need to do at this point for this example here is we have to go back and now update our end timestamps that we are beginning and end timestamps here which we used for the begin timestamp of our transaction to now be the commit timestamp 35. So, we're going to do this after we do a validation which we'll talk about in a second. So, after we know this transaction is safe to commit, there's no other conflicts and no transactions. Then we go back and flip these guys. And now for this case here, this doesn't have to be atomic because as I said before if we have this special timestamp like this it tells us this is from uncommitted transaction. Well, in this case here I've created a new version. All these time ranges are correct so I should be reading this so to figure out whether I'm allowed to read this or not I can go look in the state map for this transaction and say, is this guy active or committed or not? All right. All right. So, let's rewind now unless I don't know how else to convey that we're going to go back in time but we're going to roll back our example here but to be before we try to commit and we're going to introduce another transaction. So, again, we got rid of the commit. We roll back these timestamps here to be our special begin timestamp for this transaction and now we have a another thread comes along and he wants to start a transaction at timestamp 30. So, say the first thing he wants to do is read on A, same thing, start the head of the version chain 30 is not in between 10 and 20. 30 is not between 20 and 25. So, we get to here 25 and affinity. So, this is the version that we want to read but at this point the transaction has not committed. So, Hecaton is going to allow for what are called speculative reads. We're going to assume that the transaction that created this version is going to commit successfully. So, rather than aborting ourselves because we try to read something that hasn't committed yet or rather than stalling to wait to find out what this transaction has committed we're going to go ahead and allow ourselves to read it and then we'll do some extra work which I'll talk about in a second where we go to make sure that when we try to commit we make sure that the transaction that we read from actually whether they committed or not. So, now I want to do a write on it. So, head here A1 not between 10 and 20 or 30 is not between 10 and 20 then I land here 30 is not between 20 and 25 but then I land here and now I have I'm trying to write to a new version after a version from uncommitted transaction. So, this essentially would be a write-write conflict in when we want to do validation. So, in Hecaton and every other system we'll talk about for today they are not going to allow for write-write conflicts and so they're going to do a simple policy called the first writer wins. So, in this case here we would land and recognize that we try to create a new version after this one but this version is not committed we will abort ourselves because they got there before we did. So, this is to make your life way easier in terms of implementation and what things you have to check if you don't allow for write-write conflicts and you say the first writer wins. You could do more nuanced policies like the second writer wins or who has written the more stuff first before you get to there all that yes, you maybe get better for some applications but the engineering cost of that is not worth it. So, the simple policy of the first writer wins is the way to go. Alright, so any questions at a high level of how we're doing transactions here? Yes? For the transaction that begins at time 30 should he read version A2 or A3? At this point here, for read A? Yes, because 30 is between 20 and transaction at 25. It's not. 30 is after 25, right? The at thing is just me telling you that it's a flag in the first bit that says this is uncommitted. So, we just treat this as a regular time stamp though 20 to 25, right? And we allow to specifically read this, right? And then we keep track that we read it and we want to find out later on whether the transaction specified by this time stamp whether they commit when we try to commit. So, we are now treating the time stamp connection at 25 as a very large number? Your question is, we're not treating... Oh, yeah, so this is just, again, this is just for illustration purposes. The time stamp that we would evaluate against is 25 here. Okay, okay. But I'm just marking this to say this is uncommitted. Okay, got it. Okay, yes? So, if you do a speculative read, are you blocked from committing or anything? Yeah, so this question is, if I do a speculative read, am I blocked from committing to this transaction commits? Yeah, until you know whether it works or not. Correct, yes. So, if you do a speculative read, we'll see this in a second, I have to mark that I read something from a transaction at 25 that did not commit. Before I'm allowed to commit, if I'm running with sterilizable isolation, I have to make sure that they commit. If I'm running it read uncommitted, who cares, right? I'll just read whatever I want. All right, cool. So, let's talk about the transaction state map. Again, I'm going over this because we'll see now when we talk about hyper and other systems, they do this in a different way, right? So, again, in addition to maintaining all the virgin information about the tuples, we're also going to maintain the status of transactions. So, at the global level, we have this map that basically says, for every single transaction that's running in the system, what is its current state? What is it doing, right? So, at the very beginning, you're in the active state, right? Because we're in the process of X-Tune queries that are read-write tuples. Then, when the application says I want to commit, then you enter the validating phase. And this is where the system is going to use a bunch of information that's collected during the active state to figure out whether you're allowed to commit or not, whether you would violate sterilizable ordering. Then, once we pass validating phase, then we enter the committed phase. So, the transactions finish. We're not going to execute any more queries, and we've completely been validated. And now we need to go back and update all those versions that we touched when we were executing before and flip their timestamps now to be our new commit timestamp. So, at this point here, this is where I was sort of saying, you could recognize that if I do a read on a tuple and I see that it has the uncommitted transaction timestamp, I can now go do a lookup in my this state map in the system and check to see whether it's actually been committed or not. So, if it's been committed, then I know it's not a speculative read. It's just the thread hasn't, the other thread hasn't gotten around to flipping that timestamp yet. Then, once I update all my timestamps or all the versions that I modified, then I enter the terminated phase and then at some period there will be a timeout or TTL in my entry we removed from the map. So again, we may maintain this for every single transaction. It's a way for other threads to figure out what's going on with other transactions that are running at the same time. So, let's go through the life cycle now of a transaction. So, on one side we're going to have events. What happens when a transaction executes? What is it telling the every system it wants to do? And then we'll have the phases we enter based on the state map. So, we start with the begin. Again, we get the begin timestamp for our transaction. We add entry into the state map. We're now at active transaction. Then we go through the normal processing of executing rewrite queries. And for here, in addition to updating all the version information that I just showed in the two slides ago, we're also going to maintain some metadata about the transaction, about what tools did it read, what tools did it write, and what scan operations or scan queries did it perform. Because we need that to go do validation later on to figure out what we actually read or somebody else did something that would violate serializable ordering. Then we get the commit message from the transaction. We enter the pre-commit phase. We get a commit timestamp for our transaction to go through the validating process of using the information that we have during the normal processing to figure out whether we've violated serializable ordering or not. So, all I think we need to do is validate the reads and scans. We don't have to validate our writes because, again, first writer wins and we really only care about what data we read. So, if now we pass our validation, we can then append all the new version information for the things we created to our redo log and flush that out the disk. Then we set our transaction state to committed, go back and update all our timestamps to update the begin timestamp of the new versions or commit timestamp of the older versions, and then we set ourselves to be terminated and remember to set up from the map. This is the standard lifecycle of any transaction that goes through the normal commit process. If you abort, you basically don't do anything beyond this because you just blow away all the old versions. Do you have to go update the old timestamp? Yeah, you got to update any timestamp that you flipped from older versions and make sure that anybody that maybe read something that you wrote, they get aborted as well. All right, so I want to spend more time now talking about what's happening here and here. What do we actually need to maintain to do the step here? This will be different when we talk about what Hecaton does or what Hyper does and with what Cicada does. For every single transaction we're going to maintain in a local context that's specific to our one transaction the read set, the write set and the scan set. The read set is just a pointer to every single version that I read or emitted to a query. So what I mean by that is say I have a query that does a select on A but I had to go scan through three versions. My reset doesn't contain all three versions, it only contains the one that I ended up returning as the result for the query. Because I don't care about all the older versions because they're already already committed. There can only be one uncommitted version of a tuple and that's really the only thing I care about whether I read that or not. Right, and we get that and that's an advantage of having first writer wins because there are not going to be multiple uncommitted tuples there can only be one. For the write set it's just pointers to the versions that I updated in the old and new versions I deleted or versions I inserted and we need this not for validation but to go back and update their pointers later on after we've been committed. It's just metadata to figure out what we actually modified. The scan set, the way to think about this is that it's the minimum information you need about the query in order to re-execute the scan operation. Let's say I have a select query that has some kind of UDF that does some kind of complex calculation in the projection list I don't care about any of that, I only care about what's in the where clause because I want to go back and re-execute the scan on validation to see whether I scan whether I read the same tuples when I scan again. Is that clear? Right, and you're not storing the actual SQL string because that would be stupid to do validation to parse and then plan that query all over again. It's like you store the access methods in the query plan that allows you to go back and reapply those operations from the where clause. So again, it's the bare minimum information you need in order to re-execute every scan to see whether you get the same result each time. And the last one is commit dependencies. Think of this as just a link list inside of our transaction of the list of the transactions that are waiting to see whether we commit it. So think about it, when I do a speculative read I see the begin time stamp of an uncommitted version and I know what transaction created that version. So then I go find their handle in the global map and say give me the transaction context or this transaction find the commit dependencies and append myself to the list of waiting transactions. And then when this transaction goes and commits it's a top sub system where it now goes back and notifies all the transactions that are waiting for it to find out whether it committed or not. And then based on that, they can determine oh, I now know it's time for me to go commit as well because all the transactions I was waiting for have finished. So in Hecaton, they're actually going to support both the optimistic and pessimistic versions of concurrency control. So I think they implemented both of these for the research paper. I think if you download Hecaton with SQL Server now, I think you only get the optimistic version. So for the optimistic version, the way we're going to do validation is that we just need to check whether the version that we read is still visible to us at the end of our transaction and then we repeat our index scan using the scan set that I talked about in the last slide. Again, that just checks to see whether there's any phantoms or not in any scan query. For pessimistic transaction concurrency control, they're going to create exclusive locks on individual records and buckets. We don't really talk about this, but the way Hecaton's actually organized the actual table itself is through a hash table. So you can take locks on buckets to avoid phantoms that way. You don't need to do any validation and they're doing basic deadlock detection with a background thread. So this is the only graph I'm really going to show for Hecaton. The paper's from I think 2012, 2013. It's a pretty old system at this point by standards we're talking about today. At least it's running on old hardware. The main thing I want to point out here is that the difference between the optimistic version and the pessimistic version. So obviously you see is that if you increase the number of threads you get better parallelism in the optimistic version. This is a small database. It's only got 1,002 bulls. But 80% of transactions are doing reads. 20% of transactions are doing updates. So again, as you increase the core count and increase the parallelism and concurrency in the system the optimistic version does better. So this gap here may not actually seem like a lot in terms of like in relative terms. But in absolute numbers this is actually pretty massive here. So we're already doing 1.5 million transactions a second. So this one, this gap here is about 300,000. To put this in perspective because we've done experiments on MySQL and Postgres and Oracle on roughly the same hardware these guys can maybe do 30,000 to 100,000 transactions a second. So they're already up to 1.5 million. That's pretty significant. And to get a difference between by 300,000 is a lot. Now they're not doing logging, they're not doing much extra crap here that the full systems actually do. But this is a pretty high number here. Right? So we'll see this later throughout the semester. People run TPCH, people run TPCC a good way to ground yourself and I'm trying to understand whether the numbers of reporting are amazing or not. To have a basic understanding of what MySQL and Postgres can actually do. To some extent also to commercial guys. And I'll try to say, here's what the numbers are for their experiments, here's what a real system can actually do. So this is a lot. But they might just be running entirely in memory. All right. So what are the lessons we can take away from Hecaton? We've got two major things. So the first one is that they argue that you should only use lock-free data structures. And it's not just for indexes, it's really the entire architecture of the system which should be predicated on lock-free algorithms, lock-free data structures. So that means for indexes, transaction maps, memory allocation, garbage collector, any kind of internal thing that we're maintaining in the system. I actually disagree with this. We're going to read a paper next week on the lock-free or latch-free data structure we built here at CMU called the BWTree, which actually came out of the Hecaton project. It gets crushed by regular indexes using traditional latching methods. So for other things, sure, for internal data structures, lock-free might be right the way to go. But for the index, I think they're actually wrong here. The MemSQL guys are super big on Skitless, because the co-founder of MemSQL was at Microsoft when they were building Hecaton, saw some early discussions or early presentations done at the SQL server team about how great Skitless are, lock-free data structures are when they were building Hecaton. So he then left, went off to Facebook and later formed MemSQL and borrowed a lot of ideas from the Hecaton project. So he saw all the early talks from Paul Larson and Mike Swelling about how great Skitless were. Missed the second half of the talk, so they say Skitless are terrible. So Mike Swell built the BW tree and we'll see the paper you guys read next week. Both of these get crossed. Skitless are the worst, right? There's no reason to... I shouldn't be careful here. There are very small use cases where Skitless are our good idea for general purpose database indexes. BW trees, again, they'll lose to a B-plus tree and then the radix tree, the tries, stuff crushes all these things. The other point they point out, which it actually is valid, is that you want to minimize the number of coordination or civilization points within the system. So in the case of Hecaton, it's really just at the beginning of when you have to go get the begin and commit time stamps of transactions. Because you have a global counter that everyone needs to do an atomic compare and swap to increment to get the new time stamp. At very high parallelism levels, that connection can become a bottleneck. You can alleviate this with some of the batching techniques that you guys read about in this paper. But in general, this is always a bad idea because it can just become a single bottleneck. I don't think I cover this in the hyper paper, but when every transaction starts they have a single latch that everyone has to acquire as well, and that can become an issue. Alright, so we'll do some observations we can make about Hecaton. Before we go ahead, any questions about Hecaton? As I said, at this point, I mean it's all relative terms, it's only a 7-8 year old system at this point, but we're using this as the baseline for understanding in memory NVCC, and then now we can jump on and say, well what are some of the limitations are with Hecaton, and how can these other systems improve it? Yes? So this question is, can the validation phase have cyclic dependencies because your rights could be dependent on other transactions, and I guess what you're saying is like, I write something, you specifically read it, you write something, and I specifically read it, could that be a cyclic dependency? Yes, and the way you would handle that through validation is you would look at the begin time stamp and say who should get preference. Do you have to break it? Let me think. Sorry, yep. You're reading by your begin time stamp. Wouldn't one not read? Yes, he's right. If you're reading by begin time stamp, if I'm ahead of you, you can't read anything I wrote, so that can never happen. You started time stamp one, I started time stamp two. You wrote something, I can read it, I write something, you can't read it. Thank you, yes. Yes? Could you not avoid using the term swap but using the atomic add? Same thing, yeah, sorry. So the statement is, instead of using compare and swap, can I use the atomic add? Yes, compare and swap is a broad class of instructions, atomic add is the same thing. Yeah, there'll be an instruction that does an atomic add, I'm using them as a class of instructions. I'll be more careful, yes. Okay, so what are some observations we can make about hackathon? So the first is that the re-scan set validation that I talked about is fine for the kind of workloads that hackathon was focused on, like O2B workloads where transactions don't touch a lot of data, the scans aren't very big. That's fine for those applications, those queries, but if you want to do analytical queries, then this is going to become problematic. Say I want to do OLAP query that's going to scan the entire table and compute something aggregate. So on my re-set, I have to maintain a pointer to every single tuple that I read. If I have a billion tuples, I have a billion pointers. So that sucks. Same thing with the scan set validation. If I have to scan the entire table twice, because I do it the first time when I run the query, then I have to do it again when I do a validation, that's doubling the latency of the system of every query because every query takes twice as long because you're executing it twice. So that sucks. The other issue is now that as we append new versions to our version chain, that's going to hurt performance for the analytical queries because now I'm going to have to have all this non-determinism in my execution of the code because I have to follow the version pointer and figure out what version I should actually be reading as I scan along. The CPU is going to have more branch mis-predictions. I can flush my instruction pipeline. That's going to make everything really, really slow. The last issue is that the way that we were doing conflict detection in Hecaton was just on the mere existence of a tuple. Did I read something that you wrote? But it doesn't actually look at to say whether the thing that you wrote is actually something I read or I may have a conflict. So the only thing about this, if I have a thousand attributes, a thousand columns in my single tuple, my updating transaction updates one of them. My read transaction reads the other 999 of them. I actually didn't read anything you wrote so therefore I shouldn't really conflict. So in Hecaton they can't handle that because they're just looking at pointers. Whereas we can do something more sophisticated and actually look at what the operation actually was that and what you modified to determine whether you have a conflict or not. So this is what Hyper does. So as I said, Hyper is what our system is currently based on now. We threw away all the code and we took this paper you guys read and actually built our system to be very similar to it. So this is going to be a ComStore MVCCS system that supports Delta record versioning. They're going to go news to oldest version chains. Anytime you do an update on a non-index attribute you just update it in place. If you want to do an update on a non-index attribute you have to treat that as a delete followed by an insert. It's the same thing we talked about last class for primary keys, but I think they do it for all indexes. The key thing that makes Hyper completely different than everyone else is the way they're going to do validation for serializability. So they're not going to do any scan checks. They're not going to do any locking in the tuples like in our two phase locking. They're going to do a technique called precision locking which is going to look a lot like predicate locks but it's actually not fully implemented the same way predicate locks are. So just like in Hecaton they're going to avoid write write conflicts by allowing the first writer to win. So let's look at a high level of the architecture. So we have our main data table and again Hyper is a column store so all the values for a single attribute will be stored together in a contiguous block of memory but then they're also going to maintain a separate column called the version vector and the version vector is going to be a store bit pointer to the head of the version chain stored in the delta storage space. So if there's a pointer exists then it's pointing to a version. If the pointer is null then whatever is in here in the main table space is considered the latest version. Right? So now within the delta storage part just like we talked about before we're going to organize it on a per transaction or per thread basis. So rather than having a global rollback segment or global delta storage space where every single time I want to create a new version I have to acquire a slot from that space and then do my write into it every thread has its own local memory that it can append these new versions into as needed. And then if other transactions don't want to come along and reconstruct the right version going back in time they just scan along these pointers into my memory region and it reads it. So to highlight this is basically how it works. So this part is actually not that novel there's other systems that do this what makes hyper really interesting is the validation part. So I've already said this before first writer wins the version vector always points to the latest committed version and we don't need to worry about whether the there's conflict between two different writes because the first guy who tried to do write realized he can't flip the pointer that it wants to append the new version to and therefore it knows there's a conflict and it rolls back on the boards. But now when a transaction goes to commit the way we're new validation is that we're going to check all the delta records that were created by transactions that committed after our transaction has started and we want to see whether they modify the database in a way that conflicts with any query that we ran. The reason why we only have to check transactions that started or committed after we started because again under snapshot isolation if they had committed before we started we would have seen their writes no matter what. It's only the ones that committed after we started. We don't care about ones that commit after we commit because we weren't around to see them who cares. So the way they're going to do this is super cool is a technique called precision locking. So what's crazy about this is that this paper is from before I was born, it's like from 1980 before this hyper paper came out it had like 45-40 citations. So like a 30 year old paper had 30 citations. It's basically completely forgotten. And then the German dudes working in hyper dug it out somehow like oh this is actually what we want and this is what they end up implementing. Nobody else as far as they know has ever done this on precision locking until these guys found it. It's awesome. So what's awesome about this again, the only thing we need to store now is just again the re-predicates just like the scan set we saw under Hecaton. We need to keep track of those things for our transaction and then we don't need to keep track of anything extra additional or special from any of the committing transactions other than the delta records which we were creating anyway. So we're reusing the delta records we're already generating to run transactions in the first place to figure out whether we, to figure out whether we're serializable or when we do validation. So let's look at an example here. So say this is our transaction, this is the one we want to commit, so we'd actually do three queries. And then over here I have three transactions that have committed since this transaction, after this transaction started. So what we need to do now is we're going to go look at each query one by one and look at the where clause and compare that against the values of the attributes that were modified within the transaction and see whether those predicates evaluate to true. If they do then we know that these transactions created a version that we should have seen but we didn't because we ran under snapshot isolation. So say this first example here the where clause is where attribute two is greater than 20 and attribute two is less than 30. So essentially we go look in the delta record and say oh I see that you're modifying attribute two. So this is the undue record. So when this transaction ran it set attribute two to 33 and it set attribute two to 99. So we go now and substitute the placeholder or the pointer if you will inside the SQL query for the name of this attribute with the actual value that they put into it. So now instead of being attribute two greater than 20 it's 33 greater than 20 or 99 greater than 20. And then we just run the predicate like we normally would when we evaluate expressions when we execute queries and see whether it evaluates to true or false. In this case here 33 is greater than 20 false 33 is less than 30 that's false and 99 is less than 30 that's false. So we know again that this tuple this transaction did not create a version that we should have read but we didn't when we ran the first time. Go to the next one attribute two in 10 20 30. So we check to see whether either 99 or 33 is either 10 20 or 30 both of these are false therefore this thing didn't create a version that we should have seen and so we skip that. Same thing for this one this one references attribute one this tuple this transaction did not modify attribute one therefore attribute one does not reference in the delta record so we replace the placeholder for attribute one which is null and null like wildcard ice wildcard is always false because null compared to anything else is false. Right? So we do this down the line for all the other where clauses for all the other queries and rest of the transactions that's one here. Again, attribute one like wildcard ice wildcard so in this case here this transaction created a delta record where they modified attribute one and set some value to ice cube so in that case here this predicate would evaluate to true and therefore we know again that this thing we should have read this version but we didn't so therefore we abort our validating transaction and rollback changes again that would violate serializable ordering so this is clear why don't yes sorry So you mentioned that you turned all the transactions that started after validation started right? No, so these guys here are transactions that committed after I started after you started all the transactions No, no, after I started started Yeah So it's kind of a set of transactions that you're doing Can't this list of transactions that have committed after I started can't that just keep growing and growing I mean if the validating transaction takes an hour yes this thing can keep growing and growing Most old to be transactions don't take an hour The most common scenario is I update a small number of things in a few milliseconds and commit the bulk update stuff we talked about when we talk about transaction models those are rare corner cases the common thing is that most transactions do a small amount of updates and commit right away So yes in theory this thing could be huge but in practice no Is there only one transaction validating at a time? Is there only one transaction validating at a time In the actual implementational hyper Yes, the dirty secret about hyper and it probably doesn't come out in the paper is that it's actually single-threaded execution for right transactions So in that case yes In practice though Because when you're validating there can be other transactions validating and they can commit and then you might skip some So the statement is if I'm validating and some other transactions commit while I'm continuing to validating If those transactions If those transactions started before I started I can go check them If they started after I started then I don't So what you could do is you basically say Again, you know what action transactions exist in the system and you could stall this thing to say wait until you find out about all transactions that are still running that started before me if you wanted to handle that Otherwise you could say Otherwise, again, first writer wins So if my guy gets through I finish my validation and there are still transactions that There are still transactions that were active at the time I started and they wrote something to something that I didn't read then that's okay because they would get a time stamp after I committed They could commit time stamp would be after I finish So I wouldn't have seen it anyway, so that's fine Yes So this So what's the difference between I, C, and D, I, G So this question is what is the difference between say for this version here So the version of I, C, and D, I, G What's the difference? Like this is one tuple I, C, 200 That's one tuple The latest uncommitted value is 200 And this is uncommitted And then this points to always the latest committed version First writer wins So nobody else can come and try to overwrite this from an uncommitted transaction Okay Yeah, so Biggie here The tourist B.I.G Okay So Biggie here This has no The version vector is null So that means that there's no There's no The committed version is exactly here in the main table space If it's not null Then this is uncommitted If it is null, then it's committed Yes So how can you handle complex drawings and So this question is how do you handle complex drawings here? Wait, I'm going to join an empty but some other writes breaks predicated complex drawings or So this question is how do I handle joints here? Right Again, how do you do a join Join is just an access method feeding into a join operator So at the end of the day you have a where clause that will determine what tuples you're actually reading from the base table And whether you join after that or not doesn't matter So you don't need to actually do the join Oh, but it's really likely that the join result is empty and your writes are breaking the predicate and the protection that supports the next issue So I think what you're saying is like if I do join one table produces no tuples therefore I don't do a I short circuit the join then don't actually do the scan on the other side I'm missing your example So for example you have a better result than empty but so one of the writes on the right side breaks one of the predicate and the transaction is Yes What's wrong with that? Actually it shouldn't be Why shouldn't it be aborted? Because it's the right nothing No, I think you're confusing logical tuples versus physical tuples Logically I should have read that tuple Physically I didn't because it wasn't there when I went to go read it Again, that would be a fan that would violate serialized watering So be very clear, we're doing this because we want to maintain serializability So we don't want fan-ups Okay Okay Like I said, this is an old idea, it's very cool The other cool thing about Hyper is this virgin synopsis And again, Hyper is designed for maximizing performance or trying to improve performance for analytical queries And so the issue with these version vectors is that we don't want to have to go check every single time as we go scan down is this thing null or not Right, we'll see this when we talk about query compilation and other optimizations for it making queries run faster Don't think about in terms of in a disk based system, who cares if I'm checking whether a pointer exists or not If everything's already in memory then I want to run as fast as possible and I want to have my query have almost zero if clauses or zero branches because that's the fastest way to use modern CPUs But now if I'm doing a scan on this table here and every single time I'm going to say if this is null, do this otherwise do that then that's going to suck in terms of performance because the CPU is not going to be able to predict or not do a very good job predicting what branch I'm going to go down So what they do is they add this additional column this additional metadata to keep track of which portions of a block of data and I think they store 10, 24 tuples per block which offsets within a block have a version pointer that set the null or not My version synopsis for this block here is two to five So if I treat all of these locations here as just offsets then this is saying that outside this boundary of two to five so zero to one and five to six the version synopsis is a stride of two tuples for both of these where the version vector pointer is zero So that means when I'm scanning along and if I'm in this offset range I don't need to check this thing at all and that means that's a code path during query execution I completely ignore and now I have no if clauses and now I'm going to run much faster and then for this range here then I have to go check each one as I go along the version vector Yes Your question is now if I update the version vector do I also need to update this? Yes You don't have to lock that as well So he says if I update the version vector do I have to lock this latch, not a lock do I have to latch this when I update it Yes but again this one can do atomic appearance well Again the idea here is that we want to be able to find long strides of tuples where the version vector is null and so therefore we don't have to check this and that will make our queries run much much faster Think of like getting down to like instruction level optimizations or cache level optimizations that's one of the big things we're going to focus on when we talk about analytical queries it's not about reading data from disks because how fast can we read crap that's in memory So not having if branches when we scan these portions is to make a big difference Alright any questions about hyper? Ok, so let's jump into HANA So HANA is the flagship mmery htap database system from SAP SAP is a sh** company they have many databases they bought sidebase a few years ago they bought a ton of crap but HANA is sort of the main go-to system now So it's an mmery htap system that's when you time travel storage with storage ordering like hecaton they're going to support both optimistic and pessimistic mvcc I actually don't know what the default is I think the paper you guys read last class we said it was optimistic and then the Korean guys that work on this stuff told me now it's actually both So HANA is actually a sort of originally it was a frankenstein of a bunch of different database systems that SAP bought so p-time actually came from South Korea to the South Korean so it's awesome for mmery databases p-time is an awesome system that they bought t-rex was a search engine that they bought maxdb is another system that they bought they matched these things all together and called it HANA I actually don't know what HANA stands for if you read rickipedia it stands for HOSNOPLATNER's new architecture HOSNOPLATNER is the p in SAP like it's three guys names so he was really big in this building's new system so they named it after him it's also going to support a hybrid storage layout with both rows and columns this will make more sense in a few weeks when we discuss all this and it's been around for a while so the way they're going to do version storage is that they're going to store the oldest version in the main data table and then the newest versions including uncommitted versions will always be out in the version space in the time travel space and then unlike in hecaton where you stored the time stamp information about the visibility of a tuple within the tuple themselves they're going to have a separate flag that says I have older versions go find where they are and instead of using in hyper we have a version vector that now points to where these versions are they're going to maintain a separate hash table that will map the record ID a logical record ID for the tuple to the head of the version chain and it's going from newest to oldest so if I'm scanning along and if I'm reading a tuple and I check this flag if it sets a true if I want to find a a new version of the tuple I can just I can just I can just I can just I can just I can just I can just if I want to find the newest version I got to jump over here and find the version that way right and they're doing this because they don't want to store all the additional metadata about the version over here yes is there a reason for having the oldest version here this question is why is the oldest version here and not the newest version like in hyper N2O I actually don't know the answer so this question is why is this like the oldest version and why is this like N2O I don't remember why because I think I think what happens is that this is the row store version of this they then transition it to the column store but again I don't it's in the paper actually this is based on the paper you guys read for Wednesday I'll correct this on Wednesday alright so the other interesting thing about this is that as I said they're not going to store a global state map like in Hecaton to say whether the transaction is committed or not what they're going to instead do is store these sort of separate context objects that will tell you all the information about whether the transaction that modified a particular tuple has been modified or has been committed or not so instead of now starting again the metadata like the transaction ID or inside of the the version itself they're just going to store a 64 bit pointer to one of these transaction objects so now if you want to figure out whether the transaction is committed or not you follow that pointer so there's no information about whether a version has been it was modified by a committed transaction inside the tuple itself it's always in these separate transaction contexts so let's say we have a transaction here and it gets time stamp 3 and does a write on C and write on D so up here in the version stored space these two guys will have again pointers to this transaction context and then if you want to find out whether this thing is actually being written out of the disk they have something else called the group commit context that means additional metadata now that tells you whether all the transactions that are part of this group have been flushed out of the disk or not or to have these multiple layers in direction to maintain state information about the transaction and whether it's been flushed out of the disk or not in the log out these global objects rather than embedding them inside the versions themselves and they do this to reduce space and reduce the overhead of denoting that a transaction is committed so again let's say I do an update that updates a billion tuples instead of going through every single tuple as I would and in the case of hyper flipping those flags to say this thing has been committed or not I just go down and update this thing I think that's also why they stored the oldest one here because at some point this thing goes away and then you can migrate that over here so I think that's the answer so this is again another way of organizing the system and tracking all these different things so that to reduce the overhead in a way that Hecaton would be susceptible to so I want to finish up with now Cicada before we get into that let's do some observations about what are some of the other limitations we're seeing in these different systems so the first thing we need to do is MVCC so in all the limitations we've talked about so far the way they're going to keep track of different versions and the version chain is essentially through this indirection we're going to have this version chain that you have to follow along to find the right version that you want for your transaction and a linked list is with data sort of scattered around in different memory locations on the heap it's going to be bad for cache locality because if I follow the version chain I may be jumping around to different spots in memory and then maybe getting cache missions for each of those it's still in memory instead of on disk but the difference between cache and CPU caches and memory is still a lot to again reduce this overhead reduce the size of the version chain from growing indefinitely we have to use garbage collection in the background or cooperatively to clean things up and that adds additional overhead the other aspect in the case of Hyper to reduce the need to update a global space is that because they're using local memory pools per thread then you end up with a bunch of shared memory writes that are all around the heap increasing cache misses increasing cache invalidations anytime we have to do any updates and that's going to slow us down further and then for all of these as well they require a global shared counter to do time-safe allocation and if you have a lot of cores and they're doing atomic ads and preparing swaps and these things very quickly then that can slow things down because that's causing additional traffic on the actual underlying CPU architecture network the next thing also to discuss is what are some of the limitations you have for implementing optimistic and current neutral so Hagatama was optimistic Hyper was optimistic and SAP HANA as well was optimistic so if you're under a if your workload has low contention meaning transactions are updating different random tuples then this is not a aborting won't be an issue because you wrote to something and I wrote to something completely different who cares but if there's high contention which is very common in old to be settings where most of the transactions are trying to update a small portion of the database then you're going to have abort because you can't have first rider wins and other problems and therefore transactions are going to get abort and essentially you do wasted work and have a lot of churn for systems that are implementing true OCC you have to copy the tuples from the global database into your private workspace and that's additional memory writes that can slow down performance I want to be able to read data without having to do a write in order to read it but in a lot of these systems if you implement pure OCC then you have to copy things so that you have repeatable reads and the last one is a bit more nuanced and it will probably make sense right now but it will make sense a week from now is that in order to ensure that we have no phantoms if we have indexes or to make sure that two transactions don't try to insert the same record into the same index at the same time under OCC we have to install what are called virtual nodes it's basically a placeholder in the index and say I just inserted this key I haven't really done it yet but I'm going to you know come back and later on actually update it for real so this adds additional writes into a memory which can slow us down and can limit can cause other contention issues in our indexes again we'll cover index locking and gap locks and other things next week so CMU there was a post-doc here at CMU who worked with Dave Anderson called Huntec Lim he built his own execution engine O2P engine called Cicada and this was trying again this was sort of expanding upon Hecaton a bunch of other systems to try to make it work better for in-memory MVCC for pure O2P workloads so he's not worried about OLAP stuff in the same way that Hyper and HANA is looking at pure O2P so the paper sort of points out four key contributions but I'm only going to cover the two ones I think that are most interesting so they're going to do best effort inlining meaning rather than have a version chain for every single tuple you're going to try to inline the latest version or the next version with the tuple itself so that you don't have to follow the version chain for in most cases they're going to apply techniques from distributed systems and have loosely synchronized clocks to do timestamp allocation for that one we don't need to worry about for now it's a good idea at some point we should discuss it in further detail but for now we can ignore it they're going to do contention aware validation where rather than just looking randomly to say who do I conflict with or potentially conflict with when I want to do validation they can be slightly smarter and say I conflict with this transaction over here for some reason so they validate against them first rather than just looking through the list in the order that they were added and then the last one I think is super interesting I can't say too much about because I don't know whether it's a good idea but they do it and I think it's something I want to pursue eventually in our own system is that they store the index nodes in the tables themselves rather than just in the heap and I'm going to explain what that is in a second so here's what best effort inlining works so the metadata about the tuple itself is stored in a fixed location so the idea here is that in most of the times you don't need to go follow the version chain because you probably only need to see the latest version so rather than having a pointer point to some other version somewhere in the heap you just embed directly inside the tuple itself the next version so the way I think about it is you can store the tuple itself over here the latest version and then you have a little extra space so here's the next version so you're packing two versions together with each other and you can selectively do this based on how often do you think you need to traverse the version chain because this gets tricky now because this could potentially be well I guess it's always fixed length so this is not an issue so again this is sort of like you wouldn't want to do this in hyper because hyper is a column store this is purely for a row store you can do something like this now when it comes time to do validation there's three sort of optimizations we can do so for the first one we can be aware of what contention might exist or what conflicts might exist for all the transactions that ran at the same time and so again rather than comparing their resets in just primary key order you can try to maybe reorder them based on their right time stamps so you look at say transactions that have written something more recently I'm more likely to conflict with so for I wouldn't check them first and by doing the time stamps if everyone checks in the same time stamp order you'd never have any deadlocks because everyone's always going in the same direction the standard technique is you use primary key order everyone checks A then B then C this one you go in time stamp order the next thing you can do is that before you go ahead and make any during the validation phase before you go ahead and apply any global rights to share data structures or the share table space you may just go ahead and peak ahead and say well am I actually going to abort any transaction that's running right now so you can keep track of what tables transactions are often conflicting on and therefore having to abort so do like a pre-validation to see whether I'm going to abort later on the idea here is rather than waiting to the very end of the transaction then do validation and just to find out the first query I did calls the problem then I have to roll back maybe if I check immediately after I do the first query I can kill myself ahead of time the last one is that instead of always having to do a complete search in the version list I can keep a little extra table on the side or hash map on the side that says oh if you're looking at this tuple here's how to jump to the last tuple the last version you probably really want right and again the idea here is that you identify which tuples cause a lot of cache misses or a lot of long versions to merge traversals and you just have a little index to jump ahead more quickly so for these first two here you essentially want to skip these if most of your transactions recently have committed successfully because this is extra work you have to do that will slow down transactions under normal execution like checking to see whether I'm going to validate immediately after I do an update query I'm going to do that later on again when I do validation so rather than doing it twice or having the transaction take longer because I'm doing the section step if I know I'm not likely to abort then I just skip it this last one here this is sort of like you would do this anyway and this reduces convention the last thing we'll talk about too is the index storage and again as I said as far as I know SQLite actually might do something similar I need to check with him with Richard the guy that implemented to see whether this is true or not but as far as I know, maybe I missed it in the paper nobody actually does this so normally what people do when you do an memory index is that these things are just sitting on the heap I just malloc a bunch of blocks I keep track of where they are but they're sort of separate regions of memory that's separate from the fixed length of the variable length data pool for tuples but in cicada what they're actually going to do is the nodes themselves in the index are just mapped to actually tuples in a table so there's a special table called index table for a regular logical table and these things are just in case of cicada they're just blobs where they're packing in the binary version of the actual node itself in the index if it's just on the heap it's just an offset to some memory and you know how to interpret the bytes inside of that same thing, if it's a blob I know how to jump to the offset where this thing is actually being stored and I can then interpret the bytes so what's really interesting about this is that you get phantom avoidance for free, I think or so they claim so I'm already doing phantom detection in the ways we talked about before for my regular tuples and then if I want to do phantom detection for any kind of index scans I either have to do the predicate precision locking from hyper or I have to do gap locks or next key locks we'll see next week on the B plus tree but if I pack the nodes of the index inside the table themselves if I'm already doing validation for phantoms on this for tuples, sorry for indexes so I'm doing validation for tuples on the regular table if I do the same kind of validation for the nodes that are packed inside the table I get it for free so that means I don't have to have some extra separate code to handle indexes in a special way indexes are just another table so I think that's really cool I think it might work I don't know if it's actually a good idea or not and so what they do is as I said, the node itself the data about the node is just a blob but you can imagine you could break out the keys and the pointers or whatever metadata you're maintaining inside of a node itself as separate columns and maybe do a more fine grain validation in the same way that the hyper guys do in precision locking right so actually be clear, you still need precision locking to make this work but you're already doing that for tables that's something for indexes but again, they treat this as a blob so it's a very coarse grain but you could imagine breaking up the actual metadata on a per column basis so I think I said, I don't know anybody else that does this I think it's really cool it'd be an interesting research paper to see whether this actually is a good idea or not it's one of those things where like it's in the paper it's just like one or two paragraphs like oh by the way we did this and they didn't really explore it in further detail I've said this now, I think this is something we should look into but that's an aside alright, let's finish up real quick let's look at some results here so the post doc that implemented Cicada was an amazing hacker, he also implemented all the latest versions of OCC and NVCC systems that were available so we're going to compare against Silo and a sort of improvement of Silo which came out of Harvard so we'll see this, the Silo system it's a very influential system TicToc is another OCC system that I was involved with with a PhD student back at MIT Photos came out of HP Labs Hackathon we've already covered Ermia is an improvement over Silo out of Toronto and then Cicada is this particular system here so for this first workload we're doing YCSB with almost no contention, every transaction is going to read or write a single tuple so in this environment the modern implementations so TicToc an improved version of Silo and Photos and Cicada all of these guys scale really well these other ones flat line over here I forget why, I think this version of this is what, this is Hackathon that's sort of crapping out Ermia is crapping out here and then 2PL is looking a lot like we saw before when we talked about lock thrashing so that's what's going on there so this is not that interesting so let's look now under high contention so this is running the TPCC workload with one warehouse so again you'll be very familiar with TPCC by the end of the semester it's the standard benchmark people use for OTP systems so the way it's basically modeled think of like it's like an Amazon storefront and one warehouse has a district as customers customers orders and order items it's like this tree structure so with one warehouse that means almost every single transaction is trying to update that one 2PL it's super high contention so everyone performs terrible except for Cicada does a little bit better and my system was where the purple one so there's us sort of near the bottom I'm okay with that and again in the paper they discuss how it's a combination of all the various optimizations they talked about to specifically handle high contention workloads the inlining the early validation all those extra steps you can do to allow transactions that are going to abort anyway there's no magic to make transactions when they conflict to make them not conflict that's just going to happen the question is really about what can you do to reduce the amount of waste you're working at the due for transactions that are going to abort remember the graph from the staring to the abyss paper they showed high contention most of the transactions at a thousand core count most of the protocols are spending their time aborting transactions so all the extra stuff that Cicada does allows them to get achieve a little bit better performance than the other ones again the question is what percentage can be attributed to different things how much can be attributed to the index storage stuff that we talked about I don't know look at the paper again any questions about Cicada so again what was today about today was about looking at different implications of the FCC we saw how they check for phantoms in different ways we saw how they organize the versions change in different ways we saw after the hackathon initial hackathon discussion we saw a bunch of optimizations to improve the FCC to make it run better for different workloads so we saw in HANA they store global transaction context to not have to update all the single tuples we saw in the case of Hyper the version vector storage the precision locking and the version synopsis are better for doing analytical queries and for Cicada we saw a bunch of tricks to make it run better so again there's there's no one system that does all the amazing tricks that we're talking about it's our job to understand which ones are actually worth pursuing we can only implement so many things so which ones should we give high priority to and it depends on what kind of workload you want to target any questions the reading from Wednesday is the SAP HANA paper on garbage collection it's an okay paper there'll be typos you will see them it's from the industrial track which the reviewing bar is a little bit lower but in my opinion it is a good paper because it does cover all the sub-design decisions you have about garbage collection and way deeper than what we discussed so far and for the most part it's pretty straightforward to understand it actually was probably the only MVCC garbage collection paper that's out there that's why I picked it okay? alright guys thanks for watching