 All right, it's Thursday night classes got canceled on Wednesday because of the weather so I'm here at my home office and Let's talk about databases so for Today we're going to have the third and final lecture on in-memory multi-version concurrency troll and so for today's class we're going to go into more detail into how we do garbage collection, which is a very important part of doing MVCC and The idea here is that we're going to go more detail than what we talked about so far because we want to understand now how Garbage collection and the need to be able to do it permeates throughout the entire system and start to understand again How we're actually going to build a real system that can do all these things that we talked about So we've sort of already defined obviously what garbage collection is before so I just want to reiterate It again, and it's this idea of Determining when physical versions are reclaimable so transactions come along they're an update existing logical tuples in the database and Under MVCC that's going to cause us to create new versions because we don't want to overwrite the old ones And so those older versions at some point would no longer be visible to Any other active transaction and therefore it's deemed reclaimable meaning we can free up the memory and reuse it for something else Also obviously to if a transaction does a bunch of updates that creates new versions But then they get aborted then we want to go back and reclaim that memory as well So again, the idea here is that we we we want to be able to prune out the old versions Unless we're doing time travel queries, but for our purposes in this class. We said we're not gonna make that assumption So that we can you know So Again, the the the basic idea of how garbage collection is going to work on MVCC is that All that metadata that we've talked about the last two classes that the database and system is going to maintain Inside of the tuples to figure out that there's a Verge right to figure out in my snapshot what to Should I be able to see that same many metadata is going to then be used to figure out What tuples or what versions are reclaimable and therefore we can free them up? so there's There's a couple of things we need to talk about though going forward today that's slightly different than what we've talked about so far And this is going to set the stage or why we need to do more sophisticated garbage collection than what we've been assuming so up until now we've assumed that we've been focused primarily on All HP workloads where the transactions or queries that are updated in the database are short-lived Meaning they're gonna, you know, they're gonna do some operation and commit in in milliseconds And so this means that because the transactions are short-lived then the visibility of older versions It's also short-lived meaning the garbage collector can come along and free up memory fairly quickly since from the time it gets invalidated or Yeah, it gets expired to the time we can go remove it but now if we want to think about htap workloads or htap environments where it's a an amalgamation of the fast-running transactions and the longer running analytical queries now those analytical queries are running under Snaps to isolation as well Then they could cause the garbage collector to get paused because it's not gonna be able to go and In clean up old versions because there's still some query running. That's that's still active. I could still see those old versions so I had you guys read the SAP HANA paper on garbage collection and They talked about how Some of their customers they saw queries running for hours And so the garbage gathering again gets paused all these old versions get backed up You can't remove them Because you have to wait for those long-running queries to finish and they talk a little bit about how sometimes the long-running queries could be people wrote They wrote crappy code or bad code that was holding on the cursors along than they should But in some cases it could just be there's queries that are actually do need to run that long And again, you can't you have to let them read this what they need to be able to read because Under Snaps to isolation, you know, you you need to be guaranteed. You see all you see all the things that exist at the moment you started so What are some of the issues now if you have these these these old versions that you can't free up? Well, the most obvious one is that now you have increased memory usage Right because now the all these old versions You can't remove because you're waiting for this old query the long running queries to finish Now you're just allocating more and more memory to stir all those versions that are getting piled up behind it right because you're still running the trip the OTP site still running the transactions that are updated in the database and They're going to create new versions and need need to store all those versions So right this you know memories is not cheap not only to buy and and procure in your machines, but also to actually maintain because you have to give it keep giving an energy Just to store the charge so, you know having is to Use a large amount of memory on your machine just because there could be a spike in In memory usage because these old versions you can't clean up. That's not a good You know proposition a selling point for a database system The other thing now, too, is that if we end up allocating more memory We're having to go to the operating system to get that memory and the operating system is our enemy We don't want to ask it for as as We want to ask it for as few favors as possible so having to call malloc and occasionally malloc will have to go down to the to the To the the operating system to do a syscall to extend its it's the address space for the process We want to avoid that as much as possible So if we find ourselves having to allocate more and more space because we can't go back and clean up old versions Then that's more syscalls and the OS is just going to get away get get in the way of our life So we wouldn't try to avoid that as much as possible Now up inside the database system obviously now if you have these old versions depending on the Version storage scheme you're using and what order your virgin chain is being maintained Now you could have really long virgin chains that are going to be really likely long to traverse In order to find the correct version that you're That that your transaction needs So if you're doing newest orders It's not that big of a deal because most of the transactions are going to need the latest tuple Our latest version and that's always going to be at the head of the virgin chain So you're not traversing that uh a long chain But in case of like hackathon for example, we saw it was doing oldest and newest So now if you have this really long version chain You have to traverse that every single time just to get to The the version that you should be seeing right because again you have all these old versions that you can't clean up because There's a long run inquiry blocked the the garbage collector the Next issue is a bit more nuance And it's actually something that that really doesn't come up in academic papers too much, but the what'll happen is that uh If you have these long periods where you can't collect any garbage All right, because you have a query that that's that's Blocking the garbage collector then that query finishes and now there's a whole slew of of old versions that you can go clean up Now your garbage collector is going to go to go to town like it's on cocaine is going to start You know snorting up and deleting all these old versions if start freeing up the memory So that's going to spike the cpu Uh during this period because it just has a lot of computational work that it needs to do So this is not good um from a sort of uh in a In a commercial or real-world deployment settings because people don't like You know wild swings in performance Right because during the period that the garbage collector is just freeing up all these old versions that have been that you know Piling up for the last two hours The cpu is going to be spiked because it's just finding them and freeing and freeing and freeing over and over again so Any query then or transactions it's going to run during this during this Garbage per spike period is going to end up having lower latency uh Just because you know you're using cpu now or multiple threads now to do garbage collector to free up as much memory as possible so People don't like that right people don't like having the p99 latency of transactions wildly fluctuate For random periods of time right so This is bad and so we ideally won't we want to be able to have the garbage collector run incrementally Do a little bit of work? Every so often and that way we smooth out performance and don't have wild oscillations Now the last one is also a bit more difficult to to sort of wrap your head around but we'll cover this Later in the lecture and to come up more and more when we talk about other storage issues is that If now we have uh Versions that are being stored in in All sorts of you know random locations in these new blocks Then what'll happen is when we go now back and delete all these old versions You're gonna have these in a block and have a bunch of gaps Or empty space where versions used to be but now you deleted them But now there's maybe one or two versions or in that block that are still active And so now But they were modified, you know from a you know a few hours ago or Or some other time and now you're putting other new versions in there that are that are that don't sort of fit in the same um We're not added to the davis system around the same time That the versions are the versions in that current block were added And so what you end up having are within a single block if you have this this long period where you can't clean things up And then all of a sudden you just open the floodgates and free up all this memory is that You're going to have tuples that could be added with tuples within within a single block could be an added to the database at Dramatically different times and therefore they're going to have different access patterns that transactions are going to access them and That's going to end up causing more cashmases It's going to make more difficult to do compaction and compression and other things Because now you're going to have tuples some tuples are going to be old and some tuples are going to be new within a single block Right, so I'm being a bit hand wavy about this part It'll come up later in this later in this lecture and then again later in the semester um, but the basic idea to think about this is that I'm going to have if I can't free up memory within a block then When I go free that memory of later on the garbage collector I'm now going to have a bunch of space and I now I could put new tuples in But they're going to be mixed together with old tuples and I that's not usually good for for uh For cash locality and other things that we'll talk about all right, so for Given these problems today we're going to talk about um Mostly about garbage collection, but before we get there There's a bunch of stuff. I need to talk about with with mvcc Uh, how you build a system using mvcc that I should have covered in the last two lectures that you kind of need to understand Understanding the issues we're going to deal with when we do collection So first thing I'm going to talk about is how we actually handle deletes in in memory mvcc Let's talk about how we do indexing with It's not so much whether you're using the logical versus the physical Or the indirectional layer that we talked about in the first lecture It's really about how the index is now going to store information about versions inside of itself Right, and then we'll get into garbage collection and then we'll talk about block compaction, which is Basically coalescing uh, when won't you free up a bunch of memory? What do you actually do with it? The idea that we have to deal with just like when we do an update with a delete We logically want to delete the tuple, but we physically don't want to delete it until we know nobody can actually see it But we need to actually do something We need to actually maintain some information to say that a logical tuple was deleted Right, because then if it's deleted we want to make sure that there's no new version that's added to the chain after that so This is a good example of why Using the simple rule to avoid right-right conflicts of first-writer wins makes your life easier because now you don't have to worry about the case of Some transaction deleting a tuple and then some transaction trying to update it and how does that fit in the virgin chain? Right all that goes away if you just say whoever writes to it first whether it's whether it's an update or a delete They always win So what we need to be able to do now to handle deletes is just have a way to To mark that we've logically deleted a tuple and then we'll go ahead and garbage collect garbage collect it later on So there's essentially two ways to do this So the first as far as you know is probably the most common one And this is where you just have a little little flag a little bit to say that a a tuple has been deleted and so This is gets always added To the the newest physical version Of the tuple right and then that way you just know that there's nothing else comes after you in the virgin chain So the two ways I should do this is are to either store it in the Version the tuple header right there's a little extra space we can store some extra bits some flags and say some information about About the tuple Or we can store this in a separate column. It just is a bitmap that says whether we've Well, you know whether this tool has been deleted. So in our system, which is based on hyper we do the separate bitmap column The the downside of this obviously is that you're now storing space Yeah, so you have this direction space to say this that To just to keep track of whether tuple has been deleted or not It's one bit per tuple If you pack the bitmap you have to pad it out, which I'll talk about later But it's not that big of a deal like most people aren't going to be deleted And yes, you're storing space for things that that are it's always going to be mostly false but It makes your life easier if you go to the giraffe The other approach is you use what's called a tombstone tuple And this is where you append a new physical version at the at the Uh, you know the head of version chain at the end depending what how you're doing your ordering That has a special uh bit bitmap flat uh pattern in the next pointer That you check when you're traversing the version chain if you see that bitmap pattern, which is a is a inexpensive operation It's one instruction Then you know it's a special marker that says that the tuple you're looking at represents A a marker to say that this tuple has been logically deleted at this point in time So you could just store this this uh special tombstone Tuple within the regular table storage like if you're doing a pen only you just store it as a regular regular row um A but the problem with that in cases if your tuple is really wide if you have a lot of attributes that take a lot of space Then you're allocating a bunch of space for For you just to store that that bitmap pattern and just kind of wasteful So one way to overcome this is that you have a separate pool Of tombstone tuples that just have the the metadata about the time chain ranges you need to figure out when this delete actually occurred but then you just have the the the You don't have any the data attributes. You just have that the next the the pointer with the bitmap pattern And for this it doesn't matter Uh, you can actually share the pool across all different tables Like there's nothing about the this tombstone pool that's specific to any particular table right, um Right, the only thing that matters is that pattern So again, as far as I know everyone implements the first approach Uh, we and in peloton. We actually implemented the second one. I don't remember why we did that um And actually the first implementation we did actually do the the tombstone pool We actually allocated the new tuple just to store that the the tombstone flag. Which is a bad idea Um, we actually got rid of that and then of course now we switched over to using the the separate three to flag Which is the right way to go All right, so the reason why I was bringing up deletes because this is going to complicate some things now That we need to talk about with indexes and garbors collection So we've already talked about before with indexes of What what's actually in the values, right? Is it the physical pointer to the head of the virgin chain? Or is there some kind of indirection layer? So the thing to understand about indexes and multivergent current control is that as far as I know Most mvcc databases don't store any version information About the tuples with the keys Right, so there's nothing about the timestamps of when a key was added when the key was, you know The begin time begin timestamp the end timestamp that we saw with regular tuples none of that's actually stored in the in the index itself um I think the reason why is because a lot of times in database systems they At least in the the newer ones they sort of There's open source data structures you could use that aren't going to have any of this versioning information there You'd have to build something custom made or do this We can take this offline about whether this is actually a good idea or not But as far as you know, nobody actually stores any virgin information about keep about tuples in the index itself The only exception this would be something like if you have an index organized table Right where the like in in like here a b plus tree And the leaf nodes actually store the the the tuples themselves like mysql does this In that case you could argue that the virgin information is actually in the index But not to the same level we're talking about here So That means that every single time we want to say We follow key and it's going to point us to some some some version chains We have to traverse that virgin chain to figure out whether the key we found is actually visible to us or not So the What we're going to be able to do though is that We need to have every index now be able to support Do duplicate keys because we because the same key May appear disappear and reappear again in different snapshots So the same key may may end up pointing to to multiple virgin chains And this is what i'm saying like when you say give me a bunch of keys Uh from the index Or you do a fetch on a single key from the index You may get back a list of virgin chains that you don't have to follow to figure out which one's actually the correct one You're looking at And so this is going to happen for both primary keys or certainly sorry, non unique secondary indexes Also, but also primary key indexes and unique indexes So all indexes in the mvcc database unless there's virgin information in the index itself Which at our point here, we're assuming it's not Then you have to follow the virgin chains because you may get back multiple entries So This is tricky to do so let's understand why uh, we have to do this We have a table with the single tuple a and right now only has one version And there's an index that that points to the head of the virgin chain So for this we're going to do we're not going to do the full hecaton Um of mvcc here, but we're going to do something that's very similar So you're going to have transactions are going to start they're going to begin timestamp and then the hover commit timestamp So we're going to have one thread That's going to start a new transaction at time sam 10 and this transaction just wants to do a read on a So it follows the index it gets the head of the virgin chain We're going oldest and newest it sees that the tuple the version a1 is visible to it and that's the one it's going to end up reading Now we have another thread That's going to start a new transaction at time temp 20 The the the first transaction thread one is it's still running So this transaction will do an update on a So again follows the virgin chain finds the head Appends a new version after that updates the the version pointer to now point to our new tuple and then flips the end time stamp at the first render So now this second transaction wants to do a delete on a So for this it doesn't matter whether we're doing the delete a flag or the tombstone tuple We just somehow we're going to mark this thing as deleted and and other transactions will be able to see that see that update so for this Now the second transaction wants to commit It gets the commit timestamp 25 where we go back now and update the The timestamps in the old version and in our new version Which ended we ended up deleting later on to now be 25 and then we commit and we're done So now Transaction the transaction at time stamp 10 thread one it's still running But now a third thread comes along starts another transaction at timestamp 30 And this guy wants to do an insert on a the exact same key that we were using for for before with a So at here It sees at this point in time It sees the change The modification is made by this by transaction two in the second thread Since that transaction had committed Before our third transaction started it third transaction sees all its changes So that means it doesn't see a anymore because the second transaction deleted it and committed So it's it's allowed to do an insert on a Chain But now we got to start a new version chain Because we can't connect it to the old one because that one has been deleted and we said that once we delete a Once there's a delete marker at the at the in our version chain There can't be another version that comes after that So this is treated as now completely separate logical tuple with its own separate Version chain. So we're going to start its version ID at a one again and then now in our index we have to have a pointer To point to our new version chain But because our first transaction in thread one hasn't hasn't committed yet We can't get rid of get rid of the the old version chain pointer in the index So we have to be able to support both because if thread one comes along and does another read on a We want to make sure that it goes sees that first eight version a one that it saw the first time It ran the query right it can't see the second one because that's from a transaction that's modifying the database hasn't committed yet And you know it would violate snaps to isolation if it was able to see that so this is So this is the tricky thing we have to be able to do that you don't have to do under a single version system right if something gets deleted You just you know, you can delete it from the index and not worry about somebody else Being able to read it now you may have to do some extra work under serialized isolation, but Um at a high level you can just remove move the index right from the index right away So what does this mean? So this means that the underlying data structure for every for every database index in an mvcc system Has to be able to support the storage of non unique keys and we covered how to do this in the introduction class right whether you have a Link list or whether you're duplicating keys in the arrays Doesn't matter how you're actually doing you just need to be able to be able to always support For every index non unique keys Of course now this means that some since some indexes will be logically unique the primary key Or you know declaring an index is unique when you create it That means that we need to do some extra work in our execution engine in the database system To make sure that those things are actually truly unique Right, it would be good. We have a primary key And because we have to support multiple version chain pointers in the index We could you know, we could insert the same key multiple times Right, so we need to make sure that doesn't happen even though underneath the covers we have to allow for So the way you you do this and you'll see this later on in in our own system Is that you have to basically do a conditional insert where you check to see whether the key exists And if if not, then you're allowed to insert it otherwise you throw an error and you have to do this all atomically Like the whether that means you You can do this in a latch free way or with with a latch on the on the node you want to insert into Or you want to sort in a virtual node in right? It it it doesn't matter how you actually do it It needs to be guaranteed to be guaranteed to be you atomic so Again now on the implementation side of things We have to write this extra code in that execution engine when we use our indexes Where we have to be able to hang out the case where our workers may get back multiple version chain pointers For a single fetch on a key And then we have to follow those virgin chains to figure out what's actually visible to us So what I mean by this to say back here When thread one did that second read on a it's going to get back to Uh two results for that single key a right. It's going to get back to the one that it saw the first time plus the new one that thread three just installed Right, and it's going to have to go figure out which ones are actually visible to it Right, which tuple should which tuple version should be visible to it And they can do some additional logic to say like well if I know that's my It's a unique index and therefore I should only see one version chain Soon as I find the one that matches me then I know I'm done if it's non unique then you got to look at potentially all of them All right, so now we know how to do elites Hand out deletes in mvcc and now we know how to handle indexes So let's actually talk right now what the paper you guys read from The s.t.p. Hannah team on how to do garbage collection so in this paper they focused on The two main design decisions that they would care about was the granularity of garbage collection and the comparison unit um I want to talk about a little bit more about how you do index cleanup and And go into more detail about the version tracking stuff that we talked about when we read the best paper And now put into the context of this the s.t.p. Hannah paper that you guys read So like I said the reason why I picked this paper is because uh for one it's like the only You know research paper that's out there in sigmod or vodb that covers Uh in memory multi virtual garbage collection and I really like that and I I do also like how they couch the The problem they were trying to solve based on the real world issues they were seeing with with customer applications Which was nice All right, so the the first issue we got to deal with is how we want to clean up our indexes So we just talked about how we needed to handle Uh multiple version chains for a single key in our indexes on uh with mvcc But now the question is how do we go back and remove the tuple keys? From our index when we know the versions that correspond to those keys are no longer visible to any active transaction Right and so the basic way you do this is that just in the same way that we record the read write set for transactions In the internal metadata to figure out what they actually did when they ran and then do some kind of validation later on when they go Try to go go commit We basically want to maintain an internal Log entries for how every transaction Modified indexes So you basically keep track of here's all the the Here's all the indexes I modified either adding key deleting key Um as as my transaction ran and then depending on whether the transaction commits or abort You can go back now and uh reverse those changes So where things get tricky now is if you have to um You essentially need to be able to know What was this? What was the time stamp of the transaction that ran when it made these changes? So you can go look in the catalog and figure out what indexes are actually visible to me At the moment I applied these changes Again, we'll talk about transactional catalogs in a few more lectures The basic idea is in the same way that you can have snapshot isolation For the actual tuples of the table you have snapshot isolation for what tables what indexes will come to exist as well As long as everything's run under the same Under under the same mvcc protocol So I do want to talk about a mistake that we made in in our peloton system again This is just sort of as a side comment to talk about What we got wrong in our first system We were building here at cmu and you know sort of why we had to throw it all away and start over again So that's one in particular bothered me a lot So we had this issue where we were trying to be smart about how we would store multiple updates to the same physical version within the same transaction So we were trying to be clever and say Rather than creating a new version If a transaction updates the same tuple multiple times We would just always overwrite the old version with the Uh within our transaction with our new information Right, so say we have a transaction here starts at time 10 It does an update on a and let's say this thing this thing updates the key to be 22 for this this tuple So this is the first time we've updated this tuple within this transaction So we would we would append a new version Add a new entry in the index to point to our new head of the version chain Um and things are fine But now this transaction updates the same tuple again and say that now sets the key to uh three three three again Everything we talked about so far we would say all right We just create a new version just like we did for a two We'll do the same thing and make a new version for a three a new, you know new physical version in our table space instead what we would do is we would just go back and Find the the the the tuple that we mod or the version that we created and just overwrite it Right, so a two no longer actually ever existed. We only had uh the new version a three Same thing. We do this again for key 44. Is our key 444 We go back and update the the version that we created so What's wrong with this? Well, what would happen is now if our if our transaction aborted And we got to go back and roll back all our changes if we weren't tracking The updates we did to the index for all those new versions that we created so key two two two key three three Then we had no way of know what the hell we actually put in the index So we couldn't go back and remove those things because the way we were doing is we would just go look and say All right. Well, I need to roll back this transaction What keys uh, what versions did I create? Oh, I see in this case here. I've created a four So let me see. Oh, I I modified the the key we have on our index to be 444 Let me go delete 444 from my index, but it doesn't know about the the other two that I did So the transaction would commit we think we garbage collected everything it was it was uh that it created But there would be a bunch of keys that would point to nothing Uh in in our index because we didn't know to go back and remove them so Again, the two mistakes here were one we were overriding the same same entry multiple times And again, we were doing this because then you didn't have to go mallop a new space or get a new slot to put a new version in Uh, it was just sort of like this is not that common. So it wasn't like that big of a win for us Um, I don't know. I forget why we did it in the first place But you know, it did reduce contention to trying to get a slot but the Uh, you know, this prevented you from doing things like knowing exactly what the hollow transaction ever ever did Um, it also made it impossible to do save points because you couldn't go back and to a different save point Because you just overwrite whatever your save point was every single time The other issue was that we didn't we weren't recording at like a per index level. What keys we actually end up modifying So we did this when we were using a pendulum of storage Which again every single time you create a new version even if you only update one attribute you end up copying a whole new Uh, tuple, uh, you know the old version into a new slot all over again So we were doing this the way to really again reduce that memory memory copying um Now that we switched to delta storage, which I think is the better way to go This is really isn't an issue anymore because creating new delta record is is super cheap It's just the you know just the attributes that you modified. So For a pen only this might be a decent optimization. Um I don't think it's worth it and it had had a uh You know had had more it created more problems than it actually solved. So we you know This is why we had to get rid of the yellow code All right, so now we want to talk about how we actually keep track of our versions So we already talked about this, right? This was the main thing we talked about when we talked about garbage collection with uh In the best paper, right? We said there was two approaches. You can do the tuple level or the transaction level so the tuple level garbage collection was we Didn't store any external metadata about about transact what transactions we had a wrote to or so what version transactions created And and validated when they were running We would just have this mechanism where as we were scanning the version chains of transaction of Tuples if we found versions that were Uh Not visible to any extra transaction We would go ahead and do the corporate garbage collection right right there And this could be either a separate separate process like a background vacuum like you have in postgres Or it could do could do cooperative cleaning where as threads were traversing the version chain looking for Uh, the right version as the exu queries if they came across versions that were expired Then they would do the the the garbage collection right then and there The other approach was The transaction level garbage collection, which is what the scp hana guys are doing and this is where we Levered the metadata that that transactions are recording about what they're doing when they when they were running To again go figure out Where are the old versions that we need to go go remove? So I didn't talk about this In in the first lecture on mvcc We're going to focus on the second approach here because this is going to this is going to tie into what the hana guides are talking about So the basic idea is that again as our transactions are running the same way we have to record the The rewrite set for validation We just use that information to We piggyback all the same kind of information to figure out What what things do you need to go do garbage collection later on? So we have a simple transaction here starts at time stamp 10 First thing wants to do is do an update on a again finds the head of the version chain Or find finds the people that that issue read just reads that And it creates a new version updates whatever pointers you have now to to point to The the new version you just created But now we know that we've invalidated In this case here version a2 So internally we'll track that in our transaction say this is the old this is the old version that we disinvalidated And then all this really is is just a 64 bit pointer to the To the tuple right that's the only thing we need we need to restore Because we know what our timestamp is so we know that this thing has to occur before us um If we need to go figure out what how this tuple is actually what what begin time stamp for this transaction is Then we can we can follow the pointer and go look at that later, but for net for this point here We don't need that Then we come along with our next transaction wants to update on b same thing Finds the the latest version creates the new one updates whatever pointers it wants as the entry to our old version metadata for our transaction And then we're done So now our transaction goes ahead and commits And what we're going to end up doing after we flip the pointers or the timestamps in our tuple versions We're going to now pass along this our old version set To our garbage collection or back or or vacuum thread or whatever whatever it is that's doing the the garbage collection And we're going to keep track of just for these versions We know that and it's only visible to any transaction that came before timestamp 15 because that was our our commit timestamp So the the garbage order just keeps that additional metadata to say here's a bunch of here's a bunch of versions that You know you want to garbage collection once you know there's no transaction That's active that has a timestamp less than than 15 So that's all really sort of the transaction level of garbage collection is just the meta There's an internal data structure on a per transaction basis We keep track of what old versions you've invalidated and then at some later points somebody else will come along and clean them up So now what the hana guys are discussing? where How you actually organize and these uh, these versions this metadata you're passing along after a transaction has been committed and How you're actually going to determine whether there exists, uh, where where are ranges within a the the The timestamps that exist for old versions that you could possibly do garbage collection on So the first case is the granularity Again, the basic idea here is this how the database management system going to internally organize all the expired versions that committee transactions have invalidated Um, to then be able to go check to see whether they are reclaimable So the the trade-off here really is the whether we want to have, uh A really expensive operation where we go check every single individual tuple Uh, every individual version that we've been invalidated to see whether it's now been possibly reclaimable Or we went whether we want to group them together Uh, with some kind of high watermark to say within a group. Here's the the the maximum timestamp that that has to be No longer visible before you can free up the entire group. So again, there's sort of this trade-off between Uh, do you want really expensive checks? Um, but it can be more fine grained. I'll give up to release Versions more quickly than maybe otherwise possible. Or do you want to be able to group them together and just when the group's ready to go? Then you can go ahead and and clean them up. So again, this is what the two differences are Again, the single version one is where you're within Within every single expired version. You just keep track individually when it's actually been reclaimable Um, this again, it's it's gonna cost you more to go check every single tuple Um, but again, you'll be able to free up them even more quickly Or in the case of the group version one, it's him You just organize them into groups and say when the group's ready to go then they all go In the case of hana the way they are organizing their their versions and groups is Within the sort of a commit group ID So they're already grouping together versions based on transactions and then when transactions commit they get put together into a Uh group commit batch, then then I'll get flushed out the disk So then they just assign the the batch of transactions that got flushed out the disc disc in the same right Those guys all had the same high watermark timestamp. And when that that thing status is is no longer visible to any actor transaction then you can go ahead and uh Reclaim the memory space of all those virgins So again, there's tradeouts between each of these. Um, in their case, they do a combination of both of them The third approach to this was actually pretty interesting and I have not seen anybody else do this Uh Was the ability to do sort of table level garbage collection? for versions and the idea here is that again, what we talked about so far is I update doesn't matter what table my transaction updated I just know there's an old version and here's the timestamp of when I can go ahead and reclaim it But now if you actually keep track of what table the What table that that version belongs to Then you can start to do things about Understanding all what actor transactions exist in the system at this given point in time And if you know those actor transactions will never access The uh particular table and you have a bunch of old versions for that table Then since the transaction can't read anything on that table You can go ahead and and free up and reclaim all that memory space for those old versions so this is a corner case and They talk about how it only can be used when you again, you have to know what whether transaction is going to be able to read a table or not So you may be asking well, how can you know this? Well If it's a stored procedure, then you know what all the queries are ahead of time You may not know exactly what queries are going to execute because there may be conditional branches Like if some value execute this query otherwise, you know else execute this other query But you know within the uh stored procedure Unless they're doing dynamic construction of sequel queries, uh, which we'll talk about later You have a rough idea what tables are going to access. So if you know your table is not going to be accessed within this stored procedure Then you can go ahead and do garbage collection on those on those versions Um prepared statements are essentially the same way too. Like if you're running with uh, you know single statement transactions Um, and you know that you know the prepared state, you know the query is ahead of time Then you know this transaction is executing this prepared statement Then you know that information already. So again, this is this is this is I wouldn't say that this is very this is the most common thing for most applications We can look at numbers later on. I do have uh results from a survey that I did a two years ago on this Most people do actually don't run with stored procedures Um, now in sap You know For han or their biggest customer is probably the the sap You know crm or the their enterprise software So i'm pretty sure in that thing they they already know Everything a lot of it runs with stored procedures and they already know what all the queries look like um So in that case they have a better control of they have more metadata or more information about what transactions they're doing Then sort of the the run of the mill applications in that environment They maybe you know, they maybe don't do this and that's sort of why they put emphasis on this So this is a nice to have feature I would say that out unless you're doing server procedures. It doesn't actually help because you can't do anything All right. The the last design decision is the comparison unit and this is the basic ideas here is is How do I look at my versions? and determine whether They're reclaimable or not right doesn't matter, you know Whether they're in groups or a single version or not. How do I examine the My time ranges of my actual transactions and my time ranges of my versions And figure out which ones are reclaimable so the one thing i'll say about the implementation here is that uh Everything needs to be these needs to be latch frame because we want this to be uh, as efficient as possible and we don't want the mechanism of figuring you're out figuring out What versions I can reclaim to block any actual transactions that are running? so What I mean by this is that we don't want to have to set a latch on the list of active transactions When when the garbage vector runs because that's going to stall other transactions new new transactions from starting up or committing so That means that when we go say, what are my time stamps for my transactions? It may actually be inaccurate. We may end up, you know, it's a race condition We may end up reading that data structure that list of actual transactions And we may miss somebody that just started or just finished But it doesn't matter Because if we miss the transaction the first time we go through we're going to come back again and check it again And then we'll pick it up the next time so It's okay if the this computation is inaccurate Meaning we miss things We send up we end up with uh false negatives. We miss things that we uh, But we could have cleaned up But we couldn't do it. We obviously don't want any false positives We don't want to we don't want to reclaim things that shouldn't be reclaimed And again, just doing this in an electric manner Just avoids blocking transactions when they're actually running So the two approaches for doing uh different types of comparisons are the traditional timestamp approach The the the minimum global timestamp that we've talked about so far and then the Hanagais introduced the The interval ranges which solves that problem of that they were dealing with really long running queries So again, the the traditional approach is you just keep track of the the minimum global timestamp that all versions have to be older than Do in order to say that they're reclaimable and this ensures that there's no thread out there running that could follow a virgin chain and land on a A pointer that points to nothing right It's safe to implement. Uh, it's easy to execute But again, you just you may end up not being able to reclaim things that You would have to be able to reclaim when they actually truly are not visible to any other actor from the action So this is what the interval approach tries to solve, right? The idea here is that if you can identify the ranges within your timestamp, uh, domain That you know aren't visible Then you can go ahead and pull those guys out and leave maybe even older ones still still around But because any extra action can't see the ones that you actually pulled out So this is obviously more difficult to identify these ranges The paper talks about a sort of merge based algorithm to figure this out. Um, it's it's fine. You know, it's I don't have any opinion about whether it's a good idea or not. It seems reasonable. Um I think this is the right way to do this. Uh I we don't do this in our system. Um, but I would say that, uh Yeah, I say this this is actually might be an interesting project someone could pursue for The the final project in the course. Um adding support for this so we can talk about that later All right, so let's look let's look at an example of this. All right, so we have, uh one table Single tuple a we have one transaction starts first again begins a timestamp 10 does a read on a Follows whatever the index is to get to that version and just read that one tuple Now while transaction then thread one start running a transaction thread two starts Uh, it gets timestamp 20. It doesn't update on a we update. We end up in a new version. Um This transaction then goes and commits We update our timestamps with the commit commit a timestamp and we're done and we're fine Now thread three comes along. It doesn't update on a Uh, we get a new version and then it wants to go ahead and commits And then it updates everything with their commit timestamp Now at this point here The transaction in thread one is still running, but it's running at timestamp 10 Uh, so it won't see anything that occurred after after that Um, so it didn't what doesn't see the change from from the thread two or thread three Um, any any new transaction that comes along after thread three commits Can see version a three But that means this middle guy here cannot be seen by anybody Right because there's no other action that's running in this timestamp range So with the timestamp approach our garbage sector cannot reclaim a two Because again our our lowest active transaction timestamp is 10 and that's and the end timestamp for a two Is greater than that So we just say all right We don't know what thread thread one is actually reading. We just know that this is our low watermark. We can't we can't pass So therefore we have to keep a two around But with again the interval timestamp, uh comparison unit we can identify that a two is not visible Uh to any extra transaction because because the timestamp of any extra transaction is not within the interval of 25 to 35 So nobody can see this tuple. So we can go ahead and reclaim it. So again the Computing that interval and and excising out the versions is slightly more expensive than just sort of the the all or nothing timestamp approach um I actually think the computational overhead of doing the interval approach is the Is is worth it over the the global minimum timestamp. So this is why I think this is actually a good idea from the hana guys And this is something we may want to pursue later on Okay, so now I want to talk a little bit about, uh How to actually free up memory so We talked about how transactions delete tuples, uh in mvcc um And we talked about how to you know remove the keys from from indexes from deleted tuples That's fine But now what do we actually do with the memory of the versions that we just garbage collected and removed Right, so again, we have this fixed link data pool where we have these slots where we can add to add new tuples in so If we now delete a tuple and we've garbage collected it now we have the slot used to be occupied by You know that that old physical tuple Now it's not What are we going to do with it? so For the the variable link data pool For that one we ignore we always you just always reuse that space right that that's a no brainer um For the fixed link data pools You may you may or may not want to actually reuse them And we'll explain why in a second the other thing to think about too now is Instead of just leading one tuple. What if my transaction comes along and deletes a whole bunch of tuples? so now all within my Table space my fixed link data pool from my from my table. I have a lot of holes What should I actually do with them? So let's talk about whether we want to reuse or not So again the two approaches are whether you reuse or not so the in the first case here you essentially allow the the workers in the data system to Insert new tuples into the slots where deleted tuples used to exist So for a pandemian storage, this is a no brainer because You're just always adding new versions anyway to the table space So you just find whatever space you just free to put them there, right? You don't care about any locality at all. You just put exactly where it was or where an old one was So the downside of this is that you're going to you're going to destroy any temporal locality about Tuples in in your your table, right? Because within a single block you may have tuples that are really old and really new All mixed together Again for a pendulum storage, this is maybe not that big of a deal because it's just all over the space For a delta storage this matter it can matter a lot because The you know within the table space itself Right for our fixed link data are we have in case a hyper we have these columns, right? We don't have Uh, you know, we don't we don't have versions mixed in together with their columns All the delta records for our versions are stored in these Local memory tools for that are tied to threads and they get garbage collected later on But now within our our columns it can be again. We have a mix of old and new Again, when we talk about compression and other things that can be that's gonna be problematic Because under in most cases and only to be applications The newer a tuple is or the more recently it was added to the database system or updated in the database Uh, it's more likely to be updated again And then as things age they're less likely to be updated So now within a block you could have some tuples that are being updated often and some tuples that are never being updated at all So the the way to avoid that problem is you just don't reuse any slots at all So as soon as I delete a tuple I never go back and insert a new tuple In in that in its sole slot, right? It's essentially just marked as off limits So What does this give us? Well, this solves that problem of the temporal locality issue where Now tuples within a Single block would have been added to the database roughly around the same time to a way to get deleted But the the logical list of tuples in the block are all been added to the database at the same time um The problem though obviously is that now we have a bunch of holes in our blocks From tuples that are not being or from slots that are not being used And that's going to be wasted memory. So we need to do something to uh to to reclaim that space so This is where block was called block compassion comes in sometimes it's called defragmentation same idea So the idea here is that the we want to be able to identify Blocks that are less than you know 100 full and Try to then consolidate them into 100 full blocks And then for any additional blocks we have now that we're not using We can then return that memory back to the operating system Right. I did way to think what this is if I uh, you know If I insert a million tuples and I delete a million tuples right afterwards Uh, you know, I insert a million tuples. I just see my memory user spike up If I delete a million tuples, I should ideally see it go back down to where it was before Most cases that's not going to happen right because the the data system not going to give all the memory it allocated back But I just see it go down a little bit But if I Just delete a million tuples and just keep all that memory around I may think I you know something's wrong with my system, right? so The way we're actually going to implement consolidation And this is the beauty of transactions the beauty of of of having, uh of mvcc in current control is that We want to do this obviously in a transaction safe manner because as we're doing the consolidation or the compaction We don't want any false negatives. We don't want tuples to Not be visible by transactions Uh during that brief window where we're copying it and obviously we also don't want, um Uh tuples to to appear twice To any transaction that's running So the way we do this new consolidation is that we just do a delete for every single tuple in a block and then reinsert them into a new block All in the context of an internal transaction that falls under the same snapshot isolation guarantees of all other transactions So we just piggyback off all for all that for free And we get the you know the all the correct mechanisms for acid That a normal transaction would have so again the the goal for the reason why we're doing this is that we want to be able to uh Make you know maximize the the utilization of memory within blocks um and then if we if we're clever about this we can try to Put tuples together And within a block that are related in some ways based on again access patterns or other aspects of of of their existence So that when we do other things on those blocks of cold data Uh, we have uh tuples that are similar to each other Right, so the way to think about again like if you're doing compression if I if I want to compress all the block I want I don't want to if any tuple gets updated in that block I may have to go do recompression all over again So I want to make sure that I have tuples that are unlikely to be updated together uh All together in the same block so the two things you talk about compaction are Okay, so the main thing is really you know, how do you identify? what to compact and what what What policy you should use to decide whether you should do compaction And what tuples you should try to put together So it's almost like a bin packing problem. You try to figure out. What's the minimum number of blocks I could I can I need to use to put all my data into them Right, obviously if I have if I have one block that has one tuple one block with it only has another one tuple I maybe want to put those two together into a step to a to a new block um But you know It's I want to find as many tuples as I can to put all together in a single block and then reduce my memory footprint So how do we identify what block tuples put together in a single block? So these are just some three basic ways to do this The first one, which is the time since the last the transaction or the tuples last updated Is probably the most common way to do this again as I said in LTV workloads The likelihood of transaction will be updated is directly tied to The last time it was updated right think of this like you had a reddit Nobody's commenting on articles written from Three months ago. You're commenting on the Articles that were added today So for this one, it's a nice thing about mvcc. Again, we just leverage that or reuse that same begin timestamp We're already using to track the visibility of tuples We can use that to figure out when the last time this tuple was updated and and and then you know group them together based on that Unless hernibus to actually group together tuples based on the last time they were accessed And the idea here is that tuples that are uh are read together within a context of transactions um May want to organize in a single block because then you've reduced the number of Of the fetches out to memory to go get data for a particular transaction This one is a bit more tricky to do because it requires you to have Keep track of what how tuples are being accessed So unless you're doing basic timestamp ordering concurrency control where you have the v timestamp embedded in the tuple You have to then Extend the metadata for tuple to keep track of this as well, which can be expensive Again, we'll cover this later on when we talk about anti caching and shutting things out the disk out of memory But in general, this is harder to do unless you're already tracking this So the last one is a bit more complicated to understand. Um And as far as I know nobody nobody does this But I know people want to do this where you can try to exploit some aspect of how the application uses data So that you can put tuples together in the same block Again for compression or writing out the disk Right the way to think about this is like say if I know that within a single table There's a farm key relationship between two tuples and therefore those tuples are going to be used together often in Transaction like one might be read and one might be updated So they're not going to have the same, you know, maybe last update last access timestamps But I know they're linked together based on this farm key. So I want to put those guys into a single block So this is really difficult to do automatically Um It's not clear how useful this is within a single table We need if you want to start doing physical denormalization where you're packing two, you know Foreign key references from other tables inside the same block that actually makes a bigger difference. Um But again, let's beyond foreign keys. It's hard to figure this out automatically. So as far as I know nobody actually does this All right, the the last thing I'm going to talk about is a special case scenario for compaction of how to do truncates So the truncate SQL command is basically a delete without a where clause. So just delete all the tuples in the in a table So if you did this the way we've talked about so far with garbage collection and compaction of For every transaction Uh, for every, you know, every delete you keep track of all the individual versions you've expired And then you hand that off to the garbage collector to then figure out whether the version is actually visible or not And then go ahead and and clean it up If you do it that way then if I you know, if I have a table has a billion tuples I have a billion versions I need to keep track of, you know, whether I grouped them together or not It's still gonna be expensive So the way you actually do this super easy Is you just do the truncate as a as a drop table And then you create the table again, right the drop table then basically invalidates all versions within that table Right and then the create table creates the new empty table So you obviously need to do this for all the indexes as well, right? You drop the indexes and add them all back And now the indexes are empty the We will discuss this when we talk about catalogs But again the beauty of having all your catalogs be transactional basically your catalogs the metadata about your database is stored in the database And you get all the transactional the entities are free snapshot isolation for free So if you have everything be transactional then doing this this drop and create Automatically in a transaction super easy to do um We're building this now in our own system for postgres. I think their their catalogs are pretty close to be transactional There's some corner cases where they're not my sequel version eight is entirely transactional has transactional catalogs version 5.7 Did not um and most major commercial vendors all do this correctly But yeah, this is a nice little trick or nice advantage again having transactional catalogs You can do uh all this very easy without a do any garbage collection or compaction All right, so uh, what's the main takeaway here? the As I said before and I'll say multiple times throughout the semester this you So many of the things that we're talking about here are just this classic trade-off of the storage overhead versus the computational over that So we saw the case of like, oh, well if I do garbage collection based on Single versions, I may be able to free things up more quickly and free up memory more quickly Uh, but actually maybe to store some extra metadata about that every single version of the tuple um But if I do it in groups then maybe I don't give up I don't free up memory as quickly, but the the overhead of computing what versions are visible or not Uh, or what what versions are reclaimable or not is much lower So again, we'll see this a bunch of other things The other thing I wanted to stress too is also and it's sort of why we talk about deletes and indexes in the context of mpcc is Putting all this together and handling all the Sort of additional things you need to have in a real database system Indexes materialized views triggers and all these other things Getting all that to work together is is not trivial. Um And I think but I think having a core, you know Uh, transaction e uh correct and transaction consistent data, you know data storage layer That you can then build on more complicated complicated things at the top of As we'll see as we go out through this semester that makes your life easier That's one of the advantages of that transactions give you So, um I can't prove this scientifically, but I would say it's my impression from In my various travels we go into companies and and talking with people that are running in memory databases Is that the the memory footprint is the major issue that people are dealing with right in terms of its cost because memory is not cheap um but also the there's just the the You know the size of the database you really want to store in in the hardware that you have So they're willing to pay the additional computational overhead for having more Uh for better garbage collection like the interval stuff and the the more fine-grained garbage collection They're willing to pay that penalty In exchange for reducing memory Um, so this is why I think the I mean in the case of the hana paper They have they talk about a hybrid approach where it's sort of all of the above. I think that's actually a really good idea um It might be very specific to the the kind of work called SAP SAP's looking at And there's sort of main application the main customers themselves for general purpose uh applications I think that the interval and Single version maybe some grouping is the right way to go, but I think we should add an interval into our own system. So, okay So with that, I want to very briefly introduce the first project which we Announced earlier today so The first project is an individual project that everyone's doing by themselves And the idea is basically you're going to introduce yourself to to our code base of our new database system And you're going to learn how to do profiling in a highly concurrent environment So what we're going to do is we're going to provide you with a A certain branch of our code that has a known problem that we have in our system that we've identified And we're going to you're going to learn how to use perf, which is a profiling tool to figure out where that bottleneck is and then go about You know refactoring the code to move some data structures around introduce some new special latches In the right places to alleviate the bottleneck to improve your scalability in our system So again, it's a uh, it's an individual project because we want each of you to do this separately Uh, and that way you when you go to the final project everyone's going to be able to contribute Equally because they're all you know, we've all worked on the system enough So I've already talked about this before Peloton is dead. We have a new system that we don't have a name for yet. We get we still got to figure that out So at a high level, it's an in-memory htap mvc database management system. So here's a bunch of features that That the system will have A lot of this code actually is being ported over from Peloton and cleaned up Um, so I would say from this portion down This is what you don't have to worry about any of this for the first project Um, and you may be thinking like, oh, you know, what the hell does all this actually mean? I'll be covering all of this for the entire semester. So like by the end of the semester, you'll even know what the hell Each of these are right um the Ultimate goal of this project, but I haven't really talked about so far. Um Is that we're trying to build a new system from scratch in order to make it be self-driving autonomous So I'll we'll have this on the in the last class of the last lecture We'll talk about what a self-driving database actually is the basic idea is we want this thing to be able to tune and optimize The self automatically without any human intervention. And so we we we We decided to build the system from the ground up to Uh, because that's the best way of nor is it in order for us to achieve this goal Because we have complete control of the entire architecture. So again, we'll talk about more of this later um I'm sort of showing you the list of all these different features here because it's like I remember when I was taking undergrad You you know, you started a new course and you look at like the textbook with the back of it Like you see so like how the hell am I ever going to actually learn any of this? Right? Um And so I for me, I hope and you guys are gonna have the same oppression You look at all these like buzzwords or mumbo jumbo. I have listed here You're like, I don't know what any of this means you will know what all this means By the end of the semester. So it's been really cool. So I'm excited for that right so As I said, the project right out explains exactly what you need to do Uh, and how you go about profiling the system and figuring out what you Where the issue is and what you need to fix. Um, so we're not going to tell you what to fix Exactly what we'll tell you how to find the problem um so The the source code you'd be downloading the the the the Repository it's going to have a bunch of test cases unit tests, uh, that you can run But also a bunch of micro benchmarks that test the various parts of the system that you'd be focused on in for transaction management So the for the profiling you want to do you're going to you're going to only want to run the concurrent read micro benchmark Because that will hit the bottlenecks that that you should be looking for But there's all these other benchmarks you're going to want to run as well to make sure the changes that you make Don't break them or or cause any unexpected failures So we're grading you on the concurrent read but recommend mark for you know What's speed you get but we are going to check for correctness by essentially running all these other micro benchmarks as well Because you don't want like you make some change in the transaction management code for That that affects right, you know, right transactions that the concurrent read micro benchmark doesn't hit But all these other micro benchmarks will fail now So that's how you're going to be able to check to make sure that you're not breaking something Uh, that's unexpected so We strongly encourage you to go do your additional testing beyond the things that we give you And so that essentially is going to mean taking these other micro benchmarks We provide you and maybe tweaking them To change their access patterns or change them or threads or that they're going to be running So that you explore different parts of the of the code base that you may not uh, have tested otherwise We have ways to do code covers tests as well to see whether your the code that you modified is actually being uh, adequately, uh, exercised in these experiments as well So for grading, uh The project write-up has all the information about this, but it's essentially there's two phases One there's the correctness phase where we again, we'll run all the tests to make sure your thing actually produces the correct result But then now we're going to compare your implementation against the Inflation written by Lin Ma, the TA And see how fast your implementation is compares to him So the the grade you'll get will be based on your relative performance difference between his implementation So if you get exactly with what he gets, I think you get like 90 percent In order to get 100 or higher you have to go faster than what what he's done Right, so it's not just like I got it all to work correctly and I'm going to turn them in There's actually some other things that we're not going to talk about you could go through and and find to fix things up so again on Class on monday next week. I will teach you guys how to do additional profiling with perf and call Ryan and other things To go test and try to find some additional ways to optimize the system But it's again, this is this Important aspect of this this course is not all that are you writing correct code? You're writing high performance code, right because the way obviously to fix the problem is just take a latch on the entire Uh, you know database and only let one transaction everyone run at a time That won't that you know that'll run correctly But that's going to be slow. And so we're grading you based on performance as well So We're in addition to this we're also going to run, uh, google sanitizer checks that are in the build pipeline And that just checks to make sure you're not doing memory leaks or some of the weird stuff with your allocations And then we're also very strict about how we do, uh Formatting In the code Um, and this is just again void some of the issues we had in the old system So we were running client format and client tidy. This is why we run c++ 17 So this is going to be a very strict syntax checking So you have to follow, uh Our formatting guidelines to make sure that your code conforms to Our standard and we we essentially follow the google style guide or something very similar to it So there's a documentation page that explains all this in more detail So basically what happens is if you write code that doesn't follow our style guide the build will fail and you get a zero All right, so you have to get me at the end of the work All right, so the our current data system only built on, uh, ubuntu 1804 And os x and I think it's the latest version if you're running on windows and don't want to, you know, switch over to linux or whatever Uh, you can run this in a vm We're you know, we'll talk about in the next slide You know, you're gonna do all your testing on amazon anyway, which you want to use a linux vm Um, but for your local development, you could do it inside of vm So this is carding mount university. Uh, I'm assuming everyone here has access to a machine that you can do some development on Um, this if for whatever reason you don't you don't have one Uh contact and let me know I actually I don't know whether and I'm pretty sure it won't build in any of the andrew machines because We need the latest version of clang and gcc and andrew machines that haven't been updated, but um You know, again, you can do this inside of vm, um If you can't if you don't want to set up your local environment to use nothing but linux, um The important thing i'll stress too also is that You can do all your development locally on your laptop Uh, but you're not going to be able to identify the bottleneck that we're asking you to look for Unless you're running a machine that has more than 20 cores So if your laptop has four cores, you're going to run perf and you're not going to see the bottleneck We want you to find right because there's not going to be enough enough concurrency enough parallelism Hitting these hitting these contention points So that's why you think we're asking you to run on a machine with more cores Because most of us don't have machines with 20 cores um We're giving each of you guys 50 dollars Uh, I sent an email out this morning that everyone's going to get 50 dollars for amazon at us um So you basically go on ec2 instantiate an in a linux instance on this this c59x large I think it's 36 cores and To build the system run the perfect experience we asked you to run collect some data about where the bottleneck is Go now down bring you know on your local machine fix fix it up commit it and then try to run it on on amazon So we're only giving you 50 dollars If you run this instance type the c59x large on demand so to cost you a dollar 53 an hour You're also going to have to pay for ebs, which is some trivial amount, but that's still always running as well because that's storage um When it's pot when as much as possible try to use a spot instance because it's going to be uh Fraction of the price of course now these they they can take them away from you at any time But you know, it's not like you're going to be doing this Non-stop you make a little change and run an experiment when I come back to it And if your spot instance gets taken away you just fire it back up um So try to use a spot instance as much as possible Do not run out of money Uh, because if you run out of money They're going to charge your credit card and you can't come to me and ask and ask for more Right because everyone only gets 50 dollars and that should be enough for this project It shouldn't take you You know Hours and hours and hours on and amazon you're going to burn through all your money All right, so the the Deadline is uh February 27th at 11 p.m. As we say in the on the project write-up Uh, if you miss this deadline, uh, then you lose 25 percent for every 24 hours that you're late from this deadline Um, you're going to be smitting the source code and a final pdf of the final report with screenshots of your perf commands and output um on grade scope uh But you know, well when we actually run your Source code you provide us and do grading in terms of the benchmarking and correctness We're going to run that on a different machine here at cmu All right, because again grade scope is only a single thread. You're you're you're never going to hit the contention bottlenecks. So That means that you'll submit your code to grade scope. We'll build it And make sure it actually builds and runs the the basic experiments But that's not going to be your final grade because We can't check for the things we want to check on a grade scope vm Right, so you think of like grade scope as like the smoke test that raises some stupid bog In your code that prevents it from compiling in our environment Right, so you'll submit it Just to see whether it actually it can build but all you need to test yourself on the amazon machine. We're providing you Okay All right, so again the deadline is february 27th at at midnight Um, and then the the webpage is up now with all the information about uh, the project and how to get started Okay Next class we'll be doing Now index locking and latching So we're going to start off talking about more traditional locking methods for indexes How to enforce serializability and then we'll talk about how different ways to implement latching and and Some problems that can arise with latch free environments and with latching environments Okay, all right guys, stay warm and I will see you on monday next week. Thank you You'll be picking up models Ain't it no puzzle I guess because I'm more man. I'm down in the 40 on my shorty stop store cans Slaps and six packs on a table and I'm able to see st. Niles on the label No shorts with the cost, you know, I got them. I take off the cap my first attack on the bottom Throw my three in the freezer so I can kill it careful with the bottom, baby Because st. Niles it says the pain I wear you drink it down with the gods Take back the pack of dust it won't get you some same knives to drink it to the stars Please don't be weak guys. Be a man again. I can't play