 So this is going to be the last lecture we have for the course on concurrency control. So going forward, we'll start talking about other aspects in the system. But I sort of wanted to front load the lectures talking about concurrency control, because as we go through all the other parts and all the other optimizations and techniques and methods inside of database systems, I want you to be keeping concurrency control in the back of your mind. Like, how would you do this awesome technique if you have other transactions running at the same time trying to update things? So that's why it's important to understand the basics of concurrency control in these high performance systems in the beginning, because that will sort of guide our thoughts and discussions for all the things that come after that. So for today's class, I want to spend some time in the beginning talking about store procedures, what does the store procedure look like, why we actually would want to use them, especially if we're doing transactions. And then for the focus today we'll be on optimistic concurrency control, which is another variation of timestamp ordering. I would say it's not as common as multi-version concurrency control. But as we go along, you'll see that at a high level they're basically the same thing. And then I'll have some brief comments at the end talking about the course projects. OK, so as we said before, that when we switch to an end-memory database, the one of the key advantages we get from that is that dealing with stalls because a transaction try to access data that was not an end-main memory goes away, because you assume the primary location of the database is a main memory, so everything's always going to be right there where you need it. So this means that for the database system, distals are pretty much almost gone entirely. We're still going to need the disk in order to log things out in the right-of-head log so that all of our changes that are in memory are still durable. So that's not going to go away. We still need to do that. But again, we're not going to have these long pauses while we go fetch the block we need into the buffer pool. But although we get rid of the distals for the best part, there's still going to be other stalls that the database system has to deal with when you're using a conversational style API to access and communicate with the database system. And I'll show in a second what I mean by a conversational API, but pretty much all the standard libraries that people are using today to write applications to interact with the database system is using a conversational API. So ODBC, JDBC, any of the database system specific protocols or wire protocols like when you download the driver and a library into your C program and write back and forth to the database system, these are all called conversational style APIs. And what it looks like is something like this. So say now we have our application and we have our database system. The application is running on a separate server and the database system is running on a separate machine. So inside your application you're going to have this code that essentially is doing high level operations. Like when you click the load a page, whether it's Python or PHP or whatever, it's going to call some function that's going to figure out what values to put into the HTML that you give back to their user in response to that request. And so essentially you're going to be doing transactions. So you'll have a begin and then intermixed with SQL statements will be the program logic to process the data that's retrieved from the database system. So what will happen is when the application invokes one of these transactions, you call the begin statement. And usually at this point when you call begin, you don't have to do anything because you don't tell the database system I'm starting a transaction. It's not really until you actually execute the first query that you send that notification. So then we get to this point, now the program invokes a SQL statement. And so we'll send a request over the wire to the database system. It'll get the SQL string and it's going to have to go through all the regular steps or the procedures to process it that you learned about in an induction class. So you're going to have to parse the SQL. You're going to have to turn it to a query plan. Then you have to optimize that query plan. And then you're going to execute it. And the application is stalled and waiting for the response to come back before it proceeds and keeps processing. Most database APIs that are incorporated into programming languages operate on this blocking method. And this is why it's called a conversational API because it's sort of this call and response or request and response protocol where the application is going to ask for something, wait for the database system to process it, and come back with reply. So once it gets the response, then it proceeds forward now to the next stage of the program where it's actually going to process the response that the database system gave back to the application. You get a select statement, and then you're going to get back and look at the values and then do one thing versus another. But now during this time, while we're processing the data we got back in our SQL statement over here, the database system is sitting idle, waiting for the next query to execute. Now, it'll be executing other queries for other transactions, certainly if you have those. But for this particular transaction that we just started, it doesn't know what the next thing is going to do. It doesn't know whether it's going to abort. It doesn't know whether it's going to execute another query or commit. So it has to maintain all that metadata that we talked about before, like the read sets, the write sets, the scan sets, all those things it's using to keep track of the actions of the transaction is to be sitting around here. So we're prolonging the life of the transaction because we have to keep going back and forth over the network. Then we get back down to this next SQL statement, and then we get a call and response. Then we do more program logic. And this could take, again, an arbitrarily long amount of time. The database system is waiting for the next thing. And then only until now do we get the commit statement that we can tell the database system, our transaction, and we want you to try to commit it. We want you to try to make saves changes. And then here, when we get the commit statement, we'll finally flush things in the log, and we know all changes are durable. So, again, this back and forth in all these long waits is problematic. And there's three ways we can get rid of it. The first way is to use what's called prepared statements. And so in a prepared statement, you basically tell the database system ahead of time, I'm going to execute these queries. And instead of putting constant values, like for someone's name or social security number, you'll put a little wildcard or question mark and say, I'll fill in the parameter at runtime, so I don't know what it is yet. And so this allows you to get rid of the overhead of having to do the parsing, the planning, and the optimization because it's sort of like you have a cache query plan for the query. Now, a lot of the applications are out there. Don't use prepared statements. It's all sort of these ad hoc queries. So it's the first time the database system pretends it's the first time it's ever seen this query, even though it's executed multiple times before. And it goes through all that same parsing and planning steps every single time. Now, a lot of the systems can cache things if they see it enough, but initially it's going to redo it over and over again. And this is sort of one of the arguments that the NoSQL guys made about why you didn't want to use a SQL database or a relational database because of this planning overhead for the queries. And they would claim, oh, it's better for you to write your queries in these proprietary APIs and tell the database system exactly what you want to do to avoid this overhead, which is kind of a stupid argument because if you use prepared statements, you don't have that parsing overhead. The only downside of a prepared statement is that because the optimization is done at a certain point in time based on what the database looks like at that point at that moment, if the distribution of values of the distribution of the cardinalities of the data changes over time, then that may not be the best plan to use at some point in the future. So you have a sort of a stale, prepared statement. So let's say, for example, if you have a join in your query and the database system picks one table to be the outer relation versus the inner relation based on their sizes, but then a couple days later you load a ton of data and now that distribution changes. You may actually want to join them in a different order. And a prepared statement, you'd have to refresh it every so often. Another way to get rid of the overhead is to use query batches. So instead of sending queries sort of one out of time here, if the input of this SQL statement did not rely on the output of this one, I could execute them in a batch and then have it be one round trip back and forth to execute both of them together. A lot of applications have this pattern, but it doesn't always work because it could be that the output of this is then used in an if branch here. So if the output of this is one way, then execute this SQL query. If it's another way, execute another SQL query. In that case, you couldn't do a batch. Store procedures removes both of these bottlenecks by allowing you to basically suck out the logic that's in the application and run it directly inside of the database system. And so you think of a store procedure as a group of queries that form the logical unit or some higher level operation that the application wants to perform. And you'll be able to run that on behalf of the application directly inside of the database system. So think of this as, again, if you want to do a debit and credit, instead of doing the debit and credit and checking for balances on the application side, you can do it directly inside of the database system. So the standard programming language that you would use to write a store procedure, or at least as it's defined in the SQL standard, is called SQL PSM. PSM stands for Persistent Stored Module. So the standard was sort of defined as an extension to SQL 92 a few years after in 1996. And it wasn't until, I think, SQL 99 that actually ratified what the standard looks like. And of course, there's all things in databases. No one really follows the standard. And so there's PL SQL from Oracle. And this is sort of what IBM and MySQL follow. SQL Server uses this thing called transact SQL. Postgres has SQL slash PLGS or something like that. So they're all sort of slightly different. But at a high level, they look the same. But I would say the differences between the dialects for the store procedure languages are much more severe than in the SQL differences. So certainly probably, if you write something in transact SQL, it won't work on MySQL or Postgres or DB2. So here's at a high level what a store procedure is doing. So this is the application we have before. We have our begin and commit. SQL statements intermixed with program logic. We're basically going to take this unit of operation and just move it to run directly inside the database system. So now on the application side, all we have to do is use the call SQL command or in other systems, it's exec or execute. And we'll invoke the name of the store procedure that we provided the database system when we installed it. And then we can pass in parameters, right? So it's almost like an RPC call to allow the application to invoke higher level operations directly inside the database system. And what will happen is depending on what programming language you wrote this in, the store procedure will essentially run as if it was a regular query. Like the same sort of worker thread will be able to process it just as if it was processing a query. And so now, again, it's one round trip back and forth. There's no network delays and all the operations will be committed atomically just as if it was a regular transaction, which is really nice. So just to give you an idea what SQL PSM looks like or PL SQL looks like, this is sort of a simple example where we have a test procedure. We're given in a number and a name, and then we can loop through and insert that guy multiple times, right? It's a pretty stupid store procedure, but essentially this isn't more or less what they look like. And in my opinion, it's hideous, right? This kind of programming language style clearly reeks of the 1980s, 1990s, even early 1970s. And that's because it actually kind of is. Because SQL PSM, in the standard, there was like one dude on the committee who really liked Ada. If you don't know what Ada is, Ada is a programming language that came out of the 1980 or so, and it's an extension of Pascal. If you don't know Pascal, it's what they used to teach middle schoolers when I was in middle school. And it's an even older language. So it doesn't have all the modern things you would want in a programming language. And so it's not very many people know how to write this kind of stuff. So what are the advantages of using store procedures? Well, we've already said that it allows you to reduce the number of round trips between the application server and the database server in order to invoke these high level operations. Because again, it's one call, one network message to say, invoke this store procedure, and then one response to get back the acknowledgment. So now you don't want the transactions aren't holding locks, aren't holding resources for long periods of time because it's all done atomically very quickly. It's also going to allow you to get all the advantages of a prepared statement because all the queries will be defined ahead of time. And the database system will be able to do all the optimizations it needs ahead of time. It's also going to allow you to do, from a software engineering standpoint, it's going to allow you to reuse complex operations or complex logic across multiple applications. Because now everything's sized in the database system. So say if you're at a large enterprise and you have some legacy application code written in COBOL or FORTRAN, and then you have more modern stuff in Python and C++, instead of having to reimplement some complex functionality in all the different programming languages for all your applications, you can write it once in SQL PSM as a store procedure, and then all the different applications can just invoke that. And what's nice about this is, again, you don't have to worry about that each implementation is exactly the same in the different languages, and that they're always in sync, and that there's one guy who knows where everything is and everything always works out. You just write it once, put it into the database system, and then reuse it over and over again. I would also say, too, I can't prove this, but I think it's possible to also achieve view serializability if you have store procedures, because then you can understand actually how the application plans to use data in a much deeper way than you could if you just had arbitrary Java code or PHP code. So again, most all the algorithms, the conventional algorithms we've been talking about so far, they're achieving conflict serializability, but there's a lot of false positives where they prevent certain amount of concurrency that you can have if you knew how the application actually was going to use the data, and that's essentially what view serializability provides. So actually I think you can get better performance if you use store procedures and actually understand how it actually plans to use the data. I don't think anybody does that yet, and that's kind of a PL thing, and I don't want to get anywhere near that. But that's on the side. So what's the downside of store procedures? So the first one's kind of obvious, as I said. There's not many people out there in the world that know how to write SQL PSM. The example I showed you is pretty trivial, but when you start throwing cursors in and other issues, it becomes very complex. Now there have been some attempts to allow you to write store procedures in other languages. So for example, in Postgres you can write store procedures in Perl. But the issue is that since the store procedure is going to run directly on the database system, you don't want to let people write store procedures in a language like C, because C will let you go jump to any place in memory and start doing whatever you want. So it would be terrible if you had a store procedure written in C, and they start writing to a memory location that's being used by the database system, and it corrupts it and trashes it. It's even worse if you have an in-memory system. You can call some real damage. So languages like SQL PSM and PL SQL, these are considered slave languages where the store procedure is not going to be able to do anything dangerous in the address space. For other systems, you can run things like in C or C++ in what are called sandbox environments. So in Oracle, you can write a store procedure in C, but they'll fork a process and run the store procedure in that process so that it can't affect the address space of the regular database system. But then you lose some performance of having to set up an IPC between the database server and the store procedure. So that's not ideal. Another software engineering issue is that the store procedures are going to be outside the scope of the application. You sort of have this bifurcated environment where half the logic is in your application code, and the other half of the logic is now in your store procedures. And so it really becomes difficult to manage different versions, keep track of things, to make sure that everybody's in sync, and it's actually pretty difficult to debug. If you're a programmer and you're trying to figure out what's wrong with the application, you start reading the PHP code, and then all of a sudden you see an execute call to some store procedure, and then you go find where that is, figure out what version is actually running in the database system at the time, and try to make sense of what's going on. So it actually becomes very difficult to develop and maintain these things. And this is why a lot of people don't actually use them as much as they should. I would say also too, if you think about it, you make changes in the application, and you're using a version control system, like Git or something, and the store procedures that's installed in the database system may not actually match up with the version you expect there to be. In a lot of enterprise environments, the DBAs are very, very cautious about what changes you can make to the database system, and when can you make them. So you may be in this agile environment where you're making changes to your application and putting out new releases every week. But every week you've got to fight your DBA to let them install a new store procedure code for you, and they probably won't do that. So again, it kind of hurts the productivity of the developers. And as I said before, the SQL PSM stuff is probably not as portable as SQL is. So if you write your stuff in DB2, it's probably not going to work in Postgres. And again, I think it's even worse than just straight SQL. I would say, though, the advantage you can get from store procedures are quite significant. They talk about in the silo paper that in the previous experiments they get around roughly 23% production and performance if you don't use store procedures. If you hire a DBA to come tune your database application, the first three things that he's going to tell you to do, he or she, would be one, buy more RAM, two, fix your indexes, and three, suggest that you write things with store procedures. Because these are known techniques to make things speed up, make things go faster. And again, I think this is the real deal breaker. The fact that nobody really knows how to write this crap prevents people from actually using it. OK, so the reason why I brought up I'm bringing up store procedures is because in the silo paper, they talk about a one-shot request model. And that's essentially store procedures. The people that wrote the paper, in particular, Eddie Kohler, he's a systems guy. He's not a database system person. So they use different terminology. But essentially store procedures are what they're talking about. OK, so now we're going to start talking about optimistic and current control. So again, this is going to be another time stamp ordering scheme. The last one we're going to discuss. And the main idea of how it differs with NVZC is that we're not going to maintain multiple versions in global database. We're going to maintain multiple versions in these private workspaces for each transaction, but are not visible to any other transaction running at the same time. Remember, in NVZC, we said that when you modified a tuple, you would copy it and make a new version of it. And then other transactions could read that version. And then you had to do some extra checking to see, as you went along, am I allowed to actually read that version or not? Or am I allowed to actually create a new version or not? In NVZC, we're going to do all our changes in a private workspace, and it's only until the very end do we look to see, do we have any conflicts, right? And then we have, if there are no conflicts, then we'll be able to install the updates directly in the global database, and then they become visible to everyone else. So OCC was first proposed in a paper in 1981. Actually, it was done by a professor here at CMU, HT Kung. And what's crazy about it is he's not a database person. He actually was a networking person who just sort of stumbled on this protocol that actually works quite nicely in database systems. And it's probably one of the most well-studied and well-cited commercial algorithms out there. He's now at Harvard. He's been there since the early 90s. He's sort of what is hired to rebuild the Harvard CS department after the 1980s. I met him. He's actually a really, really cool guy. All right, so let's do the same thing we did before. We'll go through the textbook example of how OCC works. And then from that, we'll go to talk about the modern variance of it. So for this transaction here, it'll start off doing a read on A, a write on A, and then a write on B. And then it'll go to commit. So now in our database, we're going to have some kind of identifier for the record. So here we'll say it's the record ID is A or B. They'll have the value for its attribute. And then we're also going to include the right time stamp of the last transaction that successfully committed that wrote that version, or wrote that latest update to this tuple. So now when our transaction starts and wants to do a read on A, we're going to define the protocol based on phases. And so this part here where we're doing the normal operations of executing queries on the database, this is called the read phase. And I realize this is a confusing terminology because we're doing writes. But it's in the original OCC algorithm. It's called the read phase. So just go with that. So we'll start off with doing a read on A. And for this, we're going to read directly from the global database, read the latest version that's in there. What we're going to do is we're going to copy it. When we read it, we're going to copy it into our private workspace. So we'll take an exact copy of all the current version, the value, and this time stamp, and bring it into our private workspace. And now, anytime we read this thing again, we know to look in here to get the thing that we want. So now when I do a write on A, again, I know that I have that object in my workspace. So I'll do my write in here. But now, when I modify the value, I'm going to change its write timestamp to be infinity. Because again, here's one of the differences between MVCC and OCC. In MVCC, you've got a timestamp in the beginning. In OCC, you get a timestamp at the end. So when we do this modification here, we don't have a timestamp yet. So we just put infinity as a placeholder to know that we modified it, and then at some later date, we're going to come back and change that. Now here, we do a write on B. And again, we want to be able to do our change in our workspace, but it's not there yet. So we'll copy it, and then we'll update it with the new value and set the write timestamp to be infinity. So now at this point, the transaction wants to commit. The application server will tell the database system I want to commit, but it doesn't commit yet. There's actually two more phases that get added here. So it'll be the validate phase and the write phase. So again, it's not as if the application server is going to specify, all right, I want my transaction to now go in the validate phase or go in the write phase. The application says commit, and then the database system sort of does this stuff internally. So in the validate phase, what we need to do is we need to go see whether we have any conflicts. And I'll explain what kind of conflicts we can have in a second for anything that we read or wrote to in our private workspace. And so long as there's no problems there, then we can switch into the write phase. And I didn't know what the icon to use for the write phase, because we already have write here. So I picked a guy that looked like Bill Murray. But anyway, so here at this point, we get our new timestamp, and again, we can use all the different timestamp allocation methods that we talked about before, like the mutex, the atomic add, the batching, the harbor clock. It doesn't matter. But we'll get a new timestamp that's greater than any previous timestamp that's come along in our database system, and then we can apply all our changes that are in our private workspace and put it into the global database. So at this point, and we'll do this atomically, and now all our changes will be visible to any other transaction that starts after we do. So this is clear for everyone. You sort of see how this is different than MVCC. We're getting timestamps at the end. We're storing our changes in our private workspace. And we're not aborting as we're in the read phase. In MVCC, you could realize that you're trying to write trying to write to an object that somebody already else has written to, and you'll shoot yourself in the head and restart. In this, you wait to the very end to figure out, did I complete successfully? To parallel validation? The question is, is allocating the timestamp only have to do with serial validation or parallel validation? No. You always have to allocate a timestamp in the right phase. We'll talk about serial validation in a second. OK, so now let's go through each of these phase a little more in detail. So the read phase is basically more or less the same thing we talked about with Hecaton. We're going to track the read-write sets of every transaction, and we're just going to store all any modifications we make into our private workspace. And it's more than just tracking the read-write sets, we're actually copying tuples from the global database into our private workspace so we ensure that we have repeatable reads. So that way if we read object A and we keep reading it again in our transaction, if someone else comes and modifies it later on, we still always get the version that we read originally, right? In the validation phase, this is sort of the main meat of the algorithm. So when the transaction invokes commit, or sorry, the database is going to check to see whether it has any conflicts, and there's two ways we can do this. We can do what's called forward, or backward validation, and forward validation. And I'll go through what these are. So in backward validation, the basic idea is that we're going to check to see whether the transaction that's trying to commit, whether it's read-write sets, intersects with any transaction that has already committed. So we don't care about anybody that's still running with us at the same time. We're only going to care about what was in the past. So in this case here, I say transaction two. And this is a timeline sort of when this transaction was alive and when it was doing stuff. So it commits at this point here. So what it needs to do now is go back in time and look to see, well, what other transactions have committed recently, and did I read or write the same stuff that they read and wrote to? So this sort of area where it's going to look is called its validation scope. So this is how far back or what it needs to look at to figure out whether there's any conflicts. In forward validation, the difference is that rather than looking back in time, you only look to see what transactions are running that are still running when you're trying to commit. So in this case here, this transaction is still in the read phase because it's still reading and writing tuples. So we only need to check to see whether we read and wrote something that it read and wrote to. And the key thing about forward validation and backward validation is that you're always doing the validation in the same order, in the same direction. So in backward validation, you're always looking back in time, you don't care about anything that came in the future. In forward validation, you only care about things in the future, you don't care about the past. Because if you get through and can validate without any conflicts with the active transactions, you know you didn't conflict with anything before you. So it's therefore it's safe for you to commit. Now, which one is better than another, again, is all things in databases. It depends on what the workload is, it depends on what the application is trying to do. I don't think one is better than another. So the other thing, actually, one advantage you get from forward validation is that if you get past the right phase and actually able to commit the changes, you can immediately throw away the read-write set and all the metadata you're tracking for any committed transaction. Because you don't need to go back and look at them. If you're doing parallel validation, so this would be doing serial validation, you assume that one guy is in the validation phase at a time. If you're doing parallel validation, you just have to look to see if your read-write set's conflict with anybody also committing at the same time. As well as looking forward or backward, appropriately. Actually, this is essentially what I said here. So basically, in parallel validation, it means each transaction must check to see whether the read-write sets of other transactions that are in the validation phase, whether there's any intersections there. And the key thing to make this work is that you have to acquire locks for the records that are in your write-set in some kind of fixed global order. Remember when we talked about lock thrashing? We talked about the ideal workload scenario that wouldn't have any deadlocks is when transactions acquire locks in some kind of primary key order, or say memory address order. Because you're going, say, you're trying to acquire locks one, two, three, and four in that order. Everybody else has acquired in that same order. There won't be somebody that holds the lock on four trying to get one, and someone holds the lock on one trying to get four. So you don't ever have any deadlocks, as long as you just enforce this order. That makes doing parallel validation pretty easy. You also don't need to keep track of locks for the read-set records because you're checking to see whether someone has written to something, and you just need to check whether they have a lock for that. Nobody checks to see whether you're reading something and hold a lock for it on a read. All right, so now in the right phase, assuming you get past the validation step, this is when you propagate the changes that are in your private workspace and just store them now in the global database. And for this, you can now release all the right locks you acquired in the validation phase, and so that way nobody has an inconsistent view of the database. Everything's sort of done atomically. So any questions about the basic OCC algorithm? Three phases, read, validate, and write. Most of the action happens in the validate phase. Yes? In the forward validation? Yes. The one that is validating with a bond itself if it is in conflict? Correct, yes. It's not the one that started later for you? Correct, yes, this is a good point. So in all OCC, you never shoot the other guy. You shoot yourself, right? We're in NBCC, I think in the Hecodon case, or in the hybrid, you just set a little flag and say, you should abort yourself, right? You can't do that in OCC. So everything has to go in one direction. So now let's talk about modern variants of it. So again, the OCC algorithm was invented in 1981. Back then, computers are much different than we have now. So now there's different versions of the algorithm that essentially at the high levels, doing the same three phases, but that can be more clever about how they organize the data and how they synchronize with each other to reduce the overhead when you're running in multi-core environments. And so unlike last class, when we talked about, like Hecodon, Hecodon is a commercial system, you can write a big check and you can go get it from Microsoft right now. I think it's like $10,000 a core, right? It was actually pretty cheap when it comes to databases. In the case of OCC, as far as I know, there's no commercial or sort of widespread, widespread, widely used implementation of it in a real database system. So for these two particular systems we're talking about today, these are both academic prototypes, right? So they don't support SQL, they're not sort of full-fledged systems. You could run your Drupal website or WordPress website off of it. Whereas in the NPCC, again, that's used in a ton of different things. So we're gonna start off talking about Harvard's Silo system and then we'll talk about a little bit at the end about TikTok, which is the system that I was involved in or as an algorithm that I've been involved in with the student that did the thousand-core paper that you guys read last week. So Silo is a single node in-memory, all-it-to-be database system. So it's designed mostly to execute fast transactions that's gonna be able to provide, it uses a serializable OCC algorithm with parallel backwards validation. And then again, in the paper, they talk about how they use a one-shot request model. Again, this just means that they're using stored procedures. So the sort of two key things that in general, that the main idea is that you should take away from Silo is that it's very careful about avoiding having to write to shared memory while transactions are running. So in the case if you have a read-only transaction, you never have to communicate or synchronize with any other core or any other thread. And then they minimize the amount of synchronization you have to do while transactions are running by using this concept of epochs. The basic idea is you should be executing transactions in batches and you only need to synchronize at the beginning of each batch. And I'll go through what that means in a second. So Silo was written by was the primary author of Silo was Eddie Kohler. Eddie Kohler is probably one of the best systems professors I think that are out there today, because he's not just like awesome in terms of like coming up with research ideas. The dude is a ridiculous coder. They give you an idea how awesome Eddie Kohler is. When you actually look at the code of Silo in Mastery and we have another student we've looked at this, it's like he writes, he doesn't trust any other code out there, so he writes everything himself. So for example, if I wanted to use, if I needed the JSON library in our new database system, I'd probably find an open source one and just use it, right? Eddie basically wrote his own JSON library just for his own database system because all the other ones weren't fast enough, right? So I don't know how he does this while teaching classes and all these other things. And he looked at his GitHub account, he's committing all the time, which is crazy. And if you ever use hot crap for, when you submit papers, that's written by Eddie Kohler, he wrote some early like X11 games in the 90s that ended up being ported into early entertainment centers on planes, so like the screen that sits in front of you, if you go play Mahjong, that's Eddie Kohler's implementation, right? This is awesome. And he's a vulgar man too, which I love even more. Okay, but he's super sweet. Okay, so again, the big idea to remember from Silo is this concept of epochs. And basically it's gonna do logical time that's gonna be sliced into discrete chunks. And what'll happen is that for every epoch, at the beginning of it, you'll figure out how to allocate transaction IDs, and then you'll, at the end, you'll have everybody commit all at the same time and you can figure out all the conflicts all together. And so in the paper, they talk about epochs being 40 milliseconds. I asked them why they picked this number and they said they just picked it and it sort of worked. So there's nothing special about the number 40, it's just sort of what they ended up using. And so the key is then the worker threads are only gonna need to synchronize at the beginning of each epoch and they never need to talk with each other until the very end. So the way they allocate timestamps is through batching. So if you remember from the thousand core papers, we talked about doing atomic batches, right? So you use the atomic compare and swap and then you allocate a range of transaction IDs or range of timestamps to each thread and they only need to go back and ask for more as when they deplete the batch. So in Silo, basically what happens is you have a single thread that's in charge of the epoch. So every so often, every 40 milliseconds, it's gonna say, all right, here's the new epoch. And then it communicates with the other worker threads and says, we're starting a new epoch and then here's the ranges of your timestamps that you can use to generate new transaction IDs. So this allows us to generate globally unique transaction IDs across all cores without having a single bottleneck or a single synchronization point. Remember in Hecaton, what Hecaton did was they had a single logical counter and did a compare and swap on it in atomic addition over and over again so all the cores had to go to this one spot, update it and then send out all the invalidation methods on the CPU across all the cores every single time you allocated a new timestamp. And that adds additional traffic on the chip and then if you use batching, this gets rid of this problem. So once a hundred is done, you come to 200, the next one and you get the same update. And again, if you run through your batch, you can always go back and get more but in general, you don't need to. And so now we're gonna talk about the commit protocol of SILOM and again, this is sort of another example of where someone who's not a database person writes a paper in SOSP, which is a very good conference but he doesn't use sort of the standard database terminology. So he talks about, in the paper, he talks about the different phases and it's different from the phases that I talked about in the beginning if you understand the canonical OCC algorithm. There's always the read phase but he has sort of phase one, phase two, phase three. Phase three is actually the right phase and phase one and two is the validation phase. So it has something to be mindful when you read it. So in SILO, unlike in Hecaton which had time stamp ranges, we only need to keep track of a single version and what they're gonna do is for these transaction IDs, they're gonna pack multiple units of information in a single 64-bit transaction ID. So in the higher level bits, you'll have the epoch so every 40 milliseconds you get a new logical epoch then they'll have the unique time stamp within the thread executing this transaction within its batch and then to use the last three bits as lock placeholders. So the advantage of this we talked about in the beginning where in a disk-based system, a lot of times they have a separate lock table, a separate hash table that you use to maintain locks but now we can embed the locks directly in the tuples because we know they're never gonna be written out the disk. So this improves our cache coherence because we're gonna read this line anyway to go read the tuple but we get the locks in for free as well. So for this, actually for this explanation as well, I'm not gonna talk about the multi-version aspect of this in the paper they talk about how you can have different versions, you have these pointers, you can sort of go back in time to the early versions, I'm just gonna ignore all that for now. So let's say we have a transaction here and we have in its workspace, we're keeping track of its read set and its write set and notice that in the read set, we still have to keep track of the transaction ID of the version that we've read but in the write set we don't because we don't care because if we try to go write something and someone else has already written to it, we just kill ourselves, right? So we don't care what version we're writing to because we're always gonna overwrite it. So in the first step, we're gonna lock the write set. So for every single tuple we have in our write set, we have to go acquire the lock by flipping the bit with the compare and swap operation for the lock bit. It allows us to have exclusive access to this tuple and then we're the only one that's allowed to write to it. Then in the next step, we're gonna do the validation on the read set and this is basically going through every tuple one by one and checking to see whether the someone else has written to it in the future and we're trying to, we read something that we shouldn't have and we can determine this by checking to see whether a transaction already holds the write lock for the tuple we wanna read. In this case here, let's say the transaction is green. We would check that lock bit is already set so therefore we know that someone has written to something that we didn't read and therefore we would violate serializable order if we committed. So assuming this tuple does not have a write lock, then we just need to check the pointer to see make sure that this is still the latest version if it's an all that we know it is. And then we get to the next guy and do the same thing, check to see whether somebody holds the lock and is modifying something that we read. In that case, we're the tuple that holds the lock or so we're the transaction that holds the lock so therefore we know we're safe and no one else has modified it. And then once we have that, once we do all this validation, then we can actually install the writes and this is just basically copying the changes into here just as you would normally in regular OCC. Yes. Is this forward validation? This is backwards validation. So when we are checking that each second we are seeing whether it is, whether it's, whether it is. So hold on. You're doing two things. You're checking to see whether somebody else has already locked it and that would be, you have to do that because it's parallel validation because there's some of the thread that could be validated at the same time, right? Then you check to see whether the version is still the latest and that means there wasn't somebody in the past that wrote to it before I was able to commit. Yes, you could do this through the pointer, you could do this through the version, yes. So that's that clear, right? So again, we're doing parallel backward validation. We have to check, we check the lock because if somebody else could be validating at the same time and then we check the pointer to see whether it's null and that would tell us whether someone had already locked it, modified it, released the lock before we got to that point where we could check and then that would tell us our version is, you know, we read a stale version and that would violate serializable order. Prevented by the lock. Yes, if you try to write to it, if you try to write to it and someone holds the lock and you kill yourself right away, if you try to write to it and someone else wrote to it and you kill yourself right away, right? So it's the same, the first writer wins. The same thing as heck of Tom. Okay, so Silo does garbage collection which again we talked about last time was important because you want to be able to throw things away and not have everything built up and take up memory. And so the way they do it is using the same cooperative thread technique that Hecaton uses but what's kind of clever about it is that they limit it to be, they limit the thread to only be able to reclaim storage reclaim objects that it was responsible for creating. So what they're gonna do is they're gonna have this idea of every object that you no longer, think you no longer need, you're gonna mark it with a reclamation epoch which is corresponds to some point in the future, some epoch in the future where you know this object is no longer visible and no longer needed from any other transaction. Now if you have a long running transaction that's going from epoch to epoch to epoch you may have to update this going forward in time but assume all your transactions finish in the same epoch, you know within the next epoch no one's ever gonna be able to read it so I can throw things away. And so the each thread is responsible for maintaining the references of the objects it's gonna wanna clean up in thread local storage so that way you don't have to have, you don't move a lot of data around as you would if you had a vacuum thread. So if you had a centralized background vacuum thread it would have to go through and find all these objects that are stored in different cores and when you go to reclaim things you would have to move it into your CPU caches and then do the free on it but if you're freeing the things that you're responsible for you know they're gonna be local to you and you don't need to do any cache coherence overhead you don't have any cache coherence overhead underneath the covers to move things around. All right so this is another example by careful engineering of how we do sort of the basic operations, the basic design choices you have in these time stamp ordering protocols we can get better performance because we're careful about how we use the hardware. Silo also since it's serializable in addition to dealing with the write write conflicts and the read write conflicts we have to also be able to handle phantoms and Silo essentially does the same thing that Hecaton does where you keep track of the scan set they call it the node set because they're dealing with indexes but it's the same idea and that is basically it's keeping track of all the nodes within the tree indexes that the transaction accessed and then when the transaction goes to validate they check to see whether anybody has modified those nodes and possibly put in new data or deleted data that we scanned through and saw the first time and if so then we kill ourselves on a board because that would violate serializable order. Now in the paper they talk about how you to be very careful to deal with changes in the physical data structure of the index that correspond to actually new values being added versus like I did a merge operation and there was some node that didn't have any new values added to it or deleted but you got coalesced or moved around because some higher level structure changed in the index. So they talk about how you need to be able to handle those things and not have false positives. They also talk about how you need to be able to handle cases where if I say I'm gonna check to see whether I'm inserting a new value with an auto increment key or primary key I wanna make sure it's unique. I do a probe in the index to see whether that value already exists. If not then I know it's safe for me to insert it and so in order to make sure that nobody else inserts something the same thing you do the same thing that you try to do a probe on you have to maintain virtual entries in the index for keys that don't exist yet but you know you're gonna wanna put something there later on in the context of your transaction. So these kind of stuff dealing with indexes this is what we're gonna focus on on Monday. So Monday we're gonna talk about how to do index locking like the key range locking index gap locking all this and how to do this in B plus trees and whatnot. So we're sort of ignoring all this for now but just to know in order to avoid phantoms you have to maybe do some extra stuff inside the indexes. Now in Hecaton they were basically re-executing all the scans again. So they weren't doing any sort of low level index locking they didn't care what the index was they just notified scan something again do I see new values or not. And so this using these kind of techniques that's what we're talking about on Monday this is a lot of you did not do the re-scans over and over again. All right so I'm gonna show one graph of performance because I think this is pretty the sums up silo quite nicely. And for this experiment they actually implemented the H door protocol or the volt TV protocol where you have a partition database and single-threaded execution engines that are assigned to each partition. And they have a version the regular version of silo and it's called mem silo because it's not actually durable they're not logging anything which we're ignoring for now. And then mem silo plus split is essentially partitioning tables in the same way that the H door protocol does but still allowing for transactions to touch any partition that it wants. And it's still running the regular OCC protocol across the global state of the database. So what I really like about this graph is it basically shows the advantages and disadvantages of the H door protocol. So up here when you have zero multi-partition transactions so every single transaction only has to touch data at one partition. So every single core can execute a transaction from beginning to end without coordinating with any other cores. You get the best performance. But then there's this cross point here where roughly around 15% multi-partition transactions the H door protocol does worse and the silo sort of stays the same. And what I really like about this graph is like this is almost exactly the same results we see in our version of H door, like the thing called H door, right? So this is kind of cool. This is showing that you take some basic algorithm and you implement it in two different systems in two different languages, programming languages and you see the same trends. So in H door we start to fall apart. Performance gets really bad roughly around 14 to 15% multi-partition transactions. This is exactly what he shows. And so this is also sort of showing you the motivation that the hecaton people were talking about of why they didn't use a partitioned concurrent scroll scheme, right? Because here we're showing that some applications are doing amazingly well with it but then some other ones are doing terrible. Whereas in the case of silo, it sort of gets slightly worse as you add more multi-partition transactions but it's pretty much stable across the whole board, right? So you don't get the absolute performance you can get with the H door model but you get good performance for everyone, right? So I think that's sort of why you'd want to use MVCC or OCC versus like the H door protocol. And certainly when you start doing a hybrid workloads with analytical queries, this performance would be even worse for H door, right? Because of the single threaded nature of how it x2 is transactions. Okay, so that's silo and again, this was an academic system that paper came out in 2013 and I have to admit silo is pretty damn good. We spent some time thinking about it, we spent some time playing with it and it kind of really has no flaws that we could find at least. And so what we set out to do is can we just do certain things better? It's not that it does things bad, but can we do things better? And so what I'll talk about now is a variant of OCC that we developed in the last year called TikTok and this was implemented in the same DBX1000 system that was used in the 1000 core paper, the staring tube is paper that you guys read last week. And so TikTok is gonna be doing parallel backward validation and it's gonna be using the same store procedure only API that silo does. But the key difference is that it's not gonna do global timestamp allocation either with the mutex or the atomic addition or the batches that silo does. And instead it's gonna use what we're calling data-driven timestamp management where the timestamps of transactions are derived by logical counters that are embedded inside of the tuples. So there's no one thread that's allocating or giving out timestamps. Transactions are figuring out the timestamps on their own by looking to see what are the timestamps embedded in the tuples. So I'll show what I mean in a second. So in every tuple or every record there's gonna be two timestamps. There's gonna be the right timestamp that corresponds to the transaction that created it and then a read timestamp that corresponds to the last transaction that successfully committed and read it. And what will happen is again the transactions are gonna be moving these timestamps on their own without sort of a global coordinator. So we're gonna say that we're not dealing with multiple versions although we can in the same way that silo can which is for now we'll just deal with a single version environment. And we're gonna say that a record it will be considered valid for a transaction as long within the range of its right timestamp and the read timestamp. So the right timestamp is always gonna has to come before the read timestamp. So let's look what I mean by this. All right so say we have a transaction it does a write on A, a write on B, sorry write on A, a read on B and a read on C. And so what I'm showing here in this timeline is the logical time that corresponds to the read and write timestamps for each of these tuples. So you think of like this is tuple A, this timeline, this is B and C. And then at the beginning you'll have its write timestamp followed by its read timestamp. And again the read timestamp always has to be after the write timestamp. So we execute these transactions and again we execute these operations as just like before in regular OCC. We're gonna do all our modifications inside of our private workspace. And now we finish the read phase and now we're in the validation phase. So now we need to go through a commit protocol. So in the first step we're gonna do is just lock the right set of any tuple we modified. So again that nobody else comes and modifies at the same time, which we have to do because it's parallel validation. In the second step we need to compute the commit timestamp for the transaction. Now remember in silo this is what those batch timestamps from the epoch were from. But now we're gonna derive the commit timestamp from the tuples that the transaction accessed. And so what we're gonna say is that we have to pick a commit timestamp that comes at least one after the read timestamp of any tuple we modified or at least after the right timestamp of the any tuple that we read. So in this case here and it's the max of these three here. So in this case our commit timestamp will be here because it comes after here, after here, and after here. Okay, it's getting there's no global allocation. We can derive this directly from the tuples. So now we need to validate the read set. So what this means is that we need to go look to see for any tuple that we read is it still valid from the time we're proposing this transaction should commit, right? So in this case here for this read timestamp we know that it was written at timestamp two and somebody read it in four. So therefore the range falls nicely in our commit timestamp. So we know that this guy is still valid. This guy's a bit more tricky because the timeline ends here. Now it's not to say that this thing isn't valid. It's just that we don't know whether anybody has read or written to this transaction yet, right? Again, we're always checking to see whether we can acquire whether anybody's holding the lock for these and if so we kill ourselves. But let's assume that not in this case. So we need to figure out what's going on here and can we move its read timestamp forward in time knowing that nobody else is disappointing. Did I do that? The whole thing just like died. Projector cooling down. Okay, I can hold it and hold it up at you. Or we can wait. I can't even, there's no turn back on button. I think it overheated. All right, you know what, let's, I mean we only have 20 minutes anyway. Well, I'll redo this next class. The only thing I wanna say about the, just in case, again, to remind you about OCC versus MCC, they're more or less equivalent, right? The only difference is when you allocate the timestamp, where do you store the different versions, right? The validation part is roughly the same. And that there is a, there's trade-offs about deciding whether to abort transactions early or later. So again, in MCC, you can abort a transaction early. Like if you recognize that you're trying to write something that somebody else already wrote to, you stop yourself right away, right? But this checking has additional overhead. In the case of OCC, you just do everything and then it's only till the very end do you check to see whether there was any problems, right? So that means at runtime it's really easy because you're not checking anything, but you may have done a ton of work that you're gonna end up having to abort anyway. Like say for the first query you execute, you violate serializable order and there's gonna be a conflict, but then you execute another 1,000 queries. Then you get to the validation phase and you realize, oh, those, all those 999 queries were wasted. I need to roll everything back. So again, I'm gonna say this over and over again, there's no real difference in the, ah, see if it works. There's no real difference in, there's no one way that's better than another, it just depends on what the application wants to do. If you have a lot of transactions that take like a second to run, then OCC is probably not the way to go because again, you're gonna have all these one-second transactions that are gonna have to get rolled back over and over again. Let's see if it comes back up in time. Very suspenseful. Okay, sorry. All right, so I'll back up, all right? So we've gone, we're validating the read set. This guy was read somewhere in the future so we know the range still fits in our commit timestamp so that's okay. It's this guy here we gotta worry about because we don't know what's going on here. So there's three cases we have to be able to handle. The first is that it's the latest version, no one else has read it and therefore it's safe for us to now move the read timestamp to our commit timestamp to move it forward in time. The other case is that somebody else modified it, there's a different version and for the same object there's now a write timestamp that comes before our write timestamp that we read and therefore it overlaps with our commit timestamp and in that case that would be a conflict and we have to abort the transaction because that would again violate serializable order. In the third case someone else modified it but they generated a commit timestamp for themselves that's after our commit timestamp and again we would see this because the changes would be applied atomically but we're not synchronizing with each other and in that case that's okay because we read something in the past that was still valid and they modified something in the future that we weren't gonna read anyway. So again the main takeaway what TikTok is doing versus Silo, Silo is having a centralized allocator give out these timestamps for transactions whereas in TikTok it's deriving the timestamp, the commit timestamp based on what objects it read or wrote to. So in this case here our timestamp would be three so we would leave this alone because the read timestamp is exactly where we are. We would leave this alone because the read timestamp is further in the future and in this guy here we would bump its read timestamp now to be three. Yes. Sorry, is that again? Is there any console? No effects, you don't need them. So you're not trying to preserve the idea that you never write to shared memory on reads. The question is I'm not trying to preserve, not trying to avoid writing shared memory on reads. Correct, yes. Nice segue, awesome. All right, let's look at performance, right? So for this again we're running on DBX 1000 so we already had implementation of deadlock detection, implementation of no weight and then we added Silo, we improved our implementation of NVCC to now include what Hecaton does. And so for this we're using the YCSB workload, we're running on a machine at MIT that has four sockets of 10 cores per socket. It actually has hyperthreading but that goes up to 80 cores but for this we sort of stopped it at 40 because once you go to hyperthreading things don't look as good for everyone. So for the first example we're gonna run a work of medium contention so a 40% hot spot on 10% of the database and we're gonna do extra transactions that have a query mixture of 90% reads and 10% writes and so like I said I think Silo is very, very good. So on this workload we're not that much better in some cases we're doing worse, right? We can sort of tweak these parameters a bit and Silo is basically equal to what TikTok can do. Where we see the difference and actually you see Hecaton sort of come in nicely in third and then these two other guys since they're not really optimized for multi-core environments they don't really do anything special. When we switched to high contention and a workload with more writes now we actually see the difference here. So in this case here TikTok does about 300,000 more transactions a second than what Silo does. Surprisingly Hecaton actually does worse than no weight and this is because the cost of maintaining different versions and rolling them back when you realize there's always, when you have a lot of conflicts sort of bogs this thing down whereas in no weight you abort yourself immediately before you're allowed to do anything real spec make any significant progress and therefore rolling things back as much cheaper. So again this is sort of the same trend that we saw the difference between and the Hecaton and the pessimistic case and the optimistic case. It may not seem that quite significant but 300,000 on a single box is pretty massive when you consider the disk based systems can max out at roughly 30,000, right? So I attribute the difference performance to the fact that using the data driven timestamp management allows us to avoid certain false positives having false positive conflicts and having abort transactions when they actually wouldn't violate serializable order whereas in Silo it's a bit more conservative than what we can do and therefore they abort things that maybe didn't actually have to abort but that's how the protocol is specified. Okay, so again this paper we actually the paper got accepted in November so it's on the website but it won't officially be out until June later in the year. All right, so this is essentially what I was saying when the screen was down. OCC and MVC are essentially equivalent and their differences just have to be when you do all sort of the different aspects of the protocol and where you store the data. And then we talked about this trade off between aborting early versus aborting late. My hunch I feel like MVCC is the best for hybrid workloads but we haven't gone and proven that yet. I think that's sort of the next step. I think a lot of the papers in the last couple of years have been focusing only on OLLTP stuff and now I think we'll see more things on the hybrid workloads because things definitely get much different. Okay, so any questions about OCC? Again, as I said, this is the last current actual method lecture we'll have in the course but you should always be keeping this in the back of your mind when we talk about everything else, right? When we talk about vectorization when we talked about scan sharing all these things you think about well what's gonna happen? How does this work when you have transactions multiple transactions updating things at the same time? Okay, so some quick comments about the projects. Project one is as a reminder is due February 8th at midnight. We're still working on getting AutoLab working for the submissions if it comes back with all zeros that's because it's not quite correct yet. Part of the issue is that we didn't realize AutoLab uses Red Hat Linux and everyone's been using Ubuntu so it's just a matter of fixing up the scripts to take care of that so hopefully we won't have it working by the weekend and we'll send an announcement out to everyone once it's done. So another thing too is I've seen some of the people's code and you came to the office hours and asked me to look at some stuff and I'm seeing a lot of printf statements should not do that. If you read the project specification we'll talk about how to do proper logging like debug logging, right? There's these log info macros that will write things out to the database system log file and you can look at it and it'll tell you the file name and the line number of what produced the output. If you leave those printf statements in when you actually submit it and we do the speed test your thing definitely will run slower. Writing to standard out slows you down you don't wanna do that, right? So you wanna use those log macros and that way you can turn them off when you submit them the code. We'll turn all those things off when we print it. And then for some of you been emailing me about signing up for project groups I didn't realize people were so anxious to organize so that's awesome. For some of you that don't have groups yet what we'll do is I'll send out a list tonight or a form on the Google and everyone who's in a group just sort of fill out who's in your group and then anyone who's looking to be a group sort of add yourself in a bullpen list and then we'll coordinate and assign people as needed to fill in gaps. So there's 30 people enrolled in the course so there's gonna be 10 groups of three and then there's three people that are signed up for auditing and they'll be their own separate group and so you can't mix the enrolled people with the auditing people because I really have no leverage or over the auditing people if they bail. So, any questions? Any questions about project one? Everyone should at least have started. Who has not started? No one wants to admit to it. Again, it's not meant to be a super strenuous like mentally exhaustive project. It's really just to get you to understand the build environment, the idea of these tiles and how the query execution stuff works. And then the second project will be on doing concurrent indexes and that sort of explore another aspect of the system as well. Okay? So next class, we're switching over to indexes. So there'll be three lectures on just how to do indexes in general. We'll start off with doing index locking. Then we'll talk about transactional O-to-me indexes and then OLAP indexes. Okay? All right guys, have a good weekend.