 All right, that was awesome. That's great. So I know you have a whole bunch of gigs still left before. Coming up this week, if the tickets in CMU, the Diwali party, the undergrad Diwali party. Okay, that sounds great. And hopefully it's helping you put a little bit of a dent in the CMU tuition rate or insignificant? I don't think you've seen this. Okay, okay, all right, all right. But you're getting a great education in return, right? So it's all gonna be worth it. Just get it in exchange, I love it. That's awesome, great. I have an ID, if one of you less than come in, it's still valid till 2024. So you have a year to go. All right, okay, so let's get started. We have a ton of material to cover today. We are going to pick up where we left off in the last class. And if you remember, we had talked about two-phase locking and then we had started to talk about hierarchical two-phase locking, which was a way to balance the number of locks we have to acquire versus allowing powerism in the system. And the types of things that we would do with that, if you remember, there was this compatibility table. So life is no longer just a shared lock and an exclusive lock, but now we also have these other different lock modes, including these weird things called IS and IX, which are intention to do something as you traverse down the hierarchy. And there's a very interesting lock mode called Six Lock, which is, I have a shared lock on everything else below, but if I need to grab something in an exclusive lock mode, I will grab an X lock explicitly, but S locks the entire hierarchy at which that Six Lock is sent. So how do we use these? Let's go look at a couple examples. Imagine we have a very simple database, one table and a bunch of tuples below it. So we have a bunch of transactions. And remember the thing that we are trying to do is to get as much parallelism in the system as possible without having to acquire a lot of locks because acquiring locks has overhead. So you have to put stuff in the lock table and you have to do deadlock management and all kinds of other stuff. So we were working with Andy and his bookies example where you want to read Andy's record from this table. So that's transaction T1. And so we'll start by accessing the record. You'll always access the record through the hierarchy. The hierarchy mirrors the storage hierarchy. So you'll open up the file and start a scan on that and you start to read these records as you go through it. And the way you would go about doing this is you will set an intention to share lock in the table. You're not grabbing a S lock on the table because you don't want to block everyone. You just grab an IS lock in that table. So if you go back again to this lock hierarchy, as you can see in the lock hierarchy, the lock modes that are really restrictive is like X, right? Nothing is compatible with it. You want lock modes with a lot more green in the rows corresponding to that. So you can see how IS lock has a lot more greens with that. Lock mode things are compatible with it, but it's not an explicit shared lock mode. You have to grab that explicit lock mode as you go further down. So coming back over here to our example, we'll to read a record, you grab an IS lock on the table. The other lock modes are still permissible. So other transactions who need those compatible lock modes can still proceed. And then you start to go and grab the S lock on the record that you go need to read. Okay, for now, assume indices are not present. We'll talk about that briefly today, if not in the next lecture, but transaction T1 grabbed only two locks and IS lock and an S lock. So far, that seems like if I only had S and X lock and I'm only reading one record, I grabbed one extra lock, right? I grabbed an IS lock plus an S lock. So did I do worse here? Maybe, but why do we have these lock modes? Let's go and imagine what happens in this case. If you didn't know where Andy's record is, and so you might actually be going through and grabbing a whole bunch of locks on these tables. So now let's go take a look at another mode. So let me just back up from there. We're assuming no indices. So if you know exactly whether in this case, we still assume that we know kind of where Andy's record is, Tuple 1, so we just have these two lock modes, right? If you didn't know, you would go and grab S locks as you go further down. If you knew that you needed to grab all the read locks on the records below, what you would have done is on the table, you would have grabbed a S lock or a SIX lock. And more likely just in this case, you would have just grabbed an S lock. So that is permitted. If you knew that you're going to touch every record, there's nothing that stops you from grabbing an S lock on the table itself. So that's still permitted. But when you think you don't need that lock at a higher level and you can do with a weaker lock because you kind of know what your access path is going to be below, you can grab a weaker lock mode up above. Okay, so let's go into this with a little bit better example. Yep. So what really is like the purpose of having something with my S lock? Yeah, so the question is, what's the purpose of an IX lock? It's exactly what we are going to do right now. Imagine I had grabbed an S lock on the table and that's all I had done before. I didn't have these different lock modes, a concurrent transaction that wanted to go and update the bookies record. And at the end, the bookies records are obviously different records, right? So what is permissible now because you have different lock modes is that second transaction can say, I'm going to access the table and some record below it. But the table I'm only going to lock in an IX mode. I intend to lock a record down below. And then on the record it wants to write, which is the bookies record, which is this last record in that case, it'll grab a right lock. So now we allowed a read to some portion of this table to happen while a right was happening to some other portion of the table. And we had these two different lock modes. So we are allowing more of these things to happen. And these different lock modes also allow us to go, nothing stops us from grabbing an S-lock on the table and switching over to a protocol like that. We just have more toys to play with now. Question? Yeah. Yeah. Please double time then. Yes. Yeah, yeah. So maybe I confused the situation a little bit. I said ignored the indices, but then in Andy's case, I basically just still only set one lock. So the national question is, oh, how did you know that tuple one was Andy's lock? So I'm assuming I know that if I didn't know that, if I didn't know that, and if I had to grab an S-lock on everything, then if I did an IS-lock on the table for the first transaction, then as I go further down, I would have to grab S-locks and everything. At that point, I could decide whether the S-locks is on the page or the tuple, but I'm still grabbing way more locks. So as I said, if I know I'm a read only transaction, I will not do the IS-lock on the table. I will just grab an S-lock on the table. I'm still allowed to play all the games I was playing before, right? I just have more room now to play around with things. Now if I knew there was an index and I knew record one is where it was, that's where this really shines. So I'm not showing an index as a separate access path, but the general theory works out where this resource hierarchy of where the data is organized is not necessarily a tree, like a database to tables to records, but a DAG. So it could be a database to table to indices to records and you could have index and records go through that and this whole theory still works. And again, plug for the advanced database class, we'll talk about DAG structures and stuff like that, but that's what was happening over here. Kind of new tuple one is where it was, but I didn't show you the index, which seemed a little confusing, but the main point is I can still, if I wanted to grab an S-lock, I could still do everything I could do with just two lock modes. I can do a lot more now. Okay? So I think the board is scanning itself and release it. Release it, yeah. So the question is like, can I grab an S-lock and release it and then go and grab the next lock? Like the latch coupling we were doing in the B-tree to keep it... The answer is, what will we violate? 2PL, you will violate 2PL. That means I no longer have a serializable schedule. So we shall not do that if we want 2PL semantics. Furthermore, to do a strong, strict 2PL, we'll keep those locks till the very end. Okay? But that's, you know, as we'll start to play around with the advanced concurrency control protocols that we are gonna start on today, you'll see we'll start to do that. We'll try to let go of stuff or we'll do things with timestamps to try and guess what would happen in the best way to do it and give us a little bit more room to rearrange stuff, right? This is, locks are a pessimistic form of transaction of concurrency control and they are basically doing it in the order in which things are happening. As we do timestamps and stuff, you'll see we can start to play around with other games. So I guess, chicken from the practical difference between having an IS on the table and not having an IS on the table. Like if we just had an S and an S down the table. The practical difference is that if we wanted to do an S lock on the whole table, then we have a conflict. Yeah, the IX lock would not have been allowed. P2 would not have it. So the, exactly, that's right. So the question was, what's the role of the IX in this situation? Imagine I knew I needed to read all the records. I would grab an S lock on the table and that would block everything. But imagine I also had this index and I said, you know, I really don't think I'm reading the whole table. I'm just reading a few records. I'll determine that by going to the index which is not shown here. Since you're trying to keep the material simple. But that index is going to tell me go to tuple one. And on that, I will grab an S lock. That index is also telling go to tuple N for the bookies record and that grab an S lock and so I can go make all of this happen. As I said, you can still do the S and X lock stuff at any level as before. You just have more playroom now, okay? Yep, yeah. Yeah, so the question is, how does this all relate to the lock coupling stuff that we did in the B tree? I would say those are separate things. The intentional lock is to work with any arbitrary storage hierarchy in which you have this containment like semantics that I've got some organizational structure like a table below that is another structure pages and then records I didn't show pages over here. The lock coupling stuff that you used in the B tree stuff was very specifically in the B tree. And there you were at least, you were letting go of these latches that you were creating and then going forward. There's semantically what we are trying to do is that I'm gonna treat the B tree as a logical structure and I'm going to allow maximum parallelism in that. All I need of the B tree is to retain its semantical structure so that I go looking for keys. I will find it if I have it and if I don't have it, I don't. So we are playing around with tricks over there by releasing stuff. You start to violate strict 2PL stuff but the tree as a whole will look like it is still behaving in the broader scheme as a semantical structure. So it's kind of like the nuance over here needs a full lecture but think about it this way. When we talked about conflict serializable and view serializable, remember we said view serializable if I know the application semantics then I can play games that seem wrong but the application semantics is all I care about and I could get more admissible schedules. Kind of what we are doing inside the B tree is keeping its structure intact. We are playing all these games because we know what a B tree semantics is. We are playing with the semantics to get more parallelism in there. This is a completely general scheme that works without, I don't know if I have a page and a record. All I am saying is the page is a collection of records. It's a slotted page, whatever. Doesn't matter. If it's, you could apply this to LSMs. This would still work. Any hierarchical structure, this will work. And again, this is a plug for the advanced database class where we spend a whole semester or half a semester talking about these types of things. So the question is does this illustrate the difference between latches and locks? Yes, definitely. What we were doing in the B tree was latches but remember latches and locks are trying at some level to say I want to make some form of parallelism safe. So that's kind of the interesting thing is like there are ideas that can be crossed over from here to the other side. Should there be hierarchical latches? Nothing in the theory here says you can't play around with games like that. Now would that be an overkill for the types of things you need to do? Maybe, maybe not, but these are open questions. They're trying to go after the same types of questions. They've come from different places, but the concepts you're talking about over here where two-phase walking and intentional locks are very general. And there are more than two types of latches now, right? In many advanced programming languages, there are more latch modes and they'll start to play games that look like that. Great question. Both of them are trying to make parallel, they have higher parallelism by making some notion of safety hold in the application, right? They stare that same philosophy. And then, if you're taking an operating systems class, you'll think, you know, you're always encouraged to think about mechanisms, right? Mechanisms are general. They're tool sets that you can use for doing things. This is a general mechanism that works with any hierarchical structure to make it safe, okay? You might apply this in all kinds of places, even outside databases, once you understand the concept. Okay, great questions. All right, so now you can see what we are doing over here with these intentional locks. Let's keep going and take a look at how we might play around with three transactions. So here's a transaction that's going to scan all the records, okay? It needs to scan all the records in this transaction and what it can do is can grab a SIX lock, a SIX lock. SIX lock says, I have a shared lock on this and by that it means everything below. Notice in all the hierarchical structure, we require that all access to records at the bottom of the tree, in this case, records, have to go from top to bottom. So you have to follow that protocol, top down. If you read, again, in the advanced database class, it talks about it more formally, you always have to go top down and release top bottom up. Ignore all of that if you don't because it'll be like an hour conversation. But there's a protocol you have to follow to do this. Just top down is what we need for now. So you'll follow top down. This means no one can go grab a S-lock on a record because they have to come through me. They have to come through the hierarchy. So I'm putting a lock there, saying I have a shared lock plus something else. I may grow actually get an X-lock on some records below. So this is a transaction that may want to read all the bank accounts, for example, and give $50 more to the highest bank account. First, it needs to determine which the highest bank account is or accounts are if they have the same value and then do some updates to that, right? Read everything and update a few things, okay? So we'll start and it is allowed with the protocol and the hierarchy. I'll let you work that out. But after a SIX-lock, I can allow, take an X-lock on a record below in the same transaction. That is allowed by the protocol, okay? And that's what the SIX-lock allowed us to do as we were further going down. Now a second transaction that just wants to read a single record comes in. That's the only record it wants to read. As you can see, there's no conflict between these two transactions because they are doing different things, right? They're not conflicting on that record. But if I didn't have a SIX-lock, I would have to grab an X-lock on the table. If that's the only level of hierarchy I had, if I only allowed table level locks, then that's what I would do and I wouldn't get parallelism. Remember we talked about how MongoDB had a global database lock, right? So there are the, of course, they don't do that now, but in this case, what I can do is I can go and grab an IS-lock intention to share which is compatible with the SIX-lock in the matrix and I'm allowed to proceed. Now notice what happened. Transaction T1 only grabbed two locks. So it's efficient, two locks and only lock exactly what it needed to do. And transaction T1 grabbed two locks and you know, yeah, you might say it grabbed the IS-lock but it wouldn't have if it was operating under a simpler scheme but that's gonna be a little bit of trade-off. Some transactions are gonna have to take a little bit more locks but the bigger ones are gonna get a lot less locking, locks that they have to do, right? It's a trade-off, right? And again, as I said, nothing tells you you have to use all the locks. It's just saying, if you want to play these games, you now have the way to do it. This other transaction that wants to read all the records, it can go in but obviously it can try to grab an IS-lock for example on the table and that won't be compatible with these locks so it's gonna have to wait, right? In the compatibility matrix, that's not allowed, okay? So it doesn't mean transactions never have to wait. Some of them will have to wait but you're gonna get more parallelism and you're trying to do this balance between how many locks do transactions have to acquire and how much parallelism you go allow in the system. Okay, questions. Yep, is there a way to upgrade locks? Yes, and we will defer that to the advanced database class because it could be that you might say, you know what, I'm doing a lot of X-locks on the tuple transaction T1, can I go and update my six-lock on the table to an X-lock? That's allowed. You go through a upgrade protocol, the lock table which I told you has a hash table, has things that are waiting for it in there are upgrade requests too and upgrade requests are treated differently than people just waiting for it because you might want to give them higher priority to get ahead of the queue and get the work done if it doesn't conflict, okay? Other questions? Yeah, yes. So an SIX-lock versus S-lock, so let's just go back here. Yeah, so you would do a shared lock and again it'll depend upon how you're implementing it. So if you say I know in this transaction that I'm gonna read everything then I will grab an S-lock on the table, I'm gonna read every record, but the other transaction took a SIX-lock because it said I want to also update some. So it'll depend upon the operation that the transaction wants to do. And you'll try to grab the weakest lock mode at the highest level. Yeah, yeah, yeah, so yeah the question is what if I have to lock a subset of the object? So let's take the transaction T1 and after this I'm gonna stop because we are now like, we need three lectures and I love it, but I need to get through some other material. But these are great questions. So you're asking a very interesting and important question. I am transaction T1, okay? So just let's focus on that. It, let's say when it started it was a transaction that the code was written as begin transaction, scan all the records to find the interesting bank accounts and then for those interesting bank accounts second SQL statement go and update them and transaction. So transaction's gonna two pieces of work. Does it know how many records are going to be updated in that second phase for which it needs an X-Lock? What if it is all the bank account numbers are exactly the same? They're all the highest and everyone needs to be given 50 bucks. Or what if the number of things that I need to update is more than one? Maybe it is half of the records. Should I have, imagine I was reading everything and then when I start to update I find I'm updating everything. Wouldn't I have been better by just grabbing an X-Lock on the table up front? The answer is yes. But that could go through the lock upgrade request, for example. And so these protocols are then defined in more detail as to what you do, what's the way in which you follow, the general rule is try to grab the slowest lock at each level of the hierarchy as you go down. And that way you're allowing as many others to go through. And then if you start to find whoops, I'm grabbing a lot of X-Lock. I should upgrade my table level lock to an X-Lock. You'll go through an upgrade path. So don't make static decisions. You'll try to make more dynamic decisions because you don't know till you actually start to look at the values. So again, this is a full lecture in the advanced database class, but you guys are thinking of the right things. It's like, what am I winning? And how do I win? And it's a complicated answer. The only thing you need to know is now you have more toys to play with and you can follow the protocol and everything is safe. All of 2PL holes will grab these locks whether they're intentional or real locks. Hold them till the end of transaction and strong 2PL, you'll hold them till the very end won't go through that monotonically decreasing phase. So all the things we talked about 2PL work with hierarchical locks. And that's beautiful. And you can prove that. And there's a paper that proves that formally and says, don't have to take my word for it, but here's the proof. All right, lock escalation. We just talked about that. If I have to switch over and upgrade a lock to something else, then that's called lock escalation and there are protocols that you go follow through that too. All right, notice with all of these things, just reiterating from the last class, you're not acquiring these locks manually like you as a SQL application programmer is not acquiring the locks manually, typically. Those SQL statements do have options to allow you to lock entire tables. Not recommended, you'd only do that if you're a power user that really knows what you are doing. But in general, these will get acquired at the right point in the system. You as a database system programmer, if you're the person developing the database system, you will have to worry about that and find the right abstraction. Whether it's on the call to the buffer pool or the call to the open of the page, open of the file or open of the page or open of the index, you'll have to go start making these lock calls in there. But the application programmer generally doesn't. However, SQL has many database systems have options to allow this explicit locking of tables. It's not part of standard SQL. And for example, in Oracle, Postgres and DB2, they sort of have a dissimilar syntax or the same syntax. You can say lock table, name of the table in and give an explicit mode. They'll only allow you to give you shared and exclusive locks as a request. And you're saying, look, I know you're gonna do all this hierarchical locking and stuff like that, but I know what I'm doing. I want a shared lock. Don't try to do this other stuff. And they may not be doing hierarchical locking. They may be doing some other locking protocol or timestamp based protocol like we'll see MVCC, which is what Postgres does. But this allows you to say, I know what I'm doing. Go ahead and grab that. But now, along with great power, comes great responsibility and the application code better know what they're doing. And so generally not recommended. You start splintering your SQL code with all of that stuff. We'll talk about isolation levels, which you can set at the database level to say I want read committed or read uncommitted, but that's the lecture that we're going to start on next. Okay? So you can also, SQL also has modes for when you're doing a select query and then you want to set an exclusive lock on the matching records. You can do that, kind of like the transactions that I was saying, I'm gonna read all of this stuff and then some part of it I'm gonna do updates. So there are all kinds of ways in which you can start to give SQL hints in terms of what to do so that you get locks at the appropriate time. You don't have to, the system will do the right thing. But if you want to, it's kind of like query optimization. The system does things by itself, but every database system also has hints where you can say, oh, R is joined with S, do join that first and then use a hash joint for it. Don't try something else for it. So SQL also has optional hints. They again, those hints are not part of SQL, but you can give hints to sell. I'm gonna tell the optimizer what to do or at least tell you where to look. Similarly, there are things where you can explicitly start taking over some of these transaction mechanism. So wrapping up the 2PL part is this two phase locking is used in almost every database system because you, this whole idea of how to get this concurrency control is super important and that theory is what all the products are built off. We talked about locks and the protocols 2PL and strong sick 2PL, when you do locking, that doesn't mean you're completely out of trouble. You can still get into deadlock so you need deadlock prevention mechanisms and you can detect the deadlock and handle it or you can do deadlock prevention. And of course, we talked about hierarchical locking and all the other fun stuff that comes with it. All right, let's wrap this part up and now go, we are running behind in the semester and as you have probably guessed. The good thing is if you keep asking questions and make us run behind, then there's less stuff we can ask you questions in the final exam. But that also means the last chapters we hope to get through, we won't get through so it's like a trade-off. But keep asking questions, it's good. All right, so now we are going to talk about a different way to do concurrency control. We said locking and we looked at all the protocols and effectively the main theory we got from everything we've discussed before is we want this notion of serializable schedules so that we can allow arbitrary interleaving of actions from concurrent transactions, maximize the parallelism but at the end of the day, guarantee that the database is in some consistent state as set up by the theory of the serializable schedule. And we largely focused on this conflict serializable stuff and we said there's this notion of view serializable that allows a little bit more and we touched a little bit on that view serialization today with a different protocol. So two-phase locking, what are we trying to achieve? We're trying to do this concurrency control, the isolation part of ACID, right? It is the I in the ACID, right? We're still on that topic and when we introduced that topic, we had noted that there are pessimistic protocol, locks are a pessimistic protocol, right? If you and I are going to have a read-write conflict or a write-write conflict or a write-read conflict, the lock is basically a way of saying I'm noting that down and I will stop it at the first arc and I'm not gonna let the arc close, right? I'm just gonna, as soon as the arc forms I'm gonna suspend one of the transactions, right? So I'm not gonna let the loop close, right? And hopefully you got that as we discussed this over the last two lectures. There's something that we're gonna start talking about today called timestamp ordering. That's not gonna need locks, right? All this discussion we had today is like, locks are expensive, you have to grab these many locks, you have to figure this out. With hierarchical locking, yes, we made life a lot better, but is there a different way? And that's what we are going to look at with these types of protocols that are timestamp-based protocols. And we'll start with a very simple textbook example of timestamp protocol called timestamp ordering. That's the name of the protocol and no one uses that, but it introduces the concepts on which we build the rest of it for the optimistic concurrency control and NBCC, which a lot of systems use, which is the topic for the next lecture, okay? So it's the foundation stuff we're going to talk about. Now, as we start talking about these protocols, a quick note, they will have names like timestamp ordering. Those were the names that were given when those protocols were invented in the 70s and 80s for optimistic concurrency control, right? But as you'll see, the terms that get used over there will feel a little confusing with everything that we talk about now because the terms were just getting evolved at that time. So bear with us, we want to keep that historical name for what the protocol is. And I'll try and point out where the name doesn't mean what it seems like it means, but we'll still keep that name around, okay? All right, so if we can, we're going to try and get through both the timestamp ordering and the optimistic concurrency control protocol today. I don't expect we'll get through that, so we'll try. All right, so the timestamp ordering protocol has philosophically a different way. We're going to use timestamps and the way we'll use timestamps is we'll have associated with records. Let's assume we are doing everything at a record level, right? So keep life simple. We'll mark when it was read, mark when it was written. So we'll keep timestamps like that around. And it's not all these two timestamps as you'll see with the optimistic concurrency control. It'll be just one timestamp, but the general idea is we're going to use the timestamps. And if I can mark every time I read or write an object, then I can use those timestamps and say, hey, these two records, two transactions wrote or read to it, are they conflicting? If they read in very different times, then I'm okay going forward with that. If they're conflicting, then can I resolve the conflict by finding a serial order in which it all works out and have to do this across all the records that transactions touch, okay? So we'll also give transactions numbers. And those numbers now are not just gonna be just random numbers, but they're gonna mean something. So if I'm transaction 10 and you are transaction 20, then all my work should be done before your work. Our numbers correspond to the serial execution schedules order. So lower transaction stuff must be done first in that equivalent serial schedule than the other transactions. Of course the work is happening all in parallel, but eventually we have to support a serial schedule, right? And the serial schedule, remember when we had two transactions we said this T1 followed by T2 or T2 followed by T1. Now these T1s and T2s, the numbers are gonna mean numbers and we're gonna need to ensure one happens before two if those are transaction numbers. Okay, and then we'll use the timestamps and we'll play around with tricks to say, let's assign the numbers carefully to get more parallelism. So we'll see how we do that. So everyone, okay, with the basic material that we now need, timestamps and transaction numbers now matter, they're not random numbers, they determine the order of the serial schedule, serializable schedule, conflict serializable schedule that we are trying to achieve. All right, now where do these timestamps come from? What do they look like? Some systems will use a timestamp that's a wall clock type. Just grab the wall clock and that's my timestamp. But obviously if you're a distributed system, the clocks may be out, you can't quite do that. Sometimes what people will say is there's a global counter and I can read it and then get my number. But of course two people may try to read that number and increment it at the same time. Luckily in hardware, you've got instructions that in atomic one cycle will allow you to read and update a number. So if I had a global counter, I protect it with one of those instructions that the hardware says I can do atomically, then I can build a counter that I can count on. But again, that works if I have a single machine. If I have distributed machines, I have to do something else which is kind of what Spanner has to do and all these distributed systems have to do. So that's like a logical counter or some combination of that. It's not as important where these numbers come from. It is of course important if you have to implement it but the material today will just assume that some way that we are getting these numbers from because even that counter stuff has to be protected because multiple people may be trying to write to that counter at the same time. So it's the non-privile thing. You can't ignore that if you're trying to implement something. Okay, so we have a number. We have these timestamps. Let's start with the basic time order protocol which has the following components. Transactions are going to read and write objects but there's no locks now. This is a competing scheme. No locks are needed. Now, but remember strict 2PL and 2PL was all about getting a serial schedule. So we will still get a serializable schedule but without doing any of that stuff. So completely different way of thinking about it. Okay, every object will be grabbed, will be tagged with the timestamp of the last transaction that read and wrote it. So again, every object has these two timestamps and when I say object just refer to it as a record but it's generalizable to other things. If I'm doing page level locking, the object is a page. If I'm doing file level locking, it's a file and depends on what that notion is. Now, if a transaction tries to the main principle we'll do is the one that's written up over here at the bottom is we'll use these timestamps philosophically in the following way. If I'm trying to do something to an object, read or write, I'll look at the timestamps there and say, whoops, what do these timestamps tell me? Let's do something to this object, read or write and they're ahead of me in that transaction number order that logical order. If so, I need to back out of it because if I insert my operation now, I will end up, I'm guaranteed to end up with a non-serializable schedule. Okay, so now our problem just becomes how we develop these conditions, these simple equations that tell us when I shouldn't do bad stuff. And of course, every time I do some operation on an object, I better go update the timestamps so that I can leave that marker behind saying I was here and this is what I did to it. So let's get going. So the basic TO protocol, again, as I said, it's not practical, but it just provides us a foundation is every time a transaction wants to read an object, it does the following, right? It is going to look at the, you're gonna start to get a little bit more familiar with these equations. So let's just slow down on these slides. You'll say TS of TI, that is the timestamp of that transaction TI. Think of it as transaction number, if that's all we have, right? And then the right that is happening to that object X. So I'm trying to do something to object X, in this case, transaction TI wants to read object X, right? That's the action noted at the top. And if I see that the timestamp, the right timestamp of that object is bigger than my timestamp, then something happened to the object that's a future value that I should not be seeing, right? Because these timestamps, these numbers now mean something. These timestamps mean something, right? So I cannot basically, the most intuitive way to think about it is to say, I cannot read stuff from the future, okay? I cannot see stuff in the future. Because otherwise, if I start doing that, then I'm gonna get some sort of anomaly, like I'll get a WR anomaly very easily if I start doing that, or an RW anomaly. I start, in this case, it would be RW. So, okay? So if I hit that condition, when I'm seeing something in the future, I'll abort. Otherwise, I will read, but now I need to let the world know that I read it. So I will update the read timestamp to be a little tricky here, to be the maximum of my timestamp and whatever was already there. Since reads are compatible with each other, and you'll see that in an example in a little bit, if reads are compatible with each other, this max is saying if another future reader got ahead of me, I don't care. I should simply not write. I should not have seen some future transactions write, okay? So that max stuff you'll see in a second. And now, one more thing that we will do here is we will go and say that we are going to also, going to make a local copy of that object that we just read so that we can start to make sure that if I need to repeat that read, I'm okay with doing that. And you'll see that in a second, right? Because X's value will keep getting changed. If I want to make sure repeatable reads, remember we had that repeatable read anomaly, right? If I want to prevent repeatable reads, then I need to make a copy for myself. Yep. Yes. And we'll talk about that in a bit. Yes, this can cause starvation, like deadlocks could cause, locks could cause deadlocks, this will cause starvation and their ways of getting around that. Just hold on to that question for about five minutes. Good question. Yep, starvation will happen. And then we'll just make sure the timestamps are assigned in a way that we don't infinitely start someone. That's a quick answer. Okay, now we figured out what to do with reads. Now let's figure out what to do with writes. So for writes, the conditions are a little complicated. TSTI is the transaction that's trying to write, right? So that's the timestamp of the transaction. And then we check if the read or the write timestamp of the object we are trying to read is in the future. Okay, again, similar to that, but for the writes, we have to check both the read and the write timestamp. For the reads, we just have to check the write timestamp. If you want to keep a mental model, the previous slide was about the RW anomaly. This is about the WR and WW. And if you understand dependency graph and cycles are bad, you can take any complicated protocol, put your head to it and it'll start to look simpler. Okay? So now this basically says I cannot write if a future transaction has read or written to an object and I will abort if I detect that condition. Otherwise, I will write and oh, I better tell the world about that. So I need to go and update the write timestamp. Okay? All right, let's take a, yep? When do we assign the? Yeah, when do we assign the RTS and the WTS? So let's actually go into that right now with this example and that will make it clear. Okay? So here's an example. I've got a schedule in which transactions are happening. I've got big in and reads and writes. I've got a database and now associated with each object, we're going to have a read and a write timestamp. So every object is going to need those two values that are associated with it. And let's start with the begin transaction and let's assume right now that that's when we assign the timestamp. So transaction T1 actually is one. That's its number. This number has a meaning, which means all of T1 must happen before T2 in the schedule we are going to allow. Okay? Now, read happens. So now this is the first part. Remember two slides ago, we'll read and say, hey, what's the write timestamp of this object be? Oh, zero, fine. It's in the past. I can go read that and that's totally fine. And oh, I need to record that I read it. So I'll take the max of zero and one, which is one, and I put one there. Okay? There was that max call if you remember in the read portion of this protocol. Right now, I go to the second transaction, context switch is over. Let's say the second transaction gets to run. It assigns its get transaction ID two. So now all of its action must happen in the final state of the database after transaction one, and then read it, reads B, and says, oh, I'm two. Write timestamp is zero, that's fine. I'll just make sure that everyone knows that I've read it. So I update the read timestamp to two and I move on. Now I get to write. I have to write to this object B and the write timestamp just before that happens, if I go back was zero, I'm fine. The read timestamp is two, not in the future. So I look both at the read and write timestamp, they're not in the future. I can do that right and make sure the write timestamp is now my timestamp. So that answers your question, okay? Then context switch is over to T one. It says I have to read of A and that's okay because A was not read, so I'll just let everyone know that A is read. So I made the A, which is an object we hadn't touched so far, set it to be one, same thing happens there, very similar to what happened to B and then I go back to read this value A and that's okay. Because the write timestamp is still zero. T two has interfered with me, but only on read, so it doesn't matter. Reads don't interfere with reads. So notice how on the read side, I checked the write timestamp and you, now you can see why I didn't need to worry about the read timestamp on the read side. There's no R, R anomaly, okay? Now, this is okay, T one is less than T two, but we don't, that's fine, that's allowed. Write happens, then we go update the write timestamp, that's also allowed and this transaction commits and it's as if, even though we had this interleaving, the final state of the database is T one followed by T two. Okay, so totally different mechanism that doesn't use locks, can use timestamps, but we have to use it properly. Let's keep going. Second example, read A, now you guys know how that works. You're going to go update that, write of A happens, we go do that. So after the read write, the transaction has A values, read timestamp one, write timestamp or two, right? Pretty similar to what we did so far. Now, T two commits, all right? T one has to go write to a value A. What should happen here? Can it write to the value A? Because the write timestamp is two, right? So following that protocol that we just said, it cannot. So it violates that piece because it's an object in the future. I'm only one, my serial order in the config serializable is one followed by two. How can I see two stuff? That's wrong. So I cannot overwrite that. So transaction two has to go and basically abort and then it has to restart. At that point it restarts, you'll grab a new timestamp and go about doing its own business, okay? And when it gets a new timestamp, its timestamp will be more recent, but as we'll talk about there are other ways of doing that. Yeah, so you're asking, so the question is related to the cascading aborts kind of situation, right? What if T two, if there was some other interviewing between the write and the read and there was a cascading abort situation. So can cascading aborts happen? Yes, in the similar kind of way. This is doing the same type of stuff of doing that. To avoid cascading aborts, you would basically say any aborting transaction. You'll do the same thing as we talked about. You'll have to do the, you'll have to have the commit graph. It's called the commit graph saying who's committed. When can I commit? And then keep track of that to keep around. And again, it's like, I will defer that topic because we could go down a rabbit hole to go figure that part out. But this is very similar in philosophy to what happens with all the abort stuff. Go for it. Yeah. Yes, correct. So there's a bad protocol from that perspective. Now a record, just even the reader, a pure reader is going to have to update timestamp. So it's making updates to someplace in the database and that is expensive. So no one implements this as, you know, there's just an example to get us going, but we'll talk about more efficient ways to do this. Absolutely. Does it happen at runtime or just like in the planning offline stage? What part happens at runtime? Can you be more specific? Yeah. The timestamp stuff? Yes, yeah. It's like checking more conflicts. Checking at runtime. So the question is, does a timestamp check happen at runtime or someplace else? Runtime. When I'm accessing the record, I'll check. Yep. It's behind an ACET. That's correct. The same query about the distance. Yep. You totally got that. So it all has to happen at runtime because otherwise the only thing you can do at static time is to grab an X lock on the database because you don't know what you're going to touch, right? And that as we've talked about, you know, building a database and now rushing for time, you want to get it correct. That's what you'll do, but it will be a very slow database system. Questions? Can you analyze the transactions directly? So for that, I would need to know all the transactions that are going to come while I'm running. I don't know how long I'm going to run. I don't know what's going to come while I'm running. So if I had a schedule of transactions, if I said I have a database system that on Monday morning only does these two transactions and they touch only these two records and a perfect plan, then yes, but everything is, you know, you can't do that, right? The database is going to get queries when it gets queries and you don't know how much stuff it's going to take or touch in the data till it actually starts running. But it's good. You're thinking in the right ways. It's like, oh, can I get better at this if I knew something about the timing and if I knew something about the properties of these transactions and what we want to do is to build something completely safe and general that no matter what happens, we are efficient and correct, okay? Which is hard. Which is hard. All right, other questions? Both of these are committed right now. So the question is, do you about T2? Do you about T2? Sorry, T2 has already committed, so it's fine. T1 is the one that we'll abort. Doesn't T1 come in? Yeah. For T2? Yeah, so you're asking a good question because I said T1 got transaction number one, so it's as if it is in the serial schedule ahead of T2. That is true, but right now there's no dependency from T1 to T2, right? T2 is reading stuff that was already there before. So effectively once we abort T1, it's as if the world had started with only T2 in the picture. No, no, which right? So right now we are on the right of A in T1. We'll abort it so that right won't go through. Yeah, yeah, yeah. I know exactly what you're saying. So you're saying, I started by saying T1 and T2, if they're both in the system, I want the serial schedule of T1 followed by T2. I'm playing a little loose over here. We are aborting T1. So it's as if I'm saying, oh, you know what? I went and fixed it in a correct way so that T1 never existed. So when an abort happens in a serial schedule, it's all related to all this cascading abort and other stuff too. We've always talked about equivalent serial schedules as being transactions T1 and T2, but implicitly we've always been saying T1 and T2 commit. We know that. But if it's abort, it's like they never existed before. So it's a little trick. On the slide, I can only fit two examples, but if you imagine T1, T2, T3, it's as if T1 abort it because of some violation and T2 and T3 safely could get along. It's as if it happened as two followed by three and the abortive transaction never existed. It's kind of like I could go and rewrite the history from the past for an abortive transaction. Yeah, but that's a great observation. It's like, whoa, whoa, you were telling me one followed by two, but you took away one. You took away one because we're going and changing the rules in that different way. Yeah. There's some way you could like intelligently schedule these because it's pretty clear by looking at it. They sort of. Yeah, so the question is can you intelligently schedule this? Yes, but I don't know what T1's going to do when T1 starts. I don't know if it's going to write to A. It will only know what it's reading and writing as it proceeds in the transaction means that's the whole game. We don't know what the transaction is going to do till it starts to do its work. Isn't there any way to analyze it? Not necessarily because if I said I've got a transaction in which I'm going to read all the bank records and only the ones that are the highest I'm going to give a $50 bonus. Unless I look at the data statically, I cannot tell anything. You can tell what columns are. You can tell what columns you're going, but that's going to not do you much because you don't know which of those columns you're eventually going to go update. But is it going to be enough to tell you they have to do them in order to do some work? The transaction, see, time is proceeding from top to bottom in all our schedules. At any point in time you can imagine it's as if like we were at the begin of T1. We don't know what the world's going to look like at the next time tick. So a read may come, a write may come. We are saying no matter what you throw at me I want to make all of that safe and happen for you. So if your question could be, if I knew exactly what, it's exactly the question that was asked before. If I knew exactly what the transaction was going to do, every transaction in my system only reads A and writes A, the other transactions read A, write A and read B and write B. If that's all I was doing, I can do all kinds of crazy schedule, but that's a database system that can do much. So we don't know what read and writes are going to come till the transaction proceeds. So, okay, yep. Sorry. Ask again, sorry. Once we have bought T1. Yeah, once you have bought T1. Yeah, yeah, if you, yeah, exactly. So if T1 is aborted, then it will get rerun by some mechanism that mechanism could be you as the SQL programmer could have said, if I get an abort from this, right, you write the SQL code, you check for error condition. So if I get abort, retry it again and you might say retry it five times or some number of times. So the application code will typically have some handling of that. Okay, all right, I need to keep moving. I'm on slide 10 of 85. All right, great. So some of you might have noticed that you could have told me here that that write of A, going back to the question that was just asked here. Hey, what if I just wrote that A through that write of A because I know all that's happening if I wanted to really follow that T1 followed by T2. The database already has a write of T2, which is all I need to end up with. I could just have thrown this right away and let T1 actually commit. And turns out that in very specific conditions like this, where you have a write over someone else's write, but you are the previous transaction, you could actually under some conditions throw that right away and allow this to proceed. So you were perhaps starting to think like that when you ask that question. There's a rule called the Thomas Wright rule, which effectively says that in a more mathematical form, saying that when you have that specific condition of a write followed by a write based on this timestamp that the previous transaction is just trying to write that, you could allow that right to proceed. And effectively what this allows is with the Thomas Wright rule, you're now allowing schedules that become that are view serializable and a little bit bigger than just the conflict serializable set of schedules. And I won't prove it, but I will just need that as a thought exercise, right, you're allowing more schedules than you would strictly allow, okay. So there's a proper rule, it's not that important as I said, no one writes transaction management system with this TO protocol that we talked about and the Wright rule is very famous in database systems. So if you ever talk to a transaction person, they'll know about it, they'll sometimes refer to it. But in practice, it's not a rule that gets used because as you talked about view serializable stuff is not what we typically end up trying to get it super hard to implement. So I'm gonna just leave with one note that Andy had, Andy likes to go dig up all these things, it's like, okay, who's this Thomas guy? And so when he dug this up, what he found was that this is a guy who was at BBN, which was a networking company that did one of the earliest internet, but they were also like a think tank and they did a whole bunch of actually super interesting database stuff in the late 70s and early 80s. And Andy suspects that this is the same guy who also wrote the first computer worm. There's a Wikipedia article that talks about the first computer worm was written by someone called Bob Thomas, also at BBN. So it's probably the same person. And if you guys know in the security literature, there's always a Bob and Alice, right? They always have Bob and Alice trying to do something that someone else is trying to interfere with. So very likely it is the same guy. I just leave that in the slides over here. You might find that interesting. All right, all right. Also gives you a little bit of relief from thinking about transactions which can start to weigh you down. But let's get moving. Here's another example with a basic time order protocol. And here what we are going to do is basically just start going through this. Read happens, write happens, same thing as before. Then this commit happens and now, oops, lost my slide, the write for A that happens interferes because a write for two has a much higher value. So we do not update that write timestamp. Now we could skip doing this actual write as we talked about with the Thomas write rule if he was supporting that. But there's a read following that. So what you have to do is you do the write but you keep it to the local copy so that that read is within your transaction itself, you should be reading what you just wrote and not seconds value. So you're gonna start making a copy of stuff that you write to if you wanted to support that advanced mode so that the reads in your transaction which are now allowed, they shouldn't see the second write, they should see your write. And so you're making local copies of things. And we'll use this mechanism of local copies for the next protocol that we're going to talk about. All right, so we've already covered this. I'm gonna skim over it because your questions covered that. There's no deadlocks because we don't have that. There's a possibility of starvation, the question that was just asked over here a little while back. If I've got a long running transaction, I start on this end of the file, I may have a billion records to go through and I'm doing this at the record level. By the time I get there, someone's probably got in the head of me and I have to go restart and start to do other kinds of things with it. Now if that happens, there are all kinds of things you could do with it, including things at some point you might say, I just have to pause all these other guys so that I can go through that. But there are other protocols to go follow that. All of that is kind of not super important right now, as I said, no one ever builds this basic protocol. We're just going to use that to understand the other mechanisms. But notice, even in this next mechanism we'll talk about there's going to be this overhead, which is there's this overhead of copying data into the transactions workspace. So as you saw, if you wanted to allow that read in T1 to happen, if you're operating on the Thomas's right rule, I needed to make a copy of that. We'll do a lot more of that in the OCC protocol. We're going to do that even more aggressively and that comes at a cost. Locks have a cost, copies have a cost, right? So we're making different choices here. And long running transactions can get solved and we'll move on with that. The key observation, however, and this is important, why didn't we stop at locking? Why are we so interested in these protocols and their properties? It's because if I have transactions that are mostly short lived, which is what happens a lot in OLTP systems, right? Your shopping cart application is just going to look from all the customers and just pull up your customer record. These OLTP applications, they read and write very few records. They might have billions or trillions of records, but every transaction is just touching very, very small number of records, reading and writing them. And so if everything is short lived, then forcing transactions even to come down with a hierarchical locking through the entire hierarchy and grabbing all those locks seems like a little expensive. And these protocols that we are going to start looking, that are starting to look at now with the timestamp and MVCC, which we'll talk about in the next class, they'll perform a lot better in those cases, okay? Now, if you have a lot of conflicts, even if I have an OLTP transaction, but let's say there's a pink Barbie doll at Christmas time, and then everyone wants to buy that pink Barbie doll and there's a lock you need to grab on that object for the inventory count, then nothing's gonna save you, right? Everything's gonna just conflict. So it's not just short lived, but it's also like, I'm relatively not interfering with someone else. You can allow more of that parallelism to happen with these schemes and that these different trade-offs you're making. So the protocol we are going to now look at is the optimistic on currency control protocol. And that was actually invented over here. The locking stuff happened in the 70s, late 70s, and H.T. Kang was here at Carnegie Mellon and came up with this beautiful protocol. It's a short paper. You'll read it in the advanced graduate level class, but even if you don't take that class, just read this paper. It's so beautifully written. Like in some of these short papers, they have everything that you need to know about this protocol is in that paper, but it's not like 20 pages. And every word in there matters. So if you skip a word, you'll be like, whoops, I missed an important detail. And obviously we won't go into the gory details of all of this paper, but we'll get through the essence of the main parts that's in this paper. So what we will do is objects are going to read stuff and they'll create the workspace. Like we were starting to create that object A for transaction one. We'll create a workspace where we will keep all the objects and now we'll keep in that workspace everything we read or write in the transaction. Not just things that we write. And you'll see examples in a second. And modifications that we'll make to the transaction is all going to happen to a local workspace. So it's kind of like GitHub, you check out the curve and maybe just check out just what you need. You make all the changes there and then at some point you're going to say, I'm going to commit that back, like GitHub commit. So it's kind of work like that. While you've checked out, there's no interference. If both you and your other members of your team are working on completely separate pieces of code and the commit is all going to merge in and do just fine. So the ideas are like that. You're going to do everything in your workspace and then eventually you will have to write. You'll have to do that final write to the master. And when you write into what is called the global database you'll, before that you go and make checks and make sure that everything is safe and correct. All right, so let's get going on that. And today this camera is just refusing to stay focused. All right, so there are three phases. Now this is where the terminology is going to start to look a little weird. The first phase you're going to call a read phase. It's a read phase because from the global database perspective all that each transaction is doing in the read phase is just reading stuff. But in that read phase is when all the work of that transaction is going to happen. It's actually going to make the changes and all of that stuff. The read write all of that stuff is happening but happening on local copies of the database just like you checked out your GitHub code to the GitHub repo it just looked like you read it, right? And then it's only later on that the write comes in. Similar, now it'll go through a validation phase. In the validation phase we're going to check is it safe for me to do the final thing which is actually make that write and make all my changes from my workspace permanent. And all the changes is just going to mean the write. Okay, so let's go into that with a couple with an example first. So now we have a database. We'll have the notion of checking stuff out into our own workspace. And we'll do that unlike GitHub where you have to check out the whole repo here we'll just start creating into our workspace objects as we need it. As we talked about the transaction as it is running doesn't know upfront everything it's going to read and write all that is just going to evolve, right? As the database system's concerned it's just going to getting read and write request. It just has to make sure all those requests can be done as efficiently as possible allow as many concurrent requests and make it all safe. Correctness is important, right? So same thing. We don't know what's getting checked out. So that's where the GitHub analogy is now going to start breaking up, okay? So we start, we have a transaction T1 starts and reads object A. Notice now in the global database this is the main database, right? There's only one timestamp with the objects which is a write timestamp. Don't need the read timestamp. So we just talked about previously in the TO protocol every transaction was updating reads and write timestamps and that's kind of very expensive because even a reader has to do that. Here now we have only one timestamp and we worry about it only on writes. So still we have to have one more field in the object but lot better than like was just a few slides ago. So the transaction is now in the protocol going to have read and there's going to be this new thing called validate and there's a write and as I said, these are the protocol terms. The write is different than the W of A which is the write to the object. The red stuff is the steps in the transaction. So don't confuse that write with the write of A. As I said, we are retaining the names from the original terminology in that paper but hopefully you get that, right? It's just the right phase of the protocol hence shown in a different color and different path. Okay? So read and as soon as you start that read stuff you create a workspace which is empty and you'll read into that the object A. So A has a value of 123 and you're gonna pull that in. You can think of this workspace as being organized as a key value store, right? Here's the object and the value is whatever it is plus the write timestamp and so now it is zero. Then TJ starts and now notice I don't have T1 and T2, okay? I'm calling it TI and TJ because whether I comes before J or J comes before I we're gonna decide in a little bit, right? So now we are using variables and then T2 comes in, it's grabbing from the master database that same stuff doing its thing and at some point it reaches the validate phase and the validated phase it says and what do I need to do with my stuff and it'll look at its timestamp. I'll talk about the protocol in a little bit and look at its timestamp and at that point it's going to grab the timestamp. So till now it says if TI and TJ are babies they were born without a name and now we assign them a name and that name is a number and that decides the order, okay? So we delayed it. We delayed assigning that number because we're trying to be optimistic. Optimistic protocol philosophically says compared to the pessimistic protocol saying I think life is good. Most transactions won't interfere with each other so let them keep doing their stuff. If I pick TI and TJ's proper names right up front then I'm giving myself less room to go and allow parallelism and I'll just leave that as a statement. You'll have to read the paper or take the advanced database class to really understand that and I'll leave this a little bit. In fact you could delay this even further. Even not at the end of read phase you could delay till the end of the write phase and you could say readers I don't even give you a number so it says if you're a ghost transaction never happened didn't interfere with everyone you could just go ahead. So it's like very cool tricks you can play while still maintaining correctness with when you assign these transaction numbers. When do you name your transactions? When you give them the names which are numbers in a case. So you assign that the name now it does a write and when it does the write and commit so that write will happen in its workspace and then finally it'll get committed. T2 will do its write do its own, sorry that write stuff over here was that write phase which had nothing to do in this case because all it did was a read only transaction. T2 actually has a write write to an object and so notice what happens to the timestamp in the local copy it sets it to infinity it doesn't have a name yet it doesn't have that number yet. So infinity says something in the future this is think of the right timestamp by saying it's valid till infinity for now and then it actually gets a number let's say it is two it puts that and then finally two becomes a permanent copy in the global database validate checks that am I safe to go or not and if it says safe that's when it goes forward. All right. So sounded like a lot of stuff but it's actually super simple remember I told you about 30 minutes ago saying if you really understand dependence e-graphs and anomalies the WR, WW and RW and you can really picture that as happening as you do any protocol it'll be super easy to understand that the first time I read the OCC protocol the first five times I read it I found it super complicated and you kind of get it but you don't get it and you miss certain corner cases till I drew the pictures which I'm going to show you next and then it just became super clear. So before we get into the pictures the main thing is in the read phase we are now going to keep these local copies and in these local copies we are going to do all of our rights and the DBMS will copy all the tuple the transaction accesses from the shared space to the workspace it's kind of like the checkout system if you're checking out just a file or a directory and for now we will ignore what happens if these regent rights to records happen via indices and we will actually not cover that at all for optimistic methods in this class at all so just assume how do I get to object X if I'm coming through an index or updating an index all of that stuff you're just going to brush under the covers but these things all work with that that's all I need you to know okay all right so where are these pictures that I've been promising so we're gonna optimistic concurrency control works in three phases read is where I'm gonna do all my work in my local copies checkout from the database only the objects I touch either in read and write and then I will enter the validation phase and for the purpose of this course we are saying at the end of the read phases when I get named I get my transaction number then in the validation phase I'm going to check do I violate if I take what I've checked out and if I put it back am I going to cause some violation of the serial schedule that we are all trying to achieve I have a number now so I know where I belong in that order am I gonna cause trouble and if I think I'm gonna cause trouble I'll back out of it and abort okay and if not I'll go to the right phase where I'll make my changes happen in this global database so time proceeds this way transactions life now is designed is broken up into three phases T1 starts does a read phase where even the rights are happening in the local works phase then the validation phase and then finally does the right phase okay so every transaction will have those three phases the main work of the transaction is all happening in that read phase the other stuff is all the concurrency control stuff if T2 starts and does it three phases after that no problem right there's no conflict trouble starts when you have things like this T3 which started at some point compare T3 and T2 T3 started before and these T3s are not numbers 3 is not a number now right it's just a just a logical name we haven't given it a real name just yet so I probably should have called it I, J and K so T3 has started here and then its read phase is done way later even started did all of that stuff is in the validation phase maybe I've done other things with it so it's these kinds of things that we want to make safe okay so how do we make that safe and again the transactions are signed at the end of the read phase the paper is beautiful because it says you have to worry only about three things three checks okay for every pair of transactions that you're considering if any one of these checks passes those those two pairs of transactions are done apply these three conditions to all the transactions and you are basically done so what are these three conditions either one of them holds means that pair of transactions safe so I've got T I and T J and let's say I is before J so we are saying I want to assign I as being ahead of J in the serial schedule they want to make sure all of I stuff happens before J so pictorially it looks like that right if I say I and J I happens before J and if all of T I happen before T J this is trivially safe T I completely got done before T I completely got done before T J and I'm totally fine and it's defined in the paper as saying the right phase of T I is completed before T J starts its read phase very precise definition okay that means there's no overlap and it's all safe and no conflicts because you can think about all all the changes in I happening before J slightly more complicated which is condition number two that condition is where T I completes its right phase before T I starts its right phase T J so T J could have as you can see in this example have started to read stuff that T I is writing because the right phase of T I and the read phase of T J overlaps okay so we'll disallow this by having a very simple check I look at the right set of T I and the read set of T J if they don't conflict there's nothing in common then I will declare these are safe these two transactions are safe the last hardest case is there is a bunch of overlap and the more precise condition is T I completes its read phase before T J starts its read phase you're just saying I before J okay that's all this is saying everything else every other type of overlap is allowed and there's a check that says check the right set of I and read set of J don't overlap which is the same condition from the previous slide you can verify later and that the right sets of these two transactions don't conflict and if so I can tell you it is safe but this sounds magical you give me three rules and you're telling me that we got serializable schedule and this is where it takes a long time to understand that and this is the master picture put it like that case one, case two, case three can be thought of in the following ways I just took exactly the same condition as before and remember case one said right phase of I is done before read phase starts I just did that there's a black transaction and a blue transaction and the red line just says that red is the condition that HDKang had in those three conditions now why is this working out and we are saying if any transaction either satisfies case one they find or satisfies case two it's fine it's not and any one of these checks if two pairs of transactions pass they are safe with each other okay so why is this true as I told you all these think of dependency graph three anomalies that cause arcs read right right read right right case number one trivially says none of those happen because everything in I is happening before J so all those dependencies are taken care of because I is less than J we've given these names and they mean something this means for every transaction read followed by right is taken care of by that very nature of I and J right the other part is where we run into trouble now case number two is slightly more general than case number one and but the right phase of PJ only starts later so right right dependencies are taken care of right because of the red arrow that establishes the condition using right right happens in this way that's the part of the condition and so the only thing you can't check which we have to do some work for is the read set and the right set and that gives me my WR dependency and if it's empty I don't have any arc of the WR type so life is good and now you can see how the other part is saying I couldn't even tell that I couldn't even tell that so I had to do both of those checks and that becomes the basis for doing optimistic concurrency control okay questions it's actually a beautiful protocol yes you're checking out these copies for which there's a cost but these three conditions cover everything and you check those and you've got a protocol that is correct could I explain the right right case for case two the case to the condition in that if you go back to slides that was implicit in the definition of the condition so case two said TI completes its right phase before TJ so in the protocol what happens is that when you're in the validation phase you'll check I am this transaction everyone else who's active with me for them I will check each pair and for each pair I will see one of these cases passes this is the if condition that says I know for this pair that the right phase of I completes before TJ starts its right phase so it's the condition right so it's that if then else statement that you're checking you're checking these three cases in that pairs of transaction checks that you're making so uh... it is a set it's it's or right so one of these has to hold to it in the worst case you're always in case three in which case you have to do all these right right right set read set and right right set checks how to do this checks is let's leave it aside it's a containment which you have to do efficiently but there are algorithms for doing that okay set intersection in this case sorry yep all of this is happening doing the validation phase it's think about it as happening for t-i-n-t-j why am I hesitating with giving you an answer for simplicity you can assume all of this is happening for t-i t-i it's it's symmetric in the sense when you are assigning these i's and j's is delayed but won't matter when it is happening you're just saying if I'm going to make a decision with everything else that is active I need to make sure this applies yeah but the reads overlap they all happen to the local copy so when the reads are happening the right set and read set see this where the read terminology is like what does that read mean in the read phases we are creating the read set and the right set so that read phase is where all the work is happening and these right and read sets are getting created so they could overlap and that's fine it only matters when they're trying to write that you have to start worrying about things does that make sense no or you ask something else if you're saying that t-i's read phase starts after t-i's read phase ends yeah can this what happens to the overlap if they overlap then you will have you will have to go and eventually one of them is going to so let's say they overlap then it's one of them is going to read the right phase before the other one reaches the right phase right so you'll then hit case number two where you have to go check the right phase the dependency see between any pair of transactions you have to check all these conditions as they can determine it is to be safe right if you can determine if any of these conditions don't hold then the the transaction that's trying to check will abort itself and then go back again and start to do things hold on let me just make sure that got answered okay so I think the confusion is coming about like how can you convince me that there are only these three cases and no other cases right is that the question you're asking yeah yeah yeah okay so the simple answer is that if you try to construct any other case and we can take this offline it'll end up these are the checks that you have to do to declare safety if you can't do any of these safety checks you don't know what that but the transactions because they'll enter the right phase in a certain way let me just talk about the serial protocol because and I'll come back to that question if it's not I know there's a question on the table this is all related to I've only talked about the read phase where you're creating the read write set then the validation phase where we do these checks then it's this write phase where everyone's getting stuck on but that's the next slide so the question is like how do I do these writes so the simplest protocol is something called serial commit only one transaction is allowed to be in the right phase at a given point in time so that is why case number two will not ever have a right right overlap so you try to arrange anything cases it's gonna boil down to that right so if I say look all of these questions are coming because oh what happens if I have some overlap that I haven't thought of that will all boil down to saying am I allowing the right phase to be overlapped with each other in some way now the first version of the protocol in there says that I will grab a latch and only one transaction is allowed to go into the right phase when they go into the right phase they do all their rights to the global database is like I only allow one person to commit to the master repo and others who have to commit have to wait you can't have parallel commits and that's the serial protocol obviously that is slow in that right phase but there's a parallel protocol which I won't talk about I just alluded to at a high level because it takes an hour to get through so let me just put a pin on that and hopefully that clarifies the confusion the and happy to take questions offline if you need to can you go back to the slide of case number three? yeah whether it's so for what it says here I can read the report yeah yeah yeah I think I have my diagram wrong over here yep I will go fix that yep you're right slide number 25 is wrong this is the one I spend like three hours on this is correct yeah yeah thank you good catch okay good what you're saying is do you take any piece of CINT base do you think this is really the one of three cases? you should be and it's because like I'm telling you that the serial commit protocol is in place right so I didn't tell you that that thing till now that's why you're coming up with cases where it won't hold but that's what gets you in place so the hand-wavy stuff I'm going to say is that you can actually do better than the serial protocol using something called the parallel commit protocol where you can actually allow parallelism but its behavior is going to be identical to the serial stuff there's a whole section in the paper that talks about that and and we'll have to defer that to the advanced database class okay so alright I'm going to stop here since I know I'm a little bit over time already and then pick up from slide 29 in the next class alright thank you