 All right. Today's topic is distributed transactions. And these come in really two implementation pieces. And that's how I'll cover them. The first big piece is concurrency control. The second is atomic commit. And the reason why distributed transactions come up is that it's very frequent for people with large amounts of data to end up splitting or sharding the data over many different servers. So maybe if you're running a bank, for example, the bank balances for half your customers are one server, and the bank balances for the other half are on a different server. That's to split the load, both the processing load and the space requirements. This comes up for other things too. Maybe you're recording vote counts on articles at a website. Maybe there's so many millions and millions of articles, half the vote counts are on one server, and half the vote counts are on another. But some operations require touching, modifying, or reading data on multiple different servers. So if we're doing a bank transfer from one customer to another, well, their balances may be on different servers. And therefore, in order to do the balance, we have to modify data, read and write data on two different servers. And we'd really like to, or one way, building these systems, and we'll see others later on in the course, one way to build the system is to try to hide the complexity of splitting this data across multiple servers, try to hide it from the application programmer. And this is like a traditionally, has been a database concern for many decades. And so a lot of today's material originated with databases. But the ideas have been used much more widely in distributed systems, which you wouldn't necessarily call a traditional database. The way people usually package up concurrency control plus atomic commit is in an abstraction called a transaction, which we've seen before. And the idea is that the programmer has a bunch of different operations, maybe on different records in the database. They'd like all those operations to be sort of a single unit and not split by failures or by observation from other activities. And the transaction processing system will require the programmer to mark the beginning and the end of that sequence of reading and writing and updating operations in order to mark the beginning and end of the transaction. And the transaction processing system will provide certain guarantees about what happens between the beginning and the end. So for example, supposing we're running our bank and we want to do a transfer from the account of user x to the account of user y. Now these balances for the both of them start out as 10. So initially x equals 10, y equals 10. And x and y mean to be records in a database. And we want to transfer. We'll actually imagine that there's two transactions that might be running at the same time, one to transfer a dollar from account x to account y and the other transaction to do an audit of all the accounts at the bank to make sure that the total amount of money in the bank never changes. Because after all, if you do transfers, the total shouldn't change even if you've moved money between accounts. In order to express this with transactions, we might have two transactions. The first transaction, we'll call it T1, is the transfer. The programmer is expected to mark the beginning of it with the begin transaction, which I'll write at the begin x. And then the operations on the two balances, on the two records in the database. So we might add 1 to the balance x and add minus 1 to y. And then we need to mark the end of the transaction. Currently, we might have a transaction that's going to check all the balance, do an audit of all the balances, find the sum, or look at all the balances, make sure they add up to the number that doesn't change, despite transfers. So a second transaction, we might be thinking about the audit transaction. Also, we need to mark the beginning and end. This time, we're just reading. This is a read-only transaction. We need to get the current balances of all the accounts. Let's say we're just these two accounts for now. So we have two temporary variables we're going to read. So first one, it's going to be the value of balance x. We'll just write get to mean we're reading that record. We also read y. And we print them both. And that's the end of the transaction. And the question is, what are legal results from these two transactions? The first thing we want to establish is, given the starting state, namely, the two balances for $10, what could be the final results after you've run both these transactions maybe at the same time? So we need a notion of what would be correct. And once we know that, we need to be able to build machinery that will actually be able to execute these transactions and get only those correct answers, despite concurrency and failures. So first, what's correctness? Well, databases usually have a notion of correctness called acid or abbreviated as acid. And this stands for atomic. And this means that a transaction that has multiple steps, maybe writes multiple different records, if there's a failure, despite failures, either all of the right should be done or none of them. It shouldn't be the case that a failure at an awkward time in the middle of a transaction should leave half the updates completed and visible, and half the updates never done. It's all or nothing. So this is all or none, despite failures. The C stands for consistent. Actually, we're not going to worry about that. That's usually meant to refer to the fact that database will enforce certain invariance declared by the application. It's not really our concern today. The I, though, is quite important. It usually stands for isolated. And this is really a property of whether or not two transactions that run at the same time can see each other's changes before the transactions have finished, whether or not they can see intermediate updates from the middle of another transaction. And the goal is no. And the technical specific thing that most people generally mean by isolation is that the transaction execution is serializable. And I'll explain what that means in a bit. But it boils down to transactions can't see each other's changes, can't see intermediate states, but only complete transaction results. And the final D stands for durable. And this means that after a transaction commits, after the client or whatever program that submitted the transaction gets a reply back from the database saying, yes, we've executed your transaction, the D in acid means that the transactions modifications to the database will be durable, that they'll still be there, they won't be erased by some sort of failure. And in practice, that means that stuff has to be written to some non-volatile storage, persistent storage like a disk. And so today, in fact, for this whole course, really, our concerns are going to revolve around good behavior with respect to failure, good behavior with respect to other multiple parallel activities, and making sure that the data is there, still they are after even if something crashes. So the most interesting part of this for us is the specific definition of isolated or serializable. So I'm going to lay that out before talking about how it actually applies to these transactions. So the ion isolated is usually means serialized. And the definition for this, if a set of transactions executes concurrently more or less at the same time, they yield a set of results. And here, the results refer to both the new database records created by any modifications the transactions might do, and in addition, any output that the transactions produce. So for our transactions, these two ads, as they change records, these change records are part of the results. And the output of this print statement is part of the results. So the definition of serializable says the results are serializable if there exists some order of execution of the transactions. So we're going to say a specific execution, parallel, concurrent execution of transactions is serializable if there exists some serial order. I'm really emphasizing serial here. A serial order of execution of those same transactions that yields the same result as the actual execution. And the difference here is the actual execution may have had a lot of parallelism in it. But it's required to produce the same result as some one-at-a-time execution of the same transactions. And so the way you check whether an execution is serializable, whether some concurrent execution is serializable, is you look at the results and see if you can find actually some one-at-a-time execution of the same transactions that does produce the same results. So for our transactions up here, there's only two orders. There's only two one-at-a-time serial orders available. Transaction one, then transaction two, or transaction two, then transaction one. And so we can just look at the results that they would produce if executed one at a time in each of these orders. So if we execute t1 and then t2, then we get x equals 11, y equals 9. And this print statement, since t1 executed first, this print statement sees these two updated values. And so it'll print the string 11, 9. The other possible order is that perhaps t2 ran first, and then t1. And in that case, t2 will see the two records before they were modified, but the modifications will still take place since t1 runs later. So the final results will again be x equals 11, y equals 9. But this time, t2 saw the before values. So these are the two legal results for serializability. And if we ever see anything else from running these two transactions at the same time, we'll know that the database we're running against does not provide serializable execution. It's doing something else. And so while we're thinking through, oh, what would happen if? Or what would happen if we'll always be against these? These are the only two legal results. We better be doing something that produces one or the other. It's interesting to note that there's more than one possible result. Depending on the actual order, if you submit these two transactions at the same time, you don't know whether it's going to be t1, t2, or t2, t1. So you have to be willing to expect more than one possible legal result. And as you have more transactions running concurrently and more complicated, there may be many, many possible different correct results that are all serializable. Because there are many, many orders here that could be used to fulfill this requirement. So now that we have a definition of correctness and we even know what all the possible results are, we can ask a few questions. So a few what-if questions about how these could execute. So for example, suppose that the way the system actually executed this was that it started transaction 2 and got as far as just after reading x. And then transaction 1 ran at this point. And then after transaction 1 finished, transaction 2 continued executing. Now it turns out with a different other transactions than this, that might actually be legal. But here we want to know if it's legal. So we're wondering, gosh, if we executed that way, what results will we get? And are they the same as either of these two? Well, if we execute transaction 1 here, then t1 is going to see value 10. t2 is going to see the value after decrementing y. So t1 will be 10, t2 will be 9. And what this print will be 10, 9. And that's neither of these two outputs here. So that means executing in this way that I just drew is not serializable. It would not be legal. Another interesting question is what if we started executing transaction 1 and we got as far as just after the first add? And then at that point, all the transaction 2 executed right here. So that would mean at this point, x is value 11. The transaction 2 would read 11, 10, print 11, 10. And 11, 10 is not one of these two legal values. So this execution is also not legal for these two transactions. So the reason why serializable, serializability is a popular and useful definition of what it means for transactions to be correct, for execution of transactions to be correct, is that it's a very easy model for programmers. You can write complicated transactions without having to worry about what else may be running in the system. There may be lots of other transactions, maybe using the same data as you, maybe trying to read and write it at the same time. There might be failures. Who knows? But the guarantee here is that it's safe to write your transactions as if nothing else was happening, because the final results have to be as if your transaction was executed by itself in this one-at-a-time order, which is a very simple, very nice programming model. It's also nice that this definition allows truly parallel execution of transactions as long as they don't use the same data. So we run into trouble here, because these two transactions are both reading x and y. But if they were using completely disjoint database records, it turns out this definition allows you to build a database system that would execute transactions that use disjoint data completely in parallel. And if you have a sharded system, which is what we're sort of working up to today, with different data on different machines, you can get true parallel speedup, because maybe one transaction executes purely in the first shard, on the first machine, and the other in parallel on the second machine. So there are opportunities here for good performance. Before I dig into how to implement serializable transactions, there's one more small point I want to bring up. It turns out that one of the things we need to be able to cope with is that transactions may, for one reason or another, basically fail or decide to fail in the middle of the transaction. And this is usually called an abort. And for many transaction systems, we need to be prepared to handle, oh, what should happen if a transaction tries to access a record that doesn't exist or divides by zero? Or maybe since some transaction implementation schemes use locking, maybe a transaction causes a locking deadlock. And the only way to break that deadlock is to kill one or more of the transactions that's participating in the deadlock. So one of the things that's going to be kind of hanging in the background and will come up is the necessity of coping with transactions that all of a sudden in the middle decide they just cannot proceed. And maybe really in the middle, after they've done some work and started modifying things, we need to be able to kind of back out of these transactions and undo any modifications they've made. All right. The implementation strategy for transactions, these asset transactions, I'm going to split into two big pieces and talk about both of them, the main topics in the lecture. The first big implementation topic is concurrency control. And this is the main tool we use to provide serializability, the current or isolation. So concurrency control bias, bias isolation from other concurrent transactions that might be trying to use the same data. And the other big piece that I mentioned is atomic commit. And this is what's going to help us deal with the possibility that, oh yeah, this transaction is executing along and it's maybe modified x. And then all of a sudden, there's a failure in one of the servers involved. But other servers that were maybe executing other parts of the transaction, that is, if x and y are in different machines, we need to be able to recover even if there's a partial failure of only some of the machines that transactions running on. And the big tool people use for that is this atomic commit, which we'll talk about. All right, so first, concurrency control. There's really two classes, two major approaches to concurrency control. And we'll talk about both during the course. These are just main strategies. The first strategy is a pessimistic, usually called pessimistic concurrency control. And this is usually locking. We've all done locking in the labs in the context of Go programs. So it turns out databases, transaction processing systems also use locking. And the idea here is the same as one you're quite familiar with is that before a transaction uses any data, it needs to acquire a lock on that data. And if some other transactions are already using the data, if a lock will be held, and we'll have to wait before we can acquire the lock, wait for the other transaction to finish. And in pessimistic systems, if there's locking conflicts, somebody else has the lock, it'll cause delays. So you're sort of trading performance for correctness. The other main approach is optimistic approaches. The basic idea here is you don't worry about whether maybe some other transactions, reading or writing the data at the same time as you. You just go ahead and do whatever reads and writes you're going to do, although typically into some sort of temporary area. And then only at the end do you go and check whether actually maybe some other transaction might have been interfering. And if there's no other transaction, ah, you're done. And you never had to go through any of the overhead or waiting of taking out locks. The locks are reasonably expensive to manipulate. But if somebody else was modifying the data in a conflicting way at the same time you were, then you have to abort that transaction and retry. And the abbreviation for this is often optimistic concurrency control. It turns out that under different circumstances, these two strategies, one can be faster than the other. If conflicts are very frequent, you probably actually want to use pessimistic concurrency control. Because if conflicts are frequent, you're going to get a lot of abort due to conflicts for optimistic schemes. If conflicts are rare, then optimistic concurrency control can be faster because it completely avoids locking overhead. Today we'll be all about pessimistic concurrency control. And then some later paper, in particular, Farm in a couple of weeks, will deal with an optimistic scheme. OK, so today I'm talking about pessimistic schemes. It refers basically to locking. And in particular for today, the reading was about two-phase locking, which is the most common type of locking. And the idea in two-phase locking for transactions is that transactions can use a bunch of records, like x and y in our example. The first rule is that you acquire a lock before using any piece of data, before reading or writing any record. And the second rule for transactions is that a transaction must hold any locks it acquires until after it commits or abort. You're not allowed to give up locks in the middle of the transaction. You have to hold them all. You can only accumulate them until you're done, until after you're done. So hold until completely done. So this is two-phase locking. The phases are the phases which we acquire locks, and then the phase in which we just hold onto them until we're done. So for two-phase locking to sort of see why locking works here, typical locking systems, there's a lot of variation. Typical locking systems associate a separate lock with each record in the database, with each row and each table, for example, although they could be more coarse-grained. These transactions start out holding no locks. Let's say transaction one starts out holding no locks. When it first uses x, before it's allowed to use it, it has to acquire the lock on x. It may have to wait. And when it first uses y, it acquires another lock, the lock on y. When it finishes, after it's done, it can release both. If we ran both these transactions at the same time, they're going to basically race to get the lock on x. And whichever of them manages to get the lock on x first, it will proceed and finish and commit. Meantime, the other transaction that didn't manage to get the lock on x first is going to sit here waiting before it does anything with x until it can acquire the lock. So transaction two actually got the lock first. It would get the value of x, get the value of y, because transaction one hasn't gotten to this point, it hasn't locked y yet. It'll print, and it'll finish, and release its locks. And only then, transaction one, will be able to acquire the lock on x. And as you can see, that basically forces a serial order, because in this case, it forced the order t2. And then when t2 finishes, only then t1. So it's explicitly forcing an order which causes the execution to follow the definition of serializability. It really is executing t2 to completion, and only then t1. So we do get correct execution. So one question is, why you need to hold the locks until a transaction is completely finished? You might think that you could just hold the lock while you were actually using the data, and that would be more efficient, and indeed it would. That is maybe only hold the lock for the period of time in which t2 is actually looking at record x, or maybe only hold the lock on x here for the duration of the add operation, and then immediately release it. And in that case, if we transaction would immediately release the lock on x, thereby disobeying this rule, of course. But if it immediately released the lock on x, then transaction 2 might be able to start a little bit earlier. We get more concurrency, more higher performance. So this rule definitely is bad for performance. So we want to make pretty sure that it's good for, that it's required for correctness. So what might happen if transactions did actually release locks as early as possible? So suppose t2 here reads x, and then immediately releases this lock on x. That would allow t1, since now at this point in the execution, t2 doesn't hold any locks, because it's just released illegally, released the lock on x, since it holds no locks. That means t1 could completely execute right here. And we already knew from before that this interleaving is not correct, as it doesn't produce either of these two outputs. Similarly, if t1 released this lock on x after it finished adding 1 to x, that would allow all of t2 to slip in right here. And we know also from before that that results in illegal results. But there's an additional kind of problem that can come up with releasing locks after modifying data. If t1 were to release the lock on x, it might allow t2 to see the modified version of x here, to see the x after adding 1 to it, and to print that output, and then for t2 to complete after printing the incremented value of x. If transaction 1 were to abort after that point, maybe because bank balance y doesn't exist, or maybe bank balance y exists, but its balance is 0. And we're not allowed to decrement below 0 for bank balances, because that's an overdraft. So t1 might modify x, then abort. And part of the abort has to be undoing its update to x in order to maintain atomicity. And what that would mean if it released the lock is that transaction 2 would have seen this sort of phantom value of 11 that went away because t1 aborted. It would have seen a value that, according to the rules, never existed. Because then the transaction 1 aborts, then it's as if it never existed. And so that means the results from t2 had better be as if t2 ran by itself without t1 at all. But if it sees the increment, then it's going to print 11 for x, 1110 actually, which just doesn't correspond to any state in the database, given that t1 didn't really complete. OK, so that's why there are two dangers that are averted to two violations of serializability that are averted because transactions hold the locks until they're done. A further thing to note about these rules are that it's very easy for them to produce deadlock. So for example, if we have two transactions, one of them reads record x, then reads record y, and the other transaction reads y and then x, that's just a deadlock if they run at the same time. Each of them gets this lock on the recorded first red. They don't release till the transactions finish. So they both sit there waiting for the lock that's held by the other transaction unless the database does something clever, which it will, the deadlock forever. And in fact, transactions have various strategies, including tracing cycles or timeouts in order to detect that they've gone into the situation. The database will abort one of these two transactions and undo all its changes and act as if the transaction had never occurred. OK, so that's concurrency control with two phase locking. And this is just completely standard database behavior so far and is the same in single machine databases as it will be in distributed databases that are of little more interest to us. But our next topic is actually specific to building databases or storage systems in general that support transactions on a distributed setting that is splitting the data over multiple machines. So now the topic is how to build distributed transactions and in particular, how to cope with failures and more specifically, the kind of partial failures of just one of many servers that you often see in distributed systems. So we have distributive transactions and we're worried about how they behave and make sure they're serializable and also have sort of all or nothing atomicity even in the face of failures. So the way this looks like is that we may have two servers. We got server one and maybe it stores record X in our bank and we have server two. And maybe it stores record Y. So they all start out with value 10. We need to run these two transactions. Transaction one, of course, modifies both X and Y. So we need to send messages to the databases saying, oh, please increment X. Please decrement Y. But it would be easy if we weren't careful to get into a situation where we had told server one to increase the balance for X. But then something failed, maybe the client sending the requests or maybe server two that's holding Y fails or something. And we never managed to do the second update. So that's one problem is failure somewhere may sort of cut the transaction in half. And if we're not careful, it'll cause only half of the transaction to actually take effect. This can happen even without crashes. If X does its part in the transaction, it could be that over on server two, the server two actually gets the request to decrement bank account Y. But maybe server two discovers this bank account doesn't exist, or maybe it does exist and its balance is already zero and can't be decreased. And so it can't do its part of the transaction. But X, look, has already done its part of the transaction. So that's a problem that needs to be dealt with. So the property we want, as I mentioned before, is that all the pieces of the system either, or all the pieces of the system should do their part of the transaction or none. So the thing we've violated here is we want atomicity against crashes versus failures, where atomicity is all or none. All parts of the transaction that we're trying to execute or none of them. And the kind of solution we're going to be looking at is atomic commitments, atomic commit protocols. And the general kind of flavor of atomic commit protocols is that you have a bunch of computers. They're all doing different parts of some larger task. And the atomic commit protocol is going to help the computers decide that either they're all capable of doing their part and they're actually going to do it, or something has gone wrong. And they're all going to agree that, oh, actually none of them are going to do their part of whatever the overall task is. And the big challenges are, of course, how to cope with various failures, machine failures, loss of messages. And it'll turn out that performance is also a little bit difficult to do a good job with. The specific protocol we're going to look at, and this is the protocol explained in the reading for today, are two-phase commit. This is an atomic commitment protocol. And this is used both by distributed databases and also by all kinds of other distributed systems that might not at first look like traditional databases. The general setting is we assume that in one way or another, the task we need to perform is split up over multiple servers, each of which needs to do some part, a different part, each one of them. So for example, this setup I showed here in which it's really the data that's split up, and so the tasks being split up are incrementing x and decrementing y. We're going to assume that there's one computer that's driving the transaction called the transaction coordinator. There's lots of ways of arranging how the transaction coordinator steps in, but we'll just imagine it as a computer that's actually running the transaction. There's one computer, the transaction coordinator, that's executing the code for the transaction, like the puts, and the gets, and the adds. And it sends messages to the computers that hold the different pieces of data that need to actually execute the different parts. So for our setup, we're going to have one computer, the transaction coordinator, and it's going to be these server 1 and server 2 that hold x and y. Transaction coordinator will send a message to server 1 saying, oh, please increment x. Send a message to server y saying, oh, please decrement y. And then there'll be more messages in order to make sure that either they both do it or neither than do it. That's where two-phase commit steps in. Something to keep in the back of your mind is that in the full system, there may be many different transactions running concurrently and many transaction coordinators executing their own transactions. And so the various parts here need to keep track of, oh, this is a message for such and such a transaction and where they keep state. Like, it turns out these servers are going to maintain tables of locks, for example. And when they keep state like that, they need to keep track of, oh, this is a lock that's being held for transaction 17. So there's a notion of transaction IDs or TID. And I'm just going to assume, although I'll not actually show it, that every message in the system is tagged with the unique transaction ID of the transaction it applies to. And these IDs are chosen by the transaction coordinator when the transaction starts. So the transaction coordinator will send out, oh, this is a message for transaction ID 95. And it'll keep all its state here about the transaction. It'll be tagged with 95 and the various tables in the different participants in the transaction will be tagged with the transaction IDs. And so that's another piece of terminology. We've got the transaction coordinator. And then the other servers that are doing parts of the transaction are called participants. All right, so let me draw out the Two-Face Commit Protocol, example execution. So this is, I'll abbreviate this to PC for Two-Face Commit. The parties involved are the transaction coordinator. And we'll just say there's two participants. That is, maybe we're executing the transactions I've shown. X and Y are in different servers. Maybe we've got participant A and participant B. These are two different servers holding data. So the transaction coordinator is running the whole transaction. It's going to send puts and gets to A and B to tell them to read the value of X or Y or add one to X. So we're going to see at the beginning of the transaction that the transaction coordinator is sending, for example, maybe a get request to participant A and it gets a reply. And then maybe it sends a put for whatever. And we might see a long sequence of these if there's a complicated transaction. Then when transaction coordinator gets to the end of the transaction and wants to commit it and be able to release all those locks and make the transactions results visible to the outside world and maybe reply to a client or a human user. So maybe we're assuming there's a sort of external client or human that said, oh, please run this transaction and is waiting for a response. Before we can do any of that, the transaction coordinator has to make sure that all the different participants can actually do their part of the transaction. And in particular, if there were any puts in the transaction, we need to make sure that the participants who are doing those puts are actually still capable of doing the puts. So in order to find that out, the transaction coordinator sends prepare messages to all of the participants. So we're going to send prepare messages to both A and B. And when A or B receive a prepare message, they know the transaction is new in completion but not over yet, they look at their state and decide whether they are actually able to complete the transaction. Maybe they needed to abort it, break a deadlock, or maybe they've crashed and restarted between when they did the last operation or now and they've completely forgotten about the transaction and can't complete it. So A and B look at their state and say, oh, I'm going to be able to or I'm not going to be able to do this transaction. And they respond with either yes or no. So the transaction coordinator is waiting for these yes or no votes from each of the participants. If they all say yes, then the transaction can commit. Nothing goes wrong. The transaction can commit. And the transaction coordinator sends out a commit message to each of the participants. And then the participants usually reply with an acknowledgement saying yes. We now know the outcome. This is called the acknowledgement. All right, so if the transaction coordinator sends out repairs, if all of the participants say yes, they can commit. If any of them, even a single one says no, actually I cannot complete this transaction because I had a failure or there was an inconsistency, like a missing record and I have to abort, if even a single participant says no at this point, then the transaction coordinator won't commit, it'll send out a round of abort messages saying oops, please retract this transaction. Either way, after the commit, two things happen of interest to us. One is the transaction coordinator will emit whatever the transaction's output is to the client or human that requested it and say, look, oh yes, the transaction's finished. And so now, if it didn't abort, if it committed, it's durable. The other interesting thing is that in order to obey these locking rules, the participants unlock when they see either a commit or an abort. And indeed, in order to obey the two phase locking rule, each participant locked any data that it read as part of doing its part of the transaction. So we're imagining that in each participant there's a table of the locks associated with the data stored at that participant and the participants sort of lock things in those tables. Remember, oh, this piece of data, this record is locked for transaction 29 and when finally the commit or abort comes back for transaction 29, the participant unlocks that data and then other transactions can use. So we may have to wait here and this unlock may unblock other transactions. That's really part of the serializability machinery. So far, the reason why this is correct, basically, is that if everybody's following this protocol and there's no failures, then the two participants only commit if both of them commit and if either of them can't commit, if either of them has to abort, then they both abort. So we get that either they all do it or none of them do it result that we wanted, the atomicity result with this protocol so far without thinking about failures. And so now our job is to think through in our head all sort of the different kinds of failures that might occur and figure out whether the protocol still provides atomicity, either both do it or neither do it in the face of these failures and how we have to adjust or extend the protocol in order to cause it to do the right thing. So the first thing I want to consider is, would it be crashes and restarts? I mean, power failure or something, be just suddenly stops executing and then powers are stored and it's brought back to life and run some, maybe some sort of recovery software as part of the transaction processing system. Well, there's really two scenarios we have to worry about. One is B might have crashed before sending its yes message back. So B crashed before sending its yes message back, then it never said yes. So the transaction coordinator couldn't possibly have committed or be about to commit because it has to wait for a yes from all participants. So if B can convince itself that it could not possibly have sent a yes back, that is a crash before sending the yes, then B is entitled to unilaterally abort the transaction itself and forget about it because it knows the transaction coordinator can't possibly commit it. So there's a number of ways of implementing this. One possibility is that all of B's information about transactions that haven't reached this point is in memory and is simply lost if B crashes and reboots, so B just won't know anything about transactions that haven't sent yes back yet. And then if the transaction coordinator sends a prepare message to a participant that doesn't know anything about the transaction because it crashed before sending yes, the participant will say, no, no, I cannot possibly agree to that. Please abort. OK. But of course, maybe B crashed after sending a yes back. So that's a little more tricky. Supposing the crash, supposing the B gets a prepare, it's happy, it says yes, I'm going to commit, and then it crashes before it gets the commit message from the transaction coordinator. Well now, we're in a totally different situation. B has promised to commit, they've told to do so, because it sent a yes back. And for all it knows, and indeed the most likely thing that's happening is the transaction coordinator got yeses from A and B and has sent a commit message to A so that A actually will do its part of the transaction and make it permanent and release locks. And in that case, in order to honor all or nothing, we're absolutely required if B should crash at this point, that on recovery, that it be still prepared to complete its part of the transaction. It doesn't actually know at that point whether, because it hasn't received the commit yet, whether it should commit or not, but it must still be prepared to commit. And what that means, the fact that B can't lose the state for a transaction across crashes and reboots, is that before B replies to a repair, it must make the transaction state, the sort of intermediate transaction state, the memory of all the changes it's made, which may have to be undone if there's an abort, plus the record of all the locks the transactions held. It must make that durable on disk in particular, it's almost always in a log on disk. So before B sends the yes in reply to a repair message, it first must write to disk in its log all the information required to commit that transaction. That is, all the new values produced by put plus a full list of locks on the disk or some other persistent memory before replying with yes. And then if B should crash after sending the yes, as part of recovery, when it restarts, it'll look at its log and say, oh gosh, I was in the middle of a transaction. I had replied yes for a transaction 92. Here's all the modifications it should make if committed and all the locks it held. I better restore that state. And then when B finally gets a commit, ignore an abort, it'll know from having read its log how to actually finish as part of the transaction. So this is an important thing I left out of the original laying out of this protocol, is that B must write to its disk at this point. And this is part of what makes Too Faced Commit a little bit slow, is that there's these necessary persisting of information here. So we also have to worry about the final place, I guess, where you might crash is you might be my crash after receiving the commit, or after you might crash after actually processing the commit. But in that case, it's made the modifications that the transaction needed to make permanent in its database, presumably also on disk, after it received the commit message. And in that case, there's maybe not anything to do if it restarts because the transaction's finished. So when B receives the commit message, it probably writes the copies, the modifications from its log onto its permanent storage, releases its locks, erases the information about the transaction from its log, and then replies. Of course, we have to worry about what if it receives a commit message twice. Probably the right thing to do is either for B to remember about the transaction that takes memory. So it turns out that if B simply forgets about committed transactions that it's made durable on disk, it can reply to a repeated commit message if it doesn't know anything about that transaction by simply acknowledging it again. And that'll be important a little bit later on. So that's the story of one of the participants crashes at various awkward points. What about the transaction coordinator? It's also just a single computer. Sorry, I don't know if it fails. Might be a problem. OK, so again, where things start getting critical is if any party might have committed, then we cannot forget about that. If either of these participants might have committed, or if the transaction coordinator might have replied to the client, then we cannot have that transaction go away. If A is committed, but maybe the transaction coordinator sent out a commit message to A, but hadn't gotten around to sending a commit message to B. If it crashes at that point, the transaction coordinator must be prepared on restart to resend the commit messages to make sure that both parties know that the transaction is committed. OK, so whether that matters depends on where the transaction coordinator crashes. If it crashes before sending commit messages, it doesn't really matter. Neither party, since the transaction coordinator didn't send commit messages before crashing, it can just abort the transaction. And if either participant asks about that transaction, because they see it's in their log, but they never got a commit message, the transaction coordinator can say, I don't know anything about that transaction. It must have been aborted possibly due to a crash. So that's what happens if the transaction coordinator crashes before the commit. But if it crashes after sending one or more commit message, then it cannot. The transaction coordinator can't be allowed to forget about the transaction. And what that means is that at this point, after the transaction coordinator has made its commit versus abort decision on the basis of these yes-no votes, before sending out any commit messages, it must first write information about the transaction to its log in persistent storage, like a disk, that will still be there if it crashes and restarts. So the transaction coordinator after receives a full set of yeses or noes, writes the outcome and the transaction ID to its log on disk, and only then starts to send out commit messages. And that way, if it crashes at any point, maybe before it sent the first commit message, or after it sent one, or maybe even after it sent all of them, if it crashes at that point, its recovery software will see in the log aha, which in the middle of a transaction, the transaction was either known to have been committed or aborted. And as part of recovery, it will resend commit messages to all of the participants or abort messages, whatever the decision was, in case it hadn't sent them before it crashed. And that's one reason why the participants have to be prepared to receive duplicated commit messages. So those are the main crash stories. We also have to worry about what happens if messages are lost in the network. You might send a message. Maybe the message never got there. You might send a message and be waiting for a reply. Maybe the reply was sent, but the reply was dropped. So any one of these messages may be dropped, and you need to think through what to actually do in each of these cases. So for example, supposing a transaction coordinator has sent out prepare messages, but hasn't gotten some of the yes or no replies from participants, what are the transaction coordinators options at that point? Well, one thing I could do is send out a new set of prepare messages saying, I didn't get your answer, please tell me your answer yes or no. And I could keep on doing that for a while. But if one of the partisans is down for a long time, we don't want to sit there waiting with locks held. Because supposing a is unresponsive, but b is up, but because we haven't committed or aborted, b is still holding locks, and that may cause other transactions to be waiting. So we don't want to wait forever if we can possibly avoid it. So if the transaction coordinator hasn't gotten yes or no responses after some amount of time from the participants, then it can simply unilaterally decide, oh, we're going to abort this transaction. Because it knows, since it didn't get a full set of yes or no messages, of course, it can't possibly have sent a commit yet. So no participant could have committed. So it's always valid to abort if the transaction coordinator hasn't yet committed. So the transaction coordinator times out, waiting for yeses or noes, if this messages were lost, or somebody crashed, or something. It can just decide, all right, we're aborting this transaction, we'll send out a round of abort messages. And if some participant comes back to life and says, oh, I didn't hear back from you about transaction 95, the transaction coordinator will say, oh, well, I don't know anything about transaction 95 because it aborted it and erased its state for that transaction. And it'll tell the participant, you should abort this transaction, too. Similarly, if one of the participants times out waiting for the prepare here, then for participant hasn't received a prepare, that means it hasn't sent a yes message back. And that means the coordinator can't possibly have sent any commit messages. So if a participant times out here waiting for the prepare, it's also always allowed to just bail out and decide to abort the transaction. And if it's some future time, the transaction coordinator comes back to life and sends out prepare messages, then B will say, no, I don't know anything about that transaction, so I'm voting no. And that's OK because it can't possibly have started to commit anywhere. So again, if something goes wrong with the network or the transaction coordinator is down for a while and the participants are still waiting for prepares, it's always valid for participants to abort and thereby release the locks that other transactions may be waiting for. And that can be very important in a busy system. So that's the good news about if the participants or the transaction coordinators time out waiting for messages from the other parties. However, suppose participant B has received a prepare and sent its yes. And so it's somewhere around here, but it hasn't received the commit. And it's waiting and waiting, and it hasn't gotten the commit back. Maybe something's wrong with the network, maybe the transaction coordinator, its network connection has fallen out or its powers failed or something. But for whatever reason B has waited a long time and it still hasn't heard a commit. But it's sitting there holding locks and still holding onto those locks for all the records that were used in its part of the transaction. And that means other transactions may be also blocked waiting for those locks to be released. So we're pretty eager to abort if we possibly can or release the locks. And so the question is if B has received prepare and replied with yes, is it entitled to unilaterally abort after it's waited, say, 10 seconds or 10 minutes or something to get the commit message? And the answer to that, unfortunately, is no. In this region, after receiving the prepare, or really after sending the yes and before getting the commit, if you time out waiting for the commit, you're not allowed to abort. You must keep waiting. You must usually called block. So in this region of the protocol, if you don't receive the commit, you have to wait indefinitely. And the reason is that since B sent back a yes, that means the transaction coordinator may have received the yes. It may have received yes from all of the participants. And it may have started sending out commit messages to some of the participants. And that means that A may have actually seen the commit message and committed and made us change as permanent and unlocked and shown the changes to other transactions. And since that could be the case for all B knows, in this region of the protocol, B cannot unilaterally decide to abort at the time's out. It must wait indefinitely to hear from the transaction coordinator as long as it takes. Some human may have to come and repair the transaction coordinator and finally get it started again and have it read its log and see, oh, yes. We committed that transaction and finally send long-delayed commit messages. On a time out, you can't unilaterally abort. It turns out you can't unilaterally commit either because for all B knows, A might have voted no, but B just hasn't got the abort message yet. So in this region, you can either abort nor commit on a timeout. So this actually, this blocking behavior is a sort of critical property of two-phase commit. And it's not a happy property. It means if things go wrong, you can easily be in a situation where you have to wait for a long time with locks held and holding up other transactions. And so among other things, people try really hard to make this part of two-phase commit as fast as humanly possible so that the window of time in which a failure might cause you to block with locks held for a long time is as small as possible. So they try to make this part of the protocol very lightweight or even have variants of the protocols that for certain special cases may not have to wait at all. OK, so that's the basic protocol. One thing to notice about this that is a fundamental part of why we're able to get to actually build a protocol that allows A and B to sort of both, they both commit or they both abort. One reason for that is that really the decision is made by a single entity. It's made by the transaction coordinator alone. A and B are neither of them, except that they vote no. Neither A nor B is deciding whether to commit or not. And they certainly are not engaged in a conversation with each other to try to reach agreement about what is the other thinking, are they thinking to commit, maybe all commit to. Instead, we have this much, this quite sort of fundamentally simple protocol in which only the transaction coordinator makes the decision, the single entity, and it just tells the other party, here's my decision, please go do it. The penalty for that, for having the transaction coordinator, really the single entity make the final decision, again, is the fact that you have to block, there's some points in which you have to block, waiting for the transaction coordinator to tell you what the decision was. One further question is that we know the transaction coordinator must remember information about transactions in its log in case it crashes. And so one question is when the transaction coordinator can forget about information in its log about transactions. And the answer to that is that if it manages to get a full set of acknowledgments from the participants, then it knows that all the participants know that that transaction committed or reported, that all the participants know the fate of that transaction, and have done their part in it, and will never need to know that information, since they both acknowledged it. So when the transaction coordinator gets acknowledgments, it can erase all information, all memory of the transaction. Similarly, participants, once they've received a commit or abort message and done their part in the transaction and made their updates permanent and released their locks, at that point, the participants also can completely forget about that transaction after they send their acknowledgment back to the transaction coordinator. Now of course, the transaction coordinator may not get their acknowledgment and may therefore decide to resend the commit message on the theory that maybe it was lost, and in that case, a participant, if it receives a commit message for a transaction which it knows nothing about because it's forgotten about it, then the participant can just send another acknowledgement back because it knows that if it gets a commit message for an unknown transaction, it must be because it had forgotten about it, because it already knew whether it committed or abort it. OK, so that's Two-Face Commit for Atomic Commitment. For a little perspective, Two-Face Commit is used in a lot of sharded databases that have split up their data among multiple servers. And it's used specifically in databases or storage systems that need to support transactions in which records, in which multiple records may be read or written. There's a lot of some more specialized storage systems that don't allow you to have transactions on multiple records. And for them, you don't need this kind of, you don't need Two-Face Commit if the storage system doesn't allow multi-record transactions. But if you have multi-record transactions and you shard the data across multiple servers, then you need to support Two-Face Commit if you want to get asset transactions. However, Two-Face Commit has an evil reputation. One reason is it's slow due to multiple rounds of messages. There's a lot of chit chat here in order to get a transaction that involves multiple participants to finish. There's in addition a lot of disk rights, both A and B, have to not just write data to their disk between the prepare and the sending of the yes. They have to wait for that disk right to finish. So certainly, if you're using a mechanical drive that takes 10 milliseconds to append to the log, that puts a real serious limit on how fast participants can process transactions. 10 milliseconds of pop means without some cleverness, you're limited to 100 transactions per second, which is pretty slow. And in addition, the transaction coordinator also has a point in which it must, after it receives the last yes, it must first write to its log, make sure the data is safe on disk, and only then is it allowed to send the commit messages. And that's another 10 milliseconds. And both of these are 10 millisecond periods in which the locks are held and the participants and other transactions are slowed up. And I keep mentioning that, but it's very important. Because in a busy transaction processing system, there's lots and lots of transactions. And many of them may be waiting for the same data. And we'd really prefer not to hold locks over long periods of time in which there's lots of messages going back and forth, and we have to wait for long disk writes. But Two-Face Commit forces us to do those waits. And a further problem with it is that if anything goes wrong, messages are lost, something crashes. Then if you're a little bit unlucky, then the participants have to wait for long times with locks held. So therefore, Two-Face Commit, you really only see it within relatively small domains, within a single machine round, within a single organization. You don't see it, for example, to do transfers between banks, between different banks. You might possibly see it within a bank if it's sharded at its database. But you would never see Two-Face Commit run between distinct organizations that were maybe physically separate because of this blocking business. You don't want to put the fate of your database and whether it's operational in the hands of some of the organization where if they crash at the wrong time, your database is forced to hold locks for a long time. And because it's so slow, also, there's a lot, a lot of research has gone into either making it fast, or relaxing the rules in various ways to allow it to be faster, or specializing Two-Face Commit for various specific situations in which you can shave a message or write to the disk or something off it. Because you know you're only supporting a certain limited kind of transaction. So we'll see a fair amount of this in the rest of the course. One question that comes up a lot. This exchange here where you have a leader, essentially, and it sends these messages to the followers. And we can only go forward if the leader can only proceed, if it receives acknowledgments, replies from enough of the followers. This looks a lot like raft. This construction looks a lot like raft. However, the properties of the protocol and what you get out of it turn out to be quite different from what we get out of raft. They solve very different problems. So the way to think about it is that you use raft to get high availability by replicating data on multiple participants, on multiple peers. The point of raft is to be able to operate even though some of the servers involved have crashed or are not reachable. And you can do this in raft. Raft can do this. Because all the servers are doing the same thing. They're doing the same thing. So we don't need all of them to participate. We only need a majority. Two-phase commit, however, that participants are not at all all doing the same thing. The participants are each doing a different part of the transaction. A may be incrementing record X. And B may be decrementing record Y. So two-phase commit, all the participants are doing something different. They all have to do their part in order for the transaction to finish. We really need to wait for every single one of the participants to do their thing. Raft is replicating. Doesn't need everybody to do their thing. Two-phase commit, everybody's doing something different that has to get done. Two-phase commit does not help at all with availability. Raft is all about availability. It can go on, even if some of the participants are not responding. Two-phase commit is actually not at all available. It's not highly available at all. If anything goes wrong, we risk having to wait until that's repaired. If the transaction coordinator crashes at the wrong time, we simply have to wait for it to come up and read its log and send out the commit messages. If one of these participants crashes at the wrong time, if we're lucky, we simply have to abort. And if we're not lucky, we have to say, did you finish that? Did you finish that? So two-phase commit is not at all about high availability. In fact, it's quite low availability as such things go. Any crash can hold up the whole system. And of course, Raft doesn't ensure that all the participants do whatever the operation is. It only requires a majority. There may be a minority that totally didn't do the operation at all. And that's how the fact that Raft, all the participants do the same thing, and we don't have to wait for all of them, is why Raft gets high availability. So these are quite different protocols. It is, however, possible to usefully combine them. Like two-phase commit is really vulnerable to failures. It's correct with failures, but it's not available with failures. So the question is, could you build some sort of combined system that has the high availability of Raft through replication, but has two-phase commit's ability to cause various different parties each to do their part of the transaction? And the construction you want, actually, is to use Raft or PAXOS or some other protocol like that to individually replicate each of the different parties. So then for this setup, we would have three different clusters. The transaction coordinator would actually be a replicated service with three servers. And we'd run Raft on these three servers. One would be elected as a leader. They'd have replicated state. They'd have a log that helped them replicate. We would only have to wait for a majority. The leader, we'd only have to have a majority of these to be up in order for the transaction coordinator to do its work. And of course, they would all then execute through the various stages of the transaction and the two-phase commit protocol basically by appending relevant records to their logs. And then each of the participants would also be a cluster of, a Raft replicated cluster. So we would end up and they would exchange messages back and forth. We'd send a commit message from the replicated transaction coordinator service to the replicated A server and the replicated B server. And this is admittedly somewhat elaborate. But it does show you that you can combine these ideas to get the combination of high availability because any one of these servers can crash. And the remaining two, you keep operating. Plus, we get this atomic commitment of A and B are doing completely different parts of the same transaction. And we can use two-phase commit to have the transaction coordinator ensure that they either both commit the whole thing or they both abort their parts of the transaction. You'll actually build something very much like this as part of Lab 4, in which you will indeed build a sharded database where each shard is replicated in this form. And there's basically a configuration manager which will allow essentially transactional shifting of chunks of shards of data from one Raft cluster to another under the control of something that looks a lot like a transaction coordinator. So Lab 4 is like this. And in addition, in a little bit, we'll be reading a paper called Spanner, which describes a real-life database used by Google that also uses this construction in order to do transactional rights to a database. All right, thank you.