 So this is the second lecture we have on distributed OATP systems. So remember I said that it was originally slated to be one lecture, and then I split it up to two. And then on Wednesday we'll discuss the distributed OLAP analytical systems, and then we'll sprinkle in a discussion about cloud environments or cloud databases in that lecture. So the things that you have coming up in the semester to finish up. So next Monday in this room we'll have one of the engineers, lead engineer from VolTB, come give a talk about their distributed in-memory database system. And that'll be a nice sort of a nice way to end the semester because they'll talk about the things that we talked about throughout the entire semester. They'll also talk about being an in-memory database system. So if you're taking the advanced class next semester, that'll be a nice segue into that material. But VolTB is also a distributed database system, so it'll discuss a lot of the things that we're talking about from this class and last class as well. And then on the last day of class we'll have the final review in class, and then I'll do the system potpourri. So if you haven't yet, go online and vote for what database system you want to learn about, and I'll come and spend 10, 15 minutes to teach it to sort of crash course on it. I haven't looked to see who's currently winning. And then I'll post this on Piazza as well. We'll do an extra credit check. Some of you have gotten started. There was a bug in the system I fixed yesterday on the website. But if you want me to look at your assignment before you turn it in and try to get full credit, I'll set up a way to do that next week. And I'll just send you email with the feedback of like, this doesn't look right. This looks right. What about this? What about that? Stuff like that. The remaining database talks we have this semester, again, so in class we'll have the Volta B talk on Monday, but also that same Monday in the afternoon, they'll give a more researchy talk in the database group meeting at 4.30 in the Gates building. But then this Thursday we have the founder of a German database hardware accelerator company. It's coming to give the final lecture in our seminar series on hardware accelerated databases. So Swarm64 is an FPGA that does query acceleration. So think of all the things that we talked about in class today or class this semester, how to do predicates, how to do joins and things like that. They can actually push that in hardware to speed things up. So the idea itself is not new, but they're like a new startup that's actually trying to build one of these things. So that should be interesting. The last class we were discussing distributed OTP systems, we covered sort of three main topics. The system architecture was how the, we were going to organize distributed system on the nodes in terms of what is a local to each node, whether they have a local disk, a local memory, or whether they have, they're sharing a disk across the entire fleet or cluster of machines, and now we're going to coordinate all of that together. All right, shared memory, shared disk, shared nothing. Then we talked about how to do partitioning, also known as sharding in the nosed SQL world. And we talked about how to do hash partitioning where you basically just take the, you pick one or more attributes or columns, you take the values of every single tuple, you hash them, then you sign them to a partition. Sort of you're breaking up a single giant database into individual pieces on different nodes. Now logically it looks like a single global database, but underneath the covers physically it's been broken up. And then we finished off talking about how to do transaction coordination. And the main, the main two choices we have for this are do a centralized model or decentralized model. All right, centralized model is where everyone's going through some kind of middleware or a coordinator that's in charge of figuring out whether transactions are allowed to commit or acquire locks on different, the different nodes in our cluster. And then a decentralized model is one where the nodes sort of figure out things on their own. So it is roughly sort of looks like this and the centralized model. I'm going to show this because this is what we're going to focus on today for the most part. But the things we'll talk about actually can be applied also to a centralized model. But I think it's easier to think about in a decentralized one. We have an application server. It wants to access data at these three partitions. So it's going to pick one of these partitions to be the base partition or the coordinating partition, the home partition. This is the node that's going to be responsible for the execution lifetime of this transaction. So we go to the first node and we say, hey, we want to start a transaction. Once we get the acknowledgement that we're ready to go, then the application server can send whatever query request that it wants to the different partitions to do whatever it wants. And then when it wants to go ahead and commit, it sends a commit request to its base partition. And then this base partition is responsible for communicating with the other nodes that it knows is involved in the transaction to say, hey, is this transaction allowed to commit? So last class, this last step here, safe to commit, is what I was super vague about. And that's what we're going to talk about mostly today. We need a way to ensure that if our transaction goes ahead and says, hey, we want to commit, that we got to go ask everybody else that was involved in that transaction or could see the effects of that transaction. Is that transaction allowed to commit? And we need a way to make sure that they all agree that we're going to commit. And then once we agree, we never fall back and have a weird reversal of that change. So when everything works super fast, or all your connections are reliable, this is not that hard. When now you're in a distributed environment, when the network is another resource you have to account for in your protocols, then this is when this coordination becomes tricky. Because what happens if a node fails when we're trying to go ahead and commit? Or what happens if our message is that we want to send, hey, is it safe to commit this? What if one shows up before the other or things show up late? Furthermore, what happens if we don't actually wait for all the nodes to agree that it's OK is it safe to commit this transaction? Some systems will actually let you do that. What are the implications of this? So this is what we're going to focus on today. We're going to focus on atomic commit protocols. Again, that's the method we use to get everyone to agree that it's OK to commit a transaction or we need to roll back. Then we'll talk about replicated environments. And this is going to be how we're going to make basically copies of the pieces of data on different machines or different nodes. And then we'll talk about consistency issues with the cap theorem. And then we'll finish off talking about federated databases. So I realize that I'm going through this very quickly and also going through this at a bit of a high level. We're trying to condense down in three lectures what could be an entire year on distributed databases. And the idea here is not that I'm going to teach you everything you need to know so that you can go off and build your own distributed database. I feel like my job is to expose you to the concepts and issues and difficulties of distributed databases so that if you're ever out in the real world and you think you want to build a distributed database or you think you want to use one, you know what are the issues you should be thinking about. So there's no distributed database course at CMU just because I'm the only sort of database assistance person here. So this is the best you're going to get. But again, as I said last class, everything we talked about up to this point in the semester when we were talking about single node systems is still applicable here. You still have upper poles. You still have indexes. You still have transactions. All of that crap still matters here at the moment. But now getting everyone to agree whether it's okay to commit transaction, that's the hard part we're adding onto this. Okay, so the atomic commit protocol is what distributed database is going to use when we have a transaction that spans multiple nodes to get them to agree that it's allowed to commit a transaction. Because we don't want to happen is we don't want to have, you know, we're involved in three nodes. One guy says, yes, we want to commit this. The other two say we don't want to commit this. And for some weird reason we decide to commit this. That puts us in a bad state, right? We would be hurting the integrity of the database because the transaction shouldn't have committed, but for some reason we committed it. So if you've taken a distributed computing course, they're going to, in their parlance, they're going to call this a consensus protocol. In databases, this is usually referred to as atomic commit protocol. It's essentially, it is the same thing, okay? So there's a bunch of different algorithms or protocols you can use to get this atomic commit property across in a distributed setting. The most famous one is probably a two-phase commit, and that's what we're going to focus on here, and that's what most distributed databases actually implement. There's actually a three-phase commit from the 1980s that was invented by Mike Stoenbricker, the guy that invented Postgres. Nobody actually uses this, right? It's too slow. The other one you may be familiar with or have heard of before is Paxos. Paxos is a, the way to think about this is between two-phase commit and Paxos. Two-phase commit is a subset of Paxos, right? They're actually equivalent in some properties. Then there's other things like RAF, ZAB, and view stamp replication. View stamp replication actually came before Paxos and all these other ones. RAF, as it came out of Stanford, it's designed to be more understandable than Paxos, which is debatable. But the ones we're going to focus on are these two here, right? Because again, most distributed databases are going to implement either one of these or the combination of the two of them. So two-phase commit is exactly as it sounds. It has two phases, right? So say that we have our application server, it executed a transaction on these three nodes, and then it sends the commit request to its, the base partition, the base node here. So in the vernacular of two-phase commit, we would call the node that's going to be responsible for communicating with the other nodes to determine whether it's safe to communicate a transaction as the coordinator. And then the other two nodes would just be called the participants. So the coordinator has to know somehow these other participants were involved in this transaction. So it knows it only sends the messages to those guys. If it doesn't have this information, then it has to do a broadcast to every single node in our cluster. And if we only touch a subset of the machines, then most machines are going to say, I've never heard of this transaction. I have nothing to do with it. Ignore me. So somehow the coordinator knows that it needs to talk to these two guys. So in the first phase, it's called the prepare phase. This is just a message to the nodes, the participants to say, hey, this transaction wants to commit. Are you prepared to do that? Is that okay? And they can respond to either a yes or a no or a success or a abort. And in this example, we're going to assume that transaction is safe to commit. So they're going to come back now with okay after the prepare phase. But the coordinator node has to block and wait until it gets the acknowledgement or the okay message from every participant in that transaction. So if the first guy sends his and the second guy gets delayed, then we can't proceed into the next phase because we have to wait until we hear back from everyone. This is an important distinction of two phase commit versus Paxos. In two phase commit, every single node that's involved in the transaction has to agree to commit the transaction. In Paxos, we'll see in a second, you just need a majority. So once we get back all the acknowledgment or not prepared messages, then we enter the next phase and we send a commit message. Same thing. All these guys say, all right, I said this was okay to commit. Now I'm being told it's okay to commit. So let me go ahead and commit and send back an acknowledgement to the coordinator. Now at this point, the transaction is considered committed. And we can send now an acknowledgement back to the application server. So we successfully committed this transaction. So one additional thing I'm not showing here, I don't know whether the textbook covers this, but if you read textbooks on two phase commit protocols, what's happening is that for every single step along the way, for every single message we get and every single acknowledgement we send back, like so every request and response we have at every single node, we're actually logging this on disk. Because if we crash and come back, we want to know, all right, I was involved in this transaction. I saw the commit message and I told it was okay to commit. Did that actually happen? What do I need to do? So you're writing out these commit messages and you prepare messages in your right-hand log in the same way you would with regular transactions. So let's look at an example of an abort. So again, application server says we want to commit. We send out a prepare request to everyone to say, hey, is it okay to commit this transaction? But let's say this bottom guy here, for whatever reason, we don't know, we don't care. It says no, we don't want to commit this transaction, we want to abort. Soon as the coordinator gets one abort message from any participant in the transaction, it knows this transaction cannot commit. So it can send back immediately the abort message to the application server, and then it sends out the abort message in the second phase. So even if this guy up here, he says, oh, it's okay for me to commit this, he can't actually commit until he sees the commit in the second phase, the commit message in the second phase. So he sees a abort and says, all right, I even though I said I want to commit, we can't commit this, we're going to abort this transaction. Yes. Yes. His question is, his statement is that, his question is, I said that every single message that we send over the network has to be logged to disk. Upon recovery, would that be treated as a undo or a redo? It depends. It depends on what phase you are in. Right? So like going back here, if I got a success and say, here, I got the okay, but then before I, sorry, I got the commit request, but before I sent back my acknowledge that this was okay, I crashed. So I'd come back and say, all right, well I was involved with this transaction, the last message I saw was a phase commit, therefore I'm allowed to commit. I should make sure that thing was actually committed. So this is multiple round trips in order to commit a transaction, but we have to do this because we need to make sure that everyone agrees that this is the right thing to do before we're allowed to go ahead and do it. So there are some optimizations where you can sort of send back a knowledge maybe to the application server or start these voting phases in two phase commit earlier than other than just when we actually go ahead and say commit. So two phase commit is old. It's been around I think the late 1970s, early 1980s, and there's been a bunch of proposals along the way or papers to show how people actually implement this in practice. As I said, the textbook says you have to log everything and you're super careful about everything you do, but in practice nobody actually does that. So the two optimizations you can do are early prepare voting and early acknowledgement. So in the case of early prepare voting, the way it works is that if I know, as I'm sending queries to a node, like a participant node, if I know I'm sending the last query I'm ever going to send to that particular node, I can piggyback a message and say, hey, this is the last query I'm going to send you and also tell me whether this is okay to actually commit this transaction. What is your vote on to prepare? So now we don't have to do a separate round trip for the prepare phase. We send our query request along with the prepare phase message and then the node sends back the result of that query plus its response and the vote in the two phase commit phase. And that way when I enter to the two phase commit, I know I don't need to wait for a round trip to go from participant to that particular node because I already know whether it's going to vote to commit or not. Another way to optimize things is that instead of actually waiting for all the nodes to commit the transaction at the end of the second phase, you actually can send back the acknowledgement to the application server at the end of the first phase. As long as everybody agrees to commit, everything has been logged to disk, so at that point that transaction is considered committed. And if I crash, I can always come back and put me back in the correct state. So the second optimization looks like this. Application server says I want to commit. We send out the first voting phase and the prepare phase. All our participants come back and say, yes, this is okay. At this point here, if everything is durable on disk, then my transaction is considered committed. And that no matter what, I should see the effects or the changes of this transaction are durable on disk. So I don't need to wait for the second phase to complete before I tell the application server that this transaction is successfully committed. So again, this is the early acknowledgement. Everyone agrees that we can commit this transaction. At this point here, the transaction is committed. So why wait for the second round? Let me just go back and tell you that you've already committed. I still have to do the second phase so that everyone knows that this thing successfully committed. But the application server doesn't need to wait for that. Another advantage of this too also is that thinking like on two phase locking, at this point here, I can release all the locks for my transaction because I know I'm going to be able to commit. So rather than waiting to the very end after the second phase, I can release everything right away. Yes. So when you send the early acknowledgment to the server, they all said one of the procedures in the transaction is committed. But actually, you haven't found a way to do it. How do you make sure that everything is durable off the stage? So a statement is, in this case here, I send back the success message early after the prepare phase. How do I make sure now that the transaction is truly committed? Is that what you're asking? Yes. So again, I've logged all these messages in my system. So if I crash and come back, I would look in the log and say, well, everyone agreed that we should commit this transaction. Therefore, the transaction is committed. So now I just complete the process as I would going through the second phase. Yes. The statement is, if I fail at this point here, say the whole cluster goes down. We'll talk about how one node goes down in a second. The whole cluster goes down. Then I come back and look at my log and say, what was I doing at the moment of the crash? Oh, I was involved in the two-phase commit process for this transaction. And look, everyone said it was OK to commit this transaction. So now let me go ahead and make sure everyone knows that we're going to go ahead and commit this transaction. And whether the recovery is being done through the coordinator globally or every node can come back up and figure out on its own where the contract state should be, we're not going to discuss that here. There's different ways to do both of them. Yes. Isn't this more useful when the nodes are replicating to a lot of them like partitions? So he says, he brings up an excellent point. So he says, isn't this more useful when the nodes are replicated rather than there being partitions? So again, I'm just drawing a node here. I didn't say whether it was a replication or a partition. I'm being vague on purpose. So why do you think it would be better for a partition versus replication? Because if there are partitions, then if the data was not actually commit, then the application still cannot access the data on those partitions. But if there are replications, then the data from the nodes that were already committed So you said, if this is partition data, then if what, if this guy goes down, what's the scenario you're saying? One node goes down. Because in the case when you have acknowledged the application that it has already committed, then if the application now is trying to access the data on that partition, it will fail because it's not actually committed. You don't know whether it's actually committed. Yes, sorry. His statement is, at this point here, if someone tries to read data here, this guy knows it told this guy it wanted to commit, but it doesn't know whether it actually committed because it hasn't gotten this message yet. Right. So, and his statement was, if it's replicated, then you could figure that out from looking at other nodes. If it's partitioned, you don't know that because you're only looking at your local thing. Right. So, you'd have to do some extra work to know that someone's trying to read data here. It's data from this transaction. I told it that it should commit. It didn't actually commit yet because I haven't heard back whether it's committed. I could either block and then let you read it or let you do a speculative read and don't let the second transaction commit until you get the final acknowledgement. There's extra bookkeeping you have to do underneath the cover so avoid that. His statement is it does not help you improve the accessibility or availability. How about this? His statement is it does not help you improve the availability by doing this. This optimization is not for availability. This optimization is for to reduce the latency of sending back the acknowledgement. We still have to go through the full two-vase commit process. It's just when do we tell the application server you're the committed or not? Yes. Her question is, is the second phase actually necessary? Oh, yes. Her statement is you need phase two, right? But do you actually need to wait for this thing to come back? How do I say this? When you say you don't need to wait, so who's you? The coordinator or the whole machine or the application? Right. So that's the textbook definition two-vase commit. What I'm saying is you don't have to do this. If you do this, at this point here, it doesn't matter whether I send it out or I haven't sent it out, I can tell the application server I've committed. Are you saying I'm for the homework? I have to look what I put in the homework. I don't remember. I try to be very specific about different parts. I'll double-check this. As I said, we have to log all the messages as we go along. In my scenario here, I had to know to be able to say I aborted, but again, everything was always up. So now what happens if the coordinator crashes while we're running the transaction? Well, it depends on what phase we were in, but now it's up to the participants to figure out, hey, we were involved in this transaction. The node we were talking to as our coordinator, he's gone. What the hell did we do? Right? So what you could do is you could elect a new, one of the participants comes to the new coordinator and reconcile things that way. In practice, what most systems actually do is the easiest one is just kill the transaction and say I aborted. Right? Again, it depends on what phase you're in. What happens if one of the participant crashes? Again, same thing. The coordinator just assumes that a node crashing is equivalent to the node sending an abort message. And because we end two-phase commit, we have to have all the nodes agree that it's okay to commit the transaction before we commit the transaction. If one of them disappears, then we just abort the transaction outright. All right? So as I said, two-phase commit is, as far as I know, this is the protocol that's most used in distributed databases. The other one you may be familiar with also is Paxos. Do they cover that in the intro? Right? Yes. So Paxos is a consensus protocol. Again, consensus protocol is what they call in distributed computing literature. In databases, we call them atomic commit protocols. The same thing. So consensus protocol where we're going to have another coordinator send out a proposal that whether we want to commit a transaction or not, just like before, and then all the participants are going to vote whether it's allowed to commit a transaction. But unlike in two-phase commit where we have to have everyone agree to commit a transaction, under Paxos, we only need a majority. And if someone disagrees, and because they don't want to commit a transaction, but everyone else said they want to commit a transaction, that node has to fix themselves and reconcile that issue and correct themselves. Another way to think about this is if a node goes down and we still have enough information about the transaction, we don't have to abort the transaction. Right? Because we can still keep on running. So Paxos was invented by Leslie Lamport. He's a senior research fellow or whatever he is at Microsoft Research. He won the Turing Award probably four or five years ago for this point. So the paper that described Paxos is this one here called The Part-Time Parliament. It was originally written in 1992. The way the story goes, according to Leslie Lamport and others, is that he submitted it and the reviewers rejected it. And so he put it on the shelf or in his desk and forgot about it or just left it there because no one would understand or appreciate it. And then people started publishing papers that looked a lot like Paxos, but they weren't quite solving the problem that he has solved. So then he finally said, oh, no, no, I actually have solved this problem already. Here's the paper I wrote five or six years ago and finally got published. If you read it, it's amazing. It's not written like a regular computer science paper. It's written as if it was from an archaeologist finding this ancient Greek tribe on an island called Paxos. And he talked about how the government would actually do voting on these tablets by leaving them in a hole or whatever and they'd come back later to figure out what's going on. So the issue was, he claimed, was that no one could understand the, or they wanted him to rewrite and remove all the archaeological story in the original paper. And that's why I got rejected. The truth was, he needed a proof at the end to actually explain it what was actually going on, but he refused to do that. He refused to change any part of this paper and then that's why he withdrew it from publication early on. So I've never met Lamport, but I know the guy that actually wrote this paper and they have differing stories about what Leslie Lamport says happened and what actually really happened. Okay, so for Paxos, for now we're going to have, in our transaction, we're going to actually have four nodes involved in this because we need a majority to vote to agree to commit a transaction. So just like before, the application server goes to the first node and says I want to commit. In Paxos, instead of being a coordinator, they're going to call this a proposer and then the other nodes are going to be called acceptors. Same concept, same idea. This guy goes out and says, hey, I want to propose to commit this transaction. Just like the prepare phase. And let's say this first guy or the second guy here, he says, no, no, I go down or I crash or I don't want to commit this transaction. On our two-phase commit, we'd have to stop everything right then and there. We'd have to abort this transaction and retry it. On our Paxos, though, as long as we get a majority of the nodes to agree that we want to commit this transaction, then the protocol is allowed to proceed to the second phase where we now commit the transaction. And the second guy has to come back later on and says, all right, I was either down or I didn't want to commit this transaction but everyone overruled me. Let me go back and correct myself. So another way to sort of think about it is in terms of this timeline, right? And in my example here, it's always been one node proposing to commit a transaction. But under the Paxos protocol, you can have multiple nodes trying to commit transactions at the same time that may interfere with each other. And Paxos has a way of resolving this. So they basically have this logical counter that always is marching forward in time. So every time you want to propose a committed transaction, you add one to this counter and say, I'm trying to commit this transaction at this particular time. So say we want to propose a commit transaction at time n. The acceptors say, yeah, that's OK. We're OK with agreeing to commit it at time n. But now another node comes along and says, hey, I want to commit a transaction at n plus 1. Then what will happen is when the first guy says, all right, everyone agreed to go to commit this. Let me go ahead and commit this now. They're going to come back and reject this and say, we can't commit n because we've seen n plus 1. We've seen a higher proposal for what to commit. And therefore, they come back and agree to commit this guy. He can go ahead and commit. And then the other proposer has to come back later on and say, all right, well, this is either the same transaction. I'm going to try to commit it again. Or I re-execute it. And now here's new information about it. So this will make more sense in a second when we talk about replication in a multi-master setup, where you could have transactions accessing the same data just on two different nodes or two different copies of the same data. And they're both trying to commit. And now you need a way to reconcile that. Yes? Is this actually applied to partitions? This question is, is this actually applied to partitions? Absolutely, yes. So we've heard that forcing them to work is created because the thing that you see in the debate. You said forcing what to commit? The statement is, it wouldn't forcing the minority to commit, even though, yeah, the majority says I want to commit. One node says I don't want to commit. Would that cause inconsistencies? Well, no. Because again, there's concurrency code being done on top of this. You have to make sure that you can only read things from committed transactions. Like that all still applies from before. The guy that says I don't want to commit, then he gets overruled. It's his job now to go figure out how to put them in the correct state for committing this transaction. So when you do this, you actually have to keep designing the rules. So even if you want to forward others, you have to resolve this problem. So the statement is, in the implementation of the database system, you have to write code that can recognize, oh, I did not want to commit, but I got overruled. We had to go ahead and commit this. Let me go reconcile that and fix myself up. Yes. So this is a good difference between what we're talking about here and blockchain. Okay? So in this environment, we're assuming all of our nodes are friendly. Meaning we control them. They're running our database system software that we trust, and they're doing what we want them to do. So yes, I got overruled, but I'm part of the same distributed database. I'm here to help. So let me go ahead and correct myself. So to say, blockchain is a distributed database where you don't trust the other nodes. And therefore you can't just assume that if I say, abort and you say commit, that we're both going to end up doing the right thing. So this extra protection mechanism that you have to have to figure out, get everyone to vote that it's okay to go ahead and make that change. So what do you think the portion is cut by a? Your statement is, what if the abort is caused by an inconsistency in this partition? Can you be more clear? So what can you assume that an abortion is always going to be resolved? His question is, why can we assume that an abort can always be resolved? Just as a point in English, yes, abortion is the right thing to say, but people get upset about that. I don't, but like, you say the abort, not the abortion. Cool. I mean, in what scenario would an abort not be able to be resolved? Like, going back here. So say this guy aborts. He gets overruled. So he now say, I'm missing a line. I say he gets the commit message. He says, I want the abort, but I got overruled. So let me go ahead and commit whatever I was going to do. Oh, so you're referring to something like... More like, how do I say this? Yeah, so I think the issue he's coming up with, and I think I talked with this last class, was during the sort of commit phase, we could be doing checks for integrity constraints about making sure that the, and during the commit phase, we would check that integrity constraint, recognize that, oh, we would violate this, therefore we can't commit the transaction. So now if I go ahead and get overruled that I can't commit the transaction, then it would put my node at incorrect state. Right? So I mean, that scenario... How do I say this? Yes, so... I think that's not something that you commit for the course you're worried about. If you are going to commit a transaction, you need to, like, for those, like, like, conflictions, you should deal with it with the, like, the proposed node. You have to make sure that your transaction can be committed. Yeah, so he's right. So I think I'm blurring the lines between concurrently show and atomic commit protocol, or consensus protocol. His statement is that before you propose that we want to commit this transaction, everyone will sort of... You would have found the violation earlier. Right? If I try to insert something that shouldn't have been inserted, at the moment I sent the query over to try to do that incorrect thing, then I would have gotten abort. You wouldn't check it later on. There are global asserts defined in the standard where you can have them fired off on commit. But again, like, you don't want to do this. So if you say, like, before the proposing a sense of prepare message, you always note that every transaction, other nodes, is okay. It's okay to commit. What do we look at? So his statement is, if I, now I'm claiming that, if I'm claiming that at the moment we start Pax's or Two-Face commit to go to commit, I didn't violate anything, any integrity chain, because I caught them all before. Why do we still need Pax's? Or why do we still need to commit protocol? So his statement is, we can do this in one phase. That's a, definitely that's not true. Let's talk about replication, and we'll see in a second why this is the case. Yeah? As you said, that cost, you can value the commit state. Yeah, I mean, you can fail at every single step along the way, and you're just being super careful to make sure everyone agrees that this is the state we should be in. Okay. Right, so, this example here, this was an example, this is as defined in the original Pax's protocol, you have multiple purses at the same time, and you're always moving this logical clock forward, and you make sure that if someone comes along with a newer proposal before you, you, you roll back and don't commit your transaction. Yes? Why should the acceptors So this question is, why back here, when the acceptors, as soon as they see n plus one, now reject n? So, I don't get the details of this. Think of this as like a log, right? The state of the database is just a log, where it's appending, here's the changes we're making. So n, n plus one, n plus two, another entry is in the log to say, the state of the database is now n, the state of the database is now n plus one. Right? So, the idea here of rejecting n, because we've seen n plus one, is to make sure that we don't end up in this weird state where some of the nodes think we're at n, and some of the nodes think we're at n plus one. So if there are proposals that continuously propose the transaction to the acceptors, only the last one will eventually be accepted. Correct. Yeah, so his statement is, if I have multiple proposals, and it's actually in the next slide, if I have multiple proposals all proposing, commit this, commit this, commit this, and they're essentially clobbering each other, I could end up starved, but because nothing ever gets done, right? So the solution to this is called multipaxos, and basically you have a leader election to pick one node to always be the proposer, and you give it a lease and say, you can be the proposer for a minute or so, and it's the one that's always allowed to propose stuff. So now you don't have this issue of, like, another guy proposing at the same time you are. You know all proposals have to go through this one node, right? So that sort of solves that problem, and every so often, if you have a failure, you just run regular paxos to now elect the new leader. Yes. So the question is, is there a reason why the new proposal always wins versus the older one? Which is sort of what he was asking, like, why not just have them, why always have to reject them right away as soon as you see the newer one? The idea is the same thing we talked about with, like, latches. If you always go in one direction, you don't worry about things coming in different directions and screw you up. First statement is, you don't always let the older one win. So actually, I don't know the answer to that. Right? That may work, I don't know. I think the older proposal might support the accepted proposal, so I don't know what to do. His statement is, that's true, yes. Yeah, that's right. So his statement is that if you always accept the older one, going back here, if I'm here and instead of rejecting I allow this to happen, or sorry, I'm here. I see n plus one. Now this proposal that proposes n, he crashes and n plus one would get rejected, but it's going to take a while for these guys to figure out that this one crashed. You can have a time out and say, if I don't hear back from you from 10 seconds, then I've aborted. But it just sort of adds complexity to this. Okay. So again, the many systems that run on the older distributed databases usually use Two-Phase commit, or the newer NoSQL guys and some of the newer distributed databases use Paxos or RAF or some variant of it. And again, the key distinction in Two-Phase commit, again, the coordinator fails. We have to block until we figure out what the hell was going on. So we can be non-blocking as long as we have a majority agree that this is the change we want to make to our database. Okay. All right, moving along. So replication. So this is probably what you're going to count the most in a distributed database. Again, whether you would call a distributed, you know, if you take MySQL or Postgres and you replicate it onto another machine, is that still considered a distributed database? All right. In my case, yes. I would say yes. But, you know, it's not the sort of shared nothing partition distributed database most people think of. So the idea with replication is that we're going to have make redundant copies of objects on different nodes to now increase availability. And the goal here is, again, if one node goes down, then we have another one that can or other ones available that can still continue to serve requests. The tricky thing is going to be how do we keep these replicas in sync and how do we make sure how do we actually propagate changes between them. So there's a bunch of design systems we have to discuss as we go along and how we actually can build a replicated database. And this will be in either a partition system or a non-partition system. So I'm being vague about what we're replicating here. I'm just saying object. It could be a tuple, it could be a table, it could be a partition. It doesn't matter. All these things we're talking about here are still are still germane. And this is another area in databases where the language is very it's not standardized. Some things it is, other things, other systems call, you know, different replication schemes, different things. But let's try to distill what they actually are. All right, so the first design choice we have to make is how are we actually going to configure our system with replicas? So the two approaches to do master replica are multi-master. So master replica is sometimes called leader follower. Traditionally it was called master slave, but people don't refer to that anymore. The idea here is that all the updates for an object are going to go to a single master node. And then that master node is responsible for then propagating those changes to its replicas. So it has to know where its replicas are. If you have any read-only transactions depending on how we configure the system they could go read on the replicas. We can offload the read requests to the replicas. But all the right requests always have to go to the master. And then what happens is if the master node goes down then we run Paxos to do a leader election, like propose you know, propose that this replica is now the new master. If everyone agrees then it gets promoted as the new master. Multi-master is where we have the objects replicating in different locations and transactions can update them in any one location. And then underneath the covers the replicas are responsible for synchronizing the changes with each other as needed. So let's look at these visually. So in the first one we get master replica. We have one partition P1 it's on the master and it's replicating the two replicas. All our reads and writes all writes with some reads go to the master node and then in the background it goes ahead and propagates the updates to the replicas. And if I have a read-only transaction and I'm okay with possibly reading the stale data depending on how I do this replication I can have the read-only transactions go to the replicas. Again you want to do this because now you're using the master node only to absorb the writes which are expensive to do and your reads can be offloaded to these other machines here. Under multi-master I have a single partition it's replicated on the two nodes and any read-write transaction can go to any node. And we have to use two-phase commit or whatever consensus protocol we want to use underneath the covers to keep these two guys in sync. This clip. So master slave or master replica is probably the most common setup you will see out there. If you're running Postgres using like CitusDB or one of the ones that makes it make it a shared nothing system most Postgres installations are going to be like this. You have one machine as the master and then you have a hot stand by for your replicas. If the master goes down then you can promote the one of the replicas to be the new master. Another concept we have to consider now in a replicated database environment is this idea of K-safety. So I don't know again I don't know if this is part of the energy standard this is what Mike Stoenbrecher used for CDB and his other systems but K-safety is the idea that at some threshold we're going to set in our system to determine how tolerant we are to failures in the distributed database. So K is going to be the number of nodes that has to be available before we take the system offline because we had so many failures. So going back here say my K-safety is two for this setup here. So I have three copies of the partition one master, two replicas. So if one of these nodes goes down with K-safety 2 I can keep on running because I have two copies of the data available but if a second node goes down then the system will say I don't have two copies of this data I'm going to halt all execution until that gets resolved and the idea here is you're trying to be overly careful about making sure that you never end up losing data because what you don't want to happen is with K-safety 2 I do a write here and the writes get propagated here. If this node goes down then I still have two copies of every write at these two locations but if both copies go down then I can do writes here I tell the outside world my transaction committed but then this machine catches on fire and melts and now I come back and bring the data back on these two replicas and now whatever writes I did on this thing that didn't get propagated to these guys is now gone forever because the machine melted right so K-safety is an extra protection mechanism that you can set in the system to say keep running until I don't have at least two copies or one copy or whatever you want so now we need to talk about how we're actually going to propagate these changes between replicas so an easy way to think about this it's not the only way you can do this but the easiest way to think about this is you're just streaming out the write ahead log contents to the other machine sometimes they're called the bin log sometimes they're called the op log different databases and common different things but it's basically the same idea all the log records you're generating as you apply changes to the database on one machine you stream them out to another machine and then it applies them as if it's in recovery mode the question though is how long do we have to wait or what do we need to wait for before we tell the outside world that your transaction is safely committed and replicated with our K-safety factor so the three choices are synchronous, asynchronous, or semi-synchronous so this is now where you start to get into the realm of what makes no-SQL systems different than sort of traditional transactional distributed databases so synchronous is where the say you're doing a master replica setup where the master is going to send updates to its replicas and it has to wait until those replicas acknowledge and commit those changes and write them out the disk before it sends back the acknowledgement to the application server so say I have a master in replica the application server says we want to commit we send a a commit flush request to the replica think of this as like being part of a two-phase commit I have to wait now until this guy says okay I've go ahead I've actually flushed it so this guy will flush everything to the disk send back the acknowledgement and then we can send back the acknowledgement to the application server and the reason why we do this is because if at this point here if we crash, this node crashes we don't know whether this node actually saw those changes yet so technically our transaction has not committed yet so we don't want to tell the outside world we've committed and then come back and try to read it and we can't see any any of our changes asynchronous is when you don't wait at all so I want to go ahead and commit on this node here I say hey go ahead and flush to my replica but I don't wait for anything I don't care if you even got the message I immediately send back an acknowledgement to the application server and say your transaction committed it's durable and then at some later point this guy will end up flushing so this is what no SQL systems use in many cases and this is what eventual consistency means right means I can apply a change to my master node and then the replica will eventually get it right there's been studies to say how long it actually takes to get this depending on what the network looks like and how far away it is it's roughly maybe like 50 milliseconds in some systems but now there's a 50 second window 50 millisecond window where I could do a read here and I don't see something that I thought I committed here right I'm seeing inconsistent data I'm seeing incorrect data a third approach is called semi synchronous and as far as I know I mean I know this is used in my SQL and this is what my SQL calls it I don't know if this is available in other systems but the idea here is rather than waiting for the replica node to flush everything to disk as long as it gets I get an acknowledgement that they received my messages to commit this transaction then that's good enough for me so I go ahead and want to commit I send a flush request to my replica the replica immediately comes back I have to wait until the replica comes back and says I've received your flush request I will eventually do this so then as soon as you get back to that acknowledgement then you send the acknowledgement back to the application server at some later point it would actually end up flushing things now it's actually durable so now if the master crashes we know our data would actually be there so this is sort of trying to bridge the gap between synchronous and asynchronous this is what semi synchronous means it's saying basically my message made it over there I know it has it, it's in memory but it's not durable to disk yet and I'm going to make a trade off that I'm going to assume that this node is actually not going to go down from the time it actually takes to flush it out the disk so maybe depending on the speed of your disk maybe 50 to sorry 5 to 10 milliseconds before that happens so I'm allowing myself to have a 10 millisecond window where I told the outside world my transaction committed but it actually didn't get replicated yet so this is important concept that you have in distributed databases where the design of the system or how you configure the system could affect what correctness or durability guarantees you could actually have in the system and you make these trade offs between not having things fully replicated and committed everywhere in order to get better performance yes so this question is what is the what is the how to say this what is the relationship between Paxos and this replication scheme so in in this case here it is a master replica so the master has to decide whether I want to commit this or not so it doesn't need to coordinate to this guy and say can I commit this it's going to tell him commit this right I'm doing a multi master setup then I need to do coordination across the two of them right and use atomic commit protocol to reconcile those changes right I should be more clear this is master replica right but the concept can still apply in a multi master setup where I just don't wait for things to actually get committed the main point I'm trying to point out here is if I sacrifice durability and correctness I can get better performance and this is what the no-sequel guys in addition to giving up a relational model and joins and sequel and other things but this is at the end this is the one of the core things about them yes yes so his statement is in the worst case scenario semi-secretness can be as just as bad as asynchronous yes right because if I'm here I got the acknowledgement that it received a message but it didn't actually flush it yet I don't know and I tell the outside world this thing committed this thing could this could crash and I come back and it's not there right but the idea is that the window is much smaller than asynchronous in theory because it's in memory the likelihood that's going to crash is actually very low but it's a 5mm 10ms where I could be vulnerable and again people make this trade off to give up some of those protections in exchange for performance we saw the same thing with concurrency control when we talked about isolation levels if I run with the serialized isolation level it's you know I'm running with the highest guarantees that everything is running as if it's isolated and I don't see the effects of other transactions if I relax those the isolation level and go to a lower isolation level I'm going to get better performance because I have more interleaving opportunities but I could be exposed to anomalies and sacrifice correctness so again it's like this this tug of war between performance and correctness and most systems most people probably don't need full protection of things so something like semi-synchronous might be just good enough the last thing or the last two things I'll talk about are the timing of when we propagate updates and the where we're generating the messages or where we're generating the changes for the database so this one's pretty simple propagation timing when do we actually tell the replica here's our changes for our transaction so continuous is almost exactly what everyone does where as the transaction makes changes on the master node or whatever it's based partition or base node as it generates those log records in addition to writing them out to disk on the right on my node I'll also send those over the network to my replicas I'm continually creating those log records the other approach is to on commit and this is where I just buffer all my log messages at my machine at the master node but only when I find out my transaction is going to commit then I actually send them over we think this is kind of stupid but the advantage is that you end up sending less messages if you have a lot of transactions that could abort because you know at the end of the transaction I'm not going to commit so therefore you're not wasting messages to send it over and you're sort of guaranteeing that the the replicas never see any changes from transactions that don't commit so recovery is way easier because you only have to do redo you never have to undo but in practice as far as you know everyone always does the first one the last one is I don't know what actually to call this but it's where the where the transaction is running and where it's generating changes that need to be propagated to replicas so this is where people call this different things in different systems the two choices are active-active versus active-passive so the way I was taught and my understanding in the context of BoltDB and H-Store active-active is where you have transactions that the same transaction can run independently on different replicas so say I have a transaction that wants to update the counter I add one to some tuple active-active would have that query run at both locations of the replica and then they just need to do two-phase commit or whatever they need to do depending on what commercial protocol I'm using to make sure that they're in sync active-passive is where you have them the transaction runs at one replica site it applies the change and then generates the log record that it then ships over to the other location who then applies that change so this is not the same thing as master replica versus multi master because you can have both these scenarios in a multi master database or a multi replica database so BoltDB is a master replica database and they support active-active but other systems can do active-active in a multi-home or multi master database so hopefully they'll come next week and they'll explain and make them all clear alright so any questions about replication yes you want to go back to Paxos what's your question about Paxos Paxos is applied to petitions when you're executing a transaction that happens on both petitions so on one petition itself, would that go to the active phase or even go to the preparatory so I have, I want to get through the last bit of the slides let's talk about this afterwards okay so real quickly who here has heard of the cat theorem less than 50% okay good so the cat theorem is a way to understand about these trade-offs we just talked about in a distributed database of what what problems is what issues you can have in a distributed environment are you going to allow your database to be resilient to so it was proposed by a Berkeley professor named Eric Brewer this is an old picture with this moustache it's insane like he looks like a 50 year old man with this moustache I've met him in real life and he doesn't have the moustache now he looks like 25 he's super nice so he proposed this as a conjecture in the late 1990s early 2000s and then it's been since proved rigorously with theory that this is actually correct so he says that if you're going to have a distributed database system or distributed system in general it's impossible to have to be both consistent always available and network partition tolerant and I'll explain what that is as we go along so he basically says you can pick two out of three and it's sort of it's hand-wavy because this is not always exactly true it's always you get two, but what these things mean in different environments can change so the way you think about this is you have this Venn diagram see corresponds to linearizability so this would be, I'll go through an example that the database is always going to be correct as I apply changes so this availability means that I'm always going to be up no matter how many nodes actually go down and then partition tolerance means that I can still execute transactions or actually operations and how my data is to be online even though the network may be severed and I'm losing messages and the middle part here is no go, like no system can actually be all three of these you may sometimes see posts from people to say that oh I defeated the cap theorem people don't do this anymore but it happened for a while the newDB guy for example Jim Starkey, if there's some Google group posting where he says newDB defeats the cap theorem and everyone pummeled him, right so this would prove to be correct like you have to make this trade off in here so let's go through all the examples so the first one has to do with consistency so the idea is that if I have a transaction and say it's running on this application over here and I want to set A equals two and I apply my change and now I need to propagate that change to my other replica and that way if using the replication schemes we talked about once I know that's done I can send back the acknowledgement and say that my transaction has successfully committed and it's copied everywhere so now if another transaction running on another node comes along and reads A if I told the outside world that my transaction committed and I'm making sure that I'm doing replication and everything is consistent I'm guaranteed to see the correct value of A here so this is the property you want to guarantee in your database system so again if I tell one transaction that it's committed and it's replicated then anybody that tries to read my change will see it availability means that if a node goes down then no matter what I can still propagate or still execute transactions so say this node goes down here I can read B, I can get a value I can you write some and get a value and the other application server knows that this can't go to here it'll go to this other one and they can read A correctly as well so for guaranteeing that no matter how many nodes go down the entire database is still available say we were doing partitioned and this thing had partitioned two and this had partitioned one these are the only copies that I have if a node goes down then I have to halt the database because I don't want people to see they'll have false negatives and try to read data on partition B but it's not available and it's not there so in this case here we have to shut everything down if we're partitioned and not replicated in this environment the database is replicated at both locations so I still have a copy here that this guy can read from so I'm still available the last property is partition tolerance so the easy way to think about this is that the network goes down that these two nodes cannot communicate with each other it doesn't have to be that the network goes down this can also occur if your system is based on written in Java or use the JVM the JVM's garbage collector could have a long pause because you have a huge heap and it's going to look like your nodes down the network's down it's not responsive because it's doing garbage collection so it doesn't have to be that the network is actually severed it could be other pauses in the stack at this point here the network has gone down so these two guys can't communicate with each other what's going to happen well it's a master replica setup so as I said before the replica is going to recognize hey I don't have a master anymore or I can't communicate with him let me run paxos now I'm the new master so now each node is going to think that it's the master of this replicated database and what's going to happen is each application servers are going to start writing transactions they're going to update both of them make different changes that are conflict they both get an acknowledgement that these changes are allowed to happen but now the network comes back and now I have to resolve this where this guy thought it was the master and he made A equals 2 but this guy thought it was the master and now it's A equals 3 and I've told both of the application servers that these transactions are allowed to commit so this obviously we don't want this to happen so this is what I'm saying in the cap theorem at a high level we have to pick 2 out of 3 because if I want to be network partition tolerant then I can't be consistent because this is not consistent A equals 2, A equals 3, I have a split brain if I want to be available then I can't be consistent in this case here because I can't communicate with these guys but they're both going to be able to write to the same thing okay so in general how the database management system or distributed database system handles these failures will determine what aspects or elements of the cap theorem they're going to support and traditional, relational, transactional or these new class systems called new SQL systems that I helped build when I was in grad school these are considered being consistent and partition tolerant sorry consistent and available meaning if they communicate with each other and because there's a network partition then they stop being available because they don't want to end up with a split brain environment a lot of no SQL systems have mechanisms in place to actually handle all those cases or handle the case where I can have nodes go down and I can still be available and they do this in exchange for sacrificing consistency they have other mechanisms to figure out like alright well I don't care if I lose some rights or I have a list of all the rights I've ever done and they push upon the application server to figure out how to reconcile the changes to put the database back in the correct state so going back here like we have A equals 2, A equals 3 I come back online the data system would recognize that these guys are not in sync with each other and then we would send a message to the application server and say hey something bad happened figure out how to correct us so some systems push that on to you as the application writer so in the last 10 minutes I want to talk about federated databases but real quickly is there any questions about the cap theorem stuff okay so up until this point when we talk about distributed databases we've been assuming that all the nodes in our database are part of the same software thing software package or whatever you want to call it I have distributed my SQL every single node in my cluster is a MySQL node and there's some framework or some middleware mechanism that knows that it's communicating with MySQL and knows how to send queries to the different MySQL machines or if I'm doing replication I can have MySQL node replicate to another MySQL node and they natively support that but in a lot of major organizations and I'm sure you guys have seen this when you go off and do internships in one database shop they have a ton of different crap they have MySQL, they have Postgres, they have Mongo they have Oracle and what is often people want is they want the ability to have now a single interface that can read data from all these different disparate database management systems so I don't have to say well I know my my human resources data is in MongoDB over here and then my stock ticker data is over here in Oracle and I have to write an application server to go get the data that I need and put it all together what I really want is a single database interface that can make it look as if it's a single logical database even though underneath the covers is different software vendors so this is what a federated database is so a federated database is a distributed architecture where you can pull together or tie together a bunch of different database systems that could be from different software vendors and then to the outside world, to the application server it looks like you have a single database that encompasses all the data that's available to you right and then you have right now, you can have a single query interface using SQL as an example I can write a single SQL query then underneath the covers knows how to go read data from from XML data sources or from Amazon data sources or whatever and I don't have to worry about how to access these different machines and put queries in the right semantics that these systems want I have a single interface that does it all for me this sounds amazing this would be the holy grail of databases in practice though nobody actually does this very few things actually do this because it's hard because not only are these systems going to have packages like Mongo doesn't support SQL they have their own JSON thing other systems have their own proprietary stuff or the different even the different relational databases that support SQL have different dialects of SQL and they don't always match up so getting all that query interface to be correct is hard but now if you want a way to query optimization like predicate push down joins and other things that's almost impossible to do in these environments because all these database systems assume that they're the master of the universe they don't know and don't care about other database systems so they only know how to do query optimization or do execution for the queries that touch their data but now you need to do something above all of this to tie it all together so the other issue that ends up happening is that you see a lot of data copying like if I want to do a join between my sql data and my hdfs or hbase data I got to copy it out from one machine to another and then do the join I can't natively do a join across all of that so federal databases are an old idea they go back to the 1980s there are some software there is some software available out that do a subset of this but they're not that common most people just deal sort of one-off solutions to do these things so to give a really quick example say you have some kind of middleware or federated database middleware and then in our back end we have my sql MongoDB, Redis, and PumpkinDB which is a real system so again we write a one query to our middleware and the middleware says oh I know you need to touch data at all these four systems and it has its own catalog and says this table here is in Mongo this table here is in my sql so it knows how to take the query that you sent it break it apart into sub queries that can then send out different database back ends so this process here this mechanism to go to get data from these different systems these are usually called connectors it's usually how they're marketed for in enterprise scenarios and then depending on what the query is if it's like a join we may have to copy a bunch of data from the back end to put it into the middleware and then do the join in the middleware or if we're really intelligent about it we could have take the data out of pumpkin put it directly into my sql and then do the join in my sql this all depends on how the federated database layer is actually implemented and I'm telling you nobody does this well most systems have to copy everything to some central location and then apply the change so there are commercial products that do this the one system that I think actually does that is actually probably getting the closest to this is actually Postgres Postgres has a feature called foreign data wrapper it's been around for a couple of years now but the idea is that you can implement this foreign data wrapper API and have that communicate with whatever back end database source you want the other thing to do is like these don't have to be full-fledged database systems they could just be a bunch of XML files you have in a directory or data you have on s3 and Amazon so these connectors essentially can know how to read this data and shove it into something else so I'm not saying Postgres is the only one that does this a lot of systems provide a connector API to go ahead and do this but again it's always treated almost as a black box where you can't actually you know you can't actually run complex queries on it you always have to pull data up okay so again a guarantee that you're going to come across big enterprises people will complain about how all these different disparate databases sometimes these are called data silos because it's sort of all the data you need a siloed in one particular system and ideally they want a middleware layer to go ahead and pull everything up and make it look be uniform but there's no magic pull or magic oracle that'll do this okay alright so the main takeaways of this we talked about how to do commit protocols we talked about how to do replication and again the lines are blurred of what actually of how these two things fit together and depending on how you're actually doing replication we'll determine at what stage you do Paxos or what stage you do base commit and then the last one I wanted to bring up what I said to him earlier is that in this entire lecture last lecture in this lecture and actually the lecture next class we assume that our distributed database is running in a friendly environment and what I mean by that is the nodes are under our control they're running our software and we don't think they're going to do intentionally do something malicious or adversarial I mean if we say we want to go ahead and commit a transaction on one node that node's not going to lie to us and say yeah f**k you we're not going to do it right it'll do it because we control it right contrast this with the blockchain like in bitcoin right it's you're running the bitcoin protocol on your node you're running the bitcoin protocol on my node I don't trust you you don't trust me so in order to commit transactions there's a bunch of extra stuff we have to do to make sure to prove that we're actually allowed to commit this transaction and hashing the way bitcoin does this in a proof of work is one example of this and then you have to do essentially a variant of something like Paxos that's visiting fault tolerant to make sure that everyone agrees that we go ahead and commit this transaction and I prove to you that I did the work that I need to commit this transaction so once everyone once the majority agrees then it go ahead and commit so all that extra work you have to do slows down the entire process and that's why you look and see like in the case of the bitcoin blockchain I think they're doing like one transaction a second right in a real system you know if you're running a website you couldn't actually do that right because they have to do this extra stuff because they don't trust you when you say you're going to commit something whether you're actually really going to commit it in our environment we don't worry about that because we control the hardware at least at the level we care about and we control the software okay so that's all you hear from me about blockchain and bitcoin this entire semester it's just a transactional database where you don't trust the other people okay all right next class attribute a lot of systems how to do analytics in a distributed environment okay all right guys see you on Wednesday that's my favorite all right what is it yes it's the sd cricket i d e s I make a mess unless I can do it like a g o ice cube with the g to the e to the t here comes Duke homies on the cuss I drink the bus a cap on the ice we're going to go with a flow to the yeah that's me when I party by the 12 pack case 6 pack 40 against the real promise I drink they say bill makes you fat but saying eyes are straight so it really don't matter