 All right, DJ Mushu, thank you, thank you for being here under difficult circumstances. My lawyer is handling it. Okay, it's good to hear. Okay. All right, for you guys in the class, a lot of things to go over that we're wrapping this semester up. So, Hummer 5 is out to do on Sunday, December 4th. Project 4 is out, and that'll be due the following week on December 11th. And then for the two additional lectures of the last week of classes on December 6th, we're going to have a guest speaker from Snowflake talk about their database system, still getting word back whether that will be in person or not, that might be a virtual talk. But then on the following day, on the Thursday, that same week, again, we'll have the live Q&A session people could call in. So again, I'll post details about this on Piazza and on the database group website. And the final exam is on December 16th, Friday at 1 p.m. I don't think we even assigned a room yet. I just haven't looked. Right. So, any questions on Hummer 5 or Project 4? Yes, is that it? So, your question is what, sorry, is the final exam hard? Are the special lectures part of the final exam? Oh, it's a question is, are the special lectures part of the final exam? No. You asked me this last time. We bet like, for all these different database systems, do you have the new exact details of everything I say? No. This is just like, just vomiting database information for you guys. So, yes. These will not be in the final exam. There's still a 10, though. Any other questions? All right. So, last class, we were talking about distributed databases and a sort of quick introduction about what they were at a high level, how the different sort of system architectures you could have. Remember, there was shared memory, shared disk, and shared nothing. Shared nothing is usually what people think about in a distributed database where each node has some partition of the database. But as I said, with cloud systems, cloud platforms become more prevalent, and most databases are moving to the cloud. There's a bunch of sort of cloud native database systems that are going to be shared disk. We talked about how to do partitioning or sharding to, again, how to break the database up to distinguish subsets. The primary way we do this is through horizontal partitioning. And then we talked about, in the end, about transaction coordination. Again, whether we want to have a centralized coordinator or a middleware be responsible figuring out whether a transaction is allowed to commit, or we let the nodes figure out amongst themselves. So, for this lecture and the next week lecture, we're going to now talk about distributed databases in the context of either OTP or OLAP, right, online transaction processing or online analytical processing. So, in today's lecture, we focused on transaction processing. The next class will be how to do analytics on an distributed database. And Snowflake will come also as well and talk about they have a distributed analytical database as well, so that'll reinforce what we talk about here. So, just to remind you guys what the difference between these two sort of workload categories are, in OLAP workloads, think of like sort of front end applications, things that are communicating or interacting with the outside world, like you run it using a website, you're punching in orders, you're updating a post, I would say Twitter, but maybe now Macedon or Hive or whatever you want to use, right, it's like a bunch of these small little updates. And so, we want to do this in the context of a transaction, we have all the asset guarantees we would have in a signal node system. We now want to apply that to a distributed database system. For analytics, again, it's these long running queries that are going to be, for the most part, read-only, doing complex joins, doing sort of random queries that people are typing in through dashboards, right, where the workload's not going to be the same over and over again. So again, just to remind everyone that today's class we're focusing on the first one, like how do we do small updates or small reads to a small amount of data, but there could be a lot of these transactions occurring at the same time. So this is sort of the setup we talked about last time when we had a distributed coordinator. We had some transaction says I want to begin, or as the application comes along, says I want to begin a transaction, it sends it to some node, we'll call the primary node, how exactly we decide which one's the primary node, that's the focus of today. And then we send a bunch of queries to these different partitions that we want to touch. We do read, do updates, it doesn't matter. And then when we want to say, okay, now our transaction wants to commit, we want to go to the primary node and say, hey, I'm ready to commit my transaction. And the primary node is responsible for communicating with all the other nodes participating in the transaction to say, is this transaction safe to commit? So the thing we're focused on today is sort of the first problem, how do we decide which one's to be the primary? And then the second one is, how do we decide whether it's safe to commit? They're actually the same problem, right? Because it's basically figuring out what should the new state of the database be? And whether that state is, which one's the primary? Or the state is, is this transaction allowed to commit? At a high level, they're essentially the same. So the, the thing, this is reiterating what I just said. The thing we haven't talked about so far, which you know how to talk about today, is how are we gonna decide when it's, when it's, how are the nodes can agree to commit, right? And in best case scenario, like if all the nodes are always up and our messages are arriving right away and nothing gets lost, then it's easy. But of course, the real world doesn't look like that. We have to deal with the case where, you know, we may be trying to commit a transaction and a node fails, right? Like, like a hard crash loses power and machine gets wiped, and we come back and doesn't know anything, right? But a lot of times that doesn't always happen. Instead you have these sort of, these ephemeral failures where somehow the network just gets, gets disconnected. Because somebody reboots a switch or whatever happens above and now your message shows up, some of the messages show up and others don't show up or maybe it gets delayed and shows up late. Or maybe your database system is built in Java or Scala and the garbage collector kicks in and pauses, does a full sweep on the heap and pauses your, your process for, for a second or two. That's a long time. So now from, from another node communicating with, with the node that's suffering with the GC pause or garbage collection pause. Like, that looks like it's down. The other node looks like it's down even though it is up. It's just, it's not communicative, right? And then we have to deal with the case of, what if we don't want to wait for every node to agree that we want to commit a transaction or do something in our database system? How do we, you know, how do we handle that case and make sure that, okay, we all, some agree and others don't agree. What do we do for the ones that don't agree? Right? Again, we didn't have to worry about all these things in a shared everything system because it was running on the same box, right? Like it didn't make any sense to have one thread not agree to do something on another thread. There was always some kind of centralized coordinator similar to what you guys are building in project four that can come down with a hammer and say, no, this transaction is committing. Here's what we're going to do. Everyone just did it. So we have to deal with that in this case here. So one very important assumption that we're going to make that's going to simplify, well, not simplify, but that, well, it does simplify things, but still, it's still not easy. One important assumption we're going to make with all this is that all the nodes in our distributed database system are going to be well behaved and controlled by us, right? What I mean by that is like it's not going to be some random author internet that's that can download the database system software and is now trying to participate in our transaction, right? That's stupid. You wouldn't do that if it was your business, right? If you're your bank, you don't let people, other people run your code, right? You're the bank, you're controlling everything, right? Most applications look like that. So again, the nodes in our database system are going to be under our administrative control and therefore if we say we're going to commit a transaction and everybody agrees that we're going to commit this transaction, no one's going to show up later on and say, hey guys, that's wrong, right? We didn't commit that, right? Now there may be Harvard failures and you have things get corrupted in your log and we have to deal with that. The right of head log essentially is the answer, but we're not worried about someone coming along and saying this transaction to commit, things are completely different than what I thought it was. If you don't trust the other nodes, then you need what is called a busy team fault talent protocol and this is essentially what they build for blockchains, which are stupid, right? So we don't have to worry about any of that. If you need a busy team fault talent problem, it's way harder and slower than everything we'll talk about today. But most, I can't think of any application other than Bitcoin that I would actually need this, okay? Right, so today's agenda, we're going to finish, start off talking about how to commit protocols and then dabble into consensus protocols, which is what Paxos and Raft are, but I'm only going to really describe them in terms of like what we care about Paxos and Raft for commit transactions in a database system. You can use these replicated state machine protocols, like consensus protocols like RAP and Paxos to do a bunch of other things, to maintain state and distributed systems. You can take the distributed systems course at CMU if you want to learn more. I care, but don't care right now, right? Like we care about these protocols correctly, but I want to focus just how do we actually apply them in our database system, right? Because we can spend an entire lecture on distributed databases and I'm just trying to get through exposure to a bunch of material just so that when you go off in a real world and someone comes along with their distributed database, you at least know what kind of things you should be thinking about and reasoning about to decide whether that system is correct or not. Right, so commit protocols, have to do replication. I would say I would say replicating database is probably, it's still distributed database. This is probably what most people need when they say when they want to start scaling things. This is what replicating database would be the first thing I would recommend. Then we'll talk about consistency issues with the cap theorem or PASIC theorem. I would just follow up with that and then we'll finish off talking about Google Spanner because it combines a bunch of the things that we've talked about both in this lecture and throughout the semester. We'll see how they put it all together to make a really interesting distributed database, okay? Right, so as I said, the thing we gotta figure out first is how to decide when a transaction wants to commit, how we decide whether it's okay for it to commit. Right, I get it. If it's on a single node, it's easy, right? We just run the, you know, we're running two phase locking or running OCC, the validation, all that still works here. All right, so we still have to do that but now we have this extra step to say, okay, well, I touched data at different nodes, now I gotta go to commit. Can everyone agree that it's now safe to commit? All right, and so there's a bunch of implementations, different protocols we can use that'll solve this problem for us. The two that we're gonna focus on is two phase commit and Paxos. Actually, out of curiosity, who here has heard of two phase commit before? Oh, half. Who here has heard of Paxos before? That's the same, and Raft, who's heard of Raft? Who's heard of view stamp replication? One of the back, yeah, you saw the talk yesterday. All right, so that's my number one PG student in the back. So as it says, view stamp replication would be the first provenly correct, I think, protocol for this space. Actually, it's the first protocol, Paxos came afterward, and that was first provenly correct. This was later proved correct, but view stamp replication predates this. This is what everyone thinks of, this is what everyone thinks of when they contribute to consensus protocols. Raft is a implementation of a consensus protocol from a stamper that's designed to be more easily understood and standable. At a high level, we don't really care how exactly works, hopefully that just works. The two phase commit will be sort of the default choice for most distributed databases, if they're in the same local area network. When you go over the WAN, you wanna use something else. And then three phase commit is sort of a way to deal with a liveness issue in two phase commit, Paxos and Raft deal with that, but nobody uses three phase commit. There's a four phase commit too from Microsoft. Nobody does that either other Microsoft, yes. Just to clarify, when you say multi-node, are these nodes holding the same data or different data? This question is when I say multi-node, do I mean nodes holding the same data or different partitions of the data? It does not matter. Because different nodes are holding the same data, but different nodes are holding the same data, but when you talk about different nodes holding different data. So your question is like, if it's partitioned, like Paxos says you have to have a majority, the Quorum says we're allowed to commit this thing, right? Yeah. But for our purpose right here, we're not making a distinction of whether what's actually in the node. The question we're trying to answer with these commit protocols, can we commit, right? Can we move the state forward, right? So if I have a replica, I have one node, but we'll get this, I have one node and a replica, I need to apply a change to them. I need to have both of them apply the change and then both of them agree that that change has been applied. Is it a two different thing? Sharding? Like for sharding? Yeah, they're different, but you still need, it's still state that's like split, right? Yeah. Or sorry, state, think of the state that we care about is what transaction committed, not what changes we made, right? So the difference would be like, like the right-of-head log is like, here's all the individual changes that I'm making, commit, done, right? This is like, hey, transaction one is making a bunch of changes. Does everyone agree that it's been applied? That's the state we're changing, right? And it represents a, it represents the state to the database we changed, whether it's replicated or sharded, I don't think it matters. That's a, we'll get to that in a second. All right, so, good, Paxos and sorry, RAF is probably the more common implementation at a high level, these are all equivalent, except for two days commit. Well, two days commit is a generative case of Paxos. Most modern systems are implementing RAF these days because there's a bunch of implementations of RAF protocol in a bunch of different languages. Like for a while, there wasn't like a lib Paxos that you could just download and use. In RAF, the Stanford guys made a bunch of versions of it in different languages and that's why it's seen wider adoption. And then Zab is this thing, the zookeeper atomic broadcast protocol that's used in Apache Zookeeper. This came out of the deep ecosystem. It's like a, it's a way to do like distributed consensus as a service. And we'll focus on just two days commit and Paxos. All right, so, here's the success case of two days commit. It's two days commit is basically as it sounds. There's two phases. So what happens is an application service says, application says I want to commit and it goes to what we'll call a coordinator node. So it'll be the node, there'll be some node that's going to coordinate with the other nodes and say, hey, we want to commit this transaction. And those other nodes will be known as participants. So the first message goes out and says, hey guys, this transaction wants to commit or prepared to do this. And then the different nodes can come back and vote and say yes or no. Like is this, is it safe to commit this transaction? And then once the coordinator gets all the positive votes from the participants, then it answers the second phase and says, okay, now go ahead and commit. They apply the changes, commit the transaction, the state moves forward. Then you get back all the acknowledgments to the coordinator. And at this point, the transaction is considered committed and then you can tell the outside world, you can tell the application that the commit was successful. Right? So in this case here, the coordinator has to wait at each phase for all the nodes, participants, to come back and say, yes, I can proceed. We'll see in Paxos, you just need a quorum. That's what he was, the sort of majority of what he was saying here. In two phase commit, you have to have everybody. And so there's a liveness issue because if I send out the, say, send out the prepared message here and no two doesn't respond but no three does right away, I gotta wait for some amount of time for no two to either send me the message or I time out. Right? So during that period, I can't do any, I can't commit any new transactions because I'm waiting to find out what happened to the one I'm trying to commit here. We'll see how we handle that in Paxos, right? So the board is just basically the same thing. So commit request shows up. You send out the prepared request to the different participants. One of them comes back and says abort. And this could be for whatever reason, right? Doesn't matter. It doesn't have to explain to the coordinator why it's doing it. It just says the thing has to abort. Like maybe there's an integrity constraint violation. Another transaction could be trying to commit the conflicts with this. It doesn't matter. But as soon as we get the abort message at the coordinator for one of our nodes for one of our participants, you can immediately tell the outside world this transaction hasn't been aborted. And then we just send out the abort message in the second phase, all right? And then we wait for them to acknowledge for that. So far it's so good, pretty simple. So what I'm not showing here is that the each node is also gonna maintain a log of all the messages that it sent out and the destination, and then all the messages that come back in with that node's response. We gotta store this in a log, like a right-of-head log. So that way, if we crash, on any of the nodes crashed during this two-phase commit, when that node comes back up, they would look in the log and figure out, okay, I was involved in a transaction as part of the two-phase commit. What should happen? What do I need to do? So say a node pops up, comes back, and say it's a participant, not the coordinator. So we look in the log and say, oh, I see a bunch of messages that I got for this two-phase commit, for some transaction that I was running. If I'm in the prepared state, meaning I've told the coordinator, yes, I vote to commit this transaction, then when the node wakes up, then it goes back to the coordinator and says, okay, transaction one, two, three. We were running two-phase commit, I crashed before we finished. Tell me what happened. And then based on what the coordinator would say, it can decide what to do or not. Because the issue here is that we sent an outbound message to the participant, to the coordinator, to say, yes, commit this transaction. We don't know whether that message made it there or not. Because if it didn't make it, then the coordinator would say, well, we didn't hear back from the participant, we have to have all the nodes agree to commit this transaction, therefore this transaction on boarded. Or my vote to commit message did show up, then the coordinator tries to send me, okay, commit, but I never got that and I crashed. So the coordinator would tell me what happened, right? If the local transaction never got a, never even went to the, or sent out the prepared message, then we know it's never gonna, it never was gonna commit anyway, so we just go ahead and abort it. If we were committing and we are the coordinator, then we may have to go send the commit messages to the nodes, the participants say, hey, by the way, I crashed when this thing was, when we tried to commit this, go fix this, go fix it up, make sure it committed, right? So just to reiterate what I just said, so if the, if we're doing two of these commit and the coordinator crashes, then the participants have to decide what to do after a timeout. The default, the default or easiest thing to do is to say, okay, the coordinator's dead, his transaction's aborted and we just abort. Of course now someone has to take the control of that, someone has to become the new coordinator and say, this transaction did not commit, we cannot, we have to make sure everyone aborts and rolls back. And again, there's a liveness issue in the protocol here because while we're waiting for this timeout to get triggered, then we decide, okay, the coordinator's not coming back, they're dead, we have to then, we have to then elect someone to say, you're the new coordinator, go ahead and commit, all right? If a participant crashes during two days of lobbying. So assuming that we haven't got the commit message yet and the prepare phase, then we assume that it's aborted and we go ahead and just kill things, all right? And then likewise again, we're gonna use this timeout, basically like a heartbeat to say, we haven't heard anything from this node in a while, assume that it's dead. That's gonna cause problems later on when we have network partitions because the nodes could be up but they just can't communicate with each other. So this guy thinks the other guy's dead and the other guy thinks the other guy's dead and then they try to each take control. And we need to handle that case. All right, so this seems kind of expensive protocol, but it's your two round trips between the nodes and say, hey, we go ahead and commit this, right? And that can be obviously expensive, especially if the nodes are far away from each other. Like I'm going with a wide array network, round trip of the earth is what, 300 milliseconds or something, right? That's the speed of light and so you're never gonna get that, you're never gonna get, that'll be your lower bound, you're never gonna get close to that, I think. So like having to go these round trips makes things really expensive and now you may commit 10 transactions a second if you're always waiting for the network. So there are some optimizations or shortcuts that we can take to speed things up. The first is called to do early prepare voting and the idea here is that when we go, before we go ahead and commit, when we send a query from the application or a middleware to a node, if we know we're never gonna run another query again for this transaction, when we send a query, we also piggyback a, hey, I'm gonna commit, begin the two phase commit process. So I get the request, I run a query, I send the result back, but then I can also maybe include, piggyback my prepare vote to say, if you're gonna ask me to prepare this transaction later on, just so you know, I'm okay with it. So don't wait for me in the first phase of two phase commit. I've already told you what I wanna do. So this is rare because the application really isn't, you don't write database applications with this idea of like, okay, I'm gonna send the query and it's the last query I'm ever gonna send. Like as far as I know, like in JDBC and ODBC, like the client application API, you write, you know, SQL queries and they don't, they don't have a command like, hey, last query, right? This only shows up in systems where you're running transactions as stored procedures. Think of like an RPC where I have this function that runs queries with in-mix, like if statements and for loops, that gets, that runs directly inside the database system. In that case, it would know, you know, at what point it's running the program and it's never gonna run another query. You could send that message. So that one's rare. The one that's super common. As far as I know, when you read the documentation for two phase commit, I'm pretty sure everyone does this. Is to do early acknowledgement after prepare. And this is where you begin two phase commit. You send the prepare message to everyone. You can wait for the response back to vote. As soon as you get back all the responses, rather than waiting to tell the application your transaction is committed until you get the second phase responses, you immediately tell the application that they committed at that point. All right, so it looks like this. Application server says I want to commit. We do the prepare phase. We get back the acknowledgments. And then at this point here, we tell the application you've committed. Have you? I mean, so his statement is, you have not committed yet. Have you? Yeah, so his statement is logically you have, right? Because everyone says we're gonna commit. We've logged on disk all these messages. So we've gotten at this point, we would have logged and flushed a disk that we got okay from everybody, right? So if we tell this application that they've committed, but then we crash, as I said before, we would use the log to figure out what would be the correct state of this transaction. Yeah, so same it is, logically it's committed because we got the message that said everyone's gonna commit, but physically, no two, no three, don't know whether you've committed yet. So there's a separate issue of like, if someone now tries to read the updates from this transaction, will they actually see it? You may have to stall on two and three and say, well, I think this transaction has committed. I'll let you read it, but I don't actually know whether the reads, the modifications you're reading have persisted or not, because they've committed or not. So when you try to commit, I'll stall you. Or sorry, we'll go ahead in the back. Yes. So like in this case, is there anything? So same it is, in this case here, isn't the coordinator still a point of attention? Yes. So the same it does not have to feed the purpose of a distributed database of like the commit process. Just to overall. I see David is like, having all like, my example here, all the transaction commits are always going through node one. Isn't the overhead of processing the transactions, isn't that gonna be expensive to do in one node? So we're not there yet when we have different transactions could be committing at different nodes yet, right? But even then I will say the bottleneck isn't CPU. In this case here, it's the network round trips, right? Then like we have to wait for that. It's not like processing packets, not that a concept. Same question. Okay, yeah, yeah. And we'll talk about multi-home versus primary follower, primary replica in a second. But for now to assume everybody's committing in this one place, but you could have other transactions that are touching different parts of the database committed in different places, right? Or I could have transaction committing on node three, they're the coordinator for another transaction, this guy's coordinator for that transaction. Then we've got to figure out who can go, yes. So the statement is in a real system with a million nodes, most transactional databases are not a million nodes, but you could have, you would have transactions touching different segments of the database and therefore you wouldn't, like they could be disjoint. Yes, ideally yes. Yes. These questions, does any distributed database not use database commit? There's some databases that run exclusively with taxes and raft. I have to think, actually I think maybe TIDB, Spanner uses both, we'll cover it, Spanner in a second. Yugabyte uses raft. I have to look this up, I think I'll pan, all right. I would say the classic distributed databases from the 80s, like the DB2, non-stop, Oracle, like they were like were built before Paxos came up with a thing, they're using database commit. Okay, I have a question. So this then leads into Paxos. So Paxos is another way to get, to do commits for transactions. The difference is here is that under two phase commit, you have to have all the nodes to agree that we're going to commit something. On Paxos, you need a majority to agree. And so there was a paper actually done by Leslie Lamport and Jim Gray. Leslie Lamport was the inventor of Paxos, Jim Gray, I don't think he was an inventor of two phase commit but he was an early adopter in the 1980s. But they basically showed that two phase commit is a degenerative case of Paxos. Like if you require all the nodes to agree that they're going to commit a transaction in their Paxos, it's essentially the same thing as two phase commit. And so raft is again at a high level, it's going to be the same as Paxos. The only difference is that in raft, the when you have to do leader election, you take the node that has the most up to date log as the leader or in Paxos, anybody can become a leader if they're elected. And then, but you still have to then go, update your log to make sure that everyone is back in sync. Right, so this is the original paper of Paxos called the part time parliament from Leslie Lamport. This, the story is that this came out like 90, when he wrote it like, I don't know, 91, 92. And then it was, he submitted it to a conference and distributed computing. I think the story was he was trying to figure out a, he was trying to prove that there was not a protocol that could be, could be safe and resilient and when there's an asynchronous network and then he ended up designing Paxos to prove that, like he's trying to use this as a proof that he couldn't do it. But he turns out he could do it. So then he submitted this paper to a distributed systems conference. The, if you ever read it, it's like, it's a bizarre paper because it's like talks about this like, this Greek island that had this ancient tribe with these stone tablets that he's digging up. And he's trying to tell like a weave a story about this ancient tribe and his fictional island of Paxos. But what he's really describing is actually the protocol. And so, it's more of an amusement or curiosity to read it. You won't actually learn anything. There's a follow-up paper from Leslie Lamport called Paxos Made Simple. That is also hard to read. The one in my opinion that is easier to understand is Google's paper called Paxos Made Live, like 2003, 2004. That's the one for me that actually clicked, like, okay, now I understand it. And then the RAF thing out of Stanford, that was meant to be a consensus protocol like Paxos that like a human could understand more easily. Even then, it's still not like light reading. And again, there's a bunch of corner cases we have to deal with in Paxos and the distributed commit protocols that we're gonna ignore, like the devils and the details of course. But again, we only care about the context of transactions, right? I was saying, I always tell the story. So when I was in grad school, I took out your stupidism, of course, taught by Maurice Hurley who used to be a professor here at CMU. This is back at Brown and I presented Paxos and I was like, oh, trying to explain it to the class. And when you go to Leslie Lamport's website for all his papers, he has these little blurbs that talks about what was going on in his life every time he wrote a paper, like what food he was eating, who he was dating, so forth, right? And then for the Paxos paper, he talks about how it was a work of art and he submitted it to the conference in like 92, 93, and it got rejected because they didn't see his genius of it, right? And so I said this in class, they're like, oh yeah, like this paper's great. I don't know why, whoever rejected it, they were idiots. Turns out Maurice Hurley, the guy who's teaching the class, he was the one that rejected this. But he said that they were okay with this Greek story that he had about the archeologist. All they asked him was to put in the appendix of proof of the actual algorithm. And then Leslie Lamport refused to do that, all right? So then what happened is the paper got rejected, he put it on his shelf and didn't do anything with it. And then in the 90s, there was a bunch of papers that were sort of dancing around the problem, almost solving this problem that he already solved with Paxos. In mind, he also used stamp replication to solve that too, but nobody knew that either. But then he said, okay, we'll screw this. Only after you saw all the people trying to solve the problem he already solved, then he put the paper out. And then Google adopted this in their trouble locks at risk, a bunch of other distributed systems in the late 90s, early 2000s. And that's when Paxos became super popular. But again, this paper, it's like, you read it once, it's cute, but from a pedagogical standpoint, it's a train wreck. Okay, so here's how Paxos is gonna work. So now it's set up participants and coordinators. We're gonna have proposers and acceptors. We'll also have a third category in Paxos that we're not gonna cover here called followers, or I'm sorry, learners, I think Raph calls them followers. They're basically, they'll see the updates, but they're not involved in voting. As far as they know, I don't think any database doesn't actually use that terminology unless you're coming up for recovery, right? So, all right, we have our commit request, our proposer then sends out a message to all the acceptors to say, hey, I propose that we commit this transaction. Let's say one of them goes down, node three here, that's fine. We get back the agreements from two and four. And then because we have a majority, meaning in this case here, we have four nodes. So three out of four nodes agree to quit this transaction. That's a majority, so that's okay for us to commit. So then we send the commit request or message out to all the nodes to get back the acknowledgments that they agree, accept to do this. And then we can tell the outside world that our transaction has committed, yes. If these nodes are hooking different data, how come it's acceptable that like maybe node three files because it was running through this locking and it's the... So you're saying, why is this guy allowed to say... So you would do a leader election, that would be one proposal for this. No, I think it's running through CL and it has to afford supporting through CL policy. Yes. And shouldn't the whole transaction afford? Right, so David is, say this guy doesn't crash. He just says, I can't commit this for some other reason. Then shouldn't he then... But then everyone else that agrees to commit this transaction, but doesn't that screw up him? Like, so think of like it's the state of the system is like what globally state across all the nodes and the data says what transactions are committing, right? It's like a global order. So in this case here, if node three cannot commit because somebody else holds a lock or did something else that conflicts, he says, I can't commit this transaction because again, these nodes are not malicious. We control them. Node three can't go rogue and say, okay, screw you guys, I am committing this transaction. He will have to get the update that the transaction that he wanted to commit did not commit and this other transaction committed. And then he has to update his log to get up to date. So the question, does it just mean the transaction is partially committed? So at this point here, say, if we tell the outside world we've committed, the state has moved forward. So therefore node three is no longer in sync. It's no longer up to date. Therefore, if I wanna have globally consistent transactions or consistent transactions, I can't read any data from node three until it gets up to date. The question is how do you bring up the date? You have to learn from the other nodes that this is the transaction committed and get the updates accordingly. I'm re-trying the transaction more like replaying the right head log. You see, you would say, hey guys, I missed out on this transaction. Seemed like a fun thing to do. Give me the right head log entries to put me back on the right state. Yes. And the same as assuming two, three and four of replication is same data, but now it doesn't matter. Yeah, so now we're getting the leads. So you would have, there would be a lease and say, what's the latest version? So the reads would have to go through the coordinator or the proposed leader and say, what's the latest version I should be reading? And then there's split brains of that. That's why you have this timeout. So if you no longer can communicate with the majority, then you have to wait until the timeout before you left a new leader to avoid split brain. Yes. So his statement here is, could it be the case where no three had a violation that on commit you checked some integrity constraint and it was violated? So therefore it should not be allowed to commit. In that case, I mean, you assume that it's, if you assume that it's replicas, then they all have the same integrity constraint violation, right? If you had a, if it was sharded, good question. Right, so to save it is, if it's sharded and you have, if you have a localized integrity violation on this node, then this won't work. Yes, we'd have to use dooface commit. Because dooface commit would be everyone agrees. I mean, I could jump ahead to Spanner if you want. Do it real quickly. Show you how to. IDV does dooface commit. They use dooface commit, okay. So this is Spanner, I mean, I'm giving way, way ahead, but say this is, instead of partitions, they call it tablet. So these are replicas of each other. And so these we put together in a single Paxos group. You run Paxos leader election, actually I could, right? You'd run Paxos leader election, which is just, again, it's thinking of the leader as a state machine. Which node in my group is gonna be the leader? And then so I can send all reason rights to go through the leader. And then the leader responsible for using Paxos to get all the replicas to agree that this is what I need to do, right? That this is the change we're gonna make. And again, in his example, what if there's integrity constraint violation? Because the replicas, they would all have the same integrity chain violation. I can also support snapshot reads where, again, think of like the multi-versioning idea where if I have a timestamp, I can guarantee that with snapshot isolation, VRTSE data within that's been committed at my timestamp. So I could send read requests to different replicas and I don't worry about burdening the leader with all these requests. But then if I need to update data across different groups where there aren't replicas within mine, then I use TubeBase commit for that. So for what systems that use RAP, I have to double check this. Like I know Cockroach uses RAP for everything. But maybe they're using TubeBase commit like this. I'll double check that and post that on Piazza afterwards. All right, so I'm jumping ahead. Let me jump back to Paxos, but. And that is basically using TubeBase on Paxos Paxos. Yes, yes. Because you believe later on when the goal is to get a lot of data out of the way. Your statement is, in my example here, ignoring integrity chain that like, no three goes down, majority agreed to commit this transaction. And I can tell the application that even committed because I assume that no three is gonna come back up to date. So I'm jumping ahead now. There's another notion of case safety. It says, how many copies of data that you need till you say I have to turn everything off, right? Because I don't want to have rights that then can't get replicated. And then therefore, if my node goes down, I lose everything. I lose my most recent transactions. So the idea would be, I would say, I'm allowed to failure of so many nodes or I have to have so many nodes always available on online. So my example there, I had four nodes, one went down. If I say K safety of two, K equals two, I need to have two nodes at least up. I can still lose another node and that would be okay. This is about, so yeah, I'm getting, we're jumping around here, but think of like, if these are all replicated, then yes, this was fine. In your case, if these are all individual shards, that's what I'm showing in spanners. So say node four is a partition or a shard, I'd have to have so many copies of node four. And then the K safety factor would say, how many failures of the group of node four can I tolerate before I shut everything down till the node comes back up? Right. Yeah, so like, I'm trying to not get into like, my application is sharding here just yet, but you guys are bringing up very good points. All right, so yeah, sorry for that. So your question is, if I have a redhead log, that every node, I keep track of the messages I'm sending back and forth, then what's so special about the majority voting, because if one node can at least come back and say, this is what we committed, isn't that enough? That would not solve the split brain problem, where there's a network partition, there's a bunch of nodes over here, bunch of nodes over here, they can't communicate with each other, but they can communicate locally. Then what'll happen is, if I'm just saying, the coordinator always wins, or the proposal always wins, then at some point there'll be a timeout because I can't communicate, this group can't communicate with that group. I run leader election locally, they both anoint a new leader, a proposer, and then now they start committing transactions on data that have different changes. Then the network gets fixed, now they start talking to each other, that I'm screwed, because I said eight equals five over here, and eight equals eight over here, which one's right? So the way to handle that is, as you lamp out again, there's a technique called vector clocks where you basically can store the updates to, almost like multiverging, store the updates to an object or key as a list, and then when you resume connectivity, then you have a way to reconcile what should be actually the latest version. There are some databases that do that. That was what some of the node sequences were doing, but now you have to have client code to figure out how to resolve the complex. And so if you want strong consistency and avoid these issues, you basically say when I have a partition and one side, and neither side has a quorum, then the system is just unavailable. There's no way around it. These are good questions. All right, so this is just another time series view of how to run Paxos. Again, think of the database, the database is just a log that represents the state machine of what transactions are committing. And so we once say to the next as transaction 123 commits, transaction 456 commits. And as I said before, we're not representing this log. Here's the individual changes we're making to the tuples. It's just a global state to say here's the transactions that committed. So the first proposal proposes N to all the acceptors. They all come back and agree. We want to do this. And say that you propose some transition to the log. Some number, and you always need to be moving forward in time. So this other proposal comes around from middle of nowhere and they say I'm gonna propose to commit N plus one. As soon as the acceptors see any proposal that's greater than the one that is seen before, the lower ones are immediately discarded, right? So when this guy comes back and says, I try to commit this because the acceptors have seen N plus one, they will reject that and say, we can't commit N because we saw N plus one. And then now, and then they agree to commit N plus one to this proposal. That guy's commits. And then the next guy has to come back and say, I have to do N plus two, right? This question is, is N the transaction number? Just think of like a logical character. Like, could be the transaction ID, yes. This question would have to redo N again. Again, it's, yeah, it's for simplicity, yes. Assume you have to rerun the transaction N again because it's like validation at OCC. I try to commit. I may have made a bunch of changes, but when I then go to commit, the validation says there's conflicts. You have to go back and run it again, yes. Yes. So the question is, do the proposal acceptors need to be different categories? Like a separate nodes? No, they would be, yeah, it would be the same. Could be the same. Most systems they are like, you would not have a separate node be like a only be a proposal, okay. So obviously, you know, this protocol is susceptible to liveness issues because if these guys are just proposing N plus one, N plus two, N plus three back and forth, right, and overriding each other, nothing will ever get committed. And this is where the leader election stuff that I keep talking about comes into play, where we can just run another round of, sorry, question, yes. Sorry, the question is going back here. So agree, reject, yes. Does this have to happen before this? Happen after before this? No. Because like if I tell this guy agree N plus one, it doesn't matter because whenever this guy sends, I want to go commit, or actually sorry, he's already said commit N. So if we go back, if we're here, we commit N, we could immediately come back and say, no, we saw N plus one here. We haven't agreed to commit it. We saw it, that's all it matters. As soon as we see it, we can say we can reject it. Or we can say, yeah, we're gonna do this. It doesn't change it, we're gonna send that back too. I don't think it matters. All right, so the way to handle the issue of just, again, the system is thrashing from trying to propose new transactions commit and nothing ever gets done. There's this technique called multipaxos where you allow a single leader or single proposer to propose and commit multiple transactions one after another. And you don't worry about another proposer trying to show up at the same time, trying to commit something, right? Basically, I think it's just like a lease. You say, okay, you're the noted leader for some amount of time. Spanner thing is 10 seconds, you go by it's two seconds. I think cockroach is five minutes. I might be reading the wrong thing though, but they all do something slightly different. But the basic idea is here is that you have one node be responsible for always proposing new things to commit. And then you don't have to worry about the proposed is you can skip that as a way to speed things up. Of course, now there's a failure, right? We can talk about how you wanna detect that there. The heartbeat is most common approach. Every so often you're sending a ping to another node and getting a response back. And if you don't get a response back at a certain amount of time, you assume that it's dead, right? So you just have this one node gets in order as the leader. And then all the proposals and commit transactions always go through that one node. And then you don't have to run, you don't have to worry about getting clobbered by somebody else proposing something at the same time. So multipaxes and raft has something built in similarly with its own leases. They basically work the same way. All right, so just to reiterate with two-phase commit who blocks if the coordinator fails after the repair message or until it recovers and then unpacks it's non-blocking if a majority of petitions are alive. And then it'll be able to converge and commit things as long as there's a long enough period where you don't have nodes disappearing all the time, okay? All right, so now we've been talking about replication a lot. Let's go into a bit more detail. So replication of the idea is that we're gonna maintain multiple copies of database objects or portions of the database on multiple machines. And the goal here is availability. That if one node goes down, we don't have to take the whole system down. We can have another node service the request as needed. The challenge of course is how do we actually support this? There's different ways to set things up. Where do the rights go? How do you propagate the updates? What do you do if the nodes are very far away from each other and so forth, right? So we're gonna go through each of these one by one. So the first question is what's the configuration? And there's basically two approaches, primary replica and multi-primary. The textbook I think calls them primary replica. Sometimes you'll see them called leader follower. The older term would be master or slave. Of course, we don't use that anymore. Most of the literature, if you do Google searches, most of the time, if you Google primary replica, you'll get a bunch of different stuff. If you say master slave database, unfortunately that'll be, that's where most of the material is written in because it's older. The ideas are the same. So under primary replica, all the updates from transactions or for any object are to go to some primary copy, right? And then the primary is responsible for sending out the updates or the changes to the replicas without actually a atomic commit protocol. Then when you actually wanna commit the transaction, then you would go to the replicas and say, okay, we wanna go ahead and commit this change. Again, the spanner, we saw that in Paxos. So as I'm making updates, like I'm running update queries one by one, I'll just send out my right-of-head log to the replicas and they can apply the change to update the state. You do this instead of actually running the entire query because the right-of-head log is more compact representation over the changes that was made. You don't have to rerun the query, you just apply the change. So the replicas almost be like in recovery mode, like in a single-node system recovering at the crash. When you set the replica, say I'm gonna be following this primary, you can't run any role update queries on the node itself because it's always applying the right-of-head log. All right, so then when the primary goes down, then we just hold leader election with Paxos or RAP like we saw before. So multi-primary is the more complicated case. This is where you are sometimes called multi-home, where the data will be located at different replicas, but the system's gonna be allowed to do update queries on any one of those replicas. There's no longer destination between the primary and the replica. Everybody's a primary. So now when we go and commit a transaction that updated data, you have to make sure that all the other copies of that data agreed to do it as well. So let's look at these visually. So primary replica is like this. Again, so we have partition P1. All reads and writes will go to the primary and then we send out the right-of-head log changes to the replicas and then you just apply them. To deal with the question or the comment before, like won't the primary become the bottleneck if it's committing all the transactions? The answer is yes. So a common setup is then to allow for the read-only queries to go to the replicas. You don't need to coordinate on commit with the primary because you didn't make any changes. Yes, yes, so the statement is, and he's correct, if I have a transaction that gets committed here, well, say a transaction is actually running here, I won't my reads not see it, potentially correct, yes. But that might be okay, right? It is a slight sacrifice on the timeliness or the consistency of the data, correct, yes. So just take it like in real life applications, right? Who cares if I'm slightly off, right? Like on Reddit, like the number of upvotes for a post, if it's 99 versus 100, do I care? Probably not, actually, you shouldn't, right? You could, but in that case, it's okay to do still reads. So the question is, how do you make sure you route queries to this? This is something, I don't think data systems could do automatically, and they could, but I don't think they support it because they don't know when they're okay with still reads, right? So they typically, it's either done in the application code or there's a proxy in front that says, oh, if I'm doing a read-only transaction, I'll route it to the replica. There might be a middleware in front of this that's routing queries. Multi-primary looks like this, we have two copies of P1. Both of them can accept reads and write queries, and then now when I want to commit a transaction, I got to go figure out between the two of them, you know, if they're updating the same object, which one's allowed to commit, right? So I always like to use Facebook as the example of this multi-primary and this thing here, because they started with this and then they eventually went to that. So in the first version of Facebook, like late 2000s, they had a single primary database in the data center in California, and all the updates, no matter where you were in the world, went to that primary. And then through change propagation, they sort of sent out the right-of-head log, and then it would eventually get propagated to the other data centers they had around the world, right? So now what's the problem with that one is, if I'm down in Brazil and I update, what are they called, Timeline, what are that bullshit thing is? So they update, whatever the Facebook thing is, they're status, there you go. I'm updating my status, but then I refresh the page. I'm gonna be reading from my local copy and I wouldn't see my own update. So then people would think it's broken and try to submit the same thing over and over again. So the game they played, the trick they would do is, in the browser, they would maintain a cookie with your local status update or whatever you changed so that when you refresh the page, it would check the cookie and say, here's the thing I need to put into my webpage I'm rendering. Because even though, I'm reading stale data here, then eventually the change would get propagated and then I knew the cookie was no longer necessary. They could read it from the local copy, right? Then maybe mid-2010s or so, then they switched to a multi-primary approach where again, if I'm down in South America, I do all my writes locally and then those things will still get propagated to their data centers, but when I do my reads, I'll have local reads, yes. Sorry, is it not supposed to have a multi-primary replica as well? Yeah, so the statement is, could I have this where, but then the no two has multiple copies locally? Yeah, that's the spanner thing I just showed you, right? You would have one guy would be the leader, right? All writes go to that, all writes within this region go there and then it's responsible for propagating with everyone else, but then when it has to coordinate with up above to the other region, then you would run two base code for that. Okay, KTP we already talked about, again, it's just saying for replication, it's how many nodes can go down before I stop everything? Again, we want to stop the world because we don't want to, if we only have one, we have no replicas up, we only have a single node, we do a bunch of writes to that, but then that node crashes, there's no other copy for it, we might just lose everything and that would be bad. So there's a basic, stop threshold, let's say the system goes offline, doesn't accept any new transactions until another node is available. All right, the next question is, how do we actually propagate the changes? And again, some of this we've alluded to so far, but to state more concretely, there's basically two approaches, the synchronous and asynchronous, asynchronous or sorry, synchronous is what we would say when you have a strong consistency, meaning if I apply a change to a copy or to a replica, sorry, apply a change to a primary, I do not tell the outside world that change is persistent and the transaction's committed until all the nodes agree, the alternative is do asynchronous or eventual consistency where I do a write and I assume it's conventionally gonna get to my replicas and so I tell the outside world your transaction's committed and at some later point, it'll eventually get applied. But I could have a small window where I could read the data at one of these replicas that hasn't gotten this change yet and I don't get the right result that I expect. So it looks like this, so it's a synchronous replication, again, we have a primary and a replica, we want to commit the transaction, we go to the replica and say, hey, you know what this change, go ahead and flush it and we stall and wait on the primary before the replica comes back and says, I've done that, like flushing the actual right of head loaded at the disk, once it's flushed, we send back the acknowledgement to the primary, then we can send the acknowledgement back to the replica, right? So that point again, if either node crashes, we're guaranteed to be in a consistent state here. A synchronous basically means that I go to commit, I send the flush request to the replica, but I don't wait and I send the outside world that I've committed. Now there's a small window here where this thing may crash, the primary may crash and this flush may have not actually been applied and I come back, I might lose changes, right? Or if I'm gonna do a read on this guy and not see the commit I expect to see. So this goes, I mean, it's sort of example I said before, like you could do steal reads, that's sort of an example of eventual consistency and that might be okay for some applications. But for things that are like, again, you wanna have exact measurements or exact calculations, you wanna use strong consistency, a synchronous replication. By default, most database systems that support replication from like your primary to replica, they use asynchronous by default just because it's so slow to wait for the flush and response back. Like you say, all right, well it's a five millisecond window. Maybe it's not that big of a deal if I lose my last five milliseconds of data, yes. So your question is, if when I do an update and I have to synchronize with everyone, I think they said it's slow. So what's the purpose? So what's the purpose of the replicas other than just being for failsafe? I would argue failsafe is very important, right? Like, so if everybody's updating one record, then replication adds additional overhead to keep things synchronized and you're not gonna get better performance. If though I have a workload where I could partition it and have different machines, have different portions of the database, then they can run in parallel, maybe sometimes they have to synchronize, but most of the time they don't, then I would get the horizontal scale that I would want. But I would argue that I would not underestimate the importance of high availability, right? Again, for all to be workloads, these are the things that are interacting with the rest of the world for the most part, like using a website. If you wanna buy something on a website and it crashes or the database crashes, because it's only running one node, that company's losing money, right? So it's super important that the O2 database does not go down. So you're willing to pay that this performance penalty for replication, because you want to be available. And again, I keep saying this, there's some applications where you don't need to have strongly consistent transactions, you care about availability, so you'll relax the synchronization guarantees in exchange for being always to be able to serve several query requests, even though the data you're getting back might be not what you want. So when I say flush, I mean it's flush in the right head load. Yeah. Your question is, sorry, under asynchronous replication, like if I'm back here, I say commit, I send the acknowledgement to say I got it, but I didn't actually wait for this thing to actually flush. If both of them crash, could I lose my commit? Yes, yes. Your question is, how do I avoid inconsistent writes whenever using asynchronous propagation with a multi primary case? You can't, right? Like this could be a small window, right? Where if I'm unlucky, you can't guarantee it. Okay, so in the sake of time, I'm gonna skip this. This basically says when you tell the other node, is it on commit all at once or everything continuous? Basically continuous means that like as every update query, as soon as I applied the query that I send and had the right of head log change, when I put it in my memory buffer, I also put it on the network and send it over to the other node. The alternative is that I batch all my commit messages or my update messages in my right of head log locally and then when I commit, then I send them up. And the difference here is that if I have to, if I have a bunch of transactions on a board, if I send it all at once at the end, for the board, then I don't send anything. No systems do the first approach. Active, active, active, active passive. Let's skip this, this is not necessary. I mean, it's important, but like most systems are active passive. Spanner, again, we cover this maybe at the end of the semester as well, it's an important system, but hopefully, actually at this point in the semester, all these things at the top here, everyone should know what I'm talking about, right? What you replicate it means it's across different data centers around the world. It's semi-relational. The newer version is, well, they support JSON now, so it's relational. But it's a decentralized shared disk, it has a log structure on disk storage. Originally built on big table then they wrote their own storage manager. But it's doing strict two-phase locking with multi-virgin current control and multi-papses in two-phase commit. Certainly consistent means that they guarantee that the order that transactions commit corresponds to the order that they arrive in the system, and have some special match to do that. Then we can ignore lock-free transactions. So this is more than what I've already said before, but the databases can be broken up to these tablets with partitions. And then with each tablet, we'll elect a leader in the group and then we use two-phase commit where we have to spam multiple tablets. And this is just the setup that I had before. And so the point I was trying to make here is that we're at this point of the semester, I listed a bunch of different buzzwords for the database system. And hopefully you guys will be like, oh yeah, I know what that is. I know what that is. Start putting these things together. So that's how you're gonna see database systems in the real world. It isn't gonna be like, here's the one lecture on two-phase locking. We just covered nothing with two-phase locking. It's gonna be two-phase locking plus MVCC plus shared TIP. Like it's gonna be all the things together, cumulatively, what we talked about through the semester. And so hopefully you start putting the pieces together in the puzzle. Yes. Why do you need two-phase locking when you use MVCC? This question is why do you need two-phase locking when you use MVCC? Because you need to know, because you have to take locks on the objects when you do writes. So MVCC alone is not the... So MVCC, yeah, so the statement is, MVCC alone is not off to protect data for those actions. So like, you can do optimistic concurrency tool with MVCC. It's sort of this somewhat similar, because I'm writing things to my private buffer, but I only check validation. So I only check what I can commit. I only apply the validation, was there any conflicts when I commit? Now two-phase locking, you're acquiring the locks as you apply the updates, make new versions. Okay, there's other magic in Spanner, where they actually use hardware clocks. They actually use atomic clocks that are attached to the... They use two things. They use atomic clocks that are attached to the rack, and they use a GPS receiver on top of the roof of the data center. And they use that to have really, really accurate clocks. As far as I know, other days, some does that. And when I read that, I was like, that's super well, that's amazing. That's Google money. There's ways to get around that. You can use the cockroach and Yugabyte to use hybrid clocks, but Google, as far as I know, is the only one that does that with reading times from satellites. All right, so I wanna briefly talk about the cap there. Again, this is something else that's also gonna come up when you start looking at real-world database systems. And so the cap theorem is a bit old now. There's a newer Pacellic theorem as a sort of modern adaption to the cap theorem, but the cap theorem by itself is, for our purposes here, is good enough to reason about distributed databases. So the cap theorem came out in late 1990s, early 2000s, from this professor at Berkeley named Eric Brewer. He is now like a VP or something at Google. I think he's still at Berkeley. And he basically says, when you're building a distributed system, you have these three sort of notions. Like, is it gonna be consistent across all nodes? Is it always gonna be available? And is it gonna be able to tolerate network partition? Like, when the network gets severed, I can't communicate. And the cap theorem says, basically says you can pick two out of three. Not entirely true, because there's some can't be always available and network partition tolerant. There's some portions of it you can't do. And then the Pacellic thing basically says, assuming that everything goes okay, there's another trade-off you have to consider between how fast I can run transactions and how up-to-date things will be, or consistent will be across different nodes. And that's what the Pacellic thing is. We can even do that for now. So you still think of the cap theorem as a Venn diagram. So if you were supporting consistency, it's linearizability, basically strict serializability on a single object. Availability means that all the nodes can satisfy any request at any time. And then the last one here says, partition tolerant means that if I can't communicate between nodes, I can still run the system and still get results back. And so the no-bands land is in the middle here. You can't have a system that's been theoretically proven that satisfies all three properties. So let's go to each of these one by one and sort of see some of the things we've already talked about before about what does it mean to have stale reads and updates and so forth. So on the cap theorem, because this is the idea is that I can guarantee that any change that I would make to an object in the database, like a key in the database, will be applied to all the other copies of that data and I can guarantee that after my same transaction commits, any query that comes along afterwards is guaranteed to see my changes. So this application here wants to set A equals two. Say we have a replica over here, so we're gonna send a message to apply the change. We can use Paxos to say this change has been committed or to base commit. And then once we have acknowledged that the replica has applied the change, we can send the acknowledgement back to the application that the change has been persisted. So now at this point, if another server comes along and reads A, they will be guaranteed to see the correct version because we've told the outside world that our transaction has committed, right? Basically says when we tell the primary for this copy of the data tells the application you've committed, all the replicas will see that change. The ability says that now if the replica dies, then we can always serve requests. So I can read B, get back result here. The other application server can read A. Notice the skip, the one that's failed go over the network and read from the primary, right? And get back the right result. That's fine. Partition tolerance says that if the network now goes down, the, if I run my Paxos leader election, this node over here, it's not gonna say, well, I can't communicate with this one, I'm assuming he's down. So I'm gonna make myself the new primary. This guy says I can't communicate with that guy, he's down, so I'm still the primary. But now the problem is that if the two application servers then try to apply rights to the different objects, the different nodes at the same time, one could update one to two, other could update three, both think that the primary, both say this is allowed and send back acknowledgments. But now the network comes back and I have to resolve the conflict, right? So you kind of see how you couldn't be able to do, you know, this violates consistency, but I got availability, right? The server's still up, I can still communicate with it, but I'm getting back inconsistent results. Or the case of availability, if I wanna make sure the data is consistent, then I can't tolerate network partitions because I would run my leader election and I would not have a majority of nodes, and therefore I can't service any requests until I get reconnected, right? So again, as always in databases or computer science, general systems, there's no free lunch, right? You can't build a system that can do all these things. There's a system called New ODB where the guy claimed in some forum that their database defeated the CAP theorem, it was all bulls*** back down, like you can't have a system that does all this. So what has come out in the last decade, not, you know, the CAP theorem doesn't describe everything you would care about in a distributed system, but basically what happened is that there's all these node SQL systems that were choosing high availability and network partition tolerance over consistency. So they were okay with you, maybe you're writing to, you know, doing updates on the same object and having them manually reconcile them when the network got reconnected. But the traditional SQL or transactional distributed relational database systems, they would choose consistency and availability over network partition tolerance, meaning like if they can't have all my nodes available or they can't communicate between my nodes and have the majority, I'd rather turn the system off, not have any, not accept any new requests or transactions, because I don't want to have this problem of, you know, getting the split brain of doing updates on one and another and getting inconsistent results. So since then, the node SQL guys realized, oh, transactions are a good idea after all. Google claims this is actually one of their papers where they talk about how without transactions, all the random programmers are writing their own, inconsistent data handling code and they're always doing a bad job. So it's better to have a transactional system that everyone just uses that's fast and that's what Spanner became. So a lot of the new SQL systems since then have now added support for, trying to add support for strongly consistent transactions because in a lot of cases it's a good idea. But again, there's some applications where, like it's your shopping cart, who cares if it's off? Right, if there's a failure, it's not the end of the world. Okay, so as I said, this is a crash course, no pun intended on distributed data, oh, distributed data is for transactions. So hopefully the main takeaway is that there's a lot of different things that can happen, a lot of things can go wrong and this is really hard to do. And this again, this is why you, you'd want to use a distributed data system that's been well written and tested that supports transactions and strong consistency instead of having a random programmer try to build the stuff themselves, right? Because things will go bad even if you control the nodes, even if you have high ends hardware, there's always gonna be problems. And so as I said before, the blockchain guys, they solve a different problem when the nodes are not, could be potentially malicious and you don't use Paxos for that, use another protocol, but all that is like orders of magnitude slower than everything we talked about here. Like he was concerned about this seems all super slow to replicate data, why do we want to do this? Blockchain is even slower, it's even worse, it's stupid. So if you're interested in this kind of stuff, I haven't heard you to check out this blog here from Kyle Kingsbury, he has to think on the Jepsen project. He started doing this, I don't know, 10 years ago, so where basically he built a torture chamber for distributed databases and he would try different, like network failures, node failures, funky ordering of requests, and he would see what's distributed data systems who claimed to be strongly consistent or claimed to support distributed transactions correctly, his thing would break them. As he has these amazing write-ups, it's almost like reading a paper, it describes how these different database systems get things wrong and he actually works with the companies as a contractor, a consultant to have them fix it. So there's a bunch of databases that have gone through this process where they claimed, oh, we support transactions, he rolls around, and then they have to then change their marketing language. So it's highly amusing, I encourage you to check that out. All right, so next class, we'll do distributed illat systems, how to run in large analytical queries on a multi-node system, okay? Hit it.