 DJ Jupil is not here. He got sick. I took him to the free clinic this morning. So hopefully he'll be back here feeling better on Wednesday. For you guys in the class, hearing? Right. Again, homework five is due on the Sunday coming up. On the third, Particle Four is doing the following week. On the 10th, also midnight, next Monday in class, sorry, next Monday, not in class, over Zoom, we'll have a guest speaker from a single store to talk about their distributed database system. And again, it'll cover a lot of the co-ideas. We've been talking about this the entire semester. And then you realize, oh, Andy's not crazy. He just made it up. Like, this is what help people really build systems. And then as I posted on Piazza yesterday, please go vote. We'll do a speed run in the last day of class. We'll start off with the final review for the final exam. And then we'll just plow through as many systems as you guys want to learn about. And again, instead of having a giant dropdown of 900 different databases, because nobody would click that, just copy and paste URLs in. And that way, I can sort by uniqueness, right? Just do a group by it, right? Again, there's a person in Piazza that takes you to the Google Form, and then vote as many times as you want. I don't care. The, I think the number one vote this morning was Redis. It's an interesting system. All right, final exam will be Tuesday, December 12th. Somewhere, we'll announce it. It'll be 8.30 in the morning, because it's early, and that sucks. We'll do like coffee, donuts, and cigarettes for everyone in the morning. And then Jake Nash is teaching this class by himself in the fall. So if you want more databases, I want to go even deeper into Bust Hub. Please sign up for it to be a TA. And we'll post on Piazza how to make that happen, all right? There's a lot here. Any questions? We're almost done for the end of the semester. OK, all right, let's get through it. So last class, we started off talking about distributed database systems. Again, it was a high-level introduction to the key concepts about how people think about and reason and design these database systems. And again, unfortunately, there's no distributed data system course at Carnegie Mellon. Eventually, maybe we'll teach you. But right now, it's just next to myself, so we can't do that. So this last lecture and this week's lecture is really much all you'll be able to get out of distributed databases in low-level detail at Carnegie Mellon. But again, the idea is to give you a litany of different concepts and ideas and techniques. And then that way, when you go out in the real world, you'll see these ideas come up. And then you can reason about what help people describe their systems. So we started off talking about the different system architectures. We said, shared everything is basic what Bust Hub is. It's a single-node system where everyone can communicate. The processes of workers can communicate very quickly. And then shared disk and shared nothing are the two major distributed database architectures. Where shared disk was, there's a single repository or a single storage device that all the worker nodes can read and write to. And then shared nothing is where each worker node, each node in the system has some portion of the database. And you have to send messages and pass data between the nodes directly. Then we talked about how to do partitioning or sharding, simple range partitioning, hash partitioning, again, splitting the database up into destroying subsets and distributing them across different nodes. And then we talked about how to do transaction coordination at a high level. Like, is there a centralized coordinator that's in charge of taking all the queries, figuring out what locks or what data they're going to touch, and communicating to the different nodes? Or is it decentralized, meaning the nodes are free to decide on their own how they're going to commit and order transactions? So today's class, we're going to focus on transaction processing. Wednesday's class will be about analytical processing or decision support systems. And again, we've talked about this many times. I just want to bring up the distinction between the two today, again, because then that'll help us understand what are the different design trade-offs or things we're going to care about in one sort of category of a system versus another. Something like Postgres is meant to be a general purpose database system. It is a row store, and it can do transactions. Primarily people use it for OTP, but there are some aspects of it that you would see in an OLAP system. But there are certain things that Postgres does that you wouldn't want to use in a snowflake system, because they're not worried about running transactions. So the two rough categories can be sort of broken up between these characteristics. So for OLTP workloads, these transactions are going to be short-lived, and they're going to be a combination of read-write queries. Short-lived means less than 100 milliseconds. Even that's a long time. 50 milliseconds is considered the max you want to have for a transaction. Anything longer than that is considered a longer run transaction. That number comes from that 50 milliseconds as the conventional wisdom in internet advertising. When you go visit a website, so you don't have an ad blocker turned on, if you don't, please do that. When you go visit a website, there's an ad blocker turned on. Sorry. If you go visit a website, the ad-serving company has an auction that it sends out information to all the people that want to potentially put out ads to do a bid on it, and they say, here's the information about this person visiting the website. And you have to give back a response roughly in 50 milliseconds. Roughly, that's where this number comes from. It might be down to 30 milliseconds nowadays. If you're doing high-frequency trading, those guys are freaks, and they want things less than a millisecond. But the workload basically still looks the same. It's a lot of small gets and sets. And then these operations are going to be very repetitive. Again, think of going through a website. You can only do a certain many things on a website. You're not sitting at a raw terminal putting in a SQL query. So you're running application code that's going to be running the same queries over and over again. And there's certain oppositions you can do if you know you're going to be running the same queries over and over again. Another lap, again, we'll focus more on this workload on Wednesday, but these are your long-running queries, typically read-only, doing a lot of joins, and oftentimes there'll be sort of one-off ad hoc queries. Because someone's sitting in a dashboard, clicking much buttons to form a query, they click Go to generate the visualization, and it's maybe the first time that the database system has ever seen that query. And it may never see it again. So that limits what sort of optimizations you can do. All right, so this is the setup that we sort of care about today. We have some application server that wants to run a transaction that's going to touch data at these three partitions. I'm not saying whether it's shared disk or shared nothing. For our purposes right now, it doesn't matter. So somehow we've elected this one first partition here to be the primary node. So the begin request for the transaction goes to this node, and let's say there's no centralized coordinator and the application's allowed to send queries directly to the various nodes. And then when it comes time to commit, it goes to the primary node, said, hey, can I commit? And the primary node is responsible for figuring out amongst the nodes that participate in the transaction whether this thing's allowed to commit. All right, so our focus really today is like is this sort of safe commit step and how to get everyone to agree that yes, this transaction is safe to commit. And then if they all say yes, it's going to commit that we commit it, right? And all the stuff we talked about for like two-phase locking or multi-version control or OCC, all of that is still happening here, right? That's still the mechanism that's going to use to determine is this thing allowed to commit on each individual node? And then there's a higher level process above that that's trying to get everyone to agree that it's now time to commit this transaction. So that's the big picture of what we're trying to do today. And so again, even though my last example I showed the application going directly to different partitions, ignored that for now, but the big picture what we're trying to achieve today is that, and as you would want in any distributed data system, is that you would have a single logical view against a single database system even though underneath the covers it's comprised by multiple physical servers or multiple physical resources. And so we haven't talked about how again we're going to get a transaction or just her nose to agree that we can commit a transaction and then if we decide that it is going to commit, how do we make sure that it does commit? And if there's a crash, there's a failure that when the system comes back up that if we told the outside world that a transaction committed, we have to make that guarantee, right? That's the D in asset, the durability guarantee. But then if the whole thing is down and come back up, then that's one thing. But what happens if one node goes down, which will happen, right? How do we deal with that? Well, what happens if we send a message to commit a transaction and then the message disappears somehow? Because someone tripped over the cable, there's a weird hiccup in the network or something. And then now our message to commit this transaction shows up late. What do we do? What does the node do in that setting? And then what happens if the system, we don't want to wait for everyone to agree to commit, but we still want to commit in some cases or we don't want to block everything in case a one node goes down, right? So one very, very important assumption that we're going to make in today's discussion is that we're going to assume that all the nodes in our distributed database system are going to be well-behaved and under the same administrative domain. And what I mean by that is like, these are nodes, like if you're the database system operator, these are nodes you control that you own or you're renting or whatever and it's running software that it's expected to run as part of the database server, right? So that does mean that we're not in this weird untrusted world where there's some nodes in our distributed database that we don't control and we're renting or like that people are running altruistically or something like and it's not going to be the case that we go commit a transaction and one node's going to be nefarious and starts trying to screw with us and say, no, no, no, we're not going to commit that. Or like if everyone agrees that it commits, this other node's going to, isn't going to come back later on and say, hey, look, I lied, we didn't commit that transaction and start trying to change the state of things, right? So that kind of tolerance or that property is called Byzantine fault tolerance, right? And that's what you get if you get a blockchain, right? Blockchain basically is just a database, a distributed ledger, distributed right-hand log. And in that world, because something like on Bitcoin where it's a bunch of people running these peer-to-peer networks where you don't control, no one, no, there's no single authority controlling all the different nodes, you need a BFT protocol. We don't care about that shit, right? That's a bunch of overhead that we don't have that we don't have to deal with. And to be honest, if you want to run transactions at scale, you wouldn't want to use a blockchain anyway because they measure their latencies in like seconds, right? If you go to a website and make an order on Amazon and it takes like 20 seconds to commit that transaction, you're going to give up, right? So nobody runs a real database system doing transactions on a blockchain, right? This is stupid. The real one doesn't work this way. We can ignore that entirely, okay? So now, that doesn't mean a node won't crash, disappear, come back and try to say, hey guys, I missed, you know, what's going on? I missed, you know, I missed all the updates. And then you got to go ahead and update. But again, that's just a, you know, that's a hardware failure and we will have mechanisms to deal with that. And we're still not worried about whether, you know, the node is nefarious or not, okay? So it makes our lives a lot easier if we're not, we don't worry about visiting fault tolerance. All right, so today's lecture is really me, again, attempting to try to condense a year's worth of material in distributed databases into like one or two lectures. So we can't cover everything into detail. We're to go through like these, the most important things that you need to be aware of that exists and to challenge you a face when we have a distributed database system. So we're gonna first talk about what replication looks like in this environment. So for most people, like, you know, when they say, oh, my database can't scale, the very first step you actually should do is replication. You should probably be doing this anyway for high availability and durability, but you can actually offload some reads to replicas to depending on your workload to alleviate some of the load and start scaling things out. So even though most people don't need a massive distributed database system like a spanner, you're gonna need replication almost always. All right, if you care about your data. Then we'll talk about time of commit protocols, two-phase commit, Paxos, Raft, how to actually get everyone to agree. And again, I realize there's a distributed systems course at CMU that does a better job and spends more time talking about these things, but we just wanna describe it in the context of databases. And then we'll talk about consistency issues in the context of the CAT theorem or Pacellic, which is the follow up to it. Again, in the context of databases. And if you have time, we'll finish off just talking about spanner real quickly, just because it encapsulates a lot of the ideas we'll talk about today. And I consider this a state-of-the-art system, even though Google put it out over 10 years ago. It does a lot of very interesting things. And then you do this one thing that nobody else does, which is you can do if you're Google and have Google money. Anybody take a guess what that is and what that is? If you're familiar with spanner? What's one? Time of clocks, yes, we'll get there in a second. So spoiler alert, Google basically puts satellite hookups, gets the time from satellites, from GPS satellites, and has atomic clocks in every data center. And that doesn't make things magically always in sync, but reduces the bounds you have this, how much time you have to spend waiting for new transactions to show up. Again, we'll get there in a second. But they're still gonna do Paxos, they're still gonna do phase commit, they're still gonna do MVC, all this other stuff. Okay. So with replication, the idea is that we wanna replicate the database, either a portion of it or all of it, across multiple nodes. And we wanna do this to increase availability and in some cases, scalability. Not always, but in some cases, yes. And again, we wanna do this whether it's partition or not. So partition would be, if I split my database up to disjoint sets, I still wanna have multiple copies of those disjoint subsets. So again, if one node goes down, I can still potentially serve the server queries. Or if it's not partition, which is most database systems, then I wanna be able to use the replicas for offloading maybe read all inquiries. So the idea is again, we wanna make multiple copies of data so that if any node goes down, we can still potentially remain online. So there's a bunch of design stages we have to put in our system if we wanna add replication. So what the overall configuration of the architectures could look like, how we're gonna propagate the updates, when should we propagate the updates, and then the method in which we propagate those updates. So at a high level, basically there's two approaches to do replication. You have what is called primary replica and multi primary. So primary replica is the most common one. Sometimes you'll see this as leader follower. The older terms use master-slave. We obviously don't say that anymore. The problem is like a lot of the literature that describes this configuration is gonna be bit older. So you have to Google master-slave, usually the newer stuff will refer to as primary replica. The idea here is that all our updates are gonna go to a single node, primary node in the database, right? And it's gonna be that primary's responsibility to then propagate those updates to any replicas, right? And leader follower or primary replica. And in some cases, depending on the system, how they expose this capability, you'll allow read-only transactions to run queries on those replicas, right? And you can do this because if everything's all transactional, meaning if I do a transactional update on the primary, then I propagate that transactional update to the replica and that update is atomic, then any read query will see a consistent view of the database on the replica. It may not be the most up-to-date version, depending on how we propagate the updates, but it'll be least consistent. And of course, you can relax that and see partial updates if you want, right? So then what happens is if in the event the primary goes down, we'll hold some kind of election process, which will be Paxos or something, and we'll elect one of the replicas to become a new primary, and then now all the rights will go to that new primary, right? And if the old primary ever comes back up, it'll get relegated to be a replica. Again, we'll handle that in a second. All right, so this is the top one here is what most people are gonna do. The bottom one, multi-primary, sometimes called multi-home, this is where the replica, every object will be replicated in cross-multi-nodes, and any transaction can update that object, like a record, a table, whatever, at any possible node. And then when you go to commit, now the replicas need to synchronize amongst each other because they're all considered leaders or the primary, they need to coordinate with each other to figure out, okay, who's allowed to commit, who has the most latest version and how to reconcile any changes, right? So those are two visualizations of this, right? So again, primary replica, we have a single primary, and then the two replicas, all the read and write queries will go to the primary, and then the rights will get propagated to the replicas. And this is typically just sending the right-of-head log that we talked about before. You're just sending the right-of-head log of all the changes that come out of the primary, and the replicas are more or less in recovery mode, and as if they're reading this never-ending file from disks and they're replaying the right-of-head log to apply the changes. Again, the right-of-head log could be physiological, physical, like here's the deltas or the diffs I'm making on single pages, or it could be the queries. If you send the queries, things get tricky, because now if there's things in the query, like timestamps or random, then you gotta make sure that the replicas execute the exact same values for those functions when they run, and there's tricks to handle that, right? And as I said, in some systems, they'll allow you to run read-only queries on the replicas. Now, how you actually find these replicas, that's that coordinator middleware stuff we talked about before, right? You can either explicitly in your application say, oh, I know, here's the IP address of some replica, let me go send my query there, or you send it to a middleware and says, oh, this is read-only, let me send it to the replica, right? And so if you can offload all the reads to the replicas, then this thing's just doing nothing but writes, which are gonna always typically be slower. So this allows you to allow the overhead and the burden off of a running read queries and put it on the replicas. Some cases, this is okay, like again, if you don't really care about having the most up-to-date view of the database, which will be re-encuring theme throughout the entire lecture, then this is fine. But if you need to have exactly, here's the most latest version, then you have to run to the primary. Because I haven't said yet how long is this gonna take, right? We'll get there in a second. So in multi-primary, you have, again, the every node is considered the primary and can take reads and writes. So now what happens if you have two transactions that do writes, one down here, you have to propagate those updates across these guys and then decide who's allowed to commit and see the latest version, right? So this is less common, cause this is hard to do. And a lot of times people will start with something like this and then if this thing becomes the bottleneck, the write node, then you gotta switch to something like this. It actually doesn't necessarily have to be a bottleneck. It could just be that the geographical distance between the primary and the replicas is really far. So example I always like to use is Facebook. Like in the old days, 2010-ish, every decade ago, Facebook used this model. So the data center was somewhere in California, or the primary, but then all the replicas are all across the world, right? So if you were down in Brazil and did an update to your timeline, whatever it's called now, right? That would again get sent back up, shipped up to California, stored in that data center and then eventually it would get propagated down to the replicas, right? Down in Brazil and whatever other countries, right? Of course, what's the problem with that? You have to go geographical distance and get propagated back. And then so if someone posts a cat picture on their timeline, clicks submit, but then refreshes the page, that page refresh is gonna pull the data from the database, but it's gonna pull from its local replica and it's not gonna see the update. So people aren't gonna see their own timeline updates. So Facebook played a little game by putting something in a cookie in your browser so that when you refresh, you saw your own rights locally, not actually from the database, right? So they hid that, all that from you. But eventually you had to switch to this model because in things we're doing updates and you can't put the browser cookie trick in, they had to scale it this way. All right, so the next issue we gotta consider is case safety. And this is a, I think it's primarily a database term, but the idea here is that it's the number of failures that your distributed database is gonna be allowed to have before it decides that it doesn't wanna proceed any further because it may end up losing data if there's more failures, right? This is sort of like, you know, think of a quorum, right? Same kind of ideas that we talked about before, but it's basically saying how many times, how many nodes can go down for all the replicas I have of a given object in my database before I decide that this is enough, right? And we'll see this in a second when we talk about the cap theorem, but like in a distributed relational database, like I'll say traditional, but like in the relational database world, we typically, we don't like losing data, we care about acid, and so we don't wanna have a bunch of nodes go down and say there's like one last copy of an object, we do a write to it, but then we crash, and then now there's, you know, that portion of the database is now missing. So sometimes you would say, I need to have at least two copies of every object in my database for my database to consider online. And if I go below that threshold, then the system just stops. It doesn't accept any new requests. There's self-healing models, there's other tricks like that you can say, all right, I've gone along the threshold, so let me make a copy of this data now before I run out of any queries so that, you know, I can start running more queries after I make another copy to go above the threshold, but again, there's different ways to handle that. All right, so I've already alluded to this as well, the next is the propagation scheme. So this is how we're gonna decide when and how we will propagate the changes from a primary to a replica. And this is whether it's the primary replica model or the multi-home, multi-primary model, right? And there's basically two approaches, and there's obviously there's different degrees of propagation or strength you wanna have between these different, these two approaches, but we'll just take the two major ones, the two extremes. The first is synchronous commits, or what it's called, strong consistency in the distributed database world, or sorry, distributed systems world. And the idea here is that if I do an update to an object that may have multiple copies on multiple replicas, then I don't get an acknowledgement back from my application that my change has been committed or saved until all the replicas have been updated and agreed to have been updated, or they've agreed that transaction has committed. Eventual consistency means that if I do a write, I don't eventually get propagated to the replicas, but I'll get a response back potentially before they get updated. So if you have, yes, question. Right, so this question is, how do you make sure you'll be eventually consistent if there's no guarantee that you get a response? We'll get there in a second, yes. The answer is eh, right? Like, you'll get there eventually, but when? Other questions? Yes. So in a multi-primary setting, how do we prevent simultaneous updates to the same object? How do we make it consistent because you allow writes on both the primary and the secondary? Yeah, go back here. So the question is, in multi-primary, how do I prevent writes from occurring on the same object? Yeah. Two-phase locking, OCC, all that. So you basically, you run two-phase locking across multiple nodes, right? Again, it's just the same protocols, which is now we're distributed, and now like, someone's gonna have to build the wait-to-graph assuming it's two-phase locking, and someone's gonna have to decide, okay, there's a deadlock, go ahead, let me go ahead and kill it. And that could be either the central coordinator or it could be amongst the nodes, they say, okay, well, I'm waiting for this, and you have this, but you update it. And then the two nodes decide, okay, yeah, there's a deadlock here, and then there's some other ordering team to decide, determine the priority of who gets killed. So all of the stuff we talked about before is still here. What do you think? We don't worry about the network traffic. So the latency? Yeah, so the statement is, and he's correct, that like, in a distributed system, and we said this before, between parallel systems and distributed systems, in a distributed system, the communication latency is much higher, the cost is much higher, we do worry about it, but like we can't, there's no magic wand to make it go away. We're kind of limited by the speed of light. So we will see this in a second, like the trick they basically do is say, well, I'll wait to see, I'll wait for a certain amount of time to see whether someone shows up, they may conflict with my transaction. And if they don't show up and they're in that time, then I'm allowed to commit, right? But there's like, there's no way to get around that. Yeah. Like there's no magic two phase locking you can do because you're going to like the wide area network, it's gonna be basically the stuff we talked about before. And the tricks, the games they play is like, how much hints can you pass along to say, okay, yeah, by the way, like I'm gonna, instead of saying one message at a time, here's all the locks you can acquire, it's like, here's the locks I'm gonna acquire, here's the locks I think I'm gonna acquire, and maybe you do more pessimistic things like that. That's really all you really can do. Okay, it's going back. All right, propagation team. So, synchronous commit, or synchronous propagation. So the idea here is that when the primary sends a, or when the application sends a commit request to the primary, the primary will send that request to the, to all its replicas, and it has to wait until they come back and say, yes, I've got this change, I've go ahead and commit it, I flush the changes to the disk, and then once it gets acknowledgement from all its replicas that things are durable and safe, where the transaction is allowed to commit, then and only then can it propagate the acknowledgement back to the application. It's just like before when we committed transactions with the write ahead log, except now we gotta wait for somebody else over the network to come back and say they got it too. A synchronous commit at a high level is basically, all right, commit message shows up. I'll send the request to my replicas, but I'm not gonna wait for them to come back. I'll immediately say, yep, I got it, commit, right? And what I was saying before that there's sort of like, these are two extremes, there's a bunch of games you can play about how much can you really wait or how much should you wait. So going back here to the one at the top, so say there was like four replicas, but instead of me for waiting for acknowledgement from all four, maybe I waited for two out of four or one out of four. So at least know it's propagated on at least one, right? Same thing for the asynchronous one, right? So I send my requests out, maybe I wait for some of them. We saw this with consistent hashing in Consandra, right? They do quorum writes. You could say, all right, my every tuple is replicated three times. I'll wait for two out of three responses from my replicas before I say, tell the outside world that my data's been saved. I don't want to use the term transaction committed in Consandra, because they don't, newer versions maybe supported, not the older versions. They didn't support multi-object updates. They didn't support transactions or they definitely didn't support multi-node up transactions. For reference, it's assumed that like, we're updating things just on a single note. So again, there's pros and cons to all these, right? And there isn't one scheme that's gonna be universal for everyone. It depends on the tolerance of the organization of the company running your database and depends on what the application is. If it's like posts on Reddit and Twitter and hacker news, then like, if I lose something, maybe no big deal, right? But if it's your bank, then you don't want to lose anything and you want to make sure everything is fully committed. And so the noscego guys are saying, oh yeah, you don't want to be, to be web-scale, support online applications, a lot of users, you want to use asynchronous commit, or eventual consistency. And then the traditional database people say, oh, you don't want to lose any data and you want fully-asseted transactions even though you're cross-mobile nodes. And so the answer is, again, the correction is it's somewhere in between. Some applications, yes, some applications, no. The difference with the noscego guys, they said, straight up, no, and they never had it, at least in the beginning, and they eventually had to add it back. Whereas the relational database system started off adding transactions and then they eventually relaxed it a little bit. And I would argue, better off starting with full transactional strong consistency support and then dial it back as needed rather than like trying to shoe-hold it after you've already built the system. All right, the next is when are the changes actually propagated? So again, in my example here, I just showed on commit, tell the other replica I want to go ahead and apply my changes. But let's say that I'm doing a transaction, updates a lot of data, and it's a lot around trips between the application server and the data server because do one update, do another update and so forth. Do I want to wait for the, I get the commit message before I propagate those changes to the replica? Maybe not, right? Because maybe like as the updates come in, I can start sending them piecemeal to the replicas so that when I go ahead and commit, they don't have to apply a bunch of these changes, they've already seen a bunch of these things. But then that now means that if I roll back the transaction, they've already applied a bunch of changes that I have to then reverse. But again, this is sort of like what I was saying before, like the replicas are more or less in recovery mode, ignoring multi-home, because they're just replaying these log messages and they're maintaining the undo state or undo buffer just as you would during regular recovery. So those systems do the top one. If you do the bottom one, it does make it easier to implement. But again, it just, it makes the commit process a bit longer because now you bash up a bunch of updates you have to apply. This one's a bit nuanced and always confuses students so ask questions if it just makes sense. So then there's a question of what are the transactions actually, or what are the nodes actually doing that are involved in say like a multi-node transaction? And so the two ideas are active, active versus active passive. So active passive is what I've been sort of describing so far where there's some primary, the queries go there, they do updates and then they get propagated to the replicas. And the replicas are eventually, they're not really running the queries of doing the work, there's replaying the updates that occurred on the primary. But active, active, the idea is that when I execute a transaction it's gonna actually execute the logic of that transaction at every replica. And that when you go to now commit you don't need to send like, hey, here's my updates from the right-hand log because they all did the same thing. I think of like, I have an update query ignoring random, ignoring times and all that other stuff. They're like, assuming I can execute it deterministically, I send it to the two nodes and they're both executed in the exact same order and produce the exact same result. So now when they go to commit, they just need to agree, I did this, yep, you did that, and you're done. So most systems do this bottom one because this is easier, because again, you're just piggybacking off of the, you're piggyback off of the right-hand log mechanism. But the top one is if you can do it, it's way more efficient because you're just sending less data. It's similar to that, the push versus pushing the data, sorry, pushing the query to the data versus pulling the data to the query, yes. The question is does active, active defeat the purpose of being able to run a lot of re-crays on replicas? Yes. Question is when do you want to do this? If the amount of data you have to send to say here's the transaction is going to be less than, like here's the right-hand log updates, then it may potentially use better. It'll also reduce strings the window of the synchronization time as well. Because now like, again, assuming the two nodes have the same speed, send them a request, they rip through it, and then you just commit. Whereas like, think of the active passive, I got to run the query, generate the right-hand log messages after I run the query, then send the updates that then get propagated whereas the other two can potentially happen at the exact same time. This question is if you do active, active, and you don't end up at the exact same result, what do you do? You have to abort this transaction. That's why most people don't do it. Yeah, yes. This question is with active, active, wouldn't you need to send a lot more data over the network? Why? So the statement is, and the answer is, of course indeed as it depends, statement is with active, active, you would have to send the query. With active, passive, you'd get the send of right-hand log, and couldn't the right-hand log, the records be larger than the query? Well, it depends. I have a single update query that updates the billion tuples, and I could just send that single string, and that's enough to be a billion things, right? So I want to go too much into this, but to handle all of these problems, if you can run your transactions as stored procedures, which we haven't really talked about, but think of like an RPC, like I literally have a function that I can put inside my database system that has if and else statements, things you can't easily do in SQL, for loops and all that, it's procedural code, that then makes invocations of SQL queries. Now the application, when it wants to run a transaction, doesn't say, okay, begin a transaction, update this query, get a response, update next query, and so forth, right? You literally say, actually this function with this input parameters, and then that runs on the database server. This is very common in enterprise systems, in order systems. And so if you can run everything as stored procedures, now the database system actually can look inside the code and figure out what's going on and flag anything, oh, there's a timestamp call, there's a random function and so forth, right? And then you can play a bunch of games of like making sure that like, when you send that function request to the different servers in active-active replication, it's guaranteed to execute an exact same order, right? Actually, the way you handle random and timestamp is you basically piggyback on the request has to go to one server, that server then signs like, hey, if you ever call timestamp, here's the timestamp, if you ever call random, here's the starting seed, and then that way you send that information along with the request to any other server, so when they execute it, they'll get things running in the exact same order. That assumes though, you're running transaction requests in the exact same order, which you can do with serializability. And that's why we care about that. Again, most systems will be active passive, but there's active active, if you can do it, not all systems can, provides a lot of benefits. All right, so now we got to talk about how we actually want to get everyone to agree that we want to commit transactions. Now to curiosity, who here has taken a distributed systems course, either CMU or some results? All right, so more than half, okay. So some of this will be somewhat redundant, but again, I want to focus on the things we care about in the context of databases. And so in the distributed systems literature, they'll call this state machine replication or log replication. And the way to think about it in the context of databases, the thing we're trying to get everyone to coordinate on is the order in which transactions are committed. And that's essentially our state machine, right? This transaction committed, followed by this transaction, this transaction, right? And it's not so much like, here's the changes that they made. We know what the changes are that's part of our log, but like the higher level thing that we're trying to order is the commit order. Because if you say, here's the commit order of my transactions, along with that is the metadata that says, here's the changes that they made. So if you get the commit order right and you piggyback the changes that the transaction's made, then you can propagate that to all the nodes and then everyone will be synchronized in order. So for the thing we're talking about now is this doesn't matter whether the database system is going to be partitioned or replicated. We still want to do this to get everyone to agree that this is the order that transactions are going to commit. So there's a bunch of different, actually, I take that back. There's not a lot of protocols that do this because it's really hard. These are the main ones that show up in databases. Two phase commit is the original one. And this was originally thought to be invented by Jim Gray at IBM, he won the tournament one in the 90s. But there's a famous transaction book where he actually attributes to this other dude in Italy who implemented two phase commit for one of the early databases that they built for the Italian social security system in the early 1970s. But even then the idea of two phase commit comes from the real world. It comes from contract law between legal arrangements between humans, right? So the idea of two phase commit is sort of, it's implemented in database systems or distributed systems, but the idea predates computing. Then there's three phase commit, and that's from Stonebreaker. Nobody actually does this. We can ignore it. View stamp replication is considered to be the first consensus protocol that was shown to be correct. It can handle liveness issues. And that was invented by Barbara Liskoff who won the tournament award a few years ago at MIT in 1988. Paxos is probably one everyone has also heard about. This was invented a year later. But the paper itself didn't come out to the 90s. I'll explain why in a second. Zab is from the Apache Zookeeper people, the Zookeeper atomic broadcast protocol. That's roughly 2008. And then RAFT is another popular one that was invented at Stanford in 2013. And this is the one that is sort of in vogue now if you're building a distributed database system. Often times you would use RAFT. But for this lecture, we're only gonna focus on two phase commit in Paxos. RAFT is considered to be a more readable, understandable version of Paxos. And the key difference is that there's fewer node types or participant types in the network. And then when you do a leader election, only the nodes that have the most update to log they're allowed to vote for election where Paxos anybody can. View stamp replication of VSR. Again, this is becoming popular now because there's a distributed database out of South Africa called Tiger Beetle that's written in zig, which is like Ross but more rare. And then you also use VSR for that. Okay, so here's two phase commit. Basic idea is that there's an application server. Again, we ignore whether this is actually going through middleware or whatever. It doesn't matter. That they made a bunch of updates to the database. We don't care whether it's partitioned or replicated, it doesn't matter. They wanna say, okay, I wanna commit my transaction. And so the commit request is gonna be going to some node we said was the primary, but in the two phase commit parlance it would be called the coordinator. And then the other nodes that were involved in the transaction, those would be called participants. So two phase commit sounds like what it's, it is what it sounds like, right? It's two phases. So in the first phase is called prepare phase. You send a message, the coordinator sends a message to all the nodes. The participants say, hey, there's this transaction. It's got this ID, whatever, whatever metadata or identification you wanna use. Go ahead and prepare to commit. And they come back and they vote and say yes. Okay, yes, we wanna commit this. And then once you get all the okays from all the participants, then you go through the second phase and say, okay guys, go ahead and commit. You send that message to them. They come back with okay. And once you get all them from the participants, then you're allowed to go and tell the application that this transaction has committed. Now what I'm not showing here is that on every node, we're actually recording in our write ahead log. Here's all the messages that we got and here's our response to them. And we flush that to disk. So that way, if there's a crash and we come back, the node can look on the log and says, well, I was involved in this transaction, not. So at least knows about it and it knows that it voted a certain way. So that again, if there's no failure, we can come back and reason about what was the state of the system when we were committing transactions at the time of the failure, right? So another key thing about in two phase commit is that we have to wait for all the okays to come back from all the participants before we enter the next phase or before we tell the outside world of transactions committed. That's going to be different than Paxos. Paxos has to wait for majority, raft as well, right? And I'll talk about this in a second, but basically two phase commit is a generative case of Paxos, right? But it has this live in this issue where anyone know can take the whole thing down because you're waiting for it until you time out. All right, so this is a success for abort. Save the application server says a commit request. We send the prepare message out to everyone, one of them for whatever reason, we don't care, we don't know why, comes back and says abort. And if one participant in the network says we want to abort this transaction, then immediately we can tell the outside world that our transaction has aborted and then we enter the abort phase and we send all the abort messages to everyone. Even though one node might have said, oh yeah, I really want to commit this transaction, I love it, whatever really. If we get one abort, we have to kill the whole thing. If the coordinator tells us we have to fail, we have to fail, right? And then they send back their knowledge and we log all of this. Pretty simple, right? Yes? So is this for active-active only? This question is, is this for active-active? Why would it be only for active-active? Because we're like checking the other nodes, the other node we're running. The statement is, because we're checking the other nodes when it's okay to commit, did that assume they're running the other query? This doesn't matter. You have to have someone be the coordinator. Someone has to say, okay guys, we're committing. It doesn't matter how node in two got, we're propagating the updates. Active-passive, active-active doesn't matter. Okay, so I guess in what case does it send abort then if they're only propagating updates? So question is, in what case does it send abort? So there could be another coordinator committing another transaction and no three is involved in it. And for whatever reason, it can commit that guy's transaction, but not this one. Yes? Question, the statement is participants are not limited to replicas, correct? Yes, right? So it could be multi-home, the multi-primary. It's just again, when transaction commits, someone has to be in charge, right? Okay, so as I said, all the inbound and outbound messages we have to store on disk and we have to flush them like the redhead log. And then when we crash, we come back. If we see that, if we go look at our log and we see that we were in this prepared state, meaning like we got a request to prepare the transaction and we said, yeah, let's go ahead and prepare it. But then we didn't see whether the vote was successful or not. Then when the node comes back and contacts the coordinator to get the updates for the log, I forgot what it missed. If the transaction was not in the prepared state, then we assumed that it was aborted. Because if I was involved in the transaction on our two-bates commit and then I crashed, clearly couldn't have committed. So I'm gonna correctly assume that it was aborted. If the transaction was committing and the node that fails as the coordinator, then when I come back up, I just send out again, hey guys, you may have missed this, but I'm pretty sure you were supposed to get this, this transaction was committed. Of course, now again, I have to, the other notice is to wait for that coordinator to come back up and come back online. Or at some point there'll be a timeout and they say, oh, when that coordinator's dead, he's never coming back, this transaction's aborted. Right? But again, one node going down can take down the whole thing. So, all right, what happens if the coordinator crashes and already said this, right, they have to decide what to do after timeout and again, the system's not available during this time. And available means like I can't take any new queries. I could take read queries if I'm okay for things being a little loosey-goosey if you don't want that strong consistency guarantee. But I can't take any writes. Because I can't propagate those writes because until I find out what the transaction before for me actually did. And then if a participant crashes, the coordinator's just gonna assume that it just hasn't responded and it's gonna be abort. And then we just timeout and say, okay, this guy's dead and then we just abort the transaction. Right? So, there are two optimizations we can do in two-phase commit. The first is called early prepare voting, which is actually very rare. And you can only do this with the active, active, active, active approach I talked about before, and with stored procedures. I think, yeah, I think that's true. No, sorry, too bad. You could do it with the JDBC, but the wire protocol of JDBC doesn't support this technique. But you can do this, some systems can do this if you do stored procedures. And then the one that's more common is early acknowledgement after prepare. So early prepare voting is if I'm sending a query request to another node and I know this is the last time, the last query I'm ever gonna execute on that node for this transaction, then I piggyback on my message and say, hey, execute this query and oh, by the way, I'm gonna commit pretty soon. I'm never gonna come back and ask you to run another query. So go tell me what your vote's gonna be for two-phase commit. So you get the query response back and actually the vote on the prepare phase, all in one round trip. Early acknowledgement after a prepare is basically if I get the acknowledgement on the coordinator from all of the participants that this transaction is gonna commit, then I don't wait until I get the acknowledges from the commit phase. I immediately send back the commit message to the application, the acknowledge it back to the application. Because at that point, I assume that if there's a crash and I come back, then the coordinator or the nodes will look in the log and say, okay, we all agree to commit this transaction. So let's go ahead and actually apply the change. So there's a small window where like, sorry, there is not a window where you could crash and lose data, but it will take, it will make recovery a little bit longer. So it looks like this, right? So I get my commit requests, I do my prepare phase, they all come back and say, okay, go ahead and commit. And then immediately once I get back all the okays, I can send back to the success. And then I still then have to do the commit phase and get that round trip. Again, all of this is written to the log, which not everyone always does or, not everyone always flushes it, but most systems doing two-phase commit are gonna do this simple optimization because the window, the failure window is pretty small. Yes? This question is, do we still need to do two-phase commit if we're doing asynchronous commits? You should, because you want everyone to say, okay, we all agree that this is the update that happened. You don't have to though. It makes, you'd have to do more work on recovery, but you could potentially not use two-phase commit. I think that's true, yes. Yeah, I think a lot of the new SQL systems when they did that propagation, they weren't using two-phase commit. So Paxos is, as I said, is considered a superset to two-phase commit. And this was the, I think this was the first correct protocol that was prudently resilient in the face of asynchronous networks. But again, it wasn't the first one that actually could do this. Viewstamp replication came before it. And so the idea is basically that we're gonna send out votes just like before, but instead of having all the participants come back and acknowledge that this transaction is not a commit, you just need a majority. And then for the ones that are in the minority, they basically are treated as failures or failing and they have to basically crash, pseudo-crash, and then replay the log to get them back up to the correct state. And this is going back to what I was saying before, this is why we don't care about visiting fault tolerance in the real world, well, this world, because it's not like someone's gonna vote no because they wanna beep, right? Whenever else is going to commit, there's something really wrong with it or the network, and but when it comes back, it can get back up to date, right? Assuming that there's no massive hardware failures. So the original PAXOS paper, I think, is there a date on this? Yeah, 1998, the landlord actually wrote the paper in 89, but who here has ever read the PAXOS paper? One, two, it's a wild read, right? Yeah, so it's crazy. It's basically, he's trying to be very illustrative and not poetic, but he was trying to, instead of just saying, here's the protocol, like it does this, this, and this, and this, he describes it as if he's an archeologist finding this ancient Greek tribe on the island of PAXOS and how they would do voting by throwing tablets in a hole and then coming back later to read them, right? Like it's just, I mean, it's amusing, but like if you really try to read and understand what the hell it's actually trying to do, you can't. And so the story goes is that if you go to, actually if you go to Leslie Landport's website and you go look at his bibliography, which is kind of cool, he lists like, for every single paper, at least for the major ones, he lists like what he was doing in his daily life when he wrote the paper, like what he was eating, who he's dating, and so forth, right? And so he talks about in the PAXOS paper, when he submitted it with this Greek archeology story inside of it, it got rejected and he says like, oh, the reviewers were stupid because like they couldn't appreciate my genius or whatever. And so when I was in grad school, I took a distributed systems course and I presented the PAXOS paper and I started going on about like, oh yeah, this paper's brilliant and like the reviewers were super, you know, they were stupid for rejecting his, you know, he was a genius. Turns out though, the professor teaching it, this was back at Brown, was Maurice Hurley, he used to be here at CMU and he's like, yeah, actually was one of the reviewers on this paper. And the story goes is that they were okay with, he says they were okay with all the Greek stuff. They just wanted him to add an appendix with like, you know, with an algorithm and like a proof to showing that what the thing actually was. And Leslie Lamport was apparently a super stubborn, didn't want to change anything in the paper because he thought it was perfect as is. So it got rejected. They put it in his filing cabinet or put it in a shelf and didn't do anything with it for like 10 years. And then over the 90s, people started publishing papers that started like dancing around the problem that he already saw back in 89. And he actually was not aware of view stamp replication, which came out in 88. But then once he saw enough papers trying to kind of barking up the same tree as PAXAS, then he put the original PAXAS paper out along with another paper called PAXAS Made Simple, which is not simple. It's, but actually the one paper, if you want to read a PAXAS paper, the Google one, PAXAS Made Live, that's the one that like for me clicked and I was okay, now I understand what they're doing. But then once you know that it's actually just a degenerative case of Too Faced Commit, at least for me, coming from the database world, then it makes sense. So there's this paper here. This is 2003 or five, 2006 from Jim Gray, and let's see if we can import this right before Jim Gray got lost at sea, where they basically show, they prove that Too Faced Commit is a subset of PAXAS. Right, so again, and RAF is gonna be basically the same idea here. It's just gonna have fewer node types. And actually I'm not even gonna show there's another node type in PAXAS called Learners. We can ignore that. I don't think anybody does that. Or the nodes have multiple roles, so it doesn't matter. But again, in PAXAS, when they do a leader election, any node can be a leader, but in RAF, they choose the ones that have the most updates logs. That's sort of the major distinction. So going back to our example here, now we have three nodes. Our transaction goes ahead and commit. The PAXAS, instead of calling them coordinators and participants, it can all go on the proposer and acceptors. Again, there's a Learners. We can ignore that. We send out, propose that we wanna commit this transaction. Say there's one node three here, it goes down for whatever reason, it doesn't matter. And then we get back the agreements from the other two nodes. And since we got two out of three of the nodes agreed that we can commit this transaction, the transaction is allowed. Well, we do the commit phase. And then the transaction is allowed to commit. And eventually, when node three comes back up, it can learn about the changes that it missed. So another way to look at it is often this sort of timeline graph here, right? Where you have a poser who says, I wanna commit transaction, you have a timestamp N, right? Again, think of the log as just an ordered list of here's the transaction that I wanna commit. So the first guy would says, I wanna commit transaction N. So all the acceptors get that, they agree to do it. But during this time, another proposer comes along and says, I propose commit transaction N plus one. And then soon the acceptors see a new timestamp value that's greater than anyone's seen in the past. Anything that comes after that is automatically rejected. So the log is always sort of moving forward. You never move backwards. So now when this other guy says, I wanna go ahead and commit, assuming we're not doing the early, early acknowledgement optimization, go ahead and commit because these acceptors already saw N, they've rejected and say, I can't take N because I've seen N plus one. So now they can go ahead and agree to commit N plus one, commit that, and then this other guy here, he can resubmit his request to commit that transaction. So in this world, we have multiple proposers proposing to commit transactions. And obviously that's gonna be problematic because you're gonna have this contention of everyone trying to clobber each other. Yes. In this example, would two PC actually the same? Because it looks like you said the differences in the majority basically, right? But it looks like here they're all responding at the same time and the importance is still there. So the statement is would two-phase commit have the same problem if you have multiple coordinators? Yes. So again, the way we're gonna handle this is through leases to limit the number of, or limit which node can propose commit transaction. Basically how to anoint which one's gonna be the primary, which one's gonna be the coordinator or the proposer for a distributed database system. So this came from a Google paper called Multi-Paxos. The idea here is that you run later election every so often to say, okay, this node here is anointed as the proposer. And then in some cases now you can skip the propose phase because now the coordinator or the proposer gonna say, here's the transaction I wanna go ahead and commit. Everyone agree that we're all gonna commit this, right? And then if at any time there's a failure either like the leader goes down or some node goes down, then you just run the leader election all over again, which is Jen just running Paxos, right? Because again, think of the state machine is here's the order in which transaction is commit and as part of that state machine here's which node is the proposer or the coordinator, right? So you would send heartbeats out between the different nodes to figure out who's actually alive and responding to requests. If you don't get a heartbeat with a certain amount of time then you say, okay, this node is down and rerun leader election, right? And then you set the least time out to be like, okay, after every 10 minutes or so, then I automatically run leader election all over again. And maybe the case I'd reelect the same leader, which is ideal, but at least you're running that election over and over again. And different systems are gonna run that leader election at different time intervals. Spanner runs at every 10 seconds, yougobite is every like 30 seconds and I think cockroach in the source code is like every five minutes. And again, it depends on how reactive you wanna be to failures. Yougobite is less than Spanner, it's two seconds. All right, but again generally this technique is follows an umbrella called multipaxos. So most people say they're running Paxos, they're most often running multipaxos. All right, again, I've already covered this already, but the main idea is the main differences between two-phase commit Paxos and RAFT. Is there a two-phase commit that if the coordinator fails or any node fails, we have to block until the coordinator comes back. In Paxos, as long as the majority of our participants are alive and we've waited long enough to make sure that there's no further failures, they've been going ahead to commit. And RAFT, as I said, is basically the same thing. Fewer node roles and when you do leader election, you choose the ones that have the most updated logs. Yes. What were the node types for Paxos? Paxos has proposers, acceptors, and then there's learners, which are just sort of downstream from the acceptors to say, hey, here's what we just committed. We can ignore that. Okay, so most of the modern distributed data systems are gonna be running Paxos or RAFT. RAFT is probably way more common than Paxos just because when it came out in 2013, a bunch of people took the protocol and they re-implemented it in different languages. So for a long time, there wasn't like a lib Paxos. You could download and just like, oh, let me put this in my distributed database. But in RAFT, there was a bunch of correct implementations in Rust, or Rust wasn't around, but like in Go, in Python, in C++, like there were a bunch of libraries that you could just plop in and get RAFT consensus out of that. So I think that helped evangelize a bit more. Yes. What's the difference between multi-paxos and RAFT? This question is, what's the difference between multi-paxos and RAFT? Same thing. You run leader election. Multi-paxos is like 2004, five-ish? This question is, what did RAFT invent? Oh, so like, again, the, they don't have, RAFT is meant to be a more understandable version of Paxos. Right? At a high level, they're the same, right? But the, in terms of implementation, you don't have to have these learners, either participants or proposers. So there's only two no-types. When you do a leader election, then you have, you choose the one that has the most updated log versus Paxos, anybody can be the leader. And I think Paxos had, or sorry, in RAFT, there's explicit timeouts for heartbeats and things like that wherein Paxos, people add them, but I don't think the original protocol talks about those things. Yes. So I'm a little confused on multi-paxos versus Paxos. Is the main difference that multi-paxos will periodically shuffle where the leader is? The main difference in multi-paxos, yes, but in Paxos, anybody can be a leader to propose changes, to updates. Oh, I see. Yeah. Okay, so multi-paxos, so it elects one and then all the changes get sent there to be coordinated with it? Yes. Because otherwise anybody's proposing and you're just like, you know, you just clobber each other and you have to do like back off and I was like, okay, well, my transaction I got overwritten, I couldn't commit it, so maybe wait a certain amount of time and then submit another one. So instead of like that, back and forth, just say, okay, you're the leader, everything goes to you and you don't worry about conflicting proposals coming in. Yes? So you technically have just one proposal? And you technically have one proposal within some boundary in multi-paxos, yes. And whether that boundary is like a rack of machines or like a data center depends on the implementation. All right, let's finish up. So cap theorem. So this was the hot thing in distributed systems, the buzzword everyone used in the 2000s to describe their distributed systems. And it was a way to characterize what the system could handle in the face of failures, right? So the cap theorem, the C is consistent, so basically strong consistent. The A is always available, so if there's a failure or there's a network partition you can still run any possible query and that's sort of tied to the network partition tolerant. What happens when there's a network partition can you handle updates in that environment? So it was proven to be correct the way to describe systems. The idea is like you can pick two out of three, right? Way to think of like a boyfriend and girlfriend, like are they good looking, crazy or smart, pick two out of three, right? Same thing for your chip into database. And so this is proven to be correct, but a very important aspect of this is that the cap theorem doesn't describe what happens if there's a failure, but most of the time there isn't a failure, right? So what happens when you're running no operations? So it got extended by Dana Boddy who did a lot of the early column store stuff at MIT and is now professor at Maryland. The first part is the cap theorem partition tolerant always available consistent, but then during no more operations you have this trade-off between your latency tolerance and your consistency tolerance. And we'll see what that looks like in a second, right? So let's go through some examples, sorry. So with consistency the idea is that if I do a commit or a change to a copy of an object in my database and there's replicas, when do I tell the outside world that my change has been propagated? And can anybody see an intermediate state, right? So we talked about this before between venture-consisting, trunk-consisting, same idea. So if I do an update to A here, this is the primary, this is the replica, I make the change to the primary, I then propagate the update to the replica or if it's active-active, they both run at the same time, it doesn't matter. And then once I get the acknowledgement from the replica that the change is durable and safe, then I can tell the outside world my transaction is committed. So now if any other application comes along and does a read on A, they're guaranteed to see the update from the committed transaction, right? So for single objects, right, this is, this is doable. You know, really it starts to matter when you have multiple objects, right? If I update A and B at the same time, can I guarantee that I'll see a consistent view of the database? Availability determines what happens if there's a failure to the overall system. So say this node goes down here for whatever reason. So my applications over here, they can read the copy on the primary, that's just fine, but what happens if these applications are over there? Well, assuming they can connect to this other replica, then they can still access some, access this content and read the database over there. Like, I'm, you know, assuming the network hasn't been severed, right? And no matter how many nodes go down. Partition tolerance says if the network gets severed and now these two nodes can't talk to each other, so what's going to happen here under Paxos? This guy is the primary, right? He's holding the lease on that. But now that there's a, there's a, it never gets severed, so the replica can't talk to this guy. So he says, oh, well, this guy's down. So let me run leader election up and look at that. I'm not a leader, right? So it becomes the primary. Well, what's the problem here? So now the first, the both applications send updates to A, both of them think they're the primary and said, okay, yeah, great. We'll go ahead and commit your changes. That's fine. All right. And they get, you get back acknowledged that those changes got persisted. But then now the network comes back up and now there's a problem where they both think they're the primary and they both think they have the latest version of the object, but who's correct? So this is called split brain, right? And in most systems, when you have this problem, like this goes back to the case safety thing, is once I don't have enough copies of my data, I'm gonna stop the system. A small number of no SQL systems back in the day use what are called vector clocks invented by Leslie Lamport was basically, it's almost like multi-versioning. You're keeping track of the different versions of the object over time. So now when the network gets connected up again, you said, okay, well, you have these changes and I have these changes at this time and you merge the two vector clocks together. But then now you still have a bunch of different copies or different versions in your database for these objects. So now in your application code, you have the right stuff to go reason about what is actually the correct version. So like you as the application program have to write a bunch of sh!t up here that people always get wrong. It's super hard, right? So in the no SQL world, they would, in some cases they would actually still allow this to happen, but then when the network gets connected back up again, they'll just pick what everyone has the latest timestamp and say that that's the latest version. And whether or not that's right or wrong depends on whether the application cares or not. Someone will notice, yes? Why would that be wrong? So I should, why would that be wrong? My bank account has $100. This guy pulls out $100, this guy pulls out $100. I now come back and now I have negative $100. That's definitely wrong. But I've already given out, but you give out $100 twice, right? You're happy, bank's not happy, right? Or think about it, it's the, what are the tuition payment, they take it out twice. You definitely would care, right? All right, so in again, in the cap there, you can kind of, going back here quickly, you can kind of see why like you can't get all three of those properties, right? I can't do, I can't guarantee consistency if I'm going to start doing things like allowing both sides to get updated. Because that's not a consistent view of the database, right? And so what typically will happen is like say, you know, say you have an odd number of replicas, I have three. So in this case here, maybe two's over here and one's over here. It knew that there was three before, but it says now I can't get a majority at it, you know, two out of three. So this guy then just says, okay, I'm not available. And then this guy, you know, could run lead election just fine. All right, so, okay, this is all what happens when there's a failure. But again, under normal operations, even if we have these different properties under the cap theorem, there's this trade-off between how long we want to wait for changes to get propagated and acknowledge from our replicas versus guaranteeing that everything's going to be consistent. And so again, so say my application wants to do an update on A, it's going to set it to two, but then it's going to propagate those changes to the replicas. And say these replicas are in different data centers, like, you know, one's going across the country, one's going over the ocean to Europe. And now the question is how long do we want to wait for the acknowledgements from those other replicas before we tell the application server we got those changes? And again, this goes back to that strong consistency versus ventral consistency. And I was saying that somewhere in the middle. So maybe just waiting for the update from or the acknowledgement from US West, maybe that's good enough. Because now if I crash and come back, I can make sure that I at least pull the change from one of the locations. But depending on how much I really care about my data, which is actually not something the data system can reason about. Because it depends on the application, depends on the organization, depends on whether you're a bank and it's financial things, or it's like, you know, some Macedon website for who care, if you lose some stuff, right? So there's no hard and fast rule to say, okay, this is the right way to do this no matter what. It really depends for a human to decide what the tolerance actually is. But you can clearly see how there's this trade-off between I can make sure everything's strong and consistent, but I'm gonna wait for that. Because there's the speed of light issues of sending and propagating messages that I can't, you know, there's no magic wand to make that go away. And then if, but if I don't wait, then I could have inconsistent data, right? So as I said, most distributed relational database systems, both the traditional ones, like the ones from the 1980s, like DB2 and Oracle Rack and others, they will lean heavily towards strong consistency and over-availability and partition tolerance. Meaning a bunch of nodes go down, they're just gonna stop the world until those nodes come back up. Because they don't wanna have inconsistent data, they don't wanna cause you integrity issues. Whereas the no-segal guys in the original, a lot of the original versions, they would not support multi-node consistency because they were trying to support high-availability and be able to scale out. And they would do something really simple like last update wins. And this last one here, these are the vector clocks. And as I said, this is pretty rare. Very few systems do this, okay? All right, in the last five minutes, let's go through a very complicated database system that Google's spent years and years working on, called Spanner. So, the background is, Google was one of the biggest proponents of no-segal systems in the day, right? They wrote this paper called Bigtable, 2004, where they said, transactions, we don't need them, SQL, we don't need that. They didn't explicitly call no-segal, but a bunch of people read those papers and said, oh yeah, let's build the same things Google's building because Google's make a lot of money, you won't make a lot of money. It's clearly because of their database. So a lot of people re-implemented what Google talked about in their papers. But then, so Bigtable came out in 2004, it didn't have transactions. I think a lot of their stuff had vector clocks, where you had a reason about inconsistent data in the application code. But why everyone else in Silicon Valley was re-implementing Google's papers, Google was realizing, hey, which, SQL and transactions are actually a good idea. And they then, over a five-year period, when no-segal was a hot thing, they were building this thing called Spanner, right? And what is Spanner? It's a, at least the original version was not relational database system. It was, it didn't support SQL, they eventually added that. But it's gonna be a decentralized shared disk architecture using log structure on disk storage. And for the concurrency troll, it's gonna hit all our favorite buzzwords we talked about this semester. They're gonna do strict 2PL, multi-version concurrency troll, multi-paxos, and 2Base commit. Now hopefully, at this point in the semester, all these words that start clicking, you see, oh yeah, I know what those things are, and you see now, when you see them in the context of someone describing a database system, you now can't get a big picture of what they're actually doing, whether that actually makes sense, right? Now the one thing Google does that is very rare that most people don't do is that they're gonna support what is called external consistency or strict or strong serializability, which I think we might have talked about. And the basic idea here is that the commit order of transactions is gonna be equivalent to the arrival order of transactions, which is not something we talked about when we talked about serializability on a single node, we said, oh yeah, this transaction could arrive after this one because don't commit for the other one. In their role, they need to guarantee that transactions arrive or get committed in the same order that they arrive. They claim in the papers because of like, something about ads, right? They built Spanner for running their behemoth ad infrastructure. Very few people need this. And very few systems actually support this. We can either lock free transactions for now. So the way it's gonna work is that they're gonna guarantee the ordering of transactions through globally unique timestamps that are gonna generate through a combination of atomic clocks and GPS receivers in every data center. So they're gonna have some satellite dish on top of every single data center to pull down the timestamps coming out of the GPS satellites. Now, it's not gonna guarantee that timestamps, it's not gonna guarantee that it knows the exact ordering at any given time for any transaction, but it's gonna allow us to bound how long we have to wait before we expect a transaction to show up with a lower timestamp than our transaction when we have to commit. And that's the novel aspect about this because otherwise, you know, if you wanna do this global ordering across different data centers, across the wide air network, you're gonna have to wait. But because the clocks could be skewed, you know, slowly off from one data center to another, you might have to wait a little bit longer than you would otherwise if you had this. And as far as I know, I don't know of any other people, I don't know of any other company the public talks about having this kind of infrastructure. The high-fifty trading guys might, but they don't write papers and they don't share what they're doing because they're too busy counting their money. Fun fact, when I was in grad school, when I, early years, I was sponsored by, the research funding was sponsored by this company at Chicago who their claim to fame was they, in the 90s, they figured out if you put the database server on the floor above the stock exchange, you make money faster. They obviously did a lot more stuff since then, but like, the things that they were telling me they would do was insane. Like, we're talking about like, you know, digging trenches to run their own fiber optic cables because it'll get them like five milliseconds less. Because again, that, in their world, that matters a lot. So, all right, all right, so the way expand is gonna do is gonna break the database up into what they call tablets, these are just partitions, and they're gonna use PAXOS to do leader election in the tablet group, but then if I have to update data across multiple tablets, then I'll just use regular two-phase commit for this. So visually it looks like this, you elect a leader within my data center, right, so there's a tablet and it's replicated across different data centers. There's some data center that's a leader. All reads and writes have to go to this guy, and then use PAXOS to propagate them to everyone else, and then any snapshot reads, which again, like think of like a consistent snapshot on time, I can then run on the non-leader tablets to read any data that they have, but if I have a transaction that has to update another tablet group, whether or not it's in my same data center or not, in their world they're replicating across multiple data centers, I'm gonna use two-phase commit to then do updates that get propagated to those other tablets, which then go to their leader who then propagates those changes using PAXOS. So this is a good example where the things aren't mutually exclusive, like PAXOS and two-phase commit, you can use a combination of both of them. Most systems just do one or the other. All right, I'm gonna skip this slide. This basically says how they guarantee strict serializability and doing time sampling. So all right, so there you go. Three minutes to cover PAXOS, or sorry, to cover Spanner. If you guys wanna learn more about it, we can again vote for it, we can cover it on the speed run at the end of the semester next week. All right, so the main takeaway for this would be that maintaining consistent view of the database across multiple nodes is hard. And it's important that you get this right because there will be failures. Not just like the machine goes down, but like weird things that outside of your control that purview the data system, like GC pauses or network hiccups that again, so you have to prepare for all of these things. And as I said, blockchains are a waste of time because most applications don't run in this world. Bitcoin is the only really useful thing you can use a blockchain for, right? Most people running transactions are gonna trust people to trust the machines and trust the authority because this is sort of how the real world works. If you buy something, you give them your credit card number, you trust them to go charge their credit card number. All of those are gonna be trusted transactions and so you don't need blockchains. So the overhead of gaining, visiting fault tolerance is unnecessary. So this stuff is super interesting to you. There's a great blog article or blog from this guy, Kyle Kingsbury. He has this thing called the Jepsen Project. 10 years ago at Stripe, he basically wrote a torture chamber for distributed databases to prove that they don't make all the, they don't achieve all the guarantees that they're claiming. They're not strongly consistent. They can't handle this kind of failure. Like, think of the context of the cap theorem. Are they truly partition tolerance and so forth? And so his blog artists are fascinating. They're very, very detailed and they go into all the failures that occur. He was really just doing this as a hobby but then now he spun it off as a consultancy company. So a bunch of the Stripe and Davis companies hire him to then run his torture chamber on their databases and then he writes about all the failures. It was awesome. A bunch of companies claimed that they could do this feature or that feature. He comes in and destroys them and they have to change their marketing language to reflect that like, oh yeah, we're not actually this until they eventually actually fix it. So again, this stuff is, don't read it. Read his blog article. You'll see why. Next week, or sorry, next Wednesday, OLAP systems. We won't worry about transactions but the big problem we're gonna face now is how do we do joins in a Stripe environment? Okay. I wouldn't normally say hit it, he's not here, he's the second. This shit is gangsta. The fucking fucker, 28 a grand, show me where the safe's at before I blow your face back. I got a block on taps, the feds can't trace that. Style is like tamper proof. You can't lace that, the dominican. Oh, you can call me Dominican. Black Skelly, black leather, black suede, Timberlands, my all black, dirty haters send you to the pearly gates. You get gizama trying to skate and that's your first mistake. I ain't lying for that cake, see you wake, my grand's is heavyweight and ran through every stake. When they asking how I'm living, I tell them I'm living great.