 We have two weeks left on lecture, so now for this week, we're going to focus on building distributed database systems. So we're taking everything we've learned the entire semester, and now we're going to bring it now into a distributed environment, and I'll basically show you how much harder it actually is. So real quick, the homework six is due today at midnight, and then the last project, homework four, is due next week on December 6th, which is also the last day of classes. Do you have any high-level questions about the last programming project? Okay, so the other thing is that next week, on Monday, December 4th, we're going to have Barry Morse from NuoDB, come give a talk about their system. I've known Barry for a while now. He's the co-founder, former CEO and current executive chairman, although he's there all the time, of NuoDB. So I'm really excited to have Barry come and give a talk, because the way I first met Barry was back in 2009, 2010, when I was still a grad student, we had the NuoEngland Database Summit or Day at MIT, and this is before NuoDB was called NuoDB. It used to be called NimbusDB, and Jim Starkey, who I mentioned, implemented the first database system that supported MVCC at DEC, he is the co-founder also of NuoDB. He came give a talk, which I completely did not understand. I don't want to say anything on video, but then I was like, all right, this seems like there's something interesting here, but I don't understand what's going on. So we reached out to Barry, had him come down to Brown and gave a talk, and then that's when it clicked for me, and I was like, all right, what they're doing is actually kind of interesting and kind of novel, it's kind of cool, and that totally made sense for me. Barry will come and enlighten you guys on what NuoDB actually does. And I'll cover, I'll mention a little bit about, as we go along today's lecture, how they fit into the things we're talking about today, and this will be enough to prime you to get ready to understand what Barry will talk about when he comes. And then on the last day of class, Wednesday to December 6th, the lecture will be split into two parts. The first part will be the review for the final exam. The date is posted, I think it's a Friday, like 5.30 p.m., which is a terrible time, but it is what it is. And I'll be handing out a practice copy of that final exam, and that would be the only place you can get it, right, we're not going to post it online. But then for the second half of the lecture, something I traditionally do every year is have what I call the system potpourri. So basically, you guys can go to this URL here, and it's a Google form, and you can vote for what database system you want to learn about. And I'll pick the top three or four, and I'll call them and give you a 15-minute crash course on everything you need to know about this database system. What it does, why it's different, why it's interesting, why it may suck, right? So whatever you guys pick, I will teach you about it. So I'll say two things. One, don't pick things like New ADB or any of the time series databases that we already had those guys come give talks, because that's already being covered and everything's on YouTube. And then the second thing I'll say is, this is like the third or fourth year that I've done this, you can go back in previous lectures and see what the other students that have taken the earlier arching incarnation of this course have selected. But I ask you that you don't do that, go to this URL first, pick what systems you want to learn about, then go check out what other people wanted and see how you match up. Because I don't want to say what they are, because I don't want to taint you guys. I want you guys to sort of say, what have I learned on Reddit or Hackr News or the internet that you want me to comment and talk about? So you can vote for multiple systems, but please only vote once, right? It's not a competition, no need to stuff the ballot, okay? Any questions? Yes? His question is, what if you just vote for one? Yeah, who cares? I mean, can I just slow you to choose a? Can you, what, sorry? I choose it again, because you're saying you could have multiple. I forget what I said, but the Google Form allows you to vote multiple times, right, just like, you could select all of them if you want to do that, right? Okay, all right, so for this week's lecture, we're now going to focus on distributed systems. So we've already covered parallel systems. We had a lecture on that a few weeks ago, and the main thing we were concerned about there was how can we take a query and split it up and have it run in parallel across multiple CPU cores. And so in a parallel system, we made this assumption that the nodes would be connected together with a high speed interconnect, and that we're not even going to worry about the communication costs of sending data from one socket or one CPU core to another. We also didn't spend time talking about fault tolerance or recovery in a parallel system because we just assume that if one node goes down, whether that's a core or socket or another node in your rack, then we're just going to say the whole system goes down. But now when we start talking about distributed databases, we can't make the assumption that the network communication is minimal or nothing. We also can't ignore the fact that if we have a large system that at any time one or more nodes could go down, and we don't want to have to stop the entire system while we wait for that thing to recover, right? We want to have the system as much as possible to be able to continue and still make forward progress, still process transactions, still execute things. So that's sort of what we focus on today. And as I mentioned at the end of last class, the reason why we're doing this at the end rather than the beginning, because you need to understand all the things we talked about this entire semester for single node databases. And we're going to apply them now to a distributed environment. So just because you're distributed database doesn't mean you don't need logging and recovery, right? So all those things still apply in a distributed environment. We still have to worry about concurrency control. We still have to do query optimization and planning, right? So basically we're taking all those things that we had from before and now we're putting them in a distributed system. And so this sort of complicates things and some of the problems what they deal with in this environment we didn't have to deal with before. But hopefully you'll see how all this can be put together into a single cohesive database system. And so for this week, it's all distributed databases. What I decided to do was split up the lectures to focus on online transaction processing databases today, and then on Wednesday's lecture, the next lecture, we'll focus on analytical systems. And the reason why I'm making this distinction is that the types of problems you have to deal with when you implement a distributed database for an OLAP environment or an OLAP environment will vary a lot. There'll be some high level things that will cover in the beginning that will be the same across both systems. Like replication, for example, or how do you crash and bring up a node. But the things that an OLAP system has to worry about and try to optimize for are different than an OLAP system. This is because the workloads are different. So in an OLAP system, we're going to be mostly doing transactions that are going to be modifying the state of the database. And those transactions are only going to make small changes, right? Think of like you log into Amazon, you update your payment information. That's a transaction, you're only updating just the records that pertain to you, not updating the entire table. So therefore they're also short lived and they're also repetitive, meaning you're doing the same operations over and over again. In an analytical system, you're not really doing updates. Although you would take batch updates or maybe some streaming updates from the front end OLAP system, the transactions are not actually making changes to the state of the database. And so instead, we have long running queries that are going to read large portions of the database and to compute some kind of complex aggregation or join operation. We also tend to see also what are called exploratory queries or ad hoc queries, meaning it's not a program running the same transactions for the same queries over and over again. It's someone opening up a terminal or someone opening up a visualization tool like Tableau or MicroStrategy and running queries that we never seen before or we may never seen again. So I always like to show this sort of then diagram that Mike Stoenberger came up with a few years ago, that sort of shows you on two accesses how the workloads are different. So along the Y access, you have operation complexity. So at the lower case, you have simple queries, updating a single record, retrieving a single record, single keys, lookups, things like that. And then at the top, you have more complex things. There's a multi-way joins trying to retrieve a lot of information on full table scans. And then along the X access, you have whether the workload is focusing on doing a lot of writes or doing a lot of reads. So again, for OLTP, it's a lot of writes that are mostly simple. And then the upper corner, you have OLAP queries where they're very complex but they're doing mostly reads. Now the HTAP stuff sort of encompasses both of them. That's why it's in the middle. And the diagram that Mike originally made for this article, in the middle he had social networks, right? Because they're doing a lot of writes that are simple but a lot of complex query analysis to figure out who your friends are and things like that. I've since replaced that with HTAP, the hybrid workload. So that sort of again, that middle part is trying to do both things, inside a single database system. We're not going to cover that here. We'll cover that in the next semester to show how to do this all in single database instance. But in the context of a single machine, we won't do distributed HTAP systems. But in general, again, so for today's lecture, we're focused on the bottom OLTP, and the next class we'll focus on OLAP for distributed environment. So we'll start off talking about the different kind of distributed system architectures you can have for distributed databases. Then we'll cover some design issues we're going to have to deal with and overcome in order to allow the system to actually operate correctly with minimal impact to the application. And then we'll talk about how to distribute a concurrency control, replication, and then depending on how much time we have left, I'll focus on or discuss the cat theorem in terms of how it applies to distributed databases. There's a quick show of hands. Who here has taken a distributed systems course? Or is taking it now? All right, so if I say Paxos, that's what you're familiar with most of you guys, right? Good, okay, awesome. All right, so there's essentially four types of system architectures you can have. And so what we've covered so far in this semester is what is called shared everything. So thinking of, again, a single node that has a processor, or one that has access to some pool memory, and have access to some disk, right? We can scale these out horizontally within a single box, but if everything's contained in a single logical space, right? So shared everything means every processor can read every disk, and every processor can read to any chunk of memory. And so the distributed architectures are the other three here. So shared memory, shared disk, and shared nothing. So I'm gonna go through each of these one by one and understand their trade-offs and understand what they actually mean. So in a shared memory system, what you have is that you're gonna have one or more processors or CPUs that will have access to a global address space of memory. And this memory is not always gonna be local to it. So think of a single box, it has a CPU, it has memory. But then there'll be another box in the same cluster that has its own CPU and memory. But any node or any box can access memory to any other node, right? So again, I think of a distributed shared memory. It's a global address space. And so this means that any processor can look inside of memory of another machine and see what's going on inside of it, right? And so this means that you can have all the same sort of single data structures that we talked about before, like a single hash table, a single buffer pool manager, and things like that, that all the different processors can write to and read from. And there's a fast interconnect that allows you to sort of shuffle those requests over to the machine that has that particular chunk of memory you're trying to access without knowing anything actually where it is. So you write to an address space in memory. And if it's on a local machine, it's fast. If it's in another machine somewhere else, there's some transport or some hardware device that takes care of that for you. So I'm sure someone's going to correct me on YouTube. But to the best of my knowledge, the only system that I can find that actually does this type of architecture is the product that Oracle sells called Oracle Rack. Again, they use a distributed memory fabric to allow you to read and write into the buffer pools of other nodes. So this is not just taking Postgres and making it use a shared memory pool or a memory fabric. You do actually have to modify the database system to be understand that it's running on one machine. And there's other machines that are also writing to the same address space. Because think of like in Postgres, for example, when you boot it up, there's this thing called the Postmaster that thinks it's the central point for that database. And if you now have Postgres running on different machines with a distributed memory fabric, now you have multiple Postmasters that each think it's in charge. But now there's other processes that it doesn't know about writing into memory. So you do have to, it's not just taking any off the shelf database system and running it on this environment. You do have to modify the system to be aware of that there's other instances of the database. But the low level coordination you have between those processes is all done through this, over the network, into shared memory. The next type of architecture is called shared disk. The way to think about this is that you're going to have a single node will have its own processor and its own memory, but the disk will be shared across multiple machines. So again, if you're familiar with cloud computing or Amazon's environment, think of this as like EBS, is the shared disk that you can write all your changes to. So this is actually becoming more common in the last 10 or 15 years. These ideas are not new. They've been around for a long time, but this has become, shared disk has become more prevalent in recent years because now in these cloud environments, you have this persistent storage like EBS, and then you can have your different EC2 instances that can all access to it. So the key advantage of shared disk is that it allows you to scale up the execution layer, the execution nodes separately from the disk. So depending on whether you're disk bound or CPU bound, you can increase capacity by adding new nodes that target one specific layer. I'll show an example of what I mean in the next slide. And so you could treat the shared disk as the central location where you have state, or you exchange state between the different processors. But that would be slow, because you have to send things over the network to the storage. Then you have to hope somehow the other processors can find that update. If you want to get better performance, you can have the different execution nodes or the processors send messages in between the CPUs so that they can learn about the state of each other. But this is something you have to add in your database system. The shared disk model doesn't do this for you automatically. So let's look at an example. So we have a single application server, and then we have two nodes in our execution layer. And again, these execution nodes, they obviously have a disk, because you can't boot an operating system in this environment without a disk or something. And then the way to sort of think about it is there's no state of the database stored permanently in these execution nodes. So the primary shared location of the database at all times has to be in the storage layer. And I'm showing this as an amorphous disks. Again, think of this as EBS, where you just sort of say, I want these volumes, and you don't really care how many machines Amazon actually uses to store that. It just manages all that for you. Or say you have a NAS or a SAN or a file server that you're running on premise. That would be sort of handled in the storage layer. And the execution layer, those nodes don't really need to know how many machines you're actually using to serve out the shared disk. So let's say application server now sends a query request. It goes to this node. And now all the same things that we talked about before for our buffer pool manager apply here. So the machine's going to check to say, oh, I need to access this tuple. It's in this page. Do I have this page in my local memory? If no, then I've got to go out to the shared disk and go copy it, and I'll swap it into memory here. And then it gets sort of complicated if you're doing updates, how do you pin things if it's over there. For that, we can ignore it for now. So now the same thing. If I need to go to this node here, I can access the same page, and it's going to get that copy from the storage layer and bring it to its buffer pool, and then it can process the query. So all the same things that we did before are applied here. It's just now the disk is on another machine rather than local to me. So let's say now I find out that I'm CPU bound or a memory bound, and I need to increase my execution layer capacity. So with this, I can just bring in a new node and have it work just like all the other nodes, and not have to worry about copying any data, because all the state of the database is in the shared disk storage layer. So you can sort of think of these as the execution nodes are stateless, and I can bring them up and down as needed without worrying about having to move data around, because the shared disk is always going to have the complete copy of the database. So now one optimization you can do, or say you need to create your capacity on the storage layer, you can just add more disks on that. And again, that doesn't affect the execution layer. All of a sudden, the disk is faster, or you have more disk, and you didn't have to change anything or move any data around at the execution layer to make that happen. Now, of course, Amazon and EBS could be moving your data around, and you don't know it. And from the database's perspective, you don't care. That's sort of handled all underneath the covers. So now one optimization you can do, say you want to do an update here, if all state about the database is only stored in the storage layer, and therefore if you do this update, you have to make sure that either you have the only copy of the thing you're trying to update, or you have a way to have these guys pull from the storage layer and get the new version. But one obvious optimization you can do here is do a sideways message passing, and have the system somehow know at the execution layer that for this particular tuple, I know that my other nodes have a copy of it, so therefore I'll broadcast that update to those guys, and have them get an updated version without having to go out to the shared disk to get the latest version. And you can do this because the connection between the execution layers is usually always faster than the connection to the storage layer. And the storage layer, again, is going to have some large block size for a transferring data, whereas if you're sending direct messages over TCP between these different nodes, then you can do more fine-grained updates. So this is an obvious optimization. Not all shared disk systems actually do this, but this is actually one thing that NuoDB does do. Yes? Yeah, but then if your node is fetching some piece of data from the storage, doesn't it also have to tell all the other nodes that I have this copy? And can there be a delay between that and the broadcast and such that I just fetched a copy, but missed the broadcast? So your question is, if the only copy of the data is here, and so when you do your update, you're at the top, and then someone tries to update the same piece of data at the bottom? I store all the copies of the same data on nodes, but when I fetch a new piece of data, I have to broadcast to my peers that I now have this piece of data. His statement is, if the bottom guy fetches a new piece of data, you need some kind of directory service to tell the other guys, hey, if you need this piece of data, I actually have a cash copy of it. And yes, so you do have to maintain, which I'm not showing here, some additional coordinator or leader that keeps track of what node has what piece of data so that when you have to propagate updates instead of broadcasting it everywhere, because now that would be an end-wave message send-out for every single update, you only send it to the nodes that you know have a copy of that data. Yes, you do have to do that. I'm not showing that here. That is actually something that newDB does. Yes? Say it again. Say it again. It's on the node in the system. So you said the word buffer pool and then the broadcasting. So the buffer pool is just the same buffer pool we talked about in the early in the semester. I need a page. It's not memory. Let me go fetch it. And either fetch it from another node or we fetch it from the shared disk. In a shared disk environment, typically you always want to get it from the shared disk because that's the primary copy. So now you want to say, let's say I do an update here. Say the top node is the leader for a particular item, a data record. But it knows that there's copies of the other two nodes. So if a transaction does an update at the top node, it says, oh, well when I commit I should send the latest version to the in memory of the other ones. You don't have to do this. It's just an optimization to avoid having everyone to do a round trip to go get the latest version for shared disk. Right? Like this, identically. Say it again. Is that these nodes here? This is a black box. You don't know how many disks there are. You don't know what's on the disk. You just know that there's a key value store. Give me a page or this page ID, right? We should be careful of the language. We bring a new execution node, right? So I'm ignoring the fact. So we'll get to this later. I'm not mentioning anything but how this application server figured out I need to go to this node to get that data. For app purposes, in the most simplistic form, you don't need to do this, right? Like what you could do is the most naive shared disk architecture is basically every node doesn't know anything. And every time you want to access a tuple, you always go to the shared disk to go get the tuple. And then when you go commit, you try to write it back and you see whether somebody else has already modified. That's the dumbest thing you can do because it'd be super slow. So say I'm back here. I have two nodes, right? And now I'm CPU bound. I want to scale out. When people talk about elastic databases, they mean adding more capacity. And either it can be for computation or memory or for storage. So if I want to increase my computation and memory, I can bring a new node in. But because there's no database state in the execution nodes, I can bring it up without having to reshuffle anything. Because the shared disk always has the complete copy of the database. And my new node is going to know how to go talk to it and start copying things in that it needs to process queries. Correct. This is a shared disk environment. And there's a disk here because you have to have, in order to boot the system up and log stuff. But the storage location of the database is on these disks over here. And another advantage that you see this also in cloud environments is say that it's at the end of the day, and I don't need to process any more queries, I can shut down these execution nodes and just keep the shared disk there. You pay Amazon for the maintenance and retaining the storage, but you're not paying for updates and other things that aren't happening. Whereas to see this in the next slide for shared memory, if the state of the database is stored here, I have to keep everything there and keep these machines live. Otherwise, if I shut them down, I lose everything. So there's this nice decoupling between these two layers. And as I said from these list of systems here, all of the big cloud service providers that offer database systems are always going to be this environment. So Spanner, Amazon, Aurora, Redshift, they're all going to be based on shared disk architecture. Things like HBase, Stinger, and Presto, and Pala, these are systems designed to run on HDFS, and you can HDFS as the shared disk. Snowflake is for OLAP queries. Think of that as Redshift. Same thing, you can have separate execution layer and storage layer. And then NewDB, as I said, has a storage manager that's shared disk, and then you have a bunch of execution nodes that can pull data in and process transactions. OK? Yes? Is the distributed file system enough to be a storage layer or do they have to have something extra? So that's a great question. So this question is, is your favorite off the shelf distributed file system, HDFS, probably what people think of mostly, is that good enough here to be a shared disk, or do we need something special? So if you look at systems like HBase, for example, HBase runs on top of HDFS. The basic thing you need to make this work is just get and put. I need to get a block. I need to put a block. And that's all you really need. Now, with HDFS, because it's append-only, you have to use a log-structured architecture, which we talked about before, so that you're always appending to a file, because you can't go back and do in-place updates. Systems like Aurora is a good example where Amazon can actually add some additional stuff in the layer, I think, above EBS that allows them to do things more efficiently than just always appending to HDFS. So your question is, is your off the shelf distributed file system enough to make this work? The answer is yes. For analytics, who cares about doing in-place updates because you're just reading files as fast as possible? So all you really need is get. That's enough. For transaction processing, you either need to use a log-structured architecture like in HBase, or add additional mechanisms, like in the case of Aurora. In the case of newDB, they basically implement their own key-value store, and that's enough to be a shared disk. It's a good question. So the next architecture that people usually think of when you think of a distributed database is called shared-nothing. So again, this is the term that's been around for a while. If you Google shared-nothing, Stonebreaker, of course, wrote a paper in the 1980s that touts the virtues of shared-nothing. But again, most people when they think I'm gonna be able to distribute a database, they really mean the shared-nothing architecture. And here, basically, the idea is that every node is gonna have its own CPU, its own memory, its own disk. And it can't view into the memory of other machines. It can't write to a shared disk. The only way that it can communicate with other nodes in the cluster is to go over, do message-passing over TCP. And so the advantage of this is that it's somewhat easy to create capacity because you just sort of plop up new nodes, but the tricky thing is gonna be how do you maintain consistency? How do you add new nodes and rebalance while you're executing transactions? So let's look at an example here. So here now I have two nodes. And again, on each node, now we have the processor and memory and disk. And then now I'm showing you a rudimentary partitioning scheme where a partition P1 at the top, I'm gonna have all the records for a table where they have the ID values from one to 150. And at the bottom, I have all the IDs from 150 to 300. So now, if I have a query that comes along and wants to get the record ID equals 200, ignore the question of how the applications are figured out that it needed to go there, we'll cover that later. But just assume that it knew how to get there, that's the node that has the data that it wants. So it can go to that one node, that one node can process that query entirely and send back the result to it. So now let's say I have a query that wants to get ID equals 10 and the record ID equals 200. It would send the request to that single node and that node would know about how to find the record where ID equals 200 and then it can send just the part of that query down to the node here, get back the result, coalesce it, then send it back to the single node. So this will come up when we talk about data transparency here, but the idea is in our distributed database, the application doesn't need to know anything about how the data is stored, it just knows that it can go to some node, whether it's one or any of them, and that node has enough information to figure out how to route your query or part of your query to the nodes that have the data that you need and send it back. So now the tricky thing is how do we increase capacity? So in the shared disk environment, I said that the execution nodes are stateless, so you just bring up a new one and then you obviously update some catalog information to say here's the partitioning information, but you don't have to move any data around. And then in terms of increasing capacity in the shared disk layer, that's sort of handled by EBS or whatever file system you're using to make that work and we don't care about that. But in our environment we have now, we control everything in a shared nothing system. So if I want to add a new node, that means I'm bringing in its own processor memory and data and I need to figure out how to move data, rebalance the system. So what I want to do is I want to take some portion of the data from the top node and move it to the middle and some portion of the data from the bottom and move it to the middle. So then my partitioning scheme looks like this. So now every partition and every node has an equal size of data. So this seems trivial and obviously moving data around is not that hard. It's just copying bytes. The tricky thing though is if we're doing this in for an OLTP application we want to be able to continue to execute transactions while we're moving this data around and not have any false negatives and not have any false positives. So we don't want to send a query say for ID equals 150 and before we were sending it to the top but then as we're doing our transition to move data to the middle the query lands at the top. It doesn't look up and that data is already moved and then it comes back with an empty result even though the data actually exists because that would be a false negative. So we want to make sure that when we move data around that we hide all this. So the easiest way to do that is actually to stop everything while you move data around but if you're moving terabytes of data that could take a long time and you don't want to do this. So I'm not going to talk about this here but I've done research in the past with other colleagues about how to ensure that you can continue to execute transactions while you still move data around. And the way to think about it is that you're just doing some extra bookkeeping and you're doing deletes and inserts to take data out of one machine and put it into another machine all in the context of a transaction. But again the main takeaway I want to get from this is that between shared disk and shared nothing with shared disk we need to let the disk whatever that disk layer we're using let that handle scalability and we just worry about our execution nodes. In a shared nothing architecture we have to worry about everything. So I would say that again most distributed databases built in the last 15 years follow this architecture but there's been a different trend or different movement towards the shared disk model especially for OLAP queries because cloud computing has become so prevalent because EBS is good enough. So, unless you think distributed databases are new they've been around a long time and the usual suspects for building databases have built distributed databases. So my stonebreaker and Phil Bernstein built the first so what we know as the first distributed databases. Stonebreaker then was working on Ingress at the time so they have a tech report from 1979 that talks about distributed version of Ingress called Muffin. I asked him one point what Muffin actually stands for. I'm not sure whether he was joking or not but he said it stood for **** And then SCD-1 stands for the system for distributed databases. So Phil was the early transaction processing professor at Harvard and then he built this in conjunction with CCA. There's a great talk that he gave this summer and I have to put it on YouTube, I have it. We talked about the early days of SCD-1 and all the problems they had to deal with. But these two are considered to be the first distributed databases. And then Mohan from Aries from last class of Aries fame, he was at IBM Research and he helped build a system Rstar which is the distributed version of R. David DeWitt built this academic system called Gamma. Technically it's a parallel database machine not a full distributed database or shared nothing database system. But a lot of the ideas that came out of Gamma are certainly used today in distributed systems. And then Jim Gray of the early system R days at IBM, he worked on a fault tolerant distributed database called non-stop SQL at Tandem. The way they cheat fault tolerance is super hard over redundancy. So like on a single machine it would have three redundant processors or three redundant sticks of RAM. It would always run those things at the same time. Right, so again, this is just to show that distributed databases are not a new idea. And people struggled with them back then and people struggle with them now, right? They're very difficult. All right, so now that we have an idea of what our system architecture is gonna look like for the rest of the lecture I'm mostly gonna be focusing on shared disk and shared nothing. And I'll try to sort of specify when something we're talking about is specific to shared nothing and not shared disk. But for the most part, a lot of the ideas are, you have to apply both, a lot of these ideas for both different models. For shared memory, again, that's a, I think they'll become more prevalent in the future certainly with like RDMA and other things. But for now, again, there's not many systems that follow that model. All right, so the questions we have to deal with now are how are we gonna store data across the nodes? How's the application gonna find the data? If I have a query and I send it to a node what node do I send it to? Or how's that node gonna know where to go find that data that it wants? How are we gonna actually execute our queries on our distributed database? So this will be mostly gonna be an issue we'll cover next class. The basic two models are you either push the query to the data and then send back the result or you pull the data you need bring it to the machine that you're at and then process the query there. So the shared disk model will be example of the second approach, right? Cause you can't execute any queries on the shared disk it's always get or set or get or put. So you're gonna get the blocks you need that you think has the data you want process the query and then send back the result. Again, we'll cover this more and when we talk about OLAP queries because in OLTP it's fairly simple, right? Cause you're gonna go get a single record the act of getting that single record is more or less the same as copying the block back and then get process the query one is basically the same as sending the query over. And then of course the hot problem is how is the data system gonna guarantee correctness during all of this when we're doing updates. So I mentioned this before but the key thing about a distributed database that we wanna have is that the how the data is stored physically across multiple machines should be completely transparent to the application. So I mean nowhere in your application code should you say, all right I need to touch ID record ID equals four well I wanna go to node four for that or I wanna go to node five, right? We want some we want an abstraction layer that hides all the details of this from the application code. So that way if we have machines go down if we scale up or scale out we don't have to go back and update the application cause the state of where to find the data we need is essentially a database in itself, right? It's like the catalog is the metadata about the data. So all the guarantees we want for our regular database we wanna have for that partitioning information for that catalog information as well. And so we don't wanna have that be put inside the application because now if there's a failure we have to make sure all the applications know that this node is no longer available and this is the other node you should actually go to to get that data you want. Another way to think about this is that if you have a SQL query and you have a copy of your database that can run on a single node then when you go to a distributed environment that same exact SQL query should run without any changes. And this is the beauty of a declarative language like SQL where you don't have to worry about how to actually implement things. You just see here's my SQL query you shove it off to some machine and then it figures out where the data is you need and how it should process the query. So we've already talked about partitioning before when we talked about timestamp ordering. I wanna mention a little bit here and then we'll cover it more in the next lecture but I was mentioning how it pertains or what kind of partitioning we're gonna care about in a distributed OTP system. So again partitioning is basically taking the database and we're gonna split it up across multiple resources. So we have some giant table and we want a way to say all right some portion of the data goes on this node some portion of the data goes on that node and then we have some internal metadata catalog that knows how to find the data you need based on what the query is actually looking for. So the traditional or the academic way of describing what these chunks of data are called are partitions. In the no SQL systems are usually out in the I suppose real world they're often called shards. So partitioning and sharding are essentially the same thing they're just different words for the same concept. So again what's gonna happen is we're gonna partition or shard the database and then when the query shows up the databases can know all right I need to touch this data or here's the nodes I need to go look at or talk to to find the data that I need and then you maybe send some portion of the query as fragments to those nodes they execute the query they're portion of the query you return data and then combine that all together to produce a single answer. So from a single terminal you type a select statement that maybe blast it out to a thousand machines and then put all together into a single response and send back to you as a single message. You're not gonna get a thousand responses back for every single node in your cluster. Again from the application standpoint you don't know and you should not need to know how many machines are actually involved in your query. So the main type of partition we're gonna talk about for OTP applications is horizontal partitioning. Again we've already covered this from timestamp ordering but the basic idea is that we're gonna pick some set of attributes or columns that we're gonna use to divide a table up and where each tuple will get its entire contents we put inside of a single partition. So if you remember from DSM or NSM the column store for the row store think it's horizontal partitioning is doing sort of row partitioning. You take an entire row and you're storing it on a single partition. And so we'll cover this in the next class but the ways you do partitioning could be either round robin just sort of picking random in order or partition to assign two plus two. Hash partitioning is taking the values of the partitioning key and hashing it and then mod n by the number of machines you have sending it to a node and then range partitioning we already showed before where you basically say keys within range from one to a thousand go here one thousand and one to two thousand go there. So I showed this diagram before from TPCC this is just to show you why we can do partitioning on OTP and it's gonna solve some of these problems we're gonna deal with when we talk about distributed concurrency control. So if you take the warehouse and district table from the schema when you look at the actual create table statements you'll see that in the district table it has a foreign key reference to the warehouse ID in the warehouse table. So what we're gonna wanna do is we wanna devise a partitioning scheme such that all tuples of the same warehouse because the warehouse is always gonna be the parent we'll get put into a single partition so that most of our transactions only touch data at one partition. And then we can scale those things out independently without worrying about doing any coordination between them. So essentially we're taking the schema we're gonna put into a tree structure like this again we're sort of a slice down is within a tree is always gonna be data within a single warehouse. In the case of the item table it doesn't have a warehouse ID so we're just gonna replicate that on every single node or every single partition. So now when we assign data to partitions again we'll take the warehouse ID and we either hash partitioning or range partitioning around ramen it doesn't matter and then we'll assign them to these partitions one by one. In the case of the item table again because it doesn't have a warehouse ID we're gonna replicate that on every single node. Of course that means that anybody that comes and updates the item table we have to make sure we broadcast that to every single partition so that everyone has a consistent view of the same table. So we typically do this for tables that are small and tables that are read mostly. Yes. Okay. Say when. Yeah right here. So you said that you choose columns that divide the database equally but each tuple contains all of its columns? Yes so think of like you pick a column I have five columns I pick the second one and then for every single tuple I'm gonna look at that second column hash it, hash whatever value that tuple has and then you assign it to a partition. Right? And they actually bring up a good point here is in my example I'm just doing sort of a hashing just every partition is gonna have the equal amount of data. In practice you may not actually wanna do that because the tuples may not always be the same size and they may not always be accessed with the same frequency. So this is very common you see in social networks where the very popular people are essentially given a rack of machines to themselves and only the data for that particular celebrity is stored on that rack. So Justin Bieber is the famous example of this. Justin Bieber has his own rack of machines at Twitter. I'm sure Donald Trump does too. And so the part so it's sort of Justin Bieber is his own partition of data because they're CPU bound because everyone's trying to access his updates and tweets, right? And so how you partition things can depend on what objective function you're trying to maximize or what resource you're trying to maximize. Okay, so again we assign these things to partitions and then we replicate the item table everywhere. So now if we have a application server comes along and wants to do access this one partition, right? It can start a transaction, then it says all right, here's the query I want, I want to execute, give me the record from the warehouse table from where warehouse ID equals one and then I can go ahead and commit. So this is the best case scenario for a distributed database system because at no point did I need to touch data at any other partition. So in order for me to commit this transaction that only touched data partition one, I only need to look at what happened at my single partition. I don't need to check with anybody else. So this is the ideal scenario in a distributed transaction processing database. And for a lot of applications, it is the case that you can transpose the schema into that tree structure so that you can do this. So these are referred to as either single node or single partition transactions, right? These are what you want to maximize in your database system. So you want to choose a partitioning scheme that allows you to get most of these. Because again, you don't need to coordinate with any other node or any other partition running with any concurrent transactions running there because you know they couldn't have read any data that you read because you didn't read anything at their partition. This is not all applications actually can work this way though, right? And in that case, you have what's called a distributed transaction. And this is where you have a transaction that's either accessing, reading or writing data at two or more partitions. So now what's going to happen is now you are going to have to coordinate between different partitions. And now you're going over the network and now you're sending messages and now you have to figure things out of what's going on, right? And I'll say that a lot of times you don't, you can't partition the database so that everything's single partition because of just the schema doesn't fit that way or the access patterns or queries don't work that way. But also times there's also like legal reasons why you can't do that. So in 2009 or so, we went to go visit a very well-known payment processing company in the valley because they wanted to come talk with Stonebreaker and see whether we could solve their problems. And they had this restriction where bank accounts from people from different countries couldn't be on the same partitions or the same hardware. So if you were sending money from somebody in the same country, that was always, could possibly be a single node transaction or single partition transaction. But if you had to send money from me to somebody in China, that those two accounts would be on separate machines. So that would always be a distributed transaction which is always much more expensive. So again, you want to maximize the single partition transactions but it's not always the case you'll be able to do that. All right, so now if we're gonna have distributed transactions, now we gotta figure out how to coordinate their execution. And so the first thing we have to talk about is how we actually can design our distributed database system to be aware of what transactions are going on and then allow us to make decisions about whether transactions are allowed to commit or not. So the sort of thing of this is the same concurrency stuff that we talked about before but in a shared everything system, we have a global view of the database because everything's on our single node. But now in a distributed environment, you could have transactions touching data at different nodes and we need something to figure out whether they're allowed to do that and whether they're allowed to commit and make changes. So the essentially two approaches to this is a centralized approach. We have a global traffic cop that has again a complete view of everything that's going on. Well, almost complete view, I'll say why in a second. And then you have a decentralized architecture where the nodes essentially have to figure out amongst themselves whether transactions are allowed to commit. So yes, this question is, are there any transactions that can be distributed? Yeah, I just mentioned that before, right? This question is, does this mean we cannot use shared nothing in the scenario? Why wouldn't you be able to use shared nothing in the scenario? Why can't you be distributed? Right, like, right? So say this is a shared nothing architecture, right? I can start my transaction there but then I can update something in P3. Actually, I'll show examples in the next slide. There's no reason you can't do that. The question is, are there, I think what you're saying, are there transactions where they only have to touch data at a single node? If the data is at a single node, you don't have to make it, it doesn't have to be distributed. You don't need to coordinate with anybody else, right? If you have to touch data multiple nodes then that's considered a distributed transaction. Let me go through my examples and if you're still confused, we can answer more questions, okay? Okay, so the, again, centralized decentralized architectures. So the centralized architecture, one example of this is called a TP monitor or transaction processing monitor. So these are really big in the 70s and 80s. Think of this as like, you have a separate piece of software that's running on some separate machine and your application always has to go through that and ask it whether it's allowed to do something on the database or other machines. So this is really big in the 70s and 80s because people who had these databases that didn't support concurrency transactions locally but there's nothing that could allow you to federate them and have a global transaction across them. So say like you're like an airline, you have one database for payment information and one database for airline reservations or the seat assignments. So what you wanna be able to do is have a single global transaction that updates the seat database and updates the payment database. And then once those guys are completed, then the transaction is actually finally done. So this is what TP monitors were used for. They're still used today. They're actually very expensive and but they're very common in sort of older enterprise legacy applications. But nowadays most, if you have a distributed database that supports transactions, you don't need a TP monitor because this does this for you. All right, so let's look and see how this would work. So we have again our application server but now we have our coordinator. Again, this could be our TP monitor or essentially what it is, right? And so now if I wanna touch data at these three partitions, so this is the distributed transaction. I'm touching data partitions one, three and four. So before I can begin, the application server has to go to the coordinator and says and say I need to lock these partitions because I'm gonna access data at them. So the coordinator now is gonna have that same lock table that you guys implemented for the third project where it's gonna know what transactions hold, what locks to what data objects. So to keep this simple here, I'm just gonna assume that we can lock the entire partition but all the same lock granularity and hierarchy stuff that we talked about before can be used in this environment as well. So we can lock single records or ranges or pages or whatever. It doesn't matter for our purposes we're gonna lock partitions. So once the coordinator, if nobody else is running or nobody else also locks these partitions, it updates its lock table, sends back an acknowledgement to the application server to say yes, now you can proceed because you have the locks for these data objects. So now the application server can send its query requests to these different nodes and do whatever that it wants to do on them. So then when it wants to commit though, it's gonna go to the coordinator and say all right, I did all my changes, I need to commit now. Is that gonna happen or not? Tell me yes or no. So now the coordinator is gonna do is gonna send some internal messages to the partitions and say, all right, transaction one, two, three that made these updates and then you should know about it wants to commit, let me know whether it's safe to do this or not. And then once they all say yes, you're allowed to do this, then you send back the acknowledgement to the application server from the coordinator and say yes, your transaction committed. Now, whether the coordinator sends the, pushes the commit message and tells these guys to commit or whether the application server does that, it doesn't necessarily matter as an implementation detail. But the reason why the coordinator has to go to these other guys and say, all right, are you allowed to commit? Is this transaction allowed to commit? Because there may be some integrity constraints or other things that you need to enforce that the coordinator is not gonna know about that is only known at these local nodes. In my example here, all I did was lock the partition. I didn't tell the coordinator exactly what I was gonna do. So I could have updated some record, maybe I shouldn't have or some other reason why I could not be allowed to commit and the coordinator's not gonna be able to make that decision, right? It can do deadlock detection or whatever concurrency protocol it's using up above. It can do that for the state it knows about, but you always have to ask these guys whether they're allowed to commit or not. Does that make sense or no? In transaction that cannot be like performing this truly manner in some ways. Your question is, could there ever be a transaction that you could not perform in a distributed manner? No. I think about whether that's true or not. Yeah, there's no, I can't think of any reason why not. There might be harder reasons, but logically there is no reason. Right, so think about, I mean, we're relying on time, but the simplest thing you could do is you could build a distributed database that only has serial execution. So you could have a coordinator only allow one transaction across a thousand machines run out of time. For OLAP too, it doesn't matter, but we're focusing on updating the state of the database here. Okay, yes. There cannot be any dependency between these two across the data because of the security issue that's not likely to get us. That he would not be allowed to commit or not? Not yet. Right, so his statement is that there may be a security reason that would not prevent a transaction from being instituted in a distributed environment. Yes, that's higher level semantics or high level concept that as database implementers we don't know about, we don't care about. I'm sure this is going to be on video now, it's going to burn me some later point, but I don't care about security, at least at this point here, okay? Okay, so another way to have a centralized coordinator is to have what's called a middleware where this is basically as a proxy where the application server is always going to have to go through this in order to execute the query. So it sort of sees the middleware as the single entry point for the database, it can't access the partitions directly, right? And so if I send in my query request, it always has to go through the middleware and the middleware has some kind of internal, again, partitioning table or lock table where it can say, all right, well this query, these queries need to touch these three partitions. So I'll go ahead and lock these three partitions for this transaction and then it sends the messages to the partitions, get back the data and then sends the result back to the application. So the application server only sees the middleware machine, it doesn't see the other ones. So then one wants to go ahead and commit, same thing as before, the middleware has to ask these guys whether it's allowed to commit because again it doesn't know exactly what update it did on the machines because all it was is just routing the queries to the appropriate location. So this is probably more common now in distributed, shared nothing systems. This is, I mean, I don't go through all these other things, but this is actually very common in a lot of early cloud systems, right? This middleware essentially would be the router to say, well, what shard of my SQL do you need to send your query to, right? And in decentralized system, what happens is that we send the, a, begin request to start a transaction to some partition and this partition essentially gets anointed as the base partition or the coordinating partition or the home partition has different names and different systems. But now what'll happen is this partition will send back a transaction ID to the application server and now it can send any query request to any machine that it wants but passes along the transaction ID that it got from the first guy and they keep track internally about all these updates that you're doing. So then when I want to do a commit, I go back to my base partition and say I want to commit my transaction and then internally this guy checks with the other system, the other partitions to figure out whether it's actually safe to commit or not. Yes? How does the square with the transparency idea in terms of the applications that we're needing to know which partition should be stopped? So his question is how does this, when I said before that we have data transparency, how is my example here, how does that fit in that paradigm? So, how do I say this? Some systems will push back to some metadata about where to route queries to your partition but also it could be the case here that say I only send my query request to partition P1, P1 knows, all right, I don't have the data you want, it's really at P3, so it forwards that request to P3, P3 executes it, sends it back to P1 and then P1 sends back the result. That's how you solve that. But again, this is obviously an extra round trip, extra traffic in the network. For all the TPs, it doesn't matter too much because you're not transmitting too much data, but it does have congestion on the network. So again, you can provide hints back to this. Next time you execute this query, go to P3, not me. Of course, if you're a stored procedure, you just send my query request or my transaction request to one machine, it executes there and then knows how to get the data that you need and you don't have that problem. Typically what you want to do is the base partition is the partition that's gonna have most of the data that you're gonna access. So let's say in that TPCC example, I'm gonna access data mostly at this one warehouse but I need to get one record from another warehouse. So my base partition should be the location where most of my queries need to go access data from and then if I had to go to another machine, it's not that big of a deal. Okay, so I've been very hand wavy now about committing transactions, how we actually figure out whether it's safe to commit or not. And so all of the things that we talked about for three weeks with Concertical apply here. And as I said before, a Concertical protocol either has to be two-phase locking or timestamp ordering. That's still true in a distributed environment. So there's distributed versions of all the algorithms that we talked about before, right? You just adapt them to say, all right, well now I need to go to this other machine to figure out what's going on rather than having all the data you need locally. So timestamp ordering, the OCC, all the variants of two-phase locking all still apply here. The only thing I'll show you in the next slide is when you wanna do deluxe detection, that's a bit more tricky because now you have to deal with how do you have a global view of who's waiting for who. But I would say in a distributed control it's much harder than single node concurrency control because you gotta deal with replication, which we'll talk about later. You gotta deal with the delay of sending network messages or the network messages can go missing or get unordered. Nodes can go down while you're executing transactions and how do you handle that? And then in case of any time you need to assign unique timestamps to transactions, now you can't rely on the system clock anymore because the nodes may not be in sync, right? The harbor clocks drift all the time as you run things like NTP to try to keep them in sync. But the most you maybe get is a couple milliseconds of accuracy. So now you don't have a global clock anymore to figure out who comes first. So the way Google solves this in Spanner is that they have the harbor atomic clocks and GPS devices on every single node allow them to assign timestamps. But you still have to, they're not gonna be completely in sync. You still have to wait a little bit to see if anybody shows up from another machine when you're trying to commit a transaction with a lower timestamp than you. It just allows them to bound what that drift is. But it doesn't go away entirely. Yes. So this question is for here. What is the difference between a middleware and ATP monitor or a centralized coordinator? So with the centralized coordinator, the coordinator is just the thing you ask, am I allowed to do something before I do it? So in this case here, I wanna access a partition, I wanna lock that partition. The state, the locked table state is stored at the centralized coordinator. So once I have that, then I can send my message directly to the partitions. In the middleware approach, I always have to go through the middleware. It's gonna maintain that same locked table as the ATP monitor, but I route all my queries to it and then it figures out where to actually send the requests. Why is the middleware preferred? So the question is why is the middleware preferred over the coordinator? I wouldn't say it was preferred but this is probably more common now, right? Because by adding a middleware, you have to always first go to the middleware and go to the partitions that are directly. So it sort of related to the question he asked earlier. It was like, if doesn't having the applications that would go directly to the partitions break that data transparency, the middleware solves that, right? So if you do this, you have to either route queries where they need to go or push some logic in the application server that says here's where you need to go but then you need to make sure that's always in sync. With the middleware, you never have to do that. You just point it at a single proxy, you send out your queries there and then it's responsible for figuring out where things go. Again, when most people in the mid 2000s actually probably eating still today, when they want to scale out my SQL, they always write their own sharding layer that does essentially this, right? Yes. Yes. Doesn't have to be. It's figured as a single separate logical piece of software process, right? You could have everything on the same machine with a bunch of Docker installation, it doesn't matter. So let me show you a quick example of distributed to base locking. So we have our database is partitioned across two nodes. Object A is sort of node one, object two is sort of node two. Two transactions start exactly the same time from different application servers. And so in the first case, the first query for the first transaction T1 wants to update A, so it gets the lock on that. Second transaction was update B, it gets the lock for that. And X2 set query gets back the result and it's done. Now the next query they want to do is the first guy wants to update B, the second guy wants to update A. And of course we know this is a deadlock because it's to base locking, they're going to hold the locks for the objects they already acquired locks for, then they try to go get the lock for the other guy and then they'll find that it's being held by the other transaction, but they're both waiting for each other. So the tricky thing here is where do we actually maintain the weights for a graph to figure out who's waiting for who? So in a centralized coordinator, either the TP monitor or the middleware, you know that information because you know what transaction is locked because they have to ask you whether to lock something before they can actually access anything. And in a decentralized model, every transaction, sorry, every partition has its own local view of what locks are being held by transactions that run in it. It doesn't know about the other guys, right? In a shared disk or a shared nothing architecture, it can't look into memory, you see the lock table from the other guy. So it has to send messages now to figure, all right, what locks do you hold? What locks do I hold? And things like that. And then somebody has to build this global weights for a graph and decide how to break it by aborting one of these transactions, allow other transactions to proceed. So all the same challenges we had with two phase locking before are still applicable here, right? How often do we check for deadlocks? What do we do with lock? But now we're sending messages over the network and it makes things more complicated. So any questions about this? Yes. So this question is, in this example, this would be shared nothing. What about shared disk? So again, so it's actually applied to both. Depends on where you store the lock table. Say the lock table's on the shared disk, right? Then in the first case here, transaction T1 gets a lock on A, transaction T2 gets a lock on B. Now you come back and try to get the locks on the other one and they recognize that they can't do that. Who decides who should abort? You could immediately do that prevention and say, all right, I can't get this lock, I can't get this lock. And they both abort, which you really want is abort just one of them. But now I would say, if you're maintaining some lock state in the actual memory of the execution nodes, same thing. How do you send messages to figure out who's waiting for who and who can acquire what locks? So like a petition. This question is, are petitions only for shared nothing? Yeah, so I would say for the coordinator stuff versus the decentralized, that also applies to shared disk. Yeah, yeah, yeah. Well, no, so the state of the locks would be held. Think of, okay, so the partition, think of those logical partitions, right? So you need to decide who's trying to access what on the shared disk and therefore, you need to figure out who holds what locks and still all the deadlock detection needs to be done up above. You're right, the data is still all in a central location. You could use that as the focus point for how you synchronize, but in practice, people do it in memory up above. Okay, so we're short on time, so let's see how far we get into this. All right, so everything I've talked about so far has just, when I say, oh yeah, we're gonna figure out, transactions we wanna commit, all that is just the concurrential stuff we talked about before. Are there any deadlocks? Did I read write conflicts or write write conflicts? All that still applies in this environment. And then I said, all right, well now the coordinator or whatever it is that's in charge or deciding whether transaction commits. Once everyone agrees that it's okay to commit, we need to tell the system to go ahead and commit. So I've glossed over or been hand wavy about that point. So now we need to talk about how can we actually have the system when the nodes agree that it's safe to commit a transaction, that everyone says yes, commit a transaction that we actually commit that transaction. And we need to handle the case where nodes may actually fail during this operation and we need to figure out what should happen. And we need to handle the case where our messages actually might show up late or even not at all. So if a node has a long garb destruction pause or some other delay on the network, then that's essentially the same thing as the node being down. And so how do we handle that? And then how do we handle the case where if our node comes back up and now we start getting commit messages from transactions from a couple minutes or seconds ago, how do we handle that? So the thing we're going to use to allow a multi-node or multi-partisan transaction to be able to commit atomically across multiple machines is called an atomic commit protocol. So this is where, this is another good example where the terminology gets a little fuzzy between how we describe things in database systems versus distributed systems. So in a database system, we would call this an atomic commit protocol. In distributed systems, we would call this a consensus protocol. The basic idea is that we want to have multiple entities or multiple nodes in our cluster come to an agreement about whether some action or some state change should happen. So the thing we're going to talk about here though in a distributed database is that we're going to have to require that all the nodes will have to agree that we want to commit this transaction. In a consensus protocol like PAXOS, you can just have a majority agree. But typically in a distributed database system that does transactions or O2B system, you won't have them all agree. And so the protocol we're going to use is called two-phase commit. You can still use all these other protocols. So the stone maker came up with something called three-phase commit in the 80s. It adds extra round trips. It's overly complicated, nobody does it. PAXOS and RAFT and ZAB from ZooKeeper are examples of distributed system consensus protocols that you can apply to solve this problem. But in our environment, we're going to focus on the common case where assumed nodes are going to fail and we're going to use two-phase commit. All right, so here's an example of two-phase commit working successfully. So the application server does a bunch of updates to these three nodes, one, two, three, and then it wants to commit our transaction. So it's going to send the commit request to whatever base partition, the home partition that it started off with where it actually began that transaction. And in two-phase commit terminology, we'll call this as the coordinator. And all the other nodes that it modified, we need to make sure we commit our transaction on them as well are called participants. In PAXOS terminology, this would be the proposer and the other guys would be the same idea. So in the first phase of two-phase commit, you're going to send the prepare message to the different nodes and say, all right, this is a transaction, one, two, three. It says it wants to commit. Are you okay with that? Like, are you going to allow this to happen? And at this point, it can do whatever internal integrity constraint checks or whatever security checks it wants to do to see whether that's allowed to happen or not. And then they will vote to say whether this, they're okay with this transaction committing. So they send back an okay message. Then in the second phase, is actually the commit phase, then the coordinator sends a message to the participants to say, all right, everyone has agreed this transaction should commit. So now we'll go ahead and commit it. And then the guys, they have to get a commit, then you send back the okays. And at this point, when you get the final okays from the second phase, now at this point here, the transaction is considered fully committed and we can send back the acknowledgement to the application server. So in this case here, the first node could also, it's the coordinator, it could also be a participant. So it could also do its own local check to see whether the transaction will have to commit. And maybe it does that first and then decides, all right, I can't commit this transaction. So I don't even bother sending out the prepare messages because I know I'm gonna abort right away. But in general, what happens is as soon as you get one abort message from a participant, then the transaction is considered to be immediately aborted. Let's see how this works. So now I do a commit. Again, the first phase, send out the prepare messages. So say the node three comes back and says, for whatever reason this transaction can't commit, so I'm gonna abort it. So it sends back the abort message. And at this point, the coordinator does not need to wait for any other response from any other node because under two phase commit, all the nodes have to agree to commit a transaction or none of them or none of them will commit it. So as soon as they get back one abort, it doesn't even wait to get back the rest of the messages from the other guys. It just immediately tells the application that's aborted and then sends out the second phase abort message to everyone here, right? And then they send back, okay. And then now at this point here, the transaction is done. So one thing I'm not showing here that you would have to do if you implement this in a real system is that you'll end up logging exactly what all the commit messages are under two phase commit as you go along at each node. And you use that as a record to figure out if I crash, come back, and I was in the middle of a two phase commit operation, what should happen or where should I pick up where I left off? In practice usually though, if one of the nodes goes down, the transaction fails immediately, right? So two phase commit again has been around for a long time. There's a lot of obvious optimizations you can do to speed things up. The first is do early prepare. And this is where if you know you're sending a query request to a partition that's gonna be the last time you ever touch data at that partition or node, then you also piggyback a message and say, oh by the way, I'm never coming back to you. So tell me what happened if I sent you a prepare message now? And so they can send back whether it's okay or on a board along with the query result that it sends back. The idea here is that once you start the prepare phase, you can start letting other transactions maybe acquire the locks that you're holding and allow them to start processing ahead of time whether and waiting for the second round trip of two phase commit. Another optimization is do early acknowledgement. This one is probably super, super common. This is what pretty much everyone does. And basically what happens is, as soon as you get back all the okay messages for the first phase for the prepare request, you can really tell the application your transaction committed. Even though technically it's not fully durable yet because the commit messages have not been logged but in practice this is a small window and most people make the sacrifice. So basically that, so I send my commit request, I send out my first phase prepare message, I get back all the okays at the coordinator so now I know everyone agrees that this transaction is gonna commit. So I'll immediately tell the application server your transaction was successfully committed and then continue on with the rest of the commit protocol like that. And this one is probably the most common one. All right, so as I said, each node has to record all of the two phase commit messages it has to stable storage in the back, yes. This question is what is the benefit for this? Why do this? Right, so at this point here, so you get the commit request, you get at this point here everyone says I'm okay with committing this transaction. There cannot be any other outcome, right? So again, crashing the transaction is still considered committed. The application server may not know about it but everything's been logged and durable on disk if I crash come back and I recover I'll replay that transaction. So you basically don't need to wait for the second round trip on the network to get all the okays that everything is committed before you send back the commit request. Technically from the textbook definition of two phase commit that's incorrect but this is what everyone does. Okay, so now we're gonna deal with failures. So what happens if the coordinator crashes? The participants have to figure out what to do. And so this is where we start to deviate from Paxos is that in Paxos, if the proposer fails, you just do another round of election to figure out who the new proposer is and things like that. So we essentially need to do that here but because we could have the case participants decide, all right, well we actually do want to commit this. The coordinator went down, maybe I'll elect a new coordinator and still try to proceed with the protocol and actually commit the transaction. That's all left up to the implementation and that's a special case of two phase commit that Paxos handles natively but in our case, we'd have to be handled by ourselves and that gets complicated. If one of the participant crashes then the coordinator just assumes that it's the same thing as a boarding transaction or a voting to a board transaction. So the entire transaction is then considered a board. So the main thing to point out though with two phase commit is that the nodes have to block whenever, when they're middle of a commit process in order to figure out what to do next. So if one guy goes down, you have to then time out and wait until you figure out whether you're actually allowed to commit or not. So we're out of time for today but I'll just jump ahead and show one slide here. So two phase commit is actually considered a degenerate case of Paxos. So there's a great paper from Leslie Lamport and Jim Gray where they show that two phase commit is essentially equivalent to Paxos that relies on a single coordinator or a proposer and only works if everyone is up. Whereas in Paxos, you can use leases to figure out how to determine whether someone's allowed to propose a new update without always having to do just continually rejecting transactions over and over again. All right, so this was a bit rushed at the end. I can pick up a little bit more of this next class. I'll talk about replication and the cat theorem beginning next class and then we'll also focus on then how to do distributed OLAP queries. So any questions about this or anything? Okay guys, awesome. See you on Wednesday.