 It's a beautiful weather outside here today, right, by the way, so hopefully we can get through the content soon and you can join the weather outside, all right. The similar administrative stuff, the homework five is, I mean, actually it already released right on Monday and it's already due, it's going to be due on December the 2nd and then project four, I mean, actually I forgot to update that slide today, but again project four is also already, well, project homework five has not been released yet, but the project four has already been released, right, and it will be due on December the 5th. So hopefully similar to the other projects, hopefully I can start project four sooner, because right now already there are more people have used up all their grist days, so it's better to start project four a little bit early and find out issues earlier. So for the upcoming database talk on next Monday, we'll have this company called Dreamil. It's actually an analytical engine built on this open source data format called Apache Arrow, like it's a columnar format, so they're going to come here and talk about how do they do their query optimization and then how do they accelerate data analytics with this open source columnar data format, right, so if you're interested you could check it out. So for last class, we have been talking about this basic concept of distributed database system, right, so essentially we talk about that, you know, distributed database system, there are first a few design choices that you could choose for the architecture, right, for the single node database system that we talked about in the earlier this semester, that would be called a shared everything architecture, right, essentially all the memory, all the processors, as well as all the disks would be on a single node, but then in the distributed environment there could be different choices, right, the first would be shared memory, essentially all the processors would access a single unified interface that combined different memory chips potentially located on different machines, right, for example one way to do this would be using an RDMA and then the processors could access the same address space, the second and the third choice would be the more common approaches, the second choice would be called shared disk, which essentially I mean that all the disks, right, would combine to have a single interface and then all the distributed machines they would have their own CPUs as well as own memory, but then once they want to read the data they actually go into this central location that would orchestrate all the data management either on a single physical machine or on different machines, an example of this would be Amazon S3 and then there's actually a more common approach in many of the recent cloud-based database systems and the last one would be a shared nothing, also very common but more common in the database system or the distributed database system before 2010, essentially every node in the distributed system would have its own memory, CPUs as a disk, but it would only be responsible for one portion of the data and then there's a higher level network layer that would communicate the data management between different nodes for different partitions of the data, right, then we'll talk about a different partitioning scheme of those data in a distributed database system and usually people either use hash partitioning or range partitioning and also you can naively just distribute data in a round robbing fashion, right, but that would be less common and lastly we touched a little bit on the notion of transaction coordination, essentially you could have either a centralized co-ordinate or a decentralized co-ordinate, right, so today we are going to go into more details about this transaction in a distributed database system, all right, so first I want to make a distinction between a very two commonly used concepts or types of workloads in database scenario, right, so essentially in database world where people talk about database workloads, usually people, well roughly speaking, right, there are many different types of workloads, but there are two very common types of workloads, one is called online transaction processing, the other would be called online analytical processing, okay, so the first transaction or in abbreviation OLTP would actually refer to these short lived and read and write operations that actually, again they are short lived so they typically don't last long and then they often have repetitive logic, so a very typical example of this would be online shopping, for example, when you open up your Amazon account, try to put something in your shopping card and then try to buy this stuff that most of those operations or transactions would be classified as online transaction processing because most of the time, right, you are just only touching, let's say your card may have 10, 20 items, right, and then you bought these items, you deduct them from the inventory of Amazon and then somehow later on you put that into your order history, right, and then the products will be shipped to you, and often times when I say these are really, this just transaction, you really have a repetitive logic or operation, that's because you don't actually go to write the SQL queries to deduct some items from the Amazon inventory and then put that in your order history, et cetera, right, usually for this kind of online transaction workload, the logic or the template of the transaction, the core logic of the transaction would actually directly baked in or develop by the application, in this case would be Amazon, they actually would have a template of these different transactions to buy things on their website, and then when you want to buy a few specific items, you would actually just replace these parameters in this template of the transaction, you know, in other words, this core logic of the transaction, replace these parameters with your own specific products, and then execute this transaction, right, so but usually there will be a template that used over and over again in that scenario, but on the other hand, the other common scenario would be online analytical processing, and usually these are actually pretty long running, but read only a queries or transactions, and usually they could perform some complicated, complicated analytical logics, as well as usually these are ad hoc queries or exploratory queries that there may not be a predetermined template or logic of those queries, right, so a simple example of online analytical processing would be, again, take the Amazon example, right, say the day has passed, then there are lots of transactions, and then someone from Amazon, right, could be a data analytics, come along and want to figure out, hey, what would be the average sale of a particular product in the region of North America over the last month, then this type of queries would be considered analytical query, and then they typically would read lots of rows, right, maybe in this case would be all the sales of this product over the last month, and then they could perform some complex logic, try to figure out, hey, what would be the exact value that you want to evaluate, and again, these queries, when I say exploratory, which means that oftentimes there's no predefined core logic to execute a set of queries one by one, sometimes now you may want to look at what would be the average sale of over the last month, but then after you get that value you may have some new ideas, right, you may be curious about, hey, what would be the sales of the two months ago, right, and do some comparison and explore other things, so oftentimes there's no predefined logic, and then these queries are often exploratory and long-running, but most of the time these are read only queries that perform analytics on the existing dataset, so today we are going to focus on the OLTP scenario with read and writes and potentially distributed on different machines, and for the next class, right, next Monday we are going to look at the OLAP scenario, all right, makes sense? Cool, so come back to the earlier example I discussed last class about coordination of distributed transactions, I want to use that as a motivating point or starting point to illustrate this concept, so assuming here again we are assuming a decentralized scenario which means that every node, I mean in the partitions, in the set of partitions of my distributed database could actually be responsible to commit a transaction, right, so say here when the application server sends a request to begin a transaction it will land one particular node, let's say this node is the primary node that's responsible for committing this transaction, and then this query may, sorry, this transaction initiated by this application server may send a different request and that may span on different partitions, right, and then after that when the application server sends a commit request then this primary node is going to be responsible for coordinating with these other nodes that participated this transaction, right, to figure out hey whether this transaction is safe to commit if so come back to the client, write the local records, or if not abort the transaction, so for today's lecture we're actually going to focus on this logic or this question mark here about how this transaction or the distributed database is going to determine hey whether a distributed transaction is safe to commit or not, and if so how does the system ensure that the transaction can be successfully committed on different partitions and eventually get back to the client, right, so I should say unfortunately for today's lecture, for the scope of this lecture we're not going to talk about questions like hey how do the application server decide which node is a primary node or what type of information or logs that I need to maintain on the primary node on the other participating partitions etc just for the for the sake of the timing, but at a high level right I would say for all the concurrency control protocols we discussed earlier for single node database system right say the timestamp order in two-phase locking or optimistic concurrency control most of these protocols right the idea as the algorithm most of them would actually still apply in a distributed scenario but of course you need to do a little bit tweak of them, extend this algorithm so that you can communicate the information at different different lock states on different machines right you need to extend the algorithm a little bit but then for the most part those concurrency control protocols would stay similar at least from a intuitive level, but then the thing that we did not talk about obviously would be the last part I just read here right how do a distributed transaction would actually commit when there are multiple participating partitions or nodes in a distributed system and that's definitely in a single node scenario you just write a commit record in your log and then you get back to the client you are done right but in the distributed scenario that's actually a little bit more complicated because there are multi-nodes involved here okay so then that's what we are going to talk about today so essentially I mean like like I said we are going to talk about how do the database system know that this transaction is safe to commit and then how do the system or the distributed system is going to ensure that a particular transaction if safe to commit is going to commit successfully and then all the changes would be persistent and consistent and especially we are going to we are going to focus on a few questions for example what if while you are trying to commit a transaction some of the nodes fail right because in a distributed scenario nodes can have failure or if the node didn't fail but what if there's a delay or the message right for example there would be a network hiccup right and after a while you still haven't received the message how do you handle that situation and the last one would be that hey if there are multiple nodes participating in this distributed transaction do I have to wait for all the nodes to successfully commit or to write records onto the disk and commit the changes before I get back to the client if I do not want to wait for all the nodes to finish their operation then how do I still handle the transaction correctly right so these are all kinds of questions that we need to talk about when we think about the transaction commitment transaction commits in a distributed system all right so that's what we're going to talk about a one by one in this class so before I talk about the details one important assumption very important assumption that we need to clarify here is that typically in a distributed database system we assume that all the nodes that are participating in my distributed system would be a friendly or well behaved right there will be a no nodes saying that would lie to us to tell us that it successfully written some record onto the disk but in actuality it didn't right we assume there's no such of a malicious no nodes exist and that's kind of like a comment right because if you have your own database system and that is that is managing all the all the nodes that the system that the system has then it's kind of assume that is kind of like a was it reasonable to assume that the system would control each of the node responsibility right reasonably but then in contrast another scenario would be considered a or by setting a failure scenario or you could say hey what if I'm going to assume that the nodes participating in my distributed system could behave maliciously right so for example if one node tell tell me it has written a record onto the disk or what if it doesn't do so right so that kind of scenario would be called a basiting like scenario and then the protocol or the algorithms to coordinate a distributed system in that scenario would be called a visiting a false tolerant protocol right and one common probably well known example of a visiting a tolerant visiting a false tolerant protocol which is essentially a blockchain right because in a public blockchain then you don't know who is managing which node participating in this blockchain network and then you sort of need a protocol that would be resilient to a yellow behavior of the node but that's not the focus today right today we are assuming that the distributed database would have well control over all the nodes that I mean the system has and then all the nodes would tell the true information all this information to the central system all right makes sense okay nice so these are the specific topics we are going to talk about today first I mean like I mentioned we are going to talk about the commit protocol and then we are going to talk about a two concept that I slightly mentioned in the last class right the first would be how do we do a replications in distributed database and second is that how to deal with consistent issue especially we'll talk about this cap theorem and lastly we are talking about a concept that is not exactly related to the transaction commit but it's also a concept that is you would see in practice right called a federated database all right so the first topic again how do a transaction commit in a distributed database so there are actually many algorithms to do that the first algorithm of most common algorithm actually I should say is called a two-phase commit which is also an algorithm that we are going to focus on today there are also an extension of two-phase commit actually invented by the the founder of Postgres security essentially my advisor supervisor called Michael Sturmbaker he invented these three phase commit algorithm but it actually turns out to be too complicated and nobody actually used it and then there are also other algorithm I probably you also see them in your distributed system class called pexos or raft they are very very similar like minor differences and then they can they can also be used to commit transactions in a distributed database lastly there are other less used algorithm for example Apache zookeeper have this algorithm called that that and then there's also a view time a view stamp replication algorithm right but they are they are less often used in distributed database so today we are going to focus on these two right the first is the arguably I think is probably the most commonly used algorithm called a two-phase commit and then the other one is a pexos right focus on these two protocols okay so first give you an illustration of a two-phase commit so it's kind of straightforward if you will essentially it's kind of already reflected by the name of the protocol there are two phases in the two-phase commit protocol right so say this application has finished all the operations that it wants to do with the transaction and then it wants to commit this transaction right and here assuming that this node one is responsible for coordinating this commit and then in the terminology of two-phase commit this node that's coordinating everything would be called a coordinate right in my earlier example earlier example showing the cost that it's called a primary node or something but essentially here is a similar meaning right but it's in this terminology of two-phase commit it's called coordinator and then the other node which is called a participant and then what this coordinator would do when an application server sends a commit request is that it's first going to go into the first phase sending a prepare message to all the participants right to check about whether each of the participant is ready to commit and if so there's no problem of each participant to commit then all the participants will send this okay response back to this coordinator okay and then this coordinator have to wait for all the participant to send this okay response back and then only until then it can start the second phase called commit phase right and then sending this commit request to all the participant and then again similar to the prepare phase this coordinator will actually need to wait for all the participant to get a response like a okay message back from this commit request seeing that hey everyone has successfully committed this message and then right now I mean after two-phase it has finishes the coordinator can send back to the application and seeing that hey I have successfully committed all right that's just kind of a little bit straightforward right two-phase commit just as the name suggests two-phase one prepare phase one commit phase any questions okay cool so okay we'll talk about this in detail later but in the textbook right in the textbook definition during every step of this two-phase commit all the participant will actually log out which phase it is and what message it sent out right just when you come back you can see the log and then I mean we see there's a crash when you come back you can also see the log right but it's not that it's not really the focus day but just I mean from the textbook textbook algorithm all the participants will also log out every stage in this two-phase commit as well it's this so in the second let's say the second phase starts yes yes so essentially the the question is if let's say you hear right when after the coordinator sends the commit request to both nodes and then one of them will fail what will happen right so we'll get there in later in the class but essentially there are different strategies right so in one scenario you can just announce that transaction has failed and then abort everything but then there were other organizations you could do to make the transaction continue but then of course you have to do other things to deal with that after you come back right a good question yeah thanks okay all right so let's talk about a scenario just finish talk about the scenario of a transaction commit let's just quickly go our example when a transaction abort all right so here say similarly I see application server send a commit request to this node and then similarly a node would I mean the node would send this prepare request to different participants and see that the node three right this node is not ready to commit right for whatever reason maybe it detects that there's a conflict right there's some record that doesn't have the correct version again unfortunately we don't have time to talk about concurrency control specifically but assume that node three cannot commit and then send a abort request in this case I mean the transaction just will be aborted and then I mean you just the this node one would actually come back and then send to all the other participants in this case will be node two that hey I aborted and then place row back all the changes and then this transaction after that when these do the two and three finishes all the I mean rowbacks or undoes then it will send the okay response back to node one all right but then the thing to notice here is that once node one here realize that the transaction is going to abort it can immediately send back to the client right has aborted right so for everything else it only needs to happen afterwards client doesn't know right the system would guarantee that all the undo happened correctly similar to the case how we handle a single node transaction all right cool okay so so this is actually a little bit related to to the earlier question the other student asked so there are potential way to optimize this this like a basic two-phase commit protocol so the first is that obviously as everyone has noticed during each phase at least in the basic protocol the coordinator has to wait all the participants to send back the response right and then that waiting can potentially take quite some time especially when you have some straggler straggler so the first optimization that you potentially apply can apply to two-phase commit will be called early prepare voting so the the intuition of that is essentially that instead of sending a specific prepare request to all the participants during the first phase if somehow the coordinator can realize that hey this is the last query of the transaction already right this is already the last operation and then after this query the transaction is going to either going to commit or somehow abort then the coordinator can send this prepare request along with the last query of that transaction right and then if the participant receives this combination of messages and if it successfully execute the last query it can immediately validate whether it can commit or not right and then directly send back this response this ok response for the commit phase alongside with the result of the last query so this will actually say one round trip right that's one optimization called early prepare voting and of course this assumes that the transaction or the coordinator would have the ability to know that this is the last query of that transaction which may not always be true but for some applications for example if we will use the store procedure it's actually it's actually kind of common in many of the OLTP applications I talked about earlier where we have a predefined logic right for specific operation when you do for example check out this transaction on online shopping then the logic of that transaction may be a template predefined right in that case maybe you can do this you can know whether it's the last query of this transaction or not and then do this early prepare voting all right that's the first optimization the second optimization would be called early economic range after prepare right so essentially what this optimization you could do is that if all the node already vote that they are going to commit this transaction then right after the coordinator send this this commit request to the participants that it wants to commit it actually already send back to the client the acknowledgement that this transaction has successfully committed because at that time the coordinator already know that all the participant is ready to commit right it's just a matter of whether the participant can finish I mean finish flushing the record and make all the changes etc right so at that time coordinator already know that this transaction can succeed and it can already tell the client that this transaction has committed this is called early economic range after prepare and of course this would have the additional overhead where if the transit if the database system crashed right away then after you come back from the crash you have to check the log right to see hey which phase are different participants of each transaction and whether there will be a participant that has already said it it is prepared to commit but actually didn't finish the commit and then you have to finish the commit process on the participant right that doesn't make sense but then this you give the advantage advantage that you don't have to wait for the response for all the participants to finish the second commit phase again you can get back to the clear client already and reduce the quality to see so to quickly use the example of early acknowledgement so say again application server send the committee request and then the this coordinator can send first send this prepare request to the participant participants all the participants say that hey i'm okay to commit and then it just a matter of do i actually finish this commit or not and then right now when i mean before actually i mean at the meantime when the node one is going to send the committee request it can already tell the application server that hey this transaction has succeeded succeed and then at the meantime the node one this sorry this coordinator can send the commit request and then we say uh after some time all the nodes have finished this commit and then it can send the okay response then the coordinator know that hey all the operations of the transaction has finished all right it's like simple illustration okay so a few additional questions right similar i have i have sort of mentioned this right so in the textbook definition each node or especially each participant will actually require the outcome of each phase in a log record on the persistent storage device that set in actuality most people actually omit this step right just because it's not turns out to be not that useful but the the standard algorithm or the textbook algorithm will actually require this okay and then the next question is that what would happen sorry what happens if a if the coordinator crashes right doing this to vis a commit protocol well essentially the participant has to decide what to do right so the the most common way for the participant to deal with the a crash of the coordinator will just be that it would set a time out like time right if the participant has to send some response or okay response back to the coordinator but then having her back within this time out range then it would just abort that transaction right that's the most common approach alternatively the all the participants can actually decide to do a leader election and then promote a new coordinator but then if you choose to do that you may just as well use a leader election algorithm to begin with like pexels and raft which i will get to in the next very next slide so most common way the participant will actually just time out and abort next what if the participants crash right so again so if a participant crash then it would be the if it's a coordinator that is waiting for the messages to see whether it can get the okay response right so in this scenario it actually depends so if the coordinator is still in the prepare phase right in the first phase of this two phase commit protocol then if it sends the prepare message and then hear back the okay response from any of the participant node then the transaction has to abort right because there is there could be one node that is not ready to commit then i mean if the timeout has reached then it has to abort that said if you use the early acknowledgement optimization that talked about earlier right if you use that then if the first phase has successfully finished but then only at the second phase when you try to commit you this coordinator didn't hear back from a participant then under the optimization of early acknowledgement then the transaction can still continue right because at that time in fact you already tell your client that the transaction has finished the only thing you to do is that after you crash and come back from the recovery you have to check all the logs and all the status of both the coordinator as well as a participant to see that whether there's any transaction that has promised is ready to commit but didn't finish the commit and then redo the operation accordingly all right makes sense nice cool so that's all about a two phase commit right kind of straightforward so next talk a little bit about a pexos so essentially one i sort of mentioned this one major issue or major a performance potential performance bottleneck that could exist in two phase commit is that at least during the first phase right during the prepare phase the coordinator has to wait for all the nodes in the all the participants to send back the ok response right so that it can proceed and then this is definitely a blocking operation and then in some cases if there's one there's only if there's only if there if there's only one straggler in all the participant node then this first phase right this prepare phase will actually take a very long time right because it then has these we need to wait for everyone to come back then a potential alternative approach that could alleviate this issue is the algorithm called pexos or rough right very very similar so essentially what this algorithm or protocol allows you is that it will actually allow this only a majority of the participant and in that case they have a different name for them but in two phase commit terminology will be called participant right so pexos will allow that if only a majority of the participant have tell you that hey i'm ready to commit then you actually already i mean proceed and proceed your next step and potentially commit this transaction so if there's like a one or two straggler in among all those participants it doesn't doesn't matter you can still proceed of course there comes with other overhead i mean other potential challenges you have to deal with to ensure that can proceed correctly right when we are we are getting into a little bit more details but essentially it allows this non-blocking operation especially when they are scheduled straggler so the the very first paper that proposed these pexos algorithm will actually call a the the part-time parliament right essentially is a paper written by the famous Leslie Lampard is actually so many of you you know distributed they have a distributed system cause may have heard of this paper especially Leslie Lampard so this paper called a part-time parliament is actually not a conventional computer science paper it's sort of a kind of a a fictional story where Lampard was actually describing a fake protocol that discovered in some ancient Greek island where the tribe is trying to vote to select their leader etc right and then these these the members of this tribe may reside on different islands in this ocean so that this sort of needs a distributed consensus protocol right so this author kind of like in bed or like describe this particular protocol in this fictional story and it actually turns out to be very very difficult to understand and very confusing but if you are curious you could check it out it's kind of fun but yeah it is if you read the original paper it might not be easy to understand okay so let's describe this pexos protocol so essentially right again similar to what happened before here like when the application server wants to commit it will send this community request to one of the node and in the terminology of pexos this node will be called a proposer essentially is the same sorry same as coordinator in the two phase commit protocol it just is a different name right but same thing and then all the participants in the two phase commit protocol terminology that are participating this transaction will be called acceptor right so in this case let's say we have three acceptor because we need a majority or yeah a majority vote so here in the in the i mean right after this proposer received this transaction commit request it will send this a proposer request to all the acceptors right so i mean in the in the in the in the similar to the two phase commit scenario so assuming that one of the acceptor is done right it's like it's somehow that there's a failure on that node i mean you lost the connection to it doesn't get any response but if the all the other two nodes has sent this response called a green message back to the proposer then i mean this consider that there's a majority of the node in my cluster or you know in my distributor system already has this change right then i can proceed my operation right so what's the next step the next step would be called commit step right it's similar to the two phase commit scenario and then again in this case if the majority of the nodes i mean in this among all my acceptors have sent back that hey i can have already successfully committed these changes then now this node this like a popular node can send back this application server the success message all right and then this pexos protocol is done make sense any questions okay so just illustrate it in a little bit different way so okay so one thing i need to note here is that because because we in the pexos right because we allow the system to commit a transaction without the agreement of all the nodes right we only require a majority of the participant or acceptors to agree on this change and then we can continue because of that we actually need to enforce additional rules on the commitment of the transaction or additional limitations so that we can ensure that all these transactions at the end of the day would resolve in a correct and a consistent state of the entire system right so specifically we're actually going to use something similar to the timestamp order in concrete sequential algorithm we talked about earlier it's not the same thing right but we will use something or some notion similar to that we're actually going to design a timestamp to these transactions and then we are going to limit the interactions of these different transactions in this pexos protocol so that even though we only need a majority of these nodes to commit we can still sort of resolve the state of that base correctly right so let me just read here right see here example i have two transactions i mean in this case could would be two proposers right in the pexos terminology and then see that i have a first transaction a transaction in i want to make a propose on these three acceptors right again this timestamp could either be a physical timestamp a logical timestamp a contours just keep increasing doesn't really matter right it's like orthogonal to our discussion here and then see that i have another transaction right let's see that i mean after that these acceptors have sent back to the agreement agree with the proposal back to the proposer right and see now there's another transaction transaction n plus one comes along then then i also want to propose right and then after that assuming that these acceptors have not decided to accept this proposal yet right maybe doing some calculation validate whether the transaction has conflict or not say right now if this original proposer right proposer n with proposer with the transaction send a commit message to these acceptors right then the restrictions in the pexos is that if for any acceptor when if it has seen a any propose from a transaction with a higher timestamp for example here a transaction with n plus one that's a higher timestamp right it has with all the acceptors have received the propose from this kind of a transaction then this transaction cannot commit right so essentially it has to use this timestamp to ensure this a serial order of these transactions to guarantee this uh this correct state of the system and then in this case because it has um received a proposal with a transaction that is more recent it this all the acceptors has to reject this commit from this a proposer from this proposer right and then if this this transaction is to abort and then after some time see that hey these acceptors figured out that it can agree with this um commit proposal of the proposer n plus one proposer on the right then uh they can send the agree message back and then uh this proposer can essentially can finally send the commit message and then this transaction uh can commit with this accept message from the acceptors all right so this is just uh of course in the actual pexos algorithm is there are many more implementation details right many more like uh specific steps to ensure the correctness but just here i just want to illustrate to you at a high level what pexos can achieve and what would be the uh potential tradeoffs here right so that's that's all what we can talk about in this class that make sense okay so uh just continue this right so obviously i mean if we uh have this restriction that uh we sort of have to keep a serial order of the proposal of these different transactions then we potentially will have a loss of abort of the transactions that it doesn't really need to abort right uh so one obvious way not obviously but but one potential way to address this issue is that you only allow one one person to propose right so essentially that would be a little bit similar to the centralized the transaction coordinator we talked about the last class right if you only about allow one uh proposer at a time then i mean there's no other proposer that's going to compete at the proposal with you right and in fact essentially you can also skip the proposal phase as well because you are the only one that is proposing right you only need to try to see whether the acceptors uh can uh can commit or not and of course the um the uh straightforward question we need to do this is that which node should you should the system decide to be the proposer right or in in in pexos terminology the the single central proposer will be called leader right so the question is which node should the system decide to be the leader and in practice uh most of the time uh just people actually not a lot of people system will actually rotate the leader among different nodes so in the famous uh spanner system from google i think the um the uh the duration uh for each uh leader i think in their in their paper they call this a lease right so the lease a leader can acquire at a time would only be 10 seconds right so every 10 seconds the system would have a sole leader that is responsible to propose all the commit request and then after these 10 seconds has finished then the system would need to propose a new leader right and then how do they propose a new leader essentially using the same pexos algorithm right so it's a little bit it's kind of similar to a distributed transaction right all the participants or the acceptors need to come together and then decide to reach an agreement on which one should be the single leader and everyone or at least a majority of them need to accept them so this leader election algorithm would essentially be a one pexos protocol and then if the what if the leader fails well it's the same thing right if the leader fails during its its lease right during this 10 seconds period it would essentially all the participants would do a leader election as well right and then select a new leader makes sense so that's essentially what what happened mostly in practice okay so here just to emphasize the key difference between two phase commit and pexos so in two phase commit we actually need to block the operation if the coordinator fails after the prepare message also already sent right so essentially for all the participant right it either has to set a timeout to to abort the transaction if they didn't get this response from the coordinator within a reasonable time all the participant has to wait forever right wait until the coordinator to recover right so that may have a potential issue and then in the case of pexos it doesn't need to do that right so if a leader fails of course I mean you can still set a time stand or sorry set a timeout but then if the timeout also expires then you don't actually have to abort a transaction or wait for the leader to come back forever right you can actually do another round of leader election as long as there's a majority number of acceptors that are still alive right then you just have a new leader that you can continue from there all right makes sense so another comment I want to make is that in in in the practical distributed database systems as far as I know most system will actually use two phase commit so just because it's it's simpler right and then in the general case you are not going to assume that the system would fail that often right so if the if if the participants all the I think in two phase in pexos you call acceptors if they don't fail that often right if there's no there's a straggler that there's always be super slow than the others then two phase commit would be just be a much simpler protocol right but then if if for whatever reason your operating environment would contain frequent failures then you may consider a protocol like pexos right okay okay all right any question about two phase commit and pexos before we move to replication okay nice so replication like I said most of the database system actually I shouldn't say this way so many people that use a data distributed database system I mean would actually not only want the property of the database system that you can I mean store more data beyond a single node but the most users of the database system also wants the property of the distributed system that can have replication or replica replicates of the data right so if for example you you deploy a distributed system a distributed database system on different data centers say in different states I mean in the US or even different continents then if one data center is down you still have your data available in a different data center data center right then you can actually make sure that your normal application is still serving your customer it's like you don't have a downtime of a website doesn't impact your business right so that's actually a very very important reason why people use a distributed database system in practice and then we are going to talk about a few specific design decisions related to this replication or functionality okay so the first would be would be the configuration strategy you use to replicate your data so at a high level there are two strategies to make data copies in a distributed database system the first one would call the primary replica strategy and actually I don't know what's it called in the textbook but previously or if you search the material online oftentimes it's actually called a master sleeve strategy but because you know the master sleeve could have other meanings right other interpretations so in general we actually nowadays we prefer this terminology of primary and replica so in this first configuration strategy of a replica in distributed database system all the updates will actually go through a designated primary in your entire distributed database system right so essentially all the replicas is only responsible for receiving the updates in the primary but then all the writes are going to this single node and then this after I mean the primary finish I mean making the changes according to the rights right request of your queries then it will just propagate all the updates to all the replicas actually it doesn't need an atomic commit protocol either right because it's different than this it's a two phase commit protocol that we talked about earlier indeed when we're thinking about this replication especially in a primary replica configuration for the replication the primary would be the single source of truth of the data right so wherever there's a crash you come back you would always look back into this primary node to say hey to see hey what is the the content of the data in my primary node and that would be the true source of data and then you don't need to run this two phase commit protocol when you're trying to propagate data from primary to the replicas all right and then you it depends on your consistency level but then the system could also allow read only transactions to be executed on the replicas to increase the bandwidth of the system but of course as you can imagine if if there's a read only transaction on the replica but then the changes from the primary has not propagated to the replica then there will be a small time window there there may be inconsistent data right and depending on the the asset level or the consistency level of the system of your application you may or you may not want that have to happen right so that depends but then it's a it's an optional thing that you could do to execute read only transactions on the replica to increase the throughput of the system all right and then lastly right if the primary goes down then you essentially do a leader election on all the replicas to elect a new primary right using either a pecs also roughed makes sense okay then on contrary so what would be actually I think the most system would actually I think there's something that you could choose but I think the primary replica is more common based on as far as I know but then I mean in the country a torrential strategy would be called multi primary essentially every replica each replica node of the data would be a primary as well right there's no difference between a replica and a primary and then all the rights can go to all the node on any replica right and any replica would serve as a primary essentially and of course when transaction want to commit you have to resolve the potential conflict between a different replicas as well all right and if you can imagine there will be additional overhead a relative that so again the primary replica approach would be the more common approach and in fact in actuality right as far again as far as I know most of the users of distributed database system would actually just only use this primary replica a setup without even partitioning the data right because of course if you work for a google or like a facebook then they have a lot of lots of data you actually need to partition on different machines right and then to manage them collectively but then in practice let's say almost more than 90 percent of the user of the heavy system they don't necessarily have so much data that they just can't fit on a single machine right and in that scenario the biggest reason that they use a distributed database system is actually the availability reason right you actually can put your data among different machines so that one machine in one data center is done you can still have machines in other center data running you can elect a new primary and then doesn't impact your normal business operation and you can still serve your customer right so in actuality in almost as far as i know in most cases this would actually be would would actually be sufficient for the user's requirement or the need to use a distributed database system but of course if you work for a giant company then maybe they also didn't in many cases they would need to actually partition the data all right so to use this right in a primary replica scenario you would obviously have one primary and then potentially one or two or many replicas and then all the rights i mean including the rates but especially all the rights will go to the primary right and then after the primary applies all the rights and then they can propagate it can propagate these changes of the rights to these replicas and for the replicas it only executes the rates all right that's the primary replica setup and in the multi primary well in the multi primary setup or configuration all the nodes are both replica and primary right so all the read and writes can go to each all those nodes and of course even there's conflict you have to do a coordination between different copies of replicas which can have additional overhead all right okay so another related concept with this replication in distributed database system will actually call the k safety so essentially a k safety or the number k will just represents the minimum number of data copies that have to exist in the distributed database system right so let's say if a k is three which would actually be the most common value in practice then for every single record right in your database system it needs to exist on at least three machines right could be like one primary two replicas for example right and if and before some reason i mean there's some machine failed the number of copies for any record in your distributed database go below three then you either have to immediately stop the database system and report an error right let somebody to fix it or in some system you could automatically create a new machine or spun up a new machine and then create a new copies of the data to make sure that every record has leaked this k number of copies and then you can continue right so that's kind of like availability guarantee that the system provides to the users and this called k safety all right it's also a common concept that people may use and then specify to the system when they use distributed databases so now the next topic how do the system propagate all the changes so again i sort of mentioned this or discussed this a little bit last class when we were talking about this consistent has consistent hashing right we'll talk about this ring of hash values and where you just locate a value of a locate the value the hash value of a key in a ring and then look ahead to see what would be the machines that you need to put the record of this data to and we said that you can just look ahead a few records if you want to have a if you want the data to be replicated on a few machines so essentially here the formal i mean definition of the propagation scheme is that whether when you are trying to i mean replicate the data right whether you have to wait for all the copies of the data to be successfully written right to have the latest update of this value of this record you whether you need to require that to happen before you get back to the client right i see that hey i have successfully written this data and if say you have to require that all the replicas of this data have successfully applied these changes and written to the disk and then before you get back to the client that hey this record has successfully been modified in your database system then that would be called strong consistency right whenever there's a new transaction come along and no matter which replica which node the new transaction is going to read the data it's going to read the same copy right it's always going to be consistent it's called strong consistency and then using the terminology of replications propagation scheme it will be called synchronous replication right and we'll see why it's called synchronous with the examples later and then on the other hand right if we say that hey we actually don't need the change to be propagated on every single replica before we go back to the client and tell the client that hey i have successfully made and committed these changes then that will be called a eventual consistency and obviously it's a less consistent state compared to a strong consistency right because if a new transaction come along and then it reads a data read the read the value of the data on a replica that has not been propagated with the recent changes then this value could be outdated right it's not very consistent so it's called eventual consistency and then using the terminology of a replication scheme or replication or propagation scheme in distributed database system it will be called asynchronous replication all right so go into a bit details right so first of all with the synchronous replication so what will be required here is that in this scenario the primary is going to be responsible to send all its updates to the replicas and then wait for them to fully apply all the updates in most cases that would just mean that the log records is flushed to the disk and then before that it can send back to the application and then say that hey this change has successfully been applied so we'll give you this example here right say this this is the first node is a primary right i did not write here but say the first node is a primary and then the application wants to commit and then the primary before it commits before it sends commits back to the application it needs to send these changes up to this replica right and then in most cases it needs to wait for the replica to flush this record on the on the disk right if the replica is using a right head login with the areas then it needs to wait and wait and wait and then only after the replica has successfully i mean written these changes back to the disk in the log records then it can send the knowledge to the primary and the primary can send back to the application that this change has been successfully committed all right and then i mean again as you can tell any read on the replica would be consistent right it's exactly the same value as you would read on the primary and then in this another scenario in the in the asynchronous case right then the primary could actually immediately return to the to the application once it sends the data to the replica but doesn't need to wait for the replica to flush the record right again similar here if there's a commit request from the client or the application server then while this primary sends the flush request to the replica it can immediately send the knowledge back and then in this in this scenario this this replica can try to flush it at a time it it determines proper right and then this eventually this this change would propagate to the replica assume that the replica didn't fail so that's why it's called eventual consistency all right okay so the next strategy right so there are actually quite a few strategies really through the replication so the next strategy that we need to decide will be called propagation timing right so in this scenario oh by the way one comment i i need to make for this propagation scheme is that typically right typically the system will actually support both right so unlike some other scenarios many many times open time system just choose one scenario strategy to apply typically distributed system distributed database system will actually expose this as an option right because it's actually not that difficult to support both right you just you just in the in your algorithm you just see whether i'm waiting or not so the system will actually support both and then it will expose the option to the application right depending on the requirement and the property of the application the developers or the application could actually decide whether it needs a synchronous or asynchronous replication strategy and then i mean to handle its logic all right that additional comment okay so the next strategy that we need to decide when we talk about replication will be called when when would be the time that you perform such propagation of the log records right so it's also kind of straightforward at high level there are just two types of two two times two different uh different approaches to uh to handle this the first will be called continuous right essentially every time the transaction applies some change to this primary right without query then the primary can decide to immediately send this change to the replica no matter whether this transaction is committing an out right and then on the replica if the replica can just keep applying those changes incrementally right and then when the transaction is going to commit or abort the primary can just send a commit and abort message to the replica and then the replica can either decide to finally commit the change or just roll back or undo all the changes from early right so the obvious advantage is the of this approach is that the replica is continuously applying changes right it's not trying to wait all the all the updates to arrive at the end and then do a huge right which could potentially cause loss of time but then the disadvantage of this is that if the transaction abort then all the changes you have applied earlier would actually be wasteful right because i have to do back everything so on the immediate contrary the other approach people would use in this case would be uncommit propagation right so in this scenario you at the primary actually don't send any log messages to the replica until the transaction is either going to commit or abort right so if the transaction decides to commit it would finally i mean batch up all the changes or all the log records of this transaction along with the commit message to the replica or for the replica to apply them all together but it would be a sequential right in the log records but this log record can potentially be large right and it also needs to fit into the memory as well otherwise it's difficult to send that to the replica but then the advantage of this uncommit approach would be that well kind of obvious as well you don't actually need to send anything if the transaction is aborted right because you never send any updates to the replica before the transaction abort and once the transaction abort you only need to roll back all the changes on the primary and then nothing in the replica is affecting right so there's just a trade-off between duty and approaches make sense okay so the last strategy we need to talk about here actually this strategy most most of the time people use a second one right but but yeah it's a last strategy that's related to this replication in distributed habitism is that whether you use a active active or active passive strategy to apply changes so this is actually a different from the primary replica versus a primary primary multi primary we talked about earlier so in this case for example in a active active strategy right so what you will do is that instead of having one a specific node to make all the changes from the transactions all the queries and then propagate those changes later in an active active strategy what you will do is that assuming the data has three copies on three replicas in this strategy you actually send the queries with three copies to each of these replicas right so on each of the replica you actually get the exact copy of this query and then you actually execute that query on each of these replicas instead of making change on replica and then emerging changes over right so the advantage of this is that you potentially send less data right but then the challenge of this is that hey what if there are other transactions executing on these replicas right there are interleaving between different transactions how to resolve this right so this is actually very infrequently used I mean occasionally right I think there are one or two systems use this but it's very rare right but it's one choice you can have and then the other active passive strategy which would be more similar to the example we talked about earlier right if you have one primary or few replicas you would only execute the query from that transaction on the primary and then you would just just look at what would be the changes or in other words the log records generated by that transaction or query right and then you only send the changes after the query execution to the replicas and apply them there and then you don't have to deal with the scenario where you are executing these queries the same queries on different replicas and if other things interleave with them how to resolve the conflicts etc all right make sense so most of the system would use the active passive approach is just a much simpler okay so in the next 15 or so minutes we talk about cap theorem as well as federated database so this is actually a very we will talk about lots of strategies right lots of properties etc with with distributed database but then there's this one theorem that kind of summarizes what kind of properties a distributed database system can provide and what it cannot provide very succinctly called a cap theorem and it's actually invented by a professor from UC Berkeley called Eric Brewer I think he's also one of the very first engineer very very first engineer in Google but later on he teach in Berkeley and then essentially he come up with this theorem and prove it formally that in a distributed data system you can actually you can't really have all these three different properties at the same time namely consistent available always available as well as network partition tolerant and we will go into the details about these properties some right so essentially what you can do is that at the best you can your distributed database can pick two of these properties and then provide that functionality to the users and when I say I'm gonna pick two here and sort of it means that it's not that every combination or every two combination of these three properties would be actually be possible right so even though you cannot you can't pick two properties you never be able to satisfy three properties all together there could also be some combinations or I think it's like the combination of consistency and always available that's the specific combination of only two property is not really possible either right but you cannot you can never achieve three properties at the same time okay so to use with this right these c represents consistency a represents availability and p represents a partition or tolerant then the consistency would be similar to the linearizability concept we talk about in a database right you want to guarantee that a and the end of the the state of the database would be equivalent to some serial execution serial order execution of a scheduling of the transactions right and then you want the state of the database to be consistent and records all the records will be in correct state and then availability would mean that all the nodes that are up or available would actually be be able to accept the request right so in other words availability means that as long as a node is live it should be available to provide service it should be able to answer request and back back to the clients right that would just means availability and then the lastly the network partition or partition tolerant property would mean that the system would still be able to operate even though there's a network failure and then some nodes in the database in our distributed system cannot communicate with some other nodes right and in that case if the system is still be able to operate then we would say we would say that the system is a network partition tolerant all right and then what would be impossible would just be the the overlap of all the three right it's impossible to satisfy all the three properties together any questions about these definitions of the three properties before we yes yeah yeah yes yeah you see the question is what would be the difference between linearizability and serializability so linearizability is that it's okay so at high levels they are similar right so the intuition of them are very similar and if you just only want to understand the high level concept of cap theorem you can view them as the same right but then in different near specifically linearizability is a concept developed in a distributed system world and serializability is a concept developed from the database community right so specifically they have a small differences in their definition and linearizability linearizability here will actually means that a set of operations so a set of transactions right would be executed well the scheduling of a set of transactions would have a result that would be equivalent to a specific serial order of this is a transaction so order is kind of determined they called linearizability and then in the serializability in the database community the definition would be that as long as the execution of this set of transactions would be equivalent to some serial order right of this scheduling then I would consider the serializable doesn't need to follow a specific order right there's a higher order difference but intuitively they are very similar right yeah essentially they want to make sure that the system whether it's a distributed system or distributed or distributed system or database system it needs to be in a correct state right yes yeah any other questions yes please oh so sorry could you say it again yeah yes so uh I'm not sure why exactly get your question but essentially the I think the question is how tolerant partition tolerant needs to be but essentially it just means that if there's a network partition right uh there if there's a subset of nodes in your system cannot communicate with some other subsets at all right there's a essentially there's a partition there's a wall between two subsets of the node then the system can still operate right either one subset operate the other operate or both operate but the system can still can still work assuming that use assuming there are no other failures right so that just means a network tolerant okay okay yeah okay so here let me give you a some user street example right so here we have um this uh I mean uh two two application servers right sending sending request and assuming that we have a two nodes in a database system right and then there's a network in between and assuming that at the beginning of this example the network is is successful right it's like a good network and everyone can communicate with each other and then I have a primary node and a replica node so say here an application have these send this our first request to set records a equals and then I mean successful finish right and then uh this uh this I mean when the network I mean is like always called um operates normally then this primary node can send these changes to the replica node right and then replica would have this the other the value of the other record as well right and if assuming that we have the we have a synchronous replication scheme strategy then in this case the system would be consistent right because you would immediately propagate the change to the replica and then after that the primary can send the uh this this economic economic economic event back to the application server I see that hey this change has been successfully applied right and then now if the other node come come along and then try to read this record it will immediately see this value two from this node right so essentially if the primary sees that transient has committed then it should be should be immediately available to all the other replicas and then that will be considered a consistent state all right we thought we'll talk about this in the uh synchronous replication earlier makes sense okay so now availability what does this mean is so what availability do you mean is that say again same example right two nodes primary replica had two application server and then assuming that one node is done now right like this this node b replica b is done sorry the replica is done then now what happened is that for the first application when it's trying to read a record b I mean this is the same a same read path right it can still come through this primary and then get the result back right b equals to eight and then when the other application server trying to read the data then originally it may be maybe the replica the replica is closer to the other application so usually it would just go to the replica but now if the replica is done as long as there's some other node in your system is still alive right they need to be able to serve this read from the other from the application as well in this case the application will just go to the primary and they read this record and then get this value back right so this would satisfy the availability requirement all right so lastly uh partition or tolerance right so again same example primary replica and then assume that for some reason there is a there is a network failure right and then there's just become this partition between these become this hard wall between wall between these two nodes and these two nodes cannot communicate with each other anymore right so right now if we want to make this database to keep operate which would satisfy network partition tolerance we will need to have at least one node to be still operating right but then at the meantime because in this in this case nobody would actually know the other node the two nodes cannot talk to each other anymore right so not a single node will know whether the other node is still operating or not so in this case if we want to guarantee availability right we need to make sure that as long as its some node is still up it still needs to be able to answer requests from the clients right so in this case i mean the replica b if it still wants to satisfy availability to answer requests but it doesn't know whether it the other primary is dead or not so what they need to do here when it lost connection to the primary it needs to do a reader a leader election right but assuming that there's only one node here it will just elect itself to be the leader so right now there will be two primary coexist in the database system if you want to satisfy a network partition tolerance as availability at the same time and it's in this case what will happen what will happen is that assuming that there are two different applications send the different records one set eight equals two the other set equals two three and then they'll update these two primaries at the meantime and then send back the acknowledgement and then if at some point later when the network comes back then the two system would actually become in a inconsistent state right so i mean right now you have to well the state of the system is already inconsistent but some system may allow that and then you resolve that later but in some system if you want strong consistency then you need to either satisfy sorry either sacrifice network partitioning or sacrifice the higher availability all right make sense any questions on this example okay nice so just a little bit of comment on this on this cap theorem in different of course a different database system would actually choose different combinations of property in terms of this cap theorem and usually the traditional relational database systems as as the so-called new secure which would be the in-memory relational database system they would usually choose this what is called consistency or the other properties say this availability or network tolerance as you want to make sure that the the database system is always consistent and if there's a network partition right for example then it needs to for example it may just directly in the early example it may just directly fail the replica instead of trying to let the replica really like itself right so that the database system can guarantee the consistent property but in some other or no secret system right for example Cassandra right then they will prefer this network tolerance as well as availability over consistency so essentially in some of this system if a network partition they will allow everyone to continue right try to serve their users as much as possible and then back to the early example if such a case exists right if sometime later the network resume and then there will be inconsistent state then they will try to resolve the records later right but note that no matter how the system will resolve the record it's already inconsistent right because if some other transactions were reading these records in between and do some operations based on these values of this record already then it's already inconsistent right but I mean depending on the requirement of the application some system would actually allow this all right okay okay so the last topic federated habits in the last like three or four minutes so essentially what we talk about so far is that a distributed database system right so you will actually have a centralized architecture that would have a total control of every single node in your database system right and then you will have a centralized view to make all the decisions determine all the strategies replication synchronous asynchronous etc but in many of the applications in practice people actually need to use different database system software for different purposes of their application right for example in in your bank it may have a lot of different database system like say postgres mysql or even sometimes amazon redshift that's just used for different purposes in their in their organization right for very different workloads but then at some point people may need to you have may need to use two queries that record access records and access result from this different set of databases at the meantime and maybe aggregate some information and perform some data analytics for example right so in this case of course one naive way would just be that you handle that in the application level right so for this different system you yeah essentially you could actually for your application you could write queries to the different system and then different system can send back their results separately and then you resolve that later right so you can do that but then it would actually be nice that if there is a single interface that handles all those for you right essentially hide the different database system software the implementations of different system all together in a centralized software or interface so that application can directly write queries to that single interface as if it's a giant distributed database right that would actually be nice right so that's essentially the idea of a federated database system and then essentially just like I described it is sort of like a middleware almost or a code coordinator that would make different database system software to collaborate with each other and so that the users would view it as if it's a single distributed database system and of course it's actually a pretty difficult right because you know different database system may have different data models different languages and if the query is difficult then you actually need to have a centralized optimizer that is optimized that are going to optimize the query execution and coordinate them on different database systems right and then you would often time you actually need to copy the data out of this different system to the centralized coordinator and then do processing as well but then it's just a a convenient functionality that is like a make the user's life easier all right so just give you a simple use reason for this right say you have a bank or whatever have a few a backend database system all together right my secret MongoDB Redis etc and then through this middleware or in other words a federated database when users is to a request to this system it can actually send the different queries to this separate systems separately right and then usually this communication channel between your middleware to this individual database system would call connectors right and then after the middleware receives all the response is need to do the post processing to get back to the results of the application server right again that would just be called a federated database and then one interesting example of this is actually a postgres so this is always postgres database right it's not a middleware but postgres actually sort of provides you the functionality to serve as a federated database with a interface called a foreign data wrapper it's kind of interesting so even though postgres has its own functionality to store and process data execute query etc it actually provides you this interface to hook up to just sort of we have fake a like a customized table with a external database right so you can define a table using foreign data wrapper in postgres and which is actually backed up by a different database system and then you can actually within the postgres architecture you can try to do organizations on the queries across different tables with different databases and then send back a user a single result right so that's actually one way you achieve this federated database with the postgres it's kind of cool all right so that's pretty much we want to talk about today just to summarize the key key property here and all the things we talk about in this class right with the distributed concurrency control distributed transaction is that we actually assume that um there's all the nodes in your database would behave friendly right all of them we have a full control if some of the node may behave maliciously it will be a different scenario and you will use a Byzantine or failure for tolerant protocol and then blockchain would be a one example of this and then there's also this tool developed by a this famous tool developed by a person called kyle kinsbury he actually has this jepsen project where he actually bunch has bunch of customized workloads that he just tried to issue this corner case in some cases right try to issue this very very difficult distributed transactions to your system to test that whether your system actually satisfied consistency or not right whether your system actually satisfied one of few properties for example from the cap theorem and it started as a side project but eventually this tool is actually became really really good nowadays many like an industrial standard mature distributed database would actually just use this tool to test whether it is actually implementing the things that correct or not because it's very difficult to implement the distributed transaction correctly and then so if you encounter that if you want to test your own system you may also check out this tool as well okay that's all for today for OLTP distributed database and for our next class we're going to talk about analytical queries in distributed database systems all right see you next monday work with the bmg and the e-truck get us a saint eyes brew on the job