 37th lecture in the course design and engineering of computer systems. In this lecture we want to continue our discussion from the previous lecture where we said that you know we need replication for fault tolerance, but then the question came up how do we do this replication in a way that all the replicas are consistent with each other. So in this lecture we want to understand how we can achieve replication with consistency. So let us get started. So we have seen this concept before of replicated state machines. You know when you do active replication that technique is also called replicated state machines. You know you have multiple replicas all of which are in some you know old state they are maintaining some application state then you get some input that is you get some request from the user say to add an item or delete an item from a shopping cart. And all of these replicas handle the same input and they all go to the same new state, new application state, right. So all replicas start with the same initial state, they handle the same inputs from the user in the same order and therefore they will reach the same new state and again the next input, the next input, the next input and so on, right. So this will keep on happening so all the replicas are always in sync with each other, okay. So this is what happens in active active systems. But if you are using active passive of course you are not doing a replicated state machine. The passive replicas are you know not at the same state as the active all the time but then you are using some data store or something to store your checkpoints you know using a reliable data store and within that data store once again maybe you are doing a replicated state machine. So these replicated state machines are sort of the fundamental building block of building fault tolerant systems, okay whether it is active active or active passive inside the data store does not matter. This idea is very very important and when you build this replicated state machines it is very important to have consistency, okay. So why won't the system be consistent if you are you know giving the same inputs to all of them they always will stay in the same state you know where is the issue of inconsistency coming up. It comes up because of you know failures or faults in the system, life is not perfect. So sometimes you know when you are sending some input one of the replicas may miss the input you know all these other replicas receive the input but this one replica missed getting that input because it was down it had some you know temporary failure and it missed that input. So therefore now this one guy will be in a different state from all the other replicas and from here on it will keep handling input but it will never come back to the same state or you know sometimes there could be reordering in the network packets can get jumbled up at some router and inputs are received in some jumbled order and because of that you know you might reach a different state, right. So there are faults in systems in real life because of which these replicated state machines may not stay consistent with each other may not stay in the same state all the time and in this lecture we are going to see how you can do replication in a way that guarantees consistency. So before we understand how to guarantee consistency we have to understand what we mean by consistency you know we have to define that correctly. So there are many different what are called consistency models you know definitions of consistency some are very strong definitions and some are very weak definitions. For example one consistency model is what is called atomic consistency, this is a very strong consistency model this is what we intuitively mean when we say consistency, which this model says that all the replicas must receive all the inputs in exactly the same order and all input should be received at all replicas in the same order you know your input can be you know request to add items, delete items in a shopping cart or in some other server you know the billing orders, the shipping orders or you know the request to upload videos whatever it is whatever your system is whatever kind of request it is handling all those requests should be executed at all replicas in the same order then they will all be consistent. And there is also the additional condition that if some operation Y starts after operation X suppose your operation X is you know add item to a shopping cart and your operation Y starts after operation X and this is view item or view the shopping cart. Suppose this is the scenario the user has made operation X and then the user makes a request for operation Y and Y starts after X finishes according to some global clock you know there is some clock somewhere and after X has finished then request Y has come, then Y should always see the result of X that is if you have added an item to a shopping cart and then you view the shopping cart that item should be visible in your shopping cart that is what this you know very complicated definition of atomic consistency is saying which is common sense you know you do an operation and then you look at the system that operation should be reflected in the system that is what we expect that is the definition of atomic consistency. Now there are also other consistency models for example there is something called eventual consistency which is an example of a weak consistency model. In eventual consistency says that if your operation Y starts after operation X finishes then it is not necessary that Y should see the result of X immediately it can see the result of operation X eventually what does it mean? It means that I do not need to see it immediately right away if it is ok if you show me after you know few seconds or a few minutes or something like that after some delay is also ok, but eventually you should show it ok what is eventually you do not know you know it could be now it could be next second next minute tomorrow next year you do not know the definition does not specify it simply says the system should be eventually consistent that is called eventual consistency you can see that this is a much weaker guarantee as opposed to atomic consistency which is a much stronger consistency model and of course these are not the only two definitions you have the entire spectrum you know there are many other consistency models from you know vehicle it will stronger it will stronger to very strong and so on for example there is something called causal consistency which says that all the operations that impact each other should be in the same order for example all the operations on the same shopping cart should be executed in the same order, but operations across different shopping cart it is ok if you jumble them I do not care you know you can kind of mix them and mix them up and you know it does not have to be fully consistent across all replicas. So in this way there are many different consistency models that you know the theory the distributed systems theory has defined and your system can choose to implement any of these models and accordingly how you replicate it will depend on whether you are trying to aim for strong consistency or you are ok to compromise for a weaker consistency. So now let us see how to achieve strong consistency so we will see examples of both we will see how you do replication to have strong consistency and how you do replication to have weak consistency we will see examples of both in this lecture. So for strong consistency usually the protocols that are used are what are called consensus protocols. So what does the word consensus mean in English it means agreement you know fully being in agreement with each other. So these consensus protocols exchange messages among the replicas in such a way that all the replicas agree on the state of the system right and they are strongly consistent with each other. So very popular and you know in today's systems a widely used consensus protocol is what is called the raft protocol and what this protocol does is if there are multiple replicas it will let all the replicas agree on a log of entries ok. So every replica has a log and the same log is replicated consistently across all the replicas all of them have the same view of the log. So what is this log? This log can be anything it can be for example in a shopping cart server the log can be all the operations to be performed you know add an item delete an item all the operations on your shopping cart that can be your log and this raft protocol if you run this raft protocol at all these replicas it will ensure that all the replicas it will exchange messages in such a way that all replicas see the same consistent log across all of them. Now once you have this log it is easy to build a replicated state machine right all replicas start in the same initial state whenever any request comes that is put in the log it is replicated across all of them consistently and then that operation is executed in this way all your replicas start with the same initial state they get the same inputs from the log in the same order and they will execute those inputs therefore they will all reach the same final state and your state machine your replicated state machine is maintained consistently across all replicas ok. So using this raft it is easy to see that once this raft protocol is there it is easy to see that you can build any replicated system any strongly replicated system you can build by using this log as a building block. Now there are also other consensus protocols for example there is another protocol that was an older protocol from raft raft is sort of an improvement over Paxos but Paxos was this older protocol that lets all replicas reach an agreement on a single value it does not have this concept of a log but it is used for single value and of course you can keep doing multiple iterations of Paxos to agree on multiple values. So in this way there are many consensus protocols using which all the replicas agree on some set of things and therefore they stay consistent with each other and if you need strong consistency you have to run some such consensus agreement protocol across your replicas. So we will briefly now see how raft works of course if you take a distributed systems course you will study all of these consensus protocols in a lot of detail but in this lecture I want to just briefly touch upon the main ideas of how you achieve strong consistency using consensus protocols. So what is the basic idea of raft at a high level of course all the replicas maintain a consistent replicated log. There is a log of entries this log has you know the zeroth entries this the first entries this second entries this in this way there is a set of a sequence of entries this entry can be anything it can be any operation like add an item delete an item do this do that whatever your entry can be anything and this log is replicated consistently across all the nodes in the system raft will ensure that. So how does raft work if you have multiple replicas one of these replicas is usually elected as the leader and all the other replicas are followers it has the concept of a leader and a follower and this leader is nothing but it is one of the replicas is just elected by the replicas and what the leader will do is the leader will receive inputs from the clients and it will propagate those inputs to all the replicas it will you know the users has add an item to a cart this leader will add this operation to the logs of all the replicas and ensure that this operation is propagated into all the replicas in the form of a log entry. So anything the system has to do the leader will make it into a log entry added to the log tell it to all the replicas so that all of them added to their logs and once in this way any operation is replicated at a majority of the nodes then the leader will say done now this entry is safe it can be committed it will be applied to the state machines and a confirmation will be returned to the client. So the important thing here is this the majority of the nodes so suppose there are five nodes in the system so this leader will not just if it only replicates at one nodes or two nodes then it will not send a response back to the client saying your operation is done it will replicate at a majority of the nodes at least three nodes three out of these five nodes in the system have to have this entry in their replicated log and then apply this to their state machines and you know completely process this request only then a confirmation is returned to the client. What if it cannot contact the replicas what if there are lot of failures in the system and it cannot replicate then it will not send any response to the client. So either the value is replicated at a large number of nodes or the leader will say sorry I cannot handle this request. So such protocols are examples of what are called quorum protocols. So what is a quorum? A quorum is a subset of people who reach an agreement right that is the meaning of the English word quorum. So raft is a quorum protocol because it waits to contact a majority of nodes build some agreement on a majority of nodes and then only send a confirmation that the request is handled. And if it cannot reach a majority of the nodes if you know there are two F plus 1 replicas in your system. If it can contact at least F plus 1 of them that is a majority you know more than half. If it can contact F plus 1 systems then raft the leader will say okay fine the operation is done. But if it cannot contact F plus 1 systems if there are more than F failures then the system will not work properly that is it can tolerate up to F failures but not more than F failures. And of course this leader also will keep changing periodically you know the followers they have they monitor each other via heartbeats and elect a new leader and then somebody else if this leader fails temporarily somebody else will become the leader and then this old node can join as a follower again later on and so on right. So in this way all these replicas are working as either leaders or followers and you know every time a leader changes it is called a term a new term has started a new round has started in raft that is the terminology used okay. So this is the basic idea. So now how does raft guarantee consistency if nodes keep failing all sorts of bad things can happen right. So how does raft guarantee consistency for example when replicas fail many bad things can happen the logs can diverge you know. So suppose here are all the replicas in the system and you know the leader is sending some updates all these updates the leader is sending to all the nodes and suppose some node here failed briefly and it missed this update it does not have this update does not have this update then these other nodes got the update. So you could have some follower that crashed and missed some entries it did not get a few log entries. So the leader is replicating at a majority of the nodes right it cannot ensure that everything is replicated at all the nodes all the time that is not possible failures keep happening. So some followers may have missed some entries or some followers can have some extra entries how is that possible you know one node was the leader in a term and it got a lot of entries but before it could replicate them this leader died. Now all the other followers they do not have all these entries now one of them becomes the leader then this guys log is very short whereas this guy who comes up later the old leader becomes a follower now this guy has a very long log with all these uncommitted entries ok. So in this way the logs of the replicas can diverge either some followers can have fewer entries some followers which were previous leaders can have longer entries than the present leader. So these logs will diverge. So what does the leader do in such cases the leader always reconciles all of these logs and basically will sync all the followers logs to its own log. So the leader is the one who is actively maintaining consistency here the leader will say ok all of you listen to me here is the log you all please update your logs and the way it is done is whenever the leader is sending the kth entry of you know some say 10th entry of the log it will also mention the previous entry the k-1 entry and followers will update only if the previous entry also matches. For example this leader here has a replicated 4 entries to everybody but this follower here missed the 3rd and 4th entries. Now when the leader is sending the 5th entry it will also say here is the 4th entry that matches you install the 5th entry. So at this guy the 4th entry matches he will install but at this guy the follower will say oh no my 4th entry is blank I do not have the 4th entry so I will not install the 5th entry then the leader will say ok how about you have the 4th entry no do you have the 3rd entry no do you have the 2nd entry yes. So the leader will identify will go back identify at what points the logs are in sync and then it will keep on then it will say ok do this then do this then do this ok. So it will roll back to the point where the logs match and it will help the follower catch up it will say ok you have that now here was the next one here was the next one in this way all the previous committed entries will be told to this follower who has missed a few entries ok in this way the leader is always trying to tell the followers whatever they have missed and similarly if a follower has some extra entries that it did not commit and you know which have to be erased then such entries also the leader will roll back at the follower. So basically leader is always ensuring that all the followers have the exact same log as itself is trying to replicate an entry but also ensuring that not just that entry but the entire log is consistent. So the leader's log is basically what everybody will end up following ok. So it is very important to ensure that you elect a good leader ok. So what if you elect a leader who does not have a lot of entries and somehow this you know suppose this node was only having 2 entries it does not have any of these entries and now this guy becomes the leader and this guy will tell everybody oh no everybody you listen to me this is the log you remove of all your entries that is not right no all these entries could have been committed confirmation sent to the user you cannot always roll back entries as and how you wish that is not correct. Therefore it is important to elect a good leader and who is a good leader how do replicas vote for a good leader all the replicas will vote for the node that has the most up to date log ok. So some replica here that has missed a few log entries you will never make this guy the leader you will always elect a leader who has the most up to date log who has not had failures who is always up a lot of the times and has the most correct log has all the entries only such guy will be elected as a leader ok. This will ensure that when the leader is propagating its log to everybody else the correct log with all the entries is getting propagated but not a log with incomplete entries and then how is this guaranteed how do you know that there is always a leader with the most up to date entry available for election it is election time what if you only have a bunch of you know stupid replicas who do not have all the log entries what do you do how do you elect a good leader. So that is where the beauty of this protocol comes in you have this notion of a majority ok the leader election or anything committing an entry anything requires a majority of votes and in any two majorities they will always be an intersection you cannot have two majorities like this that do not have any common elements this is not possible ok. Any time you take a majority and at a later point you take another majority there will be if intersection for example if you pick 3 out of 5 nodes always and another 3 nodes you pick they will always be at least one node that is common. So now suppose this guy was the leader in one term and you know some replica here was the follower and it got all the entries the next time this leader failed another majority has voted and this old follower became the new leader then there will at least be one node like this at the intersection of both these majorities who has seen all the updates from the previous majority and gets a majority of the votes in the new term right. So they will always be such a node it will never be the case that whoever was there in the previous round with up to date entries they are not available now to be elected as leader. Every time an election is happening they will at least be one node who has seen all the updates who was part of the majority in the previous term has the most up to date log and therefore is available for election in this term and will be elected as the leader because replica's vote for the guy who has the most up to date log and now this leader with the most up to date log will now propagate these updates to all the followers. So because any two majorities intersect you always have a good leader available with the up to date log who will then ensure that this up to date log is what is continued across in the next term ok. So this ensures that raft always has good consistency strong consistency properties ok and across all of these consistency protocols this is always a feature if a majority cannot be contacted you know for some reason you cannot replicate your log entry at a majority of the nodes then you will not return any response back to the client the system becomes unavailable but you will never replicate at a minority and then go back and tell the client that the request is done why because in the future then in this minority nobody may be available for the next term and some bad leader can get elected you do not want that. Now then the question comes up sometimes I want high availability I want my system to always return a response to me in spite of failures. In such cases what you will do is some systems then they will say we will accept weak consistency and we will return a response even if we could only replicate at a minority of the nodes. Note that in such cases you will not have strong consistency. You will either have strong consistency but sometimes your system is unavailable if a majority of the nodes cannot be contacted or if you want high availability you want your system to always respond back to you then you should be ok with replicating at only a minority of the nodes and sometimes you may not get consistent results ok. So for example you can say that a client request is only replicated at a minority of the nodes and I will return a response back. In such cases what will happen now two minorities you can have two minority sets you can have two nodes here you can have two nodes here out of five nodes where these two nodes and these two nodes do not intersect there is no common node that can happen. So you added an item to a shopping cart and it was replicated at these two nodes then these two nodes failed then you view the shopping cart your request goes to these two nodes and now what do you see whatever item you added is not available in these nodes right it was never replicated at a majority of the nodes. So it is possible for you to see inconsistent values it is rare the system will you know try to ensure that everything is replicated everywhere it will try to ensure consistency but it is not guaranteed ok. So there are some systems for example Amazon has a key value store called Dynamo which has high availability it will always return a response but it may not be consistent all the time you only have eventual consistency that is you will try to catch up all the replicas so these two guys if the other replicas are failed they will try to tell them about this adding the item to the shopping cart later on but there are no guarantees on timelines you know eventually you will be consistent. So such protocols are also called sloppy quorum protocols you know you are doing a quorum you are talking to a subset of the nodes to propagate the decision and replicate the operation and so on but you are being a bit sloppy about it ok if I cannot contact everybody so be it but I will return a response back to my client ok. So this is an example of eventual consistency weak consistency but high availability design. So note that systems with weak consistency they will return they can sometimes return conflicting values of your application state as I was telling you the example earlier. So suppose your shopping cart has items A and B ok all the nodes in your system have this shopping cart A and B. Now if you add an item C to a shopping cart it can only reach a minority of the nodes ok fine the system will still respond back to you saying item C has been added. Now you add another item D to your shopping cart it did not reach all these nodes were down then it reached some other set of nodes and they added the item D to the shopping cart. Now what is happening when you view your shopping cart if you talk to these nodes they will tell you your shopping cart has items A, B, C you talk to these nodes they will tell you your shopping cart has items A, B, D which is correct you do not know you both are valid options. Note that this would have never happened with something like raft which guarantees strong consistency why because between these two sets this A, B, C and A, B, D set there will be at least one common node that has seen you know A, B and C update and which will see the D update. There will at least be one node that will intersect when you add an item C you will replicate at majority when you add item D you will replicate at majority there will be some intersection. Therefore, you cannot have this divergent values your system will have consistent values and this node that is the most up to date node will be the leader it will ensure that this update C is propagated to the next term also and your shopping cart will have A, B, C, D but if you are using a weak consistency protocol that will return a response back even if a minority of the nodes have been contacted then you can get two different values then what do you do? So this dynamo was developed incidentally to handle to manage shopping carts at Amazon and here they say that with shopping cart this is acceptable sometimes you know an item is missing in your shopping cart it is okay the user would not be terribly upset and you know if you get two different shopping carts like this A, B, C, A, B, D then the system can even merge them it can add up all the items and say your shopping cart is A, B, C, D right. It is easy to handle inconsistent values for some applications like shopping carts but for some applications like you know bank accounts you know oh your bank balance is either X rupees or Y rupees we do not know which will you be happy no you will be terribly upset. So for certain applications this inconsistent values I am returning to you all possible values you do what you want that approach will not work okay. So which replication and consistency model will you use it will heavily depend on the applications needs. So some applications need to have a strong consistency model and therefore they will use a consensus protocol like Raft or PAXOS that is more work it involves more overhead but they will use that in order to guarantee strong consistency. Whereas some applications are okay with a weaker consistency model then they will do you know a sloppy quorum kind of protocol and they will accept a weaker consistency in return for better performance of the system. So what is the tradeoff here the tradeoff is between consistency and availability okay. If you are providing strong consistency sometimes when you cannot contact a majority of nodes you will have to be unavailable you will have to say sorry I cannot handle this request. So sometimes there is something like network partition you know you cannot your network splits into two parts there is some failure of some link and your network becomes two parts and the nodes here cannot contact the nodes here you know when such things happen if you cannot contact a majority of the nodes you will say sorry I cannot handle this request you become unavailable okay but systems with weak consistency they will be available even if there is a network partition there is a network failure you cannot contact everybody they will be available but sometimes they might return inconsistent results. So there is this very famous theorem called cap theorem if you take a distributed systems class you will learn more about it in a lot more detail but the basic idea is fairly intuitive you know there are three properties there is consistency which is you know all the application state is the same no matter which replica you look at then there is availability which is your system always handles your request and gives you a response back is not does not you know say I cannot handle your request that is availability then there is partition tolerance that is if your network fails you cannot contact some nodes and all of that your system is still working that is partition tolerance. So the cap theorem says that you can only get any two out of these three but you cannot get all you cannot have it all you know you have to make your tradeoffs if consistency is very important to you and you know of course network partitions will keep happening in real life then you cannot get availability right you pick a design where you are strongly consistent but there is no availability all the time or if you say I do not care so much about consistency then you always return a response you are highly available but sometimes you will return some inconsistent values from your system but you cannot have a fully consistent system that is available all the time and yet is tolerant to network partitions and network failures all these three things at a time you cannot get and of course if you take a distributed systems course there will be a detailed discussion of this but I hope that this discussion gave you a high level picture of why the cap theorem is true in real life. So this is the summary in this lecture we have studied what is consistency what is strong consistency what is weak consistency then we have seen what are the different ways of doing replication to achieve strong consistency and weak consistency and we have discussed this tradeoff between consistency and availability in the cap theorem ok. So as a extra homework you can actually go back and read these original papers the original raft and dynamo papers which talk about strong consistency and weak consistency of course not all the details of the papers can be understandable by you at this point but you will get a high level picture of how you achieve strong consistency and how you build systems with weaker consistency models. So that is all I have for this lecture let us continue our discussion in the next lecture. Thank you.