 Let me get started, so just because it is a great pleasure to be here and talking with the fellow friend coming back in the conference after some time. So though my work primarily has been on AI and related deep learning and related stuff in the last year or so, the stop which is of interest for me because this is of great interest in my heart. I did my PhD in Distributed Systems, so it's a pleasure to look at Distributed Systems and look at some of the issues that were there. So the inspiration of course came from the fact that Google came up with their Spanner and it was made available in the cloud recently. So just wanted to look at that space and what's happening around data safety and consensus and so on. So the talk is the result of that. So we'll see if this works. Yeah, so I just want to bring up some of the old systems which were considered complex systems probably about 15 years back. Of course, but these were the monolithic systems of old and people could actually statically program the behavior of the system and not much of dynamicity was there and of course it was mostly stateless kind of systems. So why am I bringing this up is to kind of illustrate and contrast with some of the recent systems. For example, things like TensorFlow which can be seen as really complex systems, complex distributed systems. So for instance TensorFlow, I don't know how many people are aware, but started as disbelief system about four or five years back and disbelief paper was published by Jeffrey Dean and his folks at Google train. It became a bit of a rage at that time because that was the first time people were talking about distributed deep learning and became a good piece of work. So people started looking at it. One of the key things was it had a parameter server which was kind of centralized. There used to be worker nodes on each node which used to connect to the parameter server, stateless worker nodes connecting to a parameter server which had all the state and all of them running certain computations. For example, a deep learning computation or any of that. So that's how it started. So this was the precursor to TensorFlow. So TensorFlow is the next level of system compared to disbelief and so on. But what's different about it? So it's of course a data flow system. So you have a huge number of data flow systems currently on. So for example, the had a link which was one of the older data flow systems. Amazon has Facebook has cafe. So I mean, what's really different about TensorFlow? One of the key things is the fact that there are two things which of course are important to note here. So one is that the TensorFlow graph is built, for example, by the user. Then the system kind of prunes and figures out what subgraph to do and no certain computations to be done on subgraphs. So that becomes one thing that is doable is concurrently executing multiple parts of the subgraph and doing computations on the subgraphs. That's one unique part of it. The second is of course the fact that most of these data flow systems tend to run as, for example, there is a set of vertices and edges connecting the map. It just typically are used for flow of data, right? That's why they're called data flow systems. And vertices are typically ones which do certain computations on the data which flows in. Traditionally, data flow systems have always been doing computations on immutable data. So the computations don't modify data that is there on the that flows into the vertices. TensorFlow allows mutable data. There's a significant difference because what happens is now the vertices can actually modify the data and so what does it mean for us? So one of the key things is then the shared state has to be managed. So one thing is it kind of merges state management right with the TensorFlow graph execution. So that's one point we need to keep in mind. And of course there are the complexity of the system arises due to various factors including the fact that it's distributed. You have the distributed master which prunes the graph and partitions the graph and distributes it to different parts of the network. You have kernel implementations which are the specific implementations. So there are many, many kernels. So the flow of the data flow has to be executed on specific kernels which the user chooses at that point of time. That's one layer of complexity. Then of course you have the different devices on which they can execute. You have the GPUs and CPUs and what not. One key thing again contrasting from its predecessor which was disbelief is that TensorFlow works on a heterogeneous cluster of nodes. This belief used to work on just homogeneous nodes. So that's one gain key difference. So some of these things we need to keep in mind that TensorFlow, why am I talking about all this and talking about TensorFlow at such level of detail is to illustrate the fact that this is one of the complex systems that, am I on the mic properly? It's fine right? One of the systems which are really complex distributed systems, shared state across a set of nodes need to solve distributed consensus problem. So such systems need to solve consensus at scale. That's the whole point I was trying to make. So I mean just to illustrate the problem right we go back to school pretty much and this is how consensus problems are described in distributed computing classes. I'm just taking you back to school for a minute so bear with me but interestingly right so two people who are sitting in their own offices want to meet for coffee so Bob and Alice. So the only medium of communication this was older days so they have only an email that can be sent. The email is sent over a medium which is unreliable. So when Bob sends an email to Alice saying can we meet for coffee Bob does not know that Alice got the email right and Alice sends an acknowledgement saying yeah got it we will meet and again Alice does not know that Bob received that acknowledgement. So the medium is unreliable so that means messages could be lost, messages could be delayed, it could be delivered after an unbounded period of time. Makes it very complex. So the whole point is if it is a simple of course if you think of it right it looks like very trivial problem and the phone call will solve it and because the other person knows that you know if I say can we meet for coffee and she says yes and because I know she heard me and so that kind of assumes that the act here the issue is because of the medium is unreliable you're not sure if the acknowledgement reached the other person. So just before stepping out right again Bob thinks did Alice receive my acknowledgement and if he thinks oh if she doesn't see the acknowledgement then she may not come and again he sends another acknowledgement and again Alice think the same thing if he doesn't see this acknowledgement then he may not come. So it goes on. So the whole idea is that why I'm trying to say is that in a purely asynchronous distributed system which in which the medium of communication is the network is unreliable. Consensus cannot be achieved. So this is the famous Fisher Lynch Peterson's theorem which says or the impossibility result as it's called which says that in a purely asynchronous distributed system consensus is impossible to achieve. So I mean this looks like a dead end right I mean consensus cannot be achieved so what should we do? However it turns out that under certain conditions right where from a purely asynchronous system becomes slightly synchronous and you know from a completely unreliable medium network becomes slightly reliable so then conditions arise under which consensus is achievable but this is the broader perspective of the problem. So these are the conditions right for example the so one point which we can look at is so this is a table which tells under what conditions consensus can be achieved in a distributed system. So different things are there. So the way to read is processors can be synchronous or asynchronous. So that's one part. So asynchronous processors, synchronous processors means that when one processor takes say n steps then it's guaranteed that all the others take at least one step. So that's what is typically called as synchronous processors. If it's asynchronous case obviously there's no step so the processors may not take any step so there's no movement or lifeness is not guaranteed. That's an asynchronous system. Message delivery could be unbounded or bounded that means that there's a time within which message delivery will be delivered, message will be delivered. If it is unbounded there's no timeout so timeouts cannot be used that's one other aspect of it here. So the message delivery is unbounded or bounded and of course you have the communication medium whether it's broadcast at point to point. So if you see it right in an asynchronous system right and then this is almost always not achievable. The only exception is here you see which is that of course the other aspect is messages could be ordered or unordered. So in that case for example you have an ordered broadcast. So this seems like a bit of how is it possible to implement an ordered broadcast primitive right in a purely distributed system. So it turns out that case kind of covers systems which are not really distributed for example the parallel architectures and things like that like which in which case processors may share a common bus. So if the bus is shared then it's likely that they will be able to implement a broadcast primitive which is an ordered broadcast. If an ordered broadcast is solvable then this is also solvable so they can be treated as problems which can be equivalent in the sense that one can be used to solve the other and so on. So that's this case right that so in an asynchronous system only in this case when it's an ordered broadcast consensus is achievable. In all the other cases consensus is not doable in this processor asynchronous. The interesting cases here are probably about two or three things which we can look at. Of course there are certain conditions which are properties of consensus which are there so which are given here so consistency has to be there. That means that the input value is so all agents will agree on the same value right and it is a value which one of the agents is actually inputting that's the condition of validity. Eventually the protocol should terminate that's a termination condition. All these conditions must be met for any consensus protocol and of course the key thing is here if processors are synchronous right and message delivery is bounded right consensus always achievable. So what should the system do in this case to achieve consensus? So use timeouts because you have a bound in which messages should be delivered. For example I'll say twice that bound will be a timeout and if something is not delivered by the time then I can safely assume that because the network guarantees is going to be delivered within that time and I could safely assume that the processor is down. Key thing in most cases is we are not able to distinguish when a processor is down from it's not able to receive the messages so there's a problem in the network versus the process failure. That distinction if you're not able to make that's where consensus is not achievable but if the processor is synchronous and message delivery is bounded you're able to do that. Similarly right the other case when it's achievable is again for example when we saw the other one right which is the you have an atomic broadcast which is doable. The last case is you know when processors are synchronous message delivery is unbounded but messages are ordered but it turns out that in this case right it's the protocol is actually so lengthy that it's not of practical use. So essentially right that there are mostly two conditions under which consensus is doable. Of course one is the ordered broadcast kind of thing. The other is of course when processors are synchronous and message delivery is bounded and you know for sure you can use a timeout and achieve consensus. Let's go on so just to recap right we of course saw the properties of consensus and the Fisher Lynch Peterson impossibility result which said right it's not doable in a purely synchronous system because it's not achievable and of course the the cases are also given here. Of course shared memory kind of can be seen as because everybody can read each other's messages right. In that case this consensus is easier or not so that question does arise but it turns out that it's not so trivial that even in a shared memory case with reads and writes it's again pretty difficult to achieve consensus. So you need protocol as primitives for example you have a fetch and lock and you know fetch and add and so on. So there are a lot of primitives which help you to achieve consensus even in a shared memory system and those primitives tend to have consensus numbers which means that the consensus number of n means that consensus can be achieved even with n minus 1 failures. So the key thing is to understand that this is the principle of failures is what makes it pretty hard and of course the other thing we can remember is that so far I've been talking about consensus is mostly what's called as the non Byzantine consensus. So which means that the or what called as fail stop failures so processors just fail by stopping. There is nothing else malicious process going to do but in the case of Byzantine consensus which is a much harder problem processors can actually act maliciously. So the sense that processors send the wrong messages to other nodes it could send you know one message to one node and send a different message to the different node. So in all of those cases it becomes very difficult to achieve consensus. So it's been proven that for example in the case of game we just extend our you know the Bob and the allies parable to three and join all three of them want to meet for coffee and it turns out that again at the medium of communication a conference call would solve the problem right. You just get on conference call and say we'll meet all of them here what you're saying they acknowledge and then you meet but if the conference call is not there which is equivalent to a broadcast kind of mechanism that is ordered broadcast kind of mechanism. So in that case this is doable but it turns out that you have only point upon communication like there's no broadcast like you know you can talk to only one other person on phone and in that case how is consensus doable. So it turns out that in this case so even if one processor or one person can be malicious you'll sense that for example John decides to tell Bob that I will come and tell allies he won't come right in that case right it's possible for people to actually achieve consensus. So this is what was a general protocol the Byzantine general's problem and where of course the problem was that there are generals in a war and trying to attack camp and some of the generals are you know militias and they tend to bring down the protocol so that's the general result which is that you need 3n processors and so you won't be able to achieve consensus even if one third of the processors are malicious that's the result. So there are like we said right certain conditions under which consensus is doable but however consensus protocol is implemented right so there are ways of implementing consensus protocols was under certain assumptions so so Paxos is one of the common protocols which achieve or which implement consensus so one thing is Paxos was proposed by Leslie Lamport so a lot of work here I tend to refer to Lamport's papers and you know his work he's been a seminal contributor to a lot of work in the space so he wrote a paper called the part time parliament so this paper kind of describes the problem in a rather abstract way talks about remote island in Greece called Paxos which in which people right contribute as legislators but the problem is the people are all busy right so they tend to do their own businesses but go to the parliament only when they can and come out whenever they have business they come out but when they are in parliament they tend to transact some business but the key thing is this model is a distributed system because nodes can fail right that means coming out of the network or coming out of parliament you know the mimics real distributed system so he modeled the whole baronite around this problem of the parliament of Paxos achieving or tabling any useful info and so on right so that is how the whole paper was written but it turned out that it was not well received and for whatever reasons but it's actually a beautifully written paper if people have time please read it's called part time parliament so wonderful paper but essentially right it talks about the Paxos protocol right which was the first time the Paxos was presented but the point here is Paxos is actually very complicated protocol so there were several attempts made to simplify Paxos then pass Paxos and so on right and many current distributed systems including zookeeper of the world use some variation of Paxos or the other we have many no sequels distribute or many no sequel systems implementing variations of Paxos one variation or the other of Paxos but it is a very very common protocol please keep that in mind so the protocol works something like this so you have a set of proposers proposers tend to say this is the value I'm proposing do you agree and send it out to what are called as acceptors which again you have a set of acceptors and acceptors tend to make local decisions of their own they don't need to depend on there's no global decision maker but every acceptor makes his own local decision there's a proposal number each proposal sends out the number is kind of assumed to be monotonically increasing there's a ways various ways to achieve that but yeah so given that proposals send out those proposals acceptors tend to look at you know whatever they are getting and don't accept proposals which come later with lower proposal number so that way right tends to be progressive and they send the acceptors send acceptances back and also send the acceptance back to the you know who are called as readers so the readers finally figure out that yeah the all for a majority of acceptors have accepted certain proposal number which finally becomes the value that the accepts or what's called as learners so that's how the back source really works so this is commonly used protocol for achieving consensus and of course the other is what's called as failure detectors is again a different way of solving the problem again right if you know very carefully right perfect failure detector is one which in which every fail processor is suspected of you know having been failed or being detected by every other process and so again it turns out that due to the possibility result you'll see that failure detector can perfect failure detector can never be live and accurate and only be one of the two things because if it were both then it would kind of violate the possibility result of visual injury Peterson so there are different kinds of affiliated with us and so based on two things one is the completeness and accuracy of completeness relates to how many of the processors which have failed or actually suspected right by others and accuracy relates to being able to figure out if certain process have failed in their suspected or if once which have not failed or not suspected so that's the accuracy right how many of them are actually failing and how many of them are actually suspected so the certain systems also use variations of FDs and implement some of these as well of course back source has become quite more common so you might not see implementations of the failure detectors in commonly used distributed systems of today so one other thing in this space which we need to keep in mind is of course the cap theorem which was illustrated and given by Eric Brewer first so it so what he said was essentially that in a partitioned distributed system you would not be able to achieve both consistency and availability so which means that people tend to read it as you have three things which is partition tolerance, consistency and availability you get only two of the three which of course could also be true but my reading of it is slightly restricted in the sense that I only look at the phase of partitions the tradeoff is between consistency and availability so that's the key tradeoff which the cap theorem talks about but why are we talking about the cap theorem here is the fact that most distributed system designers have to keep this in mind when designing the other aspect is so if you notice right the typically systems are known as AP or CP systems for example AP systems are those which sacrifice consistency the phase of a partition and stick to availability that means that they make it available for reading but it may not be strictly consistent so they implement typically eventual consistency or other forms of bigger forms of consistency of course the CP systems are those which sacrifice availability in the phase of a partition and tend to keep systems consistent so the Mungow DBs of the world would appear here so what they would do in the phase of a partition is some of the data elements might be not available for a write but it will only be available for V so that's the availability sacrifice they have to make in order to make it consistent and similarly right the other systems the three extremes right will have CA systems the CA systems are the traditional relational stores which tend to achieve both obviously consistency and availability but they don't tolerate partition so you can't have all three right so of course the idea is that they tend to work in a smaller cluster or typically maybe even on a single node which tends to give them the so they don't need to think about partitions but they don't work right obviously they are not scalable horizontally and so on right so they tend to be CA kind of systems so the point here is that the while the consistency availability at all is one thing practical distribute system designers need to keep in mind a different kind of trade-off as well which people are not looking at that much but which is the consistency and latency trade-off so see the partition can be seen as a rare event right it's not every day we're going to have a partition of a distributed system partition is a worst-case network failure right where the set of nodes are divided into two so the two groups cannot communicate with each other right that's like a extreme kind of an event so people tend to forget that in the normal operation of a distributed system the trade-off is between consistency and latency because what happens is if you want strict consistency involves for example protocols like Max or 2pc or whatever right so they tend to be time consuming so they can see consistency is a trade-off people have to keep in mind the peanut system from Yahoo kind of illustrates that trade-off very beautifully so what they do is they make the system available for these or what are called as tail reads they just tend to make it available for reads irrespective of consistency so that's and they make the trade-off mostly irrespective of partitions that's something you should keep in mind because most most equal systems tend to only talk about cap year-round and tend to say we do this trade-off for the cap year-round and we trade-off consistency but they are trading off consistency even in the phase of a partition event which is actually not necessary so only in the case of a partition you tend to trade-off consistency for availability right but yeah so that is you are not making but then the other aspect is of course the way exactly consensus useful I've been talking about for so long but is it really useful so I mean it just you know simple things like commit protocols are based on consensus so people have to agree that this is a comment or this is an abort so simple things like that it still requires consensus so that's the reason to face comments and then there are preface commit protocols the two-phase commits you know the there are certain conditions under which two-phase commits tend to fail preface commit protocols are proposed so that's one way of looking at transactions right and of course the key thing is to be able to achieve the asset properties as they are called right terms deconsistency isolation and durability and distributed systems tend to be hard so a lot of distributed systems tend to trade-off one over the other and relax consistency and so on primarily for cap or in certain cases it will be a latency consistency trade-off as well like the peanuts system so so consistency becoming important asset properties are important right so we tend to go on to the kind of systems which have come up or played and not very recent but last few years you see a lot of what are called as new SQL systems which have come up so they tend to be horizontally scalable and be able to achieve asset properties as well and distributed setting so of course the key thing is to look at is the kind of workloads that have been coming up and especially you know the kind of operations that are doable which is not right intensive or read intensive then you have simple operations or complex operations then you have the different kinds of workloads the data warehouses tend to be here they tend to do more of reads but the old to be kind of tend to do more of rights the data store so the whole point is your traditional stores the relational stores what are called as you know GPTRs they tend to do some of these so they tend to assume it's a storage is row wise okay so tend to be row stores and texting is through B3s and they tend to use locking for concurrency control and because they have a very optimizer which works and they use disk for storage and so on these are the properties but the new sequels of the world right tend to have different kinds of properties for example as defined by stonebreaker right that they didn't support asset properties and they use non-locking concurrency control protocols they tend to have because it's a shared nothing kind of architecture so becomes easier to implement the same and of course hype or not performance they did tend to focus on so we look at some of these so as an example right for example one could take the old DB which is one of the interesting new SQL stores so tend to be what's called as trans silica silica database so it's both analytical and transactional so that's the reason it's called trans silica database so it tends to be able to scale out efficiently tiered data stores support streaming data what I'm going to do is quickly skip through a couple of slides in the interest of time so one is of course the other is clustering which is also an interesting new SQL store it will go on this banner because banner is something of interest because it's recent and also from Google and the fact that they have made available on the cloud tends to make me talk about it so I'm just going to take five minutes talking about spanner so so one key thing in spanner is that it is what's called as a true-time API the true-time API they have implemented with very specific hardware devices so they've implemented what's called as synchronized clocks so one of the key things which we saw right was that consensus is harder to achieve the clocks are not synchronized for example in a commit protocol getting a timestamp global timestamp is very hard and so that is what the spanner applies to address by having synchronized clocks they do clock synchronization of course only in their data center of course outside of that obviously spanner cannot be used outside of the Google data center but so they tend to achieve certain levels of reliability in the network they have network and so on which helps them plus the fact that every there are nodes in their data centers which tend to be masters time masters they have Armageddon masters which maintain atomic clocks right in this system they have GPS masters also which maintain GPS clock they have two clocks for most nodes in the data center and they tend to maintain time very accurately so one key reason spanner is able to do a lot of this is because it's able to synchronize the clocks once the clocks are synchronized then it's a little easier for example to say okay there is a timestamp and the timestamp is global because nodes have synchronized clocks the error is very very less they tend to claim it's less than 10 milliseconds so that's the key difference so one thing is which we want to talk about is of course the fact that cockroach TV is an open source implementation of Google spanner the key difference is that cockroach TV can run on any infrastructure it does not need Google's infrastructure to run so what's special about it so what's the difference then between these two is that the cockroach TV will not have the same clock synchronization that spanner will have because of course spanner runs in Google data centers they have the physical clocks which tend to keep time fairly accurately so the key point is what does it give you know for practitioners so the point is of course TV can only do what's called as linearizability whereas spanner can actually achieve serializability or the other way around so serializability is the lesser condition which means that it essentially gives the view of the database as if every transaction is operating one after the other that's serializability linearizability is a harder one which says that in addition to doing that a snapshot read would also be consistent snapshot read means that just a point in time I go and look at the state of the database should be understood in the sense that there should not be you know timestamps which may be inconsistent so that inconsistency will be there in cockroach TV and it will not be there in spanner so spanner achieves linearizability which is the harder thing to achieve and cockroach TV will only be able to achieve serializability which is okay for most transactions but if you need point of time read of a database I'm going to grab so that becomes very hard to do so and of course the thing is spanner does not violate the cap theorem in any way so cap theorem is inviolable so there's no way to violate the cap theorem but one key thing is it's a you can look at it as a CA system so consistent unavailable because Google main partitions such a rare event because of their infrastructure and so on right the party becomes a very very rare event so the availability is so high that it can be seen as an effective CA system but in theory it's a CP system so it thanks to trade-off availability keep it consistent but sacrifice availability but because of its infrastructure and what they've achieved it's an effective CA system so which is significant that CA system in practice at scale is really hard to get but that's what is spanner effective CA system so let me kind of wrap up quickly and so won't go into too much details you don't have much time so then to talk about yeah formal verification I wanted to touch on because one key thing is the ability to verify all these systems is not so easy so there are work done in the last few years to look into this especially very fine the safety properties of distributed systems people have done some work we can look at that and quickly wrap up right so we started from the consensus problem then went a little bit into you know consistency issues asset properties looked at cap theorem what is the significance of cap theorem then we came to the new sequel systems of the world so which tend to achieve asset properties you know at scale so spanner is one of the perfect examples I have a certain unique features because of the ability to synchronize clocks so that does result in a lot of benefits for people and of course formal verification is something we couldn't touch upon because of time but you can look up slides will be made available if you have time for questions I'll stop right here and wait you have time for questions right okay cool you have while talking about tensorflow you have mentioned kernel implementations okay so can you put a little more insight in kernel implementation in case of multiple nodes of multiple clusters suppose when more than multiple clients like Python clients connecting in the network so how the load balancing and query prioritization and objects realization gets place we can discuss that often because it's probably I took tensorflow just to illustrate complexity of this weird systems but you can talk about that off the shelf because the focus of the talk is more on this weird system in general and consensus and so on right so we will discuss that but then the fact that I mentioned that was in the context of the fact that tensorflow tends to do many many things and which makes it very complex to implement and the need for a consensus protocol in such a complex system so that's the context in which I have talked about it but we can discuss that offline I don't want to take too much of time on because it's kind of a digression so yeah go ahead hi you had mentioned earlier in your talk that there were some systems which were trading off consistency when they didn't need to react for example right tends to say that we sacrifice consistency because cap theorem tells us so-and-so but even in the face of you know a non-partition event partition is a rare event right even in the face of you know even when there's no partition they tend to sacrifice consistency which is surprising right which they should not be doing but they are doing that so that's the point I was making that the consistency latency trade-off becomes very important that case but they're not looking at that so I mean just the point yeah do we have another question there's time for one more hand the mic over to the man yeah so what's the catch here I mean if these no-SQL systems do scale as good as no-SQL ones and they give all the goodness of the traditional SQL then I mean basically they'll take over right they'll take over both the other categories so what's the catch I think so my view in fact I wrote a blog saying that Spanner and you know the advent of the new SQL systems could mean the end of the no-SQL world as you know it not just no SQL right even the traditional SQL ones the traditional SQL ones they have their own use cases right for example the you know you have smaller data sets on which you need to run more computation so you tend to go for that that's a very specialized kind of system right so they might still be there but the no SQLs of the world might you know they have to be really aware of you know what's going on because the fact that Spanner and some of these systems can do asset transactions at scale is significant and none of the no SQLs most no SQLs won't be able to do as it so the real point there is no SQLs explicitly prohibit asset transactions the new SQLs tend to allow it but there is of course a penalty to be paid at scale and which they allow so that's really the difference but but you're right the no SQLs have to be aware of the answer to your question but you can talk there's also a question and answer session today after lunch at 2.40 with all of the speakers this morning you can hold your questions to them thank you very much are you having a good morning maybe it's not morning anymore before we bring on our next session I would like to make again a couple of announcements I see some friends up in the balcony you figured out how to get up there congratulations if we get to pull up here you can use the balcony it's open go through the doors in the back and up the stairs another reminder to please fill in your feedback forms you have paper forms that were on your chairs they cover all the sessions in both