 All right. Hello, everyone. Let's get started. Today, the topic is causal consistency. And I'm going to use the COPS system, the COPS paper that we read for today is a case study for causal consistency. So the setting is actually familiar. We're talking again about big websites that have data in multiple data centers, and they want to replicate the data in each of their, all their data in each of their data centers, but to keep a copy close to users and for perhaps for fault tolerance. So as usual, we have maybe off three data centers. And, you know, because we're building big systems, we're going to shard the data and every data center is going to have multiple servers with, you know, maybe all the keys that start with Z, A through all the keys start with Z and corresponding shards everywhere else. We've seen this for and, you know, the usual goals people have, you know, there's many different designs for how to make this work. But, you know, you really like reads to be certainly like reads to be fast because these web workloads tend to be redominated. And, you know, you'd like writes to work and you'd like to have as much consistency as you can. So the fast reads are interesting because the clients are typically web browsers. So, and there's web going to be some set of web browsers, which I'll call clients, the clients of the storage system, but they're really web browsers talking to the users browsers. And so the typical arrangement is that the reads at least happen locally. And writes might be a little more complicated. So one system that fits this pattern is Spanner. You'll remember that a Spanner and Spanner writes involve Paxos that runs across all the data centers. So if you do a write in Paxos, maybe a client in a data center needs to do a write, the communication involved actually requires Paxos maybe running on one of these servers to talk to at least a majority of the other data centers that are replicas. So the writes tend to be a little bit slow, but they're very consistent. In addition, Spanner supports two-phase commit. So we have transactions and the reads are much faster because the reads use the true time scheme that the Spanner paper described and reads really only consult the local replica. We also read the Facebook memcache paper, which is another design in this general pattern. In the Facebook memcache paper, there's a primary site that has the primary set of MySQL databases. So if a client wants to do a write, let's suppose the primary site is data center three, it has to send all writes to data center three, and then data center three sends out new information or invalidations to the other data centers. So writes are actually a little bit expensive and not unlike Spanner. On the other hand, all the reads are local. When a client needs to do a read, it could consult a memcache server in the local data center. And those memcache reads are just blindingly fast. The paper reported a single memcache server can serve a million reads per second, which is very fast. So again, the Facebook memcache scheme needs to involve cross data center communication for writes, but the reads are local. So the question for today and the question the cop's paper is answering is whether we can have a system that allows writes to proceed purely locally, at least from the client's point of view, so that a client can talk to the, if it wants to do a write, it can send the write to the local replica in its own data center as well as send reads to just the local replica and never have to wait for other data centers, never have to talk to other data centers or wait for other data centers to do writes. So what we really want is a system that can have local reads and local writes. That's the big goal. It's really a performance goal. This would help for performance, of course, because now, unlike Spanner and the Facebook paper, we had purely local writes be much faster from the client's point of view. It might also help with fault tolerance and robustness. If writes can be done locally, then we don't have to worry about whether other data centers are up or whether we can talk to them quickly because the clients don't need to wait for them. So we're going to be looking for systems that have this level of performance and in the end, we're going to let the consistency model, because we're going to be worried about consistency. If you only do the writes initially to the local replica, what about other data centers, replicas of the same data? We'll certainly be worried about consistency, but the attitude for this lecture at least is that we're going to let the consistency trail along behind the performance. Once we figure out how to get good performance, we'll then figure out how to define consistency and think about whether it's good enough. Okay, so that's the overall strategy. I'm going to actually talk about two straw man designs, two okay but not great designs on the way to before actually talking about how cops works. So first I want to talk about a simplest design that follows this local writing strategy that I can think of. I'll call this straw man one. So in straw man one, we're going to have three data centers and let's just assume that the data is sharded in two ways in each of them. So key is from maybe 8M and key is from M to Z, sharded the same way in each of the data centers and the clients will read locally and if a client writes, so supposing a client needs to write a key that starts with M, the client's going to send a right of key M to the shard server, the local shard server that has a responsible key starting with M, that shard server will return or apply to the client immediately saying, oh yes, I did your right. But in addition, each server will maintain a queue of outstanding rights that have been sent to recently got clients that it needs to send to other data centers and it will stream these rights asynchronously in the background to the corresponding servers in the other data center. So after applying to the client, our shard server here will send a copy of the client's right to each of the other data centers. And you know those rights go through the network, maybe they take a long time, eventually they're going to arrive at the target data, at the other data centers and each of those shard servers will then apply the right to its local table of data. So this is a design that has very good performance, right? The reason rights are all done locally, we never have to, clients never have to wait. It is a lot of parallelism because you know this shard server for A and the shard server for M are completely independently. If the shard server for A gets right, you know it has to push its data to the corresponding shard servers in other data centers, but it can do those pushes independently of other shard servers pushes. So this parallelism both in serving and in pushing the rights around, if you think about it a little bit, this design also essentially effectively favors reads. The reads really never have any impact beyond the local data center. The rights though do a bit of work. Whenever you do a right, you know the client doesn't have to wait for it but the shard server then has to push the rights out to the other data centers and reads that new data, the other data center then proceed very quickly. So reads involve less work than rights and that's appropriate for a read heavy workload. If you were more worried about right performance, you could imagine other designs, for example, you could imagine a design in which reads actually have to consult multiple data centers and rights are purely local. So you can imagine a scheme in which when you do a read, you actually read the data from each of the other data, the current copy of the key you want from each of the other data centers and choose the one that's most recent perhaps and then rights are very cheap and reads are expensive or you can imagine combinations of these two strategies, some sort of quorum overlap scheme where you write a majority and write a majority, only a majority of data centers and read a majority of data centers and rely on the overlap. And in fact, there are real live systems that people use in commercially in real websites that follow much this design. So if you're interested in sort of a real world version of this, you can look up Amazon's Dynamo system or the open source Cassandra system. They're both much more elaborated than what I've sketched out here, but they follow the same basic pattern. So the usual name for this kind of scheme is eventual consistency. And the reason for that is that at least initially, if you do a write, other readers and other data centers are not guaranteed to see a right. But they will someday because you're pushing out the rights so they'll eventually see your data. There's no guarantee about order. So for example, if I'm a client and I write a key starting with M and then I write a key starting with A, sure, M sends out my right to the shards over M, sends out one right and the server for A sends out my right for A, but these may travel at different speeds or at different routes along the wide area network. And maybe I wrote, maybe the client wrote M first and then A, but maybe the update for A arrives first and then the update for M. And maybe they arrive at the opposite order at the other data center. So different clients are going to observe updates in different orders. So there's, you know, no order guarantee. The sense of ultimate meaning of eventual consistency is that if things settle down and people stop writing and all of these right messages finally arrive at their destinations and are processed, then an eventually consistent system ought to end up with the same values stored at all of the replicas. That's the sense of which it's eventually consistent. If you wait for the dust to settle, you can end up with everybody having the same data. And that's a pretty weak spec. That's a very weak spec. But because it's a loose spec, there's a lot of freedom in the implementation and a lot of opportunities to get good performance because the system basically doesn't require you to instantly do anything or to observe any ordering rules. That's quite different from most of the consistency schemes we've seen so far. Again, as I mentioned, it's used in deployed systems. Eventual consistency is, but it can be quite tricky for application programmers. So let me sketch out an example of something you might want to do in a website where you would have to be pretty careful. You might be surprised if this is an eventual consistency app example. Suppose we're building a website that stores photos and every user has a set of photos stored as key value pairs with some sort of unique ID as the key and every user has a list, a list of their public photos that they allow other people to see. So supposing I take a photograph and I want to insert it into this system or the human contact, the web server and the web server runs code that's going to insert my photo into the storage system and then add a reference to my photo to my photo list. So maybe I run, maybe this happens, we'll say it happens on client C1, which is the web server I'm talking to, and maybe what the code looks like is the code calls the put operation for my photo and it really should be a key and a value. I'm just going to pretend it's just a key plus value. So I insert my photograph and then when this put finishes, then I add the photo to my list. That's where my client's code looks like. Somebody else is looking at my photographs list. They're going to look, fetch a copy of my list of photos and then they're going to look at the photos that are on the list. So client 2 maybe calls get for my list and then looks down the list and then calls get on that photo. Maybe they see the photo I just uploaded on the list and they're going to do a get-up for the key for that photo. So this is like totally straightforward code, looks like it ought to work, but in an eventually consistent system it's not necessarily going to work and the problem is that these two puts, even though the client did them in such an obvious order, first insert the photo and then add a reference that photo to my list of photos. The fact is that in the eventually consistent scheme that I outlined, this second put could arrive at other data centers before the first put. So this other client if it's reading at a different data center might see the updated list with my new photo in it, but when that other client in another data center goes to fetch the photo that's in the list, this photo may not exist yet because the first right may not have arrived over the wider network, the client 2's data set. So this is just going to be a routine occurrence in an eventually consistent system. If we don't sort of think of anything more clever, this kind of behavior where it sort of looks like the code ought to work at some intuitive level, but when you actually go and read the spec for the system, which is to say no guarantees, you realize that this obviously, this correct looking code may totally not do what I think it's going to do. These are often called anomalies and the way to think about it is not necessarily that this behavior, you saw the photo on the list but the photo didn't exist yet, it's not an error, it's not incorrect because after all the system never guaranteed that this that this code was going to do that, it's going to actually yield the photo here. So it's not that it's incorrect, it's just that it's weaker than you might have hoped. So it's still possible to program such a system and people do it all the time and there's a whole lot of tricks you can use. For example, you know defensive program might observe programmer might might write code knowing that well if you see something on the list it may not really exist yet and so if you see a reference to a photo in the list you get a photograph that's not there, you just retry, just wait a little bit and retry because by and by the photo will probably show up and if it doesn't we'll just skip it and don't display it to the user. So it is totally possible to program in this style, but we could definitely hope for behavior from the storage system that's more intuitive than this that would make the programmers lie life easier sort of. You could imagine systems that have fewer anomalies than a very simple eventually consistent system. Okay, before I go on to talking about how to maybe make the consistency a little bit better I want to discuss something important I left out about this current eventual consistency system and that's how to decide which right is most recent. So for some data if data might ever be written by more than one party there's the possibility that we might have to decide which data item is newer so suppose we have some key we'll call a k and two rights for it. You know two clients launch rights for k so one client writes a value of one and another client writes a value of two. We need to set up a system so that all three data centers agree on what the final value of key k is because after all we're at least guaranteeing eventual consistency when the dust settles all the data centers all have the same data. So you know data center three is going to get these two rights and it's going to pick one of them as the final value for k well of course data center two sees the same rights right it sees its own right plus one from data center one and so they're all seeing this pair of rights and they all had better make the same decision about which one to have to be the final value regardless of the order that they arrive in right because we don't know you know the data center three may observe these to arrive in one order and some other data center may observe them to arrive in a different order we can't just accept the second one and have that be the final value we need a more robust scheme for deciding what the final most recent value is for a key. So we're going to need some notion of version numbers and the most straightforward way to sign version numbers is to use the wall clock time so so why not wall clock time and the idea is that when a when a client generates a put either it or the shard server the local shard server talks to will look at the current time oh it's you know 125 right now and it'll sort of associate that time as a version number on its version of the key so then we'd annotate these these right messages you'd actually both store the time stamp in the database and annotate these right messages sent between data centers with the time so you know maybe this one was written at 102 and this right occurred at 103 um and um so if if the 102 write it or say it's just suppose the 103 write arrives first then the data center three will put in its database this key and um the time stamp on a three and when the right for 102 arrives the standard center will say oh actually that's an older right I'm just going to ignore this right because it has a lower time stamp than the time stamp I already have and of course if they arrived in the other order um data center three would have actually stored this right briefly until the right with the higher time stamp arrived and then it would replace it now since everybody sees some time stamps um at least you know when they finally receive whatever all these right messages over the internet they're all going to end up with the databases that have the highest numbered values okay so this almost works and there's two there's two little problems with it one is that the two data centers that they do write at the same time may actually assign the same time stamps this is relatively easy to solve and the way it's typically done is the time stamps are actually pairs of time or whatever in the high bits essentially and some sort of identifier could actually be almost anything as long as it's unique some sort of identifier like the data center name or ID or something in the low bits just to cause all time stamps from different data centers or different servers to be unique and then if two rights arrive with the same time in them from different data centers they're going to have different low bits and these low bits will be used to disambiguate which of the two rights has is the lower time stamp and therefore should yield to the to the other one with a higher time stamp okay so we're going to stick some sort of ID in the bottom bits and the paper actually talks about doing this it's very common the other problem is that this system works okay if all of the data centers are exactly synchronizing time and this is something a spanner paper stressed a great length so if the clocks on all the servers at all the data centers agree then this is going to this is going to be okay but if the clocks are off by seconds or maybe even minutes then we actually have a serious problem here one not so important problem is that rights that come earlier in time you know that should be overwritten by later rights it could be the rights that came earlier in real time or because the clocks are wrong are assigned high time stamps and therefore are not superseded by rights that came later in time now we never made any guarantees about this our eventual consistency you know we never said oh rights that come later in time are gonna win over rights that came earlier in time so we don't really owe any the clients anything in this department nevertheless we don't want to be you know the system has already weak enough consistency we don't want to have it have needlessly strange behavior like users really will notice you know they update something and then they updated later and the later update doesn't seem to take effect maybe because the earlier update was assigned to time stamp it was too large in addition if some servers clock is too high and it does a right you know if it's clock is say a minute fast then there'll be a whole minute when no other right can take effect because no other right is going to have a time stamp it's okay we have to wait for all the server's clocks to catch up to the minute fast server's clock before anybody else can do a right to that key in order to solve that problem one way to solve that problem is this idea called lamp or clocks the paper talks about this although the paper doesn't really say why they use lamp or clocks i'm guessing it's at least partially for the reason i'm just outlined um lamp or clock is a way to assign time stamps that are related to real time but um which cope with this problem of some servers having clocks that are running too fast so every server keeps a value called this t max which is the highest version number it's seen so far from anywhere else so so if somebody else is generating time stamps that are you know ahead of real time you know all the other servers to see this time stamps their t maxes will reflect and will be sort of ahead of real time and then when a server needs to assign a time stamp or version number to a new put the way it will do that is it'll take the max of this t max plus one and the wall clock time through real time so that means that new version numbers and so this is the version numbers that we need to accompany the values in our eventually consistent system so each new version number is going to be higher than the highest version number seen so it's higher than whatever the last right was for example to the data that we're updating and at least as high as real time so if nobody's clock is ahead this t max plus one will probably actually be smaller than real time and the time stamps will end up being real time if some server has a crazy clock that's too fast then that will cause all other servers at least all the ones that see its updates advance their t maxes so that when they allocate new version numbers it'll be higher than the version number of whatever late right they saw from the server whose clock is too fast okay so this is lamp or clocks and this is how the paper assigns version numbers these come up all the time in distributed systems all right so another problem i want to bring up about our eventually consistent system is the problem of what to do about concurrent rates to the same key it's actually even worse the possibility that concurrent rights might carry might both carry important information that ought to be preserved so for example again if we have two data centers and a client of both did both of these you know different clients client one and client two they both issue a put to the same key and both of these puts get sent to data center three the question is what should data center three do about the information here and the information here this is a real puzzle actually is not a good answer the what the paper uses is the last writer wins that is data center three is going to look at the version number that was assigned here and the version number is assigned here one of them will be higher you know it's slightly later in time or maybe the data center id in a little bit is higher or something and data center three will simply throw away the the data with the lower time stamp and accept the data with the higher time stamp and that's it so it's using this last writer wins policy and that is the virtue that it's deterministic and everybody's going to get the same answer because everybody's looking at the same time stamps you know satisfies eventual consistency you can think of examples in which it's great so you know for example supposing what these puts are trying to do is increment a counter so these clients both saw now the counter with value 10 they both add one and they both put 11 right and but you know what we really wanted to do was have them both increment the counter and have it have value 12 so in that case last writer wins is really not that great what we really would have wanted was for data center three to sort of combine this increment and that increment and end up with a value of 12 so you know these systems are really generally powerful enough to do that but but we would like better what we'd really like is more sophisticated conflict resolution and the way other systems we've seen solve this the most powerful system to support real transactions so instead of database like my sequel instead of just having put and get it actually has increment operators that do atomic transactional increments on the data so that increments aren't lost and that's sort of a transactions are maybe the most powerful way of doing resolving conflicting updates now we've also seen some systems that support a notion of many transactions where at least on a single piece of data you can have atomic operations like atomic increment or atomic test and set you can also imagine wanting to have a system that does come sort of custom conflict resolution so supposing this value that we're keeping here is a shopping cart you know with a bunch of items in it and our user maybe because they're running you know two windows in the web browser adds two different items to their shopping cart from two different web servers we'd like these two conflicting rights to the same shopping cart to resolve probably by taking set union of the two shopping carts instead of throwing one away and accepting the other i'm bringing this up not because there's a satisfying solution i mean indeed the paper doesn't really propose much of a solution it's just a drawback of weekly consistent systems that it's easy to get into a situation where you might have conflicting rights of the same data that you would like to have sophisticated sort of application specific resolution to but it's generally quite hard and it's just like a thorn in people's sides that has to be lived with typically and that that goes for both the eventual consistency my straw man here and for the and for the paper system the paper hints in a couple paragraphs that it could be used to do better but they don't really explore that because it's actually quite difficult okay back to eventual consistency my straw man system if you recall it had a real problem with even very simple this very simple scenario i or you know we did a put a photo and put a photo list and then somebody else in a different data center reads the new list but when they read the photo they find there's nothing there so can we do better than that can we build a system that's still allows local reads and local rights but is slightly maybe less anomalous i'm going to propose one that's straw man two that's kind of one step closer to the papers so this is straw man two and in this scheme i'm going to propose a new operator not just put and get but also a sync operator that clients can use and the sync operator will do you'll give a key and a version number and what sync does when the client calls it is sync waits until all data centers copies of key k are at least up to date as of the specified version number so it's a way of forcing order the client can say look i'm going to wait until everybody knows about this value and i want to only proceed after everyone every data center knows about this value and in order for clients to know what version numbers to pass to sync we're going to change the put call a bit so you say put key value and put returns the version number of the of this updated k you could call this this sync is asking that acting as a sort of a barrier or a fence we could call this eventual consistency plus barriers the sync call is the barrier yeah i'm going to talk about how it uses in a moment but just keep in mind this sync call is likely to be pretty slow because the natural implementation of it is that it actually goes out and talks to all the other data centers and ask them you know is your version of key k up to at least you know this version number and then have to wait both for the data centers to respond and if any of them say no it's got to then wait until that data center says yes all right so how would you use this well again for our photo list now maybe client one that's updating the photos it's going to call put to insert the photo and it's going to get a version number now you know the programmer now has to sort of keep in mind a high you know there's a danger here that you know i'm about to update the photo list but you know what if some other data center you know hasn't seen my photo yet so then the programmer is going to say sync and you're going to sync the photo wait for all data centers to have that version number that was returned by put and only after the sync return will client one call put update the update the photo list and now if client two comes along the monster read the photo list and then to read the photo you know who knows client two is going to do a get of the photo list well let's say time is you know passing the same for them it's going to do a get of the photo list and if it sees the photo on that list it'll do a get you know again in its local data center of the photo and now we're actually in much better situation if client two in a different data center saw the photo in this list then that means that client one had already called put on this list because it's this put that adds the photo to the list if client one had already called put on this list that means that client one you know given the way this code works right had already called sync and sync doesn't return until the photo is present at all data centers so that means that client two can the programmer for client two can rely on well the photo is in the list that means this whoever added the photo to the list their sync completed and that meant that the fact that the sync completed means the photo is present everywhere and therefore we can rely on this get photo actually returning the photograph okay so this works and it's actually reasonably practical it does require fairly careful thought on the part of the programmer the programmer you know has to think aha i need a sync here i need to put sync put in order for things to work out right the reader for the reader is much faster but the reader still needs to think oh you know i'm gonna at least has to the programmer has to you know check when if he's if the programmer does a get list and then a get photo from that list that oh you know verify that indeed the code that modified the list called sync before adding some things to the list that does require a bit of thought um what this is all about the sync call is all about is enforcing order make sure that this completely finishes before this happens the readers so the sync sort of explicitly forces order for writers readers also have to think about order the order is actually obvious in this example but it is true that if the writer did put a put then sync and then put of a second thing then almost always readers need to read the second thing and then read the first thing because the guarantees you get out of this out of this sync scheme out of these barriers is that if the reader sees the second piece of data then they're guaranteed to also see the first piece of data so that means that readers sort of need to read the second piece of data first and then and then the first item of data okay so there's a there's a question about fault tolerance namely that if one data center goes down that means the sync blocks until the other data centers brought up that's absolutely right so you're you're totally correct this is a this is not a great scheme all right this is sort of a straw man on the way to cops this sync call would block the way this actually that sort of version of this that people use in the real world to avoid this problem of oh you know whatever data center down with a sync block forever is that puts and gets both actually consult a quorum of data centers so the sync will only wait for you say a majority or of data centers to acknowledge that they have the latest version of the photo and a get will actually have to consult an overlapping majority of data centers to in order to get the data so things are not really real versions of this are not perhaps as rosy as I as I may be implying again the the systems that work in this way it's if you're interested it's dynamo and Cassandra and they use quorums to avoid the fault tolerance problem okay um okay so this is a straightforward design and has decent semantics even though it's slow and as you observed not very fault tolerant the read performance is outstanding because the reads are still are local at least if we uh if the quorum setup is read one right all um and the right performance is not great but it's okay if you don't write very much or if you don't mind great waiting and the reason why you can maybe convince yourself that the rate performance is uh not a disaster is that after all the Facebook memcache d paper has to send all rights through the primary data center so yeah you know Facebook runs multiple data centers and clients talk to all of them but the rights have to all be sent to the mys equal databases at the one primary data center um similarly spanner rights have to wait for a majority of replica sites to acknowledge the rights before the clients allowed to proceed so the notion that clients might have to sort of that the rights might have to wait to talk to other data centers in order to allow the reads to be fast is does not appear to be outrageous in practice um still uh you know you might like to nevertheless have a system that does better than this to somehow have um semantics of sync that sort of or sync is forcing this put definitely appears to everyone to happen before the second put you might like to have that without the cost um so we'll be interested in systems um and this is starting to get close to what cops does interest in systems in which instead of sort of forcing the clients to wait at this point we somehow just encode the order as a piece of information that we're going to tell the readers or tell the other data centers um and um a simple way to do that um which the paper mentions as a as a um non scalable implementation is that at each data center so this is a logging approach um at each data center instead of having the different shard servers talk to their counterparts in other data servers sort of independently um instead at every data center we're going to have a designated log server that's in charge of communicating of sending rights to the other uh data center so that means if a client um does a write does a put to its local shard um and that's that shard instead of just sending the the data out sort of separately to the other data centers um we'll talk to its local log server and append the right to the one log that this data center is accumulating and then if a client say does a write um to a different key maybe we're writing key a and key b here um instead of again instead of this shard server sending the right to key b sort of independently um it's going to tell the local log server to append the right to the log and then the log server send out their log uh to the other data centers in log order um so that all data centers are guaranteed to see the right to a first and they're going to process that right to a first and then all data centers are going to see our right to be and that means if a client does a right to a first and then does a right to be the rights will show up in that order in in its log a and b and they'll be sent uh the right to a first and then the right to be to each of the data centers and um they probably actually have to be sent to a kind of single log receiving server which plays out the rights one at a time as they arrive in log order so this is the logging strategy if the paper criticizes um it's actually regains some of the performance we want because now clients are no longer we now eliminate the sinks the clients can go back to just calling put of a and then put b um and they um client puts can return as soon as the data is sitting in the log at the local log server so now the client puts and gets are now quite fast again um but we're preserving order sort of by basically through the sequence numbers of the entries in the logs rather than by having the client's weight um so that's nice we get the order you know now we're forcing ordered rights and we're causing the rights to show up in order at the other data centers so that reading clients will see them in order um and so our my example application would actually work now with this scheme um and the drawback that the paper points to about this style of solution is that the log server now all the rights have to go through this one log server and so if we have a big big database with maybe hundreds of servers serving uh at least in total a reasonably high workload the right workload all the rights have to go through this log server and possibly all the rights have to be played out through a single receiving log server at the far end um and a single log server as the system grows they're going to be more and more shards a single log server may stop being fast enough to process all these rights and so cops um does not follow this approach to um conveying the order constraints to other data centers okay so we want to build a system that can at least from the client's point of view process writes and reads purely locally we don't want to have to wait we don't want clients to wait in order to get order we want to forward the we like the fact that these uh rights are being forward asynchronously um but we somehow want to eliminate the um the central log server so we want to somehow convey the order information to other data centers without having to funnel all our rights through a single log server all right so now um that brings us to what cops is actually up to so what i'm going to talk about now is um starting to be what cops does and what i'm talking about though is the non-gt version of cops cops without uh get transactions um okay so uh cops is um the basic strategy here is that when cops clients read and write locally they accumulate information about the order in which they're doing things um that's a little more fine grain than the logging scheme and that information is sent to the remote data centers whenever a client does a put so this we have this notion of client context um and as a client does gets and puts you know maybe a client does a get of x and then a get of y and then a put of z with some value the context um the library that the client uses that implements put and get is going to be accumulating this context information on the side as the puts and gets occur so um if a client does a get and that yields some value with version two um i'm just going to say that as an example maybe get returns the current value x and the current value has version two um and maybe y returns the current value of version four what's going to be accumulated in the context is that um this client has read x and it got version two then after the get for uh y the cops client library is going to add to the context so that it's not just um we've read x and got in version two but also now we've read y and gotten version four and when the client does a put the information that sent to the local shard server is not just the put key and whatever the value is but also these dependencies so we're going to tell the local shard server um for z that this client has already read before doing the put um x and got version two and y and got version four and you know what's going on here is that we're telling where the client is expressing uh this ordering information that this put to z you know the client had seen x version two and y version four before doing the put so anybody else who reads this version of z um had also be better be seeing x and y with at least these versions um and similarly if the client then does a put of um something else say q um what's going to be sent to the local shard server is not just the q and this but also the fact that uh this client had previously done some gets and put so let's suppose this put yields um version three you know the local shard server says a high assign version three to your new value for z um then when we come to do the put of q it's going to be a company with dependency information that says this put comes after this put of q comes after um the put of z that created z version three and at least notionally the rest of the context ought to be passed as well although we'll see um that uh for various reasons cops can optimize away this information and and if there's a preceding put only sends the version information for the put so the question is isn't important for the context to be ordered um i don't believe so i think um i think it's efficient to treat the context or at least the information that sent in the put um as just a big bag of dependencies for at least for non transactional cops okay so the clients accumulate this context and basically send the context with each put and the context is encoding um this order information that in my previous strawman strawman two was sort of enforced by sink instead of doing that we're not waiting but we're accompanying these puts with oh this put needs to come after these previous values and this put needs to come after uh these previous values um uh cops calls these um relationships that this put needs to come after these previous values a dependency and uh dependency and it writes it as um you know supposing this put produces z version three we express it as really there's two actually there's two dependencies here one is that x version two comes before z version three and the other is that y version four comes before z version three and these are it's just sort of definition or notation that the paper uses to talk about these individual pieces of order information that cops needs to enforce all right so um so then what is this what is this dependency information this pass to the local shard server what does that actually cause cops to do um well each cop shard server when it receives a put from a local client first it assigns the new version number um then it stores the new value you know it stores for z this new value along with the version number that uh um along with the version number that it allocated and then it sends the whole mess to each of the other data center so at least in non-gt cops the local shard server only remembers the key value and latest version number doesn't actually remember the dependencies it only forwards them across the network to the other data centers so now the position we're in um is that um let's say we had a client produced a put of z and some value it was assigned version number v3 and it had these dependencies x v2 and y v4 right um and this is sent from data center one let's say to the other data center so we got a data center two and data center three both received this now in fact this information is sent from the shard server for z so there's lots of shard servers but only the shard server for z is involved in this um so here data center three the shard server for z is going to receive um this put from sent by the client the shard server forwards it um this shard server the the what this dependency information that you know x v2 and y v4 come before z v3 what that really means is operationally is that this new version of z can't be revealed to clients until uh its dependencies these versions of x and y have already been revealed to clients um in data center three so that means that the shard server for z must hold this right must delay applying this right to z until it knows that these two dependencies are visible in the local data center so that means that z has to go off um let's say the you know we have these shard server for x and the shard server for y z's got to actually send a message to the shard server for x and the shard server for y saying you know what's the version number for uh current version for number for x and y um it has to wait for the result if both of these shard servers say oh you know they give a version number that's two or higher or four or higher for y then z can go ahead and apply the put to its uh local table of data um however you know maybe these two shard servers haven't received the updates that correspond to version two of x or version four y in that case z has to hold on to this update the shard server for z has to hold on to this update until um the indicated versions of x or y have actually arrived and been installed on these two shard servers so there may be some delays now um and only after these dependencies are visible at data center three only then can the shard server for z go ahead and write updated stable for z to have version three um okay and you know what's what that means of course is that if a client at data center three does a read for z and sees version three then because z already waited that means if that client then reads x or y it's guaranteed to see at least version two of x and at least version two of y because z didn't reveal the shard server didn't reveal z until it was sure the dependencies would be visible okay so question what if x and y never get their values perhaps due to a network partition would the z shard block forever yeah the um uh the semantics require the z shard to block forever that's absolutely true so you know there's certainly an assumption here that um well there you know two ways that that might turn out okay one is somebody repairs the network or repairs whatever was broken and x and y do eventually get their updates that'd be one way to fix this and then z will finally be able to apply the update might have to wait a long time um the other possibilities maybe the data center is entirely destroyed maybe the building burns down and so we don't have to worry about this at all but it does point out a problem that's a real criticism of causal consistency and that's that um these delays can actually be quite nasty because you can imagine oh you know z is waiting for the correct value for x to arrive you know even if there's no failures and nothing burns down even mere slowness can be irritating z may have to wait for x to show up well it could be that x has already showed up um and has arrived at the shard server but it itself had dependencies maybe on key a and so this shard server can't install it until the update for a arrives because x this put of x depended on some key a and z still has to wait for that because what z's waiting for is for this version of x to be visible to clients so it has to be installed so if it's arrived if the update for x has arrived but itself is waiting for some other dependency then we may get these cascading dependency weights um and in real life actually these you know these probably would happen um and it's one of the sort of problems that people um bring up you know against causal consistency when um when you try to persuade them it's a good idea this problem of cascading delays so that's too bad um although on that note it is true that the authors of the cop's paper um have a follow on actually a couple of interesting follow on papers but one of them um has some mitigations for this cascading weight problem okay so um for our photo example um this is this scheme this cop scheme will actually solve our photo example and the reason is that you know this put we're talking about is the put for the photo list um the dependencies it's going to have in this dependency list is the uh insert of the photo um and that means that when the put for the photo list arrives at the remote site the remote shard server is essentially going to wait for the photo to be inserted and visible before it updates the photo list so any client in a remote site that is able to see the new photo the updated photo list is guaranteed with it um to be able to see the photo as well so this cop scheme fixes the photo and photo list example um this uh the scheme the cops is implementing is is usually called causal consistency um so there's a question is it still up to the programmer to specify the dependencies no it turns out that the context information um this context information that's accumulated here uh the cops client library can accumulate it automatically so the program only does gets um inputs and um may not even need to see the version numbers uh so a simple program we just do gets and puts um and the internally the cops library maintains these contexts and adds this extra information to the uh put rpcs so that um the programmer just just does gets and puts and system kind of automatically tracks the dependency information so that's very convenient um i mean just uh you know pop up a level for a moment you know and we now build a system that's that is as uh um semantics powerful enough to make the photo example code work out correctly to have it sort of have the expected result instead of anomalous results um and at least arguably it's reasonably efficient because nobody was uh you know the client never has to wait for rights to complete there's none of this sync business and also the communication is mostly um independent there's no central log servers arguably this is both reasonably high performance and has reasonably good uh semantics reasonably good consistency so the the consistency that this design produces is usually called causal consistency um and it's actually a much older idea than this paper um there's been a bunch of causal consistency schemes before this paper and indeed a bunch of follow-on work so it's a intriguing idea that people like a lot um what causal consistency is what it sort of means um and here i am putting up i think a copy of figure two from the paper um the sort of what the definition says is that the client's actions induce dependencies um so there's two ways that dependencies are induced one is if a given client uh does a put um and then a get or does a get and then a put or a put and then a put then we say that the uh the put depends on the previous put or get so that in this case uh put of y of two depends on the put of x of one so that's one form of dependency another form of dependency if is if one client reads a value out of the storage system um then we say that that the get that that second client issued depends on the corresponding put that actually inserted the value from a previous client and furthermore we say that the dependency relationship is transitive um so that you know this put depends on that get um this get by client two depends on the put by client one and by transitivity in addition we we can conclude that the client twos get depends on client ones get and so that means that this last put of by client three for example depends on um all um of these previous operations so that's a definition of causal dependency um and then a causally consistent system says that um says that if uh through the definition of dependency I just outlined a depends on b um sorry b depends on a and a client reads b then um the client must subsequently also see a the dependency so if client ever sees the sort of second of two ordered operations operations ordered by dependency and the client is also then after that guaranteed to be able to see the everything that that operation depended on you know so that's the definition and it's you know in a sense kind of directly derived from what the system actually does so this is very nice when updates are causally related um that is if you know these clients and in some sense they're talking to each other you know indirectly through the storage system and so the clients are kind of aware that of you know if we somebody reads this value in c5 sees five and inspects the code you know they can conclude that really you know this there's a sense in which this put definitely must have come before um this last put and so if you see this last put you really gosh you really deserve to see this first put so in that sense causal consistency gives you this programmer's kind of a sort of well behaved um at least it allows them to see well behaved values coming out of the storage system um another thing that's good about causal consistency is that when updates when two values in the system are not two updates are not causally related um the causal consistency system you know the cops storage system has no obligations about maintaining order between updates that are not causally related so for example if um i mean that's good for performance so example if we have you know client one does a put of x and then a put of z and then around the same time client two does a you know put of y there's you know there's no causal relationship between these and therefore um you know sorry there's no causal relationship between the put of y and any of the actions of client one and so the cops is allowed to um do all the work associated with the put of y completely independently of client one's puts and the way that plays out is that it's done and the put of y is sort of entirely happens in the servers that for the shard of y these two puts are only involved the servers for the shards that x and z are in it may require some interaction here because the remote servers for z may have to wait for this put to arrive but they don't have to talk to the servers that are in charge of of y so that's a sense in which causal consistency allows good allows parallelism and good performance um and you know this is different from potentially from linearizable systems like in a linearizable system the fact that this put y came after the put of x in real time actually imposes some requirements on the a linearizable storage system but there's no such requirements here for causal consistency and so you might be able to build a causal consistency causally consistent system that's faster than a linearizable system okay there's a question would cops gain any more information by including puts in the client context okay so it's this may be a reference to the to today's lecture question it is the case so why don't i explain the answer for the lecture question the uh if a client does get a vex and we look at its context does the get a vex maybe and then a put to y and then a put to z um in the context initially is x version something you know the only client sends the put to the server it's going to include this context along with it but in the actual system there's this optimization that um after a put the context is replaced by simply um the um version number for the put and any previous stuff in the context like namely this information about x is erased from the from the client's context so it only includes after a put the context is just replaced with the version number return from the put let's suppose in this returns you know version version seven of y um and the reason why this is correct and doesn't lose any information for the non-transactional cops um is that um for this um when this put is sent out to all the remote sites um the put is accompanied by x version whatever in the dependency list so this put won't be applied until at all and at each data center until this x is also applied so then when if the client then does this put right but this turns into is sent to other data centers is really a put with z and some value and the dependency is just y version seven all the other data centers are going to wait for um they're going to check you know before applying z they're going to check that y version seven has been applied at their data center well we know the y version seven won't be applied at their data center until x version whatever is applied at that data center so there's sort of a cascading delays here or that is telling other data centers to wait for y version seven to be installed implies that they must also already be waiting for whatever y version seven depended on and because of that we don't need to also include x version the x version in this dependency list because those data centers will already be waiting for that version of x so the answer to the question is no cops the non-transactional cops doesn't need to have anything doesn't need to remember the gets in the context after it's done a put all right um a final thing to note about this scheme is that cops only see certain relationships it's only a way of certain causal relationships that is it's only you know cops is aware that if a single client thread does a put and then another put client you know cops records oh this the second put depends on the first put furthermore cops is aware that oh when a client does a read of a certain value that it's depending on the on the put that created that value and therefore depending on anything that that depended on so you know cops is directly aware of these dependencies here however um it could it is often the case that causality in the larger sense is conveyed through channels that cops is not aware of so for example you know if client one does a put of x and then the human you know who's controlling client one calls up client two on the telephone or it's you know email or something it says look you know i just updated the database with some new information why don't you go look at it right and then client two you know does a get of x um sort of in a larger sense causality would you know suggest the client to really ought to see the updated x because client two knew from the telephone call that x had been updated um and so if cops had known about the telephone call it would have actually included the um it would have actually caused the extra um sorry if the telephone call had been itself a put right you know it would have been a put of telephone call here and a get of telephone call here um and if this get had seen that put cops would know enough to arrange that this get would see that put but because cops was totally unaware of the telephone call there's no reason to expect that this get would actually yield the put value so cops is sort of enforcing causal consistency but only for the sources the kinds of causation the cops is directly aware of and that means that the sense in which cops's causal consistency sort of eliminates anomalous behavior well it only eliminates anomalous behavior if you restrict your notion of causality to what cops can see it in the larger sense you're going to still see odd behavior you're definitely going to see situations where you know someone believes that a value's been updated and yet they do not see the updated value that's because their belief was caused by something that cops wasn't aware of um right um another um potential problem which i'm not going to talk about is that the um you remember for the photo example with the photo list there was a particular order of the adding a photo and a particular different order of looking at photos that made the system work with causal consistency as we're definitely relying on the their being sort of that if the reader reads the photo list and then reads the photo in that order that the fact that a photo is referred to in a photo list means that the read of the photo will succeed it is however the case that there are um situations where no one order of reading or combination of orders of reading or writing um will cause sort of the behavior we want and but and this is leading to transactions which i'm not going to have time enough to explain but at least i want to mention the problem that the paper set up so supposing we have our photo list and it's protected by an access control list and an access control list is basically a list of user names that are allowed to look at the photos on my list so that means that the software that um implements these photo lists with access control lists um needs to be able to you know read the list and then read the access control list and see if the user trying to do the read is in the access control list um it however um neither order of getting the access control list and the list of photos works out so if the client code first gets the access control list um and then gets the list of photos that order actually doesn't always work so well because supposing um my client reads the access control list and sees that i'm on the list but then right here um the owner of this photo list deletes me from the access control list and inserts a new photograph that i'm not supposed to see in the list right so c2 does a you know a put of the access control list to delete me and then a put of the photo list to add a photo i'm not allowed to see um then my client gets around to the second get it sees this list you may see this list which is the now the updated list that has the photo i'm not allowed to see but my client thinks aha i'm in the access control list because it's reading an old one and here's this photo so i'm allowed to see it so in that case you know we're getting an inconsistent what we know to be an inconsistent uh sort of combination of a new list and an old access control list but there was really nothing but but causal consistency allows this um it causal consistency only says well you're going to see data that's at least as new as the dependencies um every time you do a get so and indeed if you know as the paper points out if you think it through it's also not correct for the uh reading client to first read the list of photos and then read the access control list because um sneaking in between this the list might have a the list that i read may have a photo i'm not allowed to see and at that time maybe the access control list didn't include me but at this point the owner of the list may delete the private photo add me to the access control list and then i may see myself in the list so again if we do it in this order it's also not right because um we might see we might get an old list and a new access control list so um causal consistency as i've described it so far isn't powerful enough to deal with the situation you know we need some notion of being able to get a mutually consistent list and access control list they're either sort of both before some update or both after um and cops gt actually provides a way of doing this it by essentially doing both gets um but but um cops gt sends the full set of dependencies back to the client when it does a get and that means that um the the client is in a position to actually check the dependencies of both of these return values and see that aha you know there's a dependency for list that is a version of the sorry for the there might be a the dependency list for the access control list may mention um that it depends on a version of list that's in the farther ahead than the version of list that's actually gone and in that case cops gt would refetch the data all right um with one question is it related to the threat of execution yeah so it's true there um causal consistency doesn't really it's not about wall clock time so it has a no notion of wall clock time there's only um the only sort of forms of order that it's obeying that are even a little bit related to wall clock time or that if a single thread does one thing and then another and another then um causal consistency does consider these three operations to be in that order but it's because one client thread did these sequence of things and not because there was a real time relationship um so just to wrap up here to sort of put this into a kind of larger world context um causal consistency has has been and is like a very kind of um promising research area has been for a long time because it does seem like it might provide you with good enough consistency but also opportunities more opportunities and linearizability to get high performance however um it hasn't actually gotten much traction in the real world people use eventual consistency systems and they use strongly consistent systems but it's very rare to see a deployed system as causal consistency and there's a bunch of reasons potential reasons for that um you know it's always hard to tell exactly why people do or don't use um some technology for real world systems um one reason is that uh it can be awkward to track uh per client causality in the real world a user and browser are likely to contact different web servers at different times um and that means it's not enough for a single web server to keep uh users context but we need some way to stitch together context for a single user as they visit different web servers at the same website so that's painful um another is this problem that cops doesn't actually doesn't track um only tracks causal dependencies it knows about and that means it doesn't doesn't have a sort of ironclad solution um or doesn't sort of provide ironclad causality only sort of certain kinds of causality which is sort of limits how appealing it is um another is that you know eventual and causal consistent systems can provide only the most limited notion of transactions and people more and more I think as time goes on are sort of wishing that their storage systems had transactions um finally the amount of overhead required to push around track and store all that dependency information can be quite significant um and you know I was unable to kind of detect this in the performance section of the paper but the fact is it's quite a lot of information that has to be stored and pushed around and it um you know if if you were hoping for the sort of millions of operations per second level of performance that at least facebook was getting out of mcash d the kind of overhead that you would have to pay to use causal consistency might be extremely significant for the performance so those are reasons why causal consistency maybe hasn't currently caught on although maybe someday it will be okay um that's all I have to say and actually starting next lecture will be switching gears away from storage and a sequence of three lectures that involve blockchains so see you on Thursday