 Okay, welcome back everybody to CS162. Believe it or not, this is our last lecture of the term. It's pretty hard to believe that this term is over already, but it's been rushing by and I guess we've reached the end. Today we're gonna pick up where we were left off last week, where we're talking about distributed storage, and then we're gonna go on to talking about key value stores and CORD in particular as an algorithm for storing key value stores. And if you recall, we were also talking about TCP last time and we were talking about the congestion avoidance algorithm. And the congestion avoidance algorithm is really the thing that makes TCP a good citizen style algorithm as opposed to UDP. And we're gonna try to pick an amount of outstanding data in the network that we're gonna put out there before we hear about from AXE and to try to avoid overloading the network while at the same time, basically trying to maximize our pipelining. And the assumption here is that most of the data loss is caused by overloading. Most of our links that packets go over are pretty reliable and they very rarely lose any data. And so really the loss is because too many clients are sending data through some node in the middle of the network and that node basically starts having to drop packets because there's not enough buffer space. So this is really a question of the sender's window size. The window size is the amount of data that a sender has outstanding invites before it receives an act. And we're gonna adjust that to try to maximize our pipelining while avoiding congestion. And this sender's window size has to be larger than the receiver's window size because if we have too much data outstanding and the receiver can't forward any data to its applications, then the receiver would have to start dropping packets and that would be very unfortunate. So what we're gonna do is try to match the rate of sending packets with the rate that the slowest link can accommodate that. And so the sender's using an adaptive algorithm. We talked about that to decide the window size. And the goal is basically to fill the network between the sender and the receiver with data. And the basic technique is to slowly increase the size of our window until acknowledgement starts being delayed or lost. And at that point we know that we have too much data in the network. So as an example, which I didn't mention last time explicitly is how much data do we want in the network bar and congestion? What we want is we would like if we knew what the round trip latency was, the time for data to go to the destination and come back, we would like the round trip latency times the bandwidth of the slowest link to represent the amount of data in the network and that we're putting out there. And so we want it really to be that the act arrives just as we are about to send the next packet and that we have all the maximum amount of data outstanding. And so for instance, if you've got a hundred millisecond round trip time, then you take a hundred milliseconds times the bandwidth of the slowest link and that's gonna give you an amount of data that is supposedly outstanding. Now we have a question here of, is the window size a limit on the number of packets that we receive? So the answer there is no, it's not packets, it's about total data in bytes. If you remember, the TCP is a stream-based algorithm and so it's making its decisions based on the number of bytes that have been sent and not on the number of packets. And that's basically the acts are based on what byte that has been received at the destination. Okay, and there's another question here of when talking about bandwidth in general for TCP, are headers included in the bandwidth? So in fact, the headers are included in the bandwidth because they are overhead. You can think of headers as providing basically a overhead that gives you an effective bandwidth for transmitting the actual data that's a little lower than the maximum bandwidth of the bits being transmitted. And so we need to take that into account when we're doing this calculation for window size because the data in the windows is just on the other end is data, it's not packet headers. Okay, so the TCP solution we mentioned was called slow start, which is you start sending slowly and you keep increasing the window size by one for every successfully received act. And then when you start losing packets, then you back off by dividing by two and that gives the classic sawtooth behavior of the amount of data that TCP puts into the network. We call this additive increase, multiplicative decrease. So the other thing we talked about was cap theorem. And the cap theorem, as I mentioned before, was a really a hypothesis or speculation on the part of Eric Brewer here at Berkeley back in the early 2000s, which he basically said that you can have consistency and availability, partition tolerance are all properties of storage systems in the network. And the cap theorem says you can have two of these, but not three. Okay, and so the simplest one to keep it in mind is basically, for instance, if you have consistency and availability, but not partition tolerance, what that says is, as long as you never have any breaks in the network, then you can always be available clearly because you can always route to wherever your data is being stored. And it can always be consistent because you're basically can always make sure that when you update the data, everybody who is reading it gets updated. Okay, and so this is called Brewer's theorem. It was originally conjecture, as I mentioned. And then there was some 10 years ago or whatever, there was a call for proposals for papers that about the cap theorem and a number of different theoretical papers appeared all with slightly different assumptions. They're violating to slightly different theorems. But anyway, by and large, it's true. And it's an important thing to keep in mind when we're looking at distributed systems. So back to consistency again. So we were talking about the network file system, NFS, which is one of the longest and most popular file systems over the years. It was originally by Sun Microsystems. And the protocol here is one of weak consistency. The client pulls server periodically to check for changes and it pulls the server if the data hasn't been checked in the last three to 30 seconds. And as a result, when a file is changed on one client, the server might be notified quickly, but other clients will continue to use old versions of blocks until a timeout. And so we kind of showed this situation where there are two clients, the first client there at the top left has an old version, V1 of the data. The second client has written that data. So it's got V2 and that's already gone back to the server. And so the server has the most up-to-date value. And what happens is polling every now and then, the top client there says, gee, is F1 still okay? So it's gonna ask questions about the file attribute metadata. And the server says no, it's got a new updated version, in which case the client gets the updated version. And will eventually see updated data. Now, I will point out that these updates are on a per block basis or 4K, which isn't necessarily the data that the clients are caring about. So that's an issue, all right? It's possible that the records that the clients are storing in the files on a byte basis span block or whatever. And so as a result, this consistency is not only a little delayed, but it also may show sort of partially updated objects, which can be a problem. And so in this scenario, with if multiple clients are simultaneously writing to the same file, in NFS, you can get different versions of blocks and you can get basically inconsistent information. And so that led us to the sequential ordering constraints, which we were talking about, which is sort of what sort of cache coherence might we expect from a system like this. For instance, if one CPU changes a file and before it's done, another CPU reads a file, what might we see? And so here we have a scenario with three clients. And what you can see here is a situation where we start out with the file content version A and all the blocks. And so client one is busy reading and they get A in this timeframe and then they write B. And meanwhile, client two starts reading. And so they're reading in a period of time that's somewhat overlapping those two writes. And so in this case, client two might get partially A and partially B and then it might write C. And so when client one goes back to look at the data, it might get some weird combination of B and C and client three might get some combination of B or C in here as well. So this is a potentially very scrambled consistency. And you might ask the question of what you'd actually want. And assuming, for instance, one option might be that we want the system to behave exactly the same way as if all three of these clients were really processes on the same physical node. And what that would mean is that if a read finishes before the write starts, you get the old copy. If the read starts after the write finishes, you get the new copy. And otherwise you get something in between that's almost what we've gotten here but there's a little delay on this, three to 30 seconds. And so that delay can add some kind of strange effects if you're not careful in, okay? And so for NFS, basically if the read starts more than 30 seconds after the write, you get a new copy. Otherwise you could get some combination of partial updates. So what can we do about this to get something a little more consistent? So there was another file system, the Andrew file system that was later than the original NFS. And by the way, the NFS came in several different versions. There's version one, two, three, four, all of it sort of changed some of these semantics a little bit to give you better control over the consistency of the data, but it's still a weakly consistent model. So in the Andrew file system, they adopted a callback model where the server keeps track of everybody who has a copy of a file. And when there are changes that happen, the server immediately tells everybody that there's been a change. And as a result, the clients that get this information from the server can go ahead and invalidate the data and from their caches and thereby know that they've got the most up-to-date data. So what's good about this idea is there's no polling bandwidth. So in the NFS file system, you have to keep polling over and over again. And so if you have many, many clients, the polling bandwidth can be exorbitant on the server. In the Andrew file system, there's only traffic to the clients that actually have to know about a change to a particular file, okay? And the downside though is that we've changed our stateless model, which is what we had originally with NFS where the server didn't have to know anything about the clients to a stateful model where the server has to keep track of which clients have which files, okay? And that means that basically if the server crashes and comes back up, it has to query all of the clients out there to find out what files they have cached so that it can get its callbacks set up again. Now the other thing, so that takes care of the polling bandwidth of NFS. The other thing that the Andrew file system deals with is this right through on close, which gives it better semantics for updates. And so the idea here is that when you open a file in the Andrew file system, you get a snapshot of file at the time that you opened and you continue to read a completely consistent version of the file even as other clients are writing that file, it isn't until you close the file and reopen that you get to hear about writes. And furthermore, clients that are writing data will not actually propagate that data to the server until they've closed the file again. And so as a result, the writes are a completely consistent set of writes to the file are propagated to the server all as one grouping, okay? And so we have session level semantics. We have updates are visible to other clients only after the file is closed. And so you don't get partial writes as I was talking about with NFS. It's really all or nothing. And so that's different semantics in NFS and it's much cleaner. So in AFS, basically everyone who has a file open continues to see an old version. You could think of it as a snapshot or a slightly outdated version if you like, but it's a fully consistent one. And you don't get any newer versions until you reopen the file. So the other thing that Andrew file system did was data's cached in the local disk of the client as well as in memory. So NFS basically used the buffer cache for caching data. And so as a result, it was of limited size. Whereas the AFS file system can basically cache files entirely in the local disk. And so it can have a much larger cache. And when you open with a cache miss, basically the file's not in the local disk. You get the file from the server, set up callbacks with the server. When you write, followed by a close, at that point you send your updated consistent copy back to the server and it tells basically all clients with copies to fetch new versions from the server. If the server crashes, you lose all the callback state and end up reconstructing that callback information from the client, which basically means you have to ask every client which files it has cached. Now this callback process, by the way, doesn't have to be used on read-only file partitions. So in the case of a read-only file partition, in that case nobody's gonna be updating it. And so we can cache pretty much all of the files we wish in a read-only partition and the server doesn't have to keep track of callbacks because it doesn't actually have to relay any changes. It's a read-only partition. Okay, so the pros of AFS relative to NFS is less server load, so the clients are not polling all the time like they were in NFS. And furthermore, the disk is a cache and so you have many more files can be cached locally. Callbacks basically means that the server is only involved in communication between clients when there's actual information to be forwarded, namely a file that client A is updating is being looked at or cached by client B, I mean. So that's the positives there of that callback system. For both AFS and NFS, however, the central server is a bottleneck. It's a performance bottleneck, so all writes have to go through the server in both cases. Cache misses have to go through the server. There's an availability thing here that if the server goes down, it's a single point of failure in the system. And so all of a sudden, everybody who was using some files basically can no longer use them or communicate changes between each other because the server is at central point. And so the server machines ends up having a relatively high cost relative to the workstations because it needs to be beefier in both cases to handle all that traffic. So that kind of leads us to a question of can we do better? And can we do something a little bit different that doesn't sort of focus so much of the performance and reliability on a single system? And maybe to do that, we might consider changing our model a little bit. So what about sharing data rather than files? So up till now, we've been talking about files just like a regular file system. So they're sort of bags of bits, groups of bits that have names that are in some namespace, the hierarchical namespace starting from a root file system and working its way down. And we think of files as streams of bytes that we can open, we can modify things in the middle, we can close them, et cetera. And perhaps by changing the semantics of what we want to deal with for our data, we might be able to do something larger. Okay, and so what we're gonna think about instead is key value stores. And a key value store is pretty much used everywhere, there are lots of languages that you're used to, Perl, Python, Go, C-sharp, all of these different languages that essentially have associative arrays in them where your key can be an arbitrary string and you can say sort of array indexed by string equals an arbitrary value. And as a result you can then get it later by indexing the array with that key, all right? Now question we have here is this somewhat like JSON. So JSON is a serialization language for objects and so in a key value store, you could take a JSON value object and store it as the value in a key value store. So you could do that with it. And so rather than thinking about a shared distributed file system, perhaps we want to think about a shared distributed key value store and use that instead of message passing or file sharing as our communication mechanism, okay? And then once we think of that, the question will be can we make it scalable and reliable and that's gonna be the rest of this lecture today. So just to emphasize this a little bit, the key value store is an extremely simple interface relative to a file system. So we have two operations put in get. So put takes a key and a value and it inserts it into the key value store. Get takes a key and it retrieves the value associated with that. And notice that we've abstracted away this to its simplest operations put in get. I'm not even talking about an array index here. We're just talking about puts and gets, okay? And why do we go for a key value store? Well, the thing is that this particular interface is simple enough that it's gonna be very easy to scale. And I'm gonna show you how we do that as we go on with the lecture. It can handle huge volumes of data like petabytes for instance. You can distribute items easily and partition them across a bunch of storage has very simple consistency properties because we can talk about the latest value associated with a given key. And it's a simpler, more scalable database for a lot of applications that only need a key value store. So you can think of a key value store like a table that's indexed off of the key that has a single column in it other than the key. And many applications and a surprisingly large number of applications that's more than enough to perform their activities. And just so you know, there are many examples of key value stores out there. So Amazon has, for instance, for instance, a key value store where the key might be the customer ID and the value might be all sorts of stuff associated with the customer profile, buying history, credit card, current card values, et cetera. Facebook, Twitter, the key might be a user ID and the value might be user profiles. Again, iCloud, iTunes, the key could be a movie or a song name and the value a movie or a song. So key value stores are pretty general and they're used by a lot of cloud level systems that you're familiar with. Just to be a little more, dive in on this just a little bit more. So Amazon, for instance, has the Dynamo DB which is an internal key value store to power Amazon's shopping cart. They also have the simple storage system or S3 that's a key value store as well. We're gonna talk, when we talk about the core algorithm, we're gonna talk about sort of the underpinnings of the Dynamo DB in just a little bit. At Google we have Bigtable and HBase and Hypertable which is all distributed scalable data storage examples. We have Cassandra which is distributed data management systems, also a key value store kind of like core that we're gonna talk about. Then there's Memcache D and eDonkey Emule. These are a whole bunch of key value stores that are used for large distributed systems. So this is a fairly powerful API and can be implemented in lots of different ways and let's explore this a little bit. So the basic idea here is taking your idea of a hash table which you're well aware of and distributing it. So what does distributed mean here? Well, the main idea is after we've simplified the storage interface to put and get, then we are gonna partition our keys and values across many machines. And so one option might be we partition chunks of the key space and send ranges of the key space to different machines. And the tricky part about this is gonna be then, once we've done this, now how do we figure out where our data is stored? And so once we've distributed this, we have to start worrying about finding the data afterwards. Okay, and some challenges, for instance, our scalability. So we would like to scale to thousands of machines or hundreds of thousands of machines. We would like to scale the petabytes of storage. So huge amounts of storage. And we also need to deal with full tolerance. So what happens if one of our storage nodes fails? Okay, then what? And there's a question on the chat here. When we talk about nodes, are we talking about basically storage devices? And the answer is, while we're actually talking about typically a complete chunk of a machine with a lot of disk, of course, because it's a storage node, but also it needs to have CPU and a network interface so that it can participate in the protocols we're gonna talk about. These are more than just storage devices. These are actual complete machines just with a lot of storage. So in terms of full tolerance, we're gonna need to figure out how to handle machine failures without losing our data and without degrading performance. We're gonna need to think about consistency. So for instance, if we have multiple copies of data and we have multiple clients writing at the same time, how do we come up with some consistency model that's usable? And this harkens back a little bit to what we were talking about with NFS and the fact that we had this kind of strange situation where we could have some updates that we saw and others that we didn't and we end up with a strangely inconsistent file. And then also we need to deal with heterogeneity. Now heterogeneity is something that you probably haven't thought of much before. If you haven't thought about deploying something that's gonna be around for months and years and it sort of arises in the following way. Suppose that you buy 100 machines for your machine room and you are gonna use them for this key value store and then you start incrementally scaling. And so you wanna buy, in three months you wanna buy 10 more machines and then you buy another 100 machines and so on. And these machines are gonna vary in capability. They're probably gonna get larger over time. And so as a result, whatever system we build here as we're adding individual machines we're gonna wanna make sure that we can deal with the fact that the machines in the system are not all identical, okay? So that's gonna be an interesting challenge. So some questions here that come up is, okay, so if we're put with our key value where do you store a new key? Now the where here is an interesting one, right? Because if we're talking about distributing the data widely over hundreds or thousands of nodes or 100,000 nodes, this question of how to actually find the results is complicated. And it's gonna be part of how we come up with an actual algorithm that produces a storage system for us. And then once we've figured out a where and by the way, another aspect of where is if we're keeping multiple copies we have to figure out kind of where to put more than one copy. And so then when we wanna do get we're gonna have to figure out where to get the value we want, okay? And so this where aspect is gonna be important for us. And of course we wanna do those above things while still providing scalability, full tolerance, consistency, et cetera, okay? So how do we solve this where question? Well, the simple answer, this is the answer you give to somebody, you jump in the elevator and they say, how do I solve where in a key value store? You can say hashing, right? But hashing is only really the beginnings of the solution. So as you know about hashing in a hash table, whatever is you take a key space which might be large. For instance, our typical key spaces these days are M bits where M might be 256. And you run a hash over your data to give you a 256 bit key. And now that 256 bit key, the hash is gonna tell us where to put that data. So if we had 10 nodes and we hashed our key it might give us one of 10 different nodes to put things at. And that would be a hash table that you build in software learned about in 61B, but in our distributed networking situation we have something much more complicated. What if we don't know where all the nodes are or who all the nodes are? Like because nodes are coming online over time or exiting because they've died and need to be rebooted or they're being decommissioned. So perhaps they come and go which means that basically this hashing from the key space to a location needs to be much more dynamic than a typical hash table that you're used to. And then of course, what if some keys are really popular? The simple hashing algorithm would just put sort of pick one location for every key but perhaps a really popular key we would like to pick a bunch of different locations. And as a result pick up some of that hot spot behavior for keys by putting, spreading copies over multiple nodes. So this hashing is really just the beginnings of the answer and we need to do something stronger than that or more dynamic. And then of course look up comes into play. So whatever look up we come up with the simple idea that while we hash to figure out where the location is and then we store that location in some directory, won't that directory become a bottleneck? And that could be a problem. So we're gonna need to solve these problems. So let's do this in steps. So we're gonna start by not worrying about this directory being a single point of failure and performance for the moment and just assume we do have a directory which helps us map from key to what node is storing it. Okay, so for instance, here a client is sending in a message that wants to put key 14 with value 14 and it sends to the directory and what does the directory do? It says, well, I'm gonna tell you where to store this but first I'm gonna record the fact that I've decided that key 14 is gonna go on node three. And so as a result it's gonna forward your put query to node three and as a result there'll be an acknowledgement will come back and node three now has the mapping between key 14 and value 14 and the directory keeps track of the fact that key 14 is on node three. Why do we do that? Well, now the get goes to the directory and the director can say, well, I know where key 14 is I'll forward you to the node and then the node gives the result back to the directory and forwards that back to the client and we're good to go. All right, now this slide here talks about the recursive directory architecture we're talking about. We call this recursive because essentially we're treating this like a routing problem that's recursive in its nature. So for instance, in this get example we send the get to the master and then the master figures out what the next hop is and recursively sends it on to the next hop. And so this routing idea that we route through the directory to the destination and then from the destination back to the client is what we call recursive. The alternative here is iterative and that would be one in which the original put query goes to the directory and the directory says, okay I'm gonna put key 14 on node three but it goes back and it tells the client about this node three. And so now node the client basically says oh asks node three could you please go ahead and put key 14 for me and that happens. So this is iterative because we do each step one at a time. We first find out where to store the data and then we go ahead and put it there, okay. And of course the iterative idea works also for the get operation. So we get, you know, where is key 14 and master directory says look for node three and then we come down, you know and we said you asked node three can you give me key 14 and that point it returns the value, okay. So this difference between iterative and recursive is an interesting one and it's really a choice of implementation rather than anything too profound. So on the left here we have the iterative, excuse me on the left we have the recursive version that I told you earlier. And so here that's really like routing where we forward through each hop to the destination and it gets forwarded through whereas again recursive is the client keeps asking the next step for its information and it's making all the requests on its own. It asks the directory, then it asks the node and so on. And so the advantages of the recursive case is it's faster since the directory server might be closer to the storage nodes and we're essentially forwarding through and back with a shorter path as possible. And it's also easier for consistency, okay. And so the reason it's easier for consistency is if we have multiple clients that are all updating simultaneously the directory is gonna help us with that by kind of keeping tabs on how many things are currently outstanding and so maybe it can make sure that we keep a consistent value, okay. But downside is the directory can quickly become a bottleneck in this scenario. The iterative one is much more scalable because every client can do its own querying, okay. And so you can have many clients using bandwidth throughout the network. Not all that bandwidth has to route through the directory. The downside is it's a little harder to enforce consistency because now each client is doing its own thing and so there's much less control over when updates happen to a given value, okay. Now, I'm having some issues here with my mouse. Hold on one second here, okay. I'm going to start over and restart something. Okay, so as we were talking before our... So the question here was, why is it that iterative is harder to keep consistent? The question for that is pretty simple. It's basically that if every client is all doing this on their own it's much harder for them to coordinate and make sure that there's only one latest version of the items, whereas if the master directory is coordinating then it can kind of know all the ongoing updates and it can make things work out properly. So now the question here in the... Is really if you look at this clearly how do we make this scalable? Well, we can add nodes, that doesn't seem like a big deal, but dealing with this directory seems a little tricky, okay. So for storage, we'd like to just add nodes to scale things arbitrarily. We'd like to be able to make sure that a number of our requests can serve requests from all the nodes where a value is stored in parallel. So just like with the RAID one or RAID five whatever the way you get more bandwidth is by finding all the places, particular for RAID one where there are multiple replicas and being able to serve reads from those, okay. And so that'd be nice if we could do. The master can basically replicate popular items on more nodes, but now we have to worry a little bit about the master directory basically getting overloaded. So how do we make it more scalable? Well, we could replicate it. So there's multiple identical copies of the directory but how do we make that work properly? We can partition it so that we could basically figure out how different directories store different parts of the key space, but how do we do that, okay. And so it seems like we can almost get our goals of scalability just by adding nodes here, but the directory gets in the way. And so we're gonna need to come up with a good way of dealing with that, okay. So how do we deal, among other things with full tolerance? So as I mentioned, as you put more nodes in the system, your rate of overall failure is gonna go up. And so at the same time that we're adding nodes to the system, we're also gonna wanna deal with full tolerance, possibly by replicating the value on several nodes, okay. And maybe we'll place replicas on different racks in a data center or different data centers in the country or different continents. There's lots of different types of replication we might wanna do, and we need to figure out how to do that. So for instance here, we could have our put that we saw earlier where we're putting key 14, value 14. And then the directories might say, well, send it to node one and node three, at which point we send multiple copies. And of course you can have different combinations of iterative and recursive and so on involved in that. But so multiple replicas do seem like a way to get full tolerance, but now we've even made things more complicated in that the directory has to keep track of all the places where the copies are. And there's also this consistency question, which kind of shows up when we have more than one node that's writing and we have multiple replicas. Now all of a sudden we've got sort of some complexity. Okay, so how do we know that a value's been replicated properly on every node? So if we've got thousands of nodes in the system and we were trying to write three replicas, how do we even know that all three of them have been written? Well, maybe we wait, but then what's that protocol look like? And what happens if a node fails in the middle of replication? Do you copy somewhere else? What if a node is slow? And all of a sudden I suppose I want three copies, but one of the nodes I've picked is running extremely slowly because it's overloaded. And now my right speed is limited by the slowest node in the system and that might be really problematic. So in general, when you have multiple replicas, your gets are slow because you're trying to replicate is everywhere you can, or excuse me, your puts are slow because you're trying to replicate as many places as you can, but your gets might get faster, no pun intended because you have multiple copies and you could be served by different nodes. So just to put this on the table here, what do we mean by consistency? Suppose we have the red node is trying to put key 14 with value 14 prime and the blue node is trying to put value 14 double prime in the system. What happens? Well, they get to the directory in some order and they leave the directory in some order, but perhaps because of network ordering or whatever, what actually happens here is sort of red arrives last at node one, but arrives first at node two, node three. And suddenly we've got two different values for key 14 in the system. We've got sort of 14 prime on the one hand and 14 double prime on the other. And we have to worry about how do we keep that consistent? So, because if you think about this when you go to read, what do we get for get 14? Well, it's undefined because there's two different options there. So there's many consistency models, okay? There are classes you can take where they talk about consistency models in 252. We talk about cache consistency in 262. You can talk about all sorts of different distributed systems consistency. The simplest idea here of atomic consistency or linearizability shows up in databases where gets and puts to replicas appears if there was a single underlying system even though there's many replicas and there's one image and there's only one updated value at a time. This is like transactions in 252. It's sequential consistency you talk about. Another option here is something called eventual consistency, which is that, well, we don't really know what the absolute latest value is at any given time, but if we let the system stabilize without any rights, they'll all eventually converge to a single up-to-date version, okay? And so, while a lot of rights are going on from different people, different readers might get different views of the system, but they're all converging eventually to the same place. And you might ask yourself what makes sense in this large distributed system? Well, I'll tell you these kind of atomic consistency or sequential consistency or linearizability, these kind of behaviors have a tendency to be very slow and take hard to achieve well in a big distributed system. They work better in centralized systems, okay? Now the question on the chat here is our most distributed systems eventual consistency. So by and large distributed systems try to give you a view that's like eventual consistency most of the time, there are some things for which that just isn't sufficient. For instance, if you were to try to do a banking application running on a key value store and you're gonna take money out of one account and put money into another account, you really can't afford there to be slop in the amount of money. Okay, and so that gets to be a little bit of problematic and that's why for instance that you might actually have something like a blockchain algorithm to try to give you a version of consistency, but you still have to wait because you gotta wait for the chains to stabilize. Now the question here is, does Byzantine agreement help you? The answer is, well, Byzantine agreement's a decision algorithm that you might use to try to run a consistency model. And so there's slightly different things, but you could use something like Byzantine commit or a two phase commit to make a decision about the ultimate order in which things are being committed. And by that, you can then put a kind of like a timestamp on there that could give you an idea of what the final order is. And the only thing that Byzantine agreement's gonna give you in that scenario there is it's gonna prevent other malicious writers from basically screwing up the ability of the system to come to a single consistent state at the end. So Byzantine agreement, you might still have an eventual consistency, but it's a type of consistency where everybody agrees on the final order at which the values are and it's impossible for a malicious person to screw up the ordering. So there are many others, causal consistency, sequential consistency, strong consistency, lots of consistencies. In fact, there's inconsistency in the consistency, but that's a whole other story. So I'm gonna give you one particular option that tends to show up in some of these distributed algorithms a lot, it's called quorum consensus. And the idea here is to improve the put and get performance in the presence of multiple copies. And we're gonna do so in a way that keeps us consistent or gives us a nice strong consistency while at the same time allowing us to not have to wait for every replica. And this is pretty commonly used in big key value stores, like Cassandra and several of these others where some of the first to have it, it's now pretty much everywhere. And the idea is you say that we have and replicas of any key and put basically waits for acknowledgements from at least W replicas and get waits for responses from our replicas. Okay, now put is also gonna have something that lets us differentiate relatively newer writes from older ones. And we can use a timestamp that we put on our update before it goes into the system to help us with that. But our constraint here of W and R is that W plus R has gotta be greater than N. And so why is that? Well, W plus R is greater than N, that means that if we make sure that W replicas have been written and we go to read our replicas, we know that we will get enough up to date replicas back that we can always find out what the latest value is. Okay, so there's at least one node that contains the update in this scenario. And because we've got timestamps on them, then we can figure out that one node, which one has the most up to date update. Okay, and why might you use W plus R greater than N plus one? Well, you might do that if you were really interested in making sure that your data was absolutely replicated very widely. The reason you might wanna have W smaller than N, what that really says is if I've got, suppose I have replicas of size three, three copies everywhere, W is equal to two. What that says is after I've got two acts back from two different nodes, I'm gonna assume that I'm done and that there's enough in the system so that enough copies in the system so that if R is equal to two as well, if I read and I get back two replies of the three nodes, then I know that at least one of them has got the up to date data in it. So this is what quorum means. Okay, so for instance, here we have an example where we put N is equal to three, we have W and R equal to two. And so our replica set for K14 is gonna be N1, N3 and N4, actually N1, that's a bug there, that's N3. Let's put a three in there just for you guys, three. And what we're gonna say here is that ideally we write all three of these copies, but when we get acts back from two of them, we say that's enough and we're gonna go forward. So notice how this has sped us up, we haven't had to wait for that third copy to go on even though our eventual replica set needs to be three copies of everything. And also because we have R equal to, that means we're gonna read from two of our nodes and at least one of them is gonna come back with the most recent value. And so we know that as long as W plus R is greater than N, that we have a system where we'll always at least see the most up to date copy, all right. Are there any questions on that? So what is the, let's see, well, that's a good question. So the question here on the chat is what if two of these fail and it happens to be the two that I asked, then you're in trouble. So you have to pick N to be big enough and then W to be big enough so that the chance that all the nodes that you're querying fail is negligible, okay. So that's a bit of a trade-off there. I will tell you that a lot of the parameters, if you were to use something like Cassandra or Dynamo or some of the, you know, I'm gonna call them workhorse algorithms for key value stores in the cloud, they often use N equal three, W equals two, R equals two because that works pretty well, okay. So if N three returns a value that's, so the question on the chat here is I think what if N three returns a value, then we have two values here, we would use the timestamp to figure out which one's the latest one. And what we know is for a fact that as long as we get back two reads, two nodes will return the reads for us that at least one of those two is gonna have the later timestamp and it's gonna have the one we're looking for, good. By the way, I mentioned the timestamp when I talk about this because oftentimes these papers don't make it clear that this needs something like a timestamp, otherwise you can't really make it work properly. Okay, so now let's move on for a moment to scalability. So what if I want to use more nodes? And I wanna serve more requests from nodes in which a value is stored in parallel. The master can basically replicate popular values on more nodes and then we can try to scale the directory somehow by replicating it or partitioning it or whatever but this very rapidly gets tricky and how do we wanna do this, okay? And furthermore, the directory is keeping track of the storage that's available at each node, preferentially inserting new values on nodes with more storage is a way to try to balance the load and when a new node is added, then what? You can't just insert only new values on the new node because that new node then isn't really taking over the load from the other look nodes. So suppose you've got 10 nodes and you add an 11th, you'd like there to be some rebalancing that happens so that that new node is worthwhile in the short term as well as the long term. And so that rebalancing is something that we would ideally have some algorithm that would let us figure out somehow. So, and then what happens when a node fails? So if we've, we were talking back here when we were looking at this consensus algorithm, if we had three copies and nodes start failing, then not only do we need to avoid the nodes that are failed, but we also have to somehow make sure that the values that they had get re-replicated from other copies so that our net total number of copies is converging toward n, which was our choice of three, for instance, otherwise, nodes can fail and we start losing data after too many nodes fail because nobody has it anymore. So the challenge is the directory contains bunch of entries equal to the number of key value pairs can be 10s or hundreds of billions of entries, okay? I feel like the cosmologist Carl Sagan who used to talk about billions and billions of entries, okay, billions and billions of stars was his thing, but the solution here is something called consistent hashing and it basically provides a mechanism to divide the key value pairs among the active set of nodes that are currently up, which could be hundreds or thousands or millions of nodes and do so in a way that automatically adapts as nodes fail and come back in. So that's the consistent hashing algorithm. So it's a little different from the hashing that you've put in when you learned about hash tables and say 61B, it's a little more dynamic, okay? And the first time that you see this, it's a little tricky to get, but let's see if we can get you guys through this. So we're gonna assume for a moment, like I said that the key space is m bits and like 256 is a good number. And so if you think about all of the values that m bits can represent, it's a lot, right? It's two to the 256. And so what we're gonna do is we're gonna associate a unique ID in a ring of the space zero to two to the m minus one. That's all possible keys of value m. And we're gonna partition that space across n machines. So so far I haven't said anything too exciting other than that there's a ring that represents all the keys and we're gonna split those keys up to our n machines. So that isn't anything more than the hash pixel location, but it's gonna get more interesting as we start drawing this for a moment. So we're gonna assume that the keys are all in the same space as the nodes. And so each key value is stored at the node with the smallest ID bigger than the key. Okay, so that sounds messy, but it's better with a picture. Okay, so here I'm gonna make a much smaller key space. So I'm gonna assume m is six and so we have 64 total keys. And I've got these nodes are around the system and they also have a place in this ring. So every spot in this ring, which is only 64 spots in this small ring, represents a key in the two to the 64 space or excuse me in the two to the six bit space. And it also represents the potential for a node. What we do is we, for instance, take a node and maybe take its IP address and some other unique characteristics about it, run the hash function on it. That's gonna give us another key and we're gonna think of that as placing these nodes on this ring. So we've got a node here that's currently at four, eight, 15, 20, 32, 35, 44, 58. Those are all the nodes. And how do I put them on the ring while I took some unique thing about the node and I hashed it with the same algorithm I've been using with my keys and that puts them on the ring. Okay, everybody with me so far. Now, if we now consider moving forward on this a little bit, we can say that if we imagine a key and a key value pair that we wanna put into the system, and I'm gonna put system in quotes, what we do is we put the key in the node that's the next highest key in the space that's higher than us. So for instance, if we were gonna try to put the key five into this, we find it successful. So we're gonna try to put the key five into this. The key five into this, we find it's successor is node eight. And so node eight is gonna store key five. And we can look down here and we can say, well, if we got key 24, that's gonna get stored by node 32, okay? And so here's another example, for instance, node 14 with key value 14 is gonna get stored on the server with ID 15, okay? And we can do this because we've taken the unique information about the nodes, we've hashed it with the same hash function that we got the keys from, and therefore we can say for every key, there's always some node that's the next node clockwise in the ring, okay? And in practice, since M is 256 bits or more, this is a really, really, really large ring, okay? And so there's a lot of space here between each node in the ring and each key in the ring, but the process here is still the same. We store a given key value pair on the server that is the immediate successor of the key. Now I'm gonna pause there long enough for this to sink in as being probably confusing and wait for a couple of questions. Everybody with me so far? This mapping here is consistent hashing, by the way. Sorry, could you just say that last sentence one more time? Well, this, what I've described to you here is consistent hashing, okay? And what it means is it's a way of assigning a key to a server in a space where the keys and the server names are randomly distributed from the same key space, okay? Now the question, we have a couple of questions. This is great. So the first question is how do you know that we don't have one node covering a large portion of the space than others? And the answer is you don't know for sure but probabilistically it's extremely unlikely to happen, okay? And as you add nodes, the chance of having one node taking a ridiculous amount of the space goes down very rapidly and if you look at the resources page, I put up a paper on CORD which talks about the consistent hashing properties and shows you experimentally why this doesn't tend to happen very much. Now, another question is, is it possible for the number of nodes in the network to change? Yes. So the reason I talk about this algorithm this way is let's suppose that node 15 fails. If node 15 fails, then which node is now responsible for key 14? Well, it's now node 20. So the nice thing about this consistent hashing algorithm is it automatically tells you who's responsible for keys as the number of nodes comes and leaves. All right? And the question about the directory, there isn't one right now, the directory is the structure. So this directory is discovered by starting at some known node and going clockwise until you find a space that your key fits in. And at that point, you know that the next successive node is the one that's got your item, okay? Now, we have lots of questions about who's tracking all this and so on. And you're gonna have to give me a few slides to get to that. The meaning of M, there was a question also about what's the meaning of M again. So the meaning of M is the number of bits in the key space. So typically that's something like 256, okay? All right. Now, what's happening to node 14 in this example? There's no node 14, there's only key 14. So the idea is then, especially if M equals 256. So you imagine two to the 256, okay? So your brain just kind of blew up because it's really big, right? And imagine a reasonable number of nodes, a thousand, a million, billion, whatever, it doesn't matter. You put them randomly on that ring, it's gonna be extremely sparse relative to the possible keys out there. And what's happening in this picture here is there is no node 14, there is a node 15. And so key, value pair 14, B14 is stored on server 15. That's all this is showing you, okay? So the distribution node, so notice that there's no centralized anything in this picture, all right? All that happened was when we have a new node add to the system, we come up with our way of hashing, it's metadata, so we come up with its name or key or whatever you wanna call it. We insert it into the system via some mystery mechanism I haven't told you yet. And it immediately starts taking over the keys that it's now responsible for. And how do we know it's responsible? Well, because it is the immediate successor of some key between its predecessor and it. And that's all the consistent hashing is gonna tell us. Okay, and when a new node comes in, yes, what happens is the redistribution of keys, so if 20 has just come in, then we have to redistribute the keys that were being stored on 32, get distributed back on 20 because there are those keys between 15 and 20, inclusive on 20 are now responsible, responsibility of 20. And they are copied by the algorithm, okay? All right, are we ready to move forward here? So there's no centralized anything in this. And I haven't shown you anything that will lead you to believe I can build this. All you need to do so far is believe that if I can somehow plop a node onto the ring based on something to do with its name randomly, and then I can transfer keys to this new node, that's consistent hashing. And certainly those of you who are thinking this through can note that if I only have one copy of say key 14 on server 15 and server 15 just dies, I've just lost key 14. So clearly there's gotta be some replication in here too. We'll get to that hopefully before the lecture's over. All right, so I'm gonna move on now. Hopefully we can do that and we'll see whether things keep going. So cord is a distributed lookup service based on the consistent hashing idea. It was originally designed at MIT in Berkeley, Jan Stoika was one of the original designers of this. I present cord because it's the simplest of all of these peer-to-peer algorithms. It's the cleanest one. And if you understand it, it basically gives you a great leg up in understanding the other ones. And so that's why I like cord. And you can, and go Bears, yes. And you can take a look at the cord paper that I have up on the resources page for you. So important aspect of the design space here is we're gonna decouple correctness from efficiency. So in trying to turn this consistent hashing thing that this idea I just gave you into an algorithm, I need to figure out what I need to implement for correctness and what I need to implement for efficiency and I gotta separate those two. Okay, and cord is particularly clean. All right, now the other thing about this idea is it's a combined directory and storage space. So I know I'm gonna be able to route to the location of an item and it's that item I've been routing toward is gonna be stored on the server I eventually get there. So I no longer have a separate directory and storage nodes. I now have them combined together. So every one of these things that you see here that I've got that look like computers, they're all a combined chunk of the directory space and chunk of the storage space together in one box. Okay, so the properties for correctness. So every node in the system needs to know about neighbors on the ring. So it needs to know one predecessor and one successor on the ring. And if we do that properly and the ring itself is fully connected, then we've got correctness because I can always, as long as I always know who my successor and predecessor is, then if somebody's looking up key 14 and they're on node four, they know for a fact that wherever key 14 is, it's this way clockwise on the ring. And so it just goes to its successor. Okay, and then eight says, just goes to its successor and eventually 15 is gonna be the one. And yes, this is one ring to rule them out. Oh, that is correct. Okay, and unfortunately in this particular scenario, if there were a bunch of smaller rings that it's ruling, then the thing is inconsistent, it's not working properly. So our correctness property is basically that every node fits in the ring cleanly. Performance, on the other hand, is gotta be something better than that because in a worst case, suppose I pop into this system at 20 and I'm looking for 14, I gotta go all the way around before I hit 15. So clearly the ring gives me correctness because I can always fall back to it, but I gotta do something better. Okay, and there's many other of these structured peer-to-peer services. There's one also from Berkeley called CAN, there's Tapestry, that's one I designed way back when with my students. Pastry was one designed, Microsoft, Bamboo, also designed with my students here. Kedemlia, that was designed sort of at NYU slash Stanford. Several of them designed at Berkeley. So this is all part of the peer-to-peer space in the early 2000s, okay. So, core lookup mechanisms, one of routing. And so what we do is every node maintains a pointer to its successor. We route the packet to the node responsible for the idea of what we're looking for. So suppose that node four is trying to find out who's responsible for 37, okay. So nodes four starts a query, it doesn't know anything about it. So it asks its successor who has asked, its successor who has its successor and so on around the ring. And then eventually node 35 knows that its successor is 44. And so at that point, it knows that 44 is gonna hold 37. So it can just respond to the query at that point and says, well, node four, if you're interested, you wanna talk to node 44. And by the way, since 35 has a pointer to it, like an IP address, it can also be telling node four basically what the IP address is to talk to node 44. Okay, so this is routing. And by the way, the earlier part of the lecture before my mouse went haywire, you can see that this is a recursive query process, right? We could do this iteratively where four asks eight, eight returns something, then four asks 15, 15 returns something, et cetera. Okay, and probabilistically the iterative one is gonna be about a factor of two, probabilistically longer than the recursive one. So worst case look up here is order n, where I have to go all the way around the ring. So that's not great. So we're gonna need a dynamic performance optimization to make this work somehow, all right? And hopefully you guys will give me a little time after the normal close of class here so we can get through this because we lost about 10 minutes in my mouse snafu. So what does this really mean, okay? And keep in mind that the nodes on the ring are nodes that we've taken their IP address and we've hashed it, maybe some metadata to get our position on the rings. That means that four and eight are absolutely not close to each other in the network. In fact, they're anti-close if that makes any sense, right? So because four's IP address and eight's IP address are hashed, if they happen to be in the same network, it's more likely that you'll get four out of one and 32 at the other, then you'll get eight, okay? And so what this really means is if I actually look at, say the US and I look at where all these nodes are, what I see here is here's node four, node eight, node 15, et cetera. And so my ring is really napped in a pretty random looking process around wherever those nodes are in here, I'm showing you North America. This is potentially good because it says, first of all, it doesn't give you speed from a locality standpoint obviously, but what it does give you is that if I replicate key four on a couple of successive nodes, then I'll get geographic separation and it'll be less likely that a local disaster will take out my data. So this is this anti-locality property of court is actually a feature, okay? And so here's an example, here's node 14 and each of these, key 14, excuse me, each of the clients that might be using my network knows about one, typically one node and that one node is the one that they enter into before they run the protocol, okay? So now the question is how do we make the ring work? And the answer is there's a stabilization protocol that's very simple. So I haven't given you fast yet, so we're gonna make correctness. And the idea here is basically that we run stabilized periodically where a given node asks, give a node and says, well, who I think my successor is and ask who they think their predecessor is. And it turns out that if the successor dot predecessor, so this is sending a message isn't me, then that node doesn't know about me and at which point I've got to tie myself in to the ring and make sure they think of me as a predecessor, okay? And so, and if I've changed something, I also go ahead to my successor and I notify them and say, hey, I'm here, remember me. And at that point, this is the successor gets a notification and says, oh, if either my predecessor is nil, so I don't know of anybody or this new guy who talked to me is actually between where I thought my predecessor was and me, then I'll call him my predecessor. And this algorithm, once you get kind of what's going on, converges constantly, all right? It's a constant convergence sort of thing so that as new nodes come and go, it keeps the ring connected. So here's a good example. So for instance, a node with ID 50 is sitting over here unconnected to anybody, notice its successor is nil and its predecessor is nil. And assume that it only knows about node 15 as an entry point. The question is, how does it join? What it does is it first sends a join request to the node that it knows about. And what's interesting about this protocol is the join request is exactly like a lookup. So what we're doing with node 15 is we're asking node 15 to tell us where would key 50 live in the network? And you can see the key 50 would live on server 58. So as we go through this routing process, we find out that node 58 would actually store key 50, which means, how does node 50 know where node 15 is? Doesn't have to, right? So that's the thing. Notice that what's happened, oh, I'm sorry. The way that node 50 knows about 15 is that you have to know one node in the system. It doesn't matter which one you know. Maybe you get it through DNS. Maybe you get it through some name service that tells you about how to get into the ring, but you need to know at least one. And we're gonna get to the failure in a second. Hold on a second here. So basically node 50 consulted its how do I get a hold of the ring name service? And was told about 15 and it asked now 15 where would key 50s get stored? And what comes back is stored on 58. At which point, we set that to our successor because we know that if we were key 50 and we would be stored on node 58, then our successor ought to be 58. And now we go through the stabilization process where we look up, we ask our successor now, our new successor, who its predecessor is, and it says, well, my predecessor is 44. Okay, and at that point, we know for a fact that really 58's predecessor ought to be 50, not 44. And so when we go through this, node 50 says, hey, node 44, excuse me, hey, node 58, you ought to use me. Okay, and so it goes through the algorithm. Now its predecessor is set. Okay, and so at this point, we have a partially connected ring. And if we have everybody continue to run stabilize, now our local node runs stabilize, okay, and other nodes know stabilize. And notice what happens here when 44 runs stabilize, it asks its successor, who its predecessor is, and that successor says, well, my predecessor is 50. And now 44 knows something funny is going on because it doesn't know about 50. And so now at that point, it connects itself. Okay, and if you were to now follow this through, you would see that this converges to connecting, okay? So that's what's cool about this algorithm. It's automatically, if you just keep running it over and over again, it will converge. Now, two interesting things here, one of which was a question that was just asked, is it possible for a consistency problem to occur if some queries partway through this joining process and you have nodes joining? The answer is that you always know when you get a response for a query, you always know whether it was successful or not. So what's gonna happen when new nodes are joining and your query happens to cross through them is that you retry and it'll work the second time through. Okay, so 44 only knows about 50 when it calls stabilize. That's correct. So basically what's happening here is if everybody's running the stabilize procedure and notifying their successors, then the ring will join up and it will continue to join over time, okay? Now, there was another interesting question. So 44 knows about 50 when it calls stabilize and then it gets, then it notifies 50 that it's choosing it as the successor at that point, okay? So I challenge everybody to take a look through the slides afterwards. It's pretty cool how simple this algorithm is. Now, the other question that was up here I wanted to answer this is what happens if a node fails? So if a node fails, what'll happen is that as long as only one node fails, this particular algorithm will rejoin the ring. If more than one node fails that it's close to each other simultaneously, then the ring will get disconnected. And so what that requires then is what we do is we have what's called a leaf set where rather than one successor, we take charge of a few forward and backward ones and then becomes probabilistically very hard for the ring to be unformed here, okay? Why does node 50 join and then stabilize but not node 44 is a question? So node 44 was already part of the ring in that example I showed you. And so it's only node 50 that's doing the joining process and node 44 thinks it's happily connected to 58. And so what happens is 50 gets itself partially in the ring and at that point, the stabilization then reconnects the ring like a, almost like a rubber band or something. Okay, so how do we make this whole thing fast? So the answer I'm making this fast is like the following and I'm not gonna go through this in great detail because we're already a little past our original time. I'm hoping you guys give me a little extra time. And the idea here is that we have what's called a finger table. And so each node doesn't just keep track of its successors and predecessors, what it does is it finds probabilistically a node that is halfway across the ring, a quarter of the way, an eighth of the way and so on across the ring and those are called fingers. And so what that means is for instance the node that's halfway across this which is my node number 80 plus two to the sixth mod to the seventh is halfway. And what that is is 16. And so the node that this finger points to is node 20 in that case. And there's an algorithm that's also like the stabilization one I just showed you that figures out each of these fingers and now if a query comes into a node, let's say that node 80, somebody asked node 80 to find something like key 31. What 80 does is rather than going serially all the way around the ring, the first thing it does is it takes the longest finger it can to get to node 20 and then node 20 knows exactly where to go. So you can get this with the good fingers, you could get to node 21 in two hops. And these fingers are continuously refreshed via a dynamic algorithm. If some of them are bad because the node they were pointing to went away and it hasn't been refreshed yet, it's okay because I can always fall back onto the linearly going through the ring. So this is a performance instead of a consistency issue. So the consistency issue is keep the ring always connected that's through our joining operation. The performance or efficiency is by keeping the finger tables mostly working and that lets me basically go halfway across the ring then a quarter and so on to very quickly converge on my node and logarithmic number of steps rather than a linear number of steps. Now, how do we get full tolerance for the lookup service? Well, so explaining this a little bit again. So what you see here, this was a question on the chat says the following. If, first of all, everybody hopefully agrees with me that once I have my successor and predecessors figured out I can always query by going around the ring. Okay, I can always find where I need to go by going around the ring. So that's a correctness but not it's correct but not fast. The way I make this fast is I have this finger table which represents a pointer rather than just successors and predecessors it points to a pointer all the way across the ring, one that's a quarter of the way, one that's an eighth of the way and so on. And it's always trend, it's always gone forward. Okay, and I'm not sure what the question is about finger table traverse linearly backwards. So the answer is I'm trying to go clockwise around this and if I'm going somewhere past halfway, what I do is I start by going all the way across the ring and then 20s finger tables will take me a quarter of the way and so on so I'm always going forward. Okay, the reason I've got the finger table is how I get an order log M routing. Because I basically have the ability to think of this as bit correction. If I look at where I'm at and the key I'm trying to get to I first correct the first bit then the second bit then the third bit so on. Okay, all right. Now, how do we get full tolerance? Well, for one thing we need to have more than more predecessors and successors than just one and that lets us keep the ring connected. If K's log M lookup operations works with high probability even if half the nodes fail we can make this work where K is the number of successors and predecessors I keep track of and again that paper that I put up there will tell you about it but replications also necessary. So for instance, rather than just keeping key 14 stored on server 15 I also serve it in my replica set moving forward. So here I have my three copies and as a result if say node 20 fails then I still have my new leaf set of 15 now is 15, 32 and 35 and both 15 and 32 still have copy of key 14 and so now I can re-replicate it on 35, okay. And the question also here is how does a host's finger table get updated when a new host joins? And the answer is fairly simply what it really means is that periodically I just go through all my fingers and what I say is I take my name and I add two to the M to it to the M minus one and I ask it basically to route me to there and that gives me back the current live version of halfway across and so as I go over time I go through to find all the fingers. The drawback of this is that there's constant traffic to keep the ring live, okay. But it does route you in log M steps which is kind of nice. Okay, so I'm running out of time here. So I talked to you about fault tolerance so the idea here is that if this is my leaf set say K equals three forward now if a node fails then my leaf set still has two other nodes and they can replicate forward and so that's what this chord algorithm does is when a node fails it has enough replicas to move forward, okay. And I just said that basically if node 15 fails I replicate forward, okay, et cetera. All right, so replication in physical space I mentioned this earlier so 15, 20, 32 is actually got geographic spread. If this ring is spread around the planet I'm all over the place, okay. Now, okay. So finally I wanted to say that Amazon here please turn off the annotation if you could. So Dynamo basically is a version of the chords ring that works internally and it's Amazon storage for its internal process. And so if you look here, you see we have actual chord rings that are buried behind several routing boxes inside of Amazon. And but that's how they store all of their things like their shopping carts and all of the information that you query gets stored in Dynamo. And the application can deliver functionality in bounding time so it's interesting about the Dynamo paper which I also posted relative to the chord paper is the Dynamo paper figures out how to have a number of nines of service level agreement that says that you will get your data returned no later than X whereas chord is originally defined probabilistically doesn't quite have that properly, okay. And so they for instance can have service guarantees that'll provide a response within 300 milliseconds for 99.9% of its requests, okay. All right, so I think that's all I wanted to say for today. So in summary and conclusion, we talked about distributed file systems. We started with what I'd call traditional ones like NFS and AFS trying to give you transparent access of files that are stored all over the place on remote disks. There's some cash for performance, excuse me that that causes consistency issues. We talked about VFS, the virtual file system layer. And now we talked about that mostly last time which basically allows us to plug these systems in and we can use them locally as if they were local. And then we talked about cash consistency. We then talked about how key value stores are somewhat simpler and easier to scale. And then we talked about challenges in terms of fault tolerance scalability consistency. And I presented the chord algorithm which is a highly distributed lookup protocol. And I hope if you actually internalize chord and I strongly suggest you guys do, it will give you a leg up and understanding all these other protocols, right? And the last thing I did wanna mention, I didn't say it on the slide when I was showing replication is that that quorum consistency idea where I store a wait for R reads to come back or W writes and I have N replicas applies to chord as well and that's exactly what DynamoDB has that kind of replication to it. All right, well, thank you everybody. Thanks for bearing with me for the little disaster I had earlier with my mouse. It's still not sure what happened. And I'm gonna miss you guys. It's been a little weird. I haven't actually gotten to see you in a month and a half or two months. There was a question here about elaborating a little bit about what material will be on the midterm. Pretty much the chance is that things that we're talked about today are may show up on the midterm but they're not gonna be as highly weighted as earlier things. So I hope you all have a great couple of days. Good luck on the midterm. We'll be giving you details about how to take the midterm in a bit. And don't forget to fill out your HKN forms. That's very important part of the process. Now you can say what you liked and didn't like about the class. And the final is midterm three. There's no final. But hopefully, by the way, just as a heads up, hopefully you guys can all access Google forms because the midterm's gonna be with Google forms. So you might wanna make sure for those of you that might be in foreign countries that you have a VPN set up or something going. All right, ciao. Give everybody, everybody give the TAs a clap as well virtually because they've been great. Ciao and look me up if you're still around next fall. And we're actually physically present. Come say hi to me because it's been a little lonely without you guys. All right, bye.