 All right, everyone. So let's get started. So we have a couple of topics today. We're going to talk mostly about key value storage, which is an emerging technique for storing very large amounts of data, typically. So we'll talk about the API for key value, explain some of the implementation details, and some of the challenges in getting really large scale key value stores to perform. And we're going to start the discussion of networking protocols. And today, just get into layering and the overall architecture of communication protocols on the network. So these slides are heavily derived from slides earlier by Ion Stoika and Vern Paxton and Scott Schenker. So a key value store is conceptually very simple. You have a put operation that's indexed by a key. And then a value that's being inserted in a table. And the get operation is the inverse. You supply a key. And if it's a key that's been stored in the table before, the value gets returned. So intuitively, it's just like a two column table. And in fact, the point of the idea is to simplify the interface so that the implementation can be super efficient. So these are increasingly used underneath very large scale databases. They typically relax some of the consistency and other constraints of traditional databases. And they're used in various ways sometimes for content addressable storage. So you can use keys which are quite complex that are actually really indexing into the content of a database. And they've been scaled up to petabytes of data. And in fact, the most common really large data stores are almost always some flavor of key value store. And so those are distributed over hundreds or thousands or tens, actually probably hundreds of thousands of machines at the high end. And they're designed to be faster with lower overhead. For instance, I mean, there's different types of overhead in traditional relational databases. One of them is just the indexes that are often an order of magnitude bigger than the data itself. But we'll see some of the other types of overhead that key value stores try to minimize. So in databases, you typically have four requirements, right? There's acid requirements, which are atomicity, which is that every update should be in the form of a transaction. So either the whole change happens, all of the indices that are affected get updated or none of them. It's either no operation or a complete operation. Consistency means that the different tables that might be linking to each other have to be consistently set and remain consistent throughout every transaction. Isolation implies that even when things are happening concurrently, that there's a simple sequential interpretation and that basically those complex updates are the same as some kind of serialization of the updates. And finally, the results of the transaction should be protected against various types of failure. So it has to be the case that somehow, for instance, if the data's heavily cached somewhere, some machines go offline, that the commits, if they were made, are still persisted somehow. Okay, so these properties guarantee that data in traditional databases highly protected and safe even after a variety of failures and basically arbitrary operations from clients. So on the other hand, in distributed systems, it's been recognized that getting a lot of the desirable properties that you like in a distributed system, especially one that's highly responsive, is challenging and in fact impossible. So there's a related theorem called the CAP theorem that applies to three similar concepts, consistency similar to consistency on the previous slide. When you have a distributed system, you want all of the nodes to have a consistent view of the data. Availability is somewhat of a new idea though, but it's critical for systems that are being used to serve web content. And the idea is that when clients make requests, they should receive some kind of response, even if it's a response saying, I can't complete the transaction right now. And partition tolerance is another kind of fault tolerance or durability constraint that says that the system should continue even if there's a significant loss of storage. And so the idea was originally defined, the conflict between these three things was defined by Eric Brewer. And in about 2000 and then a couple of years later, there was a theorem proved by some people at MIT that basically this is true under somewhat particular assumptions, but basically you can't achieve all three of those at once. So that was Eric's original conjecture. And in fact, it's true under certain assumptions. So that means that many of the systems that people actually build strive instead to develop two of these or they try to weaken one of the notions significantly. So they might strive for eventual consistency instead of some stronger notion of a consistency at every time step. Okay, so most key value stores are working under some relaxed assumptions that are often a subset of these. So there are CP systems that will have consistency and partition tolerance, et cetera. All right. And just to clarify, you can map standard relational tables to key value stores. And in fact, that's done and there are some advantages to doing things that way. So in a traditional table, if you have some columns, let's say states in the US, some IDs, which uniquely identify them, let's say some data like population, area square miles, and one of the senators. So a traditional database, there's an index which will index some row of the table. And the data is typically stored in rows. So you will retrieve an entire row. So there are actually two ways to approach this with a key value store. One of them is to break the table into columns with an index, which is the ID. And then your first column, which was the key of the original database becomes the key of a key value table. And then the IDs thereafter become keys into the other sub-tables, columns of the data. This way you have simpler keys than if you use a complex key in every table. It'll be generally quicker to use a simpler key. And we've done two things here. We've broken this original table into simpler tables. And we've also made access to the data easier if we only want certain fields. So, yeah, and in addition, if we wanna have other indices, we can just add them as separate key value stores. And for instance, if you wanna index by senator, you can add that key value store and you'll have basically a table with two indices now. So performance-wise, this gives you a column-based store, which contrasts with traditional databases that are normally stored row-based. All right, and the advantage of that is in a lot of databases, especially actually internet databases, will often have hundreds of columns. And a lot of queries will index just a couple of those. And using a columnar format allows you to be a lot more selective and a lot more efficient in accessing the data. So a lot of newer systems do use column-based storage and under the hood they'll have, perhaps, a highly efficient key value store implementing the column store. So that's one approach of representing large relational tables. It's to break them into column stores. The other approach is to just keep the contents of each row together and just form a complex value out of it. So that approach is also used as well. And here are some examples. So here are some canonical users of large tables, all the internet companies. So Amazon has very complex customer records, indexed by a customer ID, and then the values comprise a long record of all of the different customer information. Facebook and Twitter use a similar thing. These are used by simple unique identifiers. You can also access things like iTunes storage by song name. And in distributed file systems, key value stores are used in a few different ways. HDFS really has two different levels of key value store. One of which is mapping file names to nodes that host copies of the file or replicas of the file. And then on each data node, there's in addition mappings from IDs of pieces of the file to specific addresses on the file. So we'll get a bit more into that. That might be confusing right now. We'll get more into that implementation later on. All right. And a couple more through a few more. Google's file system and the Hadoop file system, which is modeled after Google's file system, as I mentioned, has these two different layers of key value store. Amazon uses a variety of key value stores. Bigtable and HBase are somewhat customized key value stores. Again, in a key value format, you can layer on richer structures to represent relational data with those. Facebook has their own system, Cassandra. And memcashd is a popular in-memory key value store that's used for a variety of purposes for caching and holding, acting as a cache for, in conjunction with larger and complex data stores. And there are also complex peer-to-peer key value stores. E-donk, E-mule, and also BitTorrent is an example of that. All right. So any questions so far? Yep. So BitTorrent has basically a file, it has, let's see, let's see, it has two levels. It has a search engine, which you have to first of all query to get. Yeah, BitTorrent has a distributed file directory. So let's see, I think it starts with a global lookup. You have to contact a peer, which has a piece of the distributed table. So there's basically multiple key value stores, which are parts of a directory that it uses. So it's actually one level more complex than what we've been describing, because you have to first of all find a directory that has an object that you're looking for and then look up the object and then try to find the files, which are typically replicated on a few different places. All right, so a key value store is also called a distributed hash table. And the idea is to be able to store the set of key value pairs, and actually keep the storage across many machines. And right now the directory will be on one machine. Later on we'll figure out how to try to distribute the directory as well. So conceptually we have a single key value table and the actual pieces of it are gonna be distributed across machines like this. We're also gonna have to use replications so that we can have robustness against failures. So we wanna make sure that failures don't cause the data to be lost and that don't have too big an impact on running time. So because realistically the scales we're operating at with thousands of machines, the probability of failures is quite high. They'll be happening every few weeks and sometimes there'll be catastrophic failures of entire racks of machines. So it's a given that failures will happen, we have to make sure that we have a response that will hopefully keep the system fully functional over those problems. We also wanna maintain consistency which means making sure that when there are replicas that they're consistent with each other and if there are messages being passed around in the system that they're consistent with each other as much as possible. Let's see, so across the different systems that we're considering, say from Memcache D through to really large peer-to-peer systems, the latencies range from quite small latencies to access data from fast local storage through to thousands of milliseconds to seconds to access it across the network and bandwidth similarly varies from 10s of kilobits through several gigabits if we're talking about the net bandwidth across a large distributed store. Yeah, question? I think I understand, so your question is, is this like a reference copy of the data somewhere versus is one version of the data the reference copy versus replicas? Yeah, there's normally no distinction between the replicas and normally the master or directory node is deciding where to put things. So it receives one copy and then decides where to put replicas. So yeah, there's a symmetry between the replicas. There's normally no asymmetry either when things are retrieved. Normally it'll send requests to all the machines at once and whichever one comes back is the one that's used. Okay, so we have these simple operations that we wanna execute, but we have a lot of degrees of freedom because it's a big distributed system. And we wanna, at the same time as keeping the simple interface, make sure that we have a lot of robustness built into the system. So the simplest way to do this is to maintain a master directory that's keeping track of where the different values are stored. And nodes then can directly query that master node. They only need the address of the master. They can send a request to the master to save something. The master let's say already has some information about what's in the distributed store right now. So it's got a couple of elements and you can see that it's storing the key and then a machine number. And then the value itself is on that machine number indexed by the key. So a new request comes in to put this value v14. The master at that time determines a mapping to a machine that's gonna hold that value. Now mapping is quite complex and dynamic and this node here can decide based on what machines have the most available storage, which ones are loaded and so on, where it's gonna put that piece of data. So then it actually pushes it to some node and it will appear in the key value store for that particular node. And then when something else comes, when another machine requests that data, it's now, at least the machine addresses in the master directory. The master directory can forward the request in this setup. The data will come back to the master in that situation and go back to the requesting node. So in this version of the key value store, the client never actually figured out where the data was stored. Everything was forwarded through the master. So that's called a recursive query and there's obviously some scalability difficulties with that. If we have thousands of queries per second, they're all going through the master directory. Key values, excuse me, keys are often small. Values are generally large. So that's a lot of bandwidth going through that node. All the master actually has is this machine number. So it can also send that back directly to the requesting machine. So now the requester actually has the location of the data, or at least it has the machine address and it can forward the request directly to the machine that it's looking for, which would do the save. And we've completed the transaction with a lot less communication through the master. Now the data is there. So if another node requests it, this node also receives the machine name and it can go directly to that machine to retrieve the data. So it's a more scalable approach to updating and querying. And we can look at some of the trade-offs. Here's the two scenarios now again. The iterative algorithm, excuse me, the recursive algorithm always goes through the master directory node and forwards the request, receives it and sends it back. The iterative version directory only sends a machine name, well, an IP address. Then the client node sends the request directly to that machine, gets back the result. So the recursive version, it's typically faster in terms of latency because the key value store is gonna be implemented normally in a data center where the master node is physically close to all of the data nodes. So the network time for this communication across here is in terms of latency is very short. Compared to an iterative scheme, you've still got these two messages here, the get with the key and the value being returned. Those are the same in both cases. But here we've got a communication across the short data, data node network versus through the internet. So generally you'll get better latency with the recursive scheme, but it has this big disadvantage which is that there's a scalability problem because all the values are going through the directory. But another advantage of the recursive scheme is that we have this notion of consistency which is much easier to check if requests are all going through the master and in fact the order of all of the interactions is gonna be visible on the master. And if you choose, you can force serialization of those, basically force updates to happen in a particular order because basically everybody has to go through this master node. So the master node really has the reference picture of what the state of the world is. Okay, so the iterative query though is more scalable because we're moving the heavy traffic, the value traffic directly from data nodes to clients. The disadvantage is it's generally gonna be a bit slower because those clients are further in the network topology from the data nodes. And once we give an address to a client here, we've really lost track of the order in which steps might be happening between clients and data nodes. So it's potentially much harder to figure out what the state of the distributed data store is. All right, so since it is really important to have fault tolerance, the approach is always to replicate the data on several nodes. And in fact, for best protection, the safest thing is to replicate them on different racks in a data center so that if there's a power failure in one rack or a network switch failure, that that only affects one replica. All right, so here we have the same setup as before. We've got a couple of values in the distributed store. So now we're gonna insert a new value. And this time, because we've asked for a replication for this value, the directory node's gonna return two, or at least, yeah, it's gonna return two node values back to the client. And the client's gonna simply, well, it'll happen in one of two ways. In this situation, the data node itself has forwarded the put request to a different location. And actually, this would be, this really should have an N3 because this node needs to know which is the second node to forward to. But anyway, so here we've made the data node responsible for making a second replica. So this is a recursive version of the replicated update. This might be a little bit confusing because it's got an iterative query of the master directory, which is the default. Normally, most systems will do that to save bandwidth on the server. Once we've done that, though, it's recursive in the sense that there's one request sent to the data node, and the data node recursively forwards that request somewhere else. Okay, so that's the analog of the recursive query that we saw before. There's an iterative version and in the iterative version, the client interacts directly with both of the replicas. So it starts out the same. The data node returns the two machines that the client should interact with, and this time the client just sends more or less simultaneously the two replicas. But at query time, there's really no difference. So actually, just go back. At query time, if we do a get query, the get query will normally go to the two nodes. Whichever one returns first will be the answer that's used. Okay, so we're using replication. It does require somewhat more nodes, but the benefit is it's providing us with a system that's both robust against failures, and it's highly available. So with replication, we get the system staying up even if there are significant failures. And the throughput can be very high as well. We can benefit actually from having these multiple copies. First of all, in the example so far, we're assuming that values are coming back atomically. And so basically, we can have a race between replicas to see who is gonna return first. Another thing we can do, excuse me, is to break up large values such as files themselves or any kind of large record into smaller blocks and then distribute them across machines. So here, we're using the distributed file store something like a RAID and basically effectively striping a really large value across different nodes. So HDFS does this, the Hadoop file system does this. Large files are broken up into basic blocks. They're pretty big, maybe 100 megabytes or so, but those blocks are distributed then across machines. If you really wanna copy of that file very quickly, then all of the nodes that have a piece can send it to you at the same time. So potentially, you can have really big throughput benefits from partitioning and distributing and replicating. Yeah, all right, that's a really good question. Stink, that's a really good question. I mean, you have to provide for the case that the entry that the client has is out of date. I think you would have to provide some kind of default timeout to make sure it doesn't use that data when the data is no longer there. By default, these systems are set up to move data, at least to add replicas and generally move data, so at some point, probably there are copies that are bad. Whether that requires you to, so I think it's up to the algorithm to decide whether you basically, that handle needs to be voided with some sort of message to the client or it's a timeout or if the client simply uses a bad handle and finds out that the data's not there. Yeah, so I don't know. I think there's several different ways that can be handled and I think it's up to the system, but it's a good question though. It definitely, there's an issue. Normally though, the requests are not being used for really long periods of time, they're normally being used in the context of a transaction, so they should have a short lifetime. All right, so far we have a hotspot in the master directory, which is we're relying on, even if we don't rely on it to directly forward values, we do rely on it to provide the directory into all of its storage. So if the master goes offline, then we've lost the whole data set. So it's very important to have an ability to replicate the master as well. Then that's the simplest solution. It adds more complexity to making the system consistent though, because it's even more important for directory replicas to stay consistent. And another scheme that again has a throughput advantage is to partition the directory so that different keys are stored in different servers. And that way you at least maintain even if it's not replicated, you maintain a fairly high availability of storage. So a lot of search engines use a scheme that partitions their indices without necessarily having replication, at least the early ones did. So they used partitioning of the indices so that with high probability, you'd still retrieve a given entry, a given search hit. In fact, for search hits, because there's a long list of search hits, for the same key to store it on different machines, all that would happen would be a few of the hits would be missing and people generally didn't notice. But for the last assignment, last project, you'll actually be looking at peer-to-peer distributed hash tables, which use partitioning of the directory structure so that basically you have a high availability index that has high throughput and relatively good robustness. All right, so a few more issues with this large-scale data store. Load balancing is important. It's important that your reads and writes are being roughly evenly distributed across the data nodes. One way to deal with that is that the master directory node, since it's seeing all of the queries, it can see the distribution of queries versus machines and assuming it's also aware of the storage that's available on machines, it can allocate new values to nodes that have the available storage on them. Now, when we add a new node, the simplest solution would be starting to add values on the new node, but why would that be bad? Right, yeah, there's a very high probability of overloading that. Assuming client programs have temporal and spatial locality, there's a good chance when you have one hit on a certain datum or on a set of new values, you're likely to have a lot more hits on those new values. All right, that's good. And another issue, well, is that we already have heavy load on particular nodes in the network, so this is an opportunity to actually decrease the load and improve the performance of existing nodes by moving things, yeah. So, okay, so I didn't follow the question, if you had, well, there might be scenarios where it would help, but usually it doesn't help because usually you're limited by the network bandwidth, more so than, say, disk bandwidth, especially if the data nodes are serving multiple queries at the same time, you typically have a pretty small slice of each straw. So by having pieces of the data similar to a RAID, if you have little pieces of a large dataset distributed, then the speed at which it comes at you is the aggregate. So I think what you're thinking of is probably latency. I mean, the time to access the first record from each machine is sort of fixed, say 10 milliseconds. If you had a really small amount of data, what you're saying is true, you wanted to get a little bit of data, it's kind of wasteful to stripe small amounts of data across machines because you'll have that same latency for each machine and the overhead of sort of thrashing and trying to retrieve a small data. So yes, if it's a small block of data, then it's best to be on one machine. So yes, I guess if you have values that are small, yes, it would make sense to at least block them a little bit, have them on one machine because they probably will be accessed close together. But for larger loads, it's better to distribute, and that's how most systems block, they block in blocks, say HDFS blocks in blocks of about 100 megabytes, which is significantly larger than the latency of a desk. So yeah, so for most cases, it is useful to distribute the larger sized blocks to get maximum throughput. All right, so when a node fails, so if a single node fails, the replicas of things that were on that node should be on different nodes and probably on different racks. So we need to take the replicas and replicate them again to reproduce the redundancy that the failed node had. So that's a task that in the simplest case, the math directory will be responsible for that. It'll have a list of all the things on that dead node. It'll have to then sort of distribute those, again, based on load to other nodes and then cause the data nodes to actually do the copy. All right, so there's some specific replication challenges that we should talk about. We wanna make sure that when we replicate, we get a valid copy on each node. So when data's pushed to a node, for the server to know that it's actually being received and say there's an acknowledgement that happens, and we have to worry about the fact that the update happens but the node may actually fail during that operation. So what we can do is try again, repeat the replication because there's, because the data node is doing this dynamically, one of the things that can factor into is just non-response from a data node. So we can simply pick a node and try it again. It's more complicated if the node is slow. That's gonna imply that the put operation that's supposed to create enough replicas is actually going to have to continue until it's got acknowledgments from a certain critical number of nodes. So anyway, these are the choices you have to make as a designer. You can either try to extend the time out and wait or just pick another node and possibly end up with an overload of replicas. All right, so in general, because you have to wait for the replicas to happen and be acknowledged, put operations is slower. It's really a max, some kind of max of the operations that you do gets on the other hand are really a min of the nodes that you can retrieve. So as soon as you send out, once you have the directory information about where the data is, you can send out requests to all those data nodes in parallel and whichever one arrives first is going to give you the answer. So get in general is a minimizer of delay and latency and butters though a maximizer. Yeah, yeah. Well, so typically you won't discover that until you request. That's a difficult case, right? A node's going dark. So probably you're not going to ping for specific copies of data, but you may ping, you'll often ping nodes for just basically the node being up. So yeah, so most of the time if the node goes dark, you'll find that out and then create replica copies. It is possible that there was a disk error. That's kind of a nasty case. So you wouldn't normally discover that until you try to retrieve the data. Okay, so all right. So let's look at consistency now. So how close can a distributed system emulate a single machine in terms of the read and write semantics? Okay, so suppose we have an insertion and we're trying to insert a couple of values with the same key, K14, what will happen? Well, if we've built the system so that the put is indeed atomic, we should end up with either a V14 prime or a double prime. Another case is if we do a put, even to the same data node and then a get operation, we would expect to get the result that we most recently put in. But that in fact isn't what will always happen. So if you actually trace through some real systems, it's actually quite hard to achieve this. So just look at some examples of what can go wrong and this slide always sticks. Okay, so let's look at what happens when some of these updates arrive at about the same time. Here's our directory and here's the state that it's in now. It's got a couple of replicas of K14 right now. So there's an old value of K of V14 in these two nodes here. And we're gonna update that value. First request comes in here and then another one comes in almost at the same time. So the update hasn't happened yet before we get another request. Potentially in a recursive system, the requests might be forwarded in a different order. I mean, most likely they'd be forwarded in the same order but the network itself may have additional latency. Assuming V14 came in first, that would be the first message to arrive here but if there's some extra network latency, it might not arrive first at the other machine. So now you've got two machines. They've actually received the two copies in a different order. So on each node, sorry, on each node you can see that you'll have the last value received but they're not consistent. So unfortunately, in this case for this particular setup, we've defined a really simple system that's gonna have an undefined response. So clearly we have to do something else, probably some extra kinds of acknowledgments to make sure that this situation can't happen. All right, so in other words, we've seen so far that with a simple setup with replication we're not guaranteed to get back the last right. And our master directory, by using parallelism, trying to speed things up, actually created an ordering problem. If it had worked, if it had operated truly in a serial fashion, that would have not happened. It would have had to have sent out each request weighted for an acknowledgement and that would have avoided the problem. All right, so you can see in this case, if we, here's the get request coming in for that key now and it's gonna retrieve the most recent value here, in this case we have the request, the get request actually coming in before the put request. Again, most likely they went out in the right order but then the network introduced an additional delay on the put. So in fact we've retrieved the old data before the new data was ever inserted. So yeah, so we get back the old value. So you can see that because of delays there's a lot of problems that can happen and at least you probably get a taste of why it's the easiest way to actually make this all work is to enforce sequential consistency, enforce the operations to actually happen and complete atomically before trying to do the next operation. But the trouble is that if you did that the system's going to become incredibly slow. That data node's gonna be receiving requests from probably thousands of clients. So we have to find ways of allowing for concurrency but at the same time protecting consistency which is a real challenge. All right, so the atomic consistency which is what we want is easy to achieve but the big sacrifices we have to do the operations one at a time. So in fact usually we'll do something that's less safe that's fast. Later on in the class we'll actually talk in detail about transactions which are processes that make sure that basically guarantee consistency and completeness through all of the updates that we do. All right, so what we saw in this example too was that we ended up with an inconsistent state. Two of the nodes with a different value being stored. So one kind of trade-off that you can make is instead of trying to serialize things and make things really slow is to strive instead for eventual consistency. So where you do have replicas that are potentially different. The master directory discovers that when it does the get request. Since you have that consistency it can resolve it but it'll take a little bit longer to do that. You'll have to make a policy decision about which one to keep and then propagate that back to the other replicas. That's a common strategy that's used is backing off from instantaneous consistency allowing some inconsistencies to get into the system and basically resolving them later on. So it gives you a way to get good performance most of the time and hopefully not too many conflicts. Well, there's a variety of other strategies that are used that provide different stronger or weaker types of consistency. So the strong consistency we described this already. It's basically the idea of serializing all of the updates. It has serious performance consequences which are you basically losing your parallelism but it does at least guarantee everything's consistent because all of the copies are being updated before you can do another transaction. Yeah, I mean it should be of the order of seconds because it's really enough time for the master directory to do some resolution and changes to propagate through the system. So not a huge amount of time. Yeah, the client would have to notify, presumably notify the master that there was inconsistency and then the master would be responsible for resolving it. So an interesting approach to getting consistency relatively efficiently is called quorum consensus both fast and reliably is called quorum consensus, excuse me. So and it balances the costs of doing the put and the get. So assuming we have a replica set of size n, the put operation is going to push the replicas to all of the n data nodes in the replicas set but it's only going to wait for acknowledgments from a subset of W. So W's generally less than n, maybe about half of n. And then get is going to wait for responses from some number R of replicas that's also maybe about half of n. And what happens if W plus R is greater than n? What's guaranteed if the number of acknowledging nodes and the number of replicas in total is greater than n? What can you say? Yeah, at least it's a pigeon-holing question. So at least one node both received and acknowledged the right request and also responded to the read request. So it produced the most up-to-date copy and hopefully if there's a way to recognize that that is the most up-to-date copy, that'll be the one that's used. Okay, so right. So we're guaranteed with this numbering that one node actually did receive the update and also responded to the read request. Okay, so that's a simple model. All right, here's an example. We have a three-way replication and a right threshold of two and a read threshold of two as well. And we're replicating on these three nodes n1, n3, and n4. And let's assume one of the operations fails. So the right to n3 doesn't work. Then the other two nodes, though, those should acknowledge that the operation succeeded. All right, so now issuing a get. So the data's missing from the middle node, but it is on the other two. Now, well, let's assume the request went to all three to just two of them respond. And even if the dead node responds or the node that didn't do the update, because one node is in both sets, this node in this case was both a receiver of the update and also a responder, we're guaranteed that that one gives the correct response back to the client or master. All right, so things worked out. We were guaranteed always by the numbering that as long as there's two responses, one of them will be good. Okay, yep, yeah. So all right, so you're asking if the network was pathological enough, if all the get requests went out before the put requests were acknowledged. Yeah, so in that case, let's see, what have we done? Well, so if literally the get request, even if they're going out through the network, and if we have a quorum of read requests, we do have consistency in the sense that you get a copy of the data before the updates happen. Remember, we're trying for consistency, not necessarily for maximum availability or maximum update, right? So it's plausible that all the updates, simply, yeah, if they're happening later for normal reasons, then at least the value that's going back is consistent. So that's all we can do in that situation, yeah? Well, yeah, I mean, I think that's true. Well, of course, if it's, yeah, but if we're running in the sequential mode, then the master node is not actually doing the up. Yes, so in this case, I mean, yeah, the master node should not be allowing requests if it receives them after it's doing, yeah, this isn't really only gonna work on a master node, as it's described. If the master node receives those requests later, it should block waiting for at least W of the updates. The trickier problem is if the updates are happening directly from the data nodes, but then I think we would need, I mean, I think the master node has to be in the loop here, measuring these read and write counts, otherwise there's no way it can determine because for instance, the, actually, let me think. Yeah, I think it's sensible to wait assuming the same node is actually doing the updates. The tricky question is if there's a different node trying to read. Yeah, yeah, it's pretty clear if there isn't somebody deciding that there's a critic, the constraint is met of the minimum number of write acknowledgements, then this can't work. So, yeah, let me get back to you on that. I better figure out what the right solution is. All right, all right, so anyway, so we have one simplified version which we just presented of achieving approximate consistency by looking for a consensus among the data nodes. The other, but we saw before that that there are a lot of challenges to getting consistency if we want replication and if we want availability at the same time. So, this is one model, it's a lot of other consistent models. Anyway, let's look at what we've learned so far and try to actually, I'll do a quiz, I'll do the quiz right after the break. Some quick reminders, the project two code is due on Thursday, Thursday midnight, so please keep that in mind. The submissions are coming in pretty slowly right now. There's not a lot of time. It should be a pretty easy assignment, but please try to not waste flip days, yeah. So, if you have received back enough copies, then the quorum algorithm guarantees that they're at least consistent. So, whether you've received, you may have received an older copy as well, so partly it's, there are timestamps you can use to decide which one is the most current value. But other than that, no, I think that's the best you can get out of that approach. Anyway, so, sorry, we have project two code, due on Thursday project code, group of hours at due on Friday, please try to use, not use your slip days for this one. There are a couple of challenging projects coming up later in the semester that are gonna take a lot of time. There are only four slip days, remember, and after that, we start deducting 10 points for each day late. So, please try to get them in soon, yeah. Say that a bit louder. I'm sorry, what, if you submit your what? I don't think we're taking enough for a group of hours now. Okay, so let's take a five minute break. Okay, let's continue. So, let's review some of the ideas about key value stores and about consistency generally. So, first of all, distributed key value stores, should they always be consistent, available, and partition tolerant? Yes or no? Should? All right, should but can't, all right, fair enough. Right, yeah, unfortunately we can't do that and that's the cap theorem. On a single node, a key value store can be implemented by a hash table. True, yeah, basically it's a distributed hash table, so yeah, the hash table works on one node. The master can be a bottleneck. Yeah. Let's see, an iterative puts achieve lower throughput than recursive puts. Yeah, opposite, good, all right. And with quorum consensus, we can improve read performance at the expense of write performance. Well, so actually we are allowing the system to still work with a limited number of reads. So, there is a read benefit, but there is also an additional expense in writing. All right, so we're gonna change topics now and look at network protocols for a while. So the next few lectures are gonna be about networking and IP protocols. So at high level let's review what a protocol is. So protocol is agreement on how to communicate. So it includes a syntax layer, which is how the communication is specified, what specific fields then have to be in the communication, how records are organized, and so on. The order in which things is received and sent. And then there's a semantics, which has to do with the things that should happen during the protocol. And it also relates to the meaning of certain fields like encodings, byte orders, and so on. So those are the two elements. You know, we have protocols in most things that we do. Whenever there's different agents trying to coordinate their activities, they often use a protocol. So on the telephone, if you pick up and answer the phone, you listen for a dial tone, make sure you have service, dial a number, wait for the ring tone, somebody says hello, and either you say hi, it's me or hi, it's John. And then you dive into the conversation. You always pause to wait for somebody to respond. And they talk and pause. And eventually, blah, blah, blah, you finish the call. And everyone says bye, and that's an acknowledgement that you've finished the conversation. All right. In the classroom, when you wanna ask a question, you start by raising your hand normally. If the instructor's in the middle of talking about something, you wait for that transaction to finish and then get called on. If it's not a structured situation, if it's just a regular meeting, then you'll often just wait for a pause in the conversation, interject what you wanna say and continue. Okay, so in internet communication, you have a variety of different clients and services. We have PCs, now mobile phones, smartphones, and then larger computers normally that are providing services, typically known as hosts. So a client program is something running on a user's host, request a service, and say a get request from an HTTP server, request a specific document. The server program is also running on another IP host, provides the service and implements, say, a web server, which is a large protocol for responding to queries. In this case, the site's not available. Come back again later. So client programs are assumed to be sometimes available. The server is expected to be highly available. Clients normally initiate the request to a server for some operation that they're interested in. Servers are expected to be running all the time and serving requests continuously. Clients have a lot of difficulty communicating with each other. They do it sometimes when, typically, through a server, though, on the other side of things, servers don't normally initiate communication with a few exceptions. They wait for requests from clients. In order to use a server, a client has to have somehow an address for that server. It can be an IP address or a web site URL that's received through the web or through a search engine. And for that to work, there must be a stable mapping from a symbolic name to the address of the server. All right. There are also peer-to-peer protocols where essentially clients are communicating with each other. It requires some minimal amount of directory information for each peer to find another peer. But once that communication is initiated, the connection is set up, but it has to be dynamically managed as clients change addresses. So there's a directory structure that's often distributed and dynamic that allows clients to move around. So every time you reconnect, and even if you change addresses, if you're moving around with a phone, potentially that can happen, you'll maintain the connection because the directories will update. So an example of this is BitTorrent. So any host can request file, send files or run a query. And also warehouse some fraction of the data. So this kind of scale is provided by millions of peers and there's no boundary now in this sort of system between clients and servers. Each node is hosting some of the content providing some of the services, but also typically retrieving stuff from its owner or its owner. Okay, so we have a lot of these different types of application that people would like to run on the internet and a lot of different physical networking layers as well. We have wireless, a couple of different wired protocols. Now optical network is starting to resurface again. It's always been an important part of data centers, but now becoming important for end user networking as the network speeds creep up beyond a gigabit. So we have all these different application layers and network layers and it creates a potential N squared problem, which is that every time we add a new kind of high level service, we need to make that interoperate with all of the low level services that we have, all of the physical media for transmission. And similarly, if we added a new network transport layer, like a different type of radio, then that has to be linked back up to all of the high level services that people might wanna run. So it's a quadratic blow up. We don't wanna really have to do that. So how does the internet address this? We rely heavily on a layering approach where the communication is a multi-layered set of protocols, the high level application layer only have to communicate with the top layer of that stack. The physical media similarly are only communicating with the bottom layer. And there's additional partitioning between the layers in order to prevent this kind of quadratic complexity in effect happening inside of the protocol stack. All right, so that, yeah, and in this setup, when we added a new high level protocol, we only have to figure out ways of making it interoperate with the next layer down in the protocol stack or if we add a new physical transport, we only have to make it interoperate with the bottom of the stack. So now we've made everything linear and a lot simpler and a lot more scalable. So just as we partition complex software systems into modules and define their interfaces and separate the interfaces from implementation, internet protocols do the same thing. The interfaces give you flexibility in how you actually implement the protocols underneath and allow you easily to add modules on the top. The modules only need to know about this limited set of functionalities of methods, effectively. To call in order to do what they wanna do. So libraries do a very similar thing, which is they prevent, they present specific APIs at high level for a programmer and they can change their implementation, become more efficient without adding any extra burden. In fact, just continuing to run with client code. So similarly, in the internet, similar kinds of interfaces, those basically protocol APIs allow people to implement their applications just one time. There can be a lot of evolution in the network. So there are new lower level transport protocols being used for some of the faster optical networking layers. But nobody sees that at the application layer. So generally, this is a good thing. Hiding these layers of abstraction is a good thing. But there is a performance penalty because it means that the low level transport layers are not aware of what kinds of bits are flowing over them. And there is a big difference between, say, text content that's being streamed from a website that's basically HTML content and, say, real video or some other kind of video stream which isn't as easily partitioned where the latency is extremely important for the experience of watching the video and the internet protocols mostly hide those differences. So there are great advantages in terms of simplicity but there's often performance penalties for having a simple and universal protocol stack. So some decisions we have to make there are two main principles. One of them is breaking this complex set of protocols into different layers so that we can go progressively from a very high level of abstraction, say for HTTP that just specifies the syntax and semantics of what HTML pages are all the way down to the encoding of ethernet packets at the network layer. So to ease that transition, we have a layered approach and another important principle for the simplicity of the network is called the end-to-end principle and that says that you shouldn't put any state or parameters into the network unless you really have to and especially if you can move that functionality let's say some error recovery functionality into the end points you should do that instead of trying to make the protocol itself more complex. So the end-to-end principle is about it's sort of complementary to layering. Layering says, you know, keep things simple by having simple APIs and a progression of abstraction end-to-end is just sort of keep the network itself simple and use application level stuff. If you can see a bit more of that next time that's the subject of the next lecture. All right, so layering it's a simple idea, simple to explain anyway so basically the idea is that every layer should solely rely on services from the layer below so from the ethernet layer you have basically layers that implement particular styles of packets either UDP or TCPIP packets and at the top level you should sorry, at each level you should only export services to a layer above so you only are allowed to drop down one level or move up one level in the layering so there's a well-defined interface between the layers that helps hide the implementation details and it allows people to optimize each layer without disturbing the interfaces of other layers. All right, so protocol standardization is a way of ensuring that all of the hosts speak the same protocol that allows you to implement different layers in different ways if you don't do that you're relying on a few expert programmers to do all of the work and that does happen sometimes but mostly the internet is being driven by standards that are being involved by a large number of actors it's a slow process because of the large number of actors involved usually goes through an extended request for comments where tentative standards are proposed and then all of the companies that have a stake in what these protocols might do can weigh in, provide their input until the standard is eventually codified and becomes an internet standard and the organization that manages this is the IETF and you can see actually the history of the RFCs at this location here so you can really get an idea of how much complexity goes into designing this protocol All right, on the other hand there are some de facto standards Skype and BitTorrent are just basically systems that people wrote and other systems interoperate with those systems essentially by trial and error or by agreement with the developers of the original code All right, so quickly there are two different types of transport two important different types of transport over the internet the first one is datagram transport the idea is to make a best effort at delivering packets each packet contains some amount of data and a header that specifies the information about where the source where it should be going something about the content and it provides only the weakest kind of service that the packet mostly arrives quickly but there's no guarantee that it arrives at all a lot of guarantees are not provided such as packets might be lost, they might be corrupted or they might be delivered out of order and by contrast TCP is a robust layer which provides an ordered reliable byte stream and simultaneous transmission in both directions it relies on retransmit so TCP provides an abstraction of reliable transmission by relying on acknowledgment for packets that are sent and retransmission in order to produce the right output it also at the receiving end because there may be retransmits that are actually unnecessary it discards duplicate packets and also takes packets that might have been received out of order and puts them back in the right order and there's two high level control mechanisms that are layered on top of that acknowledgement system the first one is flow control which the sender implements by looking at acknowledgments from the receiver typically noticing that those are lagging, starting to lag suggesting the receiver is not ready to receive more information and potentially slowing down its transmission or at least waiting for the acknowledgment before sending more information and congested in control which involves the sender typically figuring out somehow that the network has congested and that's going to involve sensing that acknowledgment packets are just being lost so that there's loss either of the acknowledgment or the original packet which implies a heavy network load so both of those adjustments are made typically slowly by observing the behavior of the network over a period of time but almost all implementations of TCP include those mechanisms so very quickly before we go protocols specify syntax and semantics true or false? true, okay all right, protocols specify implementation false, okay layering helps to improve application performance false, all right, awesome okay, best-effort packet delivery ensures packets are delivered in order P2P systems are noticed both are client and server and TCP ensures each packet is delivered with a predefined amount of time false, all right, excellent okay, all right, good job see you next time