 Okay, welcome back everybody to the last, I guess, official lecture. For the end of the term, we're gonna have another one on Wednesday, which is gonna be a special topics lecture. But I'd like to continue where we left off. We were talking about distributed storage. And if you remember, before we got into that topic, we were talking about the remote procedure call idea. And the idea behind a remote procedure call is really that a client can link with a library that includes a bunch of stubs, that allow it to essentially make function calls, which go all the way across the network to a server machine with the return coming back, and they can deal with them just as they would a local function. Okay, and so that's a remote procedure call. We're making a procedure call remotely. And some of the key ideas we talked about were the fact that the arguments to these procedures have to be packaged up by the client stub. And they're packaged up in a network independent way and serialized as a set of bytes, excuse me, as a set of bytes. And then they're sent across the network where they're unpacked. And the server stub will then call a server function with the deserialized versions of those arguments. And then the return call will get serialized again sent across the network and received and it'll be returned into the client as a return from a function call. And so the client can therefore use this regardless of the fact that it's remote. And the couple of things that we talked about were among other things how these stubs get generated. There's a special IDL language that you use to describe the procedure calls and a compiler that generates the stubs for the client and server side. And you can basically have the server be remote, of course, or local and the client doesn't have to know the difference other than a difference in performance. Okay, now today we're gonna actually show you an example of use of RPC, which is pretty common, which is to make it an actual remote file system work. Okay, before I pass on from this, are there any questions? All right, so then the other thing we talked about is we talked about the cap theorem, the consistency availability partition tolerance theorem, which really was more like a conjecture by Eric Brewer back in the early 2000s, but it has since been proved in various ways. And the basic idea is that you can have two out of these three, you can't have all three. So you can have consistency, availability, partition tolerance, you can pick two of any of those three. And the thing that keep in mind here is basically consistency means that when you change the file system on one side, everybody sees those changes consistently. Availability means that you always have the ability to access the file system and partition tolerance says that the network can survive being cut in half. Okay, so before I guess we have a late question here about RPC, which is fine. The question here is if the client sends pointers as arguments, does the client stub have to load all of that in? So pointers basically don't mean anything cross machine. So part of that serialization has to actually be taking any data structures that are consisting of pointers and serializing them into a complete set of bytes to send across. There are sometimes a specialized ways of packaging up, opaque pointer references and sending them off to a server, but the server doesn't know what to do with them. They would only be for returning back later to the client. So I think the short answer to the question is, yeah, if you have any structures made out of pointers, they have to be serialized into bytes before they're sent across, otherwise they don't mean anything. Okay, so this cap theorem, by the way, just to finish that thought here, will have an impact on pretty much any remote storage that we might have to deal with. And it certainly comes into play when we start talking about cash consistency of the file system, okay? All right, are there any questions on the cap theorem? All right, so let's talk about distributed file systems then. So as you can see, the idea here behind this figure is really the idea that the storage is gonna be in the network somewhere, today we call it the cloud, I guess. And you can use that storage no matter where you are. You could be here at Berkeley on the left coast, you could be in Boston on the right coast or in Beijing, whatever, and you could still use the data. And in some file systems, you can even use the data while you're driving from one coast to the other. And that's all because it's in the middle, but once you start having things remote in the middle here, then you start running into the cap theorem. So what is a distributed file system? Well, it's pretty simple, you've all used this many times, but we have a laptop here and a server that's actually got the data. And so instead of the file systems, like we've been talking about the last several weeks, which are local, in this case, you're actually sending your request to read and the response is coming back over the network. And the server is actually a separate node somewhere else from the client that's using it, okay? So a question that we might have here, which is in the chat, is a solution to make the network resistant to partitions. So it doesn't have to be partition tolerant. So the problem with that is pretty much that we run up against the end-to-end theorem, which really is how much work do you want to put in the middle of the network to make it so that it never partitions. And in practice, you can add a lot of redundancy to the network. You can have many alternate paths that can be taken, but ultimately it gets very hard to prevent there never being a partition in the network. But you can add a lot of redundancy. So if you don't take the path straight from Berkeley to Boston, going straight through, maybe you go by way of Alaska and down. If you have enough alternative paths, you can sometimes make the probability of partitions very low. So what we really want with a distributed file system is this idea of transparent access to files on the remote disk. So that the client doesn't have to know that this is remote from the standpoint of the way you interact with it. The only way you notice is that things are slower, okay? And so one of the things that we have as a concept here is the notion of mounting a remote file system onto the local file system. And so here's an instance where we actually have the local route. That's a little slash up here and slash users. And then slash users slash Jane, we've actually mounted another file system on the server called kubi. And the partition Jane is at this point in the mount. And then inside that Jane file system, there's another directory called program prog. And that we've mounted a different partition from kubi to prog, okay? And so what happens there is that the laptop user says users slash Jane slash prog slash food.c in reality, because of the way we've mounted this, it's really in the slash prog partition on the kubi file system and it's the file food.c. And so by mounting, we can essentially get transparency against the fact that these are actually remote so the local user doesn't have to know the difference. So that's a form of transparency that we get with the mount system call, okay? Now, of course that raises all sorts of questions which we don't have a lot of time left in the term to answer but one naming choice which you see pretty clearly here in this figure on the right is that every file in principle is a tuple of a host name and local name in the file system. And we basically everywhere below in the operating system we always talk about files as a tuple of host name and local name. So this is the simplest thing to do and it's what we often do for instance in the department, et cetera, it's fine except that it doesn't give you a lot of opportunity to move files around to load balance or to try to deal with failures. It does let you do DNS remapping. So if kubi the file server went down I could change its IP address to point to a different server and then I'd still be up and working. Another alternative though, which is much more interesting in the grand scheme of things might be a global name space where every file name is somehow unique in the world and there have been several instances of that over time. Today we'll talk a little bit about one which can be based on hashes over the name. Okay, so let's talk about what's involved in making a remote file system work. So we've talked a lot over several lectures about how to make local file systems work. But what about remote ones? Okay, so somehow the device driver, I'm gonna put that in air quotes here talking to the disk has got the network involved. So that's a little bit strange already, right? Cause we think of device drivers as going from the operating system down into a controller and to the local disk. But instead we're going from the files, we're going from the system call interface into the network and then over to a different server and then going into the device driver. And so we need some abstractions to let us do that. And so one of the abstractions is one called VFS, okay? So this is, it was originally the virtual file system and then in Linux it became the virtual file system switch. I'll show you why switch kind of makes more sense maybe. But if you take a look at what I've circled here in our kernel, the file systems actually go through a layer of handling files and directories, which is called the VFS right there. And below VFS is potentially many file system types, some of which are over the network, okay? And so some of these file system types might actually then interact with the network subsystem go out, come back through a different network subsystem on the other side and then back into the file system and down to the block devices, okay? So this VFS is gonna be an enabling abstraction that's gonna allow us to mount file systems, first of all of many types and then second of all, including things that are across the network, okay? So what exactly are we talking about here? So if you remember in our layers of IO, we talked about you do a read system call, it takes you into the kernel read, which are into the libc version of read, which does a system call, takes us into the system call processing. And then if you look down here, this is by the way a slide from lecture 10 or whatever, if you look inside, we actually have something called VFS read, which gets called from the higher layers and ultimately from the user, VFS being virtual file system, okay? And so inside that call is gonna be interacting with the VFS layer and this VFS layer, you can kind of think of this way. So you got the client process at top, comes through the VFS layer, and depending on which part of the directory we happen to go to, remember the mounting, we could be going into a EXT2 or three file system, kind of like BSD, or we could go to MS-DOS FAT file system, and either of those could be used in the same way by the client, okay? So this idea of opening slash floppy slash test, and then writing to slash temp slash test, what I'm actually doing in this loop here is I'm reading from an MS-DOS file system, writing to a UNIX file system, and this all works because of the abstraction of VFS, okay? So that's pretty good, right? So how does that work? So the VFS layer is like a local file system without any of the disks involved, and it's really just a set of hooks that allow you to plug in functionality that's needed for the client to act with a file system, okay? And it's compatible with all sorts of local and remote file systems, and it basically allows the same system called interface above, regardless of the file system. Now, we won't go in this in great detail, but for instance, you could look up VFS and Linux, and it would tell you that this is an interface with four primary objects. There's a super block object, an inode object, a directory entry object and a file object that represent all of these things that we talked about pretty much when we talked about UNIX file systems. What's interesting about this though is depending on what file system you plug in, it may not even have an inode object. Think about the FAD file system, right? There's no inode there. So really what happens is this VFS layer gives the underlying connector the ability to fake something that looks like a UNIX file system so you can make it look like directories are made out of files, even if they're not, you can make it look like they're inodes and super blocks and so on, regardless of whether those things are really in the underlying file system. And so that layer, sometimes we call that a shim layer, basically allows you to plug in things that then the VFS layer can make look like file systems. Okay. So I'm going to talk to you about NFS, which is in some sense the first user of VFS back when it was the virtual file system. And so, but this has persisted to this day so it's persisted for the last 20 years, 25 years. So let's talk about a simple distributed file system in a little more detail here. So first of all, we talked about RPC. So we're going to be making procedure calls. So the client, when they need to do a read, what happens is the read goes into the VFS layer, the VFS layer then could just go ahead and make a remote procedure call to the server that's then talks to the disk and gives you the blocks back and returns the data. And so we could have a whole bunch of these round trips and because this is RPC, we could even not care about the endian-ness of the client versus the server because basically the client can call a procedure on the server and it just works, okay? And so this is kind of the first way that people build file systems. You're using the remote procedure calls to translate things but there's no local caching in the client just in the server. So the advantage of this is the server is providing a consistent view of the file system like it does now if you are running processes on the server. So that's good. The downside here is it's really not performant, okay? It's expensive to go across the network even when you're in a local, even when you're in the local network where it's gonna cost you a millisecond to go round trip and much worse if you happen to have to go to the metropolitan area or globally where you're talking 10 milliseconds, 100 milliseconds that adds up really quickly for every block read, okay? And so this is fine from an abstraction, we could build this, throw it together really quickly. This is really not gonna work well, okay? And there are actually ways of mounting a remote server with SSH for instance that kind of act like this, okay? Where you just open a tunnel and you essentially get a mounted file system. It's really not gonna perform very well, okay? But you can do it. So obviously the thing to do is caching, right? So that's, we've talked lots about caching. Remember everything in an operating system is a cache. You can quote Kubey on that. If there's nothing else, you get out of this class, you can quote me on that. So what we're gonna do is we're gonna put caches in the system at the client side in addition to the server side. So the server side cache is kind of easy because that's the buffer cache. But we would like to for instance, use the buffer cache on the client side and how does that work? Okay, so the advantage of this is if you can somehow do the open read write close portion locally because maybe you cache credentials and information about some remote file, this gets really fast, right? So the very first read to some file reaches out, you get an RPC across the network, pulls it off the result off the disk, puts it into local cache, returns it gets in the, excuse me, puts it in the server cache, returns, gets into local cache and returns a result. And so that read was slow the first time, but boy, these subsequent ones are very fast, right? And they've just returned the value that's in the cache. So that sounds good. But what are some problems with this, right? So one of them is failure. So consider this idea here, we have a writer on some different client, they write some data in the cache and poof, that machine crashes and notice what just happened, we just lost data. And that's because the data was cached on the client and never made it to the server and it's now gone to DevNull. So clearly the moment we start putting caches into the system, we've got some data reliability issues we have to worry about. And of course we could force ourselves to do an RPC with an acknowledgement back first and then return from the client. So the client never gets back and okay until they know the data has been placed on the server. So that seems like a simple fix because now if we crash, we haven't actually lost the data, right? What are some other problems? Well, something else that rears its ugly head which you probably can see here. This first cache has got the first value and the second cache has got the second value. And so if client one reads, they get V1 and if client two reads, they get V2 and we have a serious cache consistency problem, okay? Now the question in the chat is frankly the obvious one which is how the heck do you deal with this, right? So this is, so on this slide, this seems like a problem. So this is a problem. Now we're gonna talk about some solutions to this but you could start imagining some of these like whenever you write, you have to first invalidate all the other caches and then you get to write. And so when they go to read again they get the next one back, right? Or you could say, well, a little bit of inconsistency is okay as long as I pull to get consistent data back, all right? You could send, yeah, you could have changes. There are many options here. The first, the way they say this is the first step is to recognize you've got a problem, okay? And so the other thing by the way to keep that I'll point out is if for every write, you're always broadcasting the results, potentially you're using network bandwidth and that may or may not be the right thing to do, okay? So what's good about the questions you all are asking here is you got the right point of view. This is clearly an issue, okay? So let's talk, we'll talk a little bit about what you can do, okay? But before we get there, let's talk a little bit more about dealing with failures. So we kind of talked about maybe if you acknowledge all the rights, you can save your data or whatever, but what if in general the server crashes, okay? So in that instance, this is not a client crashing, this is a server. And you might say, can the client just wait until the server comes back and keep going? In many cases, the client can't wait that long because who knows how long the server is gonna take to reboot. And maybe changes that are in the server's cache but not on disk get lost, okay? So when we talked about, for instance, the buffer cache holding uncommitted results, there is that window of time where the server might crash and the data isn't there. So clearly we wanna start with good journaling on the server side, so that's something we already know how to do. But we need to, yes, so we'll probably assume that the server is doing its best to do some sort of journaling. Raid, whatever you take it, they're gonna do it, okay? But something that's a little more subtle here might be, well, what if they're shared state? So think about this for a moment, client doesn't open, okay? By now, you're experts in using the open system call. And then it does a seek. So it says, well, start me out at byte number 5003 and then it's gonna do a read, okay? Now the issue with that sequence is on the local file system, that just works, right? Because you seek to byte 5003 and your next read starts there. But in the case of a remote file system, if you do that seek across the network and then the server crashes and then comes back up or something and now you do your read, probably the wrong thing's gonna happen, okay? So this idea of shared state, where in this case, we're sharing the state of our current pointer in the file between the client and the server, that actually leads itself to some really weird failure modes, okay? A similar problem might be this. What if the client goes ahead and removes the file, but the server crashes before acknowledging, maybe the client doesn't know whether the file was removed or not, okay? And if that removal was part of a cleanup process or who knows, maybe it was part of a temporary build, there could be all sorts of weird things that might happen because that file is actually still around even though the client thought it was deleted. So one thing we can do is to change our thinking a little bit and try to make sure that all of our interactions with the server are stateless. So a stateless protocol is basically one where all the information that we might need to service a request is included with the request, okay? And so you're all very familiar with this idea of behind HTTP because when you go to a website, typically the state of your access is kept in cookies on the client side and all of those, the important cookies get sent with every request. And so as a result, the server doesn't have to hold on to any information, okay? So maybe in the case of a file system, maybe what we do is instead of actually setting the pointer to where we are by seeking, maybe what we do is we pass under the covers this idea of while I'm at 503, please give me some bytes back, okay? And that would be a stateless protocol, okay? Now, the question here might be, would an alternative be that we could make a bunch of requests and then wait for a bunch of acts? And yes, we could do that, but once again, we're starting to get into that weird generals paradox kind of position. And if we can just do everything statelessly, it's much simpler, we don't have to worry about what the server knows and what they don't. So an even better adjunct to stateless protocol is a depinent operations, which say that not only is the protocol stateless, but if I do the same operation multiple times, it's okay because it'll just ignore the next couple of times. So that would be the difference between a write, that file system that appends, and then I have to be sure I append only once, or a write to a block in the file or block on disk, which I can do as many times as I want, and it'll eventually, you know, the data will get written and if it's already been written and I write it again, it won't change anything. So that's an identity operation, okay? Timeouts that happen expire without a reply and with an identity depinent operation, you can just retry, okay? So the idea of stateless is very appealing for many reasons like this. And again, HTTP is a good example of a stateless protocol. So the question might be, can we make use of that? And we will, right? I'll tell you about NFS, which is a stateless protocol by design. Okay, I wanna do a couple of administrative things before we get there. So our last midterm, and again, there's no final in this class, keep that in mind, is this Thursday, 5 to 7 p.m. All material up to today is included there, although we'll be focusing on the last third class. We're assuming you don't necessarily forget everything from the beginning of the class. We're gonna assume that cameras and Zoom screen sharing are in place and you know, there's no excuse to not have this turned on, so you can lose points for not having the camera and screen sharing turned on when the TAs come talk to you about it. We might remind you at the beginning, but it's really on you to make sure that that's all working, okay? And we're gonna once again, distribute links like we did last time, because I think that worked pretty well. There's gonna be a review session tomorrow from 7 to 9. I didn't look tonight, but I know that there is already a Zoom link for this that's been put together and so it should be published on Piazza. Just watch for that. Lecture 26, which is Wednesday, won't be on the exam, but it's gonna be a fun lecture of topics of your choosing should you send them to me. And so feel free to send me a couple of queries and I will do what I can to get that in the lecture. Barring that, I have some other things I'll talk about. I'll talk a little bit about data capsules, which I'm working on, et cetera, okay? You can let me know just by sending me email, okay? That seems simplest right now. Okay, oh, the other thing is as with last term, HKN is virtual, I was gonna post a video on how to make sure you get a chance to comment on the class. We'll post that up on Piazza, but don't forget to do your HKN evaluations. That's always useful. And I think that's all the admin's trivia I had unless anybody else had other questions. I'm sure you'll all do very well on the midterm. I'm offering you good wishes for that in advance. And I'm gonna miss having our little time here every night in Pacific time. I don't know whatever time it is for you guys. Some of you are in very vastly different time zones, I know. Okay, so let's talk about the network file system from Sun Microsystems. This was in the 80s, this particular file system came out. And was in pretty widespread use, still is in wide use. There's three layers for this, which you've already are now aware of. The first two, there's the UNIX file system layer, which is the system called layer open, read, write, close, file descriptors pointing at file descriptions. You're all very aware of that. The VFS layer is this layer I just introduced you to, which distinguishes local from remote files purely by plugging in a table full of functions that are called as a result of the system calls. And then there's an NFS service layer, which is the bottom layer. That's the part that handles the NFS protocol, does the RPC translates and serializes into a network independent format. Okay, the NFS protocol has XDR is the serialization protocol for that RPC. It was one of the first ones out there. In fact, NFS may have been one of the very first ones to have a network independent RPC layer. It has operations for reading and searching the directories, manipulating links, accessing file attributes, et cetera, are all part of that protocol. And that's across the network in the NFS service layer. The other thing that it has, it's certainly the first version of NFS version 1.0 and 2.0 had this very visibly shown to the reader or the reader, the user is right through caching, which is modified data is committed to the server's disk before results are returned to the client. That got relaxed a little bit over the years where the client might return while this caching is still going through from the buffer cache layer at the client, but by and large, it's a right through approach where the transactions aren't done until it's committed at the server side. And so this can slow things down quite a bit under various circumstances, but you have that advantage of knowing that your data made it. Servers are stateless, so the protocol is a stateless protocol as we were discussing. So reads include all the information for all the operations, for instance. So, when you say you do read at i-number position, not read file name, okay? And that's so that we have all the information like for instance the current position I wanna read at is included in the protocol, okay? And there really is no need to do an open close on the file across the network, because the local client has enough information and every operation's satisfied on its own, okay? All of the operations are adeptenent as I mentioned, so you can perform request multiple times and it gives you the same effect. So examples are the server crashes between a disk IO and message send, the client just resends it, the server does it again, so that's fine. You read, write file blocks, you just reread or rewrite and there's no other side effects. The interesting one here is what about remove? So if you ask to remove a file from a directory, NFS may do the operation twice if there wasn't an acknowledgement for some reason. The second time there's just an advisory error that's returned back from the server saying, well, that file wasn't really there. So this is the kind of adaptations the protocol to keep it stateless and adeptenent, okay? The failure model for NFS is an interesting one as well. It's also transparent to the client system in general. So the idea originally was that when a server fails, the client just freezes until the server comes back up and it just works, okay? And that was called a hard mount. The problem with that is that servers would go down and then they would have all of these processes that were reading, writing files from an NFS partition and what would happen is they would all get stuck in the device driver and if you try to do a PSUX and see what's going on in the processes, you'd see all these processes that were all blocked with a little D and that was that they were hard blocked in the NFS driver waiting for the server to come back up. And what's worse is that was an unkillable state so you couldn't even kill them off. They were just really jammed up. So that's transparent but you might argue whether or not that's a good thing, okay? And there was actually a different type of NFS mount which is what everybody pretty much uses today which is called a soft mount. And if you do some man on the NFS clients and so on or do some Googling on that, you'll see about soft mounts. The idea in a soft mount is that when the server goes down you actually just get an error that comes back and your read or write operation you were trying to do just fails. Now, of course, that failure is kind of weird because the client wasn't expecting it to fail by a server crashing because you're using the same interface you would with the local file system but at least it's not locked in a way that can't be killed off. Okay, so here's a picture of the architecture as I mentioned, so the client side we have the system call interface which takes you through VFS. And then VFS has a whole bunch of different possible file systems that might be plugged in. How do you know which one to go to? Well, depending on what you mounted we showed you mount earlier the part of the file system you happen to be in tells you which of these branches which of these actual file systems you're gonna use. Okay, and if it happens to be a local one you'll use the local file system. If it happens to be a remote mounted NFS file system they'll come off of VFS into the client the NFS client software which will take you down into the RPC XDR layer which will go across the network come back up into the NFS server layer which comes up into VFS which then or uses VFS to access the local file system. Okay, and then the results get reversed in the back of the other direction. Okay, questions. So if you notice at the remote side with NFS at least you're actually just using a file system on the other side. So positive thing about this idea here is that if the server is disconnected from clients you can go through and evaluate the consistency of the file system and so on with all the normal tools because it just is a local file system to the server. And then once it's operating as an NFS server which you'd get by starting up the NFS demons then remote clients are able to access that file system on the server that way. Okay, so that's pretty cool, right? Works pretty well. But let's talk a little bit about consistency of the caches. So the NFS protocol is a weak consistent protocol by its nature. So the client actually pulls the server periodically to check for changes. And if the data hasn't been checked in the last 30 or 30 seconds, three to 30 seconds it's settable to some extent then it pulls and asks the server what's the state of this particular block? And when a file's changed on one client the server is notified but that isn't reflected back on other clients that happen to be caching it. It's up to them to pull and pull the changes. Okay, so in this scenario that we had earlier where the second client writes and you get an acknowledge back we can actually acknowledgement we can actually be in a situation where these two clients are at least over the short term are inconsistent with each other but because of the way this polling works eventually this first client will get the new data. Okay, so that's why we call it a weekly consistent weekly consistent protocol because the client kind of converges to the right contents of the cache. So for instance, is F1 still okay? No, here's a new value. And at that point the client is good to go with the latest data. So nice. So there have been various changes over the years that have made it less likely to notice this inconsistency. Clearly you don't wanna be polling so frequently that you're using up a bunch of network bandwidth. And in fact, the polling is a hard limit to even regular simple polling not too frequently is a hard limit on the number of clients that can be connected to a server because every poll that comes in from a client is using up bandwidth on the server. And so, NFS clients can only be a limited number of them connected to a given server. But if multiple clients right, there are these windows where things are a little bit out of consistent consistency, inconsistent. And it is interesting, when I first started using NFS many years ago, I did notice that I would add it on one machine and I'd compile on another one. And occasionally I'd save out some changes to a file and I would be so quick at going to compile in a window to a different machine that I would occasionally get these really weird phantom errors, which were because sort of part of my .c file I had just saved out was intermixed with old versions of it because of the NFS consistency. Now, this thing about why Google can't handle hundreds of users simultaneously is not quite the same issue here because there there's polling that goes on. And so at that point, you do have to worry about, so there's some polling, there's actual pushing of data going on. In that case, if you change too many things and there are too many clients, then you're using up bandwidth going the other direction. The problem with NFS is that even if nobody's changing anything, you're polling all the time and that's using up bandwidth just while you're idle. So at least in the Google case, you're using up bandwidth only when there are actual changes going on, okay? Now, let's explore this weak consistency for a little bit because what sort of cache coherence might you expect from a system if you didn't know it was weakly consistent? So suppose we have three clients we start with file contents has A in it, let's just say. And client one is reading at the very beginning, by the way, time is left to right here. So client one starts reading and they're gonna get A and client two starts reading. Well, they're gonna get A for part of the time, but then if client one writes B, at some point, client two might start seeing B so there might be some intermixing of B or A and then client two might write C and you can get the situation where transiently at least you're seeing parts of each file, okay? And so what would you actually want? Well, one thing you might want is what if I wanna have the same behavior as I would on a local file system, okay? And if you wanted that, so we have three processes instead of three clients, then you might wanna say if a read finishes before a write starts, you always get the old copy. If a read starts after the write finishes, you always get the new copy and otherwise you get either copy. And it turns out that this NFS polling protocol doesn't quite give you that semantic. It gives you this a little bit less clean intermixing, okay? All right, now I'm seeing some good combinations in the chat here thinking of different options between polling and pushing. I'm gonna give you the pushing option in a second here and we can ask some questions after that. So for NFS, rather than this somewhat cleaner view that we might expect from a local file system, we really have this other idea where if a read starts more than 30 seconds or pick your polling time interval after a write, you get the new copy, otherwise you could get a partial update. So that's more bandwidth efficient than it might be if we tried to make sure that every update was propagated to every client all the time, but it does have that slightly weird semantic, okay? So the pros and cons of NFS is it's simple, relatively so, it's highly portable. So they were one of the first ones to have the RPC with a serialization XDR protocol. Some cons though is it's sometimes inconsistent in ways you can see, and it doesn't scale very well the large number of clients because even in the idle case, everybody's polling. Okay? So let me tell you about another file system in this space. So this one came later than NFS, but not too much later. So I remember working with the Andrew file system, AFS in the late 80s, and it became actually the DFS system, IBM bought the file system at one point, it was a commercial product. It had a callback mechanism instead of the polling. So the idea is that this is no longer stateless, by the way, so we're removing the ability to be stateless, but the server keeps track of every machine that has a copy of the file, and whenever there's a change, the server tells everybody with an old copy to invalidate their copy, and as a result, there's no polling bandwidth, okay? There's just invalidation bandwidth. Now notice the decision that was made here is not to push the changes out to everybody who needs them or who is using them, but rather to invalidate. And there's another interesting option here, which is not option, interesting semantic, which AFS did, which is basically what I call right through on close. So think about this a second. Andrew file system, AFS was really designed to work in a much more global environment than NFS. In fact, you could mount file systems that were served in other parts of the country. You could actually mount them and use them locally, and the performance was pretty good. And the reason for that is this right through on close consistency, which meant that when I open a file and I start modifying it, none of my changes are propagated to anybody until I actually close the file. Even though I'm doing writes, it's not until I do close. And at that point, my consistent version is now available for viewing by everybody else who's sharing the file. Okay, and at that point also, that's the point in which the notification goes out that there's a new version of the file. Now, in order to make this work, there are two things to worry about. One is that if I have a file open and somebody else changes it, I don't want it to be pulled out from under me. So what happens there is when I open, I actually see the version of the file from the moment I open it, no matter what else anybody else is doing, okay? So I open the file, they can be changing it like crazy, but I will continue to see a snapshot of the file from the point I opened it until I close it and reopen it again, okay? So the upside of that is a very consistent view. I always have a consistent snapshot of the file at the moment I opened it. And when I write, everybody always sees a consistent view of the written product. So I know there's somebody worrying about race conditions in the chat, we'll get to that in a second. But if you think about that relative to this NFS version, AFS gives you a much better set of semantics because you never see an inconsistent set of bytes in a file, it's always a fully consistent set of bytes, okay? So that's an extremely positive thing. And when we notify others that the file has changed, they're either gonna keep working with their consistent version, or if they have closed right now, they'll get notified to throw their copy out and get the new copy and they'll see a completely new consistent version of what I've got, okay? Now, a couple of things here. So if you have lots of people writing, they may not actually see each other's rights. So out of band, you need a locking scheme or a notification scheme to say, hey, I'm working on the file right now, why don't you wait for me? Okay, so that's one thing you might worry about, okay? The second thing that's interesting about the Andrew file system is rather than caching in memory, okay? Which is what NFS does in the buffer cache, Andrew file system actually caches on disks. So the local disk becomes a cache on the file system. So I can store whole files whenever I open a file, the whole file is allowed to be brought from the server and put in my local disk, and now I can access it as fast as I would if it were local, because it really is local, okay? So the potential here is for much better caching because I'm using the local disk to cache. And so you can have many, many, many more clients talking to a given server because the server isn't supporting every read and write. What it's doing is it's helping with consistency management, okay? All right, now there's a couple of, there are many questions here. I think the way to handle these questions is just to think through what we got here, right? So when you open a file, you get a snapshot of the file at the time you opened it and you'll hold onto that snapshot until you close, okay? And if you wanna do the equivalent of seeing whether anything has changed, you can close it and reopen it and you'll find out, okay? And if things never change, or they don't change for a long period of time because they're mostly read-only, then as you use files, they migrate to your local file system and now you got really fast action because now you open, close, read, read, do a bunch of stuff, close. All of that is done purely locally because the file server is responsible for making sure that your locally-cached copies of the files go away if they're no longer consistent, okay? And if you just wanna read a small, now, good question. What if you only wanna read a small part about it of this file and it's a 20 gigabyte file, okay? I think that's the question that's being asked and that's a really good one. So the original version of AFS actually, you had to cache the whole file in the local file system. Later versions actually started caching in like 64K chunks or whatever. And so there was a, there were modifications that allowed you to have part of a file if the only thing you wanna do is read a little bit of it and that took care of this performance problem that you're worried about here, okay? Now, although I don't talk a lot about this, okay, I just said that, yeah, I said this, hold on, I'll say my other point in a second. So data is cached in the local disk as well as in memories and on a write followed by a close, you send a copy to the server which tells all the clients with copies to invalidate their local versions and that they'll need to fetch a new version from the server at that point. If the loser, now, if the server crashes, unlike with NFS, we can't even conceive of a client transparent version of the protocol because the server is supposed to have all this callback state to keep track of who's got copies of things. So when the server crashes, it comes back up, it actually has to request information from all of the clients that are connected as to what copies of what files they've got. And so that's a little, that's more expensive for rebooting the server, okay? So the pros, relative to NFS, much less server load, disk is a cache, so then technically the cache is much larger. The callbacks means the server doesn't have to be involved if the files read only, okay? And so you can, if you have mostly or totally read only partitions, you can have a small server basically share a huge set of clients because really all it's doing is helping the clients get copies of the data under their local cache, okay? Now for both AFS and NFS, although AFS is less problematic here, the central server becomes a bottleneck and so the performance of all the writes ultimately go through the server. And so there is a question about availability because the server becomes a single point of failure and the servers has to be more powerful than the clients and so it's typically a higher cost than a simple workstation, okay? Now a good question is brought up here which is couldn't the server store callback state on the disk? And the answer is yes, it probably has, in fact, as I recall, it has a cache of what it used to know the server state was but who knows what happened when it crashed and came back up. So it has to at minimum validate what the current state of the caches are. All right, good. So one thing that's fun about the Andrew file system which I didn't write down is the Andrew file system had the notion of global names, okay? And so if you were to look at a client machine you would see that there was a slash AFS slash partition and then you could mount pretty much anything from anywhere in the world in a way that was independent name and as a result in principle, every file in AFS was globally available if you had the right permissions. And so this is a little different than NFS where things are named by that tuple that I mentioned earlier which is a particular machine and a local file name. Here in principle at least there was global file names or at least it was starting to go that direction. And so you would mount, we would be at MIT and we would mount files that were down CMU and ones that were over at Berkeley and so on. We could mount files that were on servers across the country and it actually worked pretty well because most of the performance was handled by the local disk. And so this is an example of something where you really were starting to mount things very distantly, okay? And now of course you're all used to that with the cloud but this was quite the innovation back when it first came out, all right? But let's move even further away and sort of ask, what's this obsession that we have with files? What about sharing data instead of files? And one thing that's become very popular over the last decade and actually I would say last 15 years is this notion of a key value store where the world is like a big hash table that lets us look up keys and get values back, okay? And really back in the early 2000s when I started working on peer-to-peer storage systems, key value stores were kind of in their early days, okay? So really this idea has been around for more than 20 years. It's just it has become very prevalent over the last decade. And it's native pretty much in any programming language. You've got associative arrays in Perl and dictionaries in Python and Maps and Go and you pick your language, there's a hash table. The key value store that we're gonna be talking about is kind of like a hash table that spans the globe or spans the network. And so for everything you can imagine using a hash table for in these languages that you're aware of you can use a key value store for more globally, okay? And in terms of sharing information what about a collaborative key value store rather than message passing or file sharing? So rather than thinking about taking file system mounting the file system on two clients and then sharing through files maybe we have a key value store and we just happen to know what the keys are that we're using and we share that way. That seems like another option here and maybe we can have more options on how to make things consistent and how to make them durable. So we might ask ourselves, could we make it scalable? Can we handle billions or trillions of keys? Can we make it reliable? Even though things are failing and the network's partitioning and so on can we always get at our data? Now, I will tell you up front here we're not gonna violate the cap theorem but what we can do is we can perhaps we can get to where the cap theorem doesn't bother us quite as much so we get an old value of the key that's pretty close to recent maybe not the most recent one and maybe that's okay. Okay, so the basic idea behind a key value store is a very simple interface. There's put and get, put has a key and a value and what it does is it inserts that value at that key into the key value store. Whatever that means, it goes off into cyberspace somehow get takes the key and returns the value from cyberspace somehow. So the interface is almost boringly simple. The question is, can we do something interesting with this that is scalable, fault tolerant, reliable, durable put your all of your favorite Ibbles in there can we make that happen out of this simple interface? And the answer is this becomes much the answer is yes, this becomes much simpler because the interface is so simple, okay. So why key value store? Okay, I've already said this, but it's easy to scale huge volumes of data, petabytes, okay, exabytes you pick your number big, right, uniform items you can distribute easily and roughly across many machines. So if I have 10 machines versus a hundred machines versus a thousand machines, I can just scale up the number of key value pairs I can handle and how many clients I can handle just by adding more things to the system, okay. And so that's kind of appealing. If you think about a big NFS file server or a big AFS file server or whatever your favorite thing is the way you typically scale something like that up is you go and you buy a huge piece of hardware, okay. And that really big thing is fast because it's got a lot of really fast processors in a single box and it's really expensive. On the other hand, the way you might scale up a key value store is you just add more and more machines to it and just this incremental scalability gives you more power. And so that's gonna be another appeal of this idea, okay. So properties are pretty simple from a consistency standpoint because all we want is, well we can talk about what types of consistency we might want, but one simple thing is perhaps we just wanna know what the latest value is associated with a key. And there are many cases these days where this is a simpler but more scalable version of a database and you can think of it as a building block for a more capable database if you want better semantics than just what's the latest value on something. But oftentimes a key associated with a value is enough and you can call that a database, okay. Good examples of this, there are many. So Amazon, key might be customer ID, value might be profile, Facebook, Twitter, key might be the user ID, the value might be the user profile, iCloud or iTunes, the key might be a movie or song name, the value might be movies or songs. So there are many examples of keys and values that you use every day without actually thinking about it. And by the way, all of the big cloud companies all have really good key value stores that scale really well and people use all the time. So in this case, the good question that's in the chat there is are keys kind of the same as global file names in AFS? Yes, roughly speaking, okay. So keys are these global names that you could get at anywhere in the system. And if you had a key value system that spanned the globe and everybody was using, then the keys would be a global naming scheme. Now, the thing that's a little tricky about that is keys, if you just have a key that say your name, the problem with that kind of type of key is it's very clustered, right? So there are many people that have the first name, John. And so there would be a part of the key value space that's really overused and then there'd be lots of places where it's underused. And so really what we use with keys when we wanna really make this scalable is we start taking names that humans use and we hash them into a uniform set of bits like 256 bits. That is the global name that these systems typically use and it's a hash over the human readable stuff. So it's close, okay? It's a hash over human readable stuff, okay? But if you wanna take the simple version of that question about our keys the same as global file names, the simple answer is yes. Now, so in real life, like I said here, Amazon has DynamoDB, which is the key value store that's used to power the shopping cart in amazon.com. There's a simple storage system or S3 which is key value storage that's used for some of the big cloud storage services that people use. Google has big table, H base, hyper table, several of these distributed scalable data storage systems which ultimately come out as key value stores. Cassandra was developed by Facebook, but it's a key value store that's used in a lot of cloud processing. There's Memcache D, which is an in-memory key value store that for instance, Redis is an example of something like Memcache D that then spans multiple sites. E-Donkey, Emule, there's lots of peer-to-peer storage systems. Before any of these things, we did research in peer-to-peer back in the 2000s. And so, CORD, which I'll tell you a little bit about toward the end of the lecture here, Tapestry was one that we worked on. CORD was an MIT Berkeley version. There was a number of other ones that are out there. So these are all key value systems that work particularly well across the globe. So, all right. So the reason I brought this up is I just want, do you know that some of the ideas we're gonna talk about in the last 20 minutes here are basically used quite widely today. Now, the question here, let's see, does files in key value store have a smaller file size requirement than AFS? I'm not entirely sure what the question here is. There's nothing that sets the particular size of a value. So your key is the thing that might be limited in being a 256-bit hash of something. The value is oftentimes something that can be anything from a small number of bytes to gigabyte video or whatever. Usually the thing that's the limit is the maximum size, not the minimum size. I don't know if that answered your question or not. Okay. So let's look at the basic idea behind key value stores. These are called distributed hash tables, oftentimes too, so it's like a hash table but distributed, right? So main idea is we're gonna simplify the storage interface. So we're gonna get rid of all that open, closed, read, write complication that we taught you at the beginning of the term. Forget all that except for, by the way, the midterm on Thursday. And what we're gonna do instead is we're gonna do put and get and we're gonna partition it, a set of keys and values across many machines. And so this thing here, this yellow key value huge table is kind of the abstract space of all keys and values. And what we're gonna do is we're gonna partition it across a set of available machines out there so that each machine handles a range of the space. Okay, now I haven't told you how to do that, but the idea is in principle that if you think of the space of all possible keys and values, actually the space of all possible keys is really what we're talking about here. You could easily say, well, here's all the machines that I'm gonna have participate. Let's just distribute the keys over them and make it work somehow, okay? So that's gonna be the simple idea. How do we make that work? Well, there's some challenges, right? So one of them is whatever scheme we come up with to do this mapping from the abstract table to the physical locations, we wanna scale to thousands or millions of machines. And so we need to make that index work somehow, right? That's gonna be a challenge. And furthermore, as I kind of told you a little bit ago, we want this idea of what's often called incremental scalability, which is we want the ability to add more machines as we need more power. And so whatever scheme we come up with ought to be scalable in a way that just increases automatically or at least easily. The other thing is there needs to be some fault tolerance here. So when machines fail, because machines will fail, we don't wanna lose any data. And one thing that we haven't talked a lot about with failure, because we haven't had a lot of time this term for this topic, but if you have a machine fails once a year on average, and then you put 365 of them together, you're now gonna have a machine failure on average every day. Because failures scale inversely with the number of machines. So typical warehouses that Google and Facebook and so on have, which have thousands, tens of thousands of machines in them have failures going on, many failures per day where machines are coming up with some failure mode or maybe their disks are just plain dying or what have you. But whatever scheme you come up with needs to handle failure very well because failure in this instance is not an uncommon thing. Just because of the scale. And then of course, consistency is gonna be important. So remember the cap theorem. So consistency says that basically we have some way that many clients that are writing all get to see, the readers get to see those values in some consistent way, or at least an eventually consistent way where we all agree on what the latest version is eventually. Okay, and that consistency needs to work even though there's failures happening. And basically our cap theorem says that maybe we can't stay available consistent and network tolerance, network partition tolerant all the time. But it'd be nice that when the thing that failed came back, we would eventually converge to something. So that's consistency, all right? And heterogeneity is one that many times you probably wouldn't think about if I hadn't put it on the slide. But the issue here is really that as I add machines over time, these machines are all from different purchases, they're from different purchases, different lots, different years, different models. And so they're all gonna be a little different. And so that means there's this huge heterogeneous mess of machines and network bandwidth and latency and all of those things. And somehow we would like this system to mostly work well despite that wide ranging set of components. Okay, so this is a large set of requirements. And nothing's gonna be perfect, but we might wanna have some way of building our distributed hash table so that we can handle at least some of these things reasonably well, okay? And that's gonna be our goal, okay? So some questions are, for instance, if we do put a key comma value, where do we store it? Well, that's gonna be complicated because we gotta start by knowing what's available. And if we keep adding machines and machines keep failing, then the where might actually be more complicated than you might think. And then of course, when we go to get, there's a question of the where of where do we get it from? Especially if machines are failing, maybe that key has moved around a bit since I put it in there originally. And so whatever scheme we come up with has gotta handle where very well, right? And then we gotta do the above while still keeping our scalability and full tolerance and consistency and all those other things that we talked about earlier. So how do we solve where? Well, one way is we can take the key space and hash it to location, all right? And so basically if we knew these hundred nodes are definitely gonna be used, we could build partitioning that sort of partitions from the key itself to one of those hundred places. And that might do the trick for us as long as everybody knows the hash key. But what if you don't know all the nodes that are participating or maybe they come and go or what's worse, I mentioned this earlier, maybe if some keys are really popular, then you might have machines in a partitioning that was partitioned equally among the key space, maybe some machines will fill up whereas other ones will be empty, okay? So the where we have to be careful about trying to keep load balance in addition to all these other things. And then look up, well, if we build this thing by having a huge table on one machine that knows where everything is, that's gonna be a bottleneck and a single point of failure. So I hope you guys can realize that at the face of it, we certainly are not going to do this, which is take this thing that I've shown you here as a big table and put it on some huge database server and use that to look things up, okay? We call that the directory approach and I'm gonna show you abstractly what that means in a moment, but that would clearly not be scalable or fault tolerant, okay? Now, before I go a little further, I wanna pause for a second and see if we have any questions. Okay, so let's look at a recursive directory architecture or for put. So let's assume for a moment that this directory is a thing, it's on a machine somewhere and we'll fix that in a few slides, but then the way we do this is if we wanna put a new value for key 14, we go to the directory. The directory would say, oh, I'm gonna assign key 14 to node three. It would go to node three and do the put, we would get an acknowledgement that came back potentially or not, but anyway, what happens here is the put gets redirected through the directory to the storage server. We're gonna call this recursive because what happens is the put goes to the directory which goes to the file server. So it's recursively going from one point to another. The alternative is what we might call iterative and I'll show you that in a moment, but how does the recursive get look like? Well, we go get to the master directory, it knows where to go, it gets the value which comes back and the directory forwards it onto me. And so this again is the get goes to the directory which goes to the node and the node goes back to the directory, goes back to the client. Another way to think of recursive structure here is it's like routing, we're kind of routing through the directory here. The alternative is often called iterative and in the iterative case which is basically what's happening is we, the client says, I'd like to put key 14, the directory says, oh, I'm gonna put that on node three, use node three. And then the client says, oh, okay, node three, please put for me. So notice that this is iterative. So the first thing I do is I find the location that's and then I go and I do the storage. So I'm iteratively working through a set of locations in the network, okay? And then get iteratively, I get back to where the location is and then I can go to that server and talk to it, okay? So just putting them both on a slide here, we sort of have iterative versus recursive. So the recursive versus iterative, I should really change that title. So the recursive case is potentially faster because we're routing through the directory server and back. It's a lot easier for consistency because we can make sure we know everybody who's trying to change that given location at any time, whereas on the iterative side, we've got everybody's kind of doing their own thing and they're talking to the storage servers independently of one another. The downside of recursive is this directory is definitely a performance bottleneck. The downside of the iterative is it's much harder to enforce consistency so they have pros and cons. So is it easy to make the system bigger? Well, we can add more nodes and now maybe we can handle more requests so we can serve requests from all the nodes that have a value in parallel. The master, we could try replicating and somehow use it to replicate popular items, okay? Except the master itself is gonna be really hard to make scalable. We could try making many copies of it but then we gotta keep them all consistent with each other. We could try to partition it so different keys are served by different directories but how do we do this? And so while the version that I've shown you so far where the directory is a thing, it seems like it's definitely gonna be an issue from a performance standpoint and as pointed out in the chat here it's definitely a single point of failure, okay? And it's really a single point of failure for both the recursive and the iterative versions because the iterative has to start by asking the directory where things are, okay? Because remember we're allowing things to move around as failures happen. So let's talk about fault tolerance in a couple of ways. So one, we could replicate for instance, the key on many nodes, okay? So that basically the copy puts on several places. So now we never lose data if a node fails because we have another copy. If the master directory fails, then we lose availability but in principle we could scan through all of our nodes and reconstruct the directory from the actual data. So from a fault tolerance standpoint this particular scheme I'm showing you here doesn't lose any data, okay? But we still have this directory being certainly an availability of single point of failure at minimum, okay? But let's also talk about consistency. So we wanna make sure the value is replicated correctly. So how do we know the value's been replicated in every node and what happens if a node fails and what happens if a node is slow? So if we wanna replicate 12 times and we wanna make sure there are 12 copies then all of a sudden the put becomes slow because put has to wait for 12 copies and then it gets back and act and then it can go forward, okay? So in general if you have a lot of replicas, slow puts are gonna be par for the course but potentially fast gets because I could get from any of the copies, okay? So let's look at a consistency issue here. So if you do put, right now if you look down here K14 is stored on node one and node three and notice that it's V14 is our current version but now some new client tries to put V14 prime and another client tries to put V14 double prime. And now if you notice depending on network ordering we could get a situation where the rights between the directory and node one and node three get reordered such that node one thinks that V14 prime is the most recent and node three thinks that V14 double prime is the most recent and really if these two puts were simultaneous there's not necessarily any right answer as to what's the most recent but we wanna make sure that the system has picked one, okay? And so the problem here is that get is kind of undefined, okay? So there's a large variety of consistency models out there of what to do when you have simultaneous rights going on. There's linearizability which is reads and writes get put to replicas they appear as if there was a single underlying replica. So that's kind of like transactions, there's an ordering there's this eventual consistency which I've been talking about where they may temporarily be different but some anti-entropy process eventually makes sure that everybody agrees on the most recent copy and there's many others, okay? And that's a different class but it would be, you know I often talk about this when I teach 252 for instance the architecture class, graduate class I haven't done that in a few years but there you start talking about causal consistency and sequential consistency and strong consistency and so on which is really about what happens when multiple people are writing multiple different key values at the same time how do you order all that, okay? The simple one I wanna talk about today is called quorum consensus and we're gonna improve put and get operation performance and the presence of replication doing the following. So we're gonna say that put we're gonna say that there's a N replicas, okay? And puts gonna wait for acknowledgments from at least W, okay? Before it goes forward and we're gonna assume that things are time stamped in some way to make this work and the timestamp basically is gonna let replicas that see two things coming in it can replace an older one with a newer one based on the timestamp. Then get is gonna wait for at least our replicas to say here's a value and as long as W plus R is greater than N what we know is that if the put happened before the get then the get will always get the most recent value because the fact that W plus R is greater than N means that any overlap of is gonna basically have at least one replicas that's got the most recent copy, okay? So there's at least one node that always has the update and so this quorum consensus is something that's used pretty commonly now in Cassandra and a lot of these other systems these by Facebook used by other cloud service providers and it's up to the client typically to pick W, R and N but a typical value is that there's three and is three you write to two of them and you read from two of them and as a result you'll make sure that you'll always get the most recent copy, okay? Now I'll let this simmer in your brain a little bit while you're thinking through this but for instance, the interesting thing about for instance you could ask for three you could have R equal to three which really says that I go and I ask for three copies and I wait till I get all three of them before I decide or I could for instance, get all three of them and when one comes back or two of them come back then I go forward that's really what's going on. So if I say R equal to I'm really potentially asking for all three and taking the first two that come back and that lets me actually tolerate a slow server and this W plus R greater than N actually lets me tolerate failures in that group of N so this quorum consensus has not just consistency positives to it but it also handles failures that have happened while you're writing between the writes and the reads and so on and it handles slow machines as well so quorum consensus is a remarkably simple idea that has a lot of positive benefits, okay? And the way you know what the updated copy is is the timestamp so what I said here is in red on this very slide basically when you write you're using typically use a timestamp that you put in all the copies that go out there and basically the clients and the servers are sorting by timestamp, okay? And so responses are potentially returning not always the same value but they'll return R of them and then you pick the most recent of those R. It's a fairly simple scheme but it's fairly powerful, okay? Now, and you might use W plus R greater than N plus one for any number of reasons including making sure that you're really sure that you've written three copies for instance, et cetera. There's a fault tolerance and performance reasons for possibly having our W plus R greater than N plus one. So here's an example for instance here's the initial put where we try to write to all three of the copies but really we only get X back from two which is okay because W equals two, all right? And so then later when we go to read we read from two of them and in this case we read from one but we'll always get back the most recent. So even though we've read from two of them one hasn't responded we always get the most recent back because we got that overlap between the two that we've written and the two that we're reading we know there's always one overlapping one that will give us our value. Okay, and again the most recent not the thing that's most recent is based on that timestamp, okay? This thing and that's why I've got this in red here. Okay, so see the red everybody who's wondering about most recent. Okay, all right. Now, if you guys will hold on for just a moment I'd like to get a couple of more things done here since this is our last lecture. So storage, the way we get scalabilities we wanna use more nodes. We might have a number of requests we can serve requests from all the nodes that have the values stored in parallel so that's potentially good. The master can replicate a popular value on more nodes so we can get more performance with more replicas in a scheme like this. To give us the master directory scalability we could replicate it, we could partition it so different keys are stored in different masters directories. How do we partition it? If I were you guys and this is the first time I've heard this lecture I'd probably think that Professor Kupertowicz hasn't really told me how to make the master work. Okay, because I would have this uncomfortable feeling that yeah, this sounds great but that master seems like a problem. Okay, and so let's see if we can do something. So load balancing, the directory keeps track of the storage available at each node preferentially insert new values on nodes with more storage available, okay? So I can see that might work. When you add a new node what you'd like to do is you'd like to rebalance everything somehow so that new node really starts taking its fraction of the load, okay? And so that sounds like there's some rebalancing process that I haven't told you about here. And then when a node fails we need to make sure that let's suppose that n was three and we were banking on the fact that we had three copies of things. If one of those three copies fails we would like to make sure that some other node got a copy so we kept our basic redundancy in there of three, okay? And so that also I haven't told you how to do that. So let's as kind of our last topic before we really cut out here is how do we scale up our directory? So the challenge here is the directory has a number of entries equal to the number of key value tuples in the system, which could be billions or trillions. Pick your favorite large number. And so that directory thing is big and really we wanna distribute it in the same way that we're distributing the actual data. And the solution here is something called consistent hashing which hopefully I think you may have heard of in other classes, but it's gonna give us a mechanism to divide the key value pairs amongst a large set of machines but do so in a fully distributed way without ever going through a single directory machine, okay? And bear with me, but the idea is it's gonna be simple but it takes a moment to catch. So the idea is we're gonna associate each node a unique ID in a ring of all the possible values from zero to two to the M minus one. And typically M is gonna be big it's gonna be the 256 bits in our hash. We're gonna call that the ring and then we're gonna partition that space of possible keys across end machines and all the key values are gonna be stored in a node with the smallest ID larger than a key, okay? So let me rather than trying to catch all that in words let me show you this in a picture, okay? So here's an example of the ring and what I've done here is I've made M six, okay? So this is a six bit hash space really not interesting in the grand scheme other than for class. But if you notice if M is six then the set of all possible hash values is from zero to 63, okay? And the idea that means this ID space is from zero to 63 and each node is gonna have a unique spot in that space. So node eight, so this node eight I'm gonna put on a spot of the ring node 15, node 20 how does a node know what its name is? Well, it's gonna take things like its IP address and maybe the name of who owns it and all that stuff it's gonna put it together into a hash and it's gonna hash it to find where its position on the ring is just like the keys are hashes over data that we want, okay? And the way we're gonna handle this is for any given number of nodes which are hopefully spread throughout the ring then the node is gonna handle every key from just bigger than the previous node. So in this example, node eight is gonna map node 15 here is gonna map everything from nine to 15. Node eight is gonna map everything from five to eight, et cetera and that's gonna store those keys, okay? And if a node goes away then we're gonna make sure that the next node up is gonna store all the keys, okay? So this is a very simple scheme for consistently partitioning hash values among the ring, okay? So for instance, the key 14 is gonna be stored on node 15 because node 15 is the node whose name is the first one clockwise in the ring from the key I'm looking for, okay? Questions. By the way, this thing I'm talking about with consistent hashing is not gonna be on the exam. We've talked about key value stores but I just wanted you guys to see a real implementation here, okay? Now the different types of machines we have a mixture of these machines spread throughout, okay? The key thing to make this work and that's no pun intended is that these be distributed throughout the ring and the way we get that is by having a good hashing function. Well, you don't have to be aware of all of the machines involved. So that's the part that's cool about this which I'll have to continue on Wednesday. You guys may have to come back for this but the chord algorithm is one which adapts to nodes coming and going where the only thing you know about is a local number of the nodes, okay? So if you look in practice M is really 256 or more, okay? Now chord is a distributed lookup service that does this, the important aspect of the design space is a couple of correctness from efficiency, okay? And it's gonna, the correctness which goes along with the question that was just asked is that every node needs to know about its neighbors on the ring and that's it. So if you go back here the only thing that node 15 needs to know about is node eight and node 20. And the rest of the algorithm of chord basically takes care of that, okay? And so we're gonna talk about that on Wednesday where we've gone way past our time and chord is not in scope for the midterm so that's fine. But I just wanted to leave you guys with this interesting idea that we're gonna show you how to build a distributed system, we'll do that on Wednesday such that we only need local information which is about a few nodes in the system and a log number of other nodes that are spread across this ring. As long as we know only that local information we can do highly efficient lookup and deal with failure as nodes come and go and do replication in a way that keeps everything safe. And so that's gonna be chord and we'll talk more about that on Wednesday. So I hope you guys all come to that because chord is one of my favorite simple directory distributed directory storage systems. And so please come, we'll talk about that. We'll talk about some other things but I'm gonna bid adieu to everybody. I hope you have a great evening and good luck studying for the exam and please come on Wednesday because we'll finish talking about chord on Wednesday and we'll talk about a few other topics if people show up. And if I don't see you on Wednesday you've all been great and I'm gonna miss these little lectures. Have a good evening and good luck on the exam.