 So today, I want to do two things. I want to finish the discussion with ZooKeeper and then talk about crack. The particular things that I'm most interested in talking about at ZooKeeper are the design of its API that allows ZooKeeper to be a general purpose service that really bites off significant tasks that distributed systems need. So why is that a good API design? And then a really more specific topic of mini transactions. Turns out this is a worthwhile idea to know. So we've got API and mini. Just to recall, ZooKeeper is based on raft. And so we can think of it as being, and indeed it is, fault tolerant and does the right thing with respect to partitions. It has this sort of performance enhancement by which reads can be processed at any replica, and therefore the reads can be stale. So we just have to keep this in mind as we're analyzing various uses of the ZooKeeper interface. On the other hand, ZooKeeper does guarantee that every replica process the stream of rights in order one at a time with all replicas executing the rights in the same order. So the replicas advance sort of in their states of all in exactly the same way. And that all of the operation reads and writes produced by a generated by a single client are processed by the system also in order, both in the order that the client issued them in and successive operations from a given client always see the same state or later in the right stream as the previous read operation or any operation from that client. OK, so before I dive into what the API looks like and why it's useful, it's worth just thinking about what kinds of problems ZooKeeper is aiming to solve or could be expected to solve. So for me, a totally central example of motivation of why you would want to use ZooKeeper is as an implementation of the test and set service that VMware FT required in order for either server to take over when the other one failed. So it was a bit of a mystery in the VMware paper, oh, what is this test and set service? How is it made? Is it fault tolerant? Does it itself tolerate partitions? But ZooKeeper actually gives us the tools to write a fault tolerant test and set a service of exactly the kind that VMware FT needed that is fault tolerant and does do the right thing under partitions. That's sort of a central kind of thing that ZooKeeper is doing. There's also a bunch of other ways that turns out people use it. ZooKeeper is very successful. People use it for a lot of stuff. One kind of thing people would use is just to publish configuration information for other servers to use, like for example, the IP address of the current master for some set of workers. So this is just configuration information. Another classic use of ZooKeeper is to elect a master. If we want to have a when the old master fails, we need to have everyone agree on who the new master is and only elect one master, even if there's partitions, you can elect a master using ZooKeeper primitives. If the master, for small amounts of stated anyway, if whatever master you elect needs to keep some state and needs to keep it up to date, like maybe informations such as who the primary is for a given chunk of data, like you'd want in GFS, the master can store its state in ZooKeeper. It knows ZooKeeper's not going to lose it. If the master crashes and we elect a new master to replace it, that new master can just read the old master state right out of ZooKeeper and rely on it actually being there. Other things you might imagine are maybe in a MapReduce-like system that workers could register themselves by creating little files in ZooKeeper. And again, with systems like MapReduce, you can imagine the master telling the workers what to do by writing things in ZooKeeper, like writing lists of work in ZooKeeper and then workers sort of take those work items one by one out of ZooKeeper and delete them as they complete them. People use ZooKeeper for all these things. Yeah, exactly. Yeah, so the question is, oh, how do people use ZooKeeper? And in general, yeah, if you're running some big data center and you run all kinds of stuff in your data center, web servers, storage systems, MapReduce, who knows what? You might fire up a ZooKeeper, one ZooKeeper cluster, because it's general purpose, confused for lots of things. So five or seven ZooKeeper replicas. And then as you deploy various services, you would design the services to store some of their critical state in your one ZooKeeper cluster. All right, the API of ZooKeeper looks like a file system, some levels. It's got a directory hierarchy. There's a root directory, and then maybe each application has its own subdirectories. So maybe application one keeps its files here in this directory. App two keeps its files in this directory. And these directories have files and directories underneath them. One reason for this is just because ZooKeeper, as I just mentioned, is designed to be shared between many possibly unrelated activities, we just need a naming system to be able to keep the information from these activities distinct so they don't get confused and read each other's data by mistake. Within each application, it turns out that a lot of convenient ways of using ZooKeeper involve creating multiple files. I'll see a couple examples like this in a few minutes. OK, so it looks like a file system. This is not very deep. It's not actually, you can't really use it like a file system in the sense of mounting it and running LS and cat and all those things. It's just that internally it names objects with these path names. So one this way of x, y, and z here, three different files. When you send an RPC to ZooKeeper saying, please read this data, you would name the data you want, maybe app two slash x. So it's just a sort of hierarchical naming scheme. These files and directories are called z-nodes. And it turns out there's three types you have to know about that help ZooKeeper solve various problems for us. There's just regular z-nodes where if you create one, it's permanent until you delete it. There's ephemeral z-nodes where if a client creates an ephemeral z-node, ZooKeeper will delete that ephemeral z-node if it believes that the client has died. It's actually tied to client sessions. So clients have to sort of send a little heartbeat in every once in a while into the ZooKeeper and to ZooKeeper to say, oh, I'm still alive. I'm still alive. So ZooKeeper won't delete their ephemeral files. And the last characteristic files may have is sequential. And that means when you ask to create a file with a given name, what you actually end up creating is a file with that name, but with a number appended to the name. And ZooKeeper guarantees never to repeat a number if multiple clients try to create sequential files at the same time and also to always use increasing numbers. Increasing numbers for the sequence numbers that it appends to file names. And we'll see all of these things come up in examples. At one level, the operations, the RPC interface that ZooKeeper exposes is what you might expect for files is to create RPC where you give it a name, really a full path name. Some initial data and some combination of these flags. An interesting semantic of create is that it's exclusive. That is, when I send a create into ZooKeeper, I ask it to create a file. ZooKeeper responds with a yes or no. If that file didn't exist and I'm the first client who wants to create it, ZooKeeper says yes and creates the file, the file already exists. ZooKeeper says no or returns an error. So clients know it's exclusive create and clients know whether they were the one client. Multiple clients are trying to create the same file, which we'll see in locking examples. The clients will know whether they were the one who actually managed to create the file. There's also delete. One thing I didn't mention is every Z node has a version, has a current version number that advances as it's modified and delete along with some other update operations. You can send in a version number saying, only do this operation if the file's current version number is the version that was specified. And that'll turn out to be helpful if you're worried about in situations where multiple clients might be trying to do the same operation at the same time. So you can pass a version saying, only delete. There's an exist call and say, oh, does this path name, does this Z node exist? An interesting extra argument is that you can ask to watch for changes to whatever path name you specified. You can say, does this path name exist? And whether or not exists, it exists now. If you pass in true for this watch flag, Zookeeper guarantees to notify the client if anything changes about that path name, like it's created, or deleted, or modified. And furthermore, the check for whether the file exists and the setting of the watch point, of the watching information inside Zookeeper are atomic. So nothing can happen between the point in the right stream at which Zookeeper looks to see whether the path exists and the point in the right stream at which Zookeeper inserts the watch into its table. And that's very important for correctness. We also have get data. You get a path, and again, the watch flag. And now the watch just applies to the contents of that file. There's set data, again, a path, the new data, and this conditional version that if you pass in a version, then Zookeeper only actually does the right if the current version number of the file is equal to the number you passed in. OK, so let's see how we use this. Maybe the first very simple example is supposing we have a file in Zookeeper, and we want to just store a number in that file, and we want to be able to increment that number. So we're keeping maybe a statistics count and whenever a client gets a request from a web user or something, it's going to increment that count in Zookeeper. And more than one client may want to increment the count. That's the critical thing. So it's an example. So one thing to sort of get out of the way is whether we actually need some specialized interface in order to support client coordination as opposed to just data. This looks like a file system. Could we just provide the ordinary read-write kind of file system stuff that typical storage systems provide? So for example, some of you have started and you'll all start soon Lab 3 in which you build a key value store, where the two operations are, the only operations are put key value and get key yields the current value. So one question is, can we do all these things that we might want to do with Zookeeper? Can we just do them with Lab 3 with a key value put get interface? So supposing for my, I want to implement this count thing, maybe I could implement the count with just Lab 3's key value interface. So you might increment the count by saying x equals get whatever key we're using and then put of that key and x plus 1. So y is a bad answer. Yeah, so it's not atomic. That is absolutely the root of the problem here. And there's a slightly abstract way of putting it, but one way of looking at it is that of two clients, both want to increment the counter at the same time. They're both going to read, they're both going to use get to read the old value and get 10. They're both going to add 1 to 10 and get 11. And they're both going to call put with 11. So now we've increased the counter by 1, but two clients were doing it. So surely we should have ended up increasing it by 2. So that's why Lab 3 cannot be used for even this simple example. Furthermore, in the zookeeper world where gets can return stale data, this is not Lab 3, where gets are not allowed to return stale data. But in zookeeper, reads can be stale. And so if you read a stale version of the current counter and add 1 to it, you're now writing the wrong value. If the update value is 11, but your get returns a stale value of 10, you add 1 to that and put 11, that's a mistake, because we really should have been putting 12. So zookeeper has this additional problem that we have to worry about that gets don't return the latest data. OK, so how would you do this in zookeeper? Here's how I would do this in zookeeper. Turns out you need to wrap this code sequence in a loop, because it's not guaranteed to succeed the first time. So we're just going to say while, while true. We're going to call get data to get the current value of the counter and the current version. So we're going to say xv equals get data. And we need to say a file name. I don't care what the file name is, we'll just say f. So now we get the, well, we get a value and a version number. Possibly not fresh, possibly stale, but maybe fresh. And then we're going to use a conditional put, conditional set data. And if set data, if set data operation return true, meaning it actually did set the value we're going to break, otherwise just go back to the top of the loop. So what's going on here is that we read some value and some version number, maybe stale, maybe fresh, out of the replica. The set data we send actually to the zookeeper leader, because all rights go to the leader. And what this means is only set the value to x plus 1 if the version, but the real version, the latest version, is still, is v. So if we read fresh data, and nothing else is going on in the system, like no other clients are trying to increment this, then we'll read the latest version, the latest value, we'll add one to the latest value, specify the latest version. And our set data will be accepted by the leader and we'll get back a positive reply to our request after it's committed and we'll break because we're done. If we got stale data here, or this was fresh data, but by the time our set data got to the leader, some other client's set data, some other client is trying to increment their set data got there before us, our version number will no longer be fresh in either of those cases. This set data will fail and we'll get an error response back, it won't break out of the loop, and we'll go back and try again. And hopefully we'll succeed this time. Yes, yes, so the question is could this, as a wild loop, are we guaranteed it's ever gonna finish? And no, no, we're not really guaranteed it's ever gonna finish. In practice, so for example, if our replica we're reading from is cut off from the leader and permanently gives us stale data, then maybe this is not gonna work out. But in real life, well, in real life, leader's pushing all the replicas towards having identical data to the leader. So if we just got stale data here, probably when we go back, maybe we should sleep for 10 milliseconds or something at this point, but when we go back here eventually, we're gonna see the latest data. The situation under which this might genuinely be pretty bad news is if there's a very high continuous load of increments from clients, if we have a thousand clients all trying to do increments, the risk is that maybe none of them will succeed. So I think one of them will succeed because I think one of them will succeed because the first one that gets its set data into the leader will succeed and the rest will all fail because their version numbers are all too low. And then the next 999 will put sand and get data in and one of them will succeed. So it all have a sort of N squared complexity to get through all of the clients, which is very damaging and it will finish eventually. And so if you thought you were gonna have a lot of clients, you would use a different strategy here. This is good for low load situations. Yes, if they fit in memory, it's no problem. If they don't fit in memory, it's a disaster. So yeah, when you're using ZooKeeper, you have to keep in mind that it's great for 100 megabytes of stuff and probably terrible for 100 gigabytes of stuff. So that's why people think of it as storing configuration information rather than the real data of your big website. You mean it's sort of watching to this sequence? Yeah, that could be. So if we wanted to fix this to work under high load, then you would certainly wanna sleep at this point or I'm not, well, the way I would fix this, my instinct on fixing this would be to insert a sleep here and furthermore, double the amount of insert a randomized sleep whose span of randomness doubles each time we fail. And that's a sort of tried and true strategy. This exponential back off is a, it's actually similar to raft leader election. It's a reasonable strategy for adapting to an unknown number of concurrent clients. Okay, tell me what to write. So we're getting data and then watching is true. If somebody else modifies the data before you call set data, maybe you'll get a watch notification. The problem is the timing is not working in your favor. The amount of time between when I receive the data here and when I send off the message to the leader with this new set data is zero. That's how much time will pass here, roughly. And if some other client is sent in an increment at about this time, it's actually quite a long time between when that client sends in the increment and when it works its way through the leader and is sent out to the followers and actually executed to the followers and the followers look it up in their watch table and send me the notification. So I think it won't give you any, read result or if you read at a point, if you're gonna read at a point that's after where the modification occurred that should raise the watch, you'll get the notification of the watch before you get the read response. But in any case, I think nothing like this could save us because what's gonna happen is all thousand clients are gonna do the same thing, whatever it is. They're all gonna do a get and set a watch and whatever. They're all gonna get the notification at the same time. They're all gonna make the same decision about, well, they're all not gonna get the watch because none of them has done the put data yet, right? So the worst case is all the clients are starting at the same point. They all do a get, they all get version one. They all set a watch point, they don't get a notification because no changes occurred. They all send a set data RPC to the leader, all thousand of them. The first one changes the data and now the other 999 get a notification when it's too late because they've already sent the set data. So it's possible that watch could help us here but the sort of straightforward version of watch. I have a feeling, if you wanted the, we'll talk about this in a few minutes but the non-heard, the second locking example absolutely solves this kind of problem. So we could adapt the second locking example from the paper to try to cause the increments to happen one at a time that there's a huge number of clients who wanna do it. Other questions about this example? Okay, this is an example of a, what many people call a mini transaction, all right? It's transactional in a sense that, wow, there's a lot of funny stuff happening here. The effect is that once it all succeeds we have achieved an atomic read, modify, write with the counter, right? The difficulty here is that it's not atomic. The read and the write, the read, the modify and the write are not atomic. The thing that we have pulled off here is that this sequence, once it finishes, is atomic, right? We actually, and once we, after we, on the pass through this that we succeeded we managed to read, increment, and write without anything else intervening. We managed to do these two steps atomically. And this is not, because this isn't a full database transaction, like real databases allow fully general transactions where you can say start transaction and then read or write anything you like, maybe thousands of different data items, whatever, who knows what, and then say end transaction and the database will cleverly commit the whole thing as an atomic transaction. So real transactions can be very complicated. Zookeeper supports this extremely simplified version of when you're sort of one, we can do it atomic sort of operations on one piece of data, but it's enough to get increment and some other things. So these are, for that reason, since they're not general, but they do provide atomicity, these are often called mini transactions. And it turns out this pattern can be made to work with various other things too. Like if we wanted to do the test and set that VMware FT requires, it can be implemented with very much this setup. We read the old value, if it's zero, then we try to set it to one, but give this version number, if nobody else intervened, then we were the one who actually managed to set it to one because the version number hadn't changed, by the time the leader got our request, then we win, if somebody else changes to one, after we read it, then the leader will tell us that we lost. So you can do test and set with this pattern also. And you should remember this, the strategy. Okay. All right. Next example I want to talk about is locks. And I'm talking about this because it's in the paper, not because I strongly believe that this kind of lock is useful. But they have an example in which a choir, that's a couple steps. One, we try to create, we have a lock file and we try to create the lock file. Again, some file with ephemeral set to true. And so if that succeeds, then we're done. We've acquired the lock. The second step, that doesn't succeed. Then we want to wait for whoever did acquire the lock. But if this isn't true, that means the lock file already exists. I mean somebody else has acquired the lock and so we want to wait for them to release the lock. And they're going to release the lock by deleting this file. So we want to watch, yes. All right, so we want to watch, we want to add and to call exists. And watch equals true. Now, it turns out that, okay. And if the file still exists, right, which we expected to, because after all, if it didn't exist, presumably it would have returned here. So if it exists, we want to wait for the notification. All right, we're waiting for this watch notification. Step three and a step four. Go to one. So the usual deal is we call create, maybe we win. If it fails, we wait for whoever owns the lock to release it. We get the watch notification when the file's deleted. At that point, this wait finishes and we go back to one and try to recreate the file. Hopefully we'll get the file this time. Okay, so we should ask ourselves questions about possible interleavings of other clients' activities with our four steps. So one we know of already, if another client calls create at the same time, then the zookeeper leader is gonna process those to create RPCs one at a time in some order. So either my create will be executed first or the other clients create will be executed first. If mine's executed first, I'm gonna get a true back in return and acquire the lock and the other client is guaranteed to get a false return. And if they're RPCs processed first, I'll get the true return and I'm guaranteed to get the false return. And in either case, the file will be created. So we're okay if we have simultaneous executions of one. Another question is, well, if create doesn't succeed for me and I'm gonna call exist, what happens if the lock is released actually between the create and the exists? So this is the reason why I wrap, why I have an if around the exists is because it actually might be released before I call exists, because it could have been acquired quite a long time ago by some other client. And if the file doesn't exist at this point, then this will fail and I'll just go directly back to this go to one and try again. Similarly, and actually more interesting is what happens if the whoever holds it now releases it just as I call exist or as the replica I'm talking to is in the middle of processing my exists request. And the answer to that is that the whatever replica I'm looking at, it's log or guaranteed that rights occur in some order. So the replica I'm talking to it's log, it's proceeding in some way and my exists call is guaranteed to be executed between two log entries in the right stream. This is a read only request and the problem is that somebody's delete request is being processed at about this time. So somewhere in the log is or is going to be the delete request from the other client. And this is my, the replica that I'm talking to the zookeeper replica I'm talking to it's log. My watch, my exists RPC is either processed, completely processed here in which case the replica sees, oh the file still exists and the replica inserts the watch information into its watch table at this point and only then executes the delete. So when the delete comes in, we're guaranteed that my watch request is in the replica's watch table and it will send me a notification, right? Or my exists request is executed here at a point after the delete happened, the file doesn't exist and so now the call returns true and no, well actually a watch table entry is entered but we don't care, right? So it's quite important that the rights are sequenced and that reads happen at definite points between rights. Well, okay, so this is where the exists is executed, the file doesn't exist at this point, exists returns false, we don't wait, we go to one, we create the file and return. We did install a watch here, that watch will be triggered. It doesn't really matter because we're not really waiting for it but the watch will be triggered by this create. We're not waiting for it, but yeah. Okay, so the file doesn't exist, we go to one, somebody else has created the file, we try to create the file that fails, we install another watch and it's this watch that we're now waiting for. So this wait is not a wait for anything to happen although it doesn't really matter and it's not harmful to break out of this loop early, it's just wasteful. Anyway, we will, this code leaves watches sort of in the system and I don't really know what does my new watch on the same file override my old watch, I'm not actually sure. Presumably. Okay, finally, this example and the previous example suffer from the herd effect, we also herd effect. We talked about, I mean, what we were talking about when we were worrying about, oh, what if a thousand clients all try to increment this at the same time? Gosh, that's gonna have n squared complexity as far as how long it takes to get to all thousand clients. This lock scheme also suffers from the herd effect in that if there are a thousand clients trying to get the lock, then the amount of time that's required to sort of grant the lock to each one of the thousand clients is proportional to a thousand squared because after every release, all of the remaining clients get triggered by this watch, all of the remaining clients go back up here and send in a create and so the total number of create RPCs generated is basically a thousand squared. So this suffers from this herd effect, the whole herd of waiting clients is beating on Zookeeper. Another name for this is that it's a non-scalable lock or, yeah. Okay, and so the paper, I mean, this is a real deal and we'll see it more in other systems and a serious center problem is the paper actually talks about how to solve it using Zookeeper and the interesting thing is that Zookeeper is actually expressive enough to be able to build a more complex lock scheme that doesn't suffer from this herd effect, that even if a thousand clients are waiting, the cost of one client giving up the lock and another requiring it is order one instead of order n. And this is the, because it's a little bit complex, this is the pseudocode in the paper in section 2.4, it's on page six if you want to follow along. So this is, and so this time there is not a single lock file. There's no, oh, I'm sorry, yes. It is just a name that allows us to all talk about the same lock. So it's just a name. No, now I've acquired the lock and I can do whatever the lock was protecting. Maybe only one of us at a time should be allowed to give a lecture in this lecture hall. If you want to give a lecture in this lecture hall, you first have to acquire the lock called 34100. That turns out it's yes, it's a Z node in Zookeeper, but nobody cares about its contents. We just need it to be able to agree on a name for the lock. That's the sense in which it looks like a file system but it's really a naming system. So step one is we create a sequential file. And so yeah, we give it a prefix name, but what it actually creates is if this is the 27th file, sequential file created with prefix F, maybe we get F 27 or something. And in the sequence of writes that Zookeeper is working through, successive creates get ascending, guaranteed ascending, never descending, always ascending sequence numbers when you create a sequential file. There was an operation I left off from the list. It turns out you can get a list of files. You can get a list of files underneath. You give the name of a Z node that's actually a directory with files in it. You can get a list of all the files that are currently in that directory. So we're gonna list the files that start with F and we're gonna list F star. We get some lists back. We created a file with the system allocated us a number here. We're gonna look at that number. If there's no lower numbered file in this list, then we win and we get the lock. So if our sequential file is the lowest numbered file with that name prefix, we win. So there's no lower number. We've acquired the lock and we can return. If there is one, then again, what we wanna wait for, then what's going on is that these sequentially numbered files are setting up the order in which the lock is gonna be granted to the different clients. So if we're not the winner of the lock, what we need to do is wait for the previously numbered, the client who created the previously numbered file to acquire and then release the lock. And we're gonna release the lock. The convention for releasing the locking in this system is to remove the file, to remove your sequential file. So we wanna wait for the previously numbered sequential file to be deleted. And then it's our turn and we get the lock. So we need to call exists. So we're gonna say if we're gonna call exists mostly to set a watch point. So it's next, lower number file. And we wanna have a watch. If that file still exists, we're gonna wait. And then, so that's step five. And then finally, we're gonna go back to, we're not gonna create the file again because it already exists. We're gonna go back to listing the. So this is acquire, release is just I delete. If I acquire the lock, I delete the file I created, complete with my number. Why do you need to list the files again? That's a good question. So the question is, we got the list of files. We know the next lower number file. The guarantee of the sequential file creation is that once file 27 is created, no file with a lower number will ever subsequently be created. So we now know nothing else could sneak in here. So how could the next lower number file, why do we need to list again? Why don't we just go back to waiting for that. Same lower number file. Anybody guess the answer? I mean, the way this code works, the answer to the question is, whoever with the next lower person might have either acquired and released the lock before we noticed, or have died. And these are transient files. Sorry, ephemeral, this is an ephemeral file. Even if we're 27th in line, number 26 may have died before getting the lock. If number 26 dies, the system automatically deletes their ephemeral files. And so if that happened, now we need to wait for number 25, that is the next, if all files, two through 27 and more 27, if they're all there and they're all waiting for the lock, if the one before us dies before getting the lock, now we need to wait for the next lower number file, not because the next lower one has gone away. So that's why we have to go back and re-list the files in case our predecessor in the list of waiting clients turn out to die. Yes. If there's no lower numbered file than you have acquired the lock, absolutely. How does this not suffer from the herd effect? It's supposed to be of the thousand clients waiting and currently client made through the first 500 and client 500 holds the lock. Every client's waiting. Every client is sitting here waiting for an event. But only the client that created file 501 is waiting for the deletion of file 500. So everybody's waiting for the next lower number. So 500 is waiting for 499, 499 is, or better everybody's waiting for just one file. When I release the lock, there's only one other client, the next hired numbered client, that's waiting for my file. So when I release the lock, one client gets a notification, one client goes back and lists the files, one client and one client now has the lock. So the sort of expense, no matter how many clients there are, the expense of each release and acquire is a constant number of RPCs. Whereas the expense of a release and acquire here is that every single waiting client is notified and every single one of them sends a right request and the create request into ZooKeeper. Oh, you're free to get a cup of coffee. Yeah, I mean, this is, what the programming interface looks like is not our business, but this is either, there's two options for what this actually means as far as what the program looks like. One is there's some thread that's actually in a synchronous way, it's made a function call saying please acquire this lock and the function hold doesn't return until the lock's finally acquired or the notification comes back. A much more sophisticated interface would be in one in which you fire off requests to ZooKeeper and don't wait and then separately there's some way of seeing well as ZooKeeper said anything recently or have some go routine whose job it is to just wait for the next whatever it is from ZooKeeper in the same sense that you might read the apply channel and just all kinds of interesting stuff comes up on the apply channel. So that's a more likely way to structure this. But yeah, you're totally either through threading or some sort of event driven thing you can do something else while you're waiting. Yes, yes. That's the process. Or if the person before me has neither died nor released if the file before me exists that means either that client is still alive and still waiting for the lock or still alive and holds the lock. We don't really know. It does it as long as that client 500 is still alive. If this exists fails that means one of two things either my predecessor held the lock and has released it and deleted their file or my predecessor didn't hold the lock. They exited and ZooKeeper deleted their file because it was an ephemeral file. So there's two reasons to come out of this weight or for the exists to return false. And that's why we have to like recheck everything. We really don't know what the situation is after the exists completes. That might, yeah, maybe that could be made to work. That sounds reasonable. And it preserves the sort of scalable nature of this and that each acquired release only involves a few clients, two clients. All right, this pattern to me, I actually first saw this pattern in a totally different context and scalable locks for threading systems like Go. This and in, for most of the world this is called a scalable lock. I find it one of the most interesting constructions I've ever seen. Now, and so like I'm impressed that ZooKeeper is able to express it and it's a valuable construct. Having said that, I'm a little bit at sea about why ZooKeeper, about why the paper talks about locks at all. Because these locks, these locks are not like threading locks in Go because in threading there's no notion of threads failing. At least if you don't want them there to be, there's no notions of threads just sort of randomly dying in Go. And so really the only thing you're getting out of a mutex is really the case in Go that when you use it, if everybody uses mutexes correctly, you are getting atomicity for the sequence of operations inside the mutex. Or that, if you take out a lock in Go and you do 47 different read and write a lot of variables and then release the lock, if everybody follows that locking strategy, nobody's ever gonna see some sort of weird intermediate version of the data as of halfway through your updating it. Just makes things atomic, no argument. These locks aren't really like that because if the client that holds the lock fails, it just releases the lock and somebody else can pick up the lock. So it does not guarantee atomicity because you can get partial failures in distributed systems where you don't really get partial failures of ordinary threaded code. So if the current lock holder had the lock and needed to update a whole bunch of things that were protected by that lock before releasing and only got halfway through updating this stuff and then crashed, then the lock will get released, you'll get the lock. And yet when you go to look at the data, it's garbage because it's just whatever random state it was in the middle of being updated. So these locks don't by themselves provide the same atomicity guarantee that threading locks do. And so we're sort of left to imagine for ourselves by the paper why you would want to use them or why this is the sort of some of the main examples in the paper. So I think if you use locks like this, then you sort of in a distributed system, you have two general options. One is everybody who acquires a lock has to be prepared to clean up from some previous disaster. So you acquire this lock, you look at the data, you try to figure out, gosh, if the previous owner of the lock crashed, when I'm looking at the data, how can I fix the data to make up? How can I decide if the previous owner crashed and what do I do to fix up the data? And you can play that game, especially if the convention is that you always update in a particular sequence, you may be able to detect where in that sequence the previous holder crashed, assuming they crashed. But it's a tricky game that requires thought of a kind you don't need for like thread locking. The other reason maybe that these locks would make sense is if they're sort of soft locks protecting something that doesn't really matter. So for example, if you're running map reduce jobs, map tasks, reduce tasks, you could use this kind of lock to make sure only one task, only one worker executed each task. So worker is gonna run task 37, and it gets the lock for task 37, executes it, and marks it as executed, and releases it. Well, the way map reduce works, it's actually proof against crashed workers anyway. So if you grab a lock and you crash halfway through your map or reduce job, so what? The next person who gets the lock, because if your lock will be released when you crash, the next person who gets it will see you didn't finish the task and just re-execute it. And it's just not a problem because of the way map reduce is defined. So you could use these locks for some kind of soft lock thing, although, well, anyway. And maybe the other thing which we should be thinking about is that some version of this could be used to do things like elect a master. If what we're really doing here is electing a master, we could use code butch like this, and that would probably be a reasonable approach, yeah. Oh yeah. Yeah, yeah, so the paper talks, do you remember the text in the paper where it says it's gonna delete the ready file and then do a bunch of updates to files and then recreate the ready file? That is a fantastic way of sort of detecting and coping with the possibility that the previous lock held or the previous master, whoever it is, crashed halfway through because gosh, the ready file was never recreated in a go program. Yeah, sadly that is possible and either, okay. So the question is, nothing about Zookeeper, but if you're writing threaded code in go, a thread acquires a lock, could it crash while holding a lock halfway through whatever stuff it's supposed to be doing while holding a lock? And the answer is yes, actually. There are ways for an individual thread to crash in go. I don't know if you get what they are, maybe divide by zero in certain panics. Anyway, you can do it. And my advice about how to think about that is that the program's now broken and you've gotta kill it because in threaded code, the way to think about locks is that while the lock is held, the invariance in the data don't hold. So there's no way to proceed if the lock holder crashes. There's no safe way to proceed because all you know is whatever the invariance were that the lock was protecting, no longer hold. And if you do wanna proceed, you have to leave the lock marked as held so that no one else will ever be able to acquire it. And unless you have some clever idea, that's pretty much the way you have to think about it in a threaded program, because that's kind of the style with which people write threaded lock programs. If you're super clever, you could play the same kinds of tricks like this ready flag trick. Now, it's super hard in go because the memory model says there is nothing you can count on except if there's a happens before relationship. So if you play this game of writing, changing some variables and then setting a done flag, that doesn't mean anything unless you release a lock and somebody else acquires a lock and only then can anything be said about the order in which or even whether the updates happened. So this is very, very hard. It would be very hard in go to recover from a crash of a thread that holds the lock. Here, this may be a little more plausible. Okay, okay, so that's all I wanna talk about with Zookeeper. That's just two pieces of high bid. One is that these clever ideas for high performance by reading from any replica, but they sacrifice a bit of consistency and the other interesting thing, interesting take home is that they worked out this API that really does let them be a general purpose sort of coordination service in a way that simpler schemes like put get interfaces just can't do. So they worked out a set of functions here that allows you to do things like write many transactions and build your own locks and it all works out although requires care. Okay, now I wanna turn to today's paper which is crack. The reason why we're reading a crack paper, there's a couple reasons. One is that it does replication for fault tolerance and as we'll see, the properties you get out of crack or its predecessor chain replication are very different in interesting ways from the properties you get out of a system like raft. And so I'm actually gonna talk about, so crack is sort of an optimization to an older scheme called chain replication. Chain replication is actually fairly frequently used in the real world. There's a bunch of systems that use it. Crack is an optimization to it that actually does a similar trick to Zookeeper where it's trying to increase read throughput by allowing reads to replicas, to any replica so you get number of replicas, factor of increase in the read performance. The interesting thing about crack is that it does that while preserving linearizability. Unlike Zookeeper which it seemed like in order to be able to read from any replica they had to sacrifice freshness and therefore it's not linearizable, crack actually manages to do these reads from any replica while preserving strong consistency which is pretty interesting. Okay, so first I wanna talk about the older system chain replication. Chain replication is just a scheme for you have multiple copies, you wanna make sure they all seem the same sequence of writes. So it's like a very familiar basic idea but it's a different topology than wrapped. So the idea is that there's a chain of servers in chain replication. And the first one's called the head. The last one's called the tail. When a write comes in, when a client wants to write something, so some client, it sends always, all writes get sent to the head. The head updates its or replaces its current copy of the data that the client's writing. So you can imagine it being a key value store. So if everybody started out with version A of the data then under chain replication, when the head processes the write and maybe we're writing value B, the head just replaces its A with a B and passes the write down the chain. As each node sees the write, it replaces overwrites its copy of the data with the new data. When the write gets to the tail, the tail sends the reply back to the client saying we completed your write. That's how writes work. Reads, if a client wants to do a read, it sends the read to the tail, the read requests to the tail and the tail just answers out of its current state. So if we asked for this, whatever this object was, the tail would just say, oh, current value is B. So reads are a good deal simpler. Okay, so we should think for a moment, like why, so chain replication, and this is not crack, just to be clear, this is chain replication. Chain replication is linearizable. In the absence of failures, what's going on is that we can essentially view it as really, from the purposes of thinking about consistency, it's just this one server. This server sees all the writes and it sees all the reads. It processes them one at a time and a read will just see the latest value that's written and that's pretty much all there is to it from the point of view of if there's no crashes, what the consistency is like. Pretty simple. The failure recovery, a lot of the rationale behind chain replication is that the set of states you can see after there's a failure is relatively constrained because of this very regular pattern with how the writes get propagated. And at a high level, what's going on is that any committed right, that is, any right that could have been acknowledged to a client, to the writing client, or any right that could have been exposed in a read, that'll, neither of those will ever happen unless that right reached the tail. In order for it to reach the tail, it had to pass through them and process by every single node in the chain. So we know that if we ever exposed a right, ever acknowledged a right, ever yielded to a read, that means every single node in the tail must know about that right. We don't get these situations like if you recall figure seven, figure eight in the RAF paper where you can have just hair raising complexity in how the different replicas differ if there's a crash. Here, you know, either the data's committed or before the crash it reached some point and nowhere after that point because the progress of writes is always linear. So committed writes are always known everywhere. If a right isn't committed, that means that before whatever crash it was that disturbed the system, the rate it got into a certain point, everywhere before that point and nowhere after that point. Those are really the only two setups. And at a high level, failure recovery is relatively simple also. If the head fails, then to a first approximation, the next node can simply take over his head and nothing else needs to get done because any right that made it as far as the second node, well it was the head that failed so that right will keep on going and will commit. If there's a right that made it to the head before it crashed but the head didn't forward it, well, that's definitely not committed. Nobody knows about it and we definitely didn't send it an acknowledgement to the righting client because the right didn't get down here. So we're not obliged to do anything about a right that only reached a crashed head before it failed. That may be the client where we send but you know, not our problem. If the tail fails, it's actually very similar. The tail fails, the next node can directly take over because everything the tail knew and the next node just before it also knows because the tail only hears things from the node just before it. And it's a little bit complex of an intermediate node fails but basically what needs to be done is we need to drop it from the chain and now there may be rights that it had received that the next node hasn't received yet and so if we drop a node out of the chain the predecessor may need to resend recent rights to the, to its new successor. That's the recovery in a nutshell. As for why this construction or why this instead of something else, like why this versus raft for example, the performance reason is that in raft, if you recall, if we have a leader and a bunch of some number of replicas, the leader, it's not on a chain, we got these, the replicas are all directly fed by the leader so if a client right comes in or client read for that matter, the leader has to send it itself to each of the replicas whereas in chain replication the leader, the head only has to do one send. These sends on the network are actually reasonably expensive and so that means the load on a raft leader is gonna be higher than the load on a chain replication leader and so that means that as the number of client requests per second that you're getting from clients goes up a raft leader will hit a limit and stop being able to get faster sooner than a chain replication head because it's doing more work than a chain replication head. Another interesting difference between chain replication and raft is that the reads in raft are all also required to be processed by the leader so the leader sees every single request from a client whereas here the head sees, everybody sees all the writes but only the tail sees the read requests so there may be an extent to which the load is sort of split between the head and the tail rather than concentrated in the leader and as I mentioned before the failure, different sort of the analysis required to think about different failures in areas is a good deal simpler in chain replication than it is in raft and that's a big motivation because it's hard to get this stuff correct, yes. Yeah, so if the tail fails but its predecessor had seen a write that the tail hadn't seen then the failure of the tail basically commits that write is now committed because it's reached the new tail and so it could respond to the client it probably won't because it wasn't tail when it received the write and so the client may resend the write and that's too bad and so we need duplicate suppression probably at the head basically all the systems we're talking about require in addition to everything else suppression of duplicate client requests sorry, can you say that again? You wanna know who makes the decisions about how to, that's an outstanding question so the question is or I'll rephrase the question a bit if there's a failure like or suppose the second node stops being able to talk to the head can this second node just take over? Can it decide for itself? Gosh, the head seems to have gone away I'm gonna take over his head until clients would talk to me instead of the old head but what do you think? That sound like a plan? With the usual assumptions we make about how the network behaves that's a recipe for split brain, right? If you do exactly what I said because of course what really happened was that the network failed here the head is totally alive and the head thinks its successor has died the successor's actually alive it thinks the head has died and they both say well gosh that other server seems to have died I'm gonna take over and the head is gonna say well I'll just be a sole replica and I'll act as the head and the tail because the rest of the chain seems to have gone away and the second node will do the same thing and now we have two independent split brain versions of the data which will gradually get out of sync so this construction is not proof against network partition and does not have a defense against split brain and what that means in practice is that it cannot be used by itself it's like a helpful thing to have in our back pocket but it's not a complete replication story so it's very commonly used but it's used in this stylized way in which there's always an external authority not this chain that decides who's that sort of makes a call on who's alive and who's dead and make sure everybody agrees on a single story about who constitutes the chains there's never any disagreement some people think the chain is this node some people think the chain is this other node so what's that usually called as a configuration manager this job is just a monitor liveness and every time it sees of all the servers every time the configuration manager thinks a server's dead it sends out a new configuration in which this chain has a new definition head whatever tail and that server that the configuration manager thinks is dead may or may not be dead but we don't care because everybody is required to follow the new configuration and so there can't be any disagreement because there's only one party making these decisions not gonna disagree with itself of course how do you make a service that's fault tolerant and doesn't disagree with itself it doesn't suffer from split brain if there's network partitions and the answer to that is that the configuration manager usually uses RAF or Paxos or in the case of crack zookeeper which itself of course is built on a RAF like scheme so the usual complete setup in your data center is that you have a configuration manager that's based on RAF or Paxos or whatever so it's fault tolerant and does not suffer from split brain and then you split up your data over a bunch of chains you have room with a thousand servers in it and you have chain A it's these servers or the configuration manager decides that the chain should look like chain A is made of server one, server two, server three chain B, server four, server five, server six whatever and it tells everybody this whole list and so all the clients know, all the servers know and the individual server's opinions about whether other servers are alive or dead are totally neither here nor there if this server really does die then the head is required to keep trying indefinitely until it gets a new configuration from the configuration manager not allowed to make decisions about who's alive and who's dead oh boy you got a serious problem so that's why you replicate it using RAF make sure the different replicas are on different power supplies, the whole works but this construction I've set up here is extremely common and is how chain replications is intended to be used how cracks is intended to be used and the logic of it is that like chain replication if you don't have to worry about partition and split brain you can build very high speed efficient replication systems using chain replication for example so these individual data replication we're sharding the data over many chains individually these chains can be built to be just the most efficient scheme for the particular kind of thing that you're replicating you may read heavy, write heavy whatever but we don't have to worry too much about partitions and then all that worry is concentrated in a reliable non-split brain configuration manager okay so your question is why are we using chain replication here instead of RAF? okay so that's like a totally reasonable question it doesn't really matter for this construction because even if we're using RAF here we still need one party to make a decision with which there can be no disagreement about how the data is divided over our hundred different replication groups so in any kind of big system you're splitting it, you're sharding you're splitting up the data somebody needs to decide how the data is assigned to the different replication groups which has to change over time as you get more or less hardware or more data or whatever so if nothing else the configuration manager is saying well look the keys start with A or B goes here then C or D goes here even if you use Paxos here now there's also this smaller question of within each what should we use for replication should be chain replication or Paxos or RAF or whatever and people do different things some people do actually use Paxos based replications like Spanner which I think we're gonna look at later in the semester has this structure but it actually uses Paxos to replicate rights for the data you know the reason why you might not want to use Paxos or RAF is that it's arguably more efficient to use this chain construction because it reduces the load on the leader and that may or may not be a critical issue the reason to favor RAF or Paxos is that they do not have to wait for a lagging replica this chain replication has a performance problem that if one of these replicas is slow because even for a moment because every right has to go through every replica even a single slow replica slows down all right operations and that can be very damaging in a few of thousands of servers probably at any given time seven of them are out to lunch or unreliable or slow because somebody's installing new software who knows what and so it's a bit damaging to have every request be sort of limited by the slowest server whereas RAF and Paxos well so RAF for example if one of the followers is slow it doesn't matter because the leader only has to wait for a majority it doesn't have to wait for all of them you know ultimately they all have to catch up but RAF does much better resisting transient slow down and some Paxo space systems although not really RAF are also good at dealing with the possibility that the replicas are in different data centers and may be far from each other and because you only need a majority you don't have to necessarily wait for acknowledgments from a distant data center and so that can also lead to people to use Paxos RAF like majority schemes rather than chain replication but this is sort of a depends very much on your workload and what you're trying to achieve but this overall architecture is I don't know if it's universal but it's extremely common yes I think two nodes in the chain are in fact able to contact the figuring configuration manager and also the reverse but these two nodes cannot talk to each other like intentional apologies okay for a network that's not broken the usual assumption is that all the computers can talk to each other through the network for networks that are broken because somebody stepped on a cable or some router is misconfigured any crazy thing can happen so absolutely due to misconfiguration you can get a situation where these two nodes can talk to the configuration manager and the configuration manager thinks they're up but they can't talk to each other so yes and that's a killer for this because the configuration manager thinks they're up they can't talk to each other boy it's a disaster and if you need your system to be resistant to that then you need to have a more careful configuration manager you need logic in the configuration manager that says gosh I'm only gonna form a chain out of these services not only I can talk to them but they can talk to each other and sort of explicitly check and I don't know if that's common I'm gonna guess not but if you're super careful you'd want to because even though we talk about network partition that's like a abstraction and in reality you can get any combination of who can talk to who else and some are maybe very damaging okay I'm gonna wrap up and see you next week