 Cool. Oh, wow, such power. All right. My name is Paul. I'm a software nerd working in the Boston area. Really excited, honored to be able to share with you a hack project that I had, which is a ZooKeeper layer for FoundationDB. Yeah, I think there's a couple interesting techniques in here that will be pretty fun. We're going to blast through the ZooKeeper API, talk about all sorts of stuff, about what it's used for, what makes it special, and then we'll dig into how to map those ideas onto a stateless layer. ZooKeeper itself is a distributed system. It's open source. It's run by Apache these days. Their tagline is that distributed systems are a zoo, and therefore you want a zookeeper. And so the idea is that it helps offload certain responsibilities of a larger system. And in my mind, it kind of maps onto two different types of access patterns and two usages. The first thing I see ZooKeeper use a lot for is storing system level configuration service discovery data, where it's like the total data set is pretty small. You're not evolving it very often. You're not updating your configuration data like thousands of times per second. It's like ones of times per minute. And despite all that, you can actually have a huge amount of read throughput onto a single piece of data. The other thing that ZooKeeper has used a lot for is for distributed synchronization. So out of the box, it doesn't actually give you a lot of these things. But it gives you all the tools to write your own things like leader election protocols, and mutexes, semaphores, and a lot of pretty neat stuff. And so I see ZooKeeper use for these things. And when you start to look at how does ZooKeeper itself actually fare against the use cases that people use it for, in my mind, it works really, really well for the configuration case. Like it can scale out reads incredibly well. It can handle that narrow distribution in lots of reads. And I've seen applications that have really struggled with the distributed synchronization side of things, where it's like you're writing an application, and then you start to hit some problem where you're like, you know what, I'm gonna use a mutex for this. And since we're already connected to ZooKeeper, I'm gonna take out a ZooKeeper lock. And ZooKeeper does not scale writes particularly well. So as the application grows, you don't really have a way to scale your ZooKeeper write throughput. And in fact, adding an extra instance into your ZooKeeper ensemble is going to bring your total throughput down. So that got me thinking about what would this, what would these set of trade-offs look like on top of FoundationDB? And this is something where it's like the whole horizontal scalability, the way it's gonna be architected, it's gonna be really phenomenal for the synchronization case, and actually somewhat weaker for the case of hot keys, which hopefully some of the stuff that I was talking about earlier today, like consistent caching would really like pick up that particular use case. So why build a ZooKeeper layer? I think it'd be cool if you could offer something to applications that have kind of gotten themselves into a pickle with their like ZooKeeper usage or just be able to take existing libraries off the shelf that are already built on top of ZooKeeper and be like, great, this is gonna scale way better than before. I haven't really seen a layer that does similar stuff. So I was kind of curious to see like, what's possible here? So why do people use ZooKeeper for storing configuration information, like data and all that stuff? And I think there's like a couple key features that it offers an application that make this, that make it good for this. It's got a really simple data model. It looks like a file system. It gives you watches. We're gonna talk a lot more about those, but it helps you avoid polling. And then it has really, really precise semantics about how the operations are ordered. And this is convenient for us because FoundationDB actually has stronger semantics on what ZooKeeper offers. So we'd like by virtue of building on this and not actively undermining this guarantee, we kind of get this, we get this for free. Why do people use ZooKeeper for synchronization? Or what does it offer for helping build synchronization primitives? Again, it has some data model. The data model doesn't really matter. It allows you to have really, really precisely ordered watches. We'll talk a lot more about that. It attracts client states, so it knows exactly who is currently active on the talking to ZooKeeper. And then it ties pieces of data which are ephemeral, so if the client disconnects, their data is removed. And that's really, really important for some synchronization primitives as well. And again, the sequential consistency is a huge factor here and we don't have to worry about it. So we're left with three pieces here that we are going to have to implement at the layer level. So digging into the first one, the ZooKeeper data model is really quite simple. It looks like a distributed file system. You create things at a particular path. A path can have child paths. You can store a very small data blob there. It's tracking some metadata. So it, you know, that's pretty much it there. So I was thinking, it's like, all right, so how do you build a file system with something with like directories or something like that on top of FoundationDB? And I decided to be lazy and I just recycled the directory layer because it already does pretty much all of those things. So it's just like pass in the, like a ZooKeeper's node path which is called a Z node, get a subspace back from the directory layer, serialize ZooKeeper's native objects into that path and then like pretty much all of the work is done. If ZooKeeper just offered a file system interface like this for very small amounts of data that doesn't have a lot of right volume, that'd be pretty boring. So one of the things that really like backs it and like buoys this as a useful idea is that ZooKeeper offers sequential consistency here. So it has a total ordering of all right operations. So you can imagine that every right request that ZooKeeper has ever accepted could be put into a single line and saying like this is the state of the system. And so when there are multiple instances in a ZooKeeper ensemble, they're all replicating, they're all keeping like moving along that line and they agree on exactly the same ordering. An interesting thing to note is that you can get stale reads like if you're connected to the ZooKeeper instance that is not the leader, you can be seeing a stale view of the world. And so when we look at how that compares to FoundationDB's consistency level, FoundationDB has strict serializability, you don't have stale reads, you still have a total ordering. So between this guarantee and between the directory layer, this whole first thing is like pretty much taken care of. So let's move on to something much more challenging. Let's move on to how watches work in ZooKeeper. In ZooKeeper, there are four different ways to set a watch. And a watch request is when you go to the server and say, hey, for a given Z node, like a particular path and a particular action, give me back a future that is going to complete when that action has been observed. And ZooKeeper allows you to do this for a bunch of different things. Like if the node does not exist already, notify me when it's created. Or if it does exist, notify me when it's deleted. And then it backs it, we'll jump into each of these with some ultra precise guarantees around exactly how the ordering of those things must work. And all of the ordering here adds up to ZooKeeper being a useful system to like to build these primitives on top of. In contrast, FoundationDB also has a feature called a watch. And the way it works is you say, for a given key, give me back a future that can please if the value has changed. We don't know if it was created. We don't know if it was deleted, updated. There's no ordering guarantees. There's no guarantees of exactly when it fires relative to other things. It's possible that it doesn't fire if you were watching value A, it goes to B, and then immediately flips back to A. And so we're gonna have to do a lot of work at the layer level in order to recreate the exact semantics of what ZooKeeper's giving us. And so for that, we'll dig into each of these constraints. The first one here is that it dispatches all of the events and callbacks in order. This is actually something that's done at the client level. So we don't have to worry about that one so we can check that one off. Next one is super interesting. This one is saying that the order of watch events corresponds to the order of updates that the ZooKeeper service observed. This one, if we're going back to that picture of that sequential consistency where we're putting all of the updates in a line, if two updates triggered watches, then the watches must have been dispatched in that order, like in the exact same order. And so for that, we are going to need a log of watch events. And so how do we build up this log of watch events that contains a list of all the watch events that have occurred in the same order as the updates that triggered them? For this, you can imagine that a ZooKeeper client has come in and says, I want to perform a write, I'm gonna create a Z node at the path slash app. And now it gets passed off to our layer and our layer goes in and checks to see, are there any clients who are actively watching for this node and this particular action? And if so, we append it into an event log that is keyed per client. So we have an individual event log for each of them. And then this is yet another place where we use version stamps. So version stamps substitute in that ordering number from the like FDB server. And so these are gonna be in the exact same order as the updates because all of this was running in a single transaction. So this part is super, super nice. And so now we can go back to our question of like, all right, are we keeping an order of watch events relative to the same order's updates? And the answer is yes, all right, we got this part. So we've built up, we've persisted this list, we haven't actually delivered it to the client yet. How do we do that? So we have our event log and we need to deliver it. And the way we're gonna do that is by actually piggybacking off of a foundation DB watch now. And so when somebody comes in and watches a zookeeper action, like is looking for a zookeeper action, we will, the layer will create a watch notifications key for that particular client. And when that one fires, that's not telling the layer that any particular watch event has happened. What it's telling the client to do is go read the event log that we have persisted for it. And so that at that point, then it can go find all of the pending watch events, play them back in the exact order that it needs to. We can now look at the last constraint the zookeeper has for our watches, which is that a client will see a watch event before it can read the corresponding data out of the underlying store. This is an interesting one to kind of like noodle on for a little bit for exactly why this is in here. But ultimately it means that there's a race where if we trigger, if we perform a write, the foundation DB watch is not fired yet. Somebody goes in and tries to read the same data that triggered that watch. They could see it before the watch is fired. So that means on read request, we actually have to go and check our watch event log to make sure that that has been satisfied first. And so with this, that's how we're gonna notify the client. If we go back to how we're building up this watch event log, we can see that we now need to trigger a notification for the watcher. And so we'll just perform an atomic update to this key as well. And all of this is happening in a single transaction. And I can't say enough good things about foundation. Like zookeeper allows you to create and existence is tied to whether or not the client you can imagine for like a leader election protocol you say like everybody writes to this directory. The note only exists as long as the client is there. Somebody selected a leader. Everybody's watching that directory to see if anybody has like been added or removed. If somebody's removed, you can re-elect a new leader. So it's like all starts to add on each other. And so this is a super important feature for zookeeper. The way it works is nothing particularly special. When a client connects to a zookeeper server, the server responds with a particular session ID. He says, all right, that's like, that's what you are now. And every few seconds, the client comes in and sends a heartbeat request. And the zookeeper server in memory is keeping track of all of the clients that it's talking to and when they are going to expire. And then about once every second, there's a background thread that's just checking, all right, are there any sessions that haven't checked in enough time so that they're expired? So that's how we're gonna like detect if somebody has disconnected. And if they have, the zookeeper goes in, deletes their ephemeral nodes, triggers the watches for anybody who was looking for that and cleans up any other state associated with the client. So how do we recreate this on top of FoundationDB? It's pretty simple. I think the secret to any stateless layers, you just take any state that you previously had and you push it down into FoundationDB instead. So we're just gonna persist the heartbeats in all of the sessions. So we're gonna have a subspace, which is just all of the sessions. We actually use version stamps to generate the session IDs. And then we have a second subspace, which is an index of all of the session IDs, which is ordered by when they're going to expire. And so now how do we find all of the sessions who might be expiring? It's a simple range read across the subspace. So for each of those, we found it like an expiring session, going and delete all of their ephemeral nodes and all of that other data that they have. Staring at this one for a little bit, we now have an interesting question, which is like, who actually runs this code? Because when you had a lot of clients connecting to an actual ZooKeeper server, ZooKeeper's keeping track of all of this in memory and it just has a background thread. So it's like lots of clients connected to one server and has a background thread that's cleaning these up. Now we have lots and lots of layer instances all pushing their state down into FoundationDB, which has no idea that something needs to be cleaned up at any regular interval. And so how do we keep the layer stateless? Or do we have to introduce a second process that is only responsible for dealing with this? So I asked about this on the forums and there was one idea there that I thought was such a fun idea I had to go and implement it, which is running a little mini-election every second. So on a ZooKeeper server, it's got a thread which runs every second. What we're gonna do is elect one of the layer instances every second to be responsible for cleaning up all of the expired sessions. And the way we're gonna do that is by using transaction conflicts. So we're gonna create a new subspace, which is saying like here's where the election is and then the value is when the next election occurs. And so our layer every second starts up a new transaction and it reads when the next election is. It waits and while holding this transaction open and then it comes in and commits and writes when the subsequent election will occur. And because in theory everybody's doing this at once, when you go to commit, one of the clients is going to succeed in this right and everybody else fails. If you succeeded, then you are the person who is responsible for going in, reading all of the expired sessions and cleaning up that data. So I thought that was pretty neat. It allows you to just deploy like one thing, like this works equally if you have like a single instance of the layer versus like lots of them. And so now the way we're like fleshing out this whole session story is that we are generating session IDs and version stamps. We're pushing the like all the session information down into foundation DB. I didn't give this its own slide, but it's pretty simple. We're keeping track of ephemeral nodes by session ID and then every second we go in and nominate one of the layer instances to go and perform cleanup duty. And so with these three features complete, we have now really like fleshed out like the special stuff that Zookeeper is doing that allows you to build all of those interesting primitives that it offers. What's the state of things today? This is very much a proof of concept. I think it'd be like software engineering malpractice to go and use this in production right now, but it's coming along nicely. It runs everything that's in Apache Curator, which has a whole bunch of like well-honed like recipes for different synchronization primitives. And so yeah, it's coming along nicely. So there's GitHub link if you're interested. If you wanna talk to me more about it, you can reach me there. We have time for questions if anybody has any thoughts. Yeah, so the question here is, is there a guarantee that only one of the instances here becomes the leader for the second? The answer is, I don't think so, but it doesn't matter so much. I think it's more just that we don't want it to be designed such that every single instance is running this cleanup every second, which would be very redundant. It's more like one or maybe slightly more than one are running it, but that should be fine. It's like an item potent operation here. At this point, the session is like it is gone. So it's just cleaning it up. Other questions? I did not know that. I think I'm missing part of this question. Sorry, it's like a little far from here. Multi-region, that's a big question. I think I needed to know more details about how it's set up. All right, cool, thank you everybody. I think we have a break now, so if you have more questions, just pull me aside. Thank you.