 Hello everybody. My name is Jeff Layton and I'm a long time developer of Linux NFS. I've been involved for more than a decade. I've been involved with Ceph for about three years now. I've been working on a project to allow exporting of Ceph over NFS but from multiple server heads. So first of all, a lot of you guys were probably in some of the earlier talks but I'll go ahead and go over it. What are the pieces here? We start with everything at the bottom layer is RADOS which is a redundant and distributed object store. Built on top of that is CephFS which is a POSIX Coherent File System. Clients will talk to, it has its own daemon that runs on top of it. Clients will talk to that daemon to acquire what's known as capabilities or Caps. Caps are essentially pieces of delegated iNode metadata. So you can get Caps to allow you to read or write to a file, fetch certain pieces of the attributes of it, X adders, that kind of thing. And then above all that we have NFS Ganesha which is a user land, all in user land NFS server. And the reason we use that is that it's got a plug-in base back end that uses the Culp to Sol Ceph. So it's pretty well suited for exporting user land based file systems. We don't need to go through the kernel NFS server or do anything like that. And also it uses the same, it uses a LibCephFS which is a user land library that allows you to talk to Ceph without working with the kernel. And on top of that we can also store recovery databases and config files in RADOS too which makes it well suited for containerization. So sort of the goal I've had with all this is to be able to build a scale out NFS server on top of Ceph, or CephFS in particular. What we want is we want to be able to, people have had NFS servers on top of Ceph for a long time but being able to do that in a way that allows where you can export from multiple heads at the same time is pretty difficult actually because it's, be able to do that safely. People will do that but if you have any sort of stateful objects, locks, opens, that sort of thing, that becomes problematic in the case of recovery. So the idea is we want to eliminate bottlenecks and single points of failure. The goal is to sort of focus on NFS v4.1 plus because it has a lot of features that we make it better for redundancy and performance. The goal here is to minimize coordination between the Ganesh nodes. So you can build these sort of servers if you are really, if you're very careful about recording lots of stateful information in stable storage but that becomes pretty expensive. So we want to have them be really loosely aggregated, sort of being able to do their own thing until we need to have them reach out and talk to the other nodes. And the other thing is that Ganesh turns out in this sort of configuration is pretty amenable to containerization. We can stick it in a container, we can put a lot of the configuration and recovery info into RADOS. We don't need any real writable local storage or elevated privileges. We can run it non-root so we can run it on a rootless container. We don't need any third party clustering like core sync or anything like that to talk across nodes. And so we don't, we can just use RADOS to coordinate all that info. And we are also, you know, unlike a traditional active passive cluster, a lot of those configurations people will run will have two, you know, more than one node and will allow, it will have the other node take over if this one fails. But we don't really want to do that here with a containerized kind of setup. We can just reconstitute the node instantly if it fails. So the difficulty with all this is handling the recovery. So NFS is a little unique in that among network file systems in that most NFS servers don't track detailed information about what state is being held by clients on stable storage. So most SMB servers do, you know, for at least the more recent SMB versions, CEPH does in its MDS. It has a journal blog that it uses. But for NFS, when after a NFS server dies and it comes back up, it has no idea of what state it has. All we do track is what clients we have. And so it's up to the clients to come back in and reclaim all their state they have before. Hey, I had this open. Hey, I had this lock. So we do that by when a NFS server comes back up after crashing, we do that by establishing what's known as a grace period. And during that grace period, we don't allow any new state to be acquired. So if you try to do an open or require a new lock, the server will tell you, nope, we're in the grace period. So in addition to that, we also have to be able to know, because of certain edge conditions that can happen during multiple reboots and network partitions, we also do need to know what clients are allowed to reclaim. And so we have to keep these stable storage records so that they know after a reboot that the server knows after a reboot that it is allowed to reclaim. So prior to a client doing its first open, what we do is we set a stable storage record in it. So either a first open reclaim or a first open regular, we set a stable storage record for it. And then after the last close, we usually couldn't remove that record. And the key here is that when we switch from grace period, which is what happens right after a reboot into normal operation, we have to atomically replace the old client database with the new one. And that atomically is very important because failure can happen at any time. If we leave it everything in a indeterminate state, then that can be a problem after a reboot. So a key conceptual way to think about this is that we can consider each reboot as a series of server epics. So we can start with, as clients perform their first, so basically the first epic would be we have a grace period when the server comes up and then it transitions into normal operation. You know, crashes again, we go to another grace period, so that would start a new epic. And then we go eventually transition to normal operation, crash again, grace period. And we can actually have multiple crashes. If a crash happens during the grace period, we just stay in the same epic. So as the clients perform their first open, reclaim a regular, we set a record in the stable storage database for them. And then during a grace period, we allow clients that are present in the previous epic's database to reclaim. And anybody that wasn't in there gets refused. And after we transitioned from grace to normal operation, we can delete any previous database that we had that was associated with the previous epic. So where the difficulty comes in is how do we do this across multiple servers? So what does a grace period mean when you have multiple servers? Because the danger we have is that when this server crashes, or server one crashes, and as it's coming back up, we have to eventually kill off the old state that it held with the MDS. And during that window, we could have another server race in and grab state that should be reclaimable by a client. So we have to avoid that at all costs. So really what happens is that, so lifting grace, if we have to project this idea of a grace period across multiple servers, we have to consider that the grace period is really a cluster wide property. So really lifting grace period really becomes a two step process. We have to first indicate that the local server's grace period is over. He doesn't have any more clients that need to reclaim. And we also have to indicate to the others when we've stopped enforcing the grace period. So the way we do this, another way to think about this is that we can consider as a set of two flags in a state. We have a need flag. Each server in the cluster has two sets of flags. So we have a need flag which tells it does this particular server node require a grace period in order to allow recovery. So the first node that sets a need flag declares a new grace period. So we'll transition into a new grace period at that point. And when the last node declared its need flag, the last node declared its need flag can fully lift the grace period cluster wide and allow other, allow normal operation everywhere. So in order to, but we also need to know, you know, just saying you need a grace period is not enough. We have to actually know that all the other servers in this cluster are enforcing it before we can allow recovery to occur. So we also need a flag to indicate that a particular node is still enforcing the grace period. So if all the servers are enforcing the grace period, then we know that node conflicting state can be handed out and the grace period is fully enforced. So this is used by the server to determine when it's safe to clear its state on the MDS, on the set MDS. So and that really the, the sort of the basic idea here is we won't want very simple logic that allows individual hosts in the cluster to make decisions about the grace period enforcement. People have done active, active servers before. It was things like this too, but a lot of it requires some sort of external coordinating demon. Like, you know, I think IBM may have had a scale out server where they did something similar, but they needed a controlling node, but that also represents a single point of failure. And that adds you complexity. You know, if that controlling node fails, then you have to deal with the recovery of the controlling node to handle all that stuff. So we want to avoid that here. We want the nodes themselves to be able to just look at the state of the cluster and decide what they should do. So basically what we've decided to do is the way we do this is that we do it within set itself. We declare an object in a, just a single object in there that has just a little bit of info in it. We have it, we use that to coordinate our cluster wide grace period recovery. You know, RADOS is pretty well suited to this because the compounds are atomic that it sends to do reads and writes. We could, currently what we do is we basically do a read-off to slurp all the data off of the object. Then we modify it in memory. Then we do a write operation if we need to change something. And then if the object has changed in the interim, we'll just go back and review the whole thing again. So we have sort of a multiple polling method for this. So all we need to really track for this is just a very little bit of data. So we have two UN64T values. I used UN64T because who wants to worry about wraparound, we only have to track two of these things. The current epoch, which is where, you know, what database should we store new records that come in, new records when clients come in. And then we also have a recovery epoch. What epoch are we allowed to recover or clients allowed to recover from? And then we use declare R equals zero. You know, the recovery epoch equals zero means that the grace period is not in force. So basically we store this little bit of info in little Indian and unstructured data inside the shared grace object. Beyond that, we also have an OMAP. We use the OMAP in the object as well, which is just a key value store. And we use that to store the need and enforcing grace flags. So it's just a single, so we basically have a node ID as the key. And then we also have a, the value is just a byte right now that stores two flags. I mean, it has room for expansion later if you need it for something else. So in addition, we can't just use a single set of recovery databases. We also need multiples. So we need one for each server, each node, and each node needs one for each epoch. So we store these also in Rattles objects because Ganesh has had the ability to do this for a long time. And we just store the name of REC and then we have this 64, UN64T value here for the epoch and then node ID here right after it. And this, so basically, and inside that record, inside that database is basically just a, what's, what NFS sends, which is a long form client ID string that uniquely identifies that client within the, within the, you know, among all the others. And so once, so when we go from normal operation to grace without recovery, we have to recreate the DB from scratch. So sometimes if a node comes down, comes up, says it needs a, needs grace period, we have to, we, you know, we don't want the surviving nodes to have to do any sort of recovery or to allow for any sort of recovery. We want them to mostly continue their operation, but we have to copy the records from one, from the old database to the new. As part of this, we had, you know, in order to handle the manipulating and querying this database, we added a command, new command to NFS Ganesha as well called Ganesha Rados Grace. And all it does is just a pretty simple command line tool. It allows you to see the current state of the cluster, tells you what, what epoch we're in and what, you know, what the current recovery epochs are. What server, you know, what servers are known to be in the database and what flags they have. We can add and remove those, those nodes from the cluster. Start a new grace period if we want or lift the grace period so you can fully manipulate the database from the command line too if you like. Mostly under normal operation, you'll just use the add and remove and maybe the one to view it, the dump command. So to configure NFS Ganesha for clustering, you know, we added this as a new recovery back in for NFS Ganesha. So mostly if anybody that's using Ganesha now in a single node configuration, you actually have like a Rados KV or there's another driver called RadosNG. But for this you would want a Rados cluster to set that value in here. And by the way, this is just an abbreviated configuration file. You would actually have a lot more parameters in here generally, but I'm just showing the ones that are relevant here. And then in the Rados KV block, this is the block that tells Ganesha how to talk to the Rados cluster. And so what user ID she used to talk to the cluster, what pool, name space, and the node ID. So each, the node ID in particular is a pretty important thing. This needs to be unique on each node in the cluster. If you have collisions there, that can be very problematic. And then so once you set up the config file on each Ganesha node, you just, and then, you know, beyond this too, you would also have export blocks so probably export for exporting Ceph. Though it is, this is nothing in this is particularly tied to CephFS. You could use this sort of Rados recovery back end with any cluster file system. And it should still work basically. But it is sort of, we've plumbed in a lot of other mechanisms for Ganesha to be able to better manipulate the CephFS state. So it is the recommended way to do this. Oh, and also there's some requirements. This is a fairly new feature in Ganesha. So you need V27 plus. And then you'll also want a Nautilus or later cluster. You can do this on an earlier one, but you'll probably find that some of the, you may end up stalling. And if you had particularly long outages of a node, you could end up not being able to recover. So we really recommend a Nautilus or later cluster for this. So how do we configure NFS Ganesha for cluster? Well, we also need to do, add it to the database. So we'll add, you can use the Ganesha Rados grace command basically 10. And so in this case we're using Ganesha dash pool as the name of the pool. And then we have cluster A as the namespace. Oh, sure. Lance. Are you using libredas? Yes. Yes. Yes. And also the same thing with, in fact, that's what, that's what this Rados KV block does is we're feeding it parameters for libredas to talk directly to the Rados cluster fetch object system. So, good question. So Ganesha Rados grace will have to add each node to the database. In this case we're doing two nodes. So you have a node A and a node B. That has to be unique. And then we just start up each daemon. And if they die, we want to restart them pretty quickly. The reason for that being that all of this depends on the NDS preserving caps that were held by a node between it going down and coming back up. So when a node dies, we're basically racing against the clock. We have to know that when the node comes back up that any caps that it held before haven't been fetched, you know, been stolen and maybe released. That's actually the worst case scenario, right? You know, if you had a file lock, let's say, for instance, and a node crashed and then that file lock got released while you were down, you come back up and someone else races in, grabs that file lock, maybe changes some data and releases that file lock. You come back up, you'll never know. And that's silent data corruption and that could be ugly. So in any case, here's the demonstration. Can everybody see this in the back? Okay, great. So in this case, I'm running a cluster on my laptop in Minicube. So Minicube is a Kubernetes in a box basically in a single VM. So, and I'm also using what's the, using Rook, which is a operator for Minicube, or operator for Kubernetes to allow it to manage a set cluster. So in here I've got three NFS servers. There's NFS AA. A is the name of the cluster and A is the name of the node here. AB and AC. Here I am on AC here and I'm running the Ganesha Rados grace command in a watch, under a watch command. So we'd see it. The current epic in this case is 23. And the recovery epic is zero. So we're not in a grace period. Nobody has a flag set, but you can see here is A.A, A.B and A.C are the three nodes that are in the cluster. So on this shell, I'm just going to kick off some, use FiO, which is a performance benchmarking tool mostly, but I'm going to use it here just to generate a bunch of IO. So we're just doing random reads and writes to a file on the on the Cephshare here through the NFS. So it's generally running now. Well, maybe running now. Well, I don't know. Hang on, let me see if I can, I may not have survived my, well, sorry, it was a good idea. Oh, boy. Let's see if MiniCube will behave. No. Well, it was working a little bit ago. Maybe I can stop MiniCube and start it back up. Basically what I was going to show is that you could run MiniCube, or you could run IO against a single head and watch and it would be able to recover and come back. MiniCube's not even behaving here. In this case, I'll just stop. So essentially what the demo was going to show is I was going to drive IO against one node and then kill off that node and you could see the IO go to zero and then see it recover and come back up and pick right back up where it left off once the thing comes back. The other thing we've done too in Ganesha recently is taught it how to lift the grace period early if all the clients have come back and reclaimed their state. So that allows us to keep the grace period down to a bare minimum, usually just a few seconds once the node comes back up. So now that my demo has failed. So caveat's of future directions. So again, all of this, the caveat with all this stuff is that we rely on the NBS holding the state for us for a client that has gone down. So a CEP client that has died when it dies and it comes back up, the NBS eventually will time out the old client's session. And so we need all the nodes to be in grace before that happens. So we have to declare a new grace period. So on a clean shutdown, what we do generally is we try to declare a new grace period as quickly as possible. And then Nautilus also, one of the reasons we recommend Nautilus here is that it has interfaces to kill off the old session and also interfaces to increase the session timeout. We want to be able to allow the NBS to hold those caps as long as possible, as long as we need for the NFS server to come back up and reclaim them. One of the future directions, another future direction I didn't put on here, is that we also have some proposed interfaces too to allow CEP clients to reclaim caps themselves. That would allow us to allow a node to die and come back without having to put everyone else into the grace period. And that could be, is a huge win. If we know that a client goes down, comes back up and that nobody has stolen his caps and that he can get to all his caps, we don't need to enforce a grace period for everyone else. So another problem we have too is that all these are loosely connected servers and to an NFS client, they all just sort of look like individual servers. How do we distribute these clients among the cluster? Can we use, you know, one of the ideas is that could we use a load balancer with strong affinity? I have some real worries about doing that so I'm not sure if that will be something that's possible. NFSv4.1 also has a mechanism, an attribute called FS Locations Info which will tell the, which a client can query so that it can know what, where other locations for the same export may be. And so we would like to be able to allow, we'd like to be able to fill out this info from the server so that it could, so that a client could, so that if a node, a connection node goes down, it can seamlessly migrate to another node possibly. And then we also want to allow a failover to a different head if one fails. So that's a little trickier because right now all of our, all of this mechanism relies on a client going, getting back to the same node. That's why we like Kubernetes for this, is that we can pretty quickly reconstitute the node from scratch, even for our read-only container image, and get back up and running quickly without having to try to do any sort of migration of the state back and forth. And another thing, eventually we'd like to do is PNFS support. We would like to be able to run many small, run small NFS servers in each OSD possibly. They just need to be able to support NFS v4 read and write, or in NFS v3 even possibly. But unfortunately right now there is no PNFS layout type that allows for algorithmic placement. So we would need to have some way to be able to set, to inform the layout. One thing we could do is try, you know, one idea is just to go ahead and do separate layout segments, you know, and manually, but that gets pretty unwieldy for large files. So we really need some way to be able to say, you know, to allow clients to determine algorithmically where they should go to send their data. Any questions? Sure, Lance. So this depends on the set of the backend, right? It doesn't work with the S3. No, you wouldn't need to actually. My understanding with RGW is that we don't really have any stateful objects. And so mostly what we're concerned with is locks and delegations. Right now we have delegations supporting Ganesha. But there's a problem with trying to run it across multiple heads. I can tell you about it offline, but POSIX Coherent Locking does need that. And I don't think RGW supports any of that sort of thing. So I don't think you really have this. You don't necessarily need this. You can just stand up multiple heads independently over RGW and you don't have any real problem. Yes, yes. Any other questions? Okay, well thank you guys. Sorry my demo didn't work.