 Hello everybody, I'm Patrick Donnelly, I am the CFFS team lead at Red Hat. My colleague, Jeff Layton, couldn't make it too fun this year, so he won't be joining me to give this talk, so he'll just have me. Today I want to be talking about the new technology we've been working on to have NFS servers deployed with the Rook Storage Operator with Kubernetes and have these NFS servers export the CFFS file system. So to begin, since many of you may not be aware of CFFS, CFFS is a POSIX distributed file system originally developed around 2005 by Sagewell to serve as a file system for HBC clusters for the national labs. CFFS is interesting because it uses the clients and the MDSs cooperatively maintain a distributed cache of the INOS and the directories. That is sort of a difference between many other distributed file systems where clients talk directly to servers and the servers manage all the state. This was a decision made early on to allow the clients to have direct access to the data without needing to go through any type of gateway. CFFS provides full failover management by the use of, for example, standby metadata servers. It also provides for horizontal scale out by having multiple active metadata servers serve in a cluster. The metadata servers do not maintain any state. They're very easy to containerize. They store all their state in RATOS, RATOS being a distributed object data storage, data store which is the main underlying component of CFFS. On top of that, the metadata servers represent the metadata hierarchy and they present that to the clients for access. Clients directly talk to the object store when they're reading or writing to the file objects. In order to maintain consistency, the MDS hands out capabilities which are similar to leases from the academic concept. They hand these out to clients which allow them to delegate metadata authority to clients. In particular, they can allow clients to write to data or also buffer rights of data or cache reads and do this safely in such a way that it doesn't conflict or is inconsistent with other clients. Moving on, the origins of this project actually began with OpenStack. There's two primary services with an OpenStack that use CFFS. There's Manila which is a file share service for virtual machines with an OpenStack and then also Cinder which is a block device provisioner for VMs used primarily to create disks for a virtual machine to boot from. We've done a lot of integration with CFFS in this way and CFFS has become one of the more popular storage technologies to use with an OpenStack. Community survey done in the past in a few years ago found that CFFS was the primary storage provider for use within Manila. The reasons for this I think are fairly straightforward. CFFS is an open source storage solution and is easy to incorporate into your OpenStack clusters and then also is free. If I'm talking about OpenStack then why am I here? Why should I be talking about Kubernetes? It turns out that there's a big push right now as we all know that to manage distributed services within the context of a container orchestrator, in this case Kubernetes. There's a lot of attractive reasons for doing this. Containers are lightweight and easy to deploy in response to changing application needs. Importantly, especially for this project, it allows for extensible service infrastructure so I can deploy services in response to the changing needs of the application. Containers are also light-weight enough to allow for easy parallelism. I can deploy a lot of containers in response to those changing needs of the application. It's much easier to do a failover in a container orchestrating environment instead of having a dedicated machine that is there to take over when a service fails, I can just spin up a new container anywhere in my infrastructure to take over that role. And finally, Kubernetes also allows us to do fast IP failover management which will become important later on in this discussion. How we're using Kubernetes within Ceph is through the Rook Storage Operator. Within Ceph, there's now a big push to deploy most Ceph clusters in the future using Kubernetes and Rook. Rook is one of the main storage operators for handling storage within the context of Kubernetes. And Ceph is one of the primary storage systems that it supports. And the way Rook works within this context is I can define now a Ceph cluster that I want to deploy within Kubernetes. And it's as simple as feeding it some YAML files into KubeCuttle. And the Rook agents that are running on all the nodes will figure out where it can host object storage devices for Ceph attaching an object storage device to available disk drives or SSDs on all of your nodes in your Kubernetes cluster and then deploying the Ceph demons as needed in order to get your cluster ready. The great hallmark success of this is that it allows you to deploy a Ceph cluster without knowing anything about Ceph which has historically been a very difficult thing to do. Ceph being a distributed system had a lot of knobs to turn and a lot of upkeep to do on all of your different computers that are serving Ceph. And now with these storage operators this is becoming easier than ever to host a Ceph cluster without knowing anything about distributed storage. So coming back to the subject of this talk why do we want a NFS gateway in front of Ceph? So you may remember when I was talking about earlier the primary use case for Ceph FS was for an HPC environment where we're running a lot of high performance computing jobs that need access to fast file systems especially for example for scratch space. And that was its target use case. And because of that it was operating in a mostly trusted environment and the application could be trusted to access the file system in a safe manner. In today's deployments in like OpenStack or Kubernetes and how you may want to serve your infrastructure to various tenants that may not be an assumption you can safely make. And there's a few reasons for that. It's not necessarily easy to restrict which types of clients are accessing your file system. You may have for example if I'm running with an OpenStack and I have virtual machines being brought up by my customers and they're running older kernels it may not work without bugs against my Ceph cluster. It may also lack control over what types of clients they're using to talk to Ceph. Another issue is security. You may want to firewall off your storage cluster in fact you probably do to prevent clients from having unrestricted access to your storage network. So there may be a firewall. Additionally you may want to introduce some kind of authentication and authorization mechanisms for example through Kerberos that does not yet support and force clients to authenticate through those mechanisms. So moving on. So for that reason it's attractive to put an NFS daemon in front of CephFS NFS being a very common file access protocol within the industry and put that in front of Ceph to serve as a gateway between the clients and the Ceph cluster. NFS Ganesha is an open source NFS server and it runs completely in user space without any dependencies on the kernel. It's open source software licensed under the LGPL. What makes Ganesha attractive is that it provides a plug-in interface for multiple different kinds of exports and then also provides support for exporting different types of file systems. The most obvious and usual case would be a local file system having NFS or Ganesha export something in the local file system but within our case we want to actually export the CephFS file system. To do this Ganesha has these back ends called the file system abstraction layer and abstraction layer for Ceph uses LibCephFS to actually talk to the Ceph cluster so any incoming NFS request or PC will be translated into a corresponding call through LibCephFS. Correspondingly Ganesha also uses LibRATOS which is a library that gives direct access to the object storage layer to store its various state that it needs to maintain in order to recover from a failover when a standby Ganesha takes over. It also uses RATOS to now store its configuration files instead of using the Etsy namespace within, for example, its file system. And for these reasons Ganesha is very amenable to containerization. We can store all of the file system states of course stored in CephFS. We can store all the configuration and recovery information within RATOS. There's no need for a writable local file system. So one of the things we wanted to have was scale out and that is we wanted to have multiple Ganeshas be able to export the same CephFS file system which is not as easy as you might think it to be, mostly for reasons of consistency. We wanted to be able to have a services containerizable so that we could use it in the future work of containerizing CephFS for use in a container deployment orchestrator like Kubernetes. Additionally, we wanted to support a newer NFS protocol which has several, which notably allows for NFS clients to store state so that they can get better performance. And finally, the ability to talk to Ceph and RATOS for all this communication. No need for any type of third party clustering software. We're going to be able to handle CLTDB which has normally been used in the past for Ganesha to handle failover management. So up to now, Ganesha has been deployed in a passive active fashion and this has been available since Ceph Luminus back in August 2017. You have one NFS server which all of your NFS clients talk to and you use the third party tools, pacemaker, chorosync to handle the failover between the active NFS Ganesha demon and the standby NFS demon. The main drawback of this approach is that it scales poorly and it requires idle resources. You need a idle NFS Ganesha server available to take over when the active fails. And all of the NFS clients are talking to a single active NFS server for all of their data needs which obviously will not scale very well. So one of our goals was to allow for all of these NFS clients to talk to multiple NFS servers. So why this is difficult to put multiple NFS servers in front of the Ceph file system in an active active fashion is because of the NFS clients now having state. In particular now, in NFS v4 we have the clients are now given leases similar to Ceph FS capabilities which allow them to read or write to files or hold locks and the clients must now contact the NFS server during a lease period between 45 and 60 seconds. This had a few problems mainly around failover of clients and in NFS v4.1 which is the target protocol we're supporting they now have a sessions layer added on to for the NFS clients when they're talking to the NFS server which provides exactly once semantics so you don't, when you're retrying an operation it doesn't happen multiple times. And now also a reclaim complete operation. So when an NFS server goes down and it comes back up it begins a reclaim process where all the NFS clients can reestablish whatever state they had with the prior NFS server. That was if you waited the entire lease period that means nothing could actually happen in terms of IO for any client of the NFS server. So they added a new reclaim complete operation which allowed the NFS server to lift the grace period early. Moving on. So after an NFS server restarts or a server takes over for a failed NFS server it has no ephemeral state because it doesn't store its state anywhere locally. And so all the clients have to notify the NFS server exactly what state they had. That includes what files they had opened, how they had the files opened, any kind of locks they had. And because you don't want one client to say I want a write to this file and another client which has not reconnected yet has write access to that file and may be buffering writes you need to wait a little bit to give all the clients a chance to reconnect to the server. And that's what this reclaim session period is is two lease periods. Lease period corresponding to how long you wait for a client to say they're still holding on to a lease. So during that grace period the NFS server can't issue any new state and the clients may reclaim the state they had prior to the crash. And in fact this method of recovering all of the state that you had opened is very similar to what we do in CFFS. So logically this can be organized into a series of epics where an epic corresponds to the lifetime of a particular instance of an NFS daemon. Beginning with a grace period where all the NFS clients can reconnect and try to establish whatever state they had and then you have after that completes the NFS server transitions into the normal operation which will be generally a very long time it should be in the normal operation. Fail overs are not that common. So the observation is that we can use the same logical organization of the grace periods and the normal operations and have this applied to a cluster of servers. And why that's a difficult problem is that if we have multiple NFS servers exporting the same backing file system you may have state that's issued to one client of one NFS server and during a fail over event another NFS client of the other NFS server may try to acquire that state. And so now we need to form agreement on what state has been issued and how to reclaim it. And so the way the NFS connection handles that is through the use of coordinating grace periods it has a central database and a RADOS object which keeps track of what the current grace period epic is for all of the NFS servers so that they can form consensus on when the grace period should happen. So all the NFS servers together in the cluster will enter grace period together whenever there's a fail over event so that none of the NFS servers accidentally issue state that another client that needs to reconnect to a failed NFS server needs to reclaim. So these are some details on how that works that I won't get into due to time but I'm willing to take questions on that later. So the next challenge we had to address with this work is we wanted to layer NFS over a cluster file system. And because both of the protocols for CEPFS and also NFS v4.1 are stateful and they use a lease-based mechanism you need some way to handle the case where an NFS server fails and another server wants to come in to acquire that state a way to make that fail over event happen faster within CEPFS. Namely, CEPFS has its own timeouts that it maintains for its clients. Whenever a client fails and comes back that client has a certain amount of time to re-establish its session with CEPFS. Likewise all those NFS demons all the NFS clients have a certain amount of time corresponding to the lease periods to reconnect to the NFS demon to re-establish their state and there's some conflicts or there's some overlap there in time which can cause a very long delay for the NFS clients to re-establish their state and be able to continue the normal operations they want to perform and how long it takes for the NFS server to re-acquire the state it needs in CEPFS. So one of the new features in Nautilus was that we allowed a client to reclaim the state of a prior session. When a CEPFS client dies abruptly the MDS will keep its session around for several minutes allowing that client to come back and say I'm still here and I want to reclaim all of these the capabilities that I had all the file locks that I had and then with the presumption that these clients have not actually failed and come back as a brand new client there was simply a network partition that separated them from the MDS. And for a while this was fine enough for a kernel client which if it were to restart it wouldn't be trying to get a new session with the MDS it wouldn't try to reclaim the old session with the MDS because all the applications that were running with the kernel client have also been rebooted so it just gets a fresh state. It's the same thing with CEPFS which is another alternative mounting mechanism for CEPFS. With NFS Ganesha we have this new problem where the NFS daemon has state that it's issued to the clients and if the daemon fails and comes back it needs to re-acquire all that state that it had in the prior session. So now in Nautilus CEPFS supports a CEPFS client to come back and claim all the state that it had in a prior instance of the client without knowing immediately what all that state was. During the grace period for the NFS server the clients will come in and say what state they had with the prior instance of the NFS daemon what files I had opened correspondingly the Ganesha server and we'll talk to CEPFS and we establish the state that it had from the prior session. When all of that is complete it actually completes the session similar to the reclaimed complete operation that the NFS clients send to the NFS server. So our next challenge is the deployment and management of NFS clusters. Now we have all the basic mechanisms in place to build a cluster of NFS servers that are in an active-active configuration. Now what we need next is a trivial or a simple way to deploy those NFS Ganesha servers in an active-active configuration dynamically corresponding to whatever kind of load we have on that particular export of the CEPFS file system that the NFS clients are accessing. And then also we want to scale in another direction where we have an NFS cluster for each export in the CEPFS file system we have. So keep in mind that CEPFS can be a very large file system with billions of inodes and many different users with different use cases and applications and so we want to be able to have an NFS cluster front each sub-tree within CEPFS corresponding to the application need and all of those and then you've achieved a few things by doing that. You improve performance because now you are isolating certain file system behaviors with a group of NFS cluster demons. You're improving the caching of the NFS demons because now you have applications that are all accessing the same NFS servers and you can also improve security because now I can isolate the cluster of NFS demons based off of which application should be using them and furthermore isolate which parts of the CEPFS file system tree that the NFS demons have access to based off of what applications will be using those NFS clusters. Finally another challenge we need to address is IP migration and failover so when an NFS server inevitably may die for some reason we need a way to spin up a replacement very quickly and cheaply and also we need a way a mechanism to migrate an IP address from the prior instance of the NFS server to the new instance and these are historically difficult problems but not anymore and the reason for that is these Kubernetes it's fairly simple to spin up new containers in response to failover and also you can migrate the IP address to the new pod. So this is where Rook and Kubernetes come in Rook now supports in the version 1.0 release that just came out the ability to specify a CEPFS resource type to launch the NFS Ganesha servers in response to the kubectl command creating those objects and Rook and now CEPF have a deep integration with each other between the CEPF Nautilus release which was just released about two months ago and Rook itself so now CEPF is also extensible with Rook and that it now is able to deploy services in response to to the changing application needs for example creating a new file system by talking to Rook and saying I need another metadata server please spin another container up and this has all been abstracted away very nicely and we are playing we are able to use this to have the manager demon deploy NFS clusters in response to what the user specifies via the how they specify the NFS cluster for their file system furthermore we can scale up the NFS cluster up or down in response to how the application needs changed and have this all centrally managed from the manager in the future we also want to use this to sort of a holy rail type concept within CEPF as we've always thought it would be very nice to be able to deploy more active metadata servers in response to the changing load of the file system we also plan to use the same the same idea to imply this within CEPF as well to scale on the SS so in summary of this work we've built several mechanisms that are available today to deploy to deploy these NFS clusters in front of CEPF now we have this ability to create volumes and sub volumes volumes corresponding to CEPF file systems which multiple CEPF file systems are not yet supported within CEPF that's something we're planning to do in a future release the more common thing to do is to have this sub-volume concept where you separate or you dedicate an entire directory tree to a volume which may be exported to a particular virtual machine or a set of containers for sharing and then this allows for OpenStack or the CSI interface for Kubernetes all of your pods can use the same interface to set up configure volumes and then also configure the exports of those volumes the other mechanism we have finished and built is the broken CEPF integration so now CEPF can launch these NFS clusters dynamically and configure these NFS clusters to have a set of exports and configs corresponding to these sub volumes and finally we've updated NFS Ganesh to be cluster aware and have faster and correct recovery with CEPF file system and as far as vision for the future again we have all these mechanisms built what the next stage of our work is to actually make this a very turnkey in terms of actually configuring the volume and then setting up the NFS Ganesh cluster so in version 14.2.2 the next version of the CEPFS Nautilus release you'll be able to create these sub volumes to a centralized command and then a subsequent future release we're planning to make it very simple to actually deploy an NFS cluster in an active active fashion in front of CEPF with Rook and Kubernetes just with a simple command enabling NFS and setting various configuration options you might want to have for example what kind of network namespace you want to attach the pods to in terms of say for example if I have an open stack virtual machines that have been set up and I want to attach my NFS Ganesh clusters to the same network namespace that those that those tendons are in then I can configure that through this command and then finally turn it on by just setting shareNFS to true furthermore we also want to add support for NFS with the NFS version 4 migration of clients so we want to be able to not just expand the size of the NFS cluster but then also shrink it and that involves shifting all the clients of the NFS server that you want to remove and have them move on to one of the remaining NFS servers within the cluster and then finally optimizing the grace periods for sub volumes because right now all the NFS clusters are within the same logical cluster in terms of grace periods and that may not be necessary because sub volumes within CFFS may not share any INOS so there is no shared state between sub volumes it would be better for the NFS server clusters to be isolated from each other and not enter grace periods together because there would be no conflicting state between them and then finally it would also be nice to use this for SMB so that's all of my time thank you for your attention and coming to my talk, I'll take any questions the IO from NFS client and the NFS server communicate with CF so the IO they share the same network so the NFS clients may be on a different network namespace than the storage ones we're doing for NFS server yeah so NFS server NFS server have two networks yes, it would have two networks which requires newer features as I understand it in Kubernetes to be able to enable having multiple namespaces attached to a pod it has not historically been possible any other questions okay, thank you for your time