 Hello Hi, so now John and Christian will talk about self so This stage is yours. All right. Thank you very much good afternoon everybody and Welcome to our talk on distributed file storage and multi-tenant clouds using self of s My name is Christian Schweder. I'm working as a software engineer at redhead obviously and with me is Hello, my name is John spray. I'm also an engineer at red hat I'm the former maintainer of self of s and the current maintainer of the set manager demon All right So in this presentation We want to give you a brief overview about the key components that we are using to expose actually shared file systems Within an open-stack cloud we're going to introduce it to your open-stack manila and we will continue or John will continue with a self FS native driver implementation Next will be some information about is the NFS Ganesha driver, which is used lately and Has some improvements about the native or using the native self of s driver directly and At the end of the talk, we are going to give you an outlook on the future work That is happening right now with the upcoming open-stack queens release and beyond the queen's release So before we talk about the components that we are using let's quickly recap your delayed What's the challenge that we actually try to solve here? so what we want to do is we want to enable tenants or Administrators of virtual machines to expose or to use a shared file system across their virtual machines At the same time we want to be we want this shared file system to be tenant aware Meaning that we don't have like a single global shared file system for all the machines running there But only to a subset of the machines and another subset of machines from a different tenant Are not able to access of course data from from the tenant a for example The whole deployment should be self-managed Meaning that the tenant administrator from tenant a for example can set up create shares delete shares for their users and their virtual machines and The operator of the underlying open-stack cloud and the storage back ends don't need to Interact directly or need to set up the different file systems for the administrators As an operator of this cloud, I want to be able to reuse my actors Well, well, hopefully I want to be able to reuse my actually available storage back-end system So for example if I have already set running That exposes block storage. I don't want to add another software defined storage I want to reuse this existing storage former shared file systems So how do we solve that? Actually, we're using three different projects to solve this issue or this challenge first We have open stack Manila open stack Manila as a Open-stack shared file system service We're going to use staff as the software defined storage solution Backing up your open-stack cloud and we're going to use NFS Ganesha, which is a freely available NFS server implementation So let's have a look at open-stack Manila. The open-stack Manila is the open-stack Open-stack shared file system service. It actually exposes a bunch of well one API That your users and administrators of virtual machines can actually use to request of shared file system shares Delete shares create new shares stuff like that So as a tenant administrator that want to use a shared file system You send a request to the Manila API and the Manila API in the background There's a second service called the Manila share service Which actually talks directly to your storage cluster and the storage cluster then exposes for example an IP address where you can mount this shared file system and At the end of this the tenant administrator will get a back-end address an IP address for example That you can use then in your guest VM to actually mount this created share And finally your guest VM is talking directly to the storage back-end now This support existing within Manila for different back-end drivers Quite a few proprietary drivers that are well supported by the their vendors There are some generic drivers for open-source implementations But until the addition of surf of s there was no real production ready Open-source driver available Meaning for example that you have stuff like highly availability services running there so surf of s added this well features and With surf of s is it's now possible to run really a production load using open-stack Manila with a completely open source cloud deployment open-stack deployment, so Surf of s is actually a distributed POSIX file system that is built on top of liberators and radars the technology that actually Makes staff or is staff Staff itself is probably also already well known for example for its block storage service using radars block devices Radars block devices are often used together with open-stack sender to expose Volumes for your virtual machines running on open-stack, and it also exposes object services called using the radars gateway at GW Exposing an S3 and object and then swift compatible API implementation so The open-stack foundation is actually doing surveys twice a year typically before the open-stack summits and It turned out that most people more than 50% of the operators actually want to use surf of s as their back end for their Manila service, and there's a good reason for this as I mentioned early on most of the times when you have already a surf cluster running sender You might want to reuse this one right you want to use only a single software defined storage within your cloud and This drives actually is a demand to to use surf of s within Manila So before John is taking over and giving you a more detailed view on surf of s itself I want to explain a few key terms that we are using during this presentation When we actually deploy open-stack clouds Within redhead or for our customers at redhead We're using a concept or we're using a tool called triple O triple O actually means open-stack on open-stack and We have a so-called under cloud Which is basically a management node and this management nodes Then deploys and so-called over cloud this over cloud is shown here for example We have a bunch of controller nodes that for example expose API services Not only Manila, but let's say keystone as well the Nova API and so on and so forth The controller nodes are also running a distributed Maria DB cluster And other services as well, of course, we have some compute nodes where your virtual machines are running on later on And we have some storage nodes There's well, we are exposing or setting up actually a few different networks by deploying an over cloud And we have public networks that for example expose the API services to the users This is also called a control plane and we have some private network Networking networks that we're using for example for the storage traffic in the background And these are typically called data plane networks And our private and normally not seen by for example, I shouldn't be seen by the guest VMs actually All right That's that I think John's going to tell you a little bit more about stuff of us So the native driver for Manila is the simplest thing that we have at the moment It's been in open stack for a while now and it acts as the basis for what's coming in future releases When we say native in this context, we mean Very good performance because all of the guest VMs can talk to all of the OSDs and the self cluster That gives them a scale out level of bandwidth that means they're only limited by the bandwidth of their own network interface So if you want to get 10 gigabits or 40 gigabits or whatever of storage Into a virtual machine in an open stack cloud. This is a way you can do it It's fairly simple to deploy From the point of view of the person deploying the cloud itself because you're not deploying any extra Protocol gateways or anything like that. If you're running open stack on a fully open source platform The chances are you already have a staff cluster that you're using for RBD images via Cinder So you can use the same underlying staff cluster with a single Large CephFS file system and expose that as Manila volumes The way we implement the concept of a Manila volume in the CephFS native driver is simply as a directory So CephFS goes a little bit further than what's required by POSIX in terms of what it provides you it gives you per sub directory Quoters and per sub directory usage data that lets you Do an impression of a particular directory being its own file system So when you type DF on one of these subtree amounts You're getting that per subtree usage information and the quota is being enforced at that level So that when somebody asks for a 10 gigabyte Manila volume What we're really giving them is a directory with a 10 gigabyte quota There's a link there to a presentation from CERN who are using this driver at some scale already so here's the network diagram and The important part of this is the red section at the top where we have the Ceph storage network That's connected to all of the backing storage servers the V OSDs Which typically would be Separate from the networks that you exposed to your client workloads and the reason for that is that Ceph is It has some security built in it does have its own crypto and authentication and so on But that's not Designed for use in public. It's not the kind of thing you would expose directly to the internet and It's not the kind of thing you would usually expose to untrusted client workloads that somebody brought to you and asked to run your cloud so Tenant a and ten and b here are just both connected to the same network so they can potentially Interfere with each other somewhat And they can certainly run the risk of interfering with the underlying Ceph cluster The upshot of that is that you would only use the Ceph native driver if you trust your workplace And that is the case for quite a lot of open stack users as it turns out to begin with I expected this driver to have very little uptake because It has this drawback But we've had a bunch of people using it because it turns out a lot of people using open stack are just using it as a way To automate their infrastructure for their own workloads and for those people the security caveat is acceptable So to recap the pros here are the performance It's shown itself to be popular with users the implementation is very simple It's just a directory that somebody's mounting and I didn't mention a J But that's because there's nothing to it. The Ceph cluster already provides high availability in itself There's no extra effort needed to make the solution a J when it's used by Manila clients in an open stack environment And the downside to emphasize again is that we are giving user VMs direct access to the storage network Which is certainly not a best practice and is only acceptable if you Can justify it based on how much you trust your applications The other slide downside is that the clients Running inside the Nova VMs need the Ceph software installed to mount the Ceph file system So the Ceph client is in the Linux kernel So this isn't massively onerous, but it means that when you deploy your Your clients if there's a new version of Ceph coming out You need to worry about things happening inside your Nova VMs So again only works if you have a good a very good relationship with whoever is running your applications and whoever is building those VMs So having talked about what's wrong with the native driver. I'll pass back to Christian for the Current work. All right. Thanks John so Because of these drawbacks the engineers were actually looking into a different way how to expose the shared file systems to the users and What has been implemented is actually NFS backed service that exposes or that actually uses a Ceph FS and as the Ceph FS driver And exposes the shared file systems to the guest VMs using NFS services this is Completely will be completely integrated with the upcoming OpenStack Queens release which is coming out in a few weeks And I'm giving you a little bit more detail how this works So first of all we're using NFS Ganesha. Ganesha is a user space NFS server supporting multiple NFS versions and Built on a very or using a very modular architecture using pluggable file system abstraction layers and these file system abstraction layers already include different open source files Shares for example Glaster FS GPFS Luster and Ceph FS Ganesha makes it possible to actually Dynamically export and update the configuration without actually restarting the service itself So when you add a new share to to the Ganesha server, you don't need to restart it You just send a signal using the D bus protocol and it will update and export the new service for example It has also some Some metadata metadata cache Implemented and included So even if so when you're using NFS to access a shared file system Not all your requests need to go to the underlying storage back end, but some of them are already cached So how does it look like when you want to use the Ceph FS NFS implementation your tenant is for example using the horizon web interface or the manila command line interface and Sending a request to the manila API service The minute manila API service is then talking directly and natively first to the storage cluster running Ceph FS and Returning a directory pass Next the manila service will update and send the information to the NFS Ganesha gateway Which itself mounts this directory pass locally and exposes the NFS Service using the NFS protocol itself. So at the end of the day the administrator has an IP address with an NFS service that you can mount on your guest VMs Now this is all happening on the control plane. Let's have a look at the data plane So on the data plane it looks now a little bit different because your clients are no longer talking directly to the Ceph FS demons or the Ceph demons But only talking to the Ganesha NFS gateway This has some benefits of course because we can better isolate traffic from each other And as John mentioned the drawbacks that we have when using Ceph FS directly are no longer Come into play here, but it adds Actually, well a kind of man in the middle here in the data pass So you need to need to run the Ganesha service in an HA mode highly available mode Otherwise it would become a single point of failure So what we're doing here when we set up an open stack deployment using triple O we're running pacemaker to Enable or to run Ganesha an active passive mode If we look at the again at the network level We now create a new network for just for the NFS traffic itself And the tenants are no longer directly connecting to the Ceph network But to this NFS network and Ganesha is also connected to this NFS network so You might wonder actually If both tenants use the same NFS network to access the data services, you might see traffic from it's up each other Well Manila actually triggers or creates Open-stack Neutron security rules to isolate traffic between the different tenants from each other as I mentioned We're running pacemaker on the controller nodes within our triple O deployed Environment so there's a concept was in triple O called composable roads and Composable roads actually make it possible that you for example define a set of nodes that are running specific services and It might be well It would be a nice idea to actually run a few standalone nodes just for the Ganesha services and With their own pacemaker cluster however The pacemaker cluster at the moment is tied to the controller nodes because it's also controlling other services as well So we can't easily remove or move the pacemaker service to a different node so pacemaker needs to run on the controller nodes and because of that Ganesha itself needs to run on the controller nodes and For example needs to restart The Ganesha container Ganesha is running container rise in and your releases with an open stack triple or within triple O Needs to restart it on a different node So to summarize What are the benefits using the sepaf s NFS driver implementation? We have better security because we can actually isolate the user virtual machines from the sepaf FN step network And it's demons. We already have familiar NFS semantics that are included in most guest operating systems Which is for some guests or for some operators were pretty important because you can for example add self-drivers to your let's say proprietary guest system and But instead NFS itself is widely supported already out of the box We have some past separation in the back-end storage system and By using open stack newtron network policies We can actually enable or enforce that the back-end traffic is Isolated from tenant to tenant and that gives us a possibility to run it more in an opens in a public cloud environment Compared to for example private cloud deployment where you can trust your virtual machines But there are also some drawbacks here so as I mentioned Ganesha as a man in the middle now in the data pass and your guests are no longer directly to the HAA enabled or HAA ready self cluster, but to the only to the Ganesha cloud service So what we would like to do is actually we would like to run best would be to run Ganesha in an active active mode Or run for example Ganesha in a way that we spawn a separate Ganesha Instance for every tenant so instead of setting up a single Ganesha service When we deploy the open-stack cloud we could well we could actually do Start a Ganesha service per tenant on-demand and by doing so we would Increase the performance that we get out of this total service and I'm not exposing like a single point of failure Because you only have a single Ganesha service running And with that said I'm giving over to John again Was giving you an outlook on what comes next So it's fairly clear that we've got the native driver that gives you great performance and simplicity But has a big security problem and the NFS driver Which is much easier to use and much more generally applicable but has some innate limitations in terms of performance Naturally we want to build a scale out NFS solution The way we want to do that is to have a tighter integration between Ceph and Ganesha Pacemaker and Chorus Inc. are very useful tools for HA up to a certain scale But we envisage that in the future we're going to have NFS demons that need to scale out potentially as far as Ceph scales out so potentially up to hundreds And we'll also want to have Ganesha demons That are created on a per tenant or per share basis So we'll have a much larger volume of these than Pacemaker Chorus Inc. will be able to handle Ceph fortunately already has lots of general purpose bits for keeping its own services running so we intend to put a wrapper around NFS Ganesha and Run it in a way that it's slaved to the Ceph cluster That should also make it simpler to deploy because there's no extra configuration to do for Chorus Inc. Pacemaker And also easier to use for users in general of Ceph so whether you're running with Manila or any other Environment we can potentially give you a single command that both creates the file system and configures the scale out NFS on top of it Kubernetes comes in here as well So there are a number of efforts ongoing at the moment to bring Ceph into Kubernetes environments Not only to provide the storage to applications running in those environments But also to take advantage of what Kubernetes can do To run our services to keep our services up scale them and so on and that will apply to NFS Ganesha So when Ceph wants to create a new Ganesha demon it will ask Kubernetes to do that for it Whether your application is based on Kubernetes or not So even if you're on a traditional open-stack cloud You would still probably be using Kubernetes behind the scenes if you wanted a Ceph cluster that could Provide this kind of capability for you So here's a diagram of what that solution looks like Manila is still there in the top left and it's calling down to The MGR there is the Ceph manager demon. It's one of the demons in a Ceph cluster and It asks the manager to create and delete shares for it and on Manila's behalf the manager calls out to the bottom there to Kubernetes and Asks it to create these pods that contain a new service that we intend to create called NFS Gw Which is a sort of analog to our Gw the rados gateway, but it's doing NFS and Those pods would be the the data path for the user applications. You see Korea is mentioned in the bottom left of that diagram as well That's about the network configuration that goes with this so as well as creating NFS demons per tenant We will also need to be doing network configuration per tenant to give them The right level of isolation. That's very much an emerging area in Kubernetes Traditionally Kubernetes pods only had one IP and that was all you got but things are moving fairly quickly in Kubernetes and We imagine that things like these storage containers will be a sort of early adopter of the new improved Stuff that's going to be out there in Kubernetes So this is the the sort of third revision of that same basic diagram But this time you can see tenant a and tenant B are both talking to their own Ganesha They're not sharing anything at all anymore. They're fully isolated, which is kind of what we wanted to begin with So this isn't trivial and it's not all just plumbing Making a scale out Ganesha cluster on top of Cephs has some challenges Running it to begin with is very easy. You can download Ganesha and do this today It's a shared file system. You can put multiple front ends in front of it. That's all fine But when a failover happens, we need a little bit more Stuff happening here. So NFS clients have some state with their NFS server But the NFS server is also a second client That creates a fair bit of complication in terms of what order you restart services in when a failover happens But also at what points you have to put the NFS services into what they call grace So when you put an NFS demon into into grace, it means that it's not going to let any clients take any any new locks or accessing new files until The failed demon that we're currently replacing has come back and that's to prevent locks getting stolen and provide the right Semantics for the user application It's quite an obscure aspect of Implementing NFS that most people aren't familiar with But it becomes very important once you start doing active active NFS and scaling it out, especially when you're putting it on top of another system Which is doing its own HA underneath in this case Ceph even further out MDS is in Ceph clusters the metadata servers are themselves a scale out cluster So we'll want to scale out the MDS is at the same time we scale out the MDS demons to let you do Things like small file workloads efficiently NFS delegations Will speed up these workloads a lot For certain applications that don't want to keep doing lots of round trips all the way down to the Ceph cluster that allows the So a delegation allows the Ceph cluster to say to the NFS demons Okay, you manage this file for the moment. You don't have to ask me to do more accesses to it PNFS parallel NFS Everybody's been waiting years for PNFS. It might still happen one day In the context of Ceph What that might look like is us putting a little lightweight demon in front of each of the OSDs that allows PNFS traffic to go through it and what that would mean would be that an Individual NFS client even when going via NFS rather than Ceph as native protocol would be able to get the full bandwidth of the cluster And finally There's the use of this outside of open stack. So Manila is a fairly separable thing from open stack It will be interesting to see to what extent Manila gets reused in other environments But also to what extent the components that we're building here to support Manila get wrapped up in other environments So much of the same stuff that somebody wants in Manila. They'll also want in Kubernetes or any other container orchestrated environments. So we anticipate this Stuff that we're building into Ceph and on Ceph for Manila showing up in other frameworks as well That's all I've got. Thanks very much for your attention any questions That's what over here I did not So You're asking about Kerberos, which I didn't mention and That's because that's a general purpose NFS aspect and it doesn't change if you're using Ceph so Whatever Kerberos support exists in your NFS team and will still exist when you're running on Ceph I'm not an expert on that aspect So I think the question is When I'm running NFS on top of Ceph FS and both of them are doing sub mounts Sub-directory mounts. How are we tracking the metadata beneath that? So the metadata Inside Ceph is completely unchanged You can mount the overall file system that's storing these volumes and you will see all of the volumes as sub directories And you can even read and write to them safely at the same time Because our whole concept of a volume is this sort of sham where really it's just a directory We don't have to really make any changes at all and have the metadata stored on the back end Yes, it's it's still on Ceph FS. Yes, so that's sorry the question was Yeah, sorry, I forgot to repeat the question there It was will the NFS team and still run on Ceph FS or will they run directly on Ceph or the underlying rados? And the answer is it's still Ceph FS Just with a more nicely packaged NFS solution on top of it So the question is why are we doing file systems and not block devices? It's for workloads that require a shared file system So if you have a workload that requires storage, which is only accessed by 1vm at once then a block device works well But if you want a file system that will be accessed at the same time by many different nodes then Manila is what you want. So Cinder Cinder is still great for single single node access And Manila is primarily for people who want Shared file systems accessed by many nodes at once One last question. I'm sorry. I can't hear you. I'm sorry The question was when deploying the Queens version of all of this with the NFS solution Does it require any special dependencies or configuration? Especially on the network side So within the Queens release, it will be fully integrated on the triple O side So when you deploy an overcloud using the Red Hat's triple O Then it will be fully integrated already. You just need to enable it during the deploy time So when you're familiar with that with triple O you have a bunch of environment files that for example set various items networking stuff like that and With the Queens release it will be as simple as like just adding an additional included environment file and that will spawn up and Well deploy the Ganesha services and other services so you can just use it basically out of the box. That's the idea Okay, I think that's it. Thanks everyone