 All right, can everyone hear me okay? Yes. All right, I guess it's time to start. My name is Sage Weil. I'm the Ceph Principal Architect at Red Hat, and today I'm gonna talk about Ceph, Manila, and a little bit of Back Containers in OpenStack. So just a brief outline of my talk. I'm gonna give a bit of background on Ceph and CephFS. I'm gonna give an update on the current state of CephFS. Gonna talk about what the current landscape in Manila looks like. CephFS native driver we're working on, and then sort of segue into our plan for better file system plumbing into virtual machines in OpenStack, how that affects Manila versus Nova, and how the responsibilities might break down, and what that means for containers. So let's start with Ceph and CephFS, or actually why should we use files in the cloud instead of say object storage? The biggest reason is that file-based applications aren't going away, POSIX is lingua franca in the computing world. It's important to be able to interoperate with other storage systems and data sets. As more people start using containers, container volumes are really just directories in a file system, we probably wanna be able to support that effectively. And it turns out that permissions and directories are actually useful concepts, so we shouldn't just throw them out because they're something else. One might ask why not just take a block device and put a local file system on top? And the problem there is that block devices really aren't shared when you put a local file system on it, it expects exclusive access to that block device, so it's not useful for sharing data between virtual machines. And also block devices aren't very elastic, it's difficult to expand and contract the file systems there, whereas shared distributed file systems usually do this quite naturally. Why should we look at Ceph? Ceph is designed from the ground up to scale horizontally, which makes it a very good fit for the cloud. It has no single point of failure. It's hardware agnostic, it's designed to run on commodity hardware. And because it's designed to run at scale, it's self-managing whenever possible. And importantly, it's also open source, completely free and open source, which makes it a good fit for open stack. Also, as we're building Ceph, our goal has always been to move beyond legacy approaches. That means that we try to adopt a client cluster paradigm where the clients that are talking to Ceph understand that they're actually a whole mess of servers that are storing their data that might be failing and migrating things around and they're fully capable and prepared to deal with that situation and start talking to the right server in order to find the data that they need. So this no Ceph talk would be complete without this diagram, high-level Ceph architecture. It's all based on RATOS, this red bit at the bottom. It's a distributed object store that manages the distribution replication of all data in the system. If nodes fail, then it migrates data away from them, re-replicates it, makes sure that your data is separated across failure domains and those cluster expansion contraction, all that good stuff. And then on top of RATOS, we build a number of different high-level services. There's the RATOS gateway, which gives you S3 and Swift-compatible API object storage. There's RBD, which gives you block storage, which many of you I'm sure are familiar with. And there's also CephFS, which gives you a fully distributed POSIX file system that sits on top of RATOS. And it's this last bit that I'm gonna talk about today. So CephFS stores all of its files as objects in RATOS that the clients access directly, which means you have very scalable and high-performance access to your data. It scales just as RATOS does. It also has scalable metadata access because the metadata in the system, the files and directories and permissions and so forth are distributed across a whole set of metadata servers that are dynamically managing that file system namespace. And it provides you a POSIX interface, so it's a drop-in replacement for any local or network file system. There are multiple clients that are implemented for CephFS. There's one in the Linux kernel that's been there for many years now. There's a user space implementation that you can access via Fuse. That's called Cephuse. And you can also access it via libcephfs.so, which is a shared library that can be linked into Samba to re-export the sys, Ganesha to re-export via NFS. And it can also be linked into a dupe to run big data type analytics workloads. The thing that truly makes CephFS unique is its use of dynamic subtree partitioning. And the idea here is that you have a whole hierarchy of files in your file system and CephFS will sort of on the fly take pieces of that hierarchy and carve it up and distribute it among the different metadata servers in the system. Can do this pretty arbitrarily. I can even take a single directory that has lots of files or is very busy and shard that into lots of little pieces and then distribute those fragments across different metadata servers. This makes the CephFS metadata server cluster highly scalable because we can arbitrarily partition the hierarchy into little pieces, supporting tens or hundreds of metadata servers. It's also adaptive because this partition is based on the current workload. So as different jobs start up and users start using the file system, then the metadata servers will migrate load from the busy servers to the out of servers and adjust that partition over time to fully distribute and take advantage of all the servers in the cluster. It can even take hot metadata and replicate it across multiple nodes so that you get the best performance that you can. There are a lot of other good cool things in CephFS. It has strongly consistent and coherent client caches, something that NFS has never had which means that it's gonna behave the same way as a local file system will. It has a recursive accounting feature that's relatively, I think, completely unique actually and that it keeps track hierarchically of everything that's stored in the system. So if you look at the size of a directory instead of getting sort of a bogus 4k number, I mean it'll tell you the number of bytes that are stored in that entire subtree of the hierarchy. Same thing that you would get out of a DU for free which is pretty nice. You can take snapshots on any directory in the file system. You have directory quotas where you can limit the size of a subtree based on the number of bytes stored or file stored that supports X adders, ACLs on the kernel client. There's also support for client side persistent caching using the kernels FS cache subsystem. So, lots of good stuff and we've been putting a lot of work into it. CephFS is actually where Ceph began some 10 years ago and sadly it's also sort of the last thing to become production ready but we're changing that as we focus over the last year primarily on resilience. So our focus has been on handling errors gracefully, detecting reporting issues that we see in the system and providing recovery tools in case things go horribly wrong. And our goal is to really achieve this first in a single metadata server situation so that we have the highest confidence that things are gonna work and you're not gonna lose any data. Part of this has been dog booting this system in our own environment. So the Ceph QA infrastructure is relatively large and storing lots of data and so forth. We've been running all that on CephFS for more than a year now. And we found it thick several sort of hard to reproduce bugs mostly in the kernel client but also a few in the MDS. Things that you would only really see running in a production environment but we're happy to say that it's been quite stable for some time now, at least on the CephFS side we've had other issues because it's running on a really, really old terrible hardware but those things aside we're quite happy with it. Which makes me happy to say that we plan to call CephFS production ready in the next stable release jewel coming out in the quarter one of this coming year. So that's very exciting and it's been a long time coming. Yay. Yeah. Yeah. So as I said there's a lot of work that's been going into CephFS. That includes improved health checks for diagnosing problems. You can identify which clients are misbehaving and which OSDs are misbehaving. There are lots of diagnostic tools so you can debug the system, figure out what requests are passing through the MDS if they're hung or something like that. There's better full space management. There's client management for evicting those misbehaving clients. And also work on continuous verification so that you can scrub an online system and make sure that there are no errors without taking it offline and doing some sort of consistency check. But our real focus again has been on this FS check and repair because although there are no real known issues with CephFS that we've been trying to fix it's not that stability has been bad. It's really that we wanna make sure that we have the tools available to recover data. If we do have people in the wild running it sort of internist that run into problems we wanna have some confidence that we'll be able to fix their problem without having to write custom tools to go do that. So we want repair tools that handle the loss of data objects in case RATOS totally fails and loses some of your data. And then also more importantly, tools that will recover the loss of metadata objects in case the metadata for the file system gets corrupted we'll be able to rebuild that. So there's a CephFS journal tool that we put together that's sort of a disaster recovery for the MDS journal structure that can pull recently written metadata out and put it back into the metadata pool. There's a table tool that adjusts some of the other internal metadata structures. And the last bit is probably the most important is the CephFS data scan tool which is can be used if there's a complete loss of all the metadata in the system you can actually rebuild the file system hierarchy from all the metadata that's attached to the file objects in a separate RATOS pool. So in case things go horribly wrong we'll be able to get your data back. And all that is gonna be ready for Jool so that's very exciting. One of the other new things is around access control. So we had a couple of interns over the summer the first was a Google Summer of Code student who was working on path-based authentication. The idea here is just to take a client and restrict them to a sub-directory mount within the system. That's now implemented in the MDS. There's one missing piece that will also allow them to be locked inside a RATOS namespace that will be in the Jool release. So thanks to Josh on from Google Summer of Code for that. We're also working on a user-based authentication system which is slightly different. The idea here is that you have a client that's mounting as a specific UID and GID and then we enforce some UNIX permissions on the server. That eventually we like to also wire into sort of external off-off frameworks like Kerberos or Active Directory to be used in sort of multi-user, I guess, IT environments. And that was an outreach-y intern student. So that's all new. So that's what's new in CFFS. Lots of excitement there which brings me to OpenSAC. So I'm gonna start a bit by talking about the current landscape in Manila and then we'll see how CFFS sort of fits into that. So Manila, as you all know, manages file volumes or file shares. You can create and delete shares. It manages some of the file server network connectivity to the tenants. I'm gonna have some stuff around snap stock management for those volumes. But there are also some awkward bits with the current sort of user experience for Manila. Manila only manages part of the connectivity problem. So it creates these neutron share networks that expose the file shares to a neutron network but it's sort of the user's responsibility to then attach the tenant to that network and actually go and do them out on the system. So this sort of last mile is up to the user. So Manila's only sort of managing half of that problem. It's also sort of assuming that the way that the files are being attached to the tenant is network-based. So it assumes that you're using something like NFS or CFS or CFFS that is the network-based protocol which isn't always necessarily what you want when you start thinking about containers. So more about that a little bit later. So most of the Manila drivers are appliance drivers sort of in the open stock tradition. The idea here is basically that you want to be able to tell an appliance to create an NFS share and export it to a particular IP and then create the neutron network to actually do that network plumbing. There are lots of these drivers from the usual suspects. Now the only thing real thing worth pointing out is that the security in this model because the tenant has access to the file server over that neutron network then the security is sort of punted to the appliance. It's assumed that the appliance is effective and secure at actually enforcing the security restrictions that you have that you can do that which is generally true. There's also a generic share driver that is not proprietary, it's fully open source. The idea here is to take existing open stock components and build a file service out of that. So you start with a Cinder volume and you attach it to a service VM that you create put a local file system on top like XFS, run a Ganesha NFS server there and then share that over neutron to your tenant. The nice thing is that we're building out of existing components and it's all fully open and usable and it gives you that tenant isolation so the tenant can't talk to the storage network where every actual storage is coming from. So it gives you that security. On the flip side, it's not the most efficient thing. You have this extra network hop, you have to go through the service VM in order to reach your storage so it's not gonna perform as well as it could. You have this extra VM that you have to run that consumes additional resources and that VM is also a single point of failure if that note goes down then you lose your file service even if your backend storage system is highly available and expensive and so forth. But this is a reference driver, it's in Manila, you can use it today. So that's all good. There's a very similar set of drivers that are built around Ganesha as well and service VMs. So currently this is used for GlusterFest but you could do the same thing with CefFS. So the idea here is that you take this Ganesha driver toolkit that they built, you start up a service VM, that's again running Ganesha but instead of using a Cinder volume, you use the Ganesha SL sort of backend abstraction and have that talk directly to the backend distributed file system. And then you export NFS to the tenant. But again this has, it's nice that it's sort of a simple existing model that we've already used with the reference driver and it has this good security where you're isolating the tenant from the network where you're distributed file system back into section running. But again the problem is that you have this extra hop, the service VM is a single point of failure and consumes additional resources. So this Manila Ganesha toolkit is written sort of modularly, it's used by the current GlusterFest share driver in Manila. It's not yet something that we've implemented in CefFS for reasons that I will talk about right now. So in our view that the problem with the service VM approach is that the architecture is somewhat limited. It's always going to be slower because you have this extra hop through the service VM and it's always going to be expensive because you're running the sex for VM doing all this extra work. The other problem is that the current implementation isn't highly available. So if you need some service monitoring to make sure that VM doesn't go down and if it does you have to restart it. It's a single VM, there's no sort of HA or low distribution or anything like that. And sort of an implementation detail. The way the current Manila code is written it's sort of assuming that there's a single service endpoint that's doing this process. So a lot of that's fixable. There's sort of a big to do list to make this sort of something that's fully robust and usable. But before we invest all that effort to do that I think the question that we need to ask is is this really the right endpoint? Is this what we really want to plumb access to tenants to Cef I guess in our case? Is that really what we want to do? And this caused us, I'll get to that in a moment. So the starting point here is we first built sort of a simpler approach, avoid the service VM and let's just give the tenants the ability to mount CefFS directly. So we made a CefFS native driver as a reference. So the idea here is that you have the tenant, you just give it access to the storage network so you ignore the security implications for now and you mount CefFS directly from the tenant VM. So the nice thing here is that you get really good performance. You have a direct network access. The client talks directly to Cef. You have access to the full CefFS feature set. Use the native kernel driver and it's nice and simple. But there are a few drawbacks. The guest has to run a modern Linux distribution that actually has an up-to-date Cef client. It exposes the tenant to the Cef cluster network which might be a security concern in your environment. The networking is currently up to the user so it's up to the user to make sure that that tenant has access to the storage network. And sort of a detail, the tenant client needs to be delivered a secret that it uses to mount the file system. Although I guess in the Manila session today, they sorted that out so that's coming soon. One of the things that made this driver simpler to implement is to implement first a library that we use to manage the CefFS volumes or shares that we're providing to Manila. So this is a CefFS volume manager.py that wraps, sits on top of the libceffs Python bindings which then talk directly to Cef. That's gonna be packaged as part of the Cef Python RBM and DEB. And the basic idea here is that Manila shares or volumes or just directories in CefFS. So you might have slash Manila, the consistency group and then the volume ID. And this library just captures all of the basic volume management tasks that you would expect. So you can create volumes, you can delete them, create snapshots on a volume or on an entire consistency group which is really trivial in CefFS. You just have to do a maker inside the sort of magical snapshot directory. And then also one of the other API calls in Manila is the ability to take a snapshot and promote it to a volume that you can then mount. I think that's the only way you can actually access a Manila share. And so if you want to read write then you actually have to make a copy because CefFS doesn't support writable snapshots. Probably never will. Or if you want to read only one you can just create a sim link. But the nice thing is that because the volume manager sort of wraps up all this complexity and is packaged as part of Cef then the Manila driver is really simple. It's only 250 lines of code. Which is nice. Excuse me. So the challenge here is that is really the security. So the core issue is that the tenant now has staff access to the storage network so it can talk directly to CefFS. Which means that Cef is the one that's responsible for enforcing whatever security restrictions that you have. So Cef actually has had proper authentication for years. Since like 2009. And our authentication model is modeled after Kerberos which provides the usual authentication of client and server. The client knows it's talking to the right server and the server knows that the client has the secret and is the right one. So what's really new here is that we've added this additional authorization that allows you to say that this particular client if they authenticate is only allowed to access a particular directory in CefFS. And that's the Google Summer Code project I mentioned is now extreme. And there's this one missing piece around Ravus namespaces that again will be present in Jule that I release next year. So the real question then is that enough? Is that a sufficient security model to give you confidence that you're gonna give tenants access to your storage network? It means that Cef security is the only barrier. Cef hasn't been hardened against denial of service the same way that many other systems have. That may or may not be a problem in your environment. I think it really depends as whether this is sort of a public cloud where your users are totally entrusted or whether it's a more trusted environment like a private cloud. And it's something that you as an administrator has to answer in your context. But the security is sort of, it's a bit of a problem. So what we're really looking for is a better way to plumb file access to virtual machines so that we can have something that we can use in a totally entrusted environment like a public cloud. Which brings me to this discussion. And so we run a couple of things. We want better security. We want the same sort of protection and isolation that we have with block storage. We want simplicity of configuration and deployment. Just as we do with Kinew and LiveRBD. The host just attaches the block and it's there for the guests and you don't really have to do anything special. And we'd also like good performance. Ideally, performance that's similar to what we would get with the native access to CephFS. So there are a couple of different models that we've looked at to try to achieve this. The first one that we looked at that initially we were very excited about was the use of 9P in KVM. So 9P is a file sharing protocol analogous to NFS. And Vertifes is an embedded 9P server that's part of Kinew. So the idea is that Kinew itself has a file server that exports just to the guest. And then the guest kernel mounts over 9P using a VertiO transport to make it nice and fast and efficient. So the idea here is to link LibsFFS into Kinew the same way that LiveRBD is. And it sort of links directly into this Vertifes embedded on 9P server. And then the guest OS, which has to be Linux in this case, uses the special version of the 9P protocol that IBM updated like five or seven years ago to mount Vertifes, which is actually re-exporting LibsFFS over the network. So a very similar model to the way that LiveRBD is used today. It's sort of easy to deploy. The nice thing is that this gives you good security, the tenant is isolated again from the network, from the storage network and locked inside a particular directory. And it's really easy to deploy. It's sort of frictionless in the same way that LiveRBD is. But on the other hand, it requires modern Linux guests that have support for 9P. More importantly, 9P isn't supported in most Linux distributions today, at least not the Red Hat ones. I found out after talking to all the developers that there are actually two different reasons for this. One is that the 9P kernel client, the kernel developers don't really like and don't really want to support. And on the Kimi side, the Vertifes server, the Kimi developers also don't really like it. They have concerns about the code quality. So on both fronts, it's not the most attractive option. Now to be fair, code quality is something that can be fixed. It's open source code. That's the whole point. So that could improve over time. But the other issue is that 9P just isn't the best file sharing protocol, at least in my opinion. It's second to NFS or something like native stuff of us. But it's an option. There's in fact a prototype put together by some developers at United Stack. That includes both the Kimi patches to plumb live stuff of us in there and the Minola driver that orchestrates FFS and some changes in Nova to actually do the attachment through LiveBird and so on. But we really like something better. So why don't we look at a different file system protocol that we all know and love. We can just use NFS. So the idea here is that you would mount stuff of us on the host or maybe use Ganesha and then re-export NFS to the guest OS over a private network that just the host and the guest share. So again, you have the same isolation of the tenant from the storage network, which is nice. NFS is supported by everyone, which is good. It's reliable in the sense that the NFS server gateways in the same fault domain as the guest. So if that host goes down, everything fails together so it doesn't really matter. And it works for any file system, not just FFS. On the flip side, NFS is kind of weak cache consistency. It's a better file sharing protocol, it's not the best one. And the protocol translation will slow us down a little bit. But the main issue here, I mean, those are probably not the big of an issue, but the main issue is that the sort of awkward and insecure networking to have this sort of simple private network that's attaching the host and the guest for a couple of different reasons. It means that you have to add a dedicated network to the virtual machine. You have to configure a local subnet on the host and assign MPs to the host and the guest. You have to configure NFS in the hypervisor and you have to then mount that to the virtual machine. The community guys tell me that it's awkward to have these special purpose network interfaces that you attach to the KVM instance that are sort of external to what the user is asking for. I don't know enough to actually know why that is. It also means that you're sort of dependent on the networking configuration inside the guest. So things like firewall D running inside the virtual machine might break this. If the user just restarts networking, their file service might go away, which could be problematic. And also we'll just see that there's this weird network interface and network configured that they might not realize is actually important for getting their file access to the root file system, which can be problematic. But I think that one of the things that concerns me the most is really the security implication where other services on the host might inadvertently be exposed to the guest. So for example, you might have a demon like SSH that's binding to all IPs on the host system that the guest can then access over this private network. So hopefully this admin has far well set up to prevent that, but we know who knows what's really gonna happen. So we'd like a better alternative. And the one we've found is called Vsoc. So Vsoc is actually based on VMware Vsockets. It's a newest address type family. That's designed specifically for communication between virtual machines, hosts. Excuse me. So it's stream-based or connectionless, just like IP. This address is like dead simple. It's just an integer to identify which virtual machine you're talking to and one is the host. And it's been supported in Linux since version 3.9. So we're over two years now. And then the key attraction here is that zero configuration. So the hypervisor always has address one and the VMs are just assigned addresses above that, but nothing has to be configured inside the virtual machine. Once you start up Kimiw and tell it which Vsoc address it is, it has it. So it's all there. So the approach to the idea then is to do this NFS export from the host of the guest over Vsoc instead. So there's some details here. We have to use NFS V4.1 only. That's because older NFS versions have all this annoying protocol stuff that requires addressing. So older versions of NFS use block D for example and that address gets shared and it's a whole mess so you can't really tunnel it over Vsoc. Whereas V4.1 consolidates everything over a single connection, so that's good. It's also much easier to support because it's easy to add support for Vsoc to existing services. It's really mostly boilerplate code to support a new address type and then adding parsing for the way that you actually parse the address. So there are patches in flight for the NFS client and NFS server from Stefan, who's a Kimi developer. There are patches for Ganesha to add Vsoc support from Matt Benjamin and there's also patches to NFS util so that the mount command line utility will work. Make this all work. So this is all sort of shiny, brand new, but we have prototypes working and running and it seems very attractive. So the overall picture then is that you would mount stuff of us on the host and re-export it via KNFSD to a particular Vsoc address for the virtual machine or you can use Ganesha to do the same thing if you don't want to use the Chrome client. And then you would export it just to that virtual machine's Vsoc address. So the nice thing is that NFS is well supported, even NFS 4.1. The security is better, it's simpler configuration, it's more reliable. The only real drive back to this approach from our perspective is just that Vsoc support is a brand new and key new NFS so it's gonna take a while for all those patches to land and make it into distributions. So we really like this model for a couple of sort of key reasons. One is security, the tenant remains isolated from the storage network and there's no shared IP network between the host and the guests. You have to worry about this accidentally exposing host services to the guest network. It's very simple. So there's no network configuration aside from the host just assigning Vsoc addresses to each of the virtual machine. It goes back to a model where you treat the virtual machine as a black box. So there's no configuration that has to happen. It'll either use that Vsoc address or it won't. And there's no software defined networking, you don't have to worry about neutron or anything like that involved. We like the reliability model because the gateway that's sitting on the host is in the same hardware fault domain. So if that node fails, the guest fails too and you don't really worry about dealing with them failing independently from each other. There are fewer network traversals, so the IO stays on the host and then goes to the distributed file system. You don't have to hop to a service VM that might be running on another machine in your cloud. Which means that the performance is gonna be much better. So it's a clear win over having a service VM where you have multiple network hops. It might also end up being a performance win over using TCP, although currently we're not optimized for that. The Vsoc driver and Kimi users for IO but it's not really optimized for performance. It's really about configuration and simplicity and security. But there are some challenges. So Vsoc is a new hotness, it's brand new, we have to get all this code upstream. So it's sort of work in progress. There is some host configuration that has to happen so the host has to decide what the addresses are for the VMs. And someone actually needs to go and do all this configuration so they need to mount Ceph on the host and set up the NFS export to the guest to actually make this happen. And so the big question for us is what is the user experiment experience gonna look like? So does the consumer of the Manila API, are they expected to just know that this is how everything's gonna work and that they're gonna have to mount this magical Vsoc address in order to attach to their share? Or are there gonna be new APIs that do some of this work for them and then tell them what the mount mechanism is gonna be? So that's the big question. Which brings me to one of my final topics which is really how the responsibility for this is gonna break down between what Manila's doing on the sort of the file share side and what Nova is doing with all these virtual machines. So Manila obviously is managing the shares and volumes and Nova's managing the virtual machines. I think that the cleanest analogy to look at is the Cinder Nova breakdown. So in Cinder, it also just manages the volumes. Nova manages the VMs but Nova has this additional set of API calls that let you attach a Nova, a Cinder volume to a virtual machine because that's driver dependent. It knows how to do it with KVM but maybe for other hypervisors it doesn't. So we think that the same model should work well for Manila. So what we'd like to see ideally is a new API in Nova that allows you to attach a file system, a Manila file system to a particular guest. So in this case, the hypervisor is the one that's sort of mediating access to that file access share. It might do some network configuration like attaching to the Neutron share network that Manila's created. It might be assigning the Vsoc address and setting up this NFS re-export. It would be totally driver dependent in that case. It also is gonna be important to figure out how this is gonna work in a container environment where it's not KVM at all but it's no LXC or something like that. We also think we need to have a second API call that lets you fetch the access metadata which tells the user how they're gonna mount the system. So which Vsoc address they need to mount or whether they should mount over the network or whether it's a bind mount from something inside a container. And because Nova now understands which file systems are gonna be attached to the tenant, it can manage things like reattaching these file shares after you reboot the system or if the Nova VM gets migrated to another host, it can also do whatever configuration is necessary on the new host in order to make that work. So we've made a chart of what all the different Nova drivers would actually have to do in this case. I'll just look at a couple of these. So in sort of the traditional Manila model where we're exporting NFS over the network, the attach might just take the virtual machine that you already have and connect it to the neutron network that the Manila's managing. In the case of doing this Vsoc thing with Ganesha, the attach API would actually set up the Ganesha server on the host, assign a Vsoc address to the VM and do that NFS export configuration. And in the case of sort of a completely different environment we're using one of the container Nova drivers like LXC or LXD. It might be to do something completely different. Like it might mount the stuff file system on the host and then just do a bind mount in the container name space so that you can access that file share. And it's because the way that the guest access as the file system is completely dependent on what the Nova driver is that Nova really needs to be involved in this whole process and do that arbitration. Which brings me to containers which is sort of the last little bit here. So our vision here is that when you're using LXC or LXD in a Nova environment that you want something sort of the ideal model here to plumb that file access to the guest is something that's very different than what you use from KVM. You don't necessarily want a container to have a new NFS mount to the file server. There's a much more effective way to do that. So the idea is that you would mount Cep or whatever other file system on the host and then just do a mount dash bind somewhere into the container name space. So our thought is that you would actually mount this to something in slash dev. And then the user would do the final mount to whatever their final location is within their file system so that they can attach that file share where they want to see it. The pros here is that you get the best performance using sort of the native driver on the host. Containers are all about sort of cutting out the virtualization overhead so that's all good. And you also get the full set of S semantics with like snapshots and so forth that I'm through the command line. The only real drawbacks here are that you're relying on the container for security which is sort of obvious since you're using containers in the first place. But in order to make this model work you really need that Nova attach detach API because doing this plumbing is entirely dependent on the Nova driver that you're using. So in summary, the Seth native driver for Manila is something that we've created, it should land soon. We hope it works for us and the next version of Seth jewel which comes out quarter one of next year is gonna have a stable production ready Seth of S. So we hope to have sort of a complete solution where we can use Seth and Manila in real cloud soon. The current Manila drivers are sort of centered around appliances and are also assuming sort of a virtual machine network file system access model. Don't really contemplate containers. We think that NFS over VSOC is gonna be a better way to plumb file access to containers. We think it's very promising for simplicity reasons, reliability, security, and performance. And there's a lot of choice there. You can use the NFS kernel server, Ganesha, whatever stack and we find it to be most reliable or attractive. But in order to make this happen we really need to sort out this interaction between Nova and Manila and where the responsibility breaks down in terms of connecting this file shares to the guests. As we think that those Nova APIs are gonna help us handle the non-KVM users, particularly containers. They might even help in the ironic case. But most importantly, it'll allow sort of the right person, in this case, Nova be responsible for setting up that gateway that re-exports NFS to the guest. So that's it. Thank you very much. I'm happy to take any questions. Sorry, I'll just repeat the question. When you connect or you export something over VSOC does the guests get notification? And the answer there I'm pretty sure is no. In fact, it's analogous to a server exposing a service on the other end of the network. You don't know until you try to connect to it. Yes, so there's still the last step of the user tenant actually mounting the file system is still sort of left up to the user. And I think in general there's a lot of reluctance to have any of the open stack services reaching inside the container and doing any work. There's some agents that do some things that, but that always I think has to be an opt-in behavior. So I think the approach that makes the most sense is to make sure that there is a user-facing API that gives the user all the information it needs to know how to do them out. And then they can do that final step. With all the dual goodness and server face, are there any caveats to running the rest of your cluster on Hammer or something? So can you have dual server face with everything else on Hammer? You definitely need the dual stuff metadata server. I'm trying to remember if there are any dependencies on the OSDs. I think currently there aren't, but by the time we get to dual there will be. We changed the way that object enumeration works so that the recovery tools work properly. And in general it's gonna be something that you would do better off doing anyway, keeping everything on the same version. But dual is gonna be the next LTS release. So it's probably gonna be the version that'll be in 1604 Ubuntu. It'll be in the next Red Hat storage release that comes out next spring. So by that time you're probably gonna wanna upgrade to dual anyway. Back here. What was the one that you were supposed to talk about? Sorry? Sorry? OpenSack Magnum is the question. So I don't know a whole lot about Magnum, but my understanding is that Magnum is about wrapping Kubernetes. And then Kubernetes would be orchestrating containers. I'm not, my understanding is that those containers are actually running inside Nova VMs or on top of Ironic. It's not actually orchestrating containers that Nova itself is managing. No, that's not entirely correct. So with Magnum the containers are not managed by Nova. So you did talk about integrating with Nova. So the question is whether you guys have any plans about integrating with OpenSack Magnum, which does not involve Nova. On the storage side we have no plans. So I don't know what the larger strategy around container orchestration is. It's not really my area. I think if you are doing it using Nova to manage those containers, then all of the stuff is just gonna work. If you don't, if those containers exist inside of existing Nova instances, then again we would use whatever methodology makes the most sense to get the file access into the Nova tenant. And then you would have to do some additional work to get it inside the Magnum container. Yes? Do you have any plan to export HDFS through Manila and integrate with Sahara project? We don't have specific plans around that on the Cef side. My understanding is that there's already an HDFS Minilla driver. In fact, there's a talk I think yesterday about doing just that. So I think you can already do that. If you wanna do all that on top of Cef, run HDFS on top of Cef. I don't know, I don't think you wanna do that. So no, I guess is the answer. Yes. The question is about a CefFS Windows client. One does exist. It uses a framework called DoCAN, which is sort of like a user space file system framework for Windows. It's out of tree right now. It's a separate project, although it's on the Kithab Cef group repository. There's a lot of work that needs to be done to sort of bring that fork in line with the mainline code. So, I think it works, but I've never tried it. So, your mileage may vary. Yes, back here. Hi there, is this gonna cause any problems? Is Minilla gonna cause any problems for the RDMA project with Cef? I don't think it should make any difference. Well no, if you're using the Cef native driver where the tenant VM wants to mount CefFS directly, RDMA is not gonna work in that context, I assume. If you are gonna do something, like the VSOC model that I'm talking about that we're looking at, then it's just a matter of whether the host kernel supports RDMA, because it's the one that's actually mounting CefFS. So in that case, it would work. And I think that would be true whether using the NFS, or the kernel client in kernel NFS, or whether using Ganesha using libcffs. In both cases, it's really a matter of what the host operating system and kernel support. And the VSOC part is sort of separate. I think that folks grow, and now cloud-press is becoming stable. It would make sense that Nova will use CefFS to enable live migration. Yes, and in fact, that's one of the reasons why we think Nova should be the one that's managing these attachments, cause you want to be able to attach a file system and then live migrate it to another host and have it still keep working. And if plumbing the file access to the VM involves this sort of complicated going through a gateway that's on the host and Manila's trying to push configuration to the Nova hypervisor to do that, it's going to break for the ephemeral drive. Like the boot volume? I'm not sure I know what to send the question. Okay. And then you used QCAL 2. You could do that. That's probably not what I would recommend, although I know some people do do that. So yeah, any other questions? Yes? Which replication? Oh, geo-replication. Not initially, no. That's gonna be a ways off before that's supported. I think eventually we'll want to look at that, but we're focusing initially on doing geo-replication in the RATOS gateway for federating all these object gateways and on the block device for replicating block devices, which will be in the dual release. But for CefFS, we're not looking at geo-replication yet. Yes? I assume there's a monolithic, but honestly, I don't know. I'm not the open stack expert around here, it turns out. Yes? CefFS are for Windows guys. We're talking about file access for Windows clients. As I mentioned before, there is a CefFS client for Windows based on the DoCAN framework, but it's out of tree and we haven't tested it directly, so I'm not sure how well it works, but I believe it does work to some degree. Over VSOC. That I'm not sure about, but I'm guessing it's possible because this VSOC type is actually many years old. I think VMware added it to VMware, what ESX like five years ago or more, and I got it upstream in Linux maybe two years ago, and my guess is that the whole point was to do stuff like NFS over VSOC to the server, but I'm not certain. So it's a really question of whether the Windows and NFS clients support VSOC, and that I don't know. Okay, thank you. Thank you. Anything else? All right, I'm gonna post these slides on SlideShare and the last slide I have a whole bunch of links if anybody wants to go look at the code that I'm referencing in patches, so thank you very much.