 Okay. Thank you. All right. Let's get started. I'll be talking about VIRDA FS today. VIRDA FS is a shared file system. It's something that I've been involved in for the past year and a half or so amongst other projects. The basic thing that VIRDA FS does is it takes a directory on the host and it lets the guest access those files. It allows them to access them directly without a disk image. You don't need to create a disk image and a file system in there. Instead, a directory that you already have mounted on the host is available to the guest. And any I.O. is done directly to those files. There are a bunch of use cases for this, including containers, Cata containers can use it. I'm not going to go into the use cases in this talk. If you're interested in a general overview of why is VIRDA FS being developed, what can it do, and what are the use cases, please check out the KVM forum talk that was given last year. In this presentation, I'm going to be a bit more technical, give you a tour of how it works. In case this kind of summary of what it does didn't make a lot of sense, let's take an example. Say you have our www on the host, a directory that has your HTML files, your website, and you want to run your web server inside a virtual machine, not on the host. So you've got the files on the host, but you want to run the web server inside the virtual machine. Maybe in order to get extra isolation. Okay, we can do this with VIRDA FS. So here's the libvert XML that passes through VAR-WW to the guest. And we can name that VIRDA FS device, and here we name it website. So that when we're inside the virtual machine, all you have to do to set it up is say, mount VIRDA FS website and then the directory where you want to have it. And now this virtual machine has that directory from the host mounted inside the guest, applications run as normal. There's really nothing special after that, you just have access to the files. Obviously the point of VIRDA FS is that you only have access to that subtree, that directory subtree on the host, that you're isolated to it. You cannot break out and access all the other files. That's one of the key security requirements. I will give you a brief benchmark that I ran recently. But keep in mind, I'm comparing VIRDA O9P here, which is a previous shared file system that's in QEMU and has Linux support. I'm comparing VIRDA FS and I'm also comparing VIRDA O block. But please keep in mind that these things are, they have lots of configuration settings. And to be honest, if I wanted to, I could make this graph do lots of different things. They're completely different code paths and stacks, so it's hard to do real apples to apples comparison. What you're seeing here is just the default set up of all of these. And this is how they compare. But I think the main message to take away here is that if you're already using VIRDA O9P, take a look at VIRDA FS because it should perform at least as well. And VIRDA O9P has been available in QEMU for some time. VIRDA OFS is going to be available starting from QEMU 5.0. So the upcoming QEMU release will have VIRDA OFS. Yes, and the only other thing I'll say about these performance results is that obviously there's still a gap in some areas where VIRDA O block can perform better. That's the yellow, the yellow bars VIRDA O block. And I didn't enable the DAX feature, which I'm going to explain later on in this presentation because that's not upstream in Linux yet. Okay, so now let's start talking about how it all works. The basic idea of remote file system, the two main components that you have in a remote file system is you have a transport and you have the protocol. The transport is the communications link. It allows a client to mount a file system from a server. So the files and the directories, they live on the server and the client can access them. And then the protocol is the vocabulary that the client has in order to express things it wants to do on the files. And think of lots of examples for these transports, for example, TCPIP or maybe USB or RDMA, they have different properties. Some of them are message passing, some of them have shared memory. For the protocols, NFS is probably the most well-known one and SIFS is also very, very widely used for Windows file sharing. And then you have other families of protocols like the old FTP or MTP which you use to transfer files on your phone. And the interesting thing, the reason I bring them up, is because remember the protocol is the vocabulary of what your file system can do. And these FTP and MTP, they're transfer protocols. They're not file system protocols. So they focus on getting and putting entire files, kind of like object storage and kind of like Amazon S3. Whereas NFS implements all kind of file system, POSIX file system semantics where you can open a file and then do reads and lots of other operations on a more fine-grained basis. What your protocol allows you to do is critical when it comes to application compatibility and also for performance. So we wanted to get that right, we wanted to choose the right kind of protocol. Okay, so now that we have this framework, how does VerteFS fit in here? How does it work? So the protocol in VerteFS is based on Linux Fuse. But it is not just Linux Fuse, it's not taking existing Fuse file system and running it over VerteIO. There are some changes, there are some changes to the architecture. And we've also extended the protocol, added some features to Fuse to make it work for VerteFS. The transport is VerteIO. The only interesting thing there is that we also extended VerteIO. We added shared memory resources to VerteIO. And I'm going to go into how VerteFS can take advantage of shared memory to do things that network file systems can't do. And so here's the picture, you get the basic idea that VerteIO is the transport, it allows the client and the server to communicate. In the case of VerteFS, the server is called VerteFSD. It's a process that runs on the host in order to emulate this VerteIO FS device. Okay, so to get into this a bit more, I'll give you a short overview of Fuse, the Linux file system protocol which we started. And the guest drivers also reuse that code. So there's a lot of shared code between VerteFS and Linux Fuse. Fuse is a user space file system interface, so that's its purpose. Basically what it does is when you have a user space application that accesses a file, say we want to open the file foo, that application will make a system call. And the Fuse kernel module can then handle that system call because it's responsible for that mounted file system. But it doesn't know what to do with it, instead it sends a message, a Fuse open message, to the file system server which is running as a user space process. And that's basically the Fuse kernel module's role. It forwards basically what our system calls to a user space process that implements the file system. It was merged in 2005, so it's very mature, it's widely available. Even if you haven't used it directly, you've probably still used it somehow. For example, when I connect my phone, the MTP protocol implementation that gets used is a Fuse file system. So it's a user space file system, it's not a kernel driver. One of the nice things about Fuse and one of the reasons why we chose it for VertiFS is that it's very closely associated with what native Linux file systems can do because it's a Linux kernel module. So it's not just a POSIX file system and it has that entire vocabulary that we need in order to be compatible with existing applications. It also has Linux extensions, new features like copy offloading and stuff like that can be done by Fuse, which is great. And it's also extensible. It's a protocol that you can add new features to. Let's look at the protocol a little bit. The header file is the linuxfuse.h, it's the kernel header. So if you have it installed on your machine, you can take a look. That file has all the constants and the structs for the Fuse protocol. And one thing about this header file is that it's undocumented. So you basically just have all the constants and the structs, which is not great if you want to write a VertiFS implementation or a low level Fuse, some kind of Fuse application. You do have to look at the Fuse source code to understand it. However, this file is user space ABI. So it's a kernel ABI that's exported to user space, which means it's stable. The kernel's not allowed to change the definition of these structs and these protocols. And if you upgrade software versions and so on, they will remain compatible. So that's really, really important. It means that if you have a virtual machine running VertiFS and you upgrade your host or vice versa, let's say you have a new virtual machine and an old host, they will still be able to communicate because this protocol is stable and it has feature negotiation and all these things. And by the way, if you're wondering, why do people even use Fuse if it's undocumented? That's because most people use Fuse bindings, like say the Python bindings or the C library for Fuse, and they do have documentation. So even if this low level stuff doesn't have documentation, you can look there as well. Okay, so let me tell you a little bit about traditional Fuse, which VertiFS builds on. And then we can look at how we mapped it all to VertiO. So as I mentioned before, the file system runs in a user space process. And the way that this works is that the Fuse kernel module has a DevFuse character device. It has a character device that the file system server opens, and it can read the protocol messages that it needs to process from that character device, and it writes the responses back to that character device. So that's usually what a Fuse server is doing. It's reading stuff from DevFuse and it's writing the responses back. The only exception is that there are server initiated messages. They're called notifications. They're rarely used, but there's some features in Fuse where the server can actually go and send a request to the client and then get a response back. It just inverts that relationship. Okay, so what's the design of the VertiO device? The VertiO device model, which is the device that the virtual machine will see in order to access the file system. The VertiO device model, the main concept in it is called a VertQ, and that's a message queue that allows a driver inside the guest to send messages to a device that's implemented on the host. And so the main thing that we really need is a request's VertQ where you can place Fuse messages. So Fuse requests are sent by a guest driver, placed onto the VertQ, and then the device, VertiO of SD, that process, will see those requests, will process them and send back a response. There's one weird thing about Fuse, it supports request priorities, because you can cancel requests. So if you have say 50 requests queued up and ready to do some IO on your file system, but then you kill an application, for example, you might want to cancel those requests that that application has in queue because that application is being terminated anyway. So there's a Fuse interrupt message. But one problem we have when mapping this to VertiO is that VertQs are dependent only, you can't go and modify the VertQs, the V-ring in memory, once you've put stuff in there. So we had this problem, how do you then send that Fuse interrupt message to the device if there's a bunch of traffic queued up in front of it? And so we just added a high priority queue that handles the Fuse interrupt message and anything else that needs to be prioritized will go there. The final VertQ that has been added is a notifications queue and that handles that server initiated communication that I mentioned before, because that's going in the opposite direction. The only other thing to know about the VertFS device is that in the configuration space, which is an area of data that the device can expose to a guest driver, there's a tag and that's the mount identifier or the file system label or whatever you want to call it. We call it a tag and that's how you can name your device. Because you can have ten VertFS devices attached to one virtual machine. How do you know which one is which file system? You have to give them names. So that's how this works. Okay, now let me give you a quick overview of how the protocol and the communication work. Say you want to read a file and you haven't started up yet. The driver hasn't started up yet. The first thing it will do is it will send a Fuse init message to create a new Fuse session. So this negotiates a few parameters and prepares you to send Fuse commands. Once the Fuse session has been started up, you can issue normal Fuse commands. One thing that might be a bit unusual is that if you're used to user space programming, you're used to using paths to open files. And then you get it back a file descriptor and then you can use other syscalls to do things on those files. But POSIX file systems actually have another concept. They have another layer of indirection. Because when you look up a path, a single file, say ETC password, can actually have multiple file names pointing to that same one file. That's what hardlinks are. If you create hardlinks, you're giving one file multiple names. And that's why POSIX file systems distinguish between paths and iNodes. The iNode is really what the file is and you can have multiple directory entries pointing to the same iNode. And the even more fun thing is you can also have files, you can have iNodes that have no path name at all. Because you open a file, you delete the file from the file system, and yet you can still access that file. Some applications make use of this. And it basically means that you need to separate the concept of an iNode from a path. So in user space programming, you're often doing stuff with paths. But the way Fuse works is first we need to look up the path and resolve it to a node ID. Once we have the node ID, we can do other stuff like opening the file and so on. So that's what you're seeing here. So you send a Fuse lookup, you look up that path, you get the file. And what you get is this node ID, this iNode. And then you can do things to it. You can open it, and then that gives you a file handle, which is like a file descriptor, and then you can read. So that's the basic flow. So say you wanted to implement a bootloader for a VIRTFS that allows you to go into a VIRTFS file system, grab a kernel. These four messages are basically what you would need to implement. Okay, one of the things about this flow that I've showed you here though, is that every single bit of IO requires communication. We need to talk to VIRTFSD, the guest needs to talk to VIRTFSD in order to read any piece of a file. Isn't there a way to just give the contents of the file to the guest and let it directly access them without always having to communicate with VIRTFSD? That would make things faster. And even more so, that Fuse read or a Fuse write command, they involve guest memory. So what you're doing is you're copying data either from the host into guest RAM or from guest RAM, you're writing data to the host. Is there a way to avoid these data copies? These data copies are bad because say you have a read-only directory that you want to share with 10 guests or 100 guests. Do they all need to take copies of the data into guest RAM in order to process it? Wouldn't it be good if there's a way to share this? So VIRTFS has an experimental feature called DAX. It's not yet upstream, but it's being worked on and it already runs, but it needs some additional performance optimization, some more code review, and so on. What it does is instead of doing IO where you copy data, instead of sending commands for every single IO and constantly communicating, it allows the guest driver to set up a mapping and map a region of the file into the memory space of the guest. So now the guest can access the pages from the host directly. And the way this works is that the device has a memory region called the DAX window, it's a shared memory region. So that's why we needed that VerdiO extension to add shared memory to VerdiO. That's what this enables. So let me show you the flow, pretty similar. Say you want to read a file, but now instead of doing FuseRead, you just set up a mapping once. And once you set up that memory mapping, you can just do load and store instructions, CPU instructions. You no longer have to communicate with the Fuse, with the VerdiFSD process anymore, because you have those memory pages and you can just access them directly. So that's great if you want to do frequent IO. And it also reduces the memory footprint if many VMs are sharing the same thing, because they can share the same host page cache pages. Okay, and I think the final thing I want to say is that the VerdiFSD implementation that's available on the host, what it does today is it takes the sub-directory and shares it with the guest. However, VerdiFSD is a standard specification. You can implement your own VerdiFSD. Maybe you have a distributed storage system that you want to integrate. Or maybe you want to export a synthetic file system that has some data that's relevant to your guests that you would like to make available to guests. You can do that with VerdiFS, similar to how you could write Fuse file systems to have your own custom file system. Either you can use the VerdiFSD code base as your starting point if you want, or if you want to go low level or write it in a different language, just look at the VerdiFSD specification, and that covers how the VerdiFSD device works, and you can implement your own thing. You're welcome to come chat. I'd love to talk about this if you have things that you think use cases that would be good for this. That's it. Any questions? Yeah, so the question was, can your entire guest file system be VerdiFS? Yes, it can. You can boot directly from VerdiFS. And that's really nice because it's kind of like setting up a CH route environment or a container environment. If you have a path and you've got your full route file system in there, you can just boot from it. The only thing is that right now, the syntax for doing it is a little bit awkward because Linux does not have a nice syntax for it right now. I have sent a patch, and it needs more polishing in order to have a nice syntax for it, but it can be done, and it's convenient, it's nice. Okay, yeah, so the question is to use this shared memory feature, and by the way, it's optional. You can run VerdiFS without DAX, and then you don't use the shared memory stuff. So the VerdiO device model itself doesn't assume shared memory, so this is kind of a special thing. And VerdiO has several different transports. There's a PCI transport, which is used mostly on x86. There's an MMIO transport, and then for S390 mainframe, there's another one. And it depends on your transport whether shared memory is a feature that's available. So you might not be able to use this on all transports. The actual, if you were asking about whether the data layout in that shared memory region is standardized, is that what you were wondering about? Okay, yeah, yes. Okay, so the question was this shared memory thing that VerdiFS is using, is it specific, or is this a general VerdiO feature that other VerdiO devices could use and so on? Yes it is, it's generic. And so the way it works is that in the VerdiO spec, a device can say, it can have, any device can have a number of shared memory regions that perform different functions, and then how that's mapped, whether that's done over PCI and all those kind of things is, any device can use it, and others will. I think VerdiO GPU already has a use case for it. One more? It comes with permissions on the files, so it was quite a mess to met permission on the host and the guest. How is it working with VerdiFS as a different, or do we still have a mess? Okay, so yeah, the question is how do permissions work? Maybe how are UIDs and GIDs mapped? Because you have your guest kernel, and in there you have root and various users, and on the host, that's a different system. It can have a different set of users. How do you map UIDs and GIDs? Well, at the moment, what VerdiFSD does is, at the moment what VerdiFSD does is you give it a subdirectory and you are allowing the guest to set the UIDs and the GIDs within that subdirectory so that normal applications can run, a full system can run with different UIDs and so on. And the UIDs are just going through one to one. If you want to do some kind of mapping, you would have to do something else, like some kind of UID shifting and so on. VerdiFSD itself does not implement that, but there are other ways, like using kernel modules and so on, to stack it on top and map the UIDs. Yeah, so with SE Linux, I can't give you the exact, so the question was does VerdiFS support SE Linux and how does this map through? I can't give you an exact answer. I think if you enable extended attributes, that's the starting point, which the feature is there, but I'm not sure how well the pass-through works. I have not tested it, but I'm sure that that's a very important feature and so I'm sure it will come. I'm sure it will be solved. One more? Yeah, that's right. So, well, as I mentioned, the Fuse protocol is designed. Oh, yeah, sorry. So, the question was what happens when a file is opened and then unlinked and then that file descriptor continues to be used. And as I mentioned, the Fuse protocol supports, where did I mention this? Anyway, the Fuse protocol supports these POSIX semantics. So what happens is that you can unlink directory entries and you still have the nodes, the inodes, and you still have the file handles and they continue to work. Yes. Thank you very much.