 Hello, and as a native French speaker, I don't practice English very often so I hope you will be able to understand what I say. If you don't, well you can have an app or you can try to use the slide to follow me. So the idea is that today we can use NFS, which is old or we can look for something better. And what something better could do that NFS doesn't do today. For example, it could have a high availability, which means that if you've got a server that goes down, the clients can still access the data. And you want this without a single point of failure. You want storage clustering, where you can add storage servers to increase your volume storage. You want elasticity, which means doing this without any downtime. You want, of course, position semantics. You want file locking, you want performance, and you want secure communications, which means cryptography. And what do we have to do this? Well, for a long time, your file system used to be implemented in Kernel, which means that we are not portable at all. And development cycles were quite long because developing in Kernel is not easy. And recently we have seen a file system in user space appearing, where the file system is a unique process. And the Kernel is a client for this unique process. We've got message exchanges, sorry, between the Kernel and the file system, which is a bit like a micro Kernel architecture. So what file system do we actually have? Well, we've got Luster, we've got Hadoop, we've got GlosterFS, we've got ExtremeFS, we've got lots of file systems. And if we want to move to one of these file systems, we need either a Kernel module for it or a few file systems. Some file systems have both the Kernel module and the Fuse interface, but more and more we see a few file systems. And after looking at a few file systems, it seems that GlosterFS was an interesting target. And it requires a Fuse interface. So the goal of my project was to run GlosterFS and obviously it was to be through the Fuse interface. So what is Fuse? Initially it was Linux only thing and it became a de facto standard. Now you have Fuse on 3BSD, you've got Fuse on macOS 10. I heard that you even have Fuse on Windows, which is a bit strange. So in a Fuse file system, you've got the Kernel, which is the client of the Fuse process. It sends requests to the Fuse file system and the Fuse file system reply. You've got a message passage interface for a character device called DevFuse. You've got a user library, which is there to upload the basic course so that the file system is not too complicated to write. And you've got free API for the Fuse file system, which seems a bit insane, but it's the way it is. In this diagram, you see how the thing works. So everything starts by a process, for example, here at LS, which issues a system call. For example, here, read, it does this through the libc. And once the system call is executed inside the kernel, it reads a layer which is called the VFS. I'll talk about it a bit more in the next slide. Basically, it's a switch that let the kernel use a different file system. It can be the X3FS file system or it can be a Fuse file system. If it's a Fuse file system, we've got a Fuse tab inside the kernel, which will immediately send a request through DevFuse to use around here. You've got three options. You can have a file system that is directly connected to DevFuse. It uses the kernel API. You can have a file system connected through libfuse using either the low-level API or the high-level API. So you've got three APIs. The VFS, so it was introduced by Sun a long time ago. At that time, the idea was to have the ability to use both UFS and NFS. So you can think about it like a standard interface inside the kernel to reach any file system. And below that interface, each file system implements some methods. So for the VFS, you've got some object like remounts and vnodes, vnodes represent files or directories. And you've got methods. Lookup to find a vnode, open to open it, read, write, which are self-explaining. What about NetBSD? NetBSD didn't have a Fuse when I started to look at GloucesterFS. It had Poof. Poof, which is very hard to pronounce, is similar to Fuse, but not compatible. It started when Fuse was not objective standard. It was not objective that everyone would be wanted to use Fuse. And it seems that Poof is there to stay today because it still has merit as a native interface because the native interface will always fit better your own VFS than a foreign one. And if you want to add a new feature, you don't have to talk to many people. You can do it within the project. How does it work? Exactly like Fuse. So you've got your process that does a system call. You go to the VFS, which is that we are talking to Poof's file system and the request with which you use our land for dev Poof's. And here you've got only one API, which seems a bit more reasonable. So it looks exactly like Fuse. The name of the device is not the same. It's dev Poof instead of dev Fuse. The name of the library is Poof instead of dev Fuse. But despite the similarity, it is still very different. The message spacing in protocol is different and the API is different. At least for Poof, we've got one API. Only one. So Poof is there to stay because it has merits. Fuse is highly desirable because today people write file systems using Fuse. So we need to use Fuse on NetBSD. And a first attempt was done, which was called with Fuse, which is a nice plan, which is an implementation of Fuse over Poof's. How does it work? Here, on the top of LibPoof, you can have LibFuse and LibRefuse, sorry. And on the top of LibRefuse, you have a Fuse file system using the high-level Fuse API. But it's only the high-level Fuse API, which means that we like two APIs. So we don't support file systems that use a low-level API. We don't support file systems that use a kernel API and GloucesterFS, which was my own target. It does exactly that. It opens the Fuse and talk to the kernel directly without using LibFuse. So Refuse is not enough and we need something better. So a new project is called Perfuse. The idea is to implement the Fuse kernel API. And once we have it, we can just use the stock LibFuse on top of it and we support everything. How does it work? Here, I omitted the first part where you've got LS doing a read system call. I think you understand the concept. And now, you see that your request that goes from the kernel to userline through DevPuse and through LibPoof. Now goes through LibPerfuse. LibPerfuse is used by a demand, which goes PerfuseD. Perfuse is to translate the Pufs request into a Fuse request, sorry, and to send it to a DevFuse socket so that the Fuse file system can use it directly through the kernel API or through LibFuse using the high-level or low-level API. But there is one small problem because I said it was a socket and a socket doesn't have exactly the same semantics as a character device. For example, you can open and mount a character device where it's impossible for a socket. You have to use the system call socket bind and connect to use a socket. So we need to cheat a bit and use some defines to wrap Perfuse open which will do the socket bind and connect stuff. And we define it as open in the Perfuse header if you include the Perfuse header the things just work. LibFuse was modified so that the change is hard to support NetBSD so today we can just use the standard LibFuse and it works. And later I realized that after all I didn't really need the DevFuse socket and it was replaced by Anonymous socket using socket pair. Here I talk a bit more about VFS because I need it in the next slides for anyone that didn't heard about it in depth. So the idea is that when we do file system and we translate the VFS operation into request for Perfuse. So the VFS operation start by obtaining the root venode and it's obtained at the same time. And then we use the lookup method to find another venode. You give lookup the name of a node and you get a reference on a new venode. You've got a method that get attribute to set attribute to get the metadata for a file, for example the mod, the owner that will open with or write on the file directory. Here is some pseudo codes that show how it works. So your first month of file system you get a node. Then you use the lookup method on the root node to lookup for something that is called foo and you get inoen because it doesn't exist. Then you try to lookup on the new node to get the inoen number you can open it and you can read from it. Sometimes VFS operation are not really obvious, for example in 8BSD we've got the release operation which is called at close time we've got the inactive operation which is called when the last reference on a file is dropped and we've got the reclaim operation that the kernel will send when he will not use the node anymore and that memory should be freed. Linux has only release and forget. It's a bit tricky because as you can see in this example after you get release method invoked you can still have some read codes that happen. In fact you close your file and it still has something to read so the release method which is called for close doesn't mean that you will not have read afterwards. So that just means that we've got to wait for inactive to close files. Now I will make a nice list about funny things that I encountered during this project, bugs and traps. So here is the list which is them one by one. So first another problem about socket semantics when I use a socket it can be of type stream where it's non atomic and reliable or of type datagram where it's atomic but unreliable and my problem is that a character device is able to provide a semantics message or send atomically and it's reliable. So I couldn't use a stream socket I tried because the kernel can split the request into several packets. I couldn't use the datagram socket type because it's not reliable I'm not sure it has to be unreliable for local communication because the sender could just sleep waiting for the receiver to get the data but changing that would have been quite intrusive and with the risk of breaking things. So we prefer to implement the second packet socket type which is an atomic and reliable socket type which is exactly what we want and fortunately someone did it and it was sitting in a bug report for I think 5 years and it wasn't being committed so I just had to make the code build again and we had some things that worked. Another problem which is specific to GlusterFS is that it makes heavy use of extended attributes on the server side and NITBSD has no support for extended attribute well it had some support which came from UFS1 code which was very broken and didn't build so I had to make it work again I choose the sparse file to store attribute and we have one file for each attribute so if you want to use another attribute you have to create a back end file so I had to add code so that the back end file can be auto created every minute I added support for copying extended attribute and preserving them in the CP and MV commands and then I had the question because the code coming from 3BSD used the 3BSD API and there is also a Linux API for extended attributes and we had to choose between the two and finally I didn't choose the limited buff API which turned to be a good idea because no porting software is much more easier because we support both there are still work to be done for extended attributes we need it in a few of our utilities for example backup utilities because for non-NITBSD it's not able to back up the extended attribute which is a big problem extended attribute support impact, dump and restore it would be nice to be able to copy extended attribute over the network using ACP and AirSync AirSync has support for it using the Linux API so I can do it easily it's just a build option to set up ACP is a bit more problem I know that macOS 10 implemented it upstream someone will have to work on it and there are some commands that could benefit of extended attribute but I'm not sure that it's ways to do so it's TAR and CPIO because if you add extended attributes to the format you somewhat break the standard that doesn't expect them I'm not sure what will happen with the binaries that will extract the archive but I don't know about standard attributes some other improvements that have to be done adding extended attribute in UFS2 someone has to steal from them better storage for extended attribute instead of sparse files we can store them directly into the file system like we did for Quota here again someone has to do the work another funny bug it's an un autoradlist this time it's about raising a get attribute a method as you can see in this diagram when you send a get attribute then you write some data at the end of the file you end up in the situation where the kernel has an idea of the file length because the length is free and that get attribute replies come with another value which dates from before the write and the kernel will tronquette the file because it discovers that the file is lower than what it expected and when the write request replies comes the file size is extended to 3 but we've just zeroed between 0 and 3 we've lost some data so the solution of this race is simple, we need a mutex on the side another problem which is probably well known is dianame thread safety or I should say thread unsafety the problem is that the standard doesn't tell us if dianame should alter the buffer it is given or if it should return static memory both behavior are standard compliant so some software will assume some behavior and Gloucester FS assumes the linux behavior which is to modify the buffer while NetBSD use a static buffer unfortunately there is no consensus on thread safe dianame underscore error so the solution was to just add the dianame code from GNU inside the country directory of Gloucester FS they want the linux version just give them the linux version don't try to be compatible in NetBSD another funny problem what happen if you call the link system call which is used to do a hard link on a sim link will you link to the sim link or to the sim link target here again it's not specified the standard will not tell you what is the good behavior unfortunately linux will link to the target while all the BSD will link to the sim link itself Gloucester FS of course relies on the linux behavior and it's a very important feature for migrating data from when storage break to another when you want to replace a server so we really needed it so the solution is to use linkat which is part of POSIX Extended IP set 2 which is a big family of new system calls which improve a lot of things and this new linkat is just like link but it's with a flag and the flag can tell if we've got to follow the sim link or not in the NetBSD 6 branch we only implemented a feature linkat just to support Gloucester FS and finally in NetBSD 7 we've got the whole extended IP set 2 except FXXVA that was not implemented because some people always cared about it seems it's unreasonable and security wise another problem and it's not the last but the previous with the Pajaman action app so Pajaman is a kernel thread that is responsible for freeing memory when memory is scarce and one of the job is to take vnodes that have memory that are using memory and put them to disk freeing memory the problem is that when you want to do this with a pure file system you will need to allocate the message so you will need to allocate memory when memory is scarce and the memory allocator can sleep waiting for memory to become available or not sleep which means it will fail if there is no memory and we had some situation where the Pajaman was able to allocate a proof message and sleep for memory we end up in the situation where the kernel thread that is responsible for freeing memory is sleeping waiting for memory and the problem will look obviously Pajaman must never sleep so the solution is inside the proofs kernel code to check what is the calling thread and if it's Pajaman then we prefer to fail of memory allocation and return to the calling code that it doesn't work instead of sleeping so what happens Pajaman will try to put a vnode to disk which fully will not be a proof vnode and sometime it will have enough memory to start freeing a proof vnode last funny problem it was not a complete list I've got a lot of funny bugs but the idea was to give you an insight of the various direction where we went GlosterFS use swap context which is a system code which let you change the CPU register and posix thread it's a bit odd because usually you use swap context to implement threads but here they needed a stranger concept of task, the idea is that when GlosterFS will send a request to the network we've got a context which has to wait for the network reply and it's a bit a bad idea to send the thread to sleep and to wait for the reply because you need to have as many threads as you have a request and you've got a lot of sleeping threads so what they do is using swap context they are able to load the context to restore the situation where they were when the request was sent and the big question is that threads use a CPU register called TLS to store the address of the private data for the thread and should swap context touch this register or not on Linux the TLS is preserved TLS is thread safe on NetBSD it was not very clear in fact we had a machine dependent mess where each port was doing a different thing so what happened you've got two threads one thread that prepare context the second thread that swap context and then you discover that both threads suddenly return the same value which means that you are going to crash because mutex logs will not work anymore and we've got to fix that so what should we do with the TLS register obviously the behavior should be different depending if the problem is linked with libpeshred or not if it's not linked with libpeshred then we do not preserve it we change it using swap context and if we link with libpeshred then we preserve it we had for some ports an option to make context to decide if the TLS register was to be restored or not so what I did was adding this option to all ports so that the various NetBSD ports no share the same behavior that's all for my list I could present bugs for a long time but I got to finish my presentation so to do list so as I told you we need to add support for extended attributes in dump and restore we need a better storage for extended attribute because the current situation it's just scary because we don't have any tool to check the integrity of the extended attribute storage if the file is corrupted and we cannot save them in backup so it's a real problem there are some few features that should be implemented few negative catching which let the kernel know that a file doesn't exist and don't ask every second there is a mechanism for the kernel to send notifications to the file system the idea is that kernel will be able to tell the system that a given file has been removed for example and perhaps there is queues which is a character device in userland I'm not sure about this one because it could be interesting if there were some queues driver a lot of queues driver to use but I've not seen many so I'm not sure that it's worth working on it yet at least here I am it might be a naive question but I don't quite understand why you had to go around for userland again to me it would seem simpler to say ok I've got the PuffFS implementation I just clone it and then I turn it into a fuse implementation considering I have some code in FreeBSD or something like that what could be simpler about going again for userland implementing a new socket making sure it works etc etc well the answer is that it's already faster to develop in userland than in kernel in kernel you crash or reboot you crash or reboot it's very long and here in fact there are not many drawbacks to do it in userland because we are working with a network file system so performances are most about the network and not about going from userland to kernel to userland to kernel I have not measured it but when you have to wait for a network when more system code is not really a problem more question roughly how much code did you have to write for the perfuse implementation so how much code did you have to write how big is the perfuse library how much code I have to run WUC on it I don't know exactly I'd say that it's a a few thousand line of C but it's a very simple code in fact you get a few request you create sorry you get a few request you create a few request you translate arguments so it's not complicated well it's time consuming when you have to spot a bug in fact because the bug list is not the most trickiest but it's the one that took me the most time to understand the thing with the pagiment that is sleeping waiting for memory has been very difficult to find from the outside you just see that the system looks I have a question myself have you measured the overhead of going back and forth in userland not yet but I know that it's nothing compared to going back and forth any more question but by the way there have been a heated debate between the people that did the cluster FS and Lennox Torval Lennox Torval told them that the thing was a toy because file system should be in kernel and the people for cluster FS had just this kind of argument that if you have to wait for the network it's not a problem to do more system codes there are a lot of questions about the differences you find with GNU, BSD, other Lennox specific stuff have you talked to upstream about it because on free BSD upstream cluster FS is themself working on porting it through free BSD so they might be interested in those feedbacks to be able to fix and have pretty more portable I'm not sure I understood the question you asked me if I send my change to cluster FS or just talk with them about those differences I do talk with them and I've been contributed a lot of portability fixes to cluster FS and today, NetBSD just works out of the box for cluster FS you just have to configure it just works and now I'm working to pass all the regression tests for cluster FS another example of communication with the cluster FS team is that today there is a NetBSD smoke test when you do any commit in cluster FS so we know that it will not break the NetBSD port and soon we know that regression test will pass so yes I'm talking with them a lot in fact I'm a bit cluster FS contributor thank you for your attention no more question ok thank you all and next talk is in 25 minutes