 o'r ffailfa ar gyfer Llinex Cymru. Rwy'n dweud o maelwaith cynyddu i'r ffile i fyny, ac amgylchedd ar gyfer Llinex Cymru, ychydigoddu Llew Beir Llyw, i'r ddeithas chi gyda'r partisiwn. Felly mae gennym rhai. Felly mae gennym rhai, ac mae gennym rhai, mae gennym lle ni ar g 님auidd. Felly mae gennym rhai. what we're going to be talking about today, although conceptually it's very similar. Now, traditional loop mounting is fine if it's a plain file. It falls over very, very quickly. What happens if you've got a compressed disk image? This isn't a disk image that I've compressed. This is how cloud images are distributed by the Fedora community, and this is one that I've just downloaded from the Fedora website in exactly this format. I didn't modify in any way. It's X-aid compressed. Of course, you could turn that into a loop device, and what would happen is that you'd end up with a loop device to contain the XZ compressed data. It's my contention today that that is not what you meant by loop mounting this. What you'd want it to do is you'd like it to kind of transparently uncompress, so you can see the data inside. We can do that using a command called mbd kit which I'm going to talk about in a minute. That creates a process which is running. Now I use a command to associate that process with a loop device. It's a different command. It's an embd client but conceptually similar. You can see there, look the size. 4096 megabytes, that's four gigabytes so that's not the compressed size. That's the uncompressed size. Mae'r craswn ddarparol ymddangos, ac mae'n ddweud ofet dros hanes mae'r llyn, mae'r llyn yn ddweud o'r fan y bwyd, ac mae'n gallu gwneud i gweithio'r gyda'r newid. Mae'r gwneud i'r newid, a mae'n gallu...凰 swyddfa. Rwy'n meddwl y gallwn gwneud i'r argyflwygiad. How does this differ from loop mounting? Well, in both cases, we've got a kernel module. On the left-hand side, loop.co, it's a kernel module. And it's configured using a command line utility, LO setup. And you use that to create Linux kernel block devices, like dev loop 0. On the right-hand side, I've got a kernel module, mvd.co. It's configured using a command line client, mvd client. And it creates loop devices, or Linux kernel block devices, dev mvd.0, et cetera. But the back-end, as you can see, is a little bit different. On the left-hand side, the back-end is talking over internal Linux kernel APIs, like the VFS, to the file, which is associated with the loop device. On the right-hand side, we've got a user process running. And this is critical. We've got a user process. In this case, it's called mvd.kit, but I should say that other mvd servers are available, and other very, very good mvd servers are available. And the kernel is talking to that process using a TCP port or a Unix domain socket, as you require. Now, I'm going to demonstrate in this talk mvd.kit, which is a mvd server that I wrote with a guy called Eric Blake, who's a brilliant free software hacker. Mvd.kit is slightly different from other mvd servers in that we have this plug-in API. It's a stable API, which means that you can write a plug-in now, or indeed you could have written a plug-in back in 2013 when we started the project, and it would still be compilable with mvd.kit now, and it's still compilable in the future as we go on. We're not going to break plug-ins at the source level. It also has an ABI guarantee. That means you can compile your plug-in, and you can distribute it separately from mvd.kit to binary and load it into mvd.kit at some point later. We're not going to break that, even as we evolve mvd.kit, and in fact we evolve the API. We don't break source or binary compatibility. If you don't want to write a plug-in, and I'm going to show you in a minute how you can write a plug-in, you'll see it's very simple, but if you don't want to write a plug-in, many other plug-ins are available, and I've listed the ones which were in mvd.kit 1.10 on here. Some of these plug-ins aren't quite like the others. These are plug-ins like Pearl, Python, and what they are, they're basically gateways to writing plug-ins in non-C languages. You can write plug-ins in scripting languages, even in shell script, if you're not very happy with writing C plug-ins. Now, the other concept that mvd.kit has is filters. You can think of a plug-in, if you like, as a kind of data source. It's like a source of disk images, but filters, they apply modifications, changes to that data source. As an example here, the partition plug-in. If your source is a whole disk image, a partition disk image, but you only want to serve one of the partitions over mvd, you can apply the partition filter, which selects a partition. Each running mvd.kit instance must have exactly one plug-in running in it, but it can have zero or any number of filters. In this case, I've selected the file plug-in, so my source is a local file, but as it's a compressed file, I'm going to use the XZ filter on top to transparently uncompress it, and then I'm going to apply the partition filter to select a partition, and then I'm going to apply the cow filter, because I want to make a writable overlay, which I can save out to a QCOW too far later. This is how you would express that on the mvd.kit command line. You put mvd.kit, the name of the program. The list of the filters, now you might think of these filters, if you like, as being in reverse order, the distance they are from the plug-in, if you see what I mean. Or another way to think about it is that when an mvd request comes into the server, it travels through the filters in this order. Then at the bottom here, I've got the name of the plug-in, and then any parameters that the plug-in needs, obviously the file plug-in needs to know which file you want to serve, so I'll give it the disk name, the file name. Then filters may also require parameters as well. In this case, the partition filter wants to know which partition you want to serve, so you have to give that as a parameter. Now, I wanted to demonstrate actually writing a plug-in, and I want to obviously do it very quickly so that I don't bore you. I was trying to think, what could I do to demonstrate writing a plug-in? I thought I'd write a test device, so I'm going to write a Linux kernel device to test the bad blocks command. Now, quite a young audience here, and we haven't used the bad blocks command for a really long time, perhaps since we've had IDE disks in the early 90s, but before then, old grey people will remember RLL and MFM disks. Everyone's looking a bit confused, but floppy disks, remember those? No. In those systems, when there was an error on the surface of the disk, that would appear on the file system layer, so you had to run the bad blocks command first. To find these bad sectors, and then it would produce a list of blocks which were bad, and it would pass it over to MKFS, and then MKFS can actually work around this. Anyway, so that's the bad blocks command. This is the device I'm going to write to test that. It's going to be a big virtual device. It's going to have a bad sector somewhere in it. The idea is whenever the kernel requests or reads from that bad sector, so whenever my request contains the bad sector, it's going to return an error, but any other place in the disk that it tries to read is going to return some data. So nice and simple demo. Let's write that now. What's a good language for writing Linux kernel block devices in? Bash. Yep, bash. The first thing MBD is going to do is it's going to send me a request for the size of the disk. So I'm just going to return any size. It doesn't matter, 64 megabytes is fine. And then MBD kit will send me a request each time there's a read, so I'm going to... The request is called pread, and the parameters for that, so $1, that's the literal string, pread. $2 is a handle which we're not using here. $3 is the size in bytes, and $4 is the offset in bytes of the request. So, right, error case. So the error case is if my request contains the bad, bad sector or bad byte. So I'm going to put the bad byte at 100,000. So if my offset is less than the bad byte and the offset plus the size is bigger than the bad byte, that means that the bad byte is in the request. Agreed? You've done pair programming where you have people looking at your... You've done pair programming with a hundred people looking at your shoulder. Okay, so my offset, less than the bad byte that was at 100,000, and so, and the size, sorry, the offset plus the size, so offset is $4, plus the size if that's greater than the bad byte. I've got the right number of zeros there. So this is my error case. So I just have to echo the error number that I want, so EIO, and then just something that goes in syslog, and I have to send that to Studer, and I have to exit with an error code. Right, so that's the error case. The other case is where I'm just reading somewhere else on the disk, and I want... I have to return a block of size bytes back to MBD kit. So I'm going to return just zeros. It doesn't matter what I return. So if I use DD from dev zero, I'll type here, and I want to return exactly bytes $3 size, so I'm going to return... So if I set the block size to be $3, and I set the counter one, that should return that number of zeros, right? So yeah, so that's it. That's my complete bash script Linux block device. So to do that, I'm just going to run it. So run MBD kit, name of the plug-in, which is SHR, and the SHR plug-in needs the name of the script, which I've just written. So moment of truth here. If I associate that... Right, so a good thing there is the size, 64 megabytes. Remember we set the size to 64 megabytes? So that's good. And now if I run bad blocks, that worked. Now you might say, why is that printed out four numbers when there was only one bad block? And the reason for this is that just the bad blocks commands reads the disk in 4k chunks, and when it hits a bad 4k chunk, it wants the output has to be in 1k chunks, so it just says, well, there must be four bad chunks. So it doesn't sort of go in any deeper and try to work out which of the blocks is bad. It's just that's the way bad blocks works. So it's good that bad blocks we've proven here doesn't have any bugs, even though nobody's used it since 1992. All right. Thank you very much. Okay, so you don't have to write plug-ins. You can use existing plug-ins. We've got loads there. I don't know what to demonstrate. So I'm just going to demonstrate to fairly much at random the floppy plug-in and the memory plug-in, which is a RAM disk. So the floppy plug-in first, real simple. Use to MBD kit. Name of the plug-in, which is floppy. Any directory, this happens to be the directory of the source code of MBD kit. And same old MBD client command to associate that with a loop device. And it should pop up in a second. There it goes. So that popped up. That is a floppy disk image, which contains the source of the MBD kit. And what exactly happened there? Well, what I did was I took a file system from my host. I turned it into a floppy disk, like a fat-formatted MBR partition disk image. And I loop-mounted it on my host again. Why is that useful? Well, one thing you can actually genuinely use as far as to export disk images really, really easily to virtual machines. Or to containers. Some container systems let you just export a disk image and it gets loop-mounted inside the container. When this is actually really, really useful, it's actually not in the loop-mounting case. That's when you're creating a pixie client, a pixie server. And your pixie client needs to be given a root file system. And the traditional way you do that is you create like a massive init ramifest that you TFTP over to the client at boot time. Which is slow. TFTP is unreliable. It's not encrypted, et cetera, et cetera. MBD is encrypted and authenticated. It's a super efficient protocol, thanks very much. And it's just a much better protocol because it only fetches the bits that it actually needs to read so you can have much bigger root file systems. So it's actually, it's a kind of useful thing to do. Okay, my next demonstration is a ram disk. So Linux, of course, has a ram disk driver inside. It is, however, much more convenient to be able to write ram disks in user space. In this case, we've written a simple ram disk called the memory plug-in, which allow, it's implemented using a sparse array. So it's not limited by the size of, it's not limited by the size of ram in virtual size. You can actually create really, really, really massive disks. And in this case, I'm going to create the most massive disk that you can create. The biggest that Linux supports, essentially, until we eventually move to 128-bit block sizes. This is 2 to the power 63 minus 1. It's the largest sign 64 bit integer you can have. How big is that 2 to the power 63 minus 1 in terms of disks? Well, I went on to Amazon to try and work on how much it would cost you to buy that many disks. And it turns out that's 300 million euros. I was very disappointed that Amazon doesn't actually let you create an order for 300 million euros that I could have screenshoted here because the field just isn't big enough. It doesn't let you do that. You can't do that. But anyway, it's 300 million euros on Amazon. But we can just create it here much more cheaply. Associate it with a loop device. You can see the size there is just massive. And I'm going to use GPT for partitioning because MBR is limited to 2 terabytes. So just all defaults. Yeah, that's all the partition. So it looks like 8 exabytes. It's actually 8 exabytes minus 1 byte. And I'm going to use ButterFS. Now, what are my other choices? I could have used EXT4, could I? What's the limit of file systems in EXT4? Nobody knows. It's 1 exabyte. So we'd have 7 exabytes wasted. ExFS is possible, but exFS has quite a high metadata overhead. Actually, that's unfair on exFS. ExFS has a really nice low metadata overhead, but it's about 1%. 1% of 8 exabytes is too big for my laptop, unfortunately. So I'm going to use ButterFS. And you can just see there, ButterFS, absolute champ. It totally just creates an 8 exabyte file system. And I can mount it. And I can, you know, got 8 exabytes. I'm just going to show you this so I can... So I can just go in there and just show you a bit. I mean, I played around with this and it's... You know, I miss that question, I'm afraid. You hit... Okay, so the question was, how many bugs in anything, actually, do you hit when you try to use the very last block, which is only 511 bytes long? The answer is you definitely hit bugs in QMU. QMU can't handle that case. It's fine. So, yeah, you can create, you know, ButterFS subvolume, sub-V, whatever. Whereas ButterFS file system, DF, I think. Et cetera. And, you know, it just works. It's great. And of course the next thing is that when I click to the next slide, that's gone. This software I'm using will kill MBD kit. Everything's destroyed and it goes away. So it's great for testing. And other things that are useful for testing. There are some plugins there, which are very useful for testing. And some filters I'm going to talk about now, which are super useful if you're testing file systems or the limits of file systems. So the first filter I'm going to talk about, which is useful for testing, is the delay filter. You can inject delays into the MBD kit request. You can specify the number of seconds or number of milliseconds. This is useful if you are testing, say, a distributed file system. You want to test it all on one machine, but you want it to simulate the effects of having like a really, really remote node or something like that that has a long delay. You just inject delays into that device to simulate that. So it's a very simple filter. This filter is also a lot of fun. It's the error filter and it injects errors. So obvious use for testing here. There's two ways to use this. The first way here is we say, you know, we want a particular error and we want a generalized error rate of 10%. It means that random 10% of errors are going to fail. However, I think the second way of doing this is more useful for most people. Here we're saying, okay, error rate is 100%, so 100% of requests are going to fail reliably. It's going to fail all the time. However, it's gated on the error file. Now, what that means is that if that error file doesn't exist or you delete it, no errors are injected, the error filter is turned off. When you create that file, the error filter is turned on. This is while MVD kits running, so it's just checking the error file all the time. And that's like super useful of testing because obviously you can inject errors when you want them to be injected and then turn off error injection and see if your file system recovers or whatever you want it to do. The third filter, which is a very simple filter, but also useful is the log filter. You just give it the name of a log file and it writes the log file in that format. In the next demonstration I'm going to show you, we're going to have some graphical visualisation of what happens inside file systems when you do things like creating file systems. It's important to note that MVD kit is not a graphical tool. MVD kit has nothing about graphics or anything like that. What's actually happening here is that we're using the log filter. We're writing a log file and we've got a second graphical program, just a program that I wrote for this talk, which is tailing that log file and then creating the visualisations that you'll see. So MVD kit is not a graphical program. It's just a command line tool stroke server. So let's have a look at what it looks like to create a file system. So a slightly long MVD kit command line here, but hopefully you should be able to understand what's going on. We're creating a memory plug-in, so we're creating a RAM disk, 64 megabytes, and we're using that log file to create the log file which we're going to tail with a second process, and we're inserting delays. Now the delays, to be honest here, are just so that it slows it down a little bit to make it a little bit easier to see. Otherwise everything goes past too quickly. So I run MVD kit and this is my second program, which is going to visualise things, and same old command to associate the MVD kit instance with a loop device. Now hopefully you can see that. Little black flashes going on. Those are reads. So what's happening there is that because we've created a Linux kernel block device, the kernel Udev is looking at saying, is there an LVMPV there? Is there a file system there? Is there a partition there I should know about? Is there a RAM disk so it's empty but it has to check? Now let's partition it. I'm going to use GPT. All defaults. GPT works by creating a partition table at the beginning of the disk, and a secondary or backup partition table at the end of the disk, and those are represented in red, those are rights, and you probably also saw little black flashes there. Again we've created another Linux kernel device and again Udev has to check it. So let's create a file system in there. The big thing that happens there is this lump of blue that happens at the beginning. Blue in this diagram represents discards. Modern MKFS always issues a big discard or trim over the entire partition. The reason for that is it just makes SSDs work much more efficiently if you do that. Other notable features, the red bar here is some kind of metadata. I'm in a storage room full of file system experts so hopefully you know better than I do what's going on here, but that's probably an inode table or something like that. Big lump of red here could be the journal maybe. Little red dots, I think those are backup super blocks. If you notice there's four red dots and there's also four backup super blocks. Let's mount that. I'm not touching the laptop here but something funny happens in a second. There it goes. See it's writing. That's interesting, isn't it? We've just mounted the disc but it's writing to it. This is lazy block group initialisation. It's another very common feature of modern file systems that they, because discs are really, really, really big these days and writing to them relative to the size of the disc is really, really, really slow. So you wouldn't want your MKFS to sit there for hours on end writing all of the block group metadata. In any case, why would you do that? Because you can't use all of those block groups for writing because writing is so slow compared to the size of the discs. So it makes so much more sense for file systems to defer all this stuff to the kernel. So when the disc is mounted, the kernel sees that there are uninitialised or uncreated block groups and block group metadata and it creates that in the background and it doesn't matter anyway because you can't write to those new block groups faster and you can't write to it faster than they're being created anyway, so it's fine. So let's mount this. So we've mounted it. I'm going to chone it so that I can just make it more convenient for me to put some files on there. So let's, again, couple of the MBD kit source code. You see that nothing actually is written until I sync. We know this, right? When you write to a disc, the writes don't hit the disc immediately. They get stored in memory for a little bit and then they get written a few seconds later, unless you do a sync, which forces that right. And of course when you delete that directory, even when I sync, actually, it's not going to change that. And again, you know why this is, you know when you delete files on disc, it doesn't really delete them. It simply marks them in the block group as being unused and then, later on, those blocks get reused for other files that you create, but it's not deleted. There is, of course, a command for modern file systems anyway that we can use to actually tell it, please go and discard them, and that's the FS trim command. So that issues a discard command or discard request to the file system. So that was nice. This is my final demo. So that was a nice one sharing a single file system, but I think more interesting is when you actually run multiple copies of MBD kit on multiple devices. And this is the longest MBD kit command line that you'll probably ever see, actually. But there's only two important changes here. The first one is previously I was only running one copy of MBD kit. And so I could just have it listening on TCP port 108.09, which is a default port for MBD. However, I'm going to run five copies of MBD kit this time. They can't all be listening on the same port. So I'm going to use a Unix domain socket, and that's the purpose of the dash U option here. And the second change is I'm using the error filter. And I'm using this in the way that we described before, where you set the error rate to 100%, but we gate this on the presence or absence of an error file. So the error filter is turned off because the error file doesn't exist, but it gets turned on later on. So I'm going to start five copies of MBD kit. I'll just show you what's going on on the file system here. So we've got five log files, as you'd expect. So those are going to be tailed by the graphical viewer. And we've got five sockets. There are five copies of MBD kit hiding behind those Unix domain sockets. So let's me run the graphical viewer. Five devices this time. Hopefully not too small. And now I'm going to associate the five MBD kits with the five loop devices. And now I'm going to create a RAID array. So I'm going to create a RAID five array. I'm going to create a RAID five array. I'm going to use the first four disks as data disks and the last disk as a hotspare. Let's go to that game. You can see what's happening here is that it's reading the first three disks and creating a parity disk on the fourth disk. People who know about RAID will be thinking, why is that parity disk not being striped over all of the data disks? The reason for that is simply because these disks are so small, they're like 64 megabytes, the stripe size is actually smaller than the entire disk. There is one parity disk and there's three data disks. Let's have a look at the kernel messages, which will be interesting in a minute. So let's just partition that as before. All defaults. We can create a file system on that as well. This looks a lot like it did last time, except that there's no trim. Now, the MD driver in the kernel doesn't believe that you can send discard requests to devices, I guess because they've been burned in the past. There is a way to do this. You can set a kernel command line flag, which is something weird like RAID four, five, six, dot discards are fine or something. However, as this is quite literally my work laptop, I don't happen to have that on my kernel command line, so it's just not issuing discards to the underlying devices. I can mount this. Let's just chone it so I can get in there and stuff. Let's create some files in there. Now, of course, the interesting thing is what happens when I inject an error into this. You can see what happened there quite dramatically, actually. It detected, first of all, that the error occurred on the second disk. Unfortunately, the second disk is called NBD1 here because I'm starting from zero for the disk numbering. You can see also it started to do a recovery, so it started to read from the remaining good disks and created an extra power to disk on the hot spare. It took a little bit of time. We're injecting delays here, and although we're injecting delays, so it's a bit slower than normal. You can imagine how it would be if this wasn't a 64 megabyte disk, but this was 6.4 terabytes or something. Recovery on RAID five takes a really long time. Unfortunately, the way that RAID five works is if you then get another disk failing at certain points during the recovery, you can actually lose all your data, and that's kind of the reason why we don't use RAID five in production, certainly on larger systems these days. However, it's still a good demo. I should just very quickly note that when I clicked the error button there, the graphical tool didn't start injecting errors. All that happened was the graphical tool created a file called Error2, and then MBD kit notices that that file exists and starts to inject errors on that disk. Although all that dramatic stuff happened in the background, the actual file system is fine. There's no errors on the file system or RAID array level. It's just the dramatic stuff happens below there. Of course, I can inject more errors on a second disk, and this time now we're running in degraded mode, so this is like the minimum that this RAID array can support without actually failing, because although there was another error, I'm still just about okay, although if there was another error, if I clicked another button, two things would happen. You'd see errors, you'd see errors actually appearing at the file system level, and the second thing that would happen is I'd have to reboot my laptop, because you cannot, and I could not work out how to do this, you cannot then unmount a RAID array that's in that state. It just is absolutely impossible. I don't know why. There's tons of stuff about this on Stack Overflow. However, I don't want to reboot my laptop in the middle of the talk, so I'm not going to do that, but you trust me, this is a thing, and it's probably a kernel bug or something. I don't know, but there we go. Instead, what I'm going to do actually is I'm just going to you mount that and stop the RAID array. So in summary, loop mounting, popular, great, somewhat limited, limited to plain, uncompressed files on your local hard disk, which gets you a long way, obviously, because people have been loop mounting for years. Once you have a user process running at MBD server, MBD kit is what I've shown here, but other MBD servers are available and very good, you have a user process, and within that user process, you can do all kinds of quite great stuff. MBD kit has lots of plug-ins, but we also have this stable API and ABI guarantee, which means you can write your own secure in the knowledge you won't have to rewrite your test down the line. So I think this would be a good way to do file system development and testing. Interestingly, this is not, in fact, how we use MBD kit at Red Hat. We actually use it for something which is related but a little bit different. We use it as a way to get to expose proprietary storage, proprietary hypervisor data, and by proprietary, of course, I mean VMware, and expose all that data, and NASs as well, and expose that data to free world software, to Linux and BSD. Although loop mounting is super convenient, it's also dangerous. If you don't trust the data in your disks, if you just downloaded a disk from somewhere or even worse, allow users to upload their cloud images and you just loop mount them for some reason, that's a terribly bad thing to do because it exposes your host kernel to bugs in file systems within that disk. I mean, it's like worse than a root exploit. If you're in that situation, you should use LibGestifest, which is our sister project. LibGestifest does interoperate with MBD kit, and it works by creating a virtual machine which protects your host from the possible male effects of the contents of the disk. If you want to reproduce what I have shown you in this talk, then you will need that minimal version of MBD kit, 1.8.3. However, it's not really necessary to have that version. Earlier versions are fine, and let's say the API and the ABI have been stable since 2013, which is back when we started with 0.1. However, if you want to reproduce exactly what I've done and all the things I've shown you here, you will need that particular version. It's available in all major distros, so it's in Fedora. Debian testing, so Buster has 1.8.3, I believe, and CID has 1.10, which is the version that we released a couple of weeks ago. I would also say to do, if you want to do loop mounting, you should have MBD client 3.18 or 3.19, which was released just a few days ago. The reason for that is that 3.17 had kind of an annoying bug with just effects loop mounting. It's just kind of annoying, but 3.18 fixes that. There were also a few bugs in kernel before 4.18. I don't think they would genuinely affect you, but you might be better off, if you do find any problems, you're probably better off looking at kernel 4.18 or above, because there were some things that were tidied up a bit. We also have support for FreeBSD, Ombudy and Haiku. Haiku is a great operating system. I was actually running it for a little while. It's like just works really well, it's brilliant. However, those don't have kernel MBD loop clients, so you can't do any loop mounting with them, but they're useful for, you know, as acting as an MBD server to be consumed by other MBD clients on other machines. If you want to find MBD kit, go to your favorite search engine and type in MBD kit, and I'm sure it will be the top of the listing. This talk may be downloaded. This link is on the FOSDEM page as well. I wouldn't recommend running this talk because it does loads of weird pseudo stuff, which is very applicable to my laptop, but probably not applicable to your systems. However, certainly download it and have a look at how we've done it and how we implement this. With that, I'd like to say, are there any questions? Hi, great project. Currently, if you create an empty QCAL2 and then, say, fire up a QMU and install some stuff into it, it doesn't compress on the fly. You can only compress after the facts. Does this allow you to get around that? So, the question was about QCAL2 compression, where basically the way that QCAL2 compression works is that you can only compress when you're initially creating the QCAL2. And then when you later on write to that same compressed QCAL2, you're actually writing uncompressed QCAL2 clusters. This does not solve that, I'm afraid, because that is an inherent issue with the QCAL2 format that ain't going to be solved anytime soon, but you should ask Kevin Wolff and I'm sure he will consider your request. I'm afraid that this does not solve that. The other thing that this does not solve is we do not allow writing to compressed XZ files, so we could do that if somebody wanted to add that as a feature. There's nothing in our software that prevents that from happening, but it's quite difficult because the way that the LZMA format is arranged, you have these compressed blocks, and you can certainly seek to a block and read it, which is how the filter works. But if you want to write to it, you'd have to actually expand the file by shoveling all the data aside to fit the larger block in. There's nothing in the NBD kit that would prevent us doing this. It's complicated and slow, so nobody's done it. What you can do with the XZ filter however is you can put a cow filter on top and then you can save your changes out to a QCAL2 file after the fact. So basically you'd end up with an XZ file that's the same, and then a QCAL2 file containing the differences that you've written. I hope that's somewhat clear. The QCAL2 wouldn't be compressed, I'm afraid. I know there's no official Gen2 package for this, so I'll get one added. Looking at the kernel version, it seems rather new software. Would you consider... Are you confident enough to call NBD and NBD kits stable? Well, there were... I'm not this talk, but I do maintain NBD since 2001, and it's actually existed since before that. There were a few annoying bugs recently, and those have been fixed, that's why we recommend that version. But it's been around very long, and it's actually... This is something very few people would know. It's actually older than Ice-Coozie. Wow. It's been in the kernel for 20 years. 1997. 1997. I'm just saying 4.18, because there were some bugs in... I'd say there were some bugs, but they probably wouldn't have affected you. It's not like it was bugging before 4.18 at all. It was working fine. It's just that, if you did hit a bug in the kernel driver, you probably want to upgrade to 4.18 because there are a bunch of fixes there. Is that fair? Yeah, I think so. There actually is a free-busy client too. It's a gate filter thing. Somebody wrote it for scale weight, but it's just minor. Got any more questions? So I'm well aware that NBD kit offers a lot of flexibility, but I'm wondering, as an alternative approach, just for having a basic block device without every injection and so on, you could expose a disk image in a few files system. Are you aware of any performance comparison between this approach and NBD kit? Yeah. So this is actually interesting. There have been, for example, fused drivers that have exposed... for example, there's a fused driver that exposed partition block devices to get around that, to provide a sort of way to that. I would say that I would be surprised if the NBD kit or the NBD approach wasn't way faster than that. I mean, much faster. We have really concentrated in the 1.10 cycle on performance as our number one thing in 1.10. And I would imagine that it would blow that way out of the water. But I haven't tested it, so I don't want to claim that. It's definitely true. But if it isn't, it should be. Yeah. So it's good that you mentioned fused, because I like to think of this as like being fused for block devices. Fused for file systems is also brilliant, but it's for file systems. And this is like fused for block devices. Yeah. So I very much like the overall direction this is taking, which is possible to write stuff in user space that historically was only possible to do in the kernel, and especially making it possible to do things in user space unprivileged. Now, the only problem with that that I see is NBD. I mean, NBD requires root in order to basically just to drop a block device into the slideshow. So the question where it was about why does this require sudo? Why do we require root privileges for this? I'd like to say that libguestivest does not require sudo. Looping back to fused, one possible approach is really, first of all, the loopback device is something that really ought to go away. There's no inherent reason we couldn't in the kernel make it so that you can mount any file, not just a block device, a file directly without having to have this weird intermediary where you create a loopback device and then that would make it help the privilege issue. I believe that the real issue is with Udev, and whether Udev is hardened to somebody just plugging in a basically... Well, it's a global namespace. Slashdev is a global namespace. We don't want unprivileged users mucking around with the global namespace. An unprivileged user should be able to mount any given file in their home directory without having to create something in slash dev. Getting rid of loopback would help with that issue. The other thing that we ought to do is with fused. Making a fused driver it's got a fairly large API, but what would be really useful I think for this is a simplified way of exposing just a single file over fused. If we have that. There are a number of problems with doing this unprivileged. I think actually you're right by global namespaces. I think the Udev is near to us. Udev sees those devices being created and it runs a whole bunch of code against them. It's my belief that Udev is not really hardened to doing that against any old stuff that you just throw in there that any user could throw up. I think we've run out of time, but thanks everybody for attending and I hope you have a good file .