 So, without any further ado, I'd like to introduce Walter Bell-Helst from NBD. Thank you. So, that's me. Welcome to Belgium. Welcome to Festum. Welcome to my talk. Just a short question. What's NBD? Anyone in the room who hasn't heard of the abbreviation yet? I don't think so. Well, you could think of NBD as, if you talk to an HP guy, they will say, well, that's a support thing where you go next business day, which is not what I'm talking about here. You could, if you talk to a Twitterer, they go, NBD, that's no big deal, which is true as well, but also not what I'm talking about here. You know, NBD that I'm talking about is a network block device. That's the downside about three-letter abbreviations, of course. You've got multiple expansions. A network block device, we say network because it goes over TCP. We say block device because it only does block devices, not character devices, in contrast to some other networks. Well, protocols are these similar things as NBD. And it's actually a fairly easy protocol to understand and to implement, and that's what makes it so interesting, in my opinion. I would actually say support for NBD by NBD is NBD, if you just check the three abbreviations. Oh, I did something wrong. No, I didn't do it. A little bit of history, where does NBD come from? Originally, it was written by a powerful marketer and submitted to the kernel for Linux 2.1. I think 109, but I'm not 100% sure. It always felt to me like you did it in a sort of overnight hack. Can we do this? Is this possible? Can I do swap over TCP? Oh, yeah, I can. Wow. Now, I've got this thing that works. What do I do with it? Let's submit it to Linux, and he called it into the kernel. And here we are, 20 years down the road, and it's still there, and it still works. His original idea, however, didn't actually work very well until a few years ago. But, yeah, well, in the late 90s, VMware did their VMware desktop for Linux, and they shipped it along with... They shipped something called... I'm not sure what it was called, but it was something that did pretty much the same as QMEMD, allowing you to mount virtual hard disks on your host system and just copy things from there. After a few years, the bubble got tired of maintaining some overnight hack from five years ago, and he passed maintenance on to two different people. For the kernel bits, he talked to Paul Clemens, who'd been working on it a bit, and I did the user-land devices. For the next few years, I just fixed minor bugs and tried to be a bit more reliable, et cetera. And at some point, I managed to convince the LTSP guys that MbD might be something useful for them because they didn't use disks themselves either, and they were swapping to MbD, so to network, originally, NFS, where MbD was written for that, so it was interesting. We also added a disconnect command because before that, all we had was write and read, and that was it. You couldn't do anything more than that. A few years later, we added named exports. Before the named exports, you could connect to a port, and if you wanted to export a second device, you had to open a second port, which was not very useful. In 2011, I finally sat down and wrote down the entire protocol spec. This is how things work. Before that, people had to look at the code. Many people did that. It wasn't that difficult, but you still had to look at the code. Doing that suddenly got people to understand it better and add more features to it. Pala Bonzini suddenly came up with a patch to add trim, which marks specific blocks in the device as no longer active, so you can just discard them from the disk. We added flush and FUA to act as a point in time where we're sure that particular writes have been completed before you move on, which is very useful if you're trying to implement a file system on top of MbD, et cetera, et cetera. 2014, some of the QMU guys came back to me and said, look, we want to do some TLS with MbD, and originally they wanted to do an HTTPS-style version where if you connect to the port, either you connect without TLS or you connect with TLS. But I prefer the start TLS approach for reasons I'll come back to later. Was that the other question? Oh, no, you're just waving to me. No worries. The implementation took a bit longer. The QMU guys implemented the start TLS back in 2015, and this was actually the first time where a particular feature was implemented into another MbD implementation rather than mine first, so that was quite an interesting situation. I implemented start TLS value recently, late December, 2016. In 2015, we also added the write series. We added structured replies to fix a problem that I'll come back to later, and Marcus Carganon in early 2015 took over for the kernel. 2016 and early this year, we added some more options, opt-info and opt-go, which allow you to figure out the information about the device. We added the ability to add multiple connections to improve performance, and the classic of Facebook has been taking over MbD implementation because apparently Facebook is now considering to use it for some of their internal things. I don't know the details there. Let's just go back a bit to 2003. At this point, iScusi was fairly new, and MbD is actually older than iScusi, which not many people know, but it was not very mature at that time. Like I said, we only had read and write requests, and there was a Chinese guy who asked on an IP storage billing list somewhere how MbD compares to iScusi, and he got a response from a guy named Andre Hedrick. I don't know him, maybe he's in the room, maybe he's not, if you are, say hi to me. This is just a short quote, but the mail is much longer if you have the time you could just check the URL here. What it says here is, MbD totally lacks and will never have the ability to be easily managed in this market space. MbD is not iScusi. It is not Enterprise SAM. It is not a serious solution for SAM. It will not be adopted as SAM. It is useful in its free and open source environment. It's a great undergraduate project for SAM. I'm hoping to be able to show to you that while this may have been true at a time, it's not so much true anymore today. How does MbD compare to iScusi? iScusi really is a transport for SCSI. With iScusi, you just talk SCSI over TCP, but essentially the iScusi standard describes how you embed SCSI commands into TCP, which means you have an HPA driver in iScusi, which then just throws everything on the network. At the other end, you have an iScusi server which translates it back to the local hard disk through a block layer. MbD does not have such a general HPA driver. The MbD driver itself is just a layer higher. What that means is that MbD has a higher level of abstraction than the iScusi. The type of commands you would send over MbD are much simpler in general than the iScusi ones. It does mean that implementing a hardware iScusi server where you just attach disks, and it allows you to access those disks over the network, it's easier than an MbD server, but implementing an MbD server is not all that hard, actually. A little more in-depth comparison between MbD and iScusi. The protocol document that describes how MbD works is fairly short, I can show you in a moment. It's a self-contained spec. What I mean with that is, if you've read the entire protocol from top to bottom, you know everything there is to know about the MbD protocol. There is an RFC that describes how iScusi works. It's RFC 7143. It is the consolidated RFC for iScusi. It is 295 pages. I don't know if there's anyone in the room who's ever implemented an iScusi server. I don't think there is. I tried to look at the spec. It's just huge and impossible to read. It's extremely long. Once you've read that, you know how to embed iScusi protocols in TCP or on the network. What you don't know is what iScusi protocols to use, because that's a different document. It's not described in that one. It shows you how MbD is much simpler. With MbD, we have an abstract concept of a block device. Essentially, for MbD, a block device is a device that allows you to read and write particular blocks of a given length at a given offset, and that's it. We don't do much more than that. With iScusi, it's just a layer in a iScusi system. Any device that speaks iScusi could theoretically be used in an iScusi environment. You could theoretically scan a document over iScusi. I don't think anyone would do that, but it's theoretically possible. The addressing layer in MbD originally was just a port name. These days, we have a server host name and an export name, and that's your address. In iScusi, there is a very complicated addressing scheme. I'm not going to go into details, because this is not a talk about iScusi, and I don't have that much time, but it's fairly complicated. MbD doesn't have any hardware implementations that I know of. There might exist some, but I don't think that exists. Whereas iScusi does have that. That is one of the advantages that iScusi does have. It is widely implemented, and MbD is mostly popular within Linux and free software environments, which is not the case for iScusi. We do have encryption with TLS these days. With iScusi, you don't, although some most implementations do support the IPsec, and then you can authenticate with TLS certificates or chat for iScusi. All of the protocols are called ATO-over-Ethernet, which I'm not going to go into too much detail here, because ATO-over-Ethernet isn't something that QMU supports, whereas iScusi does. ATO-over-Ethernet is fairly simple to iScusi in theory. It uses the same abstraction, same system of sending commands over a network. It's much less popular, though. There's essentially just been one company with a lot of different devices, but they've gone bust now, so it's not really being developed anymore. It's much simpler, though, than iScusi is, so if you're looking for simplicity, that might be a better option. But they don't have encryption, they don't have authentication. The protocol isn't being developed anymore. You cannot round it, which you can do with iScusi and MbD, so there's quite a few downsides there. If you compare things in a nice-looking table, if it doesn't just tell me. I also added a fiber-channel-over-Ethernet, which essentially takes part of the bad bits of iScusi and all the bad bits of ATO-over-Ethernet combines them and adds some of them to their own, so it's not a very good looking part of the table. I think you'll agree that the two leftmost ones are the most interesting ones. MbD has a very simple protocol, as I've shown, as I've told you, but I haven't shown. Let me just fix that, sorry. This is the protocol spec of MbD. It's a markdown file in the GitHub repository. It is a good read, but that's it. It's not that long. If you compare that to iScusi, like I said, it's 295 pages of ASCII text, and then you've only just read how to embed it. So the MbD protocol is much simpler and much easier to understand. It's a self-conceived standard, which is not true for any of us. It's a routable protocol. It has authentication with DLS certificates, so not if you don't do DLS. It has encryption. It doesn't have hardware implementations, so I'm going to get a bit of water here. It is still actively developed, which is also the case for iScusi, and I don't know whether that's the case for a fiber-channel-over-Ethernet. This is not the case for any of the other protocols. We do have an iScusi client, so if you want to boot a QMU virtual device from an iScusi device, that's perfectly possible. But if you want to export a device from QMU, you can only use MbD for that. And this is what makes it so interesting for a fiber-channel-over-Ethernet. And this is what makes it very interesting. There is a QMU client as well as a QMU server and this is what makes it so interesting for virtualization. A modern QMU allows you to export a device while your VM is running. And we've been adding a few features that make it possible to synchronize two devices, one MbD device on a remote QMU machine and one device locally, so you can live migrate your backend storage while the VM is running. For that, we've had to add a few features and I'll come back to that in a minute. There are a few implementations of MbD that are free and open source, and I've listed them here. Well, I've listed most of them here. There are still a few more, but I can't list them all. The top most one is mine. It's MbD server, which is the reference implementation, as well as Linux kernel itself, which only implements a client for obvious reasons. The only implements client and server. There is an alternative implementation by a Japanese person. I've never met him, and I forget the name again. It's called XMbD. He does his own thing. He hasn't implemented most of the recent features, but it did implement a few features of its own, and it is actually being used by a particular cloud storage provider called Skirway. If you get a machine on their system, they're using XMbD. There's also MbDKit, which was written by Richard Jones of Red Hat. It's part of FlipGestivus, and it allows you with a very simple API to implement an MbD server in like half an hour or something. So if you're using your own storage system, that is not just a raw file, that is not a QME image thing, then MbDKit might be just what you need to boot a VM from your device. And finally, there's also BitRig, which is an BSD clone. I think it's a fork of OpenBSD Moonsaliger, and they implemented the MbD client side with the same API as Linux, so you can work on that as well. In my comparison that comes up here, I'm going to describe the most important implementations, which are the top five, because BitRig is fairly limited at this point in time. So actually the top four. There are a few features that we have added to family recently, and a few features that depend on the design of your system. So the first one is, can a server export multiple devices at the same time without requiring multiple instances of the same server? MbDKit can't do that right now. I understand that Richard is working on getting that possible, but right now it's not possible. It is possible for QMU and for MbD server though. If you have a client connection that you connected to a server, currently if you just use just a single connection to your server, then you can get a bottleneck on that one connection, and Joseph Vasik of Facebook has actually discovered that by using multiple connections and block MQ in Linux kernel, you can get quite a performance gain. So Linux 4.10 is the first kernel that implements that feature. So if you use MbD on Linux 4.10 with MbD client 3.15 or higher, then you can use multiple client connections and you will get quite the speed up there. QMU doesn't implement that yet, at least on version 2.8. MbDKit is only a server, so it doesn't implement that part. There is a writeSeries command that we added recently by request of the QMU developers. The point of that is that if you want to sync from one device to another, it is useful to first be able to say, there are lots of series here. I don't want to actually send them all over the wire. So writeSeries is just some optimization that you just sent with this command saying, there are two gigabytes of series here, just write them. TLS support is fairly recent. We implemented that in MbD server and QMU. We don't have kernel-side TLS, and MbD client implements TLS by proxying the TLS to the server through a little proxy, which then just decrypts it, and then bytes it to a socket there. So you can actually connect your Linux kernel to a TLS server, but you cannot do that if you want to swap and not deadlock your system. So for that reason, the dash swap and the TLS options in MbD client are, well, if you try to use them together, the system will just barf right here and say, no, that's not possible. And then the one feature that MbD kit has over the other implementations is that it's extremely easy to add your own server backend. MbD kit has plug-in APIs for both Python, OCaml, C, and a few others that I think Ruby to. So whatever your favorite language is, I'm sure MbD kit can use it, and you can just implement reading and writing blocks in that language, which makes it easy to export an image from or maybe even generate on the fly, if that's the case. The protocol is evolving on a constant basis. How do we do that? Originally, somebody with an idea, I would implement it and it would be done. This is about five or 10 years ago. These days though, before we implement the new protocol feature, there's extensive discussion on the main list. Then we write a spec, and then somebody writes an implementation that's either me or somebody else. And then experiences of that implementation are merged into the spec if that's needed, and then the spec is formalized to make part of the formal... Was that a question? No, okay, sure, no problem. Then the spec is formalized and the changes are merged into the main spec. There are currently a few outstanding protocol changes that have been discussed and for which a spec is complete but that aren't implemented yet. The first one is structural replies. The reason we need those structural replies is that currently the MBD protocol does have the ability to signal an error to a client, but the fields in the response where that error goes is in the header, not in the footer. So that means if you send a reply to a client, you have two options. Either you can send the header without reading your data and saying, well, all is fine, all will be fine. And if you do actually encounter a read error, you have to drop the collection because you already said it will be fine. Or you have to allocate whatever memory the client asks for which can be up to 32 gigabytes or is it for way too many memory. And then if you send that to the client which actually allows the client to DOS your server. So neither are really good options and the structure replies tries to fix that by allowing the server to split up the response into multiple blocks and having the ability to send errors at some later point in the reply. That way the client can recover from a read error but in the worst case by signaling the read error to the user space but at least it can recover from that and that's a good thing. Based on that structure reply we've also added a block status option which is something really cool. This was added by request of Firtuoso and a few backup companies which allows us to send metadata and the spec is very careful not to define what this metadata is. It just allows you to define the metadata context and then in the original phase, in the negotiation phase and then during the transmission phase this metadata context can be queried and then it could mean anything. It could mean things like this part of the block device has been backed up or it could mean your interview again for an incremental backup or it could mean this part of the block device has not been allocated yet so if you try to write to it we might give you an e-no space or it could mean anything essentially. The spec allows for extensibility there and so by using this it would allow you to make an incremental backup of a VM that is currently running based on a snapshot et cetera et cetera and that's actually the main reason and there's also a second one which is the info and go thing to fix another issue we have in the negotiation phase currently if you select a named export but the export name doesn't actually exist and there's no way for the server to say that other than to double connection and info and go try to fix that by making that a bit more generic. We've also got an extra option that we're currently discussing so the spec isn't finished there yet but the feature is to allow an active-free size from the client so the client can say please make this block device now that big rather than the size whatever it is right now the spec still needs some finishing up but it looks like it's going to happen. Right, demo time what time do we have right now? Essentially do that is that readable? Yeah, okay awesome Essentially I'm going to run it shows me the exports that my server has this is my laptop so I'm just exporting a few test devices that I do development with the first one is called test the second one is called test 2 and the third one is called MAC image I don't even remember what that was supposed to be but the first one actually contains an installation of a system I can with this URL format I can tell QMU that I want to connect to an MBD device at that location sorry, of course I have to use the right command so now we have a WM booting I don't know if you can see it but yeah it takes a moment yeah it's there it's just too big for the screen right now yeah yeah that's it doesn't fit on this small screen but there we are looks like an existing booting and this is running over MBD I currently is just using an MBD server with a simple raw file as backend but since the server runs in user space it could essentially be anything you could have something that just uses a tar file and exports from a tar file and there's magic there you can do anything you want really and that's the end of my talk are there any are there any questions? I'm trying to so W is a very stupid question but how stable is it? how stable is this? could I run this in a production environment? I know of at least well I didn't say this the LTSP guys they used as part of their their system the LTSP got used by Debian Edu which got used by the Astromabia folks and they run it on 80,000 desktops without many issues I think you can say it's quite simple this was 2008-09 so yeah I also know the the scaleway guys usage in production on their entire infrastructure and it seems to work fine for them please do yeah there's one over here as well I was wondering regarding the head of this performance wise how much MBD is effective in actual performances yeah, performance, it's a very good question and I actually did run a few tests I ran my SCUSY and ATO for Ethernet and MBD two implementations of MBD my home server which runs set of us on five devices and has a one gigabit network and second time at a customer site where they run set of us on 14 disks and have a 10 gigabit link between the two nodes and I could not see any difference between the four protocols nothing statistically significant really there were some differences but I don't think it has performance issues as compared to SCUSY it looks fine to me I tried to make a graph for that and added to my talk but I couldn't show it in a reasonable way so I couldn't do that but I can give you the numbers if you're really interested alright anyone else you had a question earlier too can we have one there I was actually going to ask about latency not just bandwidth well it runs on top of TCP so if you have a TCP issue then you have an MBD issue but essentially the overhead is very small it is faster than using a load mount on a VES device because you've got way less overhead the packet I didn't add that because it would get too technical but the request packet is I think 5 fields and the response packet is that much it's actually 3 fields for the data so it's fairly small and fairly low overhead so latency is fairly low but it depends on the side of the request of course yeah I wonder about scalability how do you do this with one TCP endpoint do you mind something about load balancing that and referring to a cluster can you do something like that so the protocol spec actually defines what happens if multiple clients connect to the same device through various means we added that because of the multiple TCP connections to the same server but there's nothing in that spec that would seem to be a single server my implementation doesn't implement that the kernel doesn't check that but we can essentially, if you wanted to if you had several servers which had the same backend which synchronized somehow and which synchronized on the same synchronization points that I'll define in the MD protocol that you could perfectly well load balance in several servers I'm not aware of an implementation that does that but I'm doing so if you wanted that any experience from running a RAID on top of NBD people have done that successfully yes, RAID on top of NBD you can use software RAID on top of NBD and actually the MD RAID maintainer in Debian has done that at some point I don't think for a very long time but it's possible to tell the MD RAID layer in the kernel to say only use this device as if our house fails so if you do that to the NBD device and not to the local basic device then you can use it as a sort of weird way to back up things but it works I think I ran into a problem that the client doesn't get some kind of serial number like a normal disk has do you know what I mean? so on the client side you've got the driver with the bulk device itself and then there are means to get the serial number of a real disk but I think the protocol doesn't support this in how the service should like generate some I don't know if we're into it there is indeed no way to export a disk serial device through NBD because NBD doesn't do disks essentially it exports a block device with the size of an offset if you have a real need for that in some way I'm sure we can come up with some protocol extension to do that but certainly we don't have that what I do know though is that there are some protocols like NFSB4 4.1 there you have a serial number on the block device and that should perfectly work with my factory but it doesn't add a serial number we could easily add it if we wanted to it's not a problem if you resize the source image file a block device does that propagate through to the client's kernel or do you need to do something special so you resize the device on the server side currently it doesn't but the active resize the intent is that it would do that would have the built-in signal that too but currently it doesn't any more questions okay many questions my talk should be better or people are very interested that's cool I don't know can be used with multi path with multi path oh multi path I was talking about a single block I haven't tried that but I don't see why not if you can tell multi path to talk to one and two MBD devices it should be possible second question sorry I have read about 3M support so I believe yes tools like kimo sparsify could be used are that working to retrieve the space oh right so we did add the block info and that will tell you that will have the ability to say what the file is files are not if the server supports it and if the back end the server knows a lot about the back end because the server might solve it at the end we don't know that the back end information is correct but there isn't a means of signaling let's put that way of signaling that there is the back end of this part go ahead any more questions yeah over there hi have you not specified the meta information do you have or do you expect some how do you expect that different clients work with this meta information so let me just go to the actual spec maybe if I have it now I don't here we are anyway so essentially the idea is you negotiate a meta data context during the negotiation phase and the meta data context is there's a namespaced string so you say this namespace can be q and u spec can be base whatever column and then something in that namespace so the namespace should have its own spec somewhere and that would define how it works and then during negotiation you can say I want to see this particular meta data context and I want to use it and then during the actual phase where you want to repeat the transmission phase where you want to repeat the information and you can add a meta data context ID that refers to whatever you negotiated at first so it allows people to add as many contexts as they want so it should be fairly flexible there are two contexts sorry there's one context in the spec which is for sparseness that is defined in the base spec itself but you can add as many as you want and q and a is probably going to add a few as I understand and in enterprises usually you have the assumption the storage is always up and there is some support for nbd clustering or something like that not at the nbd level there has been an unreliable option to allow a client when the connection stops to maybe recover but that hasn't been very lively in the kernel for you could do that but it is some area that I do think we need to improve on it's not quite there yet but as long as your server remains up all should be fine with the multi connection stuff I just saw passed by yesterday when it was this morning mature where if you've got like four connections and one gets we move on with just three if you use multi path on top of nbd and one of them drops then you still have another connection so there's a few options there but essentially it's not been well defined what happens if the connection drops unexpectedly the answer is don't drop it but yeah more questions that should be that should be perfectly fine no more questions almost out of time too so I'm going to leave it at that then thank you for your attention thank you for your interesting questions