 I'm part of the cloud engineering group focusing on integrating lower-level virtualization components with OpenStack, but today I'm going to talk a bit about what are the different kinds of live block device operations that Cumia has to offer to solve different kinds of problems. First we'll set a bit of context and then we'll see what they are where they're used and how they can be combined to solve certain practical problems. I assume most people in this room are familiar with Cumia but we'll do a bit of background as well. Just show of hands how many have heard of Cumia in this room. Okay pretty much everybody. First we'll set a bit of context for virtualization, stack, KVM based for the benefit of those who aren't familiar. I assume most of them are and then we'll see a small primer on how to operate a Cumia instance followed by that quickly, super quickly we'll see how different block devices or block devices in general are configured in Cumia. And finally we'll see the different block device operations, long-running ones, potentially long-running, to do various kinds of neat things and the practical use cases for them. Alright so that's a typical KVM virtualization stack. You see at the bottom Linux kernel module KVM as a character device exposed to Cumia which will operate it via iOctos. And then sitting on top is Cumia which is your guest as a complete process just like any other process on your host. So Cumia exists alongside those processes and you can use regular tools like top, PS, and so forth to monitor the various details. But launching Cumia by with long command lines can get super tedious. If you are seeing Liberty generated command line it goes into pages and pages, well not really pages but sufficiently long. So you don't really want to launch Cumia by hand although there are some cases where people do that. This is where Liberty comes in as a hypervisor agnostic library that provides neat abstractions and other things for the life cycle of a virtual machine. And that's the Liberty Daemon. And even higher is the projects like OpenStack where it operates various instances, instance launching processes via its word driver interface connecting to Libvert which in turn talks Cumia and the complete back and forth goes on from there. And even the more recent entrance like Kubbert also relies on Libvert but not exact same architecture but some of the core fundamentals remain. On the far left you see LibGestaphis project which is a need to set up tools that provides various utilities to do different kinds of things on your guest images to safely access them, resize and launch templates. It does a whole bunch of things. It also uses Cumia and Libvert in various of its tools to give you a different functionality. So in this talk we'll zoom into the Cumia's block layer and see what functions it has to offer and what practical use case it solves. Most of the times these features are already exposed by a high-level libraries like Libvert or even higher level management software like Overt, OpenStack and so forth. But sometimes it can be instructive to look under the hood to see what's happening especially helps you during debugging. Cumia's block system offers a bunch of things to begin with a guest-visible emulated storage devices like your SCSI, IDE or power virtualized drivers where the guest knows that it is being virtualized like Verde or SCSI or Verde block. If you run that command on top you see and you look for the storage devices section you see different kinds of devices that Cumia has to offer. So that shows you a nice picture of what all the tools it offers. And then Cumia's block layer and Cumia in general is written in layers and there's a lot of code reuse between components and Cumia offers the block, different types of block drivers. One is a format driver where how data is interpreted on the host and most, no that's a protocol driver actually, where how Cumia interprets data on the host and the format driver is where you recognize one of the popular names is the QCov2 that Cumia's native image format and VMDK and other formats that Cumia has to offer. But the native format is QCov2 most people recognize. And for the protocol driver you see things like SAF or a network block devices and so forth. It also offers different live operations that you can do on block devices also offline. The most popular is Cumia IMG which you can use to create images, create backing chains, convert from one format to another and a bunch of other whole things. And yeah, the live operations are where most of the interesting stuff happens also equally interesting stuff happens offline but people care in production stuff that runs in life. Cumia's copy and write overlays as most would know offers a way to refer a base image that has say for example a Fedora operating system and then you can create an overlay based on that base image where this overlay refers to that base image and you can do all kinds of things, destructive things in this overlay and then if you screw up something you can discard it and go back to the base overlay, sorry base image. So in this kind of setup the base image can be of any format but the overlays are always QCov2. So there's different kinds of terminology I'll use throughout the talk. Not one kind, the thing is called base images and overlays. When I say overlay it is always the image that is created based on the first image in a disk image backing chain. So we'll see what they are. So when you have a simple backing chain like this the first, when our QMI tries to read a cluster it reads from the existing overlay if it is allocated or else it will traverse the backing chain and read from the other, the base image. And then QMI does some caching and so forth. But all the writes happen on the overlay image. There's several use cases to it. The one I just mentioned is a thin provisioning is where the Fedora image is there, base image and then you can create multiple overlays based on it so you can give out different thinly provisioned disk images to users and snapshots and backups. We'll get to that. So how do you create this minimum backing chain? You see that command line over there and I especially highlight on the bottom one. So first command line creates the base image. The second one creates the overlay. And why do I highlight the base image format and the base image file? If you don't specify the backing, the format of a raw image file, QMI will probe it. So that is a potentially CVE waiting to happen. Why? Because a malicious guest can write a header that can look like and hook out to and can corrupt the image. So you don't want that to happen. So that's why it is recommended to explicitly specify the backing file format and the backing file of course with those options. Another thing to keep in mind is when QMI is already using a disk image, you don't want external processes to access them potentially corrupting. So to mitigate that, QMI already offers existing runtime commands called monitor QMP monitor commands that you can do. For example, if you want to library size a guest, the QMI offers something even to change back in files and various other functionality. So you don't want to do that. And also, alternatively, I mentioned the lib guestFS project. That also offers a tool called guestFish that will let you safely in a read-only manner access the disk image and examine its content. So for example, if you're using a Fedora-based operating system and you mess up a SELINX labeling. So you can use this tool to fire up the disk image and fix that stuff. And again, you can make the image bootable. Feel free to ask questions if there are any. And in more recent versions of QMI, it offers disk image locking. So this essentially prevents you from shooting yourself on your foot by not letting two processes riding into the same image. So to that end, it uses the recent SELINX locking mechanism, the open file description locks. And as a quick example, we won't go into the details of these things here because I wanted to mention for completeness. And this is a new feature where users may have to take explicit action. So to query an in-use disk image with recent QMI versions, 2.10 or whatever, so you have to specify the four-share option. For example, if a disk image is already in use, the foo.qcu file, and you want to query it for read-only information, you supply that four-share flag, then a read-only access is allowed to that while writes are happening. Now, there are some cases where you can get stale information when you're doing read-only access of that image. Why? One example could be if QMI is actively resizing the disk image and you try to query for information, so you see inconsistent results. So in so far as you're aware that information can be stale sometimes, you can use this. And there are some valid cases, like QMI IMG info falls under an area where you can safely query an in-use image. So yeah, I just wanted to highlight that there. For example, OpenStack uses this knowingly that there can potentially stale data because it runs this command in a loop, so that if it even misses the information, it will catch up again on the next run. And on the recent QMI use, the same locking mechanism can be specified by the when you configure block devices, there's the locking option. So it defaults to open file description locks if you're Linux kernel support. Otherwise, it defaults to POSIX locking. I'm not quite sure, but there's some block people here can confirm, but that's what, from a quick look in the code, that's what it looks. So yeah, so that's, there's that. So in this part, we'll see how do you operate a QMI instance, like especially tools like Livebird, and how it can send various commands to modify an image or send query commands or modify the state of the VM, how those kinds of things are done. QMI offers this QMI machine protocol where it's a JSON RPC interface through which Livebird interacts and sends back and forth commands to do various things. So you can send query commands to query what are the status of migration, for example, or if there's a long backup operation going on, hey, can you tell me how much more information to be copied? And existing block device jobs in progress, various things like that. So to get a sense of where this QMP is configured, if you look at the Livebird configured QMI instance, you see a lot of command line that is snippet out, but focus on the card of setup where it sets up a Unix stream socket for back and forth communication with QMI. So that's where it happens. So Livebird will work through that socket. And if you're doing test and development, you can use this shorthand, which is just a sugared form of the long lengthy command line about the dash KMP Unix, and you provide the socket and a bunch of other options. So that's handy there. So now that you've set up a launched QMI with a socket, now you can connect to that socket. There's different ways to interact with the QMP monitor. One, this is a bit tedious, but can show actually what is going on is the socket cat utility. I invoke this socket cat utility to connect to the Unix socket that we just launched with QMI and supply the redline library option so that it caches the history of commands that are typed. At first thing it gives up is, hey, connection is successfully established. So that's a greeting message it shows. So that's a good indication. After that, you have to explicitly specify this command called QMP capabilities. That is for negotiating different capabilities in case so far there hasn't been many, any use cases for it, but there is apparently a new feature on the list, a mailing list that uses this. But for our purposes, what's important is that it is mandatory to specify this capabilities flag. And then you're ready to send and commands receive query arbitrary commands that you can send. So regular QMP commands. But you don't want to type manual JSON commands unless, I don't want to finish that sentence, but yeah. But thankfully, that's where Libboard comes to rescue to automate this all and provide a convenient wrapper. So and there are other ways to interact with QMP monitor. I mentioned this for completeness sake and also some of the examples that I show I'll use the first one for the QMP shell that lives in the community tree source tree where you can just which it takes key value pair so that you don't have to type out long, lengthy JSON commands. So that's convenient there, right? So it lives in the tree, you just apply a verbose parameter and a pretty print one so that it also prints out the raw JSON if you want, and supply the path to the socket that you launched QMU with. So that's a convenient one. I'll be using this to show some examples in the next slide. And the Libboard also offers a QMU monitor command bytes shell interface through which you can type JSON. But again, that's only to query. Now here is a caveat that you don't want to modify the virtual machine state behind Libboard back because then you get to keep the pieces if it breaks. So it's useful to only query but not to modify the virtual machine state. So in this part we'll quickly see a glimpse actually at what block devices and how block devices are configured on in QMU. So QMU is a block device has a notion of a back end and a front end where a back end is you can configure a block device and set up various things like specify cache options or specify formats and change back in file and various other things. It does a fine grained control that it provides. There's two different ways to configure it. One is legacy but still Libboard uses that and the more recent changes are done via the dash block def command line. But we won't get into that if you want more details there's a talk, there's been several talks at KVM forums, previous KVM forums, from there you can refer. And also the other notion is a guest visible front end device like a SCSI device or something like that and that can be configured via runtime using the device at or the dash device command line. That was configuring on command line so that's notion of the block device. Now we see quickly what it looks like on command line to configure block device. It's fairly boring and usually projects like Libvert will automate this away but you can see the complexity to get a sense of that. Here you're just attaching a QGaude device to a QGaude disk to a Verdeo block guest device. The dash block device block def configures the so-called backend and the dash device configures the guest visible front end. Again, here details are at KVM forums talks so we won't get into that. I just wanted to mention if a complete mistake. And lastly on this part is the block def add which is the runtime command line option. Again fairly uninteresting just on a simple JSON structured command to add a QGaude to block device. What is nice here is in this recent changes from QMIus block layer in the QMIus block layer the command line equivalent of the JSON is just a one is one mapping of the raw QMP. If you kind of squint your eyes and look and look at the options on the command line, it's just a direct one is one tree like mapping of the above. So that's the convenient way to launch if you want to launch a QMIus instance by looking at some JSON existing configuration. Again, you don't want to do that by hand. Most of this thing is done by when you're doing testing and development and running from Git and so forth. So more details previous KVM forum talks. Here we see the part that I wanted to talk mostly about the which provides various use cases. So the live block operations can there are several different of them to get to any of them. First we need to quickly see how do you configure a back in chain. To that end, QMIus offers a couple of commands and block dev snapshot sync is a yet creates a synchronized snapshot. But this is an older version. There's also a newer version of it called block dev snapshot. But that's more to do with configuring it a bit more easily with the recent changes. So what happens essentially when you do it is same when you do it when offline. What I guess is running the existing with a single disk. The existing disk becomes the backing image and a new QCa2 overlays created to track the new rights from that point onwards. So the backing the existing this becomes read only back in file and all the new rights going to the overlay. So it's all life. And QMIu points the rights to the new overlay. So now overlays have to be QCa2 format. Basy images can be LVM, Rob, block devise or any of the other formats. And of course the appealing aspect here is that there's no guest downtime. Snapshot is instantaneous. Some of you here may have used QMIus internal snapshots which is more convenient because the snapshot and the delta both of them are located in a single file. And that's nice to copy around and so forth. But it's not the most well tested code path. And upstream concentrates more on this external snapshots. Although internal tempers are sometimes convenient and there's a bit of a guest pause involved there as well. Not too long, but for some it may not be acceptable. But the recommended upstream recommends this format. So to see a quick example, when you start with that base image where live QMIu is writing, you launch a bunch of block dev snapshot sync, synchronized snapshot command with a bunch of parameters. It will create the overlay. Again, nothing super exciting. And there's a liberate equivalent of it. You don't have to launch via QMIu QMP, right? So liberate provides nice virtual machine lifecycle management. So I wanted to highlight that. It's a simple, fairly command line. And the result would be that, where you see an overlay that's pointing to base. Now earlier, the live QMIu is pointing at the base image. Now it's pointing to the just created QCAP2 overlay. However, to manage these long backing chains can get tedious. And not only tedious, but there can be IO penalty involved. So for example, when you have a long backing chain, there are multiple files to track. And I've even heard some cases where somebody using 100 backing chain, a disk commit, but I don't know what their use case is. So it can get really cumbersome. And also, there can be IO penalty. So if a cluster that QMIu wants to read is not in the immediate overlay, but somewhere in between, so it has to traverse through all the images and then pick that and then cache it. So, but there's some solution for that. First is the live disk commit chain merge, where you have a long chain. And the problem is to merge that long chain to short. So the simplest solution is to merge all of them into the base image. So that's fairly simple, right? And how do you do that? You first run the snapshot command to create the three overlays. And then you run the commit command so that please commit all the content from three overlays back into the base. And then you finally complete the commit job and pivot the QMIu to the base. That's it. That's a liver equivalent. Now what is the result? The result is a single consolidated base image with all three images, content from all three images into the base image. The all intermediate images are invalid. So you can throw them away a single image. So that's a nice way. And people also use this to make efficient disk snapshots. There's some wiki pages for that. We'll point that later. And stream is a new, it's a second one in this series of four commands that I want to talk about. It's similar to commit. I don't want to go into details, but just want to mention for completeness sake. But it's safer in the sense that the intermediate overlays can be valid. Meaning in stream you're copying content from base images into the top image. So the intermediate images can remain valid. There's details on this. It can be overwhelming to grasp all of it in one go, but I just want to mention that. And the other interesting one is the mirror command where you can synchronize the entire disk commit chain to a target where you can synchronize either parts of the chain or the full chain or only the top image. So that's operation. I'm showing a bunch of QMP commands. So you, it's saying please copy all the content from all the chain into a target image called copy file and complete it and optionally turn the live QMU to that copy. So now that we mentioned that this can be combined with a new, well another concept called NBD server. QMU has built-in network block device server where you can export disk committers that are well in use. It has a bunch of commands. So and you can combine the just mentioned mirror command with NBD to solve an interesting use case called live version machine migration without having to set up shade storage between two hosts. So fundamentally under the hood that's what is happening. QMU runs the mirror command, a variant, there's two variants but we don't get into that but it essentially runs the same mirror conceptual command. It copies content from source to destination. The destination sets up the network block device server, destination QMU and exports a a pre-created disk image and copies all the contents to there. So the more details there at the bottom. So on this slide you see what all I mentioned from QMU, the equivalent LibWord automation. So it's it's just a command for completeness sake and if you want you can try this out later. Projects like higher level management tools like OpenStack, use the equivalent Python API of LibWord to do the same things. This is the last primitive. Again it's a bit similar to mirror command. For mirror the point in time of a copy is when you end the synchronization of disks on the source by issuing the blockchop complete commands. And for backup the primitive that you see at the top is when the point in time is when you start the operation. Again this can seem subtly same and there's ways to upstream at least things is thinking to merge most of these into a couple of common commands. So but yeah at the moment it seems a bit similar. And the interesting thing here is it has an extra synchronization mode along with copying the entire chain and so forth. It also offers an incremental flag that is useful for incremental backups. So for that there was a talk at FOSTA last year so you can refer to that one and it has more details. Just like how we combined the mirror command with NBD you can also combine the backup with NBD for other use cases. One of the things that I hear is if you want to examine guest IO patterns without buying external tool so you can simply copy the entire chain or only the only the current writes actually no when you're examining guest IO patterns you want to see the current patterns. And you can say hey please only sync all the current writes to the copy on a destination and you can do examine them. Apparently some virus canners or other such use cases are there for that. So yeah there's that. In summary there are four of them. Comic stream mirror backup. Two of them are a bit similar and there's a lot more details in the links that I showed below which has examples and other things that you can refer to. So if you have yeah this is the reference slide that I mentioned which has talks to links to other KVM forum talks which you can look at the gory details of QMU if you're interested. So any questions? If there are any. Okay so I totally confused everyone then. Thank you.