 Hello, my name is Alexander Mikhailitsyn. I'm a technical engineer for Chilazo. And today I'm going to talk about DMQQo2 is the block device driver, which provides the ability to use Qo2 images as a block device for the container needs and not only for the container needs. My co-author here, Dan Lunev, our engineering core team engineering lead at Chilazo. Our team focused on the Linux kernel and QEMO. First, the first question here is, why do we need a block device at all? From the system containers perspective, a block device is really useful because we want to achieve better isolation between the system containers. And also we may want to tweak somehow the file system parameters and also we may want to have different file system types in different containers. Like for example, we may have combined solution when some of the old containers use X4 file system and newer containers use XFS file system. Another point here, and I think the most important is that the block device allows the easier implementation of snapshots for the file system of container because we snap-shorting the block device as a whole. And also we, of course, can perform online resizes and size limitations for container file system three just as a default because we have a block device underlying. Consider the following picture where we have the three containers with some trees in that containers and the loop device, three loop devices are the backing storage for these containers. This picture is great, but the problem here is the loop device. Loop uses the plain image format, which makes the problems because it has now dynamic cluster allocation mechanism. So we're wasting a lot of space and it doesn't provide us with the ability to make snapshots. So we need something new here. And we have the P-loop, technology P-loop as a internal block device driver with parallels, stands for Parallels Loop, uses Parallels image format. And this, yeah, this image format, this block device driver allows us to have right tracking support and making snapshots. And this is a currently key technology for our open VZ system containers since for many years. Nowadays we decided to shift the step on another technology and change the image format to widely used and supported called QQoTool. And also a Kirill Tchaik, the main developer who written the initial implementation of this driver decided to step on the device maker framework as a basis. And this allows us to have much less code than in comparison to P-loop device. And it's of course good for us. And another point here is that device maker provides stacking architecture to implement really complex things like the stacked backups. Let's talk a little bit about QQoTool image format and the features which is provided by default with this format. It's the snapshots and backing files. So you can use several images for different layers of your block device. And also of course dynamic cluster allocation which is the default for this format. And here in this picture, we can see the general structure of QQoTool file. In the first block, we can see the QQoTool header which contains the general information like size, like position, like shift of L1 table in the image like position of rep counter table. And of course we can see here L1 and L2 tables and data clusters. Data clusters marked green as a sign that these clusters available for the guests. Guests here is because we're talking not only about VMs but also about containers. And it's worth to mention that Rift counters is used for L1 tables or L2 tables, I'm sorry, and also data clusters for the snapshot needs. Here we can see general scheme of mapping between cluster virtual addresses for the at the block device level and addresses or shifts inside the image. So the L1 and L2 table is just the tables and we indexing the cells in these tables and these indexes is just the part of the address of the virtual address of the cluster. So we have some cluster address, we're splitting it to the parts and we can use first part of the image for the L1 table and second part for the shift in L2 table. And using this, we can navigate through this scheme and find the particular cluster in the image. And that's all about the format itself and all we can also pay some attention to features which it provides is the, of course it's snapshots, it's sub clusters, which is quite new feature maybe a few years ago it was introduced but it's also supported by our driver and cluster compression, which may be really useful if you're using the backing files for the storing the data, L2 tables allocation, unfortunately we don't support it yet but we have, possibly we have a plan for it and yes, backing files. Let's say a few words about the device maker framework is device maker is a great thing in Linux kernel which provides an extensive API for virtual block devices implementation. It allows us to choose between two approaches for the block device called bio-based approach and request-based or there are also hybrid block target drivers but we are speaking in our case it's a request-based. The difference between these approaches is that the API function which is used and the efficiency. If we're talking about bio-based approach, in this case our block device driver our device micro-target driver receives the each bias separately and maps and process them separately but if we're talking about the request-based in this case we're working after request queue is glued so the structure request contains a list of stored bias and this means that scheduler IOS scheduler already processed our bias and possibly blocks them into the request and this may be quite more optimal in comparison to bio-based. That's why we choose this way and also device maker provides the IO suspend resume infrastructure and for example in our driver to provide the suspend we just in our implemented callbacks we're forcing our pending IOS write all key code to image top layer metadata and then when we need to resume the device we just reload all metadata from the disk. Between this we can safely modify the code to image from the user space. For example, using QVEMO EMG tool. Now we can see this approach at this slide. Snip-frode creation in Qo2 is just as simple as creating the L1 table copy and also L2 tables and cluster ref counts should be increased by one in this case and snapshot switch is just effectively L1 table switch. Usage of the snapshot engine is simple. It's shown on this slide. As you can see, we're just suspending your and all actions that were described earlier performs here and then we just creating a snapshot using QEMO EMG tool. And then we resume the IO and resume the device. And another important thing is the so-called backward merge and forward merge. Backward merge allows you to copy all the all changed clusters from the top layer image to the previous layer. Like if you have, for example, image and backing file, all changed clusters will be applied to the backing file and top layer image gets clean from this side. And we have merge backward support in the kernel and we have pool support or forward merge support from the user space. It's done by using switching to Qo2 lawyers. And also we have a driver for code DM backup, DM push backup. This driver allows to control the lights for the device from the container and it's the stacked device. So it's stacked on top of the DM code toolware and IO from the file system comes to this driver at first and then translates to the DM code to IO. And it provides the blocking notification for rights and the user space acknowledge each right and after that this pending IO will be submitted to the DM code to lawyer. Here we can see this quite easy picture. This picture, there are some maybe errors here because here we have DMP loop but yeah, we have DMP loop too but today we're speaking about DMQ code too. It's not so important. General scheme is quite useful because we have VM or container and we have some IS getting submitted to these devices. And we also have the right notification on this side when we're processing submitting bias and the backup utility, which is a user space utility just makes some using some turn on interface right act for the clusters and this use gets submitted to the particular block device like TM code tool or DMP loop or something similar, something another. And yes, here we have QMNBD server but yeah, it's not so important. What about QMNBD and why we are not trying to use this as a driver? As far as we know, the kernel provides the internal driver for network block devices and QMNBD is just a user space tool which uses this utilizes this driver and allows us to use code to images with this and provides these images as a block devices. And of course we can run container on it on this basis but there is a problem because the QMNBD is too slow and we on our production customer production workloads we have like a 500 or 1000 of system containers and these containers may contain a lot of different software like a Docker container or even Kubernetes nodes and that's a problem because all of them submits a lot of bio and of course QMNBD in this case it's a bottleneck. Another risk for us is just connected with the fact that QMNBD is the user space driver. So it means that if you have a problem with like memory over commit and we have memory over commit in a lot of cases this may lead to the blocks in this demo server and makes the whole system unresponsive and all the containers getting down. Another interesting complication for this is the LXC because LXC project also aims to provide the system containers like from infrastructure needs or something like that. And the LXC team guys provides a lot of options for storage options for us like loop device, directory, better FS or LVM and it looks like the code tool may be a good alternative to loop device driver here because it provides the higher resolution level and it also allows us to have the snapshots and so on. And yet another application for this thing is the Virtua.io blockade driver which was developed by Andrew Zhachenko recently and this driver allows to increase density and improves performance for the virtual machines for the QMNV virtual machines and because we're trying to get rid of a lot of these calls between user space and the kernel because as far as we know, the QMNV virtual your handled in the user space and on this picture we can see the request structure and how this is working and Virtua.io blockade driver allows us to have scheme like that just because we don't need to compile requests to user space and reach the user space. We can do everything in the kernel. Another useful feature here is that the Virtua.io blockade allows us to have multiple threads for IO which is also valuable of course. Here we can see some numbers, some results for this framework. As we can see, it's quite good. So okay, I think that we are ready to answer your questions. Thanks, thank you guys for listening.