 Hello. Thanks for listening. My name is Alberto Garcia. I work for Igalia on the QMU project. In this presentation, I'm going to talk about the work that I have been doing lately. This is related to the QCAD 2 file format. As you know, this is the native file format just by QMU and it supports many features as encryption, compression, backing files, etc. But the questions that I'm going to try to answer today is, why is it that sometimes this is slower than a raw file? There's many reasons for that. It can be because it hasn't been configured correctly, we're not using the right options for our setup. Of course, it can be that the driver can be improved, there's still room for improvement there. Three years ago in the KVM for 2017, I was talking about these things, so you can check the talk, it's in YouTube, it's available. Today I want to focus on the problems that are a result of the very design of the QCAD 2 file format. Let's start with the format itself, how it works. The basic idea of the QCAD 2 file is to divide it into clusters of the same size, 64K by default, but it can be changed when the file is created, from 512 bytes to up to 2 megabytes. There are different cluster size, I'm not going to go into the details now, but let's focus on the data cluster, which contains the data that the guest can see. So every time the guest needs to read data, it goes to the QCAD 2 file. If the cluster has been allocated there, then it just reads the data for the cluster. But if the cluster has been allocated, then it contains zeros. Or if there's a backend file, it goes to the backend file and checks the data is there. So the cluster is the smallest unit of allocation. So every time we allocate a new cluster, if the right request is smaller than the cluster size, then we need to fill the rest with data. And the data means we either go to the backend file and get it from there. So in the case of this image that you see, we will write to the area in pink, we will need to go to the backend file and read the areas in dark blue and copy them to the active file. Or if there's no backend file, we just fill it with zeros, but we still need to fill it. So the problem is that, of course, QMU needs to perform additional IO to copy the rest of the data. So the copy and write can be an expensive operation. As you can imagine, when we increase the cluster size, we have to do more copy and write so the performance goes down. You can see that in the table. I had to mention though that if you don't have a backend file, then as I said, we should fill the cluster with zeros, but QMU nowadays uses F allocated to try to fill it in a more efficient way. So if that works, the file system supports it and the operating system supports it. This is very fast and then the cluster size doesn't have any effect. But if that doesn't work, then it goes to the slow path of writing actual zeros to this and then you can see the numbers that I'm showing in the table. If there's a backend file, however, there's no alternative. We need to go to the backend file and get the data from there. So that's where the cluster has an effect. The other consequence of this is of course the larger the cluster size, we do more IO, we do more copy and write and then the image is bigger. You write the same data, but you get a bigger image in result because you are duplicating data from the backend file. How much? Well, this depends a lot on the use case. There are reports that can be 30% larger or 40% larger, but it depends a lot on the use case. I was just doing a couple of tests for this presentation and you can see that if we have an empty image and we write 100 megabyte worth of random 4K requests, the impact of having a larger cluster size is very big. If we go to the default 64K, we get a 1.6 gigabyte image, which is more than 10 times what we were trying to write, which is a lot. But if we go to the maximum cluster size, we get 29 gigabytes, which is 300 times the initial, the amount of data that we want to write, which is very big. Of course, this is an extreme case. Normally, in a real-world scenario, we don't just write random write requests. But it gives an idea of what the problem is. Then I did a second test. I took an empty one terabyte image and I created a file system there. You can see the file system itself, the metadata used by the file system, which is just 1.1 gigabyte. But if you increase the cluster size and you take it to the maximum, then you use one more gigabyte. Just creating a file system, an empty file system with nothing else in it. So in summary, if we increase the cluster size, we get less performance because there's additional IO that needs to be done. And we also get larger images and duplicate data. So if things clear, then we don't just reduce the cluster size. The problem is that it's not so easy because smaller clusters means more metadata and more clusters also. So what does this mean? Apart from the guest data itself, Qco2 images also need to store metadata about the cluster. So important things are the cluster mappings, which map the guest addresses to the host addresses, and reference counts. All clusters in Qco2 have reference counts, so we're going to see later. So if we're going to have more clusters, we're going to have more of them. So it means more metadata. So the mapping from the guest clusters to the guest offsets to the host offsets is done using this structure that we call L1 and L2 tables. This is a simple structure that maps virtual offsets into host offsets. You can see an example here in this graphic. The L1 table is just one per image, per snapshot actually because Qco2 format is one per snapshot, so we're not going to go into that now. But the table itself is very small. It's for a one terabyte image, just 16K, so there's nothing. It's stored continuously in the image file. And QM always keeps it in memory because it's very small, so there's no problem with that. And basically, the table just contains pointers to the L2 tables. The L2 tables, there can be many of them. Initially there's none, but they were allocated in demand as the image grows. The L2 tables are always one clustering size, never more, never less. And they also contain basically good pointers to the data clusters, plus additional information that we're going to see later. The thing is that, of course, if we reduce the cluster size, then we need more entries. So graphically we'll have two clusters and we make the clusters twice as small. Then we're going to have four clusters and we'll have three this time. So half the cluster size twice the metadata. That's the basic idea. Here we see on the table what's the maximum metadata for a one terabyte image. As you see, if you reduce the cluster size by half, you increase metadata by two, which is a very big difference, of course. So choosing the right cluster size has a very big impact on the amount of metadata that you have in the image. So what does this mean? Every time we need to do an IO request in the guest, QMU needs to go to the L2 table and get the host offset. It needs to transform the guest offset into the host offset. So that's one additional IO operation per request. And this has a very big impact on performance. So what QMU does in order to minimize it is it keeps the L2 tables in memory. There's a queue code to cache, the L2 cache for that purpose. I was talking about it in more detail in the previous presentation that I mentioned earlier. And it has a very big impact. If we increase the cluster size, we get much more performance. So in this example that you see here, the maximum cache needs is 5 megabytes, and we get 40,000 operations per second. But we reduce the cluster size, the cache size, sorry, that the performance goes down very quickly because it means that we need to go to the disk to get the L2 metadata more often. So reducing the cluster size means we have much more metadata and we have much more RAM that we need to keep that metadata in memory. Then there's the reference counts. Every cluster in the queue code to the image has a reference count. All types of clusters, not just data clusters. These are used, for example, for snapshots because you need to know who is using each one of the clusters. And they are stored in a two-level structure and it's very similar to the L2 tables that we just described. So of course, allocating new clusters has an issue on overhead because you need to update the reference counts. So with more clusters, we need to allocate more of them. In general, we have a lot of small clusters. We need to allocate first more clusters. We need to allocate more L2 tables. We need to allocate more reference blocks. And all that together means that although normally reducing the cluster size increases the performance because there's less copy and write involved, once we go under a certain limit and this example is less than 16k, the performance goes down very quickly. As you can see, the performance when we have 4k clusters is horrible. Even though with 4k clusters there is no copy and write. But we have to allocate so many clusters, so many L2 tables, so many reference blocks that the performance is very bad. So the situation so far is that we cannot have two big clusters because they waste too much space and there's additional IO needed for a copy and write. And we cannot have two small clusters because they increase the amount of metadata and if we decrease it too much then it's also a very bad performance. And this is the consequence of the format itself. It's not something that you can fix in the driver. So what can we do about it? So the solution that I'm describing in this presentation is called a subcluster allocation. The basic idea is that we have big clusters in order to reduce the amount of metadata but each one of them is divided into subclusters that can be allocated separately. So we have faster allocations and less disk usage. So graphically, a normal L2 table looks like this. We have two data clusters as you can see. With subclusters, each one of the data clusters is divided into 32 subclusters of the same size. They are allocated separately. So in this case, only the errors in blue are actually allocated and used in the space in disk. Internally, the L2 table contains, as I said earlier, pointers to the data clusters. Basically, it looks like this. It's the cluster offset plus a few more bits that indicate whether the cluster is allocated or not. It's compressed or not, or it contains zeros. Contained zeros is a feature from QCAL2 that means that the cluster doesn't have any other data than zeros, so there's no need to go to the data cluster and read from there. We just know that it's zeros, and we can return zeros without doing the IO. So if we have subclusters, we need to store additional information for that. There's no space here. So we added this extended L2 entries, which is basically very similar to the ones that we had before. But they contain an additional bitmap indicating the status of each subcluster. So with this, it's one of the individuals that can be allocated, unallocated, or can be all zeros also. Compressed clusters don't have this, however. Compressed clusters cannot be divided into subclusters, and anyway, compressed clusters doesn't really make so much sense to use compression with extended L2 entries because they are different use cases. So the use cases that I see for subcluster aggregation are two. One of them is having very large clusters because we want to minimize the amount of metadata and the amount of memory that we need, but still want to have good IO and have smaller images. And the other use cases that we want to maximize the performance. So we want to keep the location unit as close as possible to the guest block size. So we want to minimize the amount of copy and write and get the maximum performance. So what does this mean? If we make the subcluster size equal to the request size, that means the file system block size, then there's no copy and write at all and we get the maximum performance. We can see here that compared to the default setup without subclusters, in some cases, we get 10 times more IO operation per second. And if we go to the cases where the subcluster is 4K or less, which is the size of the request in this example, then we get the maximum performance, which is 12-13K IO operation per second. Without the backing file, the relative differences are the same. Of course, it is faster because we don't need to go to the backing file to read the data. And again, I want to mention that if the file system supports emptying the cluster with FL-O-K, then this is going to be much faster than this and then actually using subclusters doesn't really make a difference. It's not going to be faster with that. So you have to consider that. About the space, of course, if we have now the smaller allocation units, then the images grow much less. So we compared the random writes that I mentioned before. We write 100 megabytes in 4K write requests. The end result is much, much smaller, as you can see in the example. So of course this improves a lot the... how we use the disk. And although extend the L2 entries are twice as large as normal L2 entries, in principle it would use more metadata. However, since it's one of the L2 entries, now points to 32 subclusters, the end result is that we have 16 times less metadata for the same unit of allocation. So we compare units of allocation, clusters in traditional L2 entries and subclusters in extended L2 entries. We see that for a 64K cluster size we would need 128 megabytes of cache for a one-terabyte image. But with extended L2 entries, if we have 64K subclusters, we only need 8 megabytes, which is much less. So we can have much larger clusters and keep good performance without needing so much memory for the cache. So things that need to be taken into account. All this looks good, but this is not magic, so... This feature is useful during allocation. Once the cluster is allocated, there's no QCAL2 works just fine, with or without subclusters and you get good performance. So once the image is allocated, this is not going to really help so much more. Again, with compressed images it doesn't make sense. I think it's a completely different use case. So this is not going to give you any benefit. You're going to have twice as much metadata and you're not going to see any benefit. And if your image doesn't have any backing file, maybe you don't see any speedup. As I said, QMU tries first to use FAllocate to allocate clusters efficiently. So you have to try that first to see if it helps in your scenario. But if you're using backing files, it's not going to help in any case. And then, of course, images created with this extended L2 entries are not going to be possible. You're not going to be able to read them with older versions of QMU. And I don't expect that this feature can be backported easily, so you will need the latest versions of QMU. So how do I try this? This is not available in QMU yet. It's not in any release. It will probably be available in QMU 5.2, but the feature is complete and it's already in the repository. You can test it already. So you just download the latest version from QIT. You compile it and you create an image with the option extended L2 enabled. And that's all. You also probably want to have a larger cluster size. The default cluster size is 64K, but with this feature it makes sense to use larger clusters, so be sure to try those. But that's all that you need to do. Again, feedback, back-reports, et cetera are very much appreciated. This feature is complete, but it's new, so any testing, anything that you try, suggestions, et cetera, we will be happy to hear about that. You can write to the mailing list or you can contact me directly. And that's basically it. I would also like to take the opportunity to thank OutScale, which is the company that is sponsoring all my work in QMU and in this feature in particular. So everything that you are seeing here today is all. I hope that you enjoyed the presentation and I'm open to any questions. Thank you.