 recording myself. Hello, everyone. I'm Neer, membatemwork on Overt storage, and I'm gonna talk today about teaching VDSM new tricks. So first we're going to talk about why Overt needs the support for A. And what are the challenges we face when we try to add this feature ומתרו סיסטן ולפוקרו על זו אינטרסטינג אישו לדטקטינג בלוק סייס ובפעמים נראה איך אנחנו נמצאים בלוק סייס עם עבירת ואיך אנחנו נתקשפים לספורת משחקים בלוק סייס דטקטיינג ואנחנו נראה אישו דמור לספורת לספורת אז קצת מה 4K אולדר דיסק used to support block size of 512 bytes, so you can write and read this size, you cannot write like one byte with disk, and newer storage supports both 4K and 512, many disks are supporting emulating 512 block size, and some newer disks are supporting only 4K, and maybe cheaper and faster, now the main reason we need to support 4K is RAI, which is read that hyperconverge infrastructure, and what it means, it means hyper command solution, Austin engine, Glaster FS and video together, and what you, why do we want this setup, it means that you can take several cheap servers with some disk on the server, and you can combine them and create a small data center without any storage, the servers are used also for compute nodes and storage, and people like this combination, this setup is pretty complex but to create simplicity we need complex software behind it, so what is video, video is the new data application and compression layer in Linux, it can give you 10 times more storage with the same hardware, so it's very useful to use and RAI wants to use video, now video really wants to use 4K sector size, it's designed for this, it can emulate older block size, but it's not efficient, and of course there is a reason to support users that want to use new disks, maybe they bought new disks and currently they could not use them, so with Ovid 4CH user can use local FS storage with 4K block size and Glaster FS storage with 4K block size or with video on top of any storage, so what are the challenges, when video sim and Ovid was created like more than 10 years ago, 4K support was not important and trying to modify old system is pretty bumpy ride, the first issue, trivial issue is that Ovid storage format assumes that the block size is 12, we assume that we can access volume metadata from any host and read and write 500 touch bytes to some error in storage, and this does not really work on 4K storage when you can write only complete blocks unless you have complicated locking, so this was fixed by introducing the new storage format, storage format V5, in Ovid every storage format as a version and we usually introduce new version when we add new features, so storage format V5 is available since 4.30, so everyone running Ovid 4.3 as it, and it supports any block size of the storage, it use the same layout for any block size, so with this format we can use 4K storage, but of course it's not enough when we have storage format supporting everything because we need to use Sunlock and Sunlock cannot detect the block size on file storage, because there is no way to detect block size on file storage, or actually to get it, you can do some magic to detect it, but there is no official way to do it, and even if we add a way, Sunlock is not the way to, the place to detect the storage block size, because videoSAM, the agent that runs on every host, does need to know the block size and it should be synchronized with Sunlock, so Sunlock is not the place to fix it, so what we did was adding changing the Sunlock API to support 4K, it means that videoSAM can detect the block size and tell Sunlock what is the block size, we can see in this example that when you create a Sunlock lock space, we also specify the sector size and the alignment, the alignment is related also to the sector size, so with this videoSAM can use Sunlock with any block size, and we are ready for 4K, right? Well, no, because videoSAM use hard-coded block size everywhere, like every, in many places we hard-coded the value to 5.12 bytes, because it probably was not important thing to make it configurable in the past, so we solved it by moving to bytes, now all videoSAM APIs and the metadata that videoSAM writes is using bytes, we don't store or process sector size anywhere, for example, in this internal API that used to get size in sectors, hard-coded sectors and used to write them the same hard-coded value writer storage, currently we get capacity in bytes and we saved capacity to the metadata with a new key, and this is part of storage format v5, now with all these changes we are still not ready, because we need to detect the block size somehow to use it, and there is no way to do it, so the way to solve this issue is to look in QM code, you can find a lot of interesting stuff in QM code, so we found that QM or solve this issue by accessing storage, and this is what videoSAM is doing now, is detecting block size by accessing storage, now how do we do it, we'll talk later about the details, but with this videoSAM can detect block size, okay, so videoSAM is detecting our block size, and we can continue, and maybe we are ready, okay, with all these changes, how do we know that we did not break videoSAM, we have big Python application, very little tests, because the people who wrote this code thought that tests are not very important, and when you have two tests, for example if I test for this room, maybe I can change something here, and then the ceiling fall down at the end of the room, so we really need good tests, so what we did was improving test coverage in storage area, when we started test coverage was the lowest part in videoSAM compared to other components, and now we have the best coverage, we added a lot of testing infrastructure, for example we have this TMP repo that knows how to create storage domain for you, this is real storage domain, it's not some fake and not using any MOOCs, this creates a real storage on your laptop or on Travis, and we run this with this user mount, this user mount we see here is another feature that will give you a mount point with several types of file system and several types of block sizes, so this test code using this infrastructure will run multiple times with all the combinations automatically, and we have even easier infrastructure like this user domain that will create a domain for you, and now you can create volumes and modify them and do storage operation, delete the volumes, query the metadata and everything, and of course this test will run several times with all the configured block sizes and file systems, so we can tell that for example on XFS this code will fail, so this was videoSAM before we changed it, we took a picture before we did it, and you can see that all these hardcoded block sizes and all the rest, not the good kind of rest that we talked about the other day, and this is now much better, I think you made a lot of progress, maybe it's not very accurate but we did a lot of work, so with videoSAM ready can we use 4K storage, we can create a storage domain, and then we try to provision a VM on the storage, and we found that when you install the VM the installation fails, so after spending a lot of time with S-trace on QMU, we found that QMU is reading and writing unaligned sizes, and it turns out that QMU does not work on certain combinations of Gluster and XFS, or XFS with local storage, and the fix was to send some patches to QMU, and with the help of Kevin and Max, we got it merged quickly, and QMU on REL7-7 supports all these edge cases, and everyone running OVRT 4.3 has this fixed QMU, so now we are really ready to use 4K, and then we found that yes, we can provision a VM with 4K, we can start it, but the VM thinks that the block size is 512, and it's not 4K, so we found out to consider it using LiveVirt, and then we found that the VM is not boot, so it turns out that BIOS does not support 4K, and we can solve it using OVMF, which not a lot of people know about, it's not supported in 4.3 yet, maybe it's supported in 4.4, I'm not sure, so currently we just keep the current way that the VM thinks that this is an old 512 by storage, so basically what we can provide now is this setup, that the guests think that it's running on logical block size of 512, QMU knows the real value, and if the guest is writing something unaligned, QMU can fix the write, of course this can introduce performance issues, but we don't have yet any results, so we don't know if this is worse or better than the old way, but with this you can use video. So how do we detect block size? Let's see how QMU does it, this is the new way that works on Glastar Linux FS, first we need one byte with direct IO, and this is expected to fail on anything, unless we cannot detect the block size, so if this succeeds QMU knows that we cannot detect the block size, maybe there is no block size requirement, and QMU will fall back to a safe value that works in any case, it's not the optimal value, but it's safe, then we try the next value, and if it succeeds we know that this is the block size, and if not we try the next value, and if not this is the value, and of course if nothing works QMU will fail to use direct IO with this image. Now there are a few issues with this, one of them is that you cannot detect the block size on unallocated block in XFS, usually when it's a remote XFS over Glastar, and this was fixed by changing QMU which creates to always allocate the first block, so when you create a new image, with QMU which creates on obviet source 3, you will get this issue fixed, so we are not affected, and of course there is the issue of NFS, NFS does not really enforce any block size or alignment, because it actually does not do any direct IO on the server side, so in this case QMU fall back to a safe value, and how we do the same in VDSM, or more precisely higher process, the helper process used by VDSM, we create a temporary file, and then we do the same flow, like writing one byte, writing more bytes until we find the value, as with this we don't have any issues, we can detect any value even with Glastar on NFS, so no issues with Glastar, of course with NFS we detect the same value of one byte, and we fall back to a safe value, so we know how we detect the block size, how we made VDSM ready, and we use all this, so first VDSM, or actually OVRT 438, that was released recently, supports 4K by default, so we don't need to do anything, but if you have an older version, you can use the configuration file to enable it, like this here, or you can use it to disable this feature if it goes trouble, but generally you don't have to do anything. When VDSM reports its capabilities, we also report the block size supported by all the storage types, so every storage type in VDSM has this supported block size list, and in this case we can see that Glastar domain supports 512 and 4K, and also this magic auto value, and other storage types support different sizes, so we can introduce this feature gradually, what is this auto value? It's zero, you see zero in the logs, and it means that VDSM will detect the size for you, and this is the way that we use currently. We support setting some block size, but engine will use zero when it creates a storage domain. Now if you request a different block size, we will relate it to the actual block size and make sure that it matches, because without it, storage operation will fail later, and if it failed, we have this new exception about storage domain block size mismatch. Another thing that you have to compute, alignment, in the past alignment was always one megabyte, it's the alignment used by Sunlock, now it depends on the block size, and we compute it by the number of hosts requested by the user. The default is 250, if you use older storage with smaller block size, we use the default value, the Sunlock default value of 2000. Generally this is the default value used by engine, so you don't have to consider anything, but it's possible to create bigger setups with more hosts. And finally when we have all the information, we keep it in the storage metadata, storage metadata data, this is, for example, the storage metadata for file storage, in block storage, we keep it in the VG tags. So just to recap, this is the flow when we create a storage domain, we detect the block size, we validate the block size with the requested value and test the user asked for specific value, we compute the alignment, we create the metadata, create the directory structure, and initialize Sunlock with all the details. So with this, you can create a 4K storage domain, but we are not done yet, because how do you manage several hosts when you can have hosts that does not support 4K and hosts that does support 4K, so engine has to check the host capabilities that we saw before. So when we activate the host, and then we check its capabilities and we start the value in the database, and when we create a storage domain, engine will check that all the hosts support the same block detection, and it will use automatic block detection only if all the hosts supports it. So engine will create a domain with automatic block size detection, then it will get the domain information, learn what VDSM has created, and store it in the database. So next time when you add the host to the same setup, engine can know if this host can work with this storage domain, if the host cannot work with this storage domain, it will become, you cannot activate it. Basically all this is not needed if you run like 438 system, but we introduce in future, so it's possible that you will see these issues if you have an older version. And of course if any host does not support the value, engine will fall back to the older way, and if you have 4K storage it will not work. So this means that you need to upgrade your host or maybe enable the configuration that we saw before. So let's see a short demo of creating a storage domain. Basically all this complexity is hidden from the user, and all the details are hidden. Sorry. Okay, so we can see how we can check the host's capabilities. We see that the host return supported block size for several domains. We can see that the cluster is supported, and this is the domain we use in this demo. So now the user just create a new domain. In the normal way there is no configuration, you don't need to know anything about it. It just works. Of course you need Over 438 or older version that is enabled for 4K support. So now we created the domain and we can check what VDSM did when it created the domain. We can check that engine asked for automatic block size detection. We can see block size equals zero. It means that please detect the value. And we can see that VDSM detected the right value in the next section. This is available only in the bug log, but we can also get the domain information and look at it. So we can see that VDSM tried one byte and front-end version and so on. And finally detected the value. So we have a cluster domain. We can use it. We can create VMs. We can start them. We can start show how we can check what engine knows about resourced. We see that resourced supports these values and then we can check the storage domain if engine recorded the value properly. So this is basic rule. If you want to learn more about this, you can check these links. There's an interesting RFE with many, many patches attached to it because it was a huge work during the 4.3 ZStream development. And this is an example of VDSM test using all this infrastructure that I talked about later. And link to the user storage project, which is the helper to test with a lot of kinds of storage, a lot of kinds of block sizes. Please check it. And of course, OVTOG. So any questions? So the question is if we can mix different kinds of this in the same storage domain, and no, we don't support it. It will fail. Anything else?