 So I'm Neel Sofel, a recipe engineer working on Ovid storage. Hi, I'm Vojta and I'm a developer also in Red Hat also working on Ovid storage. So today we'll talk about supporting 4K drive in Ovid. First we'll understand why we need to support this and then we'll dive into the challenges of trying to support 4K in legacy project that was started more than 10 years ago. And we'll focus on the interesting point of detecting block size of the storage and how VDSM is using it to create sodomain and how engine is using what VDSM reports to manage host that have different capabilities. And finally we'll look at some double shooting tips and we'll see hopefully we can see a live demo getting 4K storage. So first what do I mean by 4K storage? Drive has a minimal block size that you can write to them. You cannot write 3 bytes of drive. You can write only one block. And this used to be 512 bytes and in modern drives it can be 4K or you can have software to find storage that wants to use 4K. So the main reason we need 4K is 4I which is rather hyperconverged infrastructure which is hyperconverged solution with OSTED engine with Glaster and with VDO which is kind of complex but to create simple solution you need complex software behind it. So why do we need this? It means that you can take few cheap servers with some cheap disks and create together data center using the servers both as compute nodes and as storage. And you don't need to pay a lot of money to storage vendor for storage. Everything is built in and people really like this idea. Now what is VDO and why, how is it related to 4K? VDO is a new data application and compression solution for Linux. It can give you 10 times the storage space with the same hardware. So it's really nice that we can use this and Glaster solution wants to use VDO. Now VDO is designed to use 4K blocks. Everything is designed for 4K. It can work with, it can emulate storage with 512 bytes synthesized but it makes it slower. So Glaster guys want to use VDO in the way that it was designed to use. And of course some users bought newer drives. New drives can be cheaper and faster if they are 4K drives. And we want to support these users. So if you have 4K drives currently with file storage, you can use these drives. And we don't use all type of, we don't support all type of storage for 4K. For example in block storage, we don't support it yet. But we made all the infrastructure work so we can support it soon. So what are the challenges? Yes? Yes. Yes, yes. So what are the challenges of adding 4K support in the system? When it was created like more than 30 years ago, nobody thought that it's important detail. So one small issue, storage format assumes that the block size is 512. So actually we assume that we can write 512 bytes to a certain location on storage from different host in the same time. And this can really does not, cannot work with 4K storage because the minimal block size is 4K. So to mitigate this, we introduced a new storage format, V5. And the main advantage of this format or actually the main reason it exists is supporting 4K. And the different is that we change the storage layout so it can work with any kind of drives and use the same layout for 4K and 512. So supporting it should be much easier. You don't care about the type of storage, the layout is always the same. So the next issue, we have storage format supporting 4K, which was, by the way, available in 4.3. So everyone running 4.3 is using V5. But Sanok cannot detect the block size of file storage. So Sanok can detect the block size with block storage, of course, it's very easy. But for file storage, there's no PI to detect the block size. So Sanok cannot detect it and it was falling back to the old block size, which cannot work with 4K. And even if Sanok could detect the block size, it's not enough because VDSM and Sanok must be synchronized. When you create a storage domain, VDSM creates a storage domain before Sanok is used. So really, VDSM should detect the block size and tell Sanok which block size to use. So the solution was to add 4K API to Sanok. Sanok can be instructed now to use certain block size or certain alignment, which is also related to the block size. So for example, when we initialize a lock space, we can tell Sanok that we want to use one megabyte alignment and 4K sectors. So now we can support any combinations and VDSM and Sanok are synchronized. And this was also available in 4.3. So anyone running 4.3 has new Sanok with these capabilities. So the next issue, we have Sanok is ready, storage format is ready, but we have tons of code without coded block size. Here block size was outcoded to 512. And there are two issues here. The first issue, VDSM should not care about block size most of the time. Most of the code does not care about it. It cares only about size of volume, size of metadata files. The only place that should care about block size is the code writing to storage, waiting for storage. And this code should use the real block size and cannot use outcoded value. So the solution was very simple, moving to bytes or in more details changing or sending hundreds of patches, changing all the APIs in VDSM to use bytes, all the internal APIs. And now VDSM is sector-free. There is no code using the sector size, only the code writing or reading for storage, which is only a few places. And we can see an example of internal VDSM API. This was called set size in the past. I need the both size in sectors, outcoded to 512 and wrote this to the storage metadata. And now it calls set capacity, accept capacity in bytes, and writing new key to the metadata. So now everywhere in VDSM, you will not find size, you will find capacity, which matches other places in the VDSM that use this value. This name is also used by Leavield. So now the system is more coherent. So the next issue, how do you detect block size when there is no API? Well, we find that we can do this. We read QM a code, and in QM we found that QM knows how to detect block size by accessing the storage. Now, we will talk more about this later. It's a bit complicated, but we solved this issue. We can detect the block size. But of course, we don't have good tests in VDSM because testing was not popular and they started this project. And a big project like Python with poor tests is not something that you want to change because when you change something here, something there breaks. This is life. So we improved this coverage. Currently, storage code is the best coverage in VDSM. It used to be almost zero, and now it's the most tested part of VDSM. And we also are testing real storage domain and real volumes in VDSM. So we don't use fakes and mocks. We create real storage domain in local FS and in block storage. Like we have infrastructure like this TMP repo, which is test infrastructure that you can easily use in the test to create a local FS domain. We have similar infrastructure to create a block domain using real LVs on loop devices. So you don't need server. And in a few seconds, you can run many tests using real domain, getting volumes, getting domain, making changes, and make sure that everything works. And this is another infrastructure. We have a user domain feature that creates for you everything. You just use this object. It will create a storage domain for you. And then you can start getting volumes and making changes to these volumes. Before we made these changes, we took a picture of VDSM. So this is VDSM before. And this is VDSM now, much better. So VDSM was ready. Are we done? No, because it turns out that QM is not really ready for 4K. Like it works for most cases, but not for the case we care about, like Glaster with XFS. So the solution was to fix it. We went to QM source, find the issue, send some patches. And now QM can detect block size in a better way. And in the case it cannot detect block size, it fall back to a safe value, which works. And the last issue, after everything was ready, we found that we cannot boot a VM with 4K block size. So first, we found that if you boot a VM on a 4K domain, the VM thinks that this is a regular storage, which is found out at 12 bytes sector size. And we learned that we can configure a leverage to use the correct block size. And then we found that the VM will not boot. Well, it turns out that BIOS does not support booting for 4K. So the solution was not to use it. So what we have now is a 4K storage. QM knows about the correct size. But the VM thinks that this is old drive with 512 bytes. So this is basically what the solution that we can provide now. The guests think that this is 512 logic block size. QM knows the real size. QM can do efficient IO to the storage. And if the guest is trying to write something less than a complete block, QM will do emulation, which is not very fast. But actually, we don't have any data yet on if it's better than what we had before or not. But this is what we can do now. So how do we detect the block size, which is the interesting part? First, let's see how QM does it after we fix it. So this was actually the fix. We start with reading one byte with direct IO. This will obviously fail because direct IO cannot read one byte. Now, if it does not fail, it means that we cannot detect the block size, which is the case with, for example, empty file on XFS. Because XFS is probably doing some tricks to avoid IO if we try to read, if the file is empty, right? Why go to storage? So in this case, QM will fall back to 4k, which is safe. And if it failed, we know that we can try the next value. The next value is 512. And if it succeeded, we know the correct block size. And if not, we try the next value. And either we found the value or we failed. And in this case, QM will fail the operation because you cannot use direct IO as this slide. So the issue with this is that QM cannot detect block size on an empty image. So QM image create was changed to always look at the first sector. So this issue should be solved. And this also available in 4k in the version that we require. And the next issue is, of course, NFS. We cannot detect the block size on NFS. In this case, QM will fall back to 4k, which works. And even seem to work faster, although it can cause some unwanted alignment changes. But it seems to work better. So how do we do it in VDSM? We do it in a similar way, but a better way that QM cannot use because QM is checking the image file. In VDSM, we create a temporary file. And now we can use a better way within write instead of read. And we use the same logic. We try to add one byte and then 512 and then 4k. But in this case, we always detect the block size. So there was no issue with the cluster or XFS. And of course, with NFS, we cannot detect because NFS does not have any requirements. NFS does not really do direct tile when you have file open for direct tile. There were some issues with the cluster. I did not add them to the presentation, but we had a lot of issues with the cluster. Gluster is very creative storage. You have to ask for direct tile. And then you have to, there's another settings. I really want direct tile. And if you use both, you will get direct tile and everything works. And if not, Gluster can avoid doing direct tile in some cases and detection may fail. And some operation may fail later. But if Gluster is configured correctly, and this is the default configuration that you should get with Ovid, then everything should work. So Vojta will continue to explain how to use all this information. OK, thanks, Nir. I would like to show you how we use all this stuff in VDSM. So before trying to use even for case storage in VDSM, you first have to enable it. Luckily, in recent versions, this is enabled by default. But for any reason, around older versions, you have to enable it manually in the VDSM.conv.de slash Gluster.conf. And you have to set enable 4K storage to true. As I said, in recent versions, it's enabled by default. But it's useful to know this option if you have any issues or for any reason don't want 4K support. You can easily disable it by turning this option off. First basic functionality which VDSM provides is host capabilities. It reports the capabilities of each storage domain time for a given host, what is able to support. And as I will show, it will be later on used by Engine, this information. If you check the logs, you can see the values. But also in terms of code, which is more readable, I show it here. Of course, we support 512 4K. But there is also block size auto. In logs, you will see it as block size equal to 0. And it means what name suggests that if you request block size equal 0, you are requesting VDSM to detect the block size automatically. And while we do need it, I will also show it later. But for now, I just mentioned that we do the block size detection in any way because we are good guys and we expose the block size in our API so we validate the user input. So if you request any block size, VDSM will do the detection of the block size and will validate if the requested block size is valid. And if not, it will throw this exception. I will return back to this exception later on because it can sometimes in some setups can be often an issue. But there's more. If you read VDSM code before sleeping, as many of you probably do, you may be noticed that there is even one more constant, block size 1, which is equal to 1. And this is just internal VDSM constant, which is used for cases when we cannot detect the block size. As Neil said, for example, in the case of NFS, we are not able to detect the block size of underlying storage. What we do in such cases, we just use requested block size without any validation if there is any such request. And if the request is to detect the block size automatically, we fall back to 512 bytes just for keeping backward compatibility. In the previous versions, the alignment and block size were fixed, constant, hardcoded in the code. So there was no question about what the alignment is. But when we now can change the block size, we need to compute the alignment. And the alignment is determined beside block size by maximum number of hosts, which is configurable parameter for storages with 512 bytes. We still use the old constant of 2,000 nodes, but to have the alignment of 1 megabyte also for 4K storage. So for 4K storage, we now use the default of 250 nodes. So if you are going to use 4K storage and you are going to use more than 250 hosts, you need to configure maximum number of hosts on your setup. And once we compute the alignment, we store it into the metadata. As Snir already mentioned, we requested because of that new metadata format because we need to store block size and alignment. And here's an example of the metadata. We basically store the alignment and block size into storage domain metadata. So let me briefly recap what is the VDFM flow when you create storage domain. And this is roughly what you will see in VDSM logs if you will be investigating what's happening there. First, we detected the block size of underlying storage. Then we validated the requested size. If everything is OK, we compute the alignment. Now we are ready to store everything into the metadata. Then we created the directory structure as usual. And then we pass the alignment and block size to SunLog and initialize SunLog. So let's take a look how it works from engine point of view, which manages all the flows and hosts. First thing Engine will do is host activation. And during host activation, it will request VDSM about host capabilities and will store what given host supports or which block sizes for given storage domain it supports into the database. This is needed because when you create a storage domain, Engine will check if all the hosts support auto detection of the block size. And this is needed because we actually right now doesn't allow users to request or specify the block size of storage domain. And Engine always request VDSM to detect the block size. So it basically asks VDSM to create the storage domain with block size equal 0, which stands for auto detection. And Cloud VDSM create the storage domain. And once it's created, it requests VDSM again about information of newly created storage domain. And from information written from VDSM, it finds out what is the actual block size of the storage domain and stores it into the DB. OK, what happens if any of the hosts doesn't support auto detection? In such cases, we fall back to all behavior and we will request, we will skip the auto detection and will request directly create storage domain with 512 bytes. So if it happens to you that you have a couple of hosts with new VDSM, but there is some old hosts with old VDSM which doesn't support auto detection and you will try to create a storage domain on 4k, it will fail because it will request, Engine will request, creating of storage domain with 512 bytes and VDSM will detect that actual storage is actually 4k and you will end up with following exception. So if you can remember this exception because at least in the past it was the main source of confusion when people tried to use it. And it basically means that one of your hosts doesn't support auto detection and you, in the best case, should upgrade to the recent VDSM version. If not, it is not possible. You have to at least enable it manually. So this leads me to the troubleshooting section where I'd like to give you some hints what to do. If something goes wrong when you are creating storage domain. So here's a couple of questions you can ask yourself during the process. So first, I would really suggest to check if all hosts support auto detection. You can check it easily by calling VDSM client hosts get capabilities and you should see that it supports auto detection and also 4k in case of Gloucester. And then you should check if the storage domain metadata is correct, basically on case of Gloucester which is file based, you can just cut the file, check that the metadata version is number five and block size is what you requested. If this is okay, you can check if the engine really asked the VDSM to detect the block size. So you can grab the VDSM log for create storage domain command and check that the block size is equal to zero. If this is true, you can check what VDSM actually detected. For this, you need to enable debug level and you should check that the VDSM really detected the block size of 4k. If everything is okay, it's time to check the engine part and first, you can check if the engine recorded correctly the host capabilities. The output should be very similar to what you saw when I showed you the VDSM host get capabilities. The same information should be recorded on engine side in the database and it's stored in VDS table and you should probably select only host you are interested in. And then you should check what the engine stored for newly created storage domain. This is again in the database in table storage domain static and as you can see, you should see the expected block sizes are. So this is roughly all and now the funny part. I'm really sorry, originally we intended to do representation from my laptop as I have prepared demo here but I have some issues with graphics but thanks to the chairman, I was able at least to boot to graphical mode. So we will try to connect now my laptop and if we succeed, I can show you short demo. Okay, so maybe we can show before the talk, I did a recording of the demo so we can show you the recording of the demo instead of the live demo because anyway, we are slowly running out of the time so it would be easy to run the recording of the demo. So just a minute, I will copy to the USB. So here you can see it's roughly goes through the same commands. Sure, in troubleshooting, so here we call host capabilities and now you can see that for example, Gluster supports 4K and auto detection, the same for local FS while NFS doesn't support it. Now, we try to create new domain from user point of view, there's nothing changed, you just try to create new domain as usual and there's any change in the UI. So it just should show that it works in also on 4K storage. So now I will grab the VDSM log to check that the engine really queried VDSM to detect the block size. Yeah, and it's here, the block size equals zero. So it engine asked VDSM to detect the block size and now let's see what VDSM actually detected. It, yeah, so, and it follows the unit to enable debug log for this and here you can see what Nero spoke about that we first try to work one byte and 512 and then 4K and finally find out that the block size of the storage is 4K. So in the meantime, the new storage domain was successfully created and we can check the engine database, what it stores are and here as you can see, similar output to VDSM capabilities for Glaster, it supports 4K and auto detection and we also checked that newly created the storage domain has correct block size and it's 4K. So in this case, everything worked fine and we are happy and have a new storage domain with 4K block size. So I think that's all. Thanks for your attentions and other any questions. So the question is if we can use 4K without Glaster. Yes, well, you can use it with local FS storage. It was available, I think in 436 or 5% like this and on NFS we cannot use it. We always fall back to 512 because there are no requirement and on block storage, it's not enabled yet. Like we have all the infrastructure, the storage domain format, all the code is ready. We just need to fix the few parts that access block storage to use the actual block storage size and of course get the block size from the storage. In this case, there's no need to detect. We just read some values in CFS and we have everything. Any other question? The question is if the environment has to be homogeneous or if we can mix 4K with 512? You can mix. You can have different kind of storage in a vermet. 4K and 512 doesn't matter. But all the hosts, if you have 4K domain, you need all the hosts must run newer enough version to use the storage because in Ovid, all the hosts are connected to all storage. So in one DC, all the hosts should have the capability and then you can create 4K domain on one with one storage and on another storage, 512 domain. Yes, domain must be homogeneous. You cannot mix block sizes. Any more questions? If no, thanks for your attention again.