 So I am Damien Lemoyle. I work at Western Digital Research in the systems of software group. And I'm going to talk about zone storage support in Fedora, current status, and the future for it. So as an introduction, I'll give some background on zone storage devices and the kernel support history for these devices. Then I will dive right into the current status of Fedora for the support for these devices. Kernel system duties applications, also things missing currently, and things we're thinking to add and working on that can also benefit Fedora in the future. Okay, so zone storage. So Johannes yesterday, when he made his presentation on BetterFace, gave a similar background. So let me just repeat it for those who did not attend. So zone storage devices come in the form of SMR single magnetic recording hard disk and NVMe zone namespace, ZNS SSDs. So for hard disk, the interface for this is standardized by the ZBC for SCSI devices and ZAC for ATA drives. And the goal with single magnetic recording is really to increase capacity without increasing the device cost. You basically take a regular drive, you get the tracks really tight together overlapping, so shingling, and you get higher capacity. That's the principle of SMR. For NVMe zone namespace, what you get out of a zone interface for an SSD is better command latency behavior because essentially you remove the need for any device internal garbage collection. And you can also reduce the device cost by reducing for example, the amount of DRAM you need on your controller because you do not need an FTL anymore or reduce the amount of flash over provisioning that you have on your device. And both NVMe, ZNS drives and SMR hard disk the principle is the same is that the LBA range shown by the drive or a namespace in the case of NVMe is divided into zones. Zones can have different types. For SMR, we have also conventional zones which are zones that accept random rights. So the LBAs in these zones behave like a regular disk but most of the zone are sequential write-required zone and these zones have to be written sequentially starting at a position called the right pointer as indicated by the disk. And you cannot rewrite LBAs that you already wrote you must reset an entire zone. So erasing all the data in the zone to be able to rewrite it. And that constraint of sequential write is shown to the users any command any right command sent to the drive that is not sequential within the zone will be felt by the disk. So support for these drives in Linux started way back in 2014 with kernel 3.18 which added device types. That's all it did actually that did give some access to the drive through SCSI generic path through access. The real work started around version four of the kernel with the final patch set for the initial series landing in 4.10 and that gave really support for exposing this SMR drive through a regular blog device file and also supporting the F2FS file system. With kernel 4.13 zone device mapper support was added DM linear and the new DM zone device mapper target were added block MQ and SCSI MQ support to avoid right command reordering landing in 4.16. Many improvements after that then we reached 5.9 more recent kernels for the beginning of this year. Which got support for NVMZ and S drive and with 5.12 we landed the initial support for SMR hard disk in better FS. So there are other milestones. So as I said along this timeline there's lots of different patches that are in the kernel for improving performance, improving the code. We also got the zone FS file system added with kernel 5.6, zone append right command emulation in SCSI landing in 5.8 and DM group support in 5.9. And we're still maintaining and improving all this code across releases of the kernel. So in federal where are we? Well, first of course start with the kernel. As of this week we have 5.13.6 which puts us basically right at the end of this timeline where we have already everything supported including better FS. So for the user, what does that? Sorry, more details on that. So the kernel, the federal kernel is compiled by default with config plkdev zone enable which is the key configuration option in the kernel for enabling zone block device support. So this enables the entire block layer, IOS scheduler and also the SCSI NATA part of the support. This also drives device mapper support for the devices. There's no special different options for that. So you get DM linear, DM creep support. DM zone is also enabled by default as a target in the Fedora kernel, sorry. And also F2FS and better FS native zone support is enabled through this configuration. So recently zone FS was also enabled by default. So you get that in Fedora 34 and going forward in 35. Well, zone FS is also a file system that is enabled by default. Better FS of course is enabled and as long as the zone block device support is enabled you get also the support for zone better FS but be aware that this is still in stabilization phase. So this is functional but there are still some kernel cases that trigger problems with the file system. So we're still working on that. So all of this for the user, what does that mean? Basically this, the user has many different options for accessing this device, the zone block devices either SMR hard disk or NVMe SSDs. Of course, on the rightmost pass here we have direct device access. So path throughs, throughs, casi generic or IO control direct to the NVMe driver but we also get accesses directly to the block device file for the device. So I call that here zone block access because that IO pass doesn't hide the sequential write constraint to the user. So the user still has to write zone sequentially and there's the zone FS file system which is not a POSIX file system. It's simply exposing zones of the device as files but the files have to be written sequentially. So it's an app and only files that are exposed. And so all these passes basically allow user to implement application as long as their zone block device compliant. So they have to write sequentially and we do have some libraries more on that later for facilitating implementation of such applications. More interestingly is the POSIX behavior, the POSIX interfaces that come on top of these devices because these interfaces basically allow any application. Anything can work. The file system with F2FS and Verifas give random write access to any file. This is completely hidden. The device constraint is hidden by the file system. You can also use legacy file system. So EXT4 XFS, any file system that doesn't have zone block device support by using the DM zone device mapper which also can be used directly as a regular block device. So all of these IO pass enable any application but they may not be the most optimized for these devices. You may get, depending on the use case, writing an application that really takes advantage of the sequential write constraint can be beneficial to performance. So this is everything that application can do with the, can use, how application can use these devices today. So let's look at these applications starting with system utilities, what we have today in Fedora. So UTL Linux, so version 2.36.2 is shipping right now. So this gives you zone block device support in things utilities like LSBLK or BLK zone utility. That one actually also has ZNS support but that is part of 2.37, so not yet in Fedora. LIBLKID also has zone FS support. The better FS support for LIBLKID is upstream, queued but not yet a part of a tagged version. So we're still waiting for that to happen so that a future update in Fedora bring that support for better FS. Better FS Pro with the zone device support for things like MKFS was added to 5.12 version of the package. 5.13.1 is shipping, so all good there too. And recently thanks to Neil for, with a lot of help from Neil, DMZone tools and zone FS tools landed as new packages in Fedora. So these two are for basically formatting and checking zone FS file systems and DMZone device mapping targets. So since those kernel modules are also enabled with the kernel, you get the full set here with user land tools for these kernel components. So for application, that's where the support is a bit lacking right now. FIO does support zone block devices with the zone mode equals ZBD option. So you can at least benchmark and check that the devices are working correctly. So this was added with version 3.9 of FIO, 3.26 is shipping, so no problem there. Again, we're not really, I personally don't know of any other application that is today shipping with Fedora as a package as an RPM that has zone block device support. I'll talk about what's out there in GitHub, et cetera, that does support that, but not yet as packages. However, I want to point out here that thanks to all the, what I've shown with the kernel and the good level of support with the system activities, Fedora really provides a great environment for developing and testing application for zone storage. And that is what we use in our lab. We have Ventura racks running Fedora servers, distros and also for everyday development that is what we use Fedora. So still missing, as I said, EBRKID, we need the latest version to the 38, it's not tagged yet. That is for zone better FS. NVMe CLI, so NVMe CLI has also upstream ZNS support, but currently 1.11 is shipping, but that support was added to 1.13. So an update here is in order to get ZNS device support with NVMe CLI. For DMCript, so again, that is a support that is enabled by default in the Fedora kernel. One thing to be aware is that the Crip setup utility doesn't write sequentially the super block information for LUX format, meaning that the LUX format is not supported yet for zone drives. So something we are working on. There is also no LVM integration for zone device mapper targets, things like DMZone. So if you have a drive that is formatted with DMZone and using a file system on it, LVM will not detect that and prepare the drive on boot, for example, for you. That is something you will have to do manually. Okay, so going forward, what are we going to do? So right now with the kernel, our main activity is to stabilize better FS so that, again, back to that picture of all the IO pass that the user has access to, if we have a very solid IO pass that gives the user a POSIX interface, so meaning that anything random, read and write to a file to the device work, any application can work. So having a solid, better FS that supports as NS and SMR is, we think, very important. So that is our main activity. Again, it is functional, you can try it out today, but there are some chronic cases that create problems. So one area where we identify some problem, we understand the problem, we're just trying to find the best way to fix them, is rebalancing, which is something that is done with better FS on regular drafts too, but is more important on a zone better FS because rebalancing will reclaim zones that have a lot of garbage blocks that cannot be used anymore. So rebalancing is also used to reclaim space on the zone drive, so it's a very important part of the zone support. There are also other small bugs and issues. We have an issue tracker using GitHub to track everything. So if you're interested in contributing, these are the areas where you can start right away to patch and send patches and contributions. For better FS going forward, we have also planned some improvements. The main one is declassured parity. So the reason we want to add that is that we get regularly a lot of requests from users. How do I use my SMR drives in a red environment with protection basically for that? And the answer is you cannot. There is nothing out there today that does that. And adding support for better FS red, two zone better FS would achieve that level of protection that users want. However, a regular red system with fixed stripe with the block position in a block group depending on basically simply an offset is not possible because you cannot override that on the zone drive. So we have to declassure everything and add a stripe tree to support that on zone drive. We do have other project going on, exploratory projects since we are a research entity with motion detail we can, we get to play with different things and related to again red and declassured parity. We're also thinking of implementing a device marker target that does that for other file system beside better FS. So a lot of things going on, a lot of work going on in the kernel still today. For libraries and applications, so we have many things that will be coming very soon. Again, fully functional better FS, but that will come also, that will depend on the kernel and maybe okay, maybe being updated. As mentioned before, we do have a couple of libraries that can help users implement applications that access directly drive. One is libzbc, the other one is libzbd. So one is a path through for SMR drive. The other one is a regular library that basically wraps the kernel API IO control into a set of easier to use functions. And it works for both SMR and NVMe DNS. Both are actually used out there in the field, either by upstream projects or actually deployed by services, commercial services on the net. We also, so I'm working on actually packaging those. So if you see a new package request coming, probably from me that will be these two. RocksDB, we also have support for a new plugin for works DB called ZNFS, which supports a ZNS drive. So you can, with this plugin, you can basically run RocksDB directly on top of a ZNS drive without any file system in between. That essentially achieve a write amplification of one, meaning the write amplification is completely removed. You still, of course, have the write amplification from the LSM-3 implementation of RocksDB itself, but no write amplification at all on the device side, considering only the IO is going to the drive. So this depends on RocksDB 6.19.3, which added the external plugin support. However, Fedora today is shipping 6.15. So again, an update would be needed here so that we can submit ZNFS as a package. There is also work going on with Ceph. So, Sajjwell Ceph team is helping Abu Talib to add support to the Blue Store engine for SEMO drive. So it's a native support to the Blue Store engine directly accesses the SEMO drive through the blog device file. There's no file system in the way. Again, less code between the data and the drive for better performance and reduced overhead. I'm not aware yet of the timeline. I know there's a lot of discussion. I'm getting emails from almost every week. So it is ongoing, but I'm not aware of the actual release schedule for this. And so that's all I have. As a conclusion, again, I want to really state that Fedora is an awesome environment for its own storage. Basically, the drives simply work out of the box, just plug them in a boot and you will be able to use them. That isn't like many other distros out there. And that's why we actually recommend Fedora to even our customers for testing and developing anything related to zone storage. As mentioned, there are still many things missing from the zone storage ecosystem. So help is welcome. And we would be happy to mentor any beginner students. We do take also interns if you know people that are interested in this, in that area of work, we students, we accept interns. So this is something we can do too. Everything related to Linux and zone storage, we document everything we know off or everything we do on the website, zonestorage.io. The website content itself is actually open source on GitHub, so you can contribute if you know of other aspects of zone storage and the application that work on zone storage that you want to advertise. You can send a poll request for this website. It's written with MK docs, so it's marked down format for everything. It's very easy to generate new content. There is also a Linux distribution page where I list up different distros and their level of support for zone storage. And that's all I have for today. So I'm open for question. Well, okay, as you see, I reached the conclusion slide. Okay, so that is really the worst presentation ever I did. I'm sorry about that. I went to presentation mode and couldn't see the chat. Okay, so anyway, I hope I spoke clearly enough and you got the idea of the message I wanted to convey. So if you have any question, you need any clarification since you didn't see the slides. I can show, give more detail showing the slides now. Okay, so better if it's question. With zone blog devices, how does that affect performance? We saw volume snapshot and other rafting type stuff. So that would be more a question for Johannes and now Viro. I do not think it really affect anything. It doesn't change because these are typically, well, features implemented from the Beatrice, the Meraita. And I don't think there is any change in that area of specific to zone better FS. But again, now Hilo and Johannes would be better suited answering that one. I'm more busy with the block layers, quasi ETA and NVMe than better FS. So more general ZNS question, is there any effort plan to improve VM disk IO performance on ZNS? So the performance for ZNS essentially is the same as a regular SSD. The interface, I mean, the way the commands are changed with the device doesn't change. It's still the same coupers and command descriptors. So all that doesn't change. What does change in terms of performance is the need to write sequentially, which very often, at least for regular writes, requires synchronization, which can slow down the write pass on ZNS. But for the device interface and in the context of a VM itself, ZNS and regular NVMe SSDs do not differ at all. And there was a lot of work, I think couple, two, three years back to try to avoid VM entry and exits every time. For example, the device that the coupers, the doorbells for the coupers were hit when sending commands, but I'm not sure where that went. So for SMR drive, for VM, yes. That's on my to-do list for a while now. The name escapes me now, but Viratio, we need basically a zone interface for Viratio. Viratio drives the command interface for Viratio, the specification has nothing for zones. So you don't get the equivalent of zone reset, report zone, et cetera. So that is something you cannot path through Viratio. So you have to go the hard way and either direct attach or SMR drive or use basically the VHOS cousin interface, which is slower. But it's an HDD, so the software path is slower, but that doesn't make accessing the device pretty slower at all. It's bad for the VM or the host because you get more CPU overhead, but not pretty any overhead on the device side. Okay, I think it's time. So everybody, I'm really sorry about the slides. I really, I think that's because I didn't share my entire screen and shared only a window and going to presentation mode that didn't work. Very sorry about that. If you need the slides as PDF to put together with the recording, I'll be happy to provide that. So thank you everybody for attending this session.