 Good morning everyone. So this presentation is on using zone block devices in Fedora. I'm Brian Gurney, a senior software engineer at Pat, working on file systems and storage. So for this talk, I first want to give a special thanks to Damien LeMol at Western Digital, who helped me a lot for zone block devices and also helped me get my test system out of a brick state after some chaos happened. But he helped me with the first parts of getting zone block devices working. There's a very informative website, zonestorage.io, that he maintains to help explain what zone block devices are. So if you have any questions beyond this, you can take a look at zonestorage.io to look at what zone block devices are and where the Linux kernel support is now. Damien's primarily maintaining the zoned part of the block layer. So first, what is a block storage device? A block storage device is a device that can read and write data in blocks of a given size, typically 512 bytes or some multiple of 512 bytes, preferably in a power of 2, 4 kilobytes, and so on. In some examples of block devices are the hard disk drive, which records magnetically on a platter with a mechanical arm. A solid state drive, which would be some form of flash memory, either based on NAM flash or a 3D crosspoint or some other solid state medium that does not involve mechanical moving. And also a tape drive, which is a real of magnetic tape run pass ahead. Obviously you can only access sequentially on tape drives, but sequentially or randomly on hard disk drives and solid state drives. So zone block devices operate within zones of the device, which define whether sequential writes may occur. Usually for a zone block device, they'll have a good portion, about 99% of sequential write-only zones for the medium. Conventional zones work just like regular block devices. Sequential write-required zones have a write pointer, which indicates the next writeable sector in a zone. So I have a diagram with three example zones and the write pointers at the current location. And then the drive actually has firmware, which reports the current condition of the write pointer in each zone. And I have a diagram here. This specific example has 256 megabyte zones. So the write pointer in the first zone in this example is full. It's at the end. So in order to write this, you would have to reset the write pointer to the beginning and then write in that zone again. And then the other example, that's a partially open zone, and then there's an empty zone. In the text, there's actually a command called block zone report, BLK zone, and it reports the first two zones here are full. The second one is open implicit, and then the next two are empty. So you could start right at the write pointer of the open implicit zone. So the types of zone block devices right now, the older type are called device-managed zone block devices. The firmware controls the zone management, but it exposes to the host a standard block device interface. The advantage here is that you can use existing file systems with these devices. The disadvantage is that these existing file systems will end up seeing slow performance because the device abstracts away the management of the zones. And so you start to see performance degrade as the drive has to shuffle between conventional and the sequential write-only zones. The newer drives are host-managed zone block devices, and the firmware presents a zone block device interface. So it will tell the driver what the zones are, where the write pointers are, and so that's an extension to the SCSI interface. The advantage here is that the host can have direct control of IO and zone management. The disadvantage here is that almost none of the drivers right now in Linux understand how to do this, and that's more than just the file systems. It's also the IO schedulers. Schedulers reorder IO for optimal performance. However, if they reorder writes, this may result in errors. So there's been, as it says here, IO schedulers, file systems, and volume managers need to know how to handle the write pointers, where they can write, when they can write, when they need to reset. And so the main type of zone block device that exists now are what are called shingled magnetic recording hard disk drives. And so these SMR hard disk drives record data in an overlapping fashion. The diagram on the left, which these diagrams are from Zone Storage IO, the diagram on the left is a conventional hard disk drive track map, where you can see the tracks are individually laid, but they have a lot of space in between them, which is used for error correction, head location, and so on. Whereas SMR hard disk drives have a head with a narrow read head and a wide write head. So they can achieve higher track density, be able to read all of the magnetic fields in the tracks, even if they're partially overlapped. However, to write in those tracks, in those zones, they need to rewrite the entire zone. So these ideal applications, the best ideal applications for these drives are applications that do not need to write in the middle of data commonly. And so that's very conducive for large object storage in archival style applications, which have frequent writes in a sequential fashion, frequent batches of long sequential writes, not that many parallel writes, infrequent reads, which will usually be in a sequential fashion, back up every single day and then restore once every, hopefully never, but in say every two or four months or so. I can hear those backup veterans laughing, you've had to do a restore. And ideally there would be many drives distributed across many nodes, achieving capacity, throughput, and redundancy. So you don't really mind that a hard disk drive will have a maximum throughput of say 180 megabytes per second sequentially. That's perfectly fine in your application. This would work very well. And so the large average size is a key for many applications. I previously worked in a backup application that had a large average size of 90 megabytes per file. So a 256 megabyte zone is perfectly normal. And then long term retention would be more important than short term performance. So what doesn't work with zone block devices? And the applications that are less conducive for this are file systems, drivers, and schedulers that don't know the limits of sequential write restrictions. And then partition tables. Some partition tables, particularly GPT, will have data that ends up in sequential only zones at the end of the drive. Classically, SMR hard drives have sequential and random data areas at the beginning for the first 1% and then the last 99% is sequential write required zones. RAID, the current ND RAID implementation, does not understand the zones. And your average RAID chunk size is usually 64 kilobytes to 1 or 2 megabytes or so, which does not fit into a 256 megabyte zone. Right now you can potentially try to create a RAID volume RAID array with these drives. Please don't in the future it will be a driver to prevent it right now because if you were to try a very large chunk size in RAID, it will take over 1 second to write that chunk. That may be a latency that's too high currently for the current strategy. And lastly, booting off of a zone block device, even if it was possible, wouldn't really be practical considering that an OS drive has usually a very large number of very small files that it may have to read and write randomly. That's not really a practical use for zone block devices, which usually you'll have a small OS drive, which will be a small flash drive. And then the storage volumes will be non-system drives that will be very large and easily replaceable. So how can you use zone block devices in Linux right now with say the 5.3 kernel, the 5.4 kernel, and so on? So the kernels, the optimal time for zone block devices has been right now for the 4.17.4.18 kernel is when the scheduler changes were in so that the scheduler was able to use it. And that's set up everything else to be able to use. So you can use, you should use MQ deadline for the scheduler. That's the only schedule right now that understands the right restrictions for the zones. One of my first issues was when Damian, he asked me, are you using the deadline scheduler? And I asked, well, I said, no, why does that matter? And then I found out because that's the only schedule that understands the limitations. So to create a file system, you can use F2FS, the Flash Friendly File System, with the dash M option, which uses the zones for, it understands host-managed zone block devices. The M is for host-manage. However, the maximum size for that file system would be 16 terabytes. Right now, some of the drives that you, the single-banger group, the current drives you can find are 14 terabytes, but there have been 20 terabyte drives announced. And that would be too big for one file system right now with F2FS. And you can create an LVM logical volume to make smaller zoned block devices. So you can achieve partitioning. However, you have to use the PB Create with data alignment switch to set it to the zone size. And so that way, instead of the LVM header offsetting the first usable extent, it'll make sure that the first extent is at the beginning of the zone. So then you won't have unaligned writes. And that'll present a host-managed zone block device, which you will then have to use with F2FS. With the DMZone Device Mapper Target, you can actually use conventional file systems, which is very nice. And coming soon, ZoneFS, which will expose every zone in a file where you can directly write to the file, you can truncate the file to reset the write pointer, and so on. And the basic rules of the road for zone block devices. Reads are performed in the same way as a normal block device. You can read sequentially or randomly. You should do it less often to keep the QDef low so that other IO can proceed on the drive. And the driver must keep track of the zones of the write pointer in order to know where to write. And if you don't write in the correct location at the write pointer, then you'll see nasty errors like that. Illegal request, unaligned write command. You may see hundreds or thousands of those if you use a conventional file system with these drives. And it'll be at first hard to debug. You may, in my case, I ended up seeing thousands, hundreds of thousands of those errors the first time, until I switched the scheduler, because the scheduler was doing the wrong thing. So the file system and the scheduler need to know how to use the write pointer. So in this example I have a partial zone write where the driver will run the equivalent of block zone report on the drive, and then find out where the write pointer is for a specific zone. And then it'll perform a write of data smaller than the zone size. I'm using DD as this example at some odd size in this case, which ends up the write pointer now says 0x2468. Obvious number, we can see what it is. And so say the application wanted to reuse that zone. It needs to reset the write pointer. And that is with an explicit block zone reset command that maps to a SCSI command and also an ATA command that says reset the write pointer for this zone. And the report zones command, you can do that for one specific zone or the entire drive. And the command for the drive takes a little time. It's not too bad, but it's a good thing to do at first when creating a new file system. The driver will do this to find out what to do. So in summary, the Linux kernel support for zone block devices continues to evolve. It's in a state where you can use it, at the very least with F2FS. However, there's not really a solution right now for same host device redundancy. ND RAID doesn't really know how to use these drives yet. And with these very large drives with a throughput for the SMR hard drives, the throughput is going to be lower. You may not want to do an entire disk build because 14 terabytes or so may take a day or two or longer to rebuild a single drive. So then that calls into question, could there be something better suited than classical RAID for this? And also in the Linux kernel, there's been future support via other file systems announced, ButterFS and Vcash are announcing host managed zone support. And there's other ideas for alternate APIs. ZoneFS, where the user won't have to understand how to use, they won't have to code their own driver for a file system. They'll say, I'd like to do direct rights to individual files. And ZoneFS will provide that interface to do that. So any questions? Yes. Are there still any major price differences, which is the whole background? Price, yes. Archive storage versus, you know, classical HTE type of storage? How much does it cost down to like 7% instead of, you know, comparison? Yeah, 7% or 70, I didn't hear that. Seven. Seven. Can you please skip the first one? So what is the price difference between zones and classical hard drives? It's not that much different. There is a little bit of capacity increase, but when you get multiple drives that scale amplifies with hundreds or thousands of drives in a cluster. And I forget the, I think a 14 terabyte drive was about $500. I'm overestimating intentionally because I don't remember the exact price. And so the price per terabyte is about seven, eight, nine times less than what it would be for flash. And there's ideas in the industry for flash zone drives, which will actually present a more direct interface than the flash translation layer in a drive, which may be interesting for high capacity drives, tighter control on wear leveling, and so on. Yes. Yes. And the second, having posted RFE for the housing, it was like LVM, looking to create smart things without the users in their VMS. It's important, basically. Yeah. It's not that much different. Okay. The last question first. Is there an RFE posted for LVM for a better handling of zone block to a zoned partition creation? Not yet. I think I may. Because remembering to offset it by the zone size is a little complex to do. So it might be nice to have a front end to have LVM query. What is the zone size? Okay. I'll place the first extent right at the beginning of the next zone to make it easier. And then the other question is, what is the data endurance and lifetime on these drives? I'm not speaking for the manufacturers, but I believe it's about the same as the endurance for regular magnetic hard drives. They're able to get more data space, more track space used, less control tracks in between by having these overlapping tracks. Yes. The hard part about endurance is that you usually have to wait that long to find out how long the drives will last. Accelerated testing may not produce the same results as actually experience. So some people are seeing higher failure rates with hard disk drives and they are with flash. And then there are some that may see different results. These drives will automatically allocate automatically. My name is Ludo, but they also explore interface. Is it true? That's a difference. Are there drives right now? There are host managed drives. The host managed drives provide beautiful service. The host managed drives are the host-aware drives. Host-aware, yes. Host-aware drives are able to present the zone interface, but could also do the same thing as the drive managed. So you could switch between either interface. Are there any SMR drives in general? Well, I think that there is. That secret hard drive is drive managed. Yes. Yeah. Admittedly, it's a little difficult to find some of these SMR drives on the market because they usually sell to large OEMs selling thousands of drives or hundreds of drives for a specific application. And it's not going to be as visible on the consumer market for an individual drive skew if you're purchasing one drive. You have to look for a specific part number. Information. I put a bunch of these secret 8 terabyte drive managed ones. Drive managed. I put them into a filer. Run them on a very fine, which is not as well-performing as we briefly elaborated on, but they survived it since then. Yes. Yeah. Just for the Heinz was mentioning that he has drive managed SMR drives in a filer and they've been running for a while. That's the key, though, that the application is very conducive to those sequential writes. Yeah. Yeah. Yeah. This camera, that's just me being stupid because they are not managing that kind of deployment. Okay, yes. Yes. The status of under-operating system in terms of system managed lists. Other operating systems? I know Windows supports these with ReFS, the resilient file system. I don't know the exact restrictions on that. Admittedly, I've been working in Linux for a while. But I used to work in Windows as a system administrator in a large storage. Now, in the rest of the industry, I know that Dropbox is using these drives, but they wrote their own software and their own drivers, closed source, unfortunately, so we don't get to see the work. So this sort of spoke to me in terms of a closed source software company wrote the software to use these drives, but now the drives exist, us in the open source environment have to figure out how to use them because they do exist. It's sort of like the classical 1990s when there was hardware and the vendors did not provide the drivers. So then now we have to figure out. However, in this case, Damien at Western Digital, and I believe there's a few other contributors at Seagate, are providing Linux kernel patches to use the drives. We're just at the point where now the file systems and the other applications have to get drivers working with these drives, and so it takes a little while. So the question was, have any of the other application, large storage application companies like Dropbox or Backplace-style company shown interest in providing, not that I've seen so much in the Linux file system development, the ButterFS is one that is a, one that I don't, Facebook has been contributing a little bit to the ButterFS side because they're also working on ButterFS anyway, but I haven't seen any from the application developers in terms of that. All right, thank you.