 Hi, all. Welcome to my talk, Getting in the Zone, ButterFS on SMR HDDs and NVME CNS SSDs. My name is Johannes Thumson, and I work as a Linux Kernel developer for Western Digital Research and with me is Damien Lemoyle, who is my team lead and also Kernel developer at Western Digital. Hi, everybody. As a short outline for today's talk, I'll start with a high-level overview of ButterFS and then a bit of on-zone block devices. Then I'll talk about using ButterFS on top-of-zone block devices, and last but not least, some further enhancements we have in the pipeline. First of all, what is ButterFS? ButterFS is a general-purpose copy-and-write file system for Linux. It's based around the concept of copy-and-write B-trees, which allows it to easily have things like snapshots and sub-volumes. It also has some advanced features like transparent data compression with LZO, Zapplib, and C-Standard. Check sums for both data and metadata using CSC32XXHashR256 and the Black2B algorithms. It also offers built-in rate support with the rate levels, rate 0, rate 1, rate 10. Technically, there's also rate 5 and 6, but that's broken. So it's discouraged for use, and it has its own backup utility with ButterFS and receive, which can send a stream of changes between two sub-volume snapshots. Then what are zone block devices? Some block devices are most commonly found today in the form of SMR hard drives. That's Schengel's Magnetic Recording. The zone block device interface is defined by the SCSI-CVC and ATA-CAC standards. On the zone block device, the LDA range exposed by the device is divided up into zones. These zones can either be conventional zones, which accept random writes. Basically, a conventional zone means it behaves like your hard drive you're used to, and then there are sequential write-required zones, which have a write pointer, and you always have to write up the write pointer. Then once the data has been written, the write pointer advances, and the next data has to be written at the write pointer again. To write data before the write pointer, you have to reset the complete zone, erase all the data to rewind the write pointer to the beginning of the zone and start writing again. That's a constraint that users of zone block devices must be aware of, and that sequential write rule cannot be broken because the device will simply fail the write. To simplify this operation a little bit, when NVMe CNS SSDs were standardized, the concept of the zone append operation came around, a zone append is basically a nameless write. That means you write your data to a specific zone, and the device places it at the write pointer and tells you afterwards where the write pointer has been. Unfortunately, the zone append operation is not defined in SCSI and ATA, so we had to write an emulation for the SCSI driver, which came in kernel 5.8, and just recently for the device method stack, Damian wrote a similar emulation, so we can have a unified driver stack in all of Linux. Using the zone append writes means basically, you can just do writes in any order with some minor constraints, and the device will never fail these writes, but you have to be able to handle out-of-order completions. So what had to be done for ButlerFS to support zone block devices? One thing was the superblock in ButlerFS is the only fixed location data structure that did not get updated via copy and write, so that data structure had to be changed to a log-structured manner. So we use two zones, a ring buffer and always write the new superblock at the end. The right pointer, once both zones are full, we restart from the first zone, so we always crash consistent and have the new superblock at the device right pointer. Next, we had to align the block groups to the actual zones, and write a zone-extend allocator that doesn't append only allocation and to avoid random writes within a zone. Then, starting with kernel 5.12, we are fully functional now, which means we use zone append for all data writes. We are able to mount the file system. What's not yet working is we don't have any support for rate or as allocated yet, but there are plans to do this, and there is no support for the no-cow mount option because that simply won't work with some devices. Currently, we are in a stabilization phase. There is automatic reclaim of dirty zones that was merged with kernel 5.13, and we are working on bug fixes for all the corner cases we found in testing. Next here is an overview of the data write path of Butterfus. As you can see, there's lots of parallelism. There's normal writes that can't be compressed writes, and then there's delayed allocation. All this has introduced a lot of asynchronousity, and there's a lot of potential for unintentional reordering, which would then fail all writes to the device. That's why we had to use zone append for the data IOL submission. So the new allocator only needs to reserve blocks. It doesn't really care where a block gets written on this. It just needs to reserve it in the more or less right spot in the correct zone, and the device will handle most of the magic on itself. Once the biosubmission process got completed, the file system gets notified from the block layer and the underlying SCSI or NVMe layer, and we know where the device is written, the data, so we can update our metadata, which in Butterfus always gets written after the data got written. So there's no problem with the metadata. The metadata needs to be synchronized, though, so there's a metadata log, which means that we can only have one process writing metadata at the time currently that maybe could be changed in the future. So now to something practical, but using Butterfus on some devices. So first of all, why should you use Butterfus on a some device? Or why should you use a some device? Anyways, there's an increase of capacity with some RHDDs, or you have better predictable latencies because there's no background level, background device level garbage collection in NVMe CNS SSDs. There's also a better isolation of workloads in a cell device. With the ability to use Butterfus natively on SMR drives, there's also no need to use the device as a map for DM zone target, which was used as a translation layer to serialize writes, which means there's a lower overhead all in all. And there is the chance to natively use upcoming CNS drives, which is currently under development in Butterfus. The prerequisites for using Butterfus on some devices is kernel version 5.12, Fedora 34 currently ships 5.13 as an update as of this week. Butterfus prox 5.12, that's also good in Fedora, which ships 5.13. And then the BLK ID from the Utah Linux package has been patched to detect the new log-structured super blocked. These patches are upstream in the main branch, but there is no official release tagged with the version containing these patches. But these patches are relatively easy to backport to a local RPM. So for really using Butterfus on a zone block device, you could check if your device really is zoned and look at the zone properties. We have a block zone utility, block zone report, depth SDA, or if it's depth SDA. But that's a totally optional step could be left aside. Next thing is to format the faltersome with Butterfus. Currently, you have to use the dash M single dash D single. Options to make affairs to, because only the single profile is supported. And upstream didn't like it to be the default profile on a zone device, because defaults are hard to change in the future. So you have to explicitly specify the single profile for data and metadata. And then make affairs Butterfus will perform some reset on all device zones and write your initial file system data. Afterwards, you normally mount the file system, but currently you have to use the dash T Butterfus option for detecting the file system until the latest BLKID code has reached all distributions. And the mod utility is compiled against the BLKID to automatically detect the file system. And there are some upcoming improvements, which we are working on. First of all is NVMe CNS support. NVMe CNS means our NVMe CNS SSDs are some namespace NVMe SSDs that give you more predictable command latencies due to no device level garbage collection in the background. You gain a better isolation of workloads, because there's no FTL flash translation layer in the firmware that decides to do things while you are writing. SSDs offer a similar zone device interface to SMR HDDs, but with some changes. For example, you have no conventional zones, meaning you always have to write sequentially to all zones in the device. And not LVAs are actually or may actually be usable. A CNS SSD may have a capacity of a zone smaller than the actual zone size. This means there are changes needed in the extent allocator of the file system, which track the number of active and open zones to not strangle the device limits. And then we have to differentiate between the zone size and the zone capacity to not allocate over devices or a zone's capacity. And finally, my personal pet project, the clustered parity rate is something I promise for over a year now to start working on, but haven't had the time yet. On the clustered parity rate, the old data you write in the stripe gets uniformly spread across the devices in the rate volume, which helps decreasing the rebuild stress in large arrays. And with some devices typically being relatively big in the tenths of terabyte region, the stress for rebuilding on the single devices when you have to rebuild an array is pretty big. Then we also have the chance of adding new encoding schemes like read Solomon code or NDS codes, which increase the overall fault tolerance for multi-disk failures. And all this requires introducing a new B3 in Butterfus because there's only 13 of them, so let's add a new one. That one, that change will allow us to do so. A PEN writing on a zone file system. So we don't have any implicit calculations anymore, but have real metadata on disk where our data lies. That also benefits known zone Butterfus rate setups like the traditional rate 5 and 6, which are discouraged because they suffer from the rate right hole and data loss can occur with these rate levels on Butterfus. So with the introduction of the stripe tree and the declassivity rate, this problem can also be fixed. Yeah, this concludes my presentation. We still have some time left. Are there any questions? There's the question if we see any performance problems with Butterfus. Do you mean Butterfus overall or zone Butterfus versus non-zone Butterfus? Yeah, Johannes, go to the Q&A tab. Yeah, see the Q&A tab, OK. To answer Neil, why aren't we running Fedora because corporate laptop, corporate IT? And I am running Fedora. It's on my right here. It's hidden from the camera, but it's patching and building and doing lots of things. I'm actually running Fedora on some test systems and open SUSE on some other systems. And to answer the question, how bad rate 5 and 6 is and when it will be good, well, I would use rate 5, 6 code in production. Surely people can run it with a UPS and won't ever have any problem. Other people will suffer problems, but there are some architectural errors in the code which we simply don't want to fix, but just switch to the new scheme with the rate stripe tree, which adds the new translation layer between logical addresses and device physical addresses in the file system. And journal that to be figure safe. Yes, or not really journal, but have an audit write. When will the rate stripe tree get added to Potterfist as soon as I have time for it? Hopefully it will be this year. I promise it for several years now. I hope to start the end of the month with the actual coding. But the next question is any improvement on Butterfist encryption? I have heard rumors about it. I don't know if how public these rumors actually are, but there's people working on it. Will we add stuff for Butterfist with some block devices to Anaconda? OK, so in the 19th century, right? Yeah. I'd say it should work out of the box if it's not a boot drive. It's a boot drive. We also will patch grab so that grab can find the super block. If it's not a boot drive, we'll have to check the Anaconda code if it's using the correct default block group profiles for data and metadata. But that should be all needed to change if not completely mistaken. Once we have Libia in place, I don't see any difficulty in that. Yeah. And eventually, yeah, we can even patch grab to be able to boot from the NS SSD training that our FS. Or even an SMR drive if you want really to boot from NSD. I think it doesn't include boot because grab doesn't support the encryption, most likely. I've never looked into it, but that would be my guess. The question is, is there still nothing to include for volume while to install? It's not in boot, but no why? Again, my answer is probably because, oh, Debian boot does that. OK. Well, I don't know. I don't think you can do that. We are mostly working on kernel stuff. So this throw, we are kind of not your honest, but it's me. I am a beginner on distro work. Well, I haven't done a lot of distro work either. I did work at a distribution, but never done any distro work on the kernel work. So yeah. OK. So for information, LOX format is not yet supported on SMR and DNS. So the near answer is because Debian uses LOX1 and for uses LOX2 and grab just caught LOX2 support. So related to zone storage, LOX format is not supported on zone storage, even though you can run VM creep on SMR and DNS drive, but not with the LOX format. And the reason is that creep setup, the utility, doesn't write the super block sequentially for the information for the LOX format. It doesn't write sequentially. So that, of course, creates problem if you don't have a conventional zone at the beginning of your drive. It simply counts format. So I was working on fixing that, but got the URL then we haven't finished. But yeah, that's something that can easily be fixed. DM creep does support zone devices since 505.9. So that's the question is about performance after four to five years of usage versus ButterFest versus XFS and EXT4 that ButterFest feels slower. Yes, indeed, ButterFest is slower than XFS, at least. For architectural reasons, we do copyrights, yada, yada, and there's the bookend extents are slower. There's some problems we can't address. There's some problems we can't address for performance of ButterFest on a zone device versus on a non-zone device. With all the performance testing we have done, we are relatively on par. There is things that aren't completely on par on a zone device versus a non-zone device, but it's relatively on par. I would also add that after four, five years of running an HDD, I would recommend to change to HDD. Then there's, will ButterFest get support for layered rates? What do you mean by layered rate? Oh, like a red zero on top of red one or the reverse? Yeah, like what rate 10 basically is? No, actually, the most common I've seen is a red zero of multiple red five volumes. Is that what you're asking, Neil? No answer from Neil. Well, while we are redesigning some parts of the rate code, we could at least look into it. I'm not sure everyone will be happy with that. Yeah, I'm personally not a big fan of layering reds like this because things can break. If you have one bone that dies, then everything can die. So that kind of defeats the purpose of reds. It has to be done properly, but then I would say that a proper erosion coded volume can achieve the same level of performance and protection that a layered red would with less complexity and probably better performance. Yeah, I think we are over time, isn't it? Yes, we have four minutes over time. OK, thank you very much, everybody. Thanks, bye. Thanks, Jonas. Bye.