 So now moving on to Zone Storage Boff. So the agenda, this is not, you know, going to cover Zone Storage in detail. I'm going to provide a really fast intro. So please, you know, excuse the fact that I'm jumping through a lot of stuff there. For more of an introduction to Zone Storage, please read some of the documentation that I'll be providing here in the references there. I'll also be talking about problem solutions and discussion. So some news, Zone Storage Micro-Conference has been accepted to Linux Plumbers, so great. And I hope you guys can get folks to submit talks there. And then we'll see you guys at Plumbers to follow up on this and expose Zone Storage concepts to, you know, the user world rather than the current world. So if you guys are not familiar with Zone Storage at all, just go to this page. It has really great documentation. And it is the home for information about Zone Storage. So Zone Storage, again, fast recap, is a class of storage device that enables hosts and storage devices to cooperate to achieve all these little things. Higher storage capacities increase through lower latencies. And there's different form factors that you can, you know, use Zone Storage solutions on. The latest and trendiest one, obviously, is on NVMe front, and we refer to those as ENS. There's those gains that are listed here. So Zone Storage essentially divides space into zones. Rights must be sequential. Each zone keeps a right pointer and tracks position of the next right. You can't overwrite directly. If you do want to do that, then you essentially need to do a zone reset. On SMR, conventional zones are optional. And NVMe support optional namespaces, support conventional IO access. So there is support for conventional namespaces as well. And drives can exist with only sequential zones. That essentially means that you can end up with a world where you have drives that you can only do sequential rights, which is very important. And file system developers need to keep that in mind. So recap also on the file system, lay of the land. Maybe I missed something here, I'm not sure. I think this cover is kind of like what we support right now. Anyone else? I don't know if I missed something here. ButterFast has to a fast device map or in ZoneFS, I'll elaborate and I hope to ask Damian to explain a bit more of ZoneFS in a bit, if that's okay. But let's get into some of the current problems that have come up. Oh, actually in ecosystem, I should indicate that to replace one zone storage device, contrary to like the typical world of storage, you do want to ensure that they have the same zone sizes if you're trying to do parity matching of some sort. ZNS does also require a manual setting of the IO scheduler to MQ deadline. So that's something that I hope that maybe we can fix somehow. I'm not sure we have a solution yet to that. Yes. Only partially true, I think MQ deadline is only needed if you have a file system on top of it. So if... Can you repeat that again? MQ deadline is only needed if you're doing regular writes. So if you're doing just sort of pants to your ZNS device, you don't need MQ deadline at all. If you MQFS or ButterFS... Of course, then you need to use MQ deadline. Okay, yeah, that's good. Sorry, I should have been more specific. Thanks. So, I'm not sure if we have a solution to this yet, but it would be nice if we did. So that way we don't have users to do this. I think I'm not sure if we can do this with you DevRules. Matthias? Isn't the ButterFS work really great for this? I mean, with the RAID and doing like chunks and so on? Doesn't, I mean, this is a real issue. Isn't ButterFS, with its RAID capability, able to handle different sizes of zones across different devices? ButterFS RAID isn't supported on some devices yet. No, I... Well, I mean, you have patches. I have patches, but... I was thinking... It could be made, yes, but... Not in the current stage. And it's a long, long-term project, so I... I have that one really, really way down in the task list. Yeah, yeah, that's one point. Luis, one more point about MQ deadline, just to clarify for everybody. So, why is all storage requiring MQ deadline is simply to guarantee right ordering. So, whatever the order... So, the application on the user has to issue right sequentially. And whatever support zone storage does that, but the blockhouse stack doesn't guarantee any particular ordering for execution of requests, and we have to guarantee that for rights. And the current implementation, those guarantees are provided by MQ deadline using a zone right-locking mechanism, which limits actually rights to at most one per zone, which kind of is really bad for NGM in CNS, as that limits performance. So, the issue with MQ deadline is not really MQ deadline. Itself is how do we guarantee right ordering? So, the current implementation relies on MQ deadline, we could think of something else. So, that's the larger picture of the problem here with MQ deadline. Thank you so much, Damien. If Damien, could you share your... I think you've done some work in this area. Do you think you could share your... So, initially, the work started when we were not MQ yet, so that was 4.10 for the 10 kernel with SMR support, so that was not MQ at the time, and we were doing this right ordering guarantee using a similar mechanism with zone right-locking within this quasi-device router. That was however really bad for performance because in many cases, that would limit stall the drive queue to QDEP-1 for every write, essentially, even if you had read behind that, that you could be sending. So, and MQ added the switch to MQ added more problems, so we had to redesign that. And so, the zone right-locking went up from the SCSI driver to the block layer, and the scheduler was the easiest place to put it. So, yeah, it is, as part of the scheduler, I still think it shouldn't be, but right now, it's really, really hard to put someone else, so. I'd like to say something about the locking per zone, right? I think there's some misunderstanding here about limiting QDEPs to the drive. So, it's QDEP-1 per zone, but you can submit IOs to multiple zones at once, so you can easily build high QDEPs to the device, even though you have zone-locking, right? So, just make it clear for the whole community, right? Yes. That's a separate issue. Well, yes and no. So, yes, you're right. And for an NGD, actually, that is- I'm talking in ZNS, it's for zone storage. But I'm coming there. The simple fact that you have an IOs scheduler in the past for NSSD hurts. Sorry, I missed that. What was that? The whole fact that you have an IOs scheduler when using an SSD hurts performance. Yes. Really significantly hurts. So, but isn't this like a, in some ways, a user space problem, right? Like, if you write your application to guarantee that you have one outstanding IO per zone, then, you know, you can build high QDEPs to the device. I do it all the time. Of course, it's not the problem of high QDEPs here. And that's why for the NSM QDEP line has to be set manually, we don't default to it because, yeah, we kind of want people to do that. The problem though is that it's hard to guarantee that a write from an application is not going to be split for whatever reason. Yeah, so that's a key thing here, right? Like, what are those boundaries, right? I think it's well-defined when it's split, right? Depending on the size of it. Is this something you can clearly get to? No? It's not so much the split or something or that you can write to several zones at once. The thing here is streaming writes. So if you want to do streaming writes, it would be logical to submit all these writes in one go and basically make use of the queuing. And that is what the specs was designed for. As it turns out, it's really hard to guarantee that the IOs you send down to the drive in order will also arrive at the drive in order and will be processed by the drive in order because there really is literally nothing. I think the problem is separate zones, not the same zone. No, it is the same zone. The problem is the same zone, of course. My point is from my perspective, I write applications that just use multiple zones, right? Sure, I mean if you have a button. There's too much of this talk about you can't get queue depth, you can't do this. You can, you just need to write to multiple zones at once. This becomes a philosophical question, right? So Google is using something called hybrid SMR. I think the official standards name is zone domains where zones go online and offline and you have an LBA space which is conventional zones and an LBA space which is SMR zones. SMR zones have all these magic properties like a write can only be in one zone which means for a variety of reasons we couldn't use the deadline scheduler. We were using a CFQ derived scheduler. So we manually hack the CFQ scheduler to never merge IO requests so that if you have two requests, two 256 meg write requests to adjacent zones that the IO scheduler will not helpfully merge them for you. That is a hack. That is a hack that is an out of tree patch in the Google kernel. We have been thinking about how do we actually try to work within knowledge of zone domain storage in the upstream kernels. And one of the philosophical divides for example is whether or not the kernel should be tracking the right pointer. And if the kernel tracks the right pointer, then you can guarantee that things don't go out of order. No, you can't. No, you can't. Yeah, there are some things. But the point is this, my understanding and we're not using the in kernel zone stuff at all, we are using pure user case solution because we didn't want the kernel into the business of tracking write pointers. That was overhead we considered necessary. We would much rather user space do it. The problem is that a lot of people think that way. Yeah, and we've made it work. We actually have that working in production. It also is using out of tree patches that is not sustainable. We are thinking about how do we solve that problem in an upstream kernel. But part of it is, and it's sort of a blessing, is that the upstream kernel doesn't understand these zone domain drives, which is good because we actually have some fundamental disagreements with the direction that the upstream has been using. So we're doing it purely in user space, even though we know that is not a long process. I think this is a great topic for planters if you're going. This would be a great topic for planters if you're going. So let me just move on a bit here then. Sorry. I want to acknowledge what Ted just said, and we, hard drives, we can get away with it a lot because we have all the time in the world, but we are seeing a real issue with having an IOS scheduler associated with SSDs and zone drives, and we do see a good impact on our CNS drives when we use them to deadline scheduler. We are capable of disable it, and FIO, we can do it, and all that. So it is possible with application and all that. But it is something that we are working on, and we'll continue to work on, and we'll have to work with everyone on getting to the right way to do it. So it's not clear so far. I'm not even done explaining some of the existing problems, but it is an exciting time right now for all these concepts, right? So collaboration is highly welcomed, and I'm glad that we have the right people here in the room. So current things, basically the non-power two patches now have been posted, version two of that is coming soon. I'd just like to reiterate here that Butterfuss Check does not work for any zone storage image dumps at the moment. I suspect that we may need a bit more zone information on disk for that to support this properly. Superblock, just to recap, it's a bit different, it's dealt with a bit differently on zone storage devices. So here's an explanation of how it works for power of two and then non-power two, but you guys should just try to wrap your heads around how this is being dealt with on zone devices, given that you have to have copies of the Superblock. You basically have two zones that are basically being used at the same time, and then you keep expanding on this, this is an example of the Superblock copies on zone zero and then zone one. Once you run out of space on zone zero, you jump onto zone one and then you reset zone zero, you go back and back and back and back, so that's pretty much what's being done. So just one question, for instance, do we write to Superblock on every single write today? Just want to clarify, because this is not clear to me. What's that? Yeah, it's like every, I think every transaction commit, so like a normal system, probably every 30 seconds, and then like if you're doing a RPM, like every millisecond. If you have application doing all sync, that's going to happen too. So this is kind of like where we're at today, and I'd like to now pass on by Damien Tosieva. He can elaborate a bit more on the world of his own affairs because I'm not really too familiar with like its uses, you know, so just like to ask for, if I captured, you know, kind of like what was discussed and ideas on the mailing list, I try to capture at least one use case here, and relating back to the other conversation on killing IO controls, for instance, this is just a silly example, but it's a real world problem, for instance, right? Yeah, so for his own affairs the long-term goal is to try to avoid to have to do all direct writes, and the current requirement is that writes must be all direct, and the reason for that is that if we go through the page cache with buffered write, we have no guarantee that the page cache is going to this stage pages in order for the writes into the zones. So all direct is the only solution currently. Willy, Matthew, you're in the room, you know? He's not here, but he posted some patches that could allow bypassing that. So essentially, all sync looking like a write-through instead of write-back, which would keep the ordering. So that will allow buffered writes, but write-through, so direct to the zone, but you also preload your page cache, which is nice to have to reduce device accesses if you have to do reads after writing. So yeah, that's kind of the long-term goal, but that's beyond zone affairs, there's a lot of things to think of for that besides zone affairs itself. Great, great, thanks. I'd also like to invite Kent if you can expand it a bit. Perhaps people are not aware of the bells and whistles that may be possible with BcacheFS for his own storage devices, at least I'm a bit excited about that. BcacheFS is gonna be getting full native zone device support and because dating back to Bcache, allocation has always been designed as bucket-based. Buckets map very nicely to zones, so we've already got copying garbage collection. Zone devices are gonna work just as well for us as normal walk devices. I'd like to also invite the floor for anyone as any open topics that they'd like to discuss. We've got about 10 minutes. Yeah, just for clarification for the statements you've shown now, so do not have all direct, the Java do not have the all direct, so that was a statement for the past. It didn't have, but now it has. So using the Java API, latest API, we can do that. Great, thank you. Okay, I didn't know. It's still a pain to have to use all direct because you have your cache code, essentially, with that, so any read after write will go to the device. With regards to testing, I guess I should mention and it should have been obvious. I didn't state it before, but obviously I'm using KDEVops to test the hell out of zone storage devices, both in block tests and FS tests. The only caveat there was I found out late and this is why I put in a bullet here that you do need MQ deadline, so I figured I will try both baselines, so one with essentially the MQ deadline not set and also with it set. I think it's good to see both. Have baselines for that. Yes. So if your drive testing is a setup drive connected to HCI, it's going to take about five seconds to get out of all the writes. It's like almost instant. Yeah, this bit me, because I didn't know. I was setting up my own ZNS stuff, the clutter of the stuff, and it just fell over and I was like, oh God, they gave me bad drives. No, it's the MQ deadline. No, it's HCI. HCI is just not actually respecting the order in which you command the sort of adapter itself, reorders command. And there's obviously two ways to also test it. If you want a virtualized environment, you can use that with Kymyun. Can you do that with MBD? Sorry, an old block as well. There's some caveats that, you know, one of them was RCU Splat that came out a little bit ago. So there are two different type of worlds that one can use to test. So yeah, any other topics that folks want to bring up for the host storage? One thing that actually Christophe talked about yesterday. So for the, for Kymyun has a ZNS, NVMe ZNS emulation, which is very convenient to use for testing. However, the state of the drive is not persistent across restart of Kymyun, which is really annoying sometimes. So it would be good to finish that one and have it persistent Kymyun. Can you repeat that, Damien? Can you repeat that as in catch up? The fact that the ZNS emulation in Kymyun is not persistent across restart of Kymyun. Oh, okay, yeah. So Klaus have been working on that. He has not, so I think... And he was not agreeing with Dmitry, they were fighting about it, so we just need to... Yeah, I think that's a good idea too, yeah. I have an additional topic. So today the Linux kernel block layer supports the subset of sold storage that is common between NVMe and SCSI. Should we keep an eye on the standards committee to prevent the divergence or the standards go a different direction? I noticed in CBC2 that a whole bunch of new concepts are being added, for example, domains and realms devices. I haven't seen any equivalent NVMe specs. I'm sorry, I didn't really get where you wanna go with that. What command do you want? Today, the block zone storage works well or we can add support and file systems for zone storage because what is supported in the Linux kernel is functionality that's shared between NVMe and SCSI. One day, domains and realms devices would be supported, would appear, and we would want to support these with the file system in our Linux. We will end up with something that's specific for SCSI and doesn't work. That's not appropriate for NVMe. Today, if you design for SCSI, you may run into trouble with NVMe. That's correct, yeah, because ZNS has some features that don't really exist, that don't exist on SMR in SCSI and ETA, like the active zone concept. So that's actually the work we're doing right now for better FS to adapt, to fix active zone management, which is not really a concept that exists again in SMR. It's all unified under the same interface. It did work, the unification was actually very clean. It works, but yes, there are parameters that needs to be looked at that differ between device types. Yes, correct. The challenge we are facing with Android is today we are developing or looking at zone storage for SCSI. One day we will switch to zone storage for NVMe, and our hope is that we can keep the file system the same and that we won't have to rewrite the file system. So that's always been the goal from the start for file systems. So whatever work with one device type, it has to work with the other two. Again, the block layer is abstracting the device type. That's the role of the block layer. And so the file system to some extent should not have to care what device it's talking to. There is a set of parameters that describe the device and the file system has to work with them. So what we did for better FS since the, for example, the active zone concept doesn't exist on SMR, we didn't care about it in better FS. So once CNS came in, we had to deal with it. So we added, we are, I know that he's adding that support. So he, it's in already, it's a bit still buggy. There's some problems, but it is working the same way. And what better FS is about an SMR drive is a drive that has no active zone limit. Whereas a CNS drive will have a limit. I've been another difference. There's between CNS and SMR is a zone append. For us as a file system, zone append really is awesome for data placement. So we don't have to care where we write our data, just pick a zone and then write it. For SCSI, we wrote an emulation, basically, just looking rack up zone append and translating it to a write CDB and doing a write pointer tracking. We cannot, in the kernel, we have to have like a base layer that the file systems understand. So we don't need to change too much in the file systems and the file system simply shouldn't care if it's a CNS drive or SMR drive. So I think to Bart's point, we need to keep in mind that there might be other standards coming along and we need to make sure that we support those because one, I think the worst thing that could happen is that if those standards are not in line with what is happening in the kernel today, we end up having emulation layers for very fast devices because they're not supporting the spec. So I think that's something, even though we want to keep zone storage as a common API underneath different drivers, at some point maybe we need to acknowledge that the properties of the underlying media is different and at some point we might have to deviate. The scheduler is one example, but there might be others. Yeah, and that's not the responsibility of the file system. Like I cannot describe how little I care, right? And I think that the file system needs to not care. If you guys want to add a bunch of fancy bells and whistles, awesome, user space can use those. The file system does not need to have that intelligence built in. I fully agree with you. It's not a comment about the file system. It's a comment to make sure that the specs align to how things work so that we don't need to take care of that. Yeah, and I think we have that, right? Like we know, okay, these are special, we have like special rules about working right. Like anything beyond that, I think the active zones make sense, right? You can only do so many areas. Okay, great, we have them in the file system too, but like I'm sort of reaching my limit of like what I think is acceptable to be stuffing in the butter of us, right? 100% agree. But you know, coming back to the scheduler thing, I mean, you do experience that pain. Oh yeah, I know. The scheduler thing is just an implementation issue. We could reimplement that. We just need to find a better way to do it. That's just what the order is. He's just an example. Right, and the scheduler thing is a really good example of a thing that I shouldn't have to think about. The file system shouldn't have to think about. It should have just been done automatically. I am willing to accept it because the stuff is still in transit. Cool, things are gonna happen. I switched to MQ deadline, whatever. I'm the upstream developer. I can eat that. Long term, that's not sustainable. And long term, more and more fancy things are not gonna be sustainable. You guys can look fancy stuff? Great, find user space applications for that. I think the file system is kind of nearing the limit of what we're willing to accept. It seems to me the block layer is the right place for that sort of abstraction, right? So historically, things like discard, things like fooah, right? File systems use the block layer to find interface for those sorts of things. And we don't have to worry about, is it a SATA drive or a SAS drive? You know, we just do write zeros and, you know, the block layer will magically transform that to whatever the right interface happens to be, whether it's a UFS device or whatever, right? And I think that's just the long term direction. Couldn't agree more. That's why it is very important that the block layer supports all some devices. So that's very important. It means we, I wholeheartedly agree. And my hope is that there are no more bells and whistles because we want to make CNS as easy as possible to use or SMR and we are kind of like simplify, simplify, simplify. Even if it means we are doing, as Steven there is doing more logic on the drive because we do know CPUs doesn't get like acceptably much faster. So just from that point, we wanna like, we, I mean, I think we found a really good ground now and we are really not looking to push the boundaries more than we are. I think that the scheduler thing has been hanging on us for a long time. I don't have a good answer. So, but it's a thing that's come up with every project we've done. If anyone doesn't have drives or wanna work on this, there's some storage.io. So you can get CNS drives there. SMR drives, there are windows out there. I know every one of them has samples and so on. So because we know these drives are not available in the channel, but they are available with nearly, SMR no strings attached, CNS with very little strings whatsoever that even individuals can get them. So there's no strings attached. And we are working with Samsung on having a public cluster and stuff for testing and so on. And we already, with SIF, we have drives and SIF, all of this. And I mean, so we know it's, and we gotta have it part of our CI pipelines before it is even things. And you wanna catch the errors before they land in your inboxes too. And we're building up infrastructure because it's, yeah. Yeah, I like, I'm not criticizing any of the work that's been done up to this point. I think it's, everything's been fantastic, right? There's, it's going to be rough for these things. Whatever, I completely understand, don't care. More is what I'm talking about is all the future talk of all the many fancy different things. Like, we're good. What we got now is pretty good. And the more things that you guys do to kind of differentiate yourselves from each other, I wanna tell you now, like the file system is not going to be able to take advantage of it. We are not going to be smart enough to like be the interface for user applications to the fancy things. So like, let's keep that in mind because there's only so much the file system knows and can do, and only so much I can keep in my head, right? And so many, so much other developers can keep in their head, right? And I think the answer is that as there are new bills and whistles, there needs to be a conversation between the file system developers. So there might be some new feature, like for example, if there's a way for us to send hints about, you know, this is a journal, whether it's a database journal or a file system journal, and that would like significantly improve performance for a particular device. Let's talk, right? Maybe we can come up with an interface that's not too bad, but the point is that the benefit has to be pretty large and the interface has to be fairly stable and simple or it's gonna be really, really hard to get the information plumbed in some cases from user space through the file system to the block. Yeah, so that's more of my concern. Like there's definitely things that ButterFS itself can take advantage of. Bells and whistles that ButterFS can do for its metadata and you know, AXD4 XFS can do for their metadata because they're the application, they control it. But as far as like translating those bells and whistles up through user space and keeping track of all of that, so like the plumbing, right? Like just, you know, off the top of my head, we have a special zone that will go super fast that we want users to take advantage of. Like ButterFS is not going to be able to get that right at all, right? We're never gonna be able to set up a system in which that we can take advantage of that. That needs to be driven solely by the user space application that has full use of the disk. Yeah, I'm just saying, I can imagine for example, because I know that the hint that this is actually a journal or this is hot data is one where maybe it would be worthwhile for MySQL and Postgres because they're only a handful of databases and it would probably be an iOctl, right? That you passed to the file system that we could then pass through, right? But you know, it would have to be a special case and the benefits would have to be pretty large. Very much agree, I can speak for anyone on this but this is, there was an open channel that was too complex and we had hard time getting developers and users for that before because it was hard. And it was great, you can do everything, but it was hard. And with ZNS from how I see it is to simplify, simplify and I am really, I'm sorry that we have these active like resources that has to be managed but that was unfortunately what we had to do with NAND. I'm very sorry for all the pain it's causing, Johannes and everyone else for implementing support but yes, in my vision it's kind of like we not wanna push it further because the more we do, the less general users we're gonna have and the less applicable it will be. Well, we're definitely passed but you know, there's nothing scheduled to 1030 so unless anyone has anything else, I guess we'll wrap it up.