 Hello, for all of you. And this one now is a, it was supposed to be a talk or discussion around dispersed namespace and how we can handle dispersed namespace on Linux. So, as a quick background check here, dispersed namespaces are essentially something where the primary use case is some is synchronized copy such that you can copy over namespaces from one physical subsystem to another one and that you can have essentially doing remote replications and something like that. This is already being supported on SCSI while age is immemorial because SCSI simply doesn't care. The SCSI relies on the multi-parting and that just looks at the identification of the device and if two devices have the same identification, they will be the same full stop. No questions asked irrespective on who provides that device. For NVMe, it's slightly different as NVMe has a far more strict device model, which means or which implies that each namespace has to be provided by a subsystem. Note the A, that is one subsystem. You cannot have the same device appearing or a namespace providing the same identification on another subsystem. Sadly, this is precisely that what you need to have if you want to support this use case. So arguably the use case breaks the NVMe model in its current form but the official way of doing things has also issues which are not easily solvable. So the official way would be that you just extend the subsystem across several physical systems but in doing so, you would need to coordinate the properties of that subsystem between all systems which you have, which can't really done dynamically because then you will always have a synchronization problem. So you would somehow need to partition up the subsystem, statically partition up the subsystems such that each node can act independently. But if you partition up your subsystem, you will end up, your inevitable will face the scalability issues because each partition you reduce is the scalability you have. And that's a good case. There are other things to be contemplated and so. All right, so I have been thinking of how we could do things from the Linux side because the problem I personally have is that this use case is heavily supported when running DM multiparting especially when running DM multiparting on NVMe. Sadly, this is the one use case we tried to do away with on Linux and now we are in a bit of a weak arguing spot that we need to want to deprecate something which does work against something which does not work. Which I'm not that happy with. So the one idea I had was to make it completely virtualize on the Linux side. Such that we always have a one-to-one mapping between subsystems and namespaces. Essentially that each namespace creates its own logical subsystem which then doesn't have anything to do with the actual subsystem as being seen by the target. This is arguably a hack but has the benefit that all of the contentious issues are sidestepped because it will just work irrespective on how the scaling is. And incidentally is the exact model which DM multiparting also does. The alternative would be to somehow update the spec that the spec becomes more in line to what Linux expects. But then this whole discussion is slightly moved because Kristoff says, well we wouldn't even want to contemplate things so right. I'm not really in a position to make any proposals because those proposals will be rejected and we will be stuck with not supporting it. Yes and that is the current topic and current discussion we had. So I'm not sure whether there's anything we really can do at this point here because well, so I'm here and I'm not sure how to proceed. Right, okay, I'm not sure whether it's useful to continue the discussion at this point here because well, it might just lead to further aggraviance on both sides. And I think we should just first come to terms how we as a Linux community will want to go ahead here. And this is not going to be something I can do. John had some comments over Zoom. Can we add them back in? Hey, so I think you raised the question that I've brought up in more than one occasion which is, would this actually be a use case for supporting some type of NBME multipath in with the MMP, right? So basically all multipath is within the subsystem is supported by NBME core native multipathing, right? But I was thinking if we could simply add another type of identifier, so as not to confuse, you know, the problem is that we seized upon the NID in the specification and we said this NID even though historically in its implementation is identified namespaces uniquely within the subsystem, we used that same identifier in the spec and said, oh, no, we're going to redefine with this, you know, semantically redefine what this NID is used for. And we're now going to start, you know, asking, you know the host to do multipathing across subsystems, right? So if we could somehow that convince people to support that this might be a use case that would actually be a valid use case for the MMP multipathing, then I think it really becomes a very simple thing of adding some type of new NID, right? Which is only for disperse namespaces. And then the problem, you know, really becomes something that has to be solved outside of the kernel. It's at this point is not a technical discussion. There are plenty of several technical solutions to it which could easily be implemented on Linux. This is not the problem. The problem is whether they will be implemented and whether patches trying to implement will go upstream. That is the problem, not the technical. So as I said, it is not a technical discussion at this point. Well, so I'm saying, Hannes, is in our conversations with Kristoff and with all parties. There are conversations which need to be had. Right. But these are not technical conversations and as such, this is possibly not the right venue of discussing them. It's really, I personally find it really sad because that is precisely why we did LSF to solve these kind of things. And now it turns out that we can't do it at LSF which I personally find really sad, but that's the latest. All right. Well, you'll have to talk to me offline about why that's true, but. So anyway. So just for the sake of completeness, there's a T part that was proposed for this person. Yes. Okay. And the status of that T part is not ratified, right? Yes, it was ratified probably four or more months ago. Yeah. Unanimous approval by the NVME committee. Okay. So if there's something that's been ratified and it's made in the spec, why would we have an issue with Linux support? There's a lot of ratified T parts that don't have Linux support. So I mean, just because it's ratified doesn't mean the LS is going to implement it. Yeah. So it is a bit of these things. So yes, it is in the same sense as whatever 8010 has been ratified. So yes, there's lots of things which are necessary, doesn't really make sense to support on Linux or for which the actual use case is somewhat questionable. But. So I guess the question is, you know, conceptually is this like another thing like conglomerate lungs, which was highly desired by a storage vendor, but nobody, that was particularly interested in doing. So as I said, this is, it is not a technical discussion. Well conglomerate lungs were supported by some non-Linux hosts. Quite honestly, that was a big pain for some of us storage vendors, but we had a particular host that was very interested in it, so we worked together on it. And this case. Yeah, and I don't. We have dispersed lungs today under SCSI. We have solutions that we've been shipping for a decade or more. And it would be wonderful if we could migrate them to NVMe, but right now that's mired in this debate. Well, it's a political engineering problem, right? But those are very much, it was every bit of real problems. And I agree that it comes down to what are the use cases that Linux wants to support, right? There's plenty of features and functionality that are in SCSI, right? That Linux never supported, you know? I think we're going through the same evolution with NVMe, right? Yeah, so I mean, that's the story. Enumerate the use cases, because most of us, this is an esoteric discussion about standards. What would a customer actually do? The one use case is data migration, online data migration. That could be copy off load, right? Could be what? Yes, it could be copy off load if the host is willing to do all of that work. It could be copy off load, but that would result in having a different device not on the other side. The goal with this is to do just what you do with SCSI. You go to your loading doc, you bring in your new system, drop it in place, connect it to the old system, wait a while, issue a few admin commands on the storage side of it, and then you wheel the old one out and take it down to the landfill and the host is none the wiser. We'd like to be able to do that with NVMe. That's one of the use cases for this. Another use case is geographically dispersed systems where you have multiple hosts on each end of different geographic plates. You've got one system with its storage in Singapore and one with its storage in Kansas City and you've got them connected and you want them to be able to share that information and both hosts can access both sites because the storage looks the same and when you have a disaster somewhere, one of them takes over completely transparently. And that requires things to appear as if they're the same thing. If we require different identities on each end, then we have to have manual intervention involved. We have to have different kinds of failover mechanisms and it becomes a much larger engineering effort to be able to deal with that kind of failure scenario. So those are the two primary use cases. So duplication and disaster recovery. Yeah, duplication is the primary, yeah. Whether it be replacement systems or not. Data migration and disaster recovery are the two main use cases. And so Fred, I'll just agree a little bit with you said, it's really kind of a hot potato question because what we did was we, as we were developing the TPAR in NVMe, we tried to reduce the amount of identifiers that needed to be replicated and managed by the subsystem to the smallest number of identifiers. And we seized upon the NID to do that, right? And so I kind of see this almost, I understand how you say it's not a technical argument, but there is a technical argument here because the question is, well, do we need to replicate more than one type of identifier namely the subsystem identifier, right? The NQN or the host ID, if you have to start replicating those across your disaster tolerance failure domains, well, then that's gonna take a lot more work for the storage providers, right? The only issue I have with the current approach is the sort of bunch in there. The only issue which really is questionable for me is the granularity. If really the use cases are disaster recovery and essentially system replacement, do we really need to have it at the namespace level granularity, i.e. does it make sense to copy over or to identify individual namespaces or rather wouldn't it far more useful on actually sensible conceptually by duplicating subsystems? I would agree with that. So because that would mean in the current, how it's currently specified, means that some namespaces will be copied over and some don't. Well, also you could- The point is that we have two subsystems or rather that we have two subsystems providing different information but different namespaces but partially containing the same information. Both subsystems are accessed from the very same host. Otherwise it wouldn't make sense. So why should the host see different state from one system on both sides? Wouldn't be far more sensible to make the granularity the subsystem and copy over the entire scope of the subsystem seeing that we can modify the contents of the subsystem as more or less at will. You could do that by limiting your use cases. Right now we don't control in the SCSI environment how customers group their data, how they scale the data, how they put their data on different targets and which applications use which data from which targets, how much parallel IO you wanna get by using multiple targets versus funneling all your IO to a single target. All of those are different kinds of configurations which customers make use of and to be able and to limit in the way that you have, the way that you've described it or how I've understood what you're describing would be a pretty severe limitation on what those use cases are that customers are using today. Well, I'm gonna disagree Fred. I think kind of said it's just like we said, it's just a question of what identifiers do you wanna use? Right, and my feeling is in the case where all else fails and you can't come to a consensus about that you create a brand new identifier. Well, we picked the same, the committee decided on the same identifier that was used in SCSI was the identifier through which you identify the dataset, the namespace. And we said, well it worked for SCSI so let's just pick the same one for NVME. And if we gotta pick something different, if we gotta invent a new one, then yeah, we can always talk about it. That's the whole point of the discussion is. Right, but that kind of just turns NVME into SCSI, right? And this is one of the tensions that's going on in this discussion, right? So, you know. Replace SCSI with NVME then the NVME system has to meet the same use cases that the SCSI system met. How it does it can certainly be different, but we can't replace the technology. If you're gonna turn the NVME subsystem into a SCSI target, which is what some people want to do architecturally, right? What you're saying makes sense, right? But there's a whole nother interpretation of how NVME has historically been kind of architected in that NVME wasn't really designed to scale namespaces within a subsystem, but it was made to scale subsystems, right? And once we start putting thousands and thousands of namespaces into a subsystem, then it starts to look more and more like a SCSI target with a bunch of little IT nexuses, but that's not the way the Q mechanics were designed in NVME, right? Given that from day one, they supported 64,000 namespaces with 64 billion, four billion namespaces with 64,000 Qs per controller. I would disagree that your argument about SCSI. But you're talking about, you're talking about an architectural limit. I'm talking about realities in the implementation, right? I mean, if I have to, if I implement my namespace map in the target device in the controller, if I implement it as an array, right? That's gonna be extremely fast, extremely cash efficient, and it's, but I'm not gonna be able to scale to thousands and thousands and thousands of namespaces. So there are implementation concerns, and again, just because SCSI also has large address spaces that have never been realized, right? That's a part of what we dealt with with conglomerate lines. Like, oh, look at, we can address all kinds of things here. So I don't really find that as an argument to say, well, because architecturally, you've got all these address ranges, address space, that means that has the right thing for Linux to implement support for those. Well, I go back to, if we want NVME to replace SCSI, if some people want that to happen, then we do have to match the use cases. What the implementation looks like, what the protocol looks like, those are a completely different set of discussions than the general use cases. The way that customers are getting their work done. Those things, they still have to do the exact same things they used to do. And if they want, if they're doing it with SCSI, they'd like to find a way to do it with NVME. So that's what we're trying to set out to do. I'm sorry, I missed the beginning of the first, maybe you mentioned this, what was proposed that you use virtual subsystems to abstract it to the host has no idea this is even happening. So was that not a technical solution to the problem that someone's against? The problem with virtual subsystems is all of the state. But if you look at NVME, there are massive amounts of subsystem state. There aren't the namespace IDs, there are also enduring groups IDs. There are NVM set IDs, there are ANA group IDs. There are all of these things which are state that is specific to the subsystem. And these environments, when you have geographically dispersed environments, that you have to be able to manage independently at each end. Even when the communication channel is broken between them, that state becomes very hard to manage unless you start subdividing it. And if you start subdividing it, then you break all of those scaling things that are part of the architecture. And it becomes a big pain to manage. So the idea of a single virtual subsystem is, I can't name the vendors, but all of the large storage array vendors that were in the committee meeting where this stuff was being talked about basically said, we can't do that. It's simply not possible over the communication links we have to share that amount of information and then to re-coordinate and merge it all back together again when that communication link breaks and gets re-established. It's just not possible. It is possible. And you and I are both aware of implementations that do that. This is a part of what's done with SCSI, right? It's a lot of work, right? And like I said, this comes today, it comes through a hot potato issue where it's like, hey, if we can just change, change this one thing, then it's gonna be a lot easier for us as less global state that has to be managed by the subsystem and by the controllers. But there are implementations that have the same issue and this is what they did with SCSI. They took the SCSI target object and they strict it across two failure domains in two different geos. And yeah, there's a lot of technical issues there, but implementations today, SCSI implementations today, they do it. So I'm saying I can see both sides of the issue, right? I'm not saying what's right or what's wrong. I just think that what is under purview here in LSFMM is, what is it that Linux wants to support? What is it that Linux wants to do? And I think it's perfectly reasonable for Linux to push back a little bit and say, hey, if we can make some technical changes to the protocol that would maybe split the difference here between the storage subsystems and the host stack, I think that's a reasonable compromise. The other question that I have, which is sort of a more general question, I think that we're seeing from some of the storage vendors this tendency to sort of, I think, achieve either feature parity or basically take everything that they've had in their product that was supported through SCSI and somehow support that on NVMe, right? And so there is a, I think we may have a certain responsibility to prevent everything, including the kitchen sink being thrown in just because it used to be here, right? So I mean, I think it's important that we actually have some justification of, why is this actually have to be incorporated when essentially what we're trying to do with NVMe is provide a cleaner sheet of paper where we don't go and turn it into a giant bucket of everything everyone ever thought of. And so the question is, is this really important enough that it's worth pursuing? I'm actually already trying to do that and by just, well, take part in the discussion with the NVMe express committee and see how whatever that's proposed, there would reflect back on Linux and whether it makes sense or what we need to do to support this particular one on Linux or rather than formulated that way that even that particular thing doesn't really affect us. Case in point here is this ominous T-Power 8010. So there's a T-Power for essentially modeling zoning on TCP. So basically modeling fiber-generated zoning on top of TCP, which is something, yes, you can do, but really doesn't really add the benefit to us from the Linux side. If you want to do it by means do, so we don't really care. But there again, the standard, the spec needs to be formalized that way that the whole thing is optional such that yes, you can do it if you really want, but others who don't want to do it still can't live with that standard and don't have to implement it. So yes, this is something which I'm already trying to do but then one thing is that this was prior to that and also that again, as I said, it's not really a technical issue. To implement this in Linux is, well, not exactly trivial, but it's about five patches. None of them are very intrusive. So it's not hard. And so it's really not a technical issue at this point. So I fully agree with you. If that were a technical issue, if say we would have to redesign our entire storage model for NVMe for this particular use case, I definitely would agree with you that one would have to strike the balance here between who needs to do the work. Is it also, is it the storage vendors? Of course, but in this case, from our side it's not really work which we need to do. So and just forcing the storage vendors to invest a massive amount of time to upgrading their firmware for something which we could do relatively easy on our own is, well, not to say unfair, but really isn't sensible from my standpoint. But again, I think we should terminate the discussion here because again, it is not a technical one. Sorry. It's gonna be a compromise somehow. We need to compromise eventually. It's a bit, you know, you're trying again. At one point you have to talk. If you like it, you have to. Yeah, I just don't think we have all the right people in the room to make those compromises. And that's what it comes down to. Good. Subsector, subsector read.