 about NVMe error logging and the reason I wanted to talk about NVMe error logging is because now that we're having more of our customers starting to use this technology, particularly NVMe over fabric, it's become apparent that it can be difficult to troubleshoot things like connection issues or other problems with the logging that we currently have in place in the kernel. In particular, if you hook up a storage device and it doesn't work, it can be very difficult to figure out why that is. So there's a couple of things. One is that there's several differences between the kinds of errors that we log for SCSI, which is at least for the type of technology I'm most familiar with, it's just FiberChannel. The types of errors that you get with SCSI over FiberChannel versus, for example, NVMe over FiberChannel. But it's also a somewhat of a more generic issue. We see this sometimes with customer reports of errors with NVMe PCI devices where a customer tries to use an NVMe PCI device in the system and it doesn't operate as expected and it's difficult to figure out why. So one of the things was that SCSI error logging in general has undergone a bunch of changes over the past several years. In some cases, we've slimmed down the error logging. We've trimmed out a bunch of nuisance messages. But also we've added a little bit more detail for things like errors on commands where we actually show the CDB, we show how long the command was outstanding, we've got the error, things like that. So there's been some proposals. I haven't actually submitted any patches for this yet, but I have some patches I've had of development for a while that I'm still toying with. And really I kind of wanted this to be sort of a more open general discussion about what do we want to actually have this look like. I don't think necessarily want to go to a very verbose model for NVMe, but we do want to provide enough detail so that when something's wrong we can pinpoint what the cause is. So did anybody have any questions about that before we go into a little bit more detail about what I'm talking about? All right, so with NVMe over fabrics, there's sort of effectively three phases, right? There's the discovery phase where you're finding out about the device storage end points that you're going to make connections to. And then there's the connection phase where you're establishing the connections. And then there's when you're actually doing IO and you get errors and what type of errors you see there. So with the discovery phase, one of the problems that we have seen is that the way that NVMe over fabrics operates, when we do get discovery log page, we typically see all the entries that the device is reporting. Even if you don't necessarily have connectivity to those ports on the device. Or even if it's not the same transport type. There are some devices, for example, that will return both NQN information for Fiber channel as well as say NVMe over TCP. And your system may not just have the capability, may not have the adapters, network connectivity to connect to those. But you'll still get errors about it with the current approach. And the second thing is when you try to connect to a storage device and you can't, we typically will retry a bunch of times and keep displaying errors that the connection's not working. But we won't really display any information about why this would be, right? It could be because the device isn't allowing you access. It could be because there's some kind of physical connection problem between the two. Could be a number of things. And then when we get errors on the actual IO, they can be errors for the device actually reporting on some error attempting to execute the command. Or more typically we get errors on the transport, trying to connect to the device if there's a connectivity interruption. And you'll typically get error messages to the form. NVMe and IO error minus six, and we log this. And this isn't very useful for a customer. So what I wanted to have a conversation about was sort of, what do we want this to look like? And do we want to have the same level of detail that we've had with other transports in the past? Or are we really trying with NVMe to make this sort of as a seamless thing as possible? I think that the current approach of having, well, the bare minimum of logging is not a bad one. Because one of the things which constantly bugged us on the SCSI side is the sheer flow of information coming from the SCSI device, which really caused the whole system to slow down due to the design of print gate. And so really the only good way of getting rid of that is simply by not printing anything, or only print a print which is absolutely necessary. So what I would love to have is to delegate error messaging over to either dynamic debugging or even to tracing. So the depth debug thing and so which then use the dynamic debugging really helps here. Because that means you have to actively enable the error messages and then you will see them, but in the normal case, you simply won't have error message. So you won't be off of all the information. That is actually the direction I would love to take and also to rely more on tracing because that allows us to, while really figuring out what this specific IO do whilst being in the stack. Sadly, there is an issue with that because the multi-parting kind of breaks the tracing approach. Because the tracing approach requires to be tracing to be enabled on each device. But for the parts, we don't really have a device which we can enable. So we can only enable tracing for the upper level for the, essentially for a struck NS head, but not for the path devices. So we will never see the actual error in the tracing because we can't enable it and that's an issue which, well, probably would need to be addressed. But I really have no good answer how we could address it because it would mean that block trace would need to enable more than one device which currently simply doesn't. Now, one of the things that I hear from our support people as far as doing things like enabling tracing to provide information and this is similar to what's been done with some, I know at least one fiber channel driver is that often the only information we'll get, if there was some kind of intermittent failure, we'll get a log file with the system log messages. But if you didn't enable tracing, none of the information's in there. And they may not be able to replicate the problem easily. And so then you tell them, okay, well, go turn on the tracing and then the next time we'll go and connect the log messages. And it would be preferable if there was some way to have, I don't know, using debug FS or something like that where you could go pull the information out of essentially recent activity or something. So the problem with doing that is then you end up buffering a whole lot of information in the kernel and this tends to not be particularly a good idea. So when you're saying debug FS, are you saying it should be on the target side or on the host side? No, this is on the host side, right? So it has to do with the concept of your system might run along for months and be perfectly happy until there was some upset, some disturbance in your switch fabric or whatever you were using to connect to your storage. And then you get some kind of outage and you don't know what happened. And the only information you're ever going to get is whatever you happen to capture in a log file at the time. So the trend had been to go and put more and more and more information into the logs so that if something ever happened, you'd have enough to try and at least start an analysis. But then what happens is if you have an outage where if you have a large number of devices, you can suddenly get flooded, like Han has said, with a huge amount of information that's not terribly useful, right? So this is what I'm talking about in terms of what should we do for NVMe relative to what we've done in the past for SCSI is I don't necessarily want to take what we had for SCSI and try and provide the equivalent in NVMe because there are some good things and there are some bad things about what we currently got. So I guess it's like if we could do it all over again, what would we want? I agree with what you even said about mechanisms that have to be enabled dynamically, of these usually enabling these mechanisms after something happened is too late. So that's why 20K is interesting and I'm aware at first that 20K could solve that significantly, especially if the output was sent to a serial console because at the time 20K was still synchronous. If I'm not mistaken, I think that 20K has been made asynchronous for that kind of purposes. Then I think we should take a closer look at 20K and see whether it can be used as a mechanism for building that kind of information. Okay. Another thing that comes up since Thomas mentioned NVMe native multi-path is that sort of the concept behind NVMe native multi-path was that it was going to be more of a transparent thing for handling the multi-path than DM has done. So when you get errors, when you're running with NVMe native multi-path, right now, if you know exactly what messages get logged under which conditions, you can make inferences, for example, about whether you have no access to your storage because you don't have any good paths or because none of the good paths that you have have an ANA state that will allow access. The two messages that are logged are very similar but slightly different. So one of the questions is, should we make this more explicit? Sort of, you know, I can't do this IO because none of your paths are in the right access state. Because right now, yes, there are messages about a lot of things in NVMe and if you can go dig into the kernel code, yes, you can go find where it got logged and infer exactly what conditions would have triggered that. But if you're a customer and don't want to go and dig into your, you know, kernel source code to figure this out, the messages aren't particularly meaningful. They're just, yes, they are distinct but they're not necessarily meaningful. Yes, I mean clearly, we should be updating the error messages there that it's not just trending the, well, the error number but really some meaningful thing. Because even the error numbers are just the mapping of the actual error being presented by the driver into something which then the upper layers do handle. So it's really pointed just to play what error minus seven so it doesn't tell you anything. Or the geologic driver printing, LLDD failure seven or six or something like that. It might be an idea to just, to formalize that that the error is actually logged at the NVMe layer in some meaningful state and then have it requiring the driver to provide the correct, well, return value such that we can log it correctly and not leave it off to the drivers as it is now. So on the NVMe side, it's supposed to log page even in a target, we support error log page. That is much more... This is not applicable. Because yes, the target might be able to tell us what has happened but the problem is we can't talk to the target. So an error log page doesn't really help because we can't access the log page. So it's interesting. If there is an error log page right now, I'm not sure if we can currently read that out through NVMe CLI, correct? You can read it out, right? So to the extent that the target can tell you something about it, you could go and send those commands to the target but that's an explicit thing that right now you do from user space, right? So we have, as a distribution, we have utilities in the event that we need to capture the information that goes scrape the whole system for these kinds of things and so that is something that we could do. And recently, from Oracle, we added the logging verbose error feature so you can tune it and you can get verbose error on request completion. Yeah, there's another facility that Broadcom added to their fiber channel driver that essentially has a trigger that will avoid logging things until something interesting happens and then it will display the most recent, I forget how many, but things in a buffer. Did you try the recent addition of the verbose logging with target? I'm sorry to say again. So did you try the recent addition of the NVMe error logging that we have added with the verbose option configurable? No, so we haven't looked at that yet. What I was primarily concerned with is what Juan was saying was the kernel, NVMe core code and transport drivers and multi-path code, the type of logging that the kernel is performing when it detects errors either on discovery, connection, or IO. But this is another thing that could provide more information about, particularly if there was some intermittent event. And it's really about, I think, making this mechanism a little bit more comprehensible to an end user. So things like eliminating numeric values in error codes. Things like that. So, like I said, I've had some work on this that hopefully I'll have more time to devote to in the future to try and get some patches posted. And so, if and when I get that in the near future, and get some review comments from people, I'd be appreciated. That is all I had. Anybody else have anything else? Yeah, that's one thing which is continue bugging me also is the error messages on connect. Yeah. Because the problem is there that the connect command we send is not the connect command which NVMe is using in the NVMe protocol. Heads up, we have some people that want to ask questions on Zoom, Hey, John. Hey, John. Hey. So I want to talk about two things that you brought up, Ewan. One is this idea of multiple transports being reported in the discovery log pages that leads to different errors because maybe you've got a TCP transport and you've got a fiber channel transport on the subsystem. And as you've noted, it depends upon the discovery service, that I've worked with, that we've seen. They're just going to report all discovery log page entries for everything. So I think that's something that could be fixed with a patch. And a question that we need to ask is do we want to actually support NVMe multi-path access over different transports? Right. So if you've got a TCP transport which is really slow when you've got a FC NVMe transport, it's like do we really want to be trying to support a multi-path access over both of those transports to the same namespace? I would say no. I mean, traditionally we discourage this for things like arrays of support, for example, both fiber channel and SCSI and for ISSCSI. People tend not to export the same volumes as LUNs on two different transport types. There's nothing that prevents this in any of the software. It's just not necessarily a good way to use this in the center of practice. What we have seen, though, is people wanting to use the same storage array over multiple transports with disjoint sets of logical units, or in the case of NVMe for disjoint namespaces. And so the question is to what extent, number one, can you actually tell, I mean, can the storage array even tell that it shouldn't report conceivably? It could determine that it shouldn't report discovery log pages for a different transport type than the one that it was receiving the request on? The discovery request on? Yeah, I agree about that. Offhand, I don't even know if that's if that's required by the standard. No, it's not. It's not. There are use cases in customer environments where they have a high-speed primary network and a backup secondary network. And maybe they want to have, you know, their 100-gig ethernet and an 8-gig fiber channel or 128-gig fiber channel and a 10-gig ethernet. And that might be the choice that they make in their environment. As for SCSI and NVMe, there aren't any common identifiers so that would have to be set up manually. There's no way for you to automatically know that those two data sets happen to be the same identifier because there just isn't a common identifier. No, the issue right now is that, you know, if you with the discovery Udev rules for fiber channel if you go and do the discovery you'll get back the log page that reports transport types to which you will never be able to connect. And we will try to go and do connection requests to those. And we will always get errors. But the storage doesn't know that either. Well, no, the storage could know that. No, I agree you can't really we can't really require the storage to do one way or the other, but we can certainly, you know, filter things, you know, even an NVMe CLI to just say, hey, you know, we're going to we're only going to report discovery log page entries to the host, you know, that matched the, you know, the transport that the application asked for, right? It's just a matter of paying attention to what the address family is with the, you know, if you send your NVMe CLI request over an NVMe TCP transport, then you could literally just filter that out in the results that are reported back from from, you know, from NVMe CLI. I guess, circling back to the higher level is, I mean, I think what we need to get to for the area handling with NVMe is something that's more comprehensible than what we currently have in the kernel. As kernel developers, yes, of course, we can look at it, we could probably figure it out, but those aren't going to be the consumers of this type of stuff in an enterprise environment. So I think we need to make, we need to be a little more gentle on our users. So I agree, Ewan, but I still think there are two fundamental things that you brought about. One is this idea of, you know, multiple transports, which in some situations is going to lead to, you know, any configurations that, you know, may not, I mean, they're possible, right? But, you know, I don't know if the question is, you know, do we want to support those, right? Do our partners want to support those? And I've asked this question specifically some of our partners, right? The second is this idea of connecting to devices with no physical link. You know, I see that as a configuration issue, right? I mean, filtering out log page entries based upon whether or not there's a physical link, you're never going to be able to fit, you know, you're never going to be able to comprehend, I consider the discovery log page entry to describe what we would call a configuration space, right? And whether or not there are actual wires, you know, between the host and the subsystem, you know, at each of those subsystem ports is really something that's completely up to, I don't think it's anything we should try to solve that problem. That's a configuration problem. Obviously it depends on the transport for fiber channel, you know, with traditional SCSI, we don't do the discovery that way. We typically, you know, find out about the remote ports from the name service and the fabric environment and we will only try to connect the ports that, you know, we actually can see the R port. Whereas with the NVMe discovery mechanism, the array is telling you about all of these ports that it might have somewhere that you have no connectivity to. The question is whether you can know that you don't have any connectivity. But whether or not you have connectivity, right, but whether or not you have connectivity is something which you're never going to get that right, right? If you try to modify things and it's perfectly possible to have a link that goes down, right? So with fiber channel, if the connection changes, then you're going to get, you know, the name server is going to be updated. You're going to get RSCN, right? And today what that means is that it doesn't mean that we're going to actually, you know, have to call you Deb. It just means that one of the NVMe controllers gets removed. You know, you just have one less path, right? And then when that port is plugged back in, you get another RSCN, right? And the path is restored. With NVMe TCP, it's not quite so simple as that, right? Because we don't really have like any type of a sophisticated RSCN mechanism with NVMe TCP. So I just see these all as problems which, you know, we've been dancing around and struggling with. And I think we need to ask some fundamental questions like, you know, what is it that we want to support and, you know, are we going to try to solve these problems? Some of which I don't think need to be solved. Certainly want to streamline the error logging because we don't want to be creating errors in all these different things. But basically you're not going to lose connection to your name space until every single path is gone, right? Well, so I think John has wanted to have another topic later on. Yeah, this week on specifically on the different TPs on NVMe Discovery. So. All right, so I'll go back on mute. Okay. Do we have anybody else on? Sorry, there's one thing which I want to raise. That's again the errors on connect. But we do submit quite some just submit a string of arguments. And then the driver has to figure out which is which and do actually connect underneath. And but the return we are getting is just an error code. So in case the arguments are wrong for whatever reason or not being accepted by either the controller or the NVMe stack or something, we have no way of figuring out which of the argument was it was which caused the failure. Occasionally you do get something some information by looking in the message log, but then again this is not really a nice way of doing error handling. So really we should be ideally we would have a way where we could either return a pointer or some indication at which point the parsing failed in the connect string. This is actually something that was pointed out by our QE engineer who works with me on this stuff. It's actually worse than that because when you do an NVMe CLI command to do a connect for example, it's not a synchronous thing, right? What that does is it instantiates a kernel object to do the connection. Which will then sit off and asynchronously try to reconnect. If you can't connect you cannot return an error on the syscall for the right for the NVMe connect. So the command actually comes back and says sure I did it and the kernel is off trying to connect to this thing that isn't going to work and you just get a bunch of error messages and there's never any feedback to any kind of script in user space or anything that didn't work. The other problem you have you in is that the error reporting is asynchronous. When you send an NVMe command it will come back and tell you that it was either successful or it failed. If you want exactly what failed, you have to go look at the error log entry, the log page and then you have to match up a log page entry with the command that you sent to look for the details which will point to the field within the command that tells you what's wrong. But they're completely asynchronous. Two sets of commands with no connection except some random command identifier and as long as you didn't overflow the log with too many errors or send too many commands. So there is a mechanism there to do it but it doesn't scale very well quite honestly. This sort of begs the question too whether you really want to implement that kind of state tracking in the kernel or whether you want to have some user space thing that can collect this information and tell you... You can't because as Fred said the error log is tied to the command ID and the command ID is completely ephemeral. User space doesn't know anything about the command ID so it wouldn't have a notion how it could track it. Plus that by that time user space gets along to actually asking for information, the command ID might have been reused. So you have no track whatsoever what really has happened. It's essentially a bit like the old request send to send command on SCSI it's actually the very same thing. It's actually modeled after ATA in the way they did doing an ATA rather than the SCSI auto sense. It's not that we can't handle it I mean we do it in ATA we actually do retrieve the error log page in the error handler figuring out the information and set the correct bits when returning the command. Yes you can do it. It's slow as hell and basically kills all your performance you might ever have. But yeah it's doable. The question is whether we need to do it because as of now we don't and surprise surprise this is not an area where customers have been complained about. What the customers have been complaining about are errors running up to that point where you could retrieve the error log page or any command whatsoever because that's really where the issues is. What do we do if there's a failure somewhere in the chain before the command is being sent or processed by the controller. That is not being handled consistently or not at all occasionally. All right we're at the end of the day. Next up we have James Smart talk about commonization of NVMe transports. James here.