 I'm not sure if everyone caught what I said before. I'm going to make a problem statement first so that folks in here can decide whether they're interested in staying for this topic because we recognize that it probably might not be interesting for everybody. So I won't mind if five minutes in some of you decide you need to go somewhere else. The issue is that for network file systems and for network storage in general, clients need to have a globally unique identifier. It needs to be durable across reboots. It turns out this is a problem not only for NFSV4, it's also an issue for NVMe. The initiator I guess needs to have a global identifier that targets can recognize. So all well and good, you know, you can use something like the machine ID if you're talking about a physical host. But then when virtualization enters the picture, things get a little foggier, the problem being that when you decide you want to create a container, where do you put the identifier? Who generates the identifier when you create a container? When it reboots, where does it go and look for it? So it makes sure it has the same identifier that it had during its last boot instance. If you've got a VM guest that you clone from a master, a golden master, who's responsible for making sure that that identifier is changed when you clone a guest? So Hannes actually has a solution that he has put together for NVMe. We're still looking for one for NFS and we're just sort of looking for ideas and thoughts from other folks. We certainly do want to reach out to the container community to understand if there are other similar use cases and how they might have solved the problem. Hannes, go ahead. Yeah. We from the NVMe site have the big advantage that we have a fixed place where the identification is stored. I mean, once generated, it's stored there. That's basically a file in a defined location. And that is the identifier which is being used going forward. So for us, it's just a matter what should be in there and who should be modifying it under which circumstance. But once it's there, it's pretty obvious, yeah, that's the one which the whole system will be using. So for us, we are, well, slightly off the hook there because we have a fixed location where we know that's the one we need to use. As I understand correctly, this is slightly different as you don't have a location. You don't even have a file. Anything. Yeah. How do we pick the location? It's basically a problem. How do we pick the location? How do we pick the location? Where do you store it? Because it might well be that the data there is randomly generated or whatever. It's not up for you to decide. It might be literally anything. So let's say for containers, it actually would make sense to just have a random identifier which will be deleted if the container is removed and can never ever be regenerated. That's the idea. Yeah. So that's the idea. But if you do that, then it needs to be well stored somewhere in the container, which means it needs to have a location where you can store it in that container, which also means that you have to modify your tools to look for that location where you find it. When you say location, does this include the machine? Is the location including the machine it's on or are you starting from the machine? So it's like just you have to include the server name or IP address. It turns out using a hardware specific thing is a good method for seeding the data, for getting the initial data. But it's not really that good for using it. So actually, so what I'm saying is you actually would need to have a location where you can store the data, which is different from the location where you got the data from. Oh, so about location within a machine or location across a whole load of machines on the network? Well, and that depends on your definition what a machine is. It's basically the problem we have on iSCSI or even on iSCSI or any of these things. There is an identifier for the host. What exactly is the host here? In the context that you might have several interfaces, that you might have several whatever. So a machine might have different interfaces or different connections. Or you may have several machines sharing an address. And so, or several machines sharing an address. So what exactly is a machine and what is the context? Should I have, should I give each interface the same ID that I'm pretending that each interface is actually a machine or something? Or should I just combine them, have a shared one, and if I shared, share across which? So, but this is a different discussion to be had. So the problem here is that at first, NFS doesn't have a fixed location where it could store the identifier. So meaning you can't use DefRandom to generate the identifier and then tell NFS to use it because you can't. Because there's nothing where NFS- There's nowhere to publish it. Sorry? There's nowhere to publish it. Exactly. So, and that is one of the issues with Facebook. Maybe you need to take a leaf out of NFS's book and have a thing. No, and the problem for NFS here is that, well, it's, you know, catch 22. So yes, you might need to store it somewhere. But that somewhere will be a file system. NFS does exactly what? Providing a file system. Well, NFS has a volume location server, and you ask the volume location. Here's a handle. Yeah, but this just delays the problem. Tell me who am I? Yeah, tell me who am I based on which credentials. So you're just referring back to someone needs to generate the credentials such that you can't get the correct ID. So again, back to square one, we need to create credentials. I mean, we can even put a finer point on this in the NFS case, which is one can imagine a NFS server, which is implemented using a virtual machine that mounts a Google persistent disk or Amazon elastic black block store. And at some point due to maintenance, we need to kill the VM. We start a new VM, but it's going to mount the pre existing virtual block device. And it's going to serve the same NFS contents. And logically, it's in fact the same NFS server. I mean, you could even have the problem where you have a physical, you know, bare metal NFS server and you replace the motherboard because the motherboard died, but the hard drives are staying the same. And technically, that's the same server, even though the CPU serial number has changed, right? And so it's very, very fact dependent when it's the same server and when it isn't. It's well, for the server side, it's easier because the server side has storage. It has persistent storage. Clients don't necessarily have persistent storage. Yeah, so you could put an Etsy file on the NFS server. Exactly, the NFS server has a file system that is guaranteed not to go away. The clients don't have that. So some solutions we've thought of, we can specify a module parameter with a UUID ID on it. That's how we do it today. We could put that on a kernel command line, for example, if we're doing a pixie boot of the client. We could put it in an Etsy file or we could take a hash of the machine ID. We take a hash of that because we're not supposed to put the machine ID in the clear on the network. But then, you know, that's great for the physical host. What do we do when we create a container with an NFS client instance in it? It's got a different IP address. It's a different, basically it's a different individual. So how do we specify the client identifier for that? Where do we put it? Just so I understand the problem, you have both server side and client side identifiers. Server side is pretty much solved because you have stable storage. So it's just the client side you're worried about. But why does it matter for a container which is an ephemeral entity if you just randomly choose a UUID each time you bring the container up? It's got to be the same one that the container reboots. What do you mean when the container reboots? You're positing an ephemeral container. When it reboots, it's a different instance. A lot of, well, so Docker containers deliberately have this property. If you have a different type of container, the ones that are truly sort of live from generation to generation often have persistent storage underneath them. So you could store the identifier there. I mean, ultimately it's the container manager who should have that information, right, and the container manager should provide this to the actual process running in the container. And that just seems like something that needs to be standardized. Right, that's our feeling as well. And I think what we've done so far is to write some massaged documentation that explains the problem. And I've got a patch pending for the NFS client to put that under the documentation directory so that the people who write the orchestration software can read that and go, okay, I've got to find a place in the file system to put that. And these are the characteristics that it has to have. This is also what I would suggest to indeed separate out the ID being used from the tools for generating the ideas in the first place. So I just have a defined location of the file system where the ID is stored. And then modify your tools to look for this ID and use it if present. Yeah, please. So because then you got whatever, it's up to the admin to decide, do I want to change the client ID or not? And then he can or cannot or whatever. And you can just pick whatever it's there. You don't have to worry from the NFS side, from the NFS tool side, which information to use, what to do, blah, blah. You just delegate it to the previous step, meaning the admin steps, of actually providing you with the number in there. Where in the file system do you put the ID? Does NBM put it? It is at C somewhere. At C somewhere where? Yeah, it's up to you. It's, you get to define it. So you don't have a standard file where you put it? Well, we do have a standard file, but that's an NBM file because we said it will be, that will be the standard location where it will be found. Because we defined it that way. Okay, and when you create a container, what happens? So, and that then again, the good thing here is that it's pretty clear that is the location where the identification is stored. So whatever you do, whether you call it from within the container, from within the VM, blah, blah. What properties do you want of this identifier? So supposing my container is going to be elastically scaled across the data center, so I've got multiple copies of it. Do each have the same ID or do you want each to have different IDs? And that really depends, that's an administrative choice. Well, that's what I want, I'm asking you, what are the properties of this identifier? Because the answer we give you depends on what properties we want. I think they need to be distinct for each running container. So essentially, it will be your ID where the first U is somewhat debatable. So it doesn't necessarily need to be unique, but it can be made unique. So if I've elastically scaled up the container and then I scale it back again, you're happy with me destroying, let's say I scale it up by 10 times. I destroy nine of these identifiers, now I scale it up by 10 times again, you want to come back or I'm okay to create new ones. When I scale it 100 times, are my 90 new ones, what properties? Depends on what you said, what you want to achieve. Whether you really want to have each container connect to the very same storage, or whether you want to have each container have a different storage. This is an administrative choice scaling. So let me assume that the client is connecting to the same storage. Because it has to be a projection, all it's really doing, let's say it's elastically scaling for compute. So it's connecting to the same storage and it's just doing a computation that it suddenly has to blow up. And it's serving different clients off the back end that you don't need to know about. So let me try to address the question of do we want to preserve these across instances of containers? I think not, the uniqueifier is used for the purpose of recovering lock and open state. So if the container is destroyed, it necessarily has no more open or locked files so that the uniqueifier can go away or be recycled if you want. So in the elastic scaling use case, because I'm killing the container to scale it down, I can get rid of the identifier. And because it only is tied to locking state that I should have destroyed, I'm free to create a different UUID each time. Yes, okay, well then it sounds like it's random. So just to be clear about the RFC on the wire, is this the client ID? It's the client owner on the wire. So I was got confusing a little bit to look at the NFS41 RFC. There's a client ID and a client owner, right? Yes. And so you're talking about for recovery of locks and recovery of open state specifying a consistent client ID or client owner when you have a container that's moved in flight sort of. It's obviously the container shut down, there'd be no need for this, right? No, it's not a container shut down, it's a container restart. So what do I need to do to get a safe restart, to achieve a safe restart of a container? So it was suspended with files open and locks. Yes. And then when you restart it, you have to make sure after suspended. And we have a similar issue with SMB, right? We have persistent state, resilient and durable state as well. But these are important, right? That you have to specify. But so you have this two fields, right? Client ID and client owner. And what kind of puzzles me about this is that wouldn't this just be Etsy NFS? And you stick whatever that number is in Etsy NFS and you're done, I mean. Yeah, presumably. Yeah, pretty much. Well, we would like it to be that simple. We're not sure that the containers can do that. So I mean, I think. Why not? Containers provide file system. Well, a container may provide a file system. It doesn't necessarily. If you elastically scale a container, it will have the same file in Etsy, whatever. Yeah, right. Exactly. So, but if you want to have that level, if you want to have that level of guarantees, you have to provide files, that's the way it is. Yeah, I mean, it seems like the obvious answer here is if that magic Etsy whatever file doesn't exist, it becomes a random client identifier. And the presumption is, you know, you don't care about that client if it gets destroyed and recreated because there's not going to be any locking state to be preserved. If you do care about locking state being preserved across a container reboot, then you need to establish an Etsy, whatever. And maybe that's as simple as it needs to be at least for the NFS case. So the problem that we're in with the Linux NFS client at the moment is that when we talk about containers, we're really talking about network namespaces. So when a process is put into a new network namespace, it needs to have a different identifier than the host. No, it doesn't need to have. It doesn't necessarily need to have. There are cases where it should have, but there are other cases where it not need to have one. Wait a minute. So network namespace is usually a pod-based thing, not a container thing. So if I have two containers in the same pod sharing the same network namespace, you want them, they always have to have the same identifier. For the Linux NFS client, we'll treat the server the same and the page cache and the state is also shared if they're in the same network namespace. Okay, this is a client identifier. Yes, I actually would get away from the notion of a network namespace because you might have several NFS connections, each will question whether they should have the same client identifiers. They might have different ones. If the namespace has its own unique IP address, then it needs to have its own client identifier. Yes, so that's as far as I know. We're kind of like going back and forth a lot on this in cases where we're like, the container doesn't have to have this, the container doesn't have to have that. Working for a company that does exclusively containers, we configure a lot of that stuff. And a lot of that stuff is not brought in from the file system. It's brought in from some service managing thing. And so, yes, okay, putting in the NFS or whatever, it doesn't work for the case where you just fire up a bunch of network namespace things. But if that's what you're doing, you then, you just need to provide it a way to provide its own container ID through some interface, right? So like, okay, I guess what you're saying is you want NFS to be able to read this information, but it sounds to me like you need to just provide a generic interface to say, this is my client ID. And then we just let user space figure it out, right? Or am I misunderstanding the? Yeah, I think as far as I understood, I think you're saying the same thing, right? I mean, this is not a, this is a user space provided. Identify, in the user space provided concept, right? You didn't want to do this in the kernel. We don't want to, we don't want to implement policy in the kernel. Yeah, yeah, okay, yeah. So yeah, okay, so like, so the question I have for you as someone who deploys thousands and thousands of containers this way, is it enough for us to provide you with a documentation to do it and implement it yourself or would you like tools? Documentation, all right. If you just tell us what we need to do to do the thing, we'll do the thing. Okay, yeah, the squidgy part about that for me is, it's a file system. We need to be out of the box, out of the shrink wrap. We need to behave reliably. And we don't currently because you need to do the special setup. I would argue that containers are specialized enough that you just give us, like I think that clearly I don't run Docker or any of that other stuff. But, those people can also do the same thing, right? If they have the documentation, they can go and do the thing for themselves. And Facebook. But they have to know, they have to read that documentation. Right, they have to know. You have the fallback. I mean, the fallback is a random identifier you create when the container is created. If the container's orchestration system doesn't give it the ID, it's not expected to be persistent across crashes and reboots. Therefore, it can have a randomly generated ID each time. If the orchestration system wants it to be persistent, it will provide the ID that seems to be good enough. Yeah, we don't have the random default behavior at this point. Yeah, but the point is, if you don't get anything passed in, you'll know that, you just generate one randomly. Today, it uses the same identifier as the host, as the name space. I think that- That's probably wrong. Yeah, that sounds bad. It's also hard to just start creating random identifiers because we can break existing setups. But the expectation is, if the existing setup knows what it's doing, it'll pass the ID in, so it will sort it all out for you. This is only the fallback case where you didn't get an ID passed in and you know you didn't get an ID passed in, what do you do? Generate it randomly. I think the problem we're running into here is the mirrored concept of a container is a user space thing, right? Which means the container orchestration system needs to do this default thing, right? We don't want to randomly create that when you create a container, you know, a CG, right? Because the CG is the low-level thing that the kernel knows about. But ultimately, this all has to be in the container orchestration system because containers are a user space concept. If you have clear expectations about how you think NFS should behave, then document it because that's the way how standards are made for containers. It's the sad truth, I guess. Yeah, as I said, I have a patch pending that adds that documentation. It might not be complete or utterly clear right now, but we've got, of course, time to fix it. But that documentation is in the pipeline, yes. Yeah, I mean, again, I can only speak for my guys, but my guys pay attention to what the kernel community does with containers. They take full advantage anytime anybody does anything. So it'll be consumed inside of Meta pretty quickly if you just tell us what to do, right? Okay, maybe I could CC you the patch I have and you can say thumbs up, this looks good, no, this needs improvement. Yeah, I can pass it back to the people. Thank you, yeah. You're also running most of your containers with SystemDn spawn, right? We are moving in that direction, yes. So a lot of the stuff existed before N spawn, and so N spawn is the like, oh, thank God, we don't have to do this ourselves anymore, but right now it's, I think it's 50-50. Sorry, I didn't want to spoil it. Yes, but there is also a documentation and extension process for this specific runtime. Yeah, like I think that if you, I mean you could even go so far as to add it to SystemDn, and like that covers a lot of use cases, right? Obviously it doesn't cover other container people, but they're on their own. So we consider it SystemDn and the issue we run into is the uniqueifier has to be set before the first NFS mount runs and we see some variation in that guarantee. Anyway, that's a different story and I'm out of time, so. I thank you very much for your feedback. So I'll say something. Oh, there you are. Do you want to say something? No, okay. We don't have anything for FS or IO for this next half hour, so people on the call, you can drop off if you feel like. At five o'clock we will be back in this room for lightning talks. If you have a lightning talk, we don't have a way to like indicate that, so just find one of the PC members. I'm going to have something that I'll talk about like maintainership and responsibilities and that sort of thing, but that's going to be a little bit long. So if anybody has anything they want to talk about, let me know and you guys can go first. Otherwise. Or let me know if it's IO. Like any PC person can handle it. And for the people on Zoom, it's a new Zoom link for the lightning talks. Yeah, and so that's in a half an hour.