 Hello So I'm Chris leach at red hat. I don't have any slides I just kind of wanted to talk a bit and then Throw some questions out to the group Regarding some questions that came to me And the investigation I've done into it and where I'm out with that so far So I work primarily on Network transport storage drivers, so I scuzzy fcoe Getting into NVMe TCP now a while back It was brought to my attention that people were attempting to Run the control plane for I scuzzy and run the user space packages in containers And it didn't work and they wanted to know why So as I started looking into it The answer mostly turned out to be the interaction between those transport drivers and network namespaces so in a lot of cases Just telling the people trying to do that that there was an issue and you needed to stay if you're creating a container You have to set it for the privileged Network which leaves you in the initial namespace Resolve the issue enough so no one's really beating down my door over it, but It's not the best situation, right? So the question is can you then? How important is it to have the ability to have a separate network namespace and attached to storage resources in that so as I investigated this for our I scuzzy drivers it was a variety of problems around the transport objects and the interface between the control tools In user space and the kernel so the I scuzzy control interface is Pretty complicated We've really split things out quite a bit there. So The very first thing that was failing were The netlink control interface just didn't work outside of the initial namespace So that was easy enough went in passion things up Made it listen per namespace You could get things working But then if you're going to isolate based on networks potentially want to connect to storage for multiple networks and trying to run multiple instances of the the control processes with ice-caze ED then completely blew up because They could see the other transport objects That were outside of their network isolation and Didn't like that at all. There's a lot of conflict Between having competing processes. So continue down the path of okay, I Allow per network control interfaces Then start isolating all the transport objects Into each of those network namespaces and then that turns into Sisyphus filtering on visibility of all the transport objects and so I kind of drew a line between you know the transport objects and the actual storage objects in the data plane and That works and you can Have separate control and connections to network connected storage and Then all of the storage itself just shows up on the system and is not Filtered or isolated because we don't do block namespaces. So More recently, I've kind of gone back and revisited that and tried to compare and see how we do with other Network transport storage. So I tried it out with NVMe TCP where we have a much simpler Interface to the control processes. There's just a character device that gets used It works because when you Establish a connection It just uses whatever namespace you're running NVMe CLI in So you can establish connections from different network namespaces Once they're established you can List them Control them and destroy them From any namespace as long as you have the control character device so It's not really filtered out. I Went and took a look at the code for fiber channel over ethernet The software driver there explicitly checks for the initial namespace so it won't work in Isolated namespaces at all That then there's some other things that I don't work with but I poked at the code just to take a look at the Ata over ethernet Distributed block drives device drivers all have explicit references to the initial network namespace So there doesn't seem to be a network connected storage transport that will let you Containerize its control and Connect through different network namespaces so I Guess the the questions that I had for discussion were How much of a use case is this Because it feels like something that could be worked on a bit further and then the Stopping point of where I was at with the patches and the problems that I ran into mostly were edge cases around namespace lifetimes which I am Not entirely sure how to approach or go about because The namespaces tend to be attached to processes and when we have Storage transport connections maintained in the kernel that aren't attached to a process lifetime How do you Maintain the namespaces if there's no more running processes in them I Can tell you how to do that we can actually mount the NFS file system in the Standard base file system so that allows you to exit the container while keeping its namespace alive And then you can re-enter that namespace so if you tie storage to the namespace you can use that mechanism to preserve it and Probably if we're starting at something in the kernel we should actually Automatically do that Okay Yeah, I'd seen how some of the networking Tools do that to it to keep a reference into the namespace in the file system But I guess I was worried about leaving Persistent things in the kernel if that was Where it is a problem some way deletable it doesn't go away Okay, I have the lifetime of something that you have to manage right what what would happen if you restart that container Would then Would it find the old namespace or would it create a new namespace? So it doesn't lead to some sort of it creates a new namespace by restarting Namespace is deleted when the container is shut down the question that I have is why isn't the lifetime of this Why I'm not taking a reference on the relevant namespaces when the connection is established Why is there a lifetime issue in the first place? I'm really just start trying to understand because no one was aware that there is a namespace as you just said I mean none of the network storage Implementations have any truck with namespaces. They literally try to avoid it and I mean as you just said all the others have an extra check whether we are in the initial namespace or not nice guys They're only which does not have the check, but that is not because it was a deliberate choice it's just because the one deed and Consequently as they are not aware of namespace They wouldn't be taking reference on the namespace as they would not have to take a reference on the namespace I think problem is not Why is it broken the question is how do we make it work because nobody's ever tried to make it work? Yeah, I mean but my answer was then basically just this should probably take a reference to the relevant Namespaces, I mean I'm have this is the first time for example. I hear about these issues, but if you Loop us into the patches and so on we can certainly help with that and review and comment on it and so on and Help you with this. I'm happy to Can can you remind me what exactly would happen? So if we have this setup that we have a Container who is creating namespace of where ice-cazi block device that then which the block device then takes a reference on that namespace Then I shut down that container and restart it. I will get a new namespace Will I still be able to access the ice-cazi device in that old namespace well that I mean that That's more of an issue for you know the the user space concept of a container, right? I mean if If we're going to isolate it by namespace and you're talking about Restarting and returning then you have to be returning to the same namespace You're getting confused about your control plane if the block device takes a reference on the namespace that namespace cannot go away Until the block device really you drop. Yes. I know I know that I know that they don't have to do it like that You can also manually spawn something that forces the block device to create and that would tie the lifetime of this Whatever this thing is the namespace when you kill the namespace it would probably kill the device I mean though there are different ways around you could do it. The question is what do you want to do? Indeed that's the question. What is it we want to do? And I guess it's not something we can decide or for which there is a golden rule Because it really depends on what you want to do with that What is your use case and what do you want to achieve with that container? It's perfectly feasible that you have a container which just want to create Wants to create an ice-cold device does something with it and delete the ice-cold voice on shutdown It's also feasible that there is that there is contrainer which provides an ice-cold device Which should be either accessible to other containers or should be stay around Even across restarts of the container. How would we know if we look at? well if we look at this from a networking perspective from networking devices, right if you if you for example create a Network device a vet device right and you move it into the container you're switching its network namespace when you do this And so at this moment when the network namespace is switched the lifetime of this vet device Virtual device is tied to the lifetime of the container. So when the container Get the container or the namespace The namespace the network namespace. I'm sorry container set of namespaces. I'm talking about separate network namespace And it's tied to the lifetime of the network namespace Which is tied to the lifetime of pit one or whatever it is in your container and when the container shuts down Or the network namespace is destroyed Then the vet device is automatically cleaned up. So That's one model you could for example implement These virtual ice-gasi devices, I guess So that's then you tie the lifetime of the ice-gasi device to the lifetime of the network namespace or the Container in general that yeah, and that that does work if you're not worried about The connection going away No, but no, no I'm not sure whether it's a feasible thing because Really ice-gasi only looks at the IP address. It couldn't care less which device this IP address is on and So if we were to go with a VATH approach, we would need to provide an IP address to that one and I'm not sure whether we have an additional IP address or In general, it's a good idea. Yeah, clearly to tie it to a virtual device which tied to the namespace and if the Namespace goes away the interface goes away exactly the point that I wanted to make with this is Somewhat to James's point You can have a model where you tie Device lifetime to a namespace But it's also not it's not impossible. It's not unheard of and it's doable to tie To essentially let the device that you're interested in Determine the lifetime of the namespace so when you destroy the bias the device then you get a put on the namespace as well And you destroy the namespace so in this manner you could for example persist a namespace Implicitly through the device it wouldn't be reachable from user space anymore If all of the processes that are attached to that namespace have died, but that's also a model Well, but how would you tear down such a thing? Yeah, exactly, then you need an explicit destroy operation It's always better and that destroy operation needs to exit needs to be able to access the namespace because need to talk I'm not a net link to that namespace and if it can't we can't destroy it Well, pretty much every lifetime tends to be tied to the lifetime of the namespace. Yeah I mean dies we want everything to go. That's the easiest That's the easiest model or you need a separate operation from user space where you for example say destroy this device Can I just poke on one thing you said you said you didn't have multiple IP addresses? Usually we'll be assigned You will have enough I forget it. Yeah, okay. Sorry Can I say something on this? Have a use case We implement the CSI specification for Kubernetes where the There's a container running on its node and this container creates NFS or DRP devices. I guess ice guys could be a use case too and we want the device to To persist to live across container research the container might get killed for various reasons But we want the device to persist because it's consumed by by other boats by other containers and Yes tearing down the device after you lose the namespace is a problem and for NFS We have worked around this by using the initial namespace. Yeah, so that's the only one you can do because that You need to be able to reach that namespace from other containers and every container restart to all intents and purposes is another container. So and As Christian just said that Once you tear down The container who created the namespace in the first place There is no way how you could reach that name namespace ever again, even though it's still around Yeah, if we were able to reconnect if we were able to specify please reconnect to that namespace that things would be different But I don't think we can I mean Yes, that works, but this is a user space. This is a user space problem user space concept like for example, let's say and You you shut down your container You create a container and then you bind mount the namespace file descriptor for the network namespace Sorry, maybe I should go into a little more detail. What I mean by this So and at the NSFS file system that he was talking about if you create a new namespace All namespace are reachable reachable from slash proc pit of the process then slash and s and then Slash and there's a list of all of the namespaces that the process belongs to so you can also compare namespaces and so on so when you create a container then the pit of this This path for the containers in it process for example proc pit and s net Opening that Taking an FD out and bind mounting it to somewhere will persist that namespace Meaning if you now shut down the container all of the processes go away all of the non-persisted namespaces go away But since you still have an open FD to the network namespace of the container the network namespace is still alive now If you wanted to do this you could restart a container created with a bunch of namespaces in the correct order It's a bit complicated But it works and then do a set in s for the in it process of the container and move The in it process of the container into the same network namespace that the previous container had so that is possible It's just a you know, it's just a programming programming problem, but it is in principle possible Okay, yeah office. Yeah, yeah But can we poke on the use case because you said you were exporting a service from this container So it sounds like the service isn't confined to the container You want other containers to consume the service that this is exporting Yes, we're actually we are doing the benefits mount in the container and then we bang mount this to other containers Right, so effectively. This is a containerized control plane because other containers need to be able to so you can't You can't confine the mount namespace for this because it has to be globally visible So this isn't really a properly containerized thing You're using a container artificially to get an IP address for this thing But the demon itself has the route man's name space visible that makes everything else visible into so if you have a path from Other containers you can get to the storage Yes, if I just I find this to be great. Yes, that's them I think that's the model of the architecture of the container storage interface you You do the mount and then you bind mount this to other containers So they have access we're using a bit directional mounts to make this work So we could give you a more containerized model Which is where you actually do spawn a mount namespace from this container and then everything is confined to that Mount namespace any container that wants to use that mount point would have to have its own mount namespace coming off this So you could then do the bind mount downwards, but it wouldn't be visible upwards That would be the more containerized case because that's sort of properly contained to a namespace hierarchy. Yes Yes, that would be better and in general, for example with the dRBD doesn't support it at all it uses the init namespace so It would be nice if it was possible for network devices to To be really containerized Well, then we get into the problem that NFS is easy because it's just a mount point So the mount namespace confines it I SCSI is a block device which is not confined by the mount namespace Which is the next talk? Let's talk about depth and profess Why would we want to talk about that in in a sense because all of Yeah, okay. I mean this is then is essentially the next topic which I wonder which we actually wanted to bring up that is naming the block namespaces because The main reason why most of the office you are nearly all of the network storage drivers touch upon the init initial namespace is because It's not possible to only to restrict the view of user space to which block devices are there. Yes It is it's used the device C group for that if you want it doesn't operate the same way as the namespace does no certainly remove visibility of devices and add them to a container that's a that's sort of more of an access restriction I think what Yeah, for example, it is really a talking about if you if you create a new block device or yeah If you create a new block device where which will it show up in depth? Yes on the host and all of them will show up on the host loop devices and so on and that is a That has been a problem. We have been struggling with for for a while. It's really a it's really annoying. Yeah, and what's more? It's a dev timbers only just the well a symptom really because We have a single list of well major mind and numbers Yeah, which is excessive to each and every process So if you have the major mind and number you can open the device Full stop whether it shows up a death by whether we restrict the visibility depth and that's why see group whatnot doesn't come into it So what really honest you just want to come up here and move into the next topic then yeah, yeah, sure Yep, I think if we're done with these Say I think that's okay for now. We can talk to other people about this to continue. Yeah. Thank you