 Hi. My name is Lee Duncan, and Chris and I are here to talk about namespaces on OpenIscuzzy. In particular, the problem we were trying to solve is that Iscuzzy daemon did not work in containers, and so people had to have only one daemon running on their system no matter how many containers they had. So Chris had the idea about two or three years ago to modify it, to modify the kernel code so that it worked namespace aware, and I took up those patches recently and forward ported them, and we've both been working on it now for a couple months on and off, and it looks like it's working. We have a patch set we've submitted that he's submitted that we both worked on, and so he's done most of the work. I'd like him to, he's got some slides that he would like to show you, and I'm just here for the glory. Sure, and so the whole idea here was how to, there's people trying to, with Iscuzzy we have the control plane split out into user space with the daemon, just kind of the way Iscuzzy had always been developed, and people see a daemon they try to throw it into a container. So in the main place where I've gotten complaints about that seem to be people constructing kind of cloud runtime environments where they want to containerize the components of their runtime environment. So if they're going to use Iscuzzy storage within a container or a virtualization runtime, then they want to take the Iscuzzy parts and throw them in a container for their own deployment, and as soon as you try that it doesn't work, and then they come and complain to the open Iscuzzy people about the user space tool not working. If you go and look at it, it's actually not a problem in the user space code. It's the kernel side of all the control APIs just don't work in a network namespace. So people have been kind of living along with the work around of putting it in a container but always using the host network, and you kind of restrict it to that. So the first thing that fails in the main thing is the netlink Iscuzzy commands just weren't listening on anything other than the initial namespace. That was a really easy fix and I thought it was going to be a really easy fix and that was going to solve it. At that point you actually have a problem in that you can run multiple user spaces because we were essentially enforcing a singleton through reserving a UNIX socket name for IPC. So once that goes to different namespaces you can have multiple demons running and they all start trying to take over all of the Iscuzzy sessions on the system and so they all want to recover all the sessions and things start getting reset and it's a giant mess. So the solution there was to then start filtering out all of the transport objects in Sys-FS because that is what it uses to find everything from the kernel side. So the Iscuzzy transport has a host and an I-face which is sort of a representation of a network configuration on a host and then the session and a connection even though we only do one connection per session and then an endpoint which is a representation of whatever the driver uses instead of a socket if it's not just software TCP and then there's these flash node Sys-FS interfaces that QLA4 has that the patch that had to change to do the filtering on that because they were implemented as bus devices which can't do namespace filtering so I switched them all over to class devices and that is all I want to say about QLA4. Yeah sure. Space aware. Can't you just use the interface because the way namespaces work, every network interface can be in one and only one namespace so if you just pick by network interface for the outband connection you automatically get the namespace as well that might be a simpler way of doing it. Well yeah I mean the kernel just has a list of network interfaces but as long as you can get that list unfiltered you can just select the correct network interface and every network interface is in one and only one network namespace so you're looking at it the other way around you're saying give me the network namespace then what interfaces can I see but if you just did it by interface you'd be able to you'd automatically be in the network namespace. So there's a couple of one we don't always bind a session directly to a specific network interface. Okay that's sometimes we just just use default and then it gets routed to wherever. Then a lot of these iSCSI transport objects aren't actually so what we care about supporting for this is just the software TCP initiator driver a lot of these objects are created by offload devices for their management and so basically we just want to hide them so that we don't start interfering with them. Yeah so that's the thing that we normally just have an IP address so in order to get to a to the interface we would have to follow the entire routing table. Yeah so if you do an outbound wildcard binding you have to know the network namespace that you're binding to I was just thinking an alternative way as you pick the interface first then you've got it. Yes if you if you're able to do so but occasion you don't don't really want to do so because you just have you really the only thing you're interested at is the IP address when you simply couldn't care less where across which interface this particular IP address is reachable from. Which has other issues especially if it comes to multipathic yes I know but still that's the default default we're doing. Okay yeah just just suggestions. Okay yeah and as we'll move on real quick here we'll see that it's actually not a tremendous amount of change to do this and this approach ended up fixing all of the communication issues on the kernel side and a pleasantly surprised that we didn't have to change the user space at all so we remain backwards compatible with the existing OpenISCSI tools. Yeah so this is the entire patch that we have out currently. I think this addresses all of the known issues that we've seen with this. There are very minor changes to all of the ISCSI drivers just primarily around I've got a slide that shows what the interfaces are that had to change and almost all the changes are in SCSI transport ISCSI and almost all of that is in ensuring that the CISFS objects get namespace filtering based on network namespace is turned on. We ensure that all the CISFS objects are rooted to a host with one exception for ISER with their endpoints and so the ISCSI host device lives in a network namespace now as these kind of control plane objects consider part of networking and then everything else is attached to the host and follows along with that. So really this is the kind of interface changes for TCP and ISER. ISER mostly we accidentally broke it and then I had to fix it and then once I fixed it I realized that it was not going to be very much work to make it also namespace aware so went ahead and did that. The get namespace callback is a little weird but it has to do with the way that the transport class objects get instantiated and there's not really a way for the driver to tie into that cleanly. So there's this one callback into the specific driver to try to match up a host with a network namespace and in the two converted drivers that then takes the namespace to use from where the netlink command came from in the first place and then the session create net and EP connect net are calls to deal specifically with the way that TCP works where it creates sessions and then just creates a virtual host to go with it so you don't have a host ahead of time to attach to so you specify the the network when it's being created and ISER very similarly with endpoints it wants to create this is because the endpoint object is the first thing and then create the RDMA associations before it creates virtual hosts and sessions and then the the things that caused us to do that there's like three lines of change in every other ice-cozy driver were on the fact that endpoints previously had just been kind of virtual devices that weren't parented to anything in SysFS so we changed that to attach them to a host where we could and otherwise create them directly in a network for ISER and then look everything up within a specific namespace and so that's really it did have here could you say a little bit about the net exit stuff that we have to do to clean up right so one of the things that we were trying to figure out what to deal with was namespace lifetimes and and what to do with they exit and we ended up not trying to hold the namespace for an ice-cozy session because you just end up with dangling sessions off in a namespace that no longer has a process attached to it and you can't really recover it so when the namespace exits we have cleanup code that just shuts everything down there's a bit of a weirdness and I can I can show it to you with TCP in that you if you're running a containerized ice-cozy D and then you throw that away there are still open kernel sockets with the session but like I'm using podman containers and so when the container goes away all your routes go away and it stops working but it's still alive in the kernel and so if you are running with kipa lives enabled that will time out and then the air recovery will close the sockets and then the namespace will finally go away once not being held open by the active sockets and then we'll do all the cleanup on the ice-cozy side but Lee did discovered that if you're running without any sort of kipa live and you didn't have any traffic that got cut off then it's just a nice idle TCP session sitting in the kernel with a nice cozy session seemingly attached to it doing nothing I mean we probably would need to ask Christian this but can't we close the session when the namespace is destroyed wouldn't that be well the problem is the other no no if the namespace is detached not destroyed because as you just said they'd be lazily dissolved so there is some sort of wall whatever keep a lifetime whatever right right so the problem is that we do all the cleanup fine once the namespace is destroyed but there's a live socket that's keeping the namespace open the net link net link event that says the device has gone away because what should happen when a namespace is shutting down is that all of them if they've pulled away the routing you've probably taken the network device down as well if you can hook into that device you can use that as the trigger for saying yeah the device has gone down now I can close all the sockets everything should just go away without needing the timeouts and net link should send an event that says that that might be a possibility yeah we should be able to get right that that would that would show up there would be another device so so this is actually just two VMs running on my laptop one of which is just a very simple target configuration there over on the left just actually gonna get rid of that and then the other one I've created a couple of podman containers it's just a minimal you know just a small distribution base image and then I add the open SCSI tools onto it inject an initiator name just so we have some configuration to start with that matches the target and then run I SCSI D as as PID one for the container if you run it I just start up the container and then use exec commands to go ahead and trigger login to the target after it's running so you can do that and it starts up and from the outside we can still see all the SCSI level stuff we can see the device we can see that there is an I SCSI host but none of the transport level objects will show up in SysFS here that will only be inside the container and just for fun I have a second image that's the exact same thing but built with tumbleweed instead of fedora so it's just a different build of the same tools I can start that up we've got you know two different things in here running I want to mention too we've tried to get some input from container folks and it's been like silence so I don't really know if I'm not reaching the right people or if they don't care well I just want to know will this work for them do they want other features besides this and one thing I want to mention too since I have a moment here is that Honest has mentioned this earlier at the conference we kind of would like to have device namespaces too so that you could choose whether you wanted your storage to show up and all namespaces that I just shut down you know for kind of force shut down the the container so that there was no clean log out first and then you saw that five second delay until the keep a live triggered and then the cleanup took away everything after so that's the current state of where the patch that is so what I would like from this conference is besides input from anybody that has any and comments is to get the patch set moving forward I think you put out the second or third version just a couple days ago so Lee Lee like you said took I had some some older patches and Lee was nice enough to dig those up and get them back applying to the current development tree because they bit rotted a bit and then I've spent the past couple of weeks kind of addressing a lot of feedback that Honest was nice enough to give and then some bugs that I ran into myself doing some more testing but as far as you know we're in a pretty good state with this right now and some some people might realize too it was not too surprising to me that it looks like some other subsystems might do a similar thing like NVMe yeah that's well I would say in planning but I guess well that's something I have in mind and also the block device namespace is something which I spoke to Christian about it and he was also very much a favorite of it or rather to be precise precise not a block device namespace but rather a device namespace i.e. to attach a namespace to a struct device itself which yeah would be cool Sorter has some implications which you really need to sort out because we can't just blank out any device some devices are actually quite useful and you actually need them to make your system do anything let's say def null or def tty you really want to have them so there is a problem so if you look at network namespaces they're what's called label namespaces that means every network device can be in one and only one namespace yes and is that really what you want for device namespaces like you say for that is precisely the thing that is the issue with which is as of yet unresolved which is also why I haven't followed up upon thus I would need to talk to the maintainer telling me how can I do how can I have a general device which show up in all namespaces there is another thing called a device C group that you could use it's not a namespace what it tries to do is it try to filter device visibility by a set of rules a particular namespace so I take you've looked at this yes I did and I quickly realized that even worse than namespaces so I mean the filters use that's yeah yeah the warning about label based namespaces is it's peculiarly suitable to networking because you tend either to push up a device into the network or to create a vf pair one that goes into the namespace and the other that attaches to a bridge in a different namespace and you just build your routing topologies what you'll find with the device namespace is let's say you got this huge 50 terabyte device and you push it all through to one thing and then you think well but I wanted to use device mapper to split this up and to then serve it out to other containers and either you have to do something like vf and device mapper so you can have a one-to-one device with one point endpoint in one namespace and another and another or you have to figure that devices can't just be exclusively in one namespace they need to be in a set of namespaces and that therefore means that device to namespace mappings are no longer one-to-one and then it becomes phenomenally complicated so the semantics will be tricky that is I get you I'll grant you that and yes you might you are right eventually we might be having requiring something like well in depth link so as I just said so basically vf the advice my vf so DM linear on steroids essentially where one end is in one namespace the other and the other yes I guess we really would need to have that but I'm having device namespaces would solve quite some issues which are as of now unresolved NPV is one of these you the fiber channel driver has the ability to basically virtualize itself you can just give them a WWPN and then you suddenly have another host and the idea here is that this virtual host which you've created is actually what should be passed to the guest but we can't because it's really a virtual construct essentially is just a struck device or it's got the host and with no hardware attached to it so we have nothing which you could pass a pass to the VM because there are literate nothing which we can do so it will always show up in the host at the same time everything which shows up underneath it is really should only be visible to the guest and so we always have to have we are tooling around that to ensure right okay and you know any device showing up on this particular I was really belong to that VM the other thing to consider about device namespaces is how they are going to interact with mountain namespaces for instance if I mount a file system from a device that's in the device namespace and push it through to a man namespace and a different device namespace should I still be able to see the should that operation be legal and I can still see the file system or should it fail does it matter or rather no hang on this is to my understanding the namespaces themselves are primarily user space thing that's what I look at it for the device namespaces because those primarily what can we access from user space yeah once we are within the kernel well yes we might be looking at to the namespace but then again if we don't we don't so really this I'm not sure whether there's an a whether there's really is an issue once we are inside the car will be blind to the device namespace yes be able to do mounting into any yes mount namespace it wanted to and yeah I suppose that works sorry and because in the end once you've mounted your thing you don't really care about the unhung block device well that's the point I mean that's why there is no device namespace currently because containers are pretty much file-based exactly so they wouldn't really care