 Right, so that's the then the next topic which is block namespaces which is pretty similar to what Chris already said and that is that whenever you create a device a block device in whichever shape or Form that device will be globally visible to each and everything running on that system Visible on several areas one is as I said the The major minor numbers thing so you can always create create a device node And access the device via the device node it will show up in depth temp fs if enabled and And it will also show up in this fs under various things lists whatever So why it's possible to restrict the visibility Somewhat where the device see group it doesn't really resolve the underlying issue that is the device itself is globally accessible to each and everything and so that is why Why why we came up with the idea of block namespaces whether it wouldn't be possible similar to the network namespaces to just essentially have a mapping table internally for the major minor numbers visible to the process to the global major minor numbers roughly So those really the semantics you want because the network namespace is exclusive. It's a label namespace Which means that if you create a network namespace and a network namespace it sort of you lose the interface in His use case it sounds like you want the block device to propagate through children It's sort of it wants to be slightly more apparent child based namespace than a label namespace Yeah, well, this is not I block names of a support actually be coming from the other side What I want is to precisely restrict it So if you have a container which opens an ice guys device That ice because it was really should only be visited within that container to no one else Because really I mean it's it's that right. So now you need an admin because Devices are only certain operations on devices are only allowable to the admin So you've got to have a user namespace to get this to work as well. Yeah, I mean, yes, probably the There was a patch that It's taking one step back from the really hard problem of namespacing the actual device numbers There was a patch that a while back a colleague of mine did a previous company that I worked Who namespace DevTempFS? So made it possible that first of all you could mount DevTempFS into a container or meaning is sorry in an unprivileged container so namespacing of DevTempFS was done based on the user namespace essentially and Then you could mount DevTempFS in a container and then you could create He only did it for loop devices. This is where we did the original loop device work And when you created a new loop device via the Dev loop control interface from within that container Then the loop device would only show up in the DevTempFS mount of the container not on the host Nice story that seems Ollie seems in general pretty clean the problem really is Sisyphus Sisyphus is a worst Sisyphus is problematic more than proc in a sense because all devices will always show up in all Namespaces and all of the information attached to it. This is a problem that we've been briefly exchanging mails on for the loop of s It's essentially a hack to be honest the loop of s hack that I've worked on to provide namespace loop devices to container because currently this is not This is still not possible And that's really difficult the reason why we don't have a block device namespace is not because people thought this is a bad idea People thought this is a good idea. The reason we don't have it is because Greg was vehemently opposed to it and Namespacing Sisyphus and I mean I always thought if we this is just my If we were to do this properly then we at least need a new mount option probably for Sisyphus Yeah So that's Sisyphus so that we retain the old behavior in case Someone doesn't want to regress how Sisyphus works Yeah, and then give it a new give it a new mount option on the Host probably where you say give me the Sisyphus mount But only show me the devices that actually belong to my namespace. I mean sort of like is being done for security FS Yes, that's what I did for that's what I suggested for a security FS. Yeah, it's a massive amount of work Sure, it is. Yeah, it's I mean clear Eventually, this is something that we want to head for that we I mean what I would like to head for and that is that we indeed For Sisyphus filter out the view that it's that we only Present devices which are there which I guess we need to do anyway If we were to go about and virtualize Virtualizing the major minor numbers. Yeah, because the major minor numbers also show up in Sisyphus So if we have virtualized then each Process will have a different you view what will show up in yes, it's just block or sus character Exactly, so we will have to have that anyway Part of the other problem is this is a block device that you actually want to mount has a file system on it So then we have the problem of there are any certain number of file systems that are allowed to be mounted. Yeah But that's not that that's that's what I consider That's a container manager problem You can solve this which is not pretty but you know, it's at least possible like you can set up images in the container And then we did it via something quick excursion and then back to the original topic and then we use this call Interception to actually mount it so that that really works Block device number. Yeah block device namespaces. Okay, so the I I Forgot what I wanted to say. I'm sorry. I'm pretty jet like so no, but the the key thing is well I'm she let me challenge what you probably about to say Which is we don't if we did this and it was actually effectively namespacing Sisyphus. It would be for all devices It would work for block character everything Yeah, I wanted to say one one thing we have in common for actually giving you a shot Again, even though Greg is vehemently opposed to it We would need to talk to him probably about it before but one of the reasons is this the original story always was why it is An unacceptable devices belong to the initial namespace and Nowhere else But why yeah, it's not just the why but the story is no longer true for a couple of years now Think about it if you if you think about just network devices and you argue network devices don't have representations Don't have characterized the representations So we don't character network devices are actually special that that is why they are namespace because network devices on Properly namespace even in in Sisyphus But nowadays think about infinite event devices So rdma capable devices that you slice into as are taken as an infinite event device that you slice into various SRIOV virtual functions so that you have 20 or whatever I be devices these I be devices are all associated with a separate set of character devices in slash death you be verbs or whatever it is and People delegate infinite event devices to containers What they what you need to do right now is you need to move that infinite device the actual virtual function into the container and then you also need to make sure to pick out all of the character devices that belong to the I be device by parsing through Sisyphus and then Putting all of those character devices manually meaning the container manager needs to do it into the container the correct model for this is obviously You create you slice it up into different Into different net virtual functions you move that virtual function into the container and all of the devices Associated with this virtual function show up in depth tempFS That would be the model that we want because that's how device delegation in this case should work Yep, and this is we have the same the same problem on on the SCSI side With NPAV where we can create virtual SCSI hosts which really really really in the end of the day will be Will do belong to a container or a virtual machine, but they will all show up in the host and We actually have to explicitly disable you deaf to keep on running on these devices Which are exported over to the to the virtual machine. We did there's massive problems associated with this I have a long list of this because I've been struggling with this for such a long time System these you deaf manager Can currently well Leonard always argues it shouldn't run in a container because this of us isn't namespace and then you deaf gets confused because new dev numerates devices inside of the container and has a At database and then you'd have on the host as a database list and so on then they might start renaming things or so on So there is a lot of confusion because you know all of devices show up inside and and outside of the container And that is a massive problem because we have users that want to run system the ufd in their containers You know, you know why because you need it for network devices and network devices are properly namespace They're also namespace insistive as if you move a network device from The initial network namespace into a container network namespace then the device gets yanked from Sisyphus on the host and Into the container Sisyphus instance if it mounted a new Sisyphus instance and Sisyphus is tagged by the network namespace of the container Yeah, so and that's what I mean. So it's not unheard of. It's not that scissors can't isn't principle Not capable filtering out devices. It already does for our network gamespaces. It's just that well It doesn't do it for block for block devices or courage devices camp to speak of it Wait a minute. It's still not actually properly namespace if you want to go down the PCI bus You can still find the network card if you've seen me put a physical network card into the namespace You can still find it from the root doing that so it thinks that it has you have I think so So one of there was a more annoying issue when you created a Network when you created a new network device inside of a containers network namespace and user namespace then You had all of the Sisyphus entries show up correctly inside only inside of the container Sisyphus instance And the ownership was correct and it didn't show up on the host when you created a vet device on the host And moved it into the container the ownership wasn't changed so you had the wrong ownership on the host and inside of the container as well and Some paths weren't currently Correctly changed as well. And I think I've sent a patch series for this to to the networking tree in 2019 or something like this You can you can double check what I'm pretty sure there were a bunch of issues It's not properly namespace because this is not something that The namespacing knowledge hasn't propagated to all of the subsystem, which is totally understandable It's not that we did so far that has been done a great job at Explaining how namespaces work. There is not a lot of looking at the SROV use case for network devices Just pushing them into individual containers. And so there are still a lot of holes even in the Relatively well namespaced Networking networking file system Yeah, but in general, I agree we need we should at least try to get something like block device Namespaces so I'm tense I do wonder whether it would be possible to have it in a well staged approach that we first initially leave Sousa face well as it is Yeah, if we can Just ensure that we properly blank out the access to the devices via Via the device nodes i.e having a virtualized major minor time a table and You start as an initial step to separate out things and once that's in Then we can look at how we could properly filter out entries in Sousa face or whatever make Sousa face entries namespace So where I think those two things should be we should be able to separate them I think it's crucial even for the for the general Upstreaming story because if you come with a patch set that completely reworks Sousa face everything But the general question is do we want to go that direction or do I do we want to state? Well, actually it's a device which is a general thing and really should only should be only Should belong to the initial namespace and not to any container namespace. That's actually more underlying question This is your admin problem. So the device who requires an administrator to actually plug it in correctly So if you take the view that the container orchestration system just does all this for you It's the one that has to do the mount and then it can pass the mounted file system down into a container And if that's your only use case you're not quite containerized But effectively the orchestration system is coping for you, but yeah, I'm not talking about passing it off I'm talking about the use case that you have container which creates an ice-coz device and deletes it on when the container shut down Should this ice-coz device be visible in the host or not? That's what I think to Would we will do it either way the question is what do you want to do with or who controls what you do with the ice-coz? The control is complete within the container. No one else. Yeah, I think that's why I understood it It essentially works like the network network device like most operations can be done from inside of the container You need to mount it inside the container and we have all the trouble with the file systems and everything else It's so what I'm fishing for is how much interaction of the orchestration system. We presume exists. Do you need to do this? Works for the source devices. We don't usually have it's our IV I mean there are some if you do custom builds So we don't have the luxury of the network cards with virtual or physical functions I know it exists. I'm gonna be that way, but we are interested as I mean, this is something we would stone stores We would like to have it on the host but also within the namespace, but You have this device to manage it. It's there but inside of the container. I mean, we might want to give it like We have these Limited resources for stone drives and we might want to have that in a namespace So maybe the stone drive supports hundred open zones and we might want to have that container be able to do 20 And and this so we are looking into this and I'm happy that you're kind of also kind of we we think it's a good idea to Have this blog advice. We know that it is out of How you do containers or utility but but we do see for the high-performance database use cases they usually work with raw storage and Oh We need at least one more beer before I can even contemplate this I mean Adapted to the time zone yet, right? Oh blimey Possibly yes, of course we couldn't we can look into it, but it really is more QS issue which you raise and not really Because I wouldn't know how to map it at all at this time This would be because you would have to have concurrent access to that device And you would actually need to do some sort of QS on the concurrent access which really this is about it's not so much But that's not not so much about visibility. It's more about QS on access devices, which is you possibly can't do a Kind of works You should be able to do some sort of Can work but if you want to do it in the containerized way and how everything sets up and that's how C group were containerized That was the point of C group, isn't it? Anyway, so the question shall be Shall we try or shall I try or we try to see to have the blocks Names that makes a main namespace about or is it a lost cause can't hurt the situation is also vastly different the number of arguments for it is increased with overtime quite significantly Originally when when that was proposed Containers weren't as widely used and Delegating devices to containers was just a bunch of Yeah, a bunch of people were interested in this in the first place, so it wasn't really a big use case So it was easy to just say Don't delegate devices to containers Do you really need a new block namespace? Can you say that? block devices created by networks Network namespaces only show up in the network namespace. Yeah, that is in principle possible, but You end up with a You might end up with the same amount of the same amount of work because the device creation logic is different for network devices and for For character and block device what I mean by this what I mean by this is the character and block devices are based on the K object framework and The network devices if I remember this correctly, I haven't looked at this in about a year They are not based on K objects K objects at all So it's a totally different namespacing sorry some stores story So networking is namespace completely independent of character devices and block devices and so you need to hook into the Character and you can't take as far as I can tell you can't take the Character and block device creation out of the out of the block path You need to create a set of function out of the K object path You need to create a set of primitives and functions that allow you to create Namespace character and block devices anyway You need to do this in a generic way the looper fast pet looper fast patch set that I wrote past these functions at it Yeah, so and also the other thing is that we if we were doing this we would be completely losing out on infini band Because that's not really network as such so Oh Is it also running by the network namespace the infini men stuff if you're not using IP over B Yes, but also the underlying also the underlying infini men objects So in finna band can have two appearances one is as an IP thing in which it's basically just an IP transfer protocol The other is technically as a bus protocol where it appears a bit like a PCI bus that second one wouldn't be namespaced Exactly the second one wouldn't be which means that for anything over already. Maybe we'll be losing out Because it's just using IP over B for the initial discovery and everything else what it goes by a real infini band. I know A few years ago a new container was introduced specifically for our DMA interfaces Okay, I'm not really that into it. So yeah, you might be right anyway No, but I still prefer to have a different namespace for for block devices So if we're definitely so taking into the network namespace Doesn't really feel that right and again for loop devices wouldn't work because the device have no No clue with networking at all. Why should they be tired attack to a network namespace? And this is the first in principle the k-object logic in principle. This is one of the advantages has the ability to create objects create Sysafes entries based on namespace tags and The the current FS infrastructure, which is the underlying infrastructure that makes Sysafes possible has namespace tag as well So it means when you mount Sysafes inside for example a new network namespace then Sysafel will take a reference on the network namespace tag and attach it to the super block information structure This is file system specific nonsense But essentially it just means that Sysafes now is aware. I'm in a new network namespace Therefore, I should only show the network devices a network namespace devices that belong to my network namespace And exclude all of the others and I did similar logic For block and character devices based on the username space in this case I think Okay, good. Thank you