 Welcome, everyone. So in this talk, I'm going to rant a bit about IO CTLs or IO controls or IO cuddles, however it is that you guys want to pronounce that. For a long time, I've been seeing that we have been still using them on block layer and file systems and some background. I used to work on networking and I saw the change in networking. So let me elaborate first on some rants and then I'm going to hand it over to James. Before you do that, you could have spelled it right. There's an F missing. Thank you, thank you. All right, so some history first on the introduction to IO control into a UNIX world. So it wasn't really designed originally for what we think it was. It's essentially a hack. So the history talks about it that it is a closet full of skeletons. And it was somehow exempt from the ethos of simplicity that kept the lid in new system calls. So just keep this in mind, because everyone was like, hey, what's the problem with IO controls? Just keep in mind that it was a hack. And this is from the history books as well. Documented willy-nilly and sometimes only in source. Just keep that in mind. So in Linux too, it wasn't actually when Linux was first released, it didn't really have IO controls either. It was actually introduced slightly later, about a year later after the first release. And then in 93, we have just one small little patch to modify IO controls as well. So just some generic overview of this stuff, the Compat stuff, in case that's confusing, this is essentially what it does. The Compat stuff addresses 32-bit system calls issued on 64-bit systems. So kind of some of the issues is let's just recap that given the history that I just explained, brief history. It's not originally designed for what we're using it for today. It was a hack. Let's just admit that, right? Let's not defend that position in any way, shape, or form. The concept of that everything is a file is really useful, right? So we can just obviously use IO control for pretty much anything in the kernel, right? So you can argue that it does allow a lot of flexibility. I argue that that's true, but it also allows for lazy architecture design. Also, some user space APIs don't even have support for it. Java is one example, right? It doesn't promote documentation. And introspection is also a problem. Now, I will admit introspection is not something that I actually have dealt with. And it's not a problem that I have dealt with. So I was wondering if James might be able to elaborate on some of that. And I don't know if that's a problem space that you're familiar with. I can. I mean, it comes from the container world where we have lots of, I mean, it's not just introspection for IO controls. It's introspection for things like system calls. If you want to secure a weird docker container, you block loads of system calls using SEPCOM. But if there's an alternative way they can get around the system call you just blocked by using an IO control, which you don't introspect and you don't see it happening, your security is worthless. So there's a lot of security concern about non-introspectable interfaces because they can't be policed properly by the tools we usually use for containers like SEPCOM, even EBPF. You actually, the specific IO control problem is that it's just a dense binary packet. So there's no structure that you can deduce from the form of the packet about what it contains. And that's basically why IO controls are listed as non-introspectable. Now in theory, we could make the kernel speak XML or JSON, but in fact, all you just get is a load of random tags. You could theoretically introspect slightly better but only if you knew the schema, which usually isn't transmitted anyway. So you just sort of transform a dense binary problem into a slightly less dense asking problem. So this introspection thing is always going to be with us almost regardless of the interface we choose. Thank you, thank you. So I asked Arndt for an opinion and he really told me exactly how he felt and there it is. There's a bit of historic issues with architecture support on IO control as well. Here's a list of itemized things that he could think of. So it's not, the world is not peachy either for architecture support as well. Then this is dispatch it, right? Like what the hell, how is this possible? From a design perspective, this just makes no sense. Granted, of course you have root, you can do anything you want, but this is just stupid, right? So yeah, just because it shouldn't happen, we allow for it, right? It's just silly. Go ahead and try it. Really, try that, see what happens. So is the grass greener? Well, one can't get spoiled, right? So I just wanna provide some perspective. I come from the wireless world, right? So I don't know if you guys remember wireless extensions, anyone remember that? IW config world? No? Holy crap. Well, please, please, please, if you wanna get some context of where I'm coming from for the love of Deity, Deity, whatever, just look at the Linux UAPI, wireless.h and compare that to the NL8011. Now granted, this is a complete change to generic nutlink. And we did have shortly the discussion on the mailing lists about the fact that it's not really designed to be generic enough and perhaps we don't wanna use something like generic nutlink. So I do think that even though that is the case, it doesn't mean that we can't come up with something better generic for file systems and block layer. So I'd like to hand it this off now to James. Okay, fine, but I'm not really going to defend. Config FD was a thing I came up with for a long time ago when we were doing the shifting bind mount which has now been replaced by the thing Christian was talking about yesterday whose name I've forgotten. But what Config FD did is it was based on FS config which we now use for the mount subsystem where you basically open a file descriptor using a system call. You send lots of configuration stuff down to the file descriptor and then atomically it does everything you wanted to do. The main difference that Config FD had from FS config is that it was bi-directional. You could pull information back out of the configurational file descriptor as well. One of the things I did for FS config is just when I tried to introduce Config FD I rewrote everything in terms of Config FD to demonstrate that it was actually a superset of FS config. Can I just interject something? FS config did have a bit that's allowed you to get stuff back out but Al removed it before it got upstream. Yeah, I know. So I'm not really here to defend this. What I think I wouldn't really like to spend all of my time talking about Config FD. What I'd like to talk about is the necessity of IO controls because effectively what an IO control is is it's an exception to the normal semantic order of things. And however regulated we try and make the semantic order of things. We always get these exceptions. So there's always going to be a requirement for a way that two parties can communicate using data that's not structured by the existing semantic and that is an IO control. Whether you're sending it to JSON, XML, binary data, we're always going to have a need from them. And the problem isn't getting rid of IO controls. The problem is that which is to your point what you were complaining about if the operation could be done using the semantic we already had in place it should be. So people who program an IO control where we could have got the semantic to work are the ones who are doing it wrong. That's what causes the IO control explosion. But I believe if we could be very careful and sort of regulate that IO controls and document them they have a place in our ecosystem. And the bigger problem is how do we actually introspect them? Because containers hate IO controls just because you're bunging binary data down and we have no idea whether it's going to do something with the attack surface, give you root control, allow a containment breakout, whatever. It's sort of trying to find a way of introspecting all of this correctly which sort of goes down to documentation is really what we're probably looking for in the container world. I think for the, that's a generic problem for containers especially coming from the second side which is still the default thing that most people use. And I think Kase and I gave a presentation about this kernel summit a few years back right before the pandemic I think. We need a solution for this as well in Seccomp. Not necessarily as we originally thought that you are able to introspect so actually look into structs and parse out arguments and so on from structs. So the problem here being if you make a system call that passes down a pointer Seccomp doesn't know anything to do about this. You can't really filter for example in members instructs Seccomp. But actually let's just clarify for the room. So what Kristian is talking about is it's not just IO controls. We have a lot of system calls that are multiplexes that you really do. Yeah, not just multiplex. We have, so Seccomp is becoming a burden in a way. I don't want to steal your time too much. Seccomp is becoming a burden in a way because it limits the way how we could for a long time it limited the way how we could design system calls especially now that it becomes more and more important with containers because pointer arguments were discouraged. So for example passing down structs in new system calls somebody would always reply on this. What about Seccomp? Seccomp can't filter based on the struct and the problem with this is that we do have use cases for structs and system calls but we want the system call for example to be extensible or it just has a lot of reasonable amount of arguments and so on. So this is a generic problem that sticks with us and Seccomp needs to be taught how to at least for example do some check summing that you can verify whether the arguments have been rewritten while you perform the system call and so on. So that's a problem on sort of slightly to the side of this whole problem but it needs to be solved as well. So if I can lift a larger point from that if we just banned all IO controls and said you had to use system calls all that would happen is the problem would migrate from IO controls to system calls. Yeah so I mean something just simply to extend on your Star Trek references in your slide deck from the original Star Wars the more you tighten your grip Tarkin the more Star Systems will slip through your fingers. Part of the problem is that we do want to have a very very good architectural control over system calls. So in order to get a system call through it goes through a massive bike shedding activity. It has to be documented people ask all these questions the Seccomp people come out of the room and here's one of the things which is if you're a device driver, author or even maintainer you may not care about containers maybe they should maybe supporting containers as a tax that we should impose on the entire community. However what happens is the tighter and more perfect we try to make system calls the more there is a very very extreme sort of incentive to sort of move things into IO controls where we can dodge the bike shedding because sometimes the bike shedding is important. Other times it is a huge burden right? And I won't say who but someone already wants I heard someone say well you know in order to avoid the bike shed I'll do the IOctl in my file system and then we'll later on see if we uplift it to other file systems because that avoids the FSDevelop bike shed party right? And there are good reasons and we certainly miss things in terms of architectural review if we you know dodge some of these processes but I think people do need to remember the more perfect we try to make things most of us do not have infinite resources. I just want to point that I don't think but you should first thank Princess Leia for her comment. I'm not sure if it's the right argument to make that the only alternative here is system calls. I don't think we're making that argument. What we're saying is the perfect is the enemy of the good effectively if I can quote someone who's not Star Wars related at least I don't think he is. Right so we want, we need to accept in our development process that what we do isn't perfect and there will always be exceptions. We just need to make sure that we don't do the exceptions too often. There's also an additional constraints. That's the right term. At least two separate classes that IOctol's falling to. Some of them are like setting stuff up and ConfigFD would work quite well for those because the timing doesn't matter but some of them have to be fast. You can't go create a file descriptor, do something. Yeah but ConfigFD had that because the creation was just the set up and then the thing that you atomically set it going could do that instantaneously. If we're gonna be talking about ConfigFD could we see the proposal? I was a patch sent on to the list years ago. I really don't think he wanted to see the proposal. Mike, look it up if you guys really want to see that. That's something that's... So I guess what I should say, what we should do is focus on what do we want to do instead because I will, you can pry articles from my cold dead hands unless you give me something else because ButterFest does a lot of things and we've wasted a lot of time in these architectural life sheds that just ended up being like, oh well, yeah you're right, let's just put it in my octal. Like the community said, you're right, you know what, let's just put it in my octal and that was a year wasted Omar? Like I, job, anyway. And the articles we have aren't going away because there are too many things using them or some of the articles aren't going away. Right, and I think I'm standing up here to argue that you just judiciously, the IO controls are probably the best way of doing stuff, possibly with some sort of introspection data that allows us to solve some of the second problems we have with them. Right, and so like, you know, this is the like, the code is in the documentation thing, like okay that's not great, but like these structs are put in UAPI, they have new names. Why don't they last? No, I'm looking for config. I think octals are getting vilified a lot here and I don't think all of it really makes sense. I'm defending, I'm defending. Luis, config FD. Thank you. Can I continue? All in IOctl really is, is a driver-specific syscall. That's all it is. And there's a real need for that. Part of that mic shedding process you were talking about, Ted, for something to become a real system call, IOctl's cannon should be a part of that mic shedding process. Before something becomes a system call, oftentimes we should try it out in an area where it's not so permanent. Where it's more private to a file system or so on. And not everything needs to get promoted to this system call level anyways. I think we should be defining the scope of the problem here. IOctl's do have some very real problems, but we don't need to necessarily redesign everything from the ground up. We should define the scope of the problem here. One of the problems that we were talking about this morning. I just want to say to add on the principles of layout. If I was in that situation, you said, I'm gonna pull, I don't have IOctl, so I'm good. But if I did, you said, give rid of those policy heels and I can't hear you being evil. I'd say fine, and I just make another device and I do reason rights and get the same thing done. So I'm there for you. If you get too vicious, we'll just vent our way out of the corner. And not necessarily a way that helps, right? Right, and this is kind of my point, right? Like, what are we trying to solve? Is it introspection? I think like, if we want to make it easier to identify what the interface is and what it does, you know, I am a giant fanboy and I will scream that link from the top of the loop. That's me. So I'm the one who says IOctl is a fine user moderation. Our main problem with it is introspection. That's sort of my position. Yeah, I think what confused me a little bit is that I don't, configuring in general might be a good idea for some stuff, but I don't think it necessarily relates to a, it was a good replacement for IOctl's. Right, the real thing here is just, it doesn't seem like we have thought of a generic way to abstract all these things that we do need and to express them in a way that's perhaps not just an IOctl or system calls. I don't think we've yet even thought of that, right? This seems to be like a generic issue, right? It's just like, once you have something in networking with file system, maybe you work on something in that area, but something generic doesn't seem to exist. But it's not, that's the point I was trying to make. I mean, what we try to do in computing is set up fairly universal semantics that work for us and what we always find is there are exceptions to the semantics we've set up and we need some mechanism for coping with those exceptions. There is no perfect world where everything fits into one semantic and we just haven't found the one true semantic yet. There will always be edge cases where we have to root around the system and we need to make it as of doing that and the IOctl controls as good as anything else. Well, like for instance, Dave, maybe you can comment on the history of getting up this conflict, you know, like how that came about and all that. For instance, just as an example. Luis, I just wanted to add the perspective. Yeah. We all talk about should we have IOctl, should we, I think most of us agree that we need IOctl to try new things. But something that we haven't talked about is how to graduate from a new thing like staging API to a properly document API. And Ted mentioned this yesterday. There was an ancient XT2 IOctl for LS at Chata and then other file systems also implemented that and XFS had another one and both of them were merged. And now recently it was standardized as a VFS API. So it's a semi-IOctl from ages ago. It's from XT2. But this demonstrates the problem. If you're reinventing the same thing over and over again in your... No, it's not reinventing, it was adopting. It was adopting the technology. The point is that XFS did something and put this... No, no, no. There were two technologies invented in parallel. Right. And then they were standardized and hoisted up into the VFS. And the whole thing is now pretty much standardized. But it still lives in the same space of IOctl. All that is missing basically is documenting it as an API. And then all the tools, I mean, S-Trace already knows how to parse those things but it's not standardized. So maybe a better... Well, another instructive example to look at might be the new mount API. Right. Where we defined an entirely new system call. It is a little bit like config FD that you can send down... No, no, it's exactly like because config FD was a rewrite of FS config initially. But it ends up being very purpose specific, right? Yeah, and the reason it's purpose specific and the reason I wrote config FD is because it requires a mount and a super block. And if you don't have that, you can't use FS config on it. That was the problem. To Ted's point, what I like about the new mount API is what scared me a little bit about config FD was that it's a specific... At least it's a specific FD type. And config FD made it so that it would always be a config FD. No matter what, for example, if you used config FD to set up something for a new file system API to configure something or config FD for something to, I don't know, do process management in the kernel subsystem. It would always be the same FD underlying a non-inode FD type. And that kind of always bothered me. So that separate APIs, for example, an API that deals with kernel process management and an API that deals with file system management, they should be separate FD types if they use some sort of FD type-based management. That doesn't mean you couldn't use the same infrastructure things on it. So you may be, you have a PID config, an FS config, and then you use the... Well, go to the config FD stuff on top of it. This patch, four of six. Because you're often using, it's just a set key. You're doing set key value paired in time. So this patch basically used config FD as an infrastructure for FS config. So you built the more specific on top of the less specific. And config FD could live like that and you never really actually see it in a while. You just do things like this on top of it. Maybe it's worthwhile to sort of separate mechanism and policy. I don't think you want to look at what you've seen the patch is. There are multiple mechanisms. It is absolutely, I don't know who said it, but I octals and system calls, they're basically the same thing, right? It's essentially a function call into the kernel. And you have a multiplexing thing, whether it's the syscall number or the I octal number, and then a random set of arguments. That's one way. You can do something like slash proc plus slash sys where you're echoing ASCII strings into magic files. You can do something like config FD or the new mount API where it's a, you know, you're basically sending attributes and then you send a commit, right? These are all mechanisms. The policy is how painful is it to add a new system call? How painful is it to add a new I octal? How careful do we need to be before we extend the mount API, right? That's where the devil is in the details. But I don't really think it's fair to just say that the alternative is system calls and considering the implications of review and system calls and designs and system calls. That's, I don't think that's- But from a technical perspective, there is no difference. The only difference is how much pain do you have to go through before you can actually get a new system call in? If you actually look at it from a 10,000 foot level, there is no difference, right? A system call number, you have an I octal number. There's some details that do matter. Of C structures, the I octal takes a number of C structures. The only difference is how is it documented so that people can actually do the introspection, right? If every single I octal had to go through the exact same process that we did with system calls, then introspection would not be a problem because it would all be documented there would be required mount stages for it all. It'd be extremely painful to add a new I octal so it's really stable, which makes it a whole lot easier to understand. And then people would find an escape patch because it would be too painful, right? This is the matter of the order. Just to give you some counter to this, right? When you extend the wireless world with a new command or new technologies or whatever, we're not adding new system calls for new features or anything like that. We're addressing problems from a domain specific place. So that's kind of like my point here is that we can reduce the scope of how it is that you present, what it is that you need to modify or deal with. I'm just having a hard time seeing what we could replace I octals with honestly. I know if you had the netlink proposal that... Yeah, I mean that's just an example, right? I understand. You can't use netlink because if the network system is not compiled in, you wouldn't be able to do I octals. Not just that. I also think netlink is actually in terms of API usage much more complicated than I octals for user space. It might be actually more convenient to place netlink with config at the... Well, good luck selling that to Dave Miller. Are we about to reinvent the process communication? There are many inter-process communication protocols that support introspection, support documentation in a machine readable way of... Well, let's back up. I would hope everything supports documentation. The problem is we don't create it, right? Ted's point of the scrutiny we undergo or don't undergo to add these things. IPC can be exactly the same. It can actually be a multiplexer as well. And if you didn't document it, nobody knows what you've done and you can't introspect it properly. I did a good amount of work in making I octals better or easier to use. But yeah, there are certainly our problems. What could be helpful, for example, is if you have I octals in a specific subsystem. Let's stick with the file systems. One thing, for example, that bothered me when I did work in the VFS that had to touch a bunch of file systems is that there are certain APIs that circumvent the permission checking of the VFS. For example, creating objects or creating new objects within the file system through an I octal. And I know where this comes from. I know totally that makes a lot of sense in the beginning but the problem is it makes permission checks get forgotten because they need to be duplicated from the VFS and so on. And some, I would, for example, would be in favor of having some documentation that would state, don't duplicate permission checking in your I octals. Try to avoid going behind the VFS's back when you create new things. Yeah, I wouldn't disagree with that. I mean, but it's back to the same thing. There was a semantic and an exception when you create an exception, you forget all of the pieces of the semantic you should be obeying. The real thing is whether I octals are something as a band aid and really it should be treated as that or whether I octals are part of the API. For the part, no, we already established that. They're part of the API because there is no perfect semantic that could replace all I over controls. I think, unless anybody disagrees with that position. Yes, but if they're part of the API then there is no need for IO controls to be in place. IO controls as part of the API is just happy document to them or not. I mean, I just, you know, I provided the example in the wireless world, right? Where we're inventing and adding a new IO control for every single little new feature. And then we ran out, then we started doing sub IO CTLs and then now we have generic net link and you look at the documentation that's forced upon new commands and all those features, we're not inventing system calls. And it's very specific and tied to the wireless objects. So yes, the grass can be greener, but I'm just saying I don't have the answer to that. Just look at what we have to do and I think it's a fucking mess. I think part of the masses, we have reinvented IO controls over and over again with things like net link and config FD and what have you. Even config FSS, I suppose. The question might be why, if we could examine why we keep doing this. Cause if you look at what I said, it's a semantic and an exception. And theoretically IO control should cover all of that use case. The fact that we invent other systems for doing this instead indicates that it doesn't. Yeah, I think you guys keep over engineering it when we need something simple. An octal is just a driver specific sys call. So why don't we just do a better driver specific sys call, driver private sys call. So one concrete problem that we've got with octals is that there's no real namespacing. You're picking a small integer that may or may not collide with some other octal driver registered. And I think where this really causes the problems is, as Ted mentioned yesterday, when octals get promoted from a driver to say the VFS layer, that can happen with just changing a pound of line, which means it happens without any real review. And we paint ourselves into some silly messes. I've always liked how OpenGL extensions work, where there's actual namespacing. And so vendors will create extensions that start with NVIDIA or ATI. And anyone can create extensions that don't conflict because they're not just integers. Don't conflict with other extensions. And then when something gets promoted to the standard, then that's done by defining a new extension with a different name. And I think that would, if we had that, that would make it much more likely that review happened at the appropriate time. Well, at this point, unfortunately, we're out of time. So I think we should probably take this up on a late session or something like that. Thank you, though. Yep, thank you very much, guys.