 OK, then let's begin. I only got 30 minutes, so I'll try to be quick. I'm going to talk about one specific facet of system.ly today, which is basically containers without a container manager. I mean, there have been quite a few container talks at this conference already, so I'm basically trying to adapt the stuff that containers do and adapt them to normal system services directly. So it's kind of doing containers without actually being a container manager. What are containers again? Lots of people have lots of different definitions. Like I think the three most relevant parts of what a container is are resource bundling. You have this one tar ball or this squash-a-face image or whatever you have, and it contains all your dependencies. So you get rid of the dependencies by simply bundling everything together. There's always sandboxing involved, like namespacing and security, like SecComp and these kind of things. And there's an important component, which is delivery, so where you can actually distribute it on your cluster. For this talk, I'm just going to focus on the first two. Resource bundling and sandboxing. And I'm going to talk a little bit about how you can do these two things without actually involving a container manager at all, but just by using SystemD's own service management functionality. So let's jump right in. The first thing, resource bundling. In SystemD's inception, we had this setting root directory. It's a one-to-one wrapper around Chirute. Like Chirute is like the prototypical pseudo configuration, pseudo containerization feature that Unix always had. And yeah, it used to be semi-useful. And nowadays, it's actually pretty useful. Yeah, we'll come to that in more detail. What it ultimately does is it just invokes something with Chirute environment setup, so that basically everything that shows up as slash there is not what the host sees. Something that is much younger in SystemD is pretty closely related to this, which is root image. Where root directory, where you specify a directory that shall be the root for that one specific service, root image, you can specify a disk image, like a binary blob that contains a file system of some form. Root image is actually pretty, pretty useful. The images that you can specify there can be completely regular disk images that you could also pass to Kiimu or something like that. Everything that they need, they either need to be discoverable GPT, like GPT being the partition table logic and discoverable. By that, I mean that the partition types are properly tagged as what they're actually used for, so that you can actually have a recognized simply by looking at the partition table which one is the root directory and which one is the home directory. It also supports unambiguous GPT or MBR, which is not discoverable. By unambiguous, I just mean that if you have a partition table that contains one partition only, then it's pretty obvious that that's probably the root partition, right? So yeah, you can throw lots of different things at it. Either you avoid all the ambiguities, or you make it discoverable so that it's clear what it is. Or you can just point a raw file system to it, right? Like no partition table at all. You just generate something with Make Squash Affair, so you just create a loop device, put a file system on it. That's fine, too. One tool to create these images, of course, MKOSI, but actually can use whatever you like. You can choose that bootstrap or YAM whatever you do. MKOSI is a tool that I have been working on in the past months. It's supposed to be like a wrapper ultimately around the bootstrap and DNF, but it has a couple of bells and whistles that make it a little bit nicer to use. For example, it can do cryptography for you, which is actually pretty interesting. This root image setting and system unit files actually can do cryptography for you as well. You can actually encrypt the images that you want to run there, and system unit will handle that properly. I think encryption is not that interesting for service management, but something closely related to it actually is, I think, which is DMVariety. For those who don't know, DMVariety is a system that protects file systems from modification. It was created originally for the Chrome OS project because they wanted to make sure that offline modification of the Chromebooks is not possible, meaning that you can leave your laptop in some unsupervised area, and people cannot just take out the hard disk, modify it, put it back in, and you would not notice. But instead, that every single read access is cryptographically verified so that you detect changes. How does that all apply to service management? Basically, if you use DMVariety protected disk images, you can deploy your service on your systems and can be sure that when they run, they run in the exact version that you prepare it and that nobody has interfered offline with them, like, for example, during the downloading or while the system was already running. This is not useful for everybody, but it's certainly useful for a lot of people. Yeah, as I already mentioned, root image and root directory are just a fancy charoot. In fact, root directory is ultimately implemented with a charoot system call, at least under normal conditions, not always, but usually. Charoots are highly problematic in many ways. I mean, you can make them work if you know what you're doing, but they come with lots of problems. One of them is, of course, you first have to mount the API file systems into them, right, like slash proc slash this, otherwise, the program will not actually run from this environment because it's not there. That's something we handle a system with, yeah, whatever. It's a boolean. If you set it, then it makes sure that after charooting into this thing, you also get the proxess and def file systems there so that everything just works. So that's one thing. The other thing is how to share data. On Unix, there's bind mounts for that. Bind mounts are excellent. Traditionally, if you would use normal charoot, how people usually traditionally used it, they would establish these bind mounts on the host, so it would always show up on the mount table of the host. In system D with unit files, you can use bind path and bind read only path, which basically allows you to map anything from the host into anything into the charoot environment. And it will only show up on the mount table of the service itself, so it will not pollute your host. It's actually very easy to use, so I'm not going to go into detail with what you do there. You just specify either one path or path, and then specify a mapping. What from the host should show up where inside of the charoot environment? Pretty much related to this is a relatively new feature of system D, the set of runtime directory, state directory, cache directory, logs directory, configuration directory. Because usually, if you want to ship your service as a bundle, as an ideally a bundle that only contains the actual operating system executables, it's still interesting to actually have changeable data route that reside on the host system. Specifically, you want something like runtime data, which is like a Unix socket or something like that. You want a state directory where your service can put stuff and it stays around. You want a cache directory where your service can put stuff and which is non-essential data, so that if it's flushed out, it's not bad. If it's there, it's optimized things. You might want to configure a logs directory, which is where your service can put logs and a configuration directory where it can put a configuration. If you use these settings and unit files together with root directory or root image, then this will a little bit work like bind pass work. It's going to be mounted from the host into the charoot environment. However, it comes with a couple of bells and whistles. These source directories automatically created in their lifetime. Because the system knows about them, it can lifecycle them together with the service itself. Like, for example, runtime directory. If you use that, it will automatically create a directory for you that has this inslash run, which is where all the runtime stuff belongs, that is automatically lifecycle together with the service itself. So you can, let's say, your engine x. You are packaged as one of these bundles, and then you want to have your run engine x directory, and you can just specify it with runtime directory. And that basically means that the run slash engine x directory is created, the instance of service is created itself, and goes away automatically when the service is shut down. The other ones are similar to this, actually. But it's basically a way how you can bundle everything in a resource bundling way in a very nice way, but you can still share specific things and have them reside on the host. Which is nice for updates, for example, because if it's on the host, it will be unaffected by updates or not as directly affected by updates, and you can update the bundles independently of that. Yeah, these things are also pretty nice because they keep bundles self-contained. Because, I mean, traditionally, if you install a Unix service on some system, they will run things like temp files or something like that, where they create additional directories in the file system hierarchy at any kind of place. But if we actually are interested in the bundling concept, then it's kind of like that we don't need to do that, at least if all we want to do is have a runtime directory, state directory, cache directory, logs directory, or configuration directory. By the way, so the runtime directory that was slash run is like a subdirectory of slash run that you configure that way. The state directory is a subdirectory of vallibs that you configure this way. The cache directory is a directory in var cache that you configure this way. Logs directory, you guessed it, probably, is vallog, and configuration directory is a subdirectory of Etsy that you configure this way. So yeah, that's that. A bigger problem, so this is how you can share data between having a bundled service like this and the host or other stuff. But the bigger problem with Truist classically is how to share the user table, right? Because on Unix, of course, the user table is usually maintained in Etsy PassWD if you have a Truit environment, and the Etsy PassWD of the Truit environment is different from the one from the host, because you still live in the same world, can become a bit of a problem. Because the idea of who user Leonard is on the host might be quite different from the idea that Truit sees there. This is only a problem if you use Truits without username spaces and PID name spaces. I mean, it's actually a problem that things like Docker have as well, except that the Docker people usually don't tell you about this problem. And it's not as visible, because if you disconnect the PID name spaces from each other, like if you can't see the process of the other users, it's not as visible that they still run as the same users. But yeah, I mean, the solution, the general solution is to actually username spaces, which we'll talk about here, which aren't yet that adopted on Linux, I figure, because they are hard to use. And if you ask me, they're kind of incomplete. But yeah, so the question is, again, what do you do if you have your bundled service and you want to use root image in a system-to-service file? And so what do you do about the user database? My suggestions to do that is not share it at all. Instead, there's this Boolean option called private users for services. If you turn it on, this basically disconnects the user tables of the service that the service sees from the one from the host. This is ultimately implemented with user NS. But instead of pretending that user NS was a solution for everything and exposing the full functionality, it will expose it in one very, very specific way. So what it does, it will basically install a mapping so that the root user of the host shows up as the root user that the container, like the services, the nobody user of the host will show up as the nobody user that the service sees. The user of the service itself will also be mapped like this and everything else is mapped to the nobody user, right? This basically means it doesn't really matter what the bundled thing actually has in ACPASSWD because we don't really care. We only care for the root user, for the nobody user and for the service user itself. And the root user and the nobody user is actually the only one where all the distributions tend to agree which user ID they actually have, right? User ID root always has UAD0 and user ID nobody, user, nobody has user ID 6553, whatever, yeah. You get the concepts. So with private users, you can disconnect that. So all the other users, like the other regular users that you might see in PS or something don't actually matter anymore. And then there's another module called NSSystemD which synthesizes user entries for root and nobody. Which basically means if you have this NSSystemD enabled which the distributions increasingly have, then you don't actually need ACPASSWD at all because these users which everybody agrees on will exist anyway regardless of ACPASSWD exists or not because NSSystemD is a module that is loaded into the user management of Linux and will make sure that they always show up. There's one piece missing in this. It's like if you have a bundle service, right? If you have a service that uses root directory or root image, how do you make sure that from inside of this environment, you actually see that the user ID you're running as has a specific name? I have some ideas about this. It's going to be very technical so I'm going to skip over this bit, but yeah. So much about the bundling, right? The essence of everything I told you really is use root directory in the root image. If you want the bundling with normal services it should just work and you can use standard images. And with the private user thing you can deal with the user database change. But yeah, the other part of containers, right? Besides the bundling is of course sandboxing and sandboxing is something we added a lot of features recently to systemD. Yeah, basically all my remaining slides are just about specific sandboxing features. We'll go quickly through them. Like one, like I blocked about this one, it's actually one of the more interesting ones. It's, you know how on classic Unix services used to be sandboxed, right? Like it's all about user IDs. It's like how we have been doing since the 90s or even before. Like the Apache user has a HTTBD or something, Apache is running as that and because it's not root that it's running as it cannot access whatever else is happening on the system. And traditionally this is how we put together our Unix systems, right? Every system service we had had its own user ID it was running as and was thus isolated to some way from everything else. It is, if you so will, the quintessential sandboxing technology that Unix always had. It kinda, I mean it's widely adopted but it's also, it's kinda frozen in time, right? It has this problem that it's very expensive to actually allocate a user because the system users that they are, most of the distributions define that you can have at most 1,000 system users. So if you install 1,000 services or something you become a problem, you have a problem. This basically means that you cannot just allocate users traditionally just like that, use them for something and then release them because they're simply too few of them to actually do this. And even if you did, there's a general problem on Unix that there is no scheme to actually release the ownership of a system user ID again because you have the problem that user ID ownership, like the ownership of a file directory or IPC object or whatever else Linux maintains is bound to user ID, like a numeric ID. So at the time you create that object, the object becomes owned by that user ID. Now if you wanna reuse the user ID for different purposes because you only have 1,000 of them, you would have to first make sure that you have to release the original resource like the file directory IPC object and so on. But that's incredibly hard because you would have to scan the entire file system for this. And what do you do if a user owned a file on some file system that's currently not mounted so you cannot really properly solve that? So most distributions, hence, they just declare for safety reasons, we'll never actually release user IDs again. So if you install a package and you remove it, then most of the files are removed but the system users that are allocated are not. So you leave a major artifact in the system and given that there are only 1,000 of them, that's pretty nasty. In system D, in the system 235, the most recently released one, we have the dynamic user concept, which basically uses a couple of tricks to make this all more bearable, right? So it's a Boolean option. If you turn it on for a service, it basically means that the instant the service starts, a new system user is allocated and the instant the service shuts down, it's released again. How do we deal with these problems that I mentioned that the fact that user ID ownership is sticky on Unix? There are two strategies to that. One of them is we forbid creating objects in most ways. This basically, for example, means that we use a couple of other sandboxing options that I'll talk about later that basically ensure that the service has very few directories that they actually can write to. And if it can't write to anything, it of course cannot actually leave objects around owned by this user. Another strategy is to define some specific areas where the service can write to after all, but then destroy these areas the instant the service goes down, right? Specifically, that's private TMP, for example. It's a simple Boolean that is set for the service and it's actually implied if you set dynamic user for the service. And it basically means that the service, as long as run has a private directory in slash temp that appears as its own slash temp and that goes away automatically when the service goes down. So two strategies, forbid writing, and we do write, make sure that it is removed again afterwards. So that's our strategy there. Yeah, dynamic user is also pretty nice because it keeps bundles self-contained, right? Like traditionally, if you install a system user, you drop an assist users D drop in or you invoke add user or something in your RPM or something like that. But yeah, that basically means you need to distribute stuff in the whole file system. And this way you don't have to do that because the service file contains all the information about the user that needs to be allocated. And it's nicely self-contained and it leaves no artifacts in the system. So yeah, the focus is really on leaving no artifacts. One other sandboxing object concept pretty closely related to this actually is remove IPC. It basically just says that system five and POSIX IPC objects that are created by the service get automatically removed when the service goes down. You know, POSIX, like the IPC systems are usually not that visible to administrators but it's how process communicate on Linux. If it's a Boolean, it's also implied by dynamic user but you can use it for everything else as well. If you set it basically and the service goes down, we iterate through the list of currently allocated IPC objects and remove every single one that matches the user ID that your service ran as. Whereas private TMP, I already mentioned that gives you this private slash temp that is life cycle bound to your service itself. The result of this is again, no artifacts left, right? Like you start the service, you shut it down and all your temporary files and all your IPC objects go away with the service. There's another option which is private devices. I mean, you know, all these options, like they are much more fine grained than what you traditionally can do with containers, right? Like containers are by default locked down very much and very much disconnected from the host. You don't see the process table, you don't see the user table or at least you think you don't see the user table, actually do. You don't get access to devices and things like that. In system D because we're coming the other way, traditionally the services run without, like with most privileges because that's how system five when it works, we do go the other way around. We locked things down bit by bit. I wish it wasn't so, of course, because security is always better if you're coming from the lockdown version and bit by bit opens things up. But we can't for the due to the system five heritage. But still like, so yeah, so these individual bits, if you take some, use them bit by bit, you can build a very nice sandbox, but you, of course, have to turn them on all individually. Yeah, private devices basically gives you a private instance of flash dev that doesn't contain any real devices, right? What it does provide you in slash devs though is the pseudo devices like dev null, dev zero, dev random, dev view run them, which aren't real devices. Like, I mean, there's not a physical PCI card or something behind that. It's just a way how Linux likes to expose its APIs. So, and in contrast, so like, they're in contrast to let's say dev SDA, which is actually physical device, it's your hard disk, or to dev sound, whatever, which is actually your sound card. So with private devices, basically you get disconnected from that. You still get the API block character devices, but you do not get anything else. It's like, unless your service needs actual physical hardware access and almost no service does, it's the great Boolean to set. There's private network which uses network namespacing to disconnect you from the host network. For every service that doesn't need networking, it's a great thing to do. Very recently we added something more fine grade, which is a little bit like a firewall where you can basically configure for your service, which IP address it shall be able to access. You can specify that simply by IP address and the net mask and just works. There are a couple of more things like that. For example, protect kernel tunables takes away the access to process for the service. Protect kernel modules takes away the access to kernel modules for a service. It's all Booleans by the way. A protect control group takes away the right to make changes to the control group file system. Yeah, then there are system called filter, which allows to apply a specific system called filters to a service so that it can lock it down so that dangerous system calls like, for example, setting the system clock or rebooting the system are not a viable. It's pretty hard to use, or traditionally it's pretty hard to use because who actually knows all the system calls that you want to list there. It's a lot simpler now because we have system called groups which are basically named groups that make it easier to enable and disable specific facets. I got like five minutes left now. There are quite a few more, and I think it's not bad at all that we can't talk about all of them. I'll just quickly, like, with one of them, you can restrict address families like socket address families. With one, you can restrict the system call architectures, blah, blah, blah, blah. The message you should get from all of these are we have all these sandboxing options these days and you can use them like, I mean, much of this, not all of this, but much of this is applied by container manager as well to the containers as running. But the message you really should take home is that if that's what you're in for, if that's what you're looking for, then you can just do that for normal services as well. Just set these booleans on bit by bit and you can run your stuff in a very lockdown version. So, yeah, this is not supposed to be a replacement for container manager, not at all, right? But the reason why I'm doing this talk is mostly because I work for Rathad, right? And I know, like, I come in contact with lots of people who use containers for various different things, right? Like, because containers are like the big word, everybody tries to fit his specific problem into the container world. Like, for example, I met with storage people who want to ship their storage management stuff as a container. And that's certainly a great thing to do until you notice that, well, if you want to manage storage, you actually need hardware access, like, like, block device access. And as soon as you do block device access, you become really, really hard with Docker because it's not designed that way because it actually is supposed to take away the rights for you. So, and then there are lots of stories like that where people see containers as the solution for everything and then try to fit it into the problems that they have. Most of these times, they just say, okay, you're actually interested in the sandboxing or, A, you're interested in the bundling, but you're not actually so much interested in the rest of it. The message I really want to get across is that, yeah, it's a fluid thing, right? Like, maybe containers are actually not the solution for you. Maybe you can use just plain service management and turn on the sandboxing and there you go. Or maybe you can use plain service management and turn on the resource bundling and there you go and it solves your problems as well. Now, I think I got like four minutes left, so maybe we should do questions. If anybody has a question, there's a question. Based on the current state of system D, how far do you think you're along to the path of portable systems? Portable system service, you mean? Well, my last slide here was about the outlook for that. So, portable service is something I've been working on in the long run, which is supposed to be something where you really just can drop in a service, a bundle, like an image file, and then system D will deal with the rest of it. Basically, it's way how everything that I presented on my slides is just pulled together in one tool and makes it nice. At this point, we're basically there, right? Like all the individual building blocks that I want for the portable services are there. It's just a matter of writing this generator that looks at the image files you drop in, pulls out the relevant service files, makes them available on the host system, points them back so that root directory or root image is used onto the original image so that they appear as a native service. So, it's mostly there. It's just about writing this generator to make it all fit it all together. It's one of the things that I have on my tool list legs, basically. It was a long way to go there and to get there because I mean, adding all the sandboxing features, adding all the image handling fixes, adding, like, figuring out what we actually want to do with the user database and Sherwood environments and things like that, that was a lot of work. But nowadays, it's pretty much just actually doing the generator. Oh, by the way, what is something that's also really important to mention is like, because these bits are so fine-grained and you can pick exactly what you want, what you can also do is use this to hook random other stuff up to system D and make it run as a native system D unit. Like, for example, you could probably just use write a generator that now uses an OCI image or something like that and dynamically converts it into a unit file with the generator. I mean, for those who don't know what generators are, generators are a system D concept, how you can convert dynamically foreign stuff that wants to run as a service to into system D unit files. We originally created that to convert system five units, like system five in its scripts dynamically into a system D units, but it's actually way more powerful than that, you can use it for all kinds of other stuff. So what I basically wanted to say here now is that, while I think the portable services are a great way forward, none of this technology is specific to that, right? Like you can stick it together in completely different ways, write a generator from UCI to this and it will work too. Any other questions? Nobody has a question, they have a question. You said that user namespaces were kind of incomplete. Could you elaborate a bit on why you think that is? Why it is or how it is? How it is. I don't know, namespaces have been around for a while, but I don't think they, I mean, you can make them use for specific use cases. For example, flat pack is probably one of the more sensible uses where it's being used, but in general, we are lacking a shift file system, like a huge shift file system. So it basically means that whenever you actually want to use a user namespaces, the way they originally intended them to be used, you have to shift around all the UADs in your image because otherwise everything will be owned by user nobody and that's usually not how systems work, and the fact that you have to shift around, that you have to do a recursive chone is just awful, that's not solved to the end. Other than that, it's probably the major security vulnerability in the kernel in the last months or years or something, right? I don't know, I don't think, I mean, it's super complex. I don't think it's solved to the end. I think it's very hard to use. I think it's way over designed because it allows arbitrary mappings from any user table to any other user table. I also think it's a problem that suddenly systems have much smaller user tables, right? Because you always have to slice up the 32-bit you have into smaller bits. And I don't know, I mean, we make use of it like private user, the Boolean uses it, but I don't think it's solved to the end and I don't think there are any deployments, right? Maybe Docker has code for it right now, but I'm not entirely sure that they actually, people run it in the full mode, how it's intended to be. It appears to be very much like something is still in progress and has been in progress for the last five years or something and probably will continue to be in progress for the next five years or something until we get a shift Fs or something in the kernel, which doesn't look very likely at this moment as far as I know, at least. Any other questions? Okay, I think the time's over. Thank you very much for your time and Chris will probably do more announcement now, so please stay. Yeah, okay, thank you. Thanks Leonard.