 So, one second. Hello, everyone. You should hopefully see my slides. Just before we start, the slides may have taken over all of your screen, but you can resize them so you can see the video at the same time and also ask questions during the talk or type questions while watching the talk. So make sure you make use of that. I'm Christian. I'm a kernel engineer working mostly on the upstream kernel. I also maintain various bits and pieces in the core kernel. And I also maintain and develop Lexi, Lexi, and LexiFS. So Lexi is a container manager, essentially, which can scale up to 10,000 of containers. And we develop the user space site and the kernel space site. So often, we either think about new features that we want on need, I think, would make life for our users easier and then push them upstream, or we run into bugs that we need to fix or that give us ideas for new features. And then we implement them in the upstream kernel, and then we usually make use of them downstream. So what I did over the last two years is, or three years, I guess now, is at least once a year when I go to conferences, I give a talk where I give a short overview of things that happen in the upstream kernel that I think are relevant for containers and for user space. It doesn't necessarily have to be a feature that is only relevant for containers, but I usually try to focus on them. And so I'm doing the same this year again. And interestingly, you would think that development stops at some point, but apparently it doesn't really, so that people always come up with new ideas. And these are just some of the highlights I picked out. I could make an endless array of slides where I talk about various kernel features, but this is just a few I'm going to mention here. So let's dive right in. So one of the things that just very recently got added, or one of the things that recently got done is reworking ProcFS. So if everyone remembers ProcFS, I guess, on Linux, it's a Linux dumping ground for, I guess, features and information about the whole system. And it's really a dumping ground. Everybody is adding files to Proc and so on. So obviously, ProcFS has been around for a long time. It's incredibly useful, but it also has a bunch of problems, obviously. So it exposes information about specific processes. It has these ProcPit directories. And under there, you can find, you can read a process memory. You can look at the phytoscripts that the process is open, and so on. You can look at networking information, which is a whole other sub directory, within that directory. But it also exposes system-wide information like ProcCPU info, ProcMem info, ProcStats. And it also exposes what is known as syscuddles. So you can basically tweak the system. You can write, for example, a maximum number of pits that you want to allow in the system, the maximum number of threats that you want to allow in the system, and so on. So it's pretty powerful, actually. But it has a bunch of issues, as I've mentioned. One of the, I guess, gravest issues that we had for a long time is that you essentially only had a single proc mount per pit namespace. Pit namespace is being relevant for containers, obviously. So it means once, so there was an internal proc mount that was created when a new pit namespace was created. And you mounted proc with a specific set of options. These options became the de facto mount options for all instances of proc inside of the pit namespace. Well, that means if you mounted proc again in the same pit namespace but changed the mount option, that mount option would just silently be ignored and not applied. So if you had a proc of this instance in your pit namespace, there was mounted by showing all processes on the system. And you wanted to mount a second proc of this instance. And you specified height pit equals 2, which means hide all of the processes but my own. Then this mount option would be ignored and you would see all processes. That's obviously could lead to security issues. It's just also not very nice in general. It's incorrect behavior. But there were actually reasons why this had, for a long time at least, why this had to be this way. Or at least nobody really bothered going on fixing this. And another issue was the so-called over mount protection that ProcFS has. So as I said, ProcFS exposes a bunch of information about the whole system. And often you don't necessarily want to expose this information to something like a container. This can have two reasons. One, as I said, you can write to some ProcFS files. And if you're a privileged container, you maybe want to block this. So one of the ways you can protect certain files or certain directories if you want to hide information or just not make them accessible. And this is what user space usually does, is over mounting, for example, like a staff null over it or over mounting an empty directory over it, at which point all of the files underneath a given directory disappear. But now the kernel has sort of a problem, right? So if you allow the same file system, in this case, ProcFS or SysFS, to be mounted again, then all of the information that you've hidden under mounts would be exposed. So what the kernel is doing is like, no, no, no. You can't mount the second ProcFS instance because there is a ProcFS instance mounted that is not fully visible, so meaning it has, for example, hidden mounts. So if you wanted to mount another ProcFS, if you want to run nested containers, and you want to hide information from Proc by mounting stuff over certain files or directories, you need to have a fresh, secret copy of ProcFS mounted in order for it to be mounted again. And that was kind of a problem as well. So luckily, someone went through the effort of reworking ProcFS and removed the internal ProcFS mount that was, for example, responsible for mount options not being applied, basically in implementation detail. And that got removed. So now you can have multiple ProcFS instances within the same pit namespace with different mount options applied, which is great, obviously. And now you can have one instance with height pit equals two and one ProcFS instance without it. And a new mount option actually got added. A new security option, I guess, has also been added to ProcFS. Height pit equals P traceable, which means only those processes will be shown in Proc that you can actually P trace, which is pretty good. The over mount protection issue is it's not that the over mount protection issue got removed, obviously, but you can at least mount ProcFS without exposing all of the information that you might not want to expose to the container. This is more of an invasive solution than just over mounting individual files or directories. But you can specify the mount option subset, and then it takes another sub argument, in this case, pit, and it means only mount proc, but only show the proc slash pit directories don't show anything else. That's obviously great that once there was an implementation, this goes a long time back called pitFS. This is actually the implementation of this, but just as a mount option on Proc. So that's great. So you can have multiple ProcFS instances, and you can also mount a restricted view with the restricted version of ProcFS in containers, which is obviously great. But that's hopefully not all we can do with ProcFS in the future. This is not part of the actual patch set right now, if I remember correctly, but this is something that was at least brought up on the list and something that I would like to explore further. So obviously, if you run a container, you often want to expose container-specific information. You don't want to expose system-specific information. So if you look at, usually, containers will make use of C groups, and you will limit the amount of CPUs it can access, the amount of memory it can access. You also might want to show a dedicated uptime based on the init process that the container is running. You might also want to show container-specific load average, which you can also, via complications, calculate from C groups. And so the solution that we've been using for a long, long time, I don't know for how many years, is we've implemented a tiny fuse file system in user space. I've mentioned this right at the beginning of my talk, which is called LXCFS. And what LXCFS is doing, it virtualizes various aspects of Proc. For example, if you start a container and you restrict it to execute on four out of your eight CPUs, LXCFS will be smart enough to hide this in the CPU, to only show those four CPU info CPUs in the CPU info file. And it will also make sure that only four CPUs are listed in your Proc step file. Similar, if you restrict memory for that container, LXCFS will make sure to only show the amount of memory that you actually have available. But it's a user space solution. And while we made it as performant as possible, also via some kernel patches, actually, it's obviously not an ideal solution. It would be nicer if the kernel had the built-in feature virtualizing certain aspects of ProcFS. So this is something I definitely would want to explore in the future. And what we probably also, yep, yep, sorry. OK, I'll just go on. And so this is something I want to explore in the future. So this is hopefully something we can discuss at the Linux Blambos conference, which will take place in August virtually as well. And the idea is that certain ProcFS files, at this point it's unclear which files, but that certain aspects of ProcFS will be virtualized by the kernel relative to a pit namespace, probably relative to C-group information. Obvious candidates, CPU, InfoStat, and MemInf, there might be a bunch more uptime would be cool, and load average would be cool as well. So if at some point we could end up with a partially virtualized ProcFS, that would actually be pretty great. Right, another big feature that landed just recently, which has also been discussed and worked on for a long, long time. So the ProcFS example I mentioned before, like this is not something new, that ProcFS needs some sort of changes to accommodate containers or something that has been discussed for a long time. And similar time namespaces, time namespaces go back at least to 2006 when somebody had a use case for them. But right now they've actually become a reality. So you know that we have seven, most people I would assume, know that we have seven namespaces in Linux so far, pit namespace, secret namespace, user namespace, IPC namespace, UTS namespace, mount namespace, and network namespace. And they all have their uses when building containers, and some of them you might consider essential for containers, and some of them you might think are not essential for containers. But we've added an eight one, actually, an eight one, and this is the time namespace. And it virtualizes clock boot time and clock monotonic. And you might think, well, what's the use case? Well, one of the biggest use cases, and it's actually the people who've also implemented this feature is Crewe, just checkpoint restore and user space, which is used to checkpoint processes and restore them. And also, and by extension there, it's used a lot of container runtime supported to checkpoint a container and for example, move it from one physical host to another physical host, like sending all of the information that Crewe stores on disk over the network to another host and then restoring the process. But that can lead to problems like you can end up in scenarios where for example, if you move the process to another host, suddenly clock monotonic when restoring the process appears to be decreasing and not increasing, which is problematic, obviously, and a bunch of other problems that you can run into. So what the time namespace allows you to do is specify an offset for clock boot time and clock monotonic to account, to make sure that a container sees the correct time or the time that it expects. And currently it's not supported, like all other namespaces are currently supported. You can specify them when creating the container, so you can specify them with the clone syscall. But this is not the case for time namespaces because we've run out of flags in the legacy clone syscall and we haven't settled on the interface for the new clone three syscall, which we've implemented last year. So this is something which we still need to do. So what you need to do right now, you need to unshare a time namespace and then you can write the offsets and then you set an S into the namespace again and then you see you have changed the namespace into with the correct offset set. As I said, it's useful for container migration. One thing I probably should mention because people always keep forgetting this is it just can't currently be used to sync time via NTPD, hopefully I said this right, in a container. So this is what people think about, oh, a time namespace that can run NTPD in the container. No, that's not the case. It was briefly considered, but it wouldn't make the implementation of time namespaces way more complex than what we have right now. Somebody has a good use case for it and is willing to do the work, just probably try all I can see if that gets upstream. You really want to have NTPD inside of a container. But yeah, time namespaces, we have support for it and I'm sure that other container runtimes will pick up time namespaces pretty soon as well. Okay, so this is interesting. This is pretty interesting work. I liked the next two features I'm going to talk about. Both of these features are related to a second Syscall interception. I've talked about yesterday as part of another talk. You all know second, it's a way to restrict Syscalls for a container. You can write complex filters in SecComp using CBPF, classical BPF, so not the BPF that you think I'm talking about, or most people think I'm talking about. It's not EVPF, extended BPF. It's the predecessor, classical BPF. So you can write filters. You can filter Syscalls based on their arguments as long as they're not pointers because for SecComp, all pointers are opaque. And one of the features that we've implemented last year, when I say we, by the way, I mean container people in general, is not that we've done all of this work. We've done a good chunk of a lot of the work I'm talking about here, but I don't want to give the impression that we're the only ones working on this. But given that I maintain a bunch of this stuff upstream in the kernel or help maintaining it. So the sec, we implemented something that extends SecComp's Syscall Interception. So to some extent, SecComp always intercepts Syscall, right? It's you make an entry into the kernel for a given Syscall. And then before you look up the Syscall and the Syscall table and perform the Syscall, you actually, SecComp gets a say and it can get access to the Syscall arguments and so on. And then it can apply the SecComp filter that you have written and then based whether or not you have a allow or deny list, you, the Syscall might be permitted or might be forbidden, might be skipped and so on. But the thing that we had a problem with is that it's not dynamic in the sense that you can't. So once you've loaded that filter, that filter will always do the same thing for a given Syscall. You cannot change the filter easily while the process is running. So we wanted to make SecComp more flexible so that you can outsource decisions about whether or not a given Syscall is successful to use a space and this is what we implemented with what is called the SecComp Notifier. And the SecComp Notifier is a type of file descriptor and file descriptor for a SecComp filter. So when you start a container and it loads a SecComp filter, you get a file descriptor for that SecComp filter, you can hand off this file descriptor to your container manager process which can put it into an EpoL loop or a general way of getting notifications on a file descriptor from Linux. There are multiple ways how to do this. And then when the container performs a Syscall that the filter is triggering on, notification will be generated that the file descriptor, the SecComp file descriptor is ready for reading and your container manager will get notified and then it can use an IOCTO to read Syscall information. Well, actually a bunch more information but it can read information from that file descriptor. And parse through the Syscall argument, the arguments, the Syscall was given. It can even, if it has been given point arguments, it can go into the PROC PIT mem directory, race free, not going to go into the details and read the memory that the pointer is pointing to. And then it can decide whether or not the Syscall is supposed to be successful. It can even emulate the Syscall in user space so you can, for example, inspect the arguments and you see, oh, it's performing a mount Syscall for something that I'm fine with this container mounting then I could mount this file system for the container and then report back, have the kernel report back to the process that actually, yeah, your mount Syscall succeeded even though it actually should have failed. This is a very powerful mechanism to emulate Syscalls. And again, if you're interested in hearing more about this, I've given a talk about this yesterday so you can rewatch that video probably. But a bunch of Syscalls you can't yet intercept what you might want to intercept. And this involves a whole range of Syscalls that either return, yeah, use phytoscripters. So you get the container manager which has the second phytoscripter for another process gets a notification on this phytoscripter that a certain Syscall has been performed. Think about the connect Syscall. And the connect Syscall takes a socket phytoscripter or socket phytoscripter and an address and then you can connect that socket to somewhere. And it would be really cool if we could somehow bridge the socket connection, right? The container manager sees that the container wants to connect to a certain address but the container manager thinks, well, it might be the address you think you want but actually it's not, I'm going to connect you to something else. But since these are two distinct processes and usually your container manager and your container don't really share the phytoscripter table, that would be kind of weird. You can't really interact with the phytoscripter of that container and also you can't reopen sockets through PROC so that's not an option as well. But what is possible, so we came up with a method to basically retrieve a phytoscripter from another process which wasn't easily possible so far. At least not for all types of phytoscripters. And we added a new Syscall which is called pfd getfd which relates to pfds we'll briefly touch upon a little later but for now it suffices to say it's a phytoscripter for a process that is a stable reference for a process. And so you can specify this and then you can retrieve a phytoscripter from another process in this case for example from your container. It will be blocked anyway because it's waiting for the kernel for Secom to allow it to continue the Syscall. You can retrieve this phytoscripter if it's a socket phytoscripter and you can when you intercept the connect Syscall and then you can connect it to wherever you want that container to be connected to. So you can rewrite incoming connections for a container which is a pretty cool idea. You see that there's a lot of possibilities that this actually opens up. That's something I'm pretty excited about. So it's not something, this might be useful outside of Secom. So this is why in this case we made it a separate Syscall. So a bunch of people have registered interest in this Syscall independent of Secom just getting phytoscripters out of another process. So pdfd getfd allows you to do this as long as you can retrace the target process. And related to this also to Syscall interception again is injecting phytoscripters into a remote process into another process, remote process into another process. So I've talked about connect. You wanna get a phytoscripter connected and be done with it. But sometimes you also have Syscall to return phytoscripters, right? So for example, open, I guess, and a bunch of others, except, and so imagine you're supervising a process and you don't want it to do the open Syscalls itself either because it's so locked down that it couldn't open the files or because you don't want it to have any direct file system access at all. This is I guess a use case and more of in the browser area of things. So you have a broker process which does all of the open calls for you. You wanna inject phytoscripters into a given process. And originally we wanted to make this a separate Syscall as well because we thought, ah, this might be useful but this would have been fundamentally, it would have been very difficult which we figured out in the discussion because Linux has a bunch of close, a bunch of assumptions about how phytoscripter tables how phytoscripter tables work and usually one of the fundamental assumptions is you can only mess with your own phytoscripter tables. So you can only install files into your own phytoscripter table. You can only close files in your own phytoscripter table. This makes things a whole lot easier. And so injecting phytoscripters via a Syscall into another task is actually a bit more involved that you would think it should be or it is. So this is actually a feature that is tied to Second itself. Sargon has worked on this or is working on this and I think it's sitting in Linux next. I could have probably verified this before but I think it's Linux next now. And so what you can do is inject phytoscripters into another task via Second. You can specify, you can even, you can do a bunch of cool things. You cannot just add a phytoscripter into a task. You can also replace a phytoscripter, which is more of a, well, I guess I should, yeah, you can add a phytoscripter to a task and you can also replace a phytoscripter. So you can swap out replacing essentially means the task things phytoscripter four refers to dev console and then you could technically, although this is probably a very bad idea, replace that phytoscripter four and make it refer to some random file on your system. So this is actually a pretty powerful mechanism. And why we did it with Second is I touched upon this because when Second, the way it works is the task can install it into its own phytoscripter, into its own phytoscripter table. This is just how Second works without going into too much detail. So you don't have this whole problem that you need to inject the file or replace a phytoscripter, a file from another task. So Second was actually a natural place to be doing this. So injecting phytoscripters into another process and retrieving a phytoscripter from another process makes this call interceptor feature a bunch more powerful than it initially was. And I'm excited to see what people are going to do with this. We're already using it. I know there's interest from various browsers to replace the current implementation of something similar, like that also moderates, for example, opens its calls for various sub-processes or sub-threads they maintain. So this is going to be exciting work in the future and hopefully we'll have more ideas around this and users around this. Next up, we have something that user space has been doing for a long, long time in a really, in a really hacky way because the kernel didn't provide a method of doing this, which is probably two of a lot of features. So closing multiple phytoscripters at once. I mean, you know that you can only, essentially right now you can close only one phytoscripter per syscall via close. But often when you, for example, exec in your process, you want to make sure that all phytoscripters apart from a few, usually zero, one and two are closed. Or what sometimes system G is doing, it's reordering phytoscripters that the process is supposed to inherit and then you have zero, one, two, three, four, five, something, something. And then that's the phytoscripters that the process is supposed to inherit when it execs and then everything else is supposed to be, it's supposed to be closed. And the way that most of user space implements this via two solutions, as far as I can tell. So either you parse to slash proc, slash self, slash fd, and parse out all of the phytoscripter numbers and call close on them. So you have the cost of parsing through proc and then calling close on all of the individual phytoscripters and you need to repeat this loop in case somebody is injecting or opening new phytoscripters if your phytoscripter table is shared. Or you just, you do the hardcore variant of this, you just say everything from three up to, I guess 32,000, some high integer number, you and max probably is supposed to be closed and then you just call close and then you exec. That's obviously not ideal. I mean, it's pretty costly. As we all know, hardware bugs have made assist calls significantly faster. Obviously not, but assist calls have become more costly. So closing multiple phytoscripters at the same time would be pretty cool. So this is why we currently in the process of adding a new assist call, it's sitting in Linux next right now, which is called close range. By the way, applications, for example, that close all phytoscripters, I have a need for this is Python. I've seen it in Rust. It used to be in Lipsy. It's not anymore. They removed one of the loops where they needed this. But there are a bunch of applications that would use this. So in the process of adding a new assist call, close range and close range would allow you to close a whole range of phytoscripters at the same time. And that's done, the kernel does it in one go. It's obviously way more performant. And it also takes a flag. This was also discussed a short while ago. Close range on share, because what usually happens, if you wanna make sure that you can, that no one can inject phytoscripters while you are closing a bunch of phytoscripters, you need to call unshare clone files, which means if you had a shared phytoscripter table, you now get your own private phytoscripter table and then you can call, you can be sure that nobody is injecting your phytoscripters into your phytoscripter table and then you close all the phytoscripters and then you exec VE and then you can exec. But close range share is basically doing the same thing. It moves this unshare logic, unsharing the phytoscripter table into the kernel itself, right before it closes all of the phytoscripters and the neat thing also is that it has potentially performance benefits if, because we can only unshare the phytoscripter table up to like, if you unshare a phytoscripter table, if unshares all of your phytoscripters. But for example, if you realize that you're closing everything above a certain, if you're closing all phytoscripters, then you can only unshare up until the lowest boundary and then you don't need to close any of the upper phytoscripters so it potentially has performance benefits. So yeah, I'm pretty excited about this because we are having this loop as well, making sure that we don't inherit phytoscripters. For example, when we fork helper, when we respond helper processes and so on. So we make use of one of the hacky solutions in user space right now. This would be gone with the clone with close ranges call, which is pretty cool. Promise that I was going to mention this briefly. So last year, we introduced a concept that is not specific to Linux. It has implementations on other platforms as well, which is called pitfds. Well, that's the Linux specific name. There are other names on the other platforms, but basically a wish is with pit recycling, whereby if you have a low limit on the number of pits on your system, the default limit used to be for a long time, 32,000 something, and you have a system where a lot of processes are created. You could easily end up in a scenario where the process you're thinking you're interacting with has been recycled, especially if it's not directly a child process. You also couldn't easily observe exit of when another process exited. There were ways around this, but they were really hacky as well. So we introduced a concept of pitfds, which are phytoscripters for processes. So you kind of get a stable private reference on a process, and this allows you to avoid various race conditions because even if that process number is recycled, the pitfds will keep pointing to the original process. You just get ESR switches. Colonel speak for that process doesn't exist anymore if you try to use it. And you can send signals and so on. One of the things that we always thought would be useful and we wanted to do, we wanted to use these pitfds for namespace management as well. Meaning you should be able to pass them to the set an assist column. The set an assist column is quite important because it's used every time you wanna interact with the namespaces the container is using. You need to call open slash proc slash containers pit and s and then for example, user. And then you open a phytoscripter to the username space of the container and then you call set an s on it. And then it switches you to the username space of that container. Now think about it. You need to have this open call and you need to have the set an s call. We're at the point where we're at seven, including the time namespace, we're at eight namespaces. If you don't stash away, if you don't stash away somewhere these namespace by descriptors, you're looking at eight open syscalls and eight set an syscalls. So you're looking at 16 syscalls to change to all of the namespaces of the container at the same time. That's obviously that's an ideal, especially if you think about that some of the namespaces, not a lot, but a few namespaces you can fail to actually attach to. So you could end up in a state where you're half attached to a bunch of namespaces. You're already attached to a bunch of namespaces, but you're not attached to others and you fail. Now you're in this weird half state. So we thought, okay, if we could use pit of D, we could also specify multiple namespaces at the same time, and this is what you can actually do. So if you, assuming you have created your container with clone and you have requested that the pit of D be returned to you and the kernel actually supports it, you can use that pit of D and then pass it to the set an syscall and specify all of the namespace flags that you want, clone user, clone UNS and clone UPID and then you get moved into all of the namespaces at the same time, and atomically, which is pretty cool. So if you specify all eight namespaces that the container was spawned with, it will, the kernel will make sure the way we've implemented this is that the kernel will make sure that you succeed in attaching all of those namespaces. And if you succeed in attaching all of those namespaces, it will commit them and you're attached to all of those namespaces. If you fail to attach to any of those namespaces, the kernel will not have altered a single thing. So if you fail, you're not in sort of a half switch state, you're still in your original namespaces, nothing has changed for you. So this is pretty cool. And obviously, I mean the great advantage, as I said before, it's we're down from 16 syscalls to one syscalls. So that's hopefully pretty helpful. What we still need and what I hopefully will implement soonish is a new flag to all namespaces which are different from your own. So for example, why is this just relevant? If I set an S to a username space of a given container, I have a problem because if that username space is the same as my own, then the kernel will not allow me to attach to the username space so that I can regain privileges, clicking the kernel into giving me more privileges, regaining capabilities essentially. So right now if you specify a bunch of flags and you have calling your user in there, you might fail to attach simply because you're in the same username space. So you need to verify that you're in the same username space first before attaching. So we should simply add a single flag that encompasses basically expresses, move me into all of these namespaces. Yeah, move me into all of these namespaces. That are different from my own. This is future work. And the great thing is that PIDFDs with this become essentially the only token that you need to interact with the container which hopefully makes it a lot easier so no more opening proc and so on. Can also set an S to containers, obviously if you don't have proc mounted. This is hopefully useful. See, this is where we get into the phase where I have a lot more stuff to talk to but I need to cut short. So another feature we added for 5.7 which has also used cases, uses, I mean a lot of the stuff has uses outside of containers but this one specifically is spawning containers directly into C groups. So they often use C groups for resource limitation, right? And resource distribution. And what you usually do is when you start a container you're creating your C group for the container and then you're fork off in your process and then at some point when you're done you move the process into the target C groups actually. People don't realize this involves costly locking in the kernel because you need to take the right side of the C group semaphore that is costly and it also can, well, obviously people realize that it can cause accounting jitters. So there is a time where that process lives in the same C group as the parent process and uses resources from this C group rather than from its target C group. That's usually not a big thing but if you're into accounting, into accounting then this is kind of annoying. So we thought, well, it would be kind of cool if you had a new flag that you could set during with the new clone three system call and then also specify the target C group that you want this process to be created in. And this is what we added in 5.7. So you can set the clone into C group flag with clone three and you can also specify the target C group. You want to be moved into the caveat is it only works with the unified C group. Why? Well, because unified C groups there is only a single hierarchy in contrast to the legacy C groups where the convention was that each controller like memory CPU set and CPU was mounted into a separate hierarchy and unified C groups are the future anyway. So we thought this is the way to go. So you take a unified C group file descriptor which is your target C group and then the container gets spawned directly in the target C group. There won't live in the parent C group for any amount of time. We'll get created right in the target C group and you don't need to take this costly write semaphore lock for C groups. So if you spawn a lot, a lot of containers this is actually cheaper. Other use cases we recently thought might be useful is right now there's a patch set which is called core scheduling which is up in Linux on the discussion which was originally thought to target some of the new exploits, MDS exploits that we have seen, LT of one and so on by making sure that processes only get scheduled on a specific core. So they can't attack each other. It's kind of funny because we've probably some people have read the paper about crosstalk which kind of proved that even with these scheduling strategies you will still have issues but it's still a kind of a nice idea. And if it's going to be C group based which is at least one of the approaches that is going to be discussed you can create processes right into your target scheduling C group which is pretty nice for this use case as well. There's a bunch of other stuff that I have I just want to briefly mention before I close and open up four questions is we have, we want to do shift effects at some point this is still up for discussion we have multiple patches that we kind of need to converge and come up with an upstream solution for this so this is still not forgotten it's just that it takes long. We had briefly a discussion about loopFS which makes loop devices available in container we will see if this is going to go anywhere or if we end up with a different solution. And we also will probably come up with a new syscall to change in the new mount API to recursively change mount attributes on a bunch of mount points. But for now this is a bunch of features I thought were worth mentioning there's a ton more. The only thing that helps figuring out which are interesting for use cases is go look at the kernel. And otherwise I hope I could give you an update of what is happening in container kernel land. And so hit me with questions. So I'm just, if it's probably fine if I'm going to read them out loud so what protections are in place to limit access to PITFD, KTFD? Well, first of all it's skated by Ptrace may access so you can only get a file descriptor from another part process if you have if you can Ptrace that process. Because if you have Ptrace access you can in hacky ways and convoluted ways you can already do this. So Cree is essentially doing something or at least used to do something similar with what they call parasite code injection. It's not my phrasing, it's Cree's phrasing. But so you can already do this with Ptrace but it's very nasty and very hacky. So the check right now is if you can Ptrace the target task you can retrieve a file descriptor to it. If you're worried about, I'm going to stop the screen channel. If you're worried about the syscall in general probably second. Can sadness manipulate syscall as an array? I'm not quite sure what this means but I'm going to guess that you mean if you can attach to multiple names basis at once and then I'm going to say yes but otherwise I'm not sure I understand the question. What does the API to inject FD to another process look like? So it's going to be second based. The current proposal that we have it's going to be a new IOCTL on. So the second notify a file descriptor that I briefly mentioned has a bunch of IOCTLs on it. So one of the IOCTLs is read and another one lets you redistruct from that file descriptor. Another one lets you write to the kernel. And add FD would basically be another IOCTL where you could specify a struct that contains a file descriptor that you have opened and then you can specify I want this file descriptor to be injected into the target task or replaced. So it's under discussion, as I said it's under discussion upstream. And for the permission it's guarded by security file receive I believe as well. So it should also be guarded by ptrace I think. I need to verify this again before I'm telling you nonces I haven't looked at this patch in a while. What more complexity would be required to run NTPD and Lexi Lexi? Well, you would need to extend the time namespace to support a way more advanced of what time namespaces do right now. So actually kernel work would be involved. Well, that depends. You can run NTPD in the container right now you just can sync time inside of a container because that aspect is not virtualized. So it's possible to run NTPD it's just not possible to use certain features of NTPD I'm no expert on this. Will the C group have the same accesses of resources once it's passed into the target? So I can briefly reiterate how this works. So you specify the file descriptor for the target C group presumably what I would advise if you create that C group you set up the limits that you probably want or you move the process in there and then apply the limits, the order is up to you I think with the new unified C group as well. And then the process is restricted right from the start. So that's the idea of how this will work. So the idea is you don't have a scenario whether the process is for a certain for a really tiny short amount of time before you move it somewhere unrestricted. It's basically restricted. If it spawned and that C group has some sort of limit set up then that process is restricted right away. The limit at the permissions if that's another part of that question the permissions are the same for when you want to move a task into another C group. So it's the same permission checking that would take place when you were to try to write a pit to a target C group. Same permission check supply for when you spawn a process directly into a C group. If you don't have permissions to spawn a process in that C group, then your process creation fails. Yeah, so when I think with that I'm out of time I hope you found this useful. There will be around in this slack channel thingy. So you can ask a bunch more questions right there if you want. Otherwise, if you have more questions you can always write me a mail. I don't promise to respond, but I try to. So thanks. Bye.