 Hey everyone, just going to start up the screen share. So, I hope everyone can see this light. Just a quick note before we start, you can resize the video and the screen share because the screen share should have taken over like most of the space right now. So, you can watch the presentation while asking questions in the Q&A box. Right, so, my name is Christian. I work mainly on the upstream kernel. I'm one of the maintainers of Lexi and LexiFS, which is like sort of a whole container tool suite. And we've been involved in containers, in kernel development and in user space for quite a long time. And we've been working on making unprivileged containers more usable. So, basically having unprivileged containers cover use cases that traditionally would only be solvable by using privileged containers. And I want to talk about a specific part of the work that we've done in this area, which is about one to two years of work, I guess. And it's not just my work, but it involves a bunch of people. So, it builds on stuff we first implemented upstream in the kernel, which takes a while, obviously. And then we also had to make the necessary user space changes. So, that's why usually I'd say for a feature to actually land to be fully usable takes about one to two years. And this one specifically concerned with something that is called syscall interception. And I've talked briefly about this before, talked about it in another session a little bit today. I did with Stefan, but this is sort of a more in-depth view. So, here's the outline I thought might be useful. First, we're going to quickly recap what unprivileged containers are, because it's just always a good idea to be clear on what we're talking about. And then I need to do a brief recap of syscalls, but not in any depth, just like it's not, I'm not going to go into any architecture types, differences and so on. I'm just going to briefly mention all these works. And then how a second ties into that picture and what syscall interception is and how we can use second for this and how we can do syscall in relation. And then hopefully we have some time for demos, but I'm prone to running over so. But I promise to try my best. So, quick recap, I'm not going to go into a lot of detail, because I guess most people by now know what containers are. What are containers? The famous dictum that a bunch of people are using user space fiction, it's a user space concept. And your opinion on what a container are is actually very much opinion-based. It's not something that you can clearly define as on Linux, we don't really have a native container concept. Sort of more kind of mishmash of various kernel interfaces and it kind of depends what kind of interfaces you think are necessary. I guess the one that most people associate with containers are obviously all the different types of namespaces that we have. But I guess what most people could potentially agree upon is agree on is that containers have to do with some sort of isolation mechanism, right? They provide you with some isolation from the rest of the system. You can just create an arbitrary process and call it a container. Well, technically you could, but that wouldn't be much point. And the biggest distinction here for system isolation from the rest of the system not the biggest distinction, the most important aspect, I guess, is the distinction between privileged containers and unprivileged containers, right? So what does this mean? Well, privileged containers, it means if you root inside of the container you also root on the host in the sense that if you escape the container you will have root privilege on the host. So UID0 inside of the container and UID0 outside of the container have the exact same meaning, which, as I said, is going to be a problem if you escape a container. And usually we have been strong advocates for a long time. If you can't avoid using privileged containers for your workload then avoid them because in some sense they're a huge security liability. And unprivileged containers, on the other hand, are containers where UID0 inside of the container and UID0 outside of the container are different. So it seems to you like your UID0 is inside of your container. So if you check your UID inside of the container it tells you it's 0 but on the host it's running as a completely different UID, a completely unprivileged UID. And these containers are driven or enabled by a specific type of namespace in the kernel, which I've talked about quite a bit before and also in the other talk today, which is the username space. This is what drives unprivileged containers. So just a quick recap. Basically the username space in general is concerned with isolating the privileged concept on a standard Linux system. So that involves, first of all, obviously UIDs and GIDs, but it also involves, for example, capabilities. Specifically here about UIDs and GIDs. So I said if you look at your container from the inside you will have the impression that you're running as UID0 and GID0. So any program will usually behave fine. But if you look at it from the outside you see that the same process, if you're inspected, will appear to be run as UID100000 or UID100001 and how does this even work? Well, you can basically specify mapping. So you take a given range, you say, here I have a range of UIDs and GIDs that a container needs to have in order for it to run a whole Linux system. And I'm going to carve out a chunk of IDs on the host starting from, for example, ID 100000. And then I say up to 200000 this range is dedicated to the container. And then I take the UIDs and GIDs that the container is running with. So 0, 1, 2, 3, 4, 5, 6, 7. And then I map them to 100000 on the host. So 0 maps to 100000, 1 maps to 100000 and 1 and so on. And it's a very flexible concept. You can have holes in your maps, meaning you can say that containers should only map UID0 up to UID50, then leave a hole from UID51 to UID999, and then map everything up from 1000 up to whatever, 100000 something. So this is a pretty powerful security mechanism, especially when you also consider that capabilities that we have on Linux are also per username space. I've touched on this in the other talk briefly before as well. So meaning if you're asking the question, do I have a given capability in the username space? If I have a capability, I'm actually asking, do I have that capability in the relevant username space? I'm not asking globally if I have that capability. Although I should mention this caveat, a bunch of capabilities or for operations that the kernel deems would affect the security of the whole system, it would always check, do you have that capability in the initial username space? If you do, then you can perform that operation and if you don't, then you cannot. So capabilities are sort of in this weird half namespace state. Let's put it like this. So they're charged against your username space or relevant username space, but sometimes the global username space is relevant. But since I mentioned this username space, spaces are a big security mechanism, right? They need to come with a lot of limitations. So as I said before, any operation that could potentially affect the security of the whole system needs to ask for permissions and or credentials in the initial username space. What are the limitations? Well, there are quite a lot of them and initially I was like, I should do a talk about how we can make containers or make containers more useful on privileged containers more useful and then show all of the restrictions that we lifted over time or worked around over time, that just would take too long. So major restrictions are that we care about for this talk essentially is we can't mount block devices. So even if you, let's say, you buy a heart in an external hard disk and you format it and you're like, if this disk goes to hell, I don't care and you hand it into the container, the container still wouldn't be able to mount it because the way file system parsing works or file systems in Linux are not necessarily, not all file systems are able to guarantee that they're stable or behave well, safely, I should say, in the face of a malicious image. So the file system parser could, for example, cause a kernel loops. So mounting block devices is not something that is possible. That's not to say some file systems can obviously be mounted in containers. Usually it's uninteresting file systems, right? So TempFS. Well, PROC is interesting, but PROC, Sys and TempFS and a few others, BinderFS, but X4, for example, or XFS and so on, and NFS, they all, you can't mount them inside of containers. That might change at some point in the far, far away future, but maybe not. You also cannot create device nodes. This is, I think, an example I bring up most of the time is if you can create random device nodes, right? You can take over the whole system. I mean, you could create DefMem or DefKmem and then just write bytes into random kernel memory, or think about any other attack that you could do by just creating device nodes. Create a device node for a new, what, you know, loop device or something. So in principle, and I mentioned this before, you can perform any operation that requires privilege on the host. But obviously, sometimes we know that the operation that requires privilege would be safe. So the block device example, for example, so if a user decides that, if the system administrator decides, it's fine if a user wants to mount, if I expose a device node to a container, and it's fine if the user mounts this. I'm fine with the user mounting this. Then there's not necessarily a reason why you should block this. Or if something inside of the container creates harmless device nodes, right? Such as DefZero, DefNull or DefConsole. These are actually things we kind of work around anyway. A container, a standard system needs DefZero, DefNull, DefRandom, DefTTY, DefConsole in order to function. And we provide those device nodes right now, usually for unprivileged containers by bind mounting them in from the host. But if somebody from inside the container were to call MakeNOT for DefConsole, there's not really any reason why this shouldn't succeed. For example, you start the init system and it tries to create a bunch of harmless device nodes that should all be fine. So the question is, can we somewhat... Well, I guess elegantly is a strong word, but can we somewhat get around these restrictions? And yes, we can, but we realized as a while back this would involve potentially a bunch of kernel changes. We actually would need a new sort of concept in a specific part of the kernel for this to work in a general manner, such that we can potentially enable even more operations than just caring about mounting block devices, or mounting or creating device nodes. So this is where we need to briefly talk about syscalls, because right how do you mount something? Well, you call them on syscall. How do you create a device node? Well, you call the makenode or makenode at syscall. Actually, syscalls really quickly. Think of the kernel as essentially a request handler, I guess. And syscalls are the main requests it recognizes. So I mean, there are other ways to request something from the kernel, but what usually happens is users base wants an operation to perform that only the kernel can perform and so it needs to make a request to the kernel and then the kernel will look at this request and decide, okay, this is safe to do. You have sufficient privilege to do this operation, or it will say no. So I promise I won't go into too much detail because obviously syscalls seem like a very easy concept and then you remember that there are different architectures and that there are syscall conventions per architecture and then you look at libc code to look at how it is all abstracted away and then you look into the entry code of the kernel and then you go home and cry, joking. But here's very briefly how this sort of looks like. You see this upper line, the black line right here. Both part is supposed to be user space and the part below is supposed to be kernel space and I know I cannot do graphics. So what usually happens is you have a process that process wants the kernel to do something for it. So it does a system call. Then a transition, a magic transition into kernel space happens and that's what I refer to as syscall convention of that specific architecture and then all architectures have a so-called syscall table which is really just a giant table or array that contains all of the requests or syscalls that this architecture actually understands. And part of the syscall convention or of most syscall conventions, all syscall conventions is that you place a syscall number in a specific register so this syscall number is used as an index into the syscall table and if that request is understood, so it's a valid syscall and the kernel will perform the syscall and if it's invalid then the kernel will return enosys which is kernel speak for, I don't know what you want from me. Performing the syscall obviously involves another syscall convention, how architectures return error codes and success codes and so on. It's really weird on some architectures. Spark is an example. And then returning the value that you're, some sort of value that you're interested in if the syscall returns a value, pointers, whatever. And that's the general boring case, right? So, but actually, I mean, it happens a lot more obviously as I said before, but one of the things that happens in the syscall path is that at some point something is hit which is called SECOM, which I've mentioned before in previous talks too in which some people also will know about, right? So, you do a request to the kernel as well. And actually, before the system call number is looked up in the syscall table and that is the same for all architectures that support SECOM, SECOM is hit. So, think about this, right? Before you even answer the question whether or not this is a valid request that the kernel understands, you run through something that is called SECOM and that somehow has access to all of the details of the syscall. So, syscall number and the syscall arguments and so on. And then SECOM can apparently make some sort of decision so it can, for example, say if you look at this green path, it can decide, for example, to skip the syscall and then return an error code or success or whatever. Or it can say, okay, continue and then you enter the normal syscall path. But interestingly, the position where it sits is pretty interesting actually also for what we're trying to achieve. Maybe I have enough time to touch on this briefly. So, what SECOM? SECOM is short for secure computing and it allows you to restrict syscalls that a task is allowed to make. So, it allows for the implementation of deny and allow lists, right? So, you can, for example, you can say a task is only allowed to make a certain subset of syscalls or the other way around. Only a certain set of syscalls is blocked. And this is obviously interesting for cases where you need to... For example, your process doesn't really need access to all of the syscalls that you have available. It's obviously interesting for when you have a very... when you have a process that needs to run with privilege but you still need to restrict it. You can use it for that. But it's used in all container runtimes and it has a first-class security mechanism status in contrast to, for example, LSMs which are part of the container world, but they're not as widely used as SECOM simply because SECOM is not... I don't even know if you can turn it off in the kernel's config option anymore, SECOM is basically available on nearly all architectures and it has good user space support. There's a library for it and so on. So it's used everywhere from browsers to data centers as I've written here. And the interesting... Also, you can write a bunch of more complex SECOM programs using BPF and this is actually not the BPF that most people nowadays think about. When people hear BPF nowadays, it's EBPF, extended BPF. And what I'm talking about is CBPF, classic BPF which is a term that was coined after EBPF got introduced. So CBPF used to be just BPF but think of it as a dialect of EBPF or a subset of BPF. It's not as expressive and it doesn't allow you to do as many of the fancy things that EBPF allows you to do. But you can write SECOM filters in it and the cool thing is that you can then specify... You can filter on a specific argument, so this is called and values for those arguments that with a restriction, namely that for SECOM any pointer argument is opaque. That's a limitation of CBPF, so it doesn't allow you to dereference pointers. So you can only filter on arguments that are passed in registered, completely in registers. For example, you could filter on flags argument, usually integers or unsigned long or whatever. So you could filter on the mount flags argument of the mount syscall or on the flags argument in the open syscall. So you can say, for example, I don't want to intercept all mount syscalls. It doesn't make any sense. I'm not interested in remount syscalls because if somebody has apparently already mounted it, why do I care about MS? Remount, and so you filter out all mount syscalls that have the MS remount flag passed. Usually what happens is when you have a SECOM filter that it causes the syscall to be skipped and an error code to be reported back to user space. So in the case of restricting the syscall interface of a given task, usually you would report back ePerm or enosys for an operation that you deem unsafe or unnecessary for this container to have. But the thing is, you start up a container. It loads a SECOM filter, and that SECOM filter is to some extent not dynamic. You load that filter once, and then the decision if a syscall with a given set of arguments is hit. So the SECOM filter triggers on a given syscall, then it will always trigger on that syscall. So you cannot make a dynamic decision on a case-by-case decision whether or not you want to allow a given syscall or not. So you cannot outsource the decision to user space. But it would be kind of neat if we could. So if we look at this part where SECOM is hit before entering syscall table, what about if we could intercept a system call and then somehow involve another process that could then make the decision instead of making the decision inside of the kernel whether or not successful. This is obviously interesting for the case where I said before, a container manager will often know policy of the administrator or because it's supervising the container and knows its privilege level and so on, whether or not A, a given operation will succeed and B, whether or not that operation is safe. More than the kernel. The kernel needs to have a policy for every user of that interface, right? And it can only say, oh, okay, this is safe for everyone or this is unsafe for some and therefore it is unsafe. It must be unsafe for everyone. And this is where we came up with intercepting system calls. Well, I should probably say came up with it. I mean, ideas to do something like this have probably circled around for quite a bit and also it's not that SECOM itself is not something that other operating systems have never thought about. Some version of this in some form exists in other operating systems as well. There is a related concept, although a bit different in FreeBSD, for example, Pledge, where you're restricted by classes of system calls. Essentially, you could emulate parts of this with SECOM. So yeah, but intercepting system calls and this is kind of what SECOM already does if we go two slides back and I hope there's not too much slack here. This is called Interception in this diagram, right? The system is sort of intercepted by SECOM before it actually proceeds in its normal path. So what we wanted to do is outsource the decision about whether a system call is allowed to a user space process, specifically in this example I'm using to a container manager. And what better to use than fighter scripters? Joke it, but this is actually what was used to implement this feature. So a task needs to load its own SECOM filter. You cannot load a SECOM filter for another task. But since we are the container manager and when we spawn containers, we are in control of the child process before it execs. So the child process in my example being the container, you fork, you set up, synchronize between parent and child, you set up everything that you want to set up, and then at the end you exec your init binary. At some point it will load its SECOM filter. So since we're in control and we need to cooperate anyway when we start the container, the child process or future container can retrieve a file descriptor for its SECOM filter, which is dubbed a SECOM notified file descriptor, the SECOM notified. And that's a file descriptor for the SECOM filter. What does this mean? Well, I guess if you have encountered this feature or if you think about it, it's pretty obvious. It behaves like a standard file descriptor. So for example it can be polled and you know the polling concept is usually familiar from when you get notified about another socket connection and then your call accept on it and so on. You get notified when file descriptor becomes readable or writable and so on. And this is related. This time you get notified as well so you can put it in an ePoll loop or into any notification mechanism that Linux provides. And then when the task of interest performs a syscall that the filter triggers on or the filter registered for, you get a notification on the file descriptor so you'll wake up. And that indicates to the container manager in this example, oh, okay, someone made a syscall that I'm supposed to do something about or that I'm interested in. And you can then use an iOctol to read SECOM information from that SECOM notify FD that usually involves I'm glossing over details because it would take way too long. It usually involves obviously the syscall number, the architecture, and the syscall arguments. And then the process can go on and inspect the container manager can go on and inspect the syscall arguments. And then make a decision based on what the syscall arguments are. So it can also chase pointers actually because PROC exposes the memory of a process in PROC PID mem. So you can also use the pointer-based syscall arguments to take as an offset into PROC PID mem, into the process memory to chase the pointer arguments and so on. There is a race condition involved, but in this case, and I will briefly touch on this, it's actually not that bad. And so the task can even read the memory of a syscall. And then when it is inspected the arguments and it's determined, okay, this is a harmless syscall, I don't really care if the container performs it if the container wants to do this, but I know it would fail. I would potentially do something about it and then I can instruct the kernel to report back whether or not I succeeded with this operation. And this is where emulating syscalls comes into play. So we have the pieces of this interface now. So we register a second filter. The task loads a second filter. That second filter has a specific property which states give me back a filerscripter for that filter. The container does it. The hands off the filerscripter to a container manager. The container manager listens for syscall events on that filerscripter. And then inspects the syscall arguments and then given a policy that a user decided is fine, it can start to emulate the syscall in user space. Now, it's usually not necessarily the most performant thing to do. It's also, one must also make very sure that the dude is as securely as possible, obviously. But in general, it's better than nothing, right? Instead of having to fail a mount operation, for example, having to fail a make-not operation, we can actually emulate that operation in user space. And there are a bunch of problems with this interface which I'm briefly going to touch upon. I mentioned that second doesn't like to chase pointers or the reference pointers. So think about this. If you register, if you write a second filter and that second filter triggers on more system calls than you would want it to and it catches system calls that would succeed, then it means it also needs to emulate them because there wasn't a mechanism to tell the second notifier to continue a syscall. So if you mount a tempFS inside of the container, that would succeed anyway, so why emulate it? But given that you can't do reference pointers, it's pretty difficult to special case tempFS mounts, for example. So I need to intercept these mounts as well and potentially emulate them, which is obviously terrible. The same for open, which is also you can filter on the path argument. So any system call that is accidentally accepted needs to be emulated as well. Well, we've gotten around these restrictions by making it possible to continue syscalls from the second notifier right now, but that needs to come with a big caveat. You need to be very careful. You cannot use this to implement a security mechanism because there's a race condition. If you perform a syscall, you inspect the syscall arguments, and in between you inspecting those syscall arguments and telling the kernel to proceed an attacker could rewrite your syscall arguments, and then you could be tricked into performing. The kernel could be tricked into you could be tricked into letting the kernel perform a potentially unsafe operation. So what this means is no security policies can be implemented with the second notifier, especially not with continuing syscalls, and you need to be very sure that any operation that you're tricked into letting pass, the kernel will already block anyway, and that's actually the case for unprivileged containers, right? We're using this to elevate privileges where we think it's safe, and if somebody... So we inspect the arguments once, we verify that it's still the same task, and then we make an informed decision on whether or not it actually wanted to... We find with performing that operation, if somebody rewrites the syscall arguments after this, we don't care because we don't perform this syscall anyway, we emulate it, and if we inspect the syscall arguments and we're like, ah, yeah, I just continue this, let it pass to the kernel, and somebody rewrites syscall arguments, they will rewrite it to something dangerous, but the kernel will block all make not syscalls anyway, so no harm done. And work that is currently ongoing is making it possible to retrieve phytoscripters from a different task. There is a specific syscall targeted at this. This allows to bridge socket connections, so you could retrieve a socket phytoscripter connected for the container, and so rewrite socket connections and so on, and injecting phytoscripters, so you can, for example, when you call open, you can inject the phytoscripter for the open syscall. This is important for, for example, for Chrome, which wants to replace its current way of doing sandboxing with the second notify. So there's a bunch of stuff here, and oops, I've sped up a little, so I could actually still do the demo. I should have a little bit of time left, so I'm going to briefly stop the screen share and share a different screen. Hopefully you can all see this. Let me make it, it should be a little better. So this is Lexi running here. You might have heard about this. This is our container manager, which we've done a long time ago nowadays. And we're just going to launch a standard container, fairly quick. And going into this container. And now let's say, nothing fancy in here, let's say I wanted to create a device node. Let's look at the Dev Console. Dev Console, as you can see right here, is 5.1. That's the major and minor number. So if I wanted to create Dev Console, I needed to do make not my console character device for C, 5, major, 1, minor. I hope that's the right order. And then, huh. Well actually, why would it matter if we created this? Because we have Dev Console in Dev anyway. You can see right here. So you could actually allow this. How do we allow this? Well, we use this calling deception. So we have actually config set for security.syscalls.intercept. Make not true. Then we need to restart because this is a change of the second filter and you cannot live update second filters. So, okay, let me see if I lied. Same exercise. Make not my console. That worked. Cool. This actually created a character device note for us. So, let's look at this. A little bit here. Not sure if I have time to show mount, but if I call the make not syscall, you see look in the upper half of this demo. Handling make not syscall. That means the container manager in this case, Lexity, has been notified that the container has made a syscall with syscall number 133. Here is the architecture type. It's the second notify ID, which is like a unique token that makes sure that we are actually talking, actually the task we received at syscall from is still alive. And then the syscall arguments, which in this case was the device number, the pit of the process, the task that perform syscall, and the path that was used. And obviously this doesn't unblock any... Obviously this won't work, but you can't create other device notes, right? You can't suddenly... My block, I don't know, B, 4, 1... I mean, it all won't work. Obviously you get a device not allowed. So in a similar way, we intercept the mounts syscall. I mean, I have a couple of minutes for questions and I will go answer them, but I think actually doing the mount interception stuff is kind of cool. So let's say mount true. And I'm also saying that I want to allow X... ButterFS file systems. Now I'm adding a ButterFS device. Let's see config, device, add, f4, blah, unix, block, source, this stuff, let's see, 10 path, my block. I've just added a device note to the container. We allow this to do this dynamically. And then def... Oops, I called it my block, right? So right now, I shouldn't be able... Right now I shouldn't be able to mount this file system because I haven't restarted the container yet, so the second profile doesn't apply, but I just want to show that this doesn't work right now. Permission denied, great. That's how it should be. You should be able to mount random block devices. So let's see, restart at 4. Let's force it so it does it a bit quicker. Oh, and then you can already see up here, handling mounts is called. That, by the way, all of the mounts is, of course, that we accidentally intercepted whenever, because we can't filter on pointer arguments, but we just let them continue. So nothing changes for the functionality of the container. This is a butterf file system. So mount my block, I said, right? Let's see. Ah, well, it worked. So system call got intercepted. You see it right here. And yeah, we mounted the butterf file system. Actually, I did something stupid because I mounted my own... I mounted the next piece of butterf files back end. It stores all of the container into the container, but just as an illustration. But yeah, so we do this for the mount syscall and for the make not syscall, and we do it for one specific instance of the ZX address syscall. But this is a mechanism that would potentially allow you to do this for a range of syscalls. Just need to make sure that you're airing on the side of safety. And yeah. I guess that's it from me so far. And I am ready to take a bunch of questions. So let's start with the oldest one. So I see right here does syscall interception feature require LexD? Does it work only with LexD as a container manager? As far as I know, yes. We've implemented this in the kernel and in our user space tools. And I don't know of anyone else who's currently using this. Chrome, as I said, wants to switch to this in the future. But for that we need to have file descriptor injection available, which is something that Sargon is working on right now. So there are patches for this upstream, maybe even already in Linux next. I don't quite know right now. But yeah. And do you have a get up to download the code for your work? Well, just go to the LexD and LexD repositories. Well, this is a bit of tricky. I should. So it requires cooperation between the container manager and the container, right? And so we have implemented a protocol based on Unix sockets essentially that abstracts away the intricacies of the kernel implementation and to keep this extensible. So I'm trying to say in a very complicated manner that you need to understand the source code in two repositories. So you need to look at LexD and LexD. Not sure. I hope that helped. Okay. I think that's the only two questions I see so far. I mean, but I can stick around. Oh, can you expand on the security aspects? So many. One thing, for example, I spoke about emulating the syscall, right? Think about it to some extent. G-Lipsy is in the same sort of spot, right? Sometimes it emulates functionality that the user space expects if the kernel doesn't provide it. And this is always kind of a risky business because you need to emulate all of the security restrictions that the kernel would take into account when performing that operation, right? So think about the make-not-syscall. Are you rude? Do you have the necessary capabilities in your username space? Do you have a specific devices policy that prevents you from doing stuff and so on? What is your current C-group state? What is your current UAD and GAD? What are your current capabilities? What is your current? What is the task second status? What is the task LSM status and so on? So for the most part, you can assume that... So if you want to emulate this one-on-one because you're performing a syscall for another task, that's potentially a long list of things that you need to take into account. Now, it isn't that bad because potentially when your administrator... So if you can administer the lexidemon, we assume you need to have root privileges anyway. So if your administrator said, it's fine if these syscalls are performed, then in the general case, it's okay. For example, we have a hard-coded policy for MakeNot that only allows the devices that we would be fine with bind-mounting into the container at startup anyway. So we just... that's fine. Other security aspects are of course, as I said, with continuing syscalls. Like for lexid, this is fine. For unprivileged containers, this is fine because the kernel is your ultimate security boundary when it comes to syscalls. So the kernel anyway blocks anything that it deems is unsafe. If you continue a syscall and somebody writes bogus arguments in there, then the kernel will just look at the syscall and will go, so no. So your kernel is your ultimate safety net. So rewriting syscall arguments is not really a big deal. But think about when you guard for a privileged process, right? You implement a security policy for a privileged process and you decide whether or not it's safe to perform a given operation. Now you have a problem because the kernel isn't your safety net anymore. You're the safety net. Your decision is your safety net. And you can be tricked. You can lose a race. You can lose a race, as I said before, where somebody rewrites syscall arguments after you inspected them and made a decision and you let the syscall continue and the kernel is like, fine, I don't care. I don't care. And then creates the KMM and then the process is like, okay, time to take over the system. So these are things to keep in mind. And anyone who wants to use the, especially the continue part, I've placed a long, long comment in how the notifier can be problematic in this aspect in the kernel header. And it stresses every time this cannot be used to implement a security policy. It should probably make everyone chant this, but we're running out of time. So yeah, the security implications of this are definitely interesting. Oh, so briefly, Chrome, I think you heard you mentioned Chrome wanting to use the second view such as well. I'm not very familiar with containers. Could you give a concrete example of what they're hoping to do with that? So a lot of browsers, I don't know developer, so what do I know about browsers? But if they're restricting helper processes quite heavily, especially if they do encoding and decoding and so on. And for example, when a task, as far as I remember this, when a task doesn't give an open sys call, for example, and wants to give then I think by default Chrome doesn't let them do the open themselves. Instead they get notified of the open sys call with a specific second feature that is called where you get a six sys signal when a task performs a sys call. And then it performs the open sys call and does some weird trickery to do the open for the task essentially. But this becomes a problem now because G-Lipsy wants to block all signals if I remember correctly when it starts new processes or when it creates new threats and you can obviously see the problem here. So when you rely on retrieving a six sys signal, when you rely on receiving a specific signal but then G-Lipsy, for security reasons, for its own reasons, blocks these signals then you obviously won't receive them anymore which means the sandboxing solution that you used before won't work anymore. And so this is why they want to switch to the Notifier mechanism. It's also in my opinion the way more obvious solution how to do this. This is actually put in another way. This is actually what they wanted all along but never had or never knew that they wanted it or maybe they knew that they wanted it but there wasn't a way to do this in the kernel. So they want to use it for sandboxing to answer what they're hoping to do with that part. Good. I think if there are no more questions then thank you very much.