 Okay, so let's continue. So our next talk is going to be from IBM. Mike Robaport is going to be talking about other space for namespaces. Something like that. I am working for IBM Research for a couple of years now and our current research field is how to make containers more secure. And since address space isolation is the ultimate protection method since the invention of virtual memory, we are trying to see if we can make use of MMU and address space isolation inside the kernel mappings to make the environment in Linux kernel more secure and particularly container engines and container users. And one of the assumptions is that vulnerabilities are inevitable and there always will be some hole somewhere and some system calls that doesn't take proper care of its age conditions and these can be exploited. So if we can try to create restricted address spaces for execution of some of the kernel functionality, then a vulnerability that exploits something in such a restricted address space will still make an attacker life harder to get to other kernel code and data that is not mapped in the restricted address space. So the first example of restricted address space in the Linux kernel is PTI of course. It's more restricted for users than for kernel but still the page table of user space processes sees to contain kernel mappings at some point because of some problems probably you know of. There is also ongoing work in similar direction for KVM oracle developers are working on creating a restricted context for execution of VM call and VM exit and trying to avoid mapping sensitive data in the context for VM guest mode. And AWS proposed creating areas that are mapped only in the context of certain process and are not mapped in the rest of kernel page tables. And our group is targeting container security where the entire system call interface is the attack surface and the major is isolation primitive is a Linux namespaces. So we've been thinking how can we use MMU and the page table and restricted address spaces in order to make namespaces safer. And the idea we've been working for last couple of months weeks is to assign a page table for namespace. Before that we tried something that we call system call isolation which was an attempt to run system call execution in very restricted address space. It was kind of continuation of PTI approach. We took the PTI page table we added to it some minimal mappings necessary to enter system call and then the actual system call execution was running without any of kernel code or data mapped. It faulted a lot. It was not so fast as Peter said. It kind of worked to the extent I could measure and could run micro benchmark. But after I did it I realized that it's not a solution that's going to fly. And what we intended to do is to use these page faults to verify that access to kernel code is safe and that this way prevent possible rope attacks. So whenever a system call tries to execute new function new functionality it falls. This causes a page fault. In the page fault we can run some verifiers that ensures that the call is performed to known symbol and not to the middle of some function. So when the access is considered safe we are mapping the required page into the restricted mapping. If the access is not safe we kill the process. And again it didn't really fly. It was just our first attempt at looking into using page tables for improving container security. The other one we tried. It was a simple patch about 200 lines to create special mappings that we called map exclusive. So memory region in a process is considered exclusively owned by that process and is related from the rest of the system. For instance it can be used to store secrets in memory. One can map such region with that flag and then read data to that region and this data won't be visible neither to kernel nor to other processes in the system. So this red, this is pointing to is already 40 emails and probably will grow more. One of the suggestions that appears on this red is instead of using M-up, M-advice and protect and such it may be worse using fd-backed memory and creating dev exclusive memory as somebody suggested or dev secure memory somebody else suggested. And then this it will be a character device that will behave similarly to MMFD and in this way we can reduce complexity of such exclusive mappings because the M-up method of the char device implementation will take care of extracting the right memory pages from direct map and making them visible only to processes that own the file descriptor. Another huge problem with creating regions of memory with different properties that normal direct map uses is the fragmentation of direct map mappings. A direct map is usually mapped with larger pages. It can be one giga or two megs on the x86 and whenever we extract some small memory region from the direct map we split this huge map page mappings of the direct map and there is nothing that can return them back like there is thpn compaction for user space pages there is nothing like that for kernel mappings. And the last one that is not yet upstream and probably won't be any time soon but still we are working on it. In the container world namespaces create isolation by means of virtualizing some of the kernel objects and making sure that every namespace owns part of the kernel objects it's using. So we are trying to create page table for namespaces and then the objects that are any way private to that namespace will be visible only in the page tables of the processes that found in that namespace. So for instance it will be something like that. If there are buffers, devices, whatever kernel structures that belong to particular namespace they have their own page tables they will be mapped only in that page tables and they will not be seen by the rest of the kernel. For network namespace which was our first guinea pig network namespace creates independent network stack for each instance of a namespace. All the objects in that stack are anyway private and most of them never cross namespace boundaries except SKBs. So we created a prototype that allows to map objects in particular network namespace in its own page table and made these objects invisible in other page tables of the kernel. It was pretty simple also kind of 200 lines div. Not that it's already working but we've added the PGD to the struct netns and then we are forcing that PGD to be used by every process that joins the network namespace in clone and share set NS. And we also pose a restriction that every thread of the same process should live in the same network namespace. And we did proof of concept with several objects. The next step we did is we created extensions to page allocator and slab caches to allow allocations that also private to certain namespace. And the pages or objects allocated with game alloc and the friends are visible only inside the page table of the process that on behalf of this process these objects were allocated something like that. And we also took some special care for objects traversing context boundaries like SKB. So whenever SKB traverses namespace boundaries it became unmapped in previous only namespace and mapped in the next only namespace. These are more or less memory management implementation details that we are using internally for now. We added GFP exclusive for pages, slab exclusive to mark the entire slab cache as exclusive and it implies GFP exclusive for every allocation of new pages for that particular slab. And we mark pages that are allocated in such manner with new page type which most probably will be frowned upon but for the prototype it works just fine. So for now we stick with that. So whenever page is allocated in such with GFP exclusive we call set memory non present in the page allocator and that makes the page disappear from direct map and as such it won't be visible in any kernel mappings of other processes. And whenever it is freed we return it back to the direct map and again we return it to the same problem of fragmentation of direct map mappings which need to be resolved. And also we restrict such allocations to happen only when we have actual MM. This won't happen in case read or something like that or in interrupt context. With KMalek most of this is already implemented for our POC. It's so scary I'm frightened to look at it myself. And we were able to get as far as to allocate struct SOC and struct TCP and struct SKBuff with the modified version of KMalek. And I still have been able to have TCP traffic inside the namespace that used all these bells and whistles and protection methods. Now it appears that it's not an easy task. Probably it will take more than a couple of months and maybe even more than years. Touching MM code is really difficult because you can get bug reports, you wouldn't anticipate and it actually affects every Linux user. There are open questions we have about how this feature could be used beyond the network namespaces. So it could be userness, UTS and network UTS namespace, maybe others. And it's really unclear at that point whether security benefit will outweigh that complexity and performance penalties. So as many of us I'm running out of slides. Thanks a lot. Thank you. Let's thank the speaker. Questions. I have many, but I'm going to ask after. So with the network namespace isolation, what happens if I create a new network namespace, then open a SOC and a network namespace, pass that SOC by a fellow Scribler passing to someone outside the namespace and then that process acts on that SOC. Does that work? Does it suddenly have to switch the page tables in the outer process or do you just prevent passing sockets around? We don't support it for now. Well, I thought that we should remap it in the new namespace. And can you talk a bit about what the overarching goal here is? Are you trying to just mitigate arbitrary reads but not prevent arbitrary writes? We are not trying to mitigate any particular issue. We are presuming that we'll have vulnerabilities and we are trying to restrict kernel mappings as much as possible so that everything that is not strictly needed will be unmapped. And then it would make attackers life harder, at least that's what we think. Any more questions? Hi. Have you considered using memory tagging instead of namespaces? Should achieve the same goal? Namespaces are pretty much given for containers. So what we are trying to do to protect namespaces rather than use different memory management techniques to protect processes. The object of isolation is a namespace. I'm thinking about the overhead that you will incur when you switch between namespaces while tagging should be managed lower level. The context switch is in place because whenever you switch from process to process you do switch to and you do switch of the context. And then the kernel part of the page table of processes and different namespaces will be different. But you have that context switch anyway. Mark? So at least on ARM64 we only have a single PGD for all kernel threads because we have two base registers for page tables. One for the low half of the address space we use a space that actually gets context switched and one for the high half that has the kernel in. And we only use one page table there and we avoid lots of things like lazily faulting in the VMalic area through that. So this sounds like this would have quite a significant overhead in our case which we don't have today. So we won't support ARM64 initially at least. One question? Thanks for the talk. Some people who know Windows kernel told me that Windows kernel doesn't have FizzMap at all. And they, as I could understand, every kernel process has special mapping, separate mapping for it. Is it really better design that having FizzMap for all the kernel? Linux is better so probably it is. Linux is faster so probably it is better. I don't know. It's making each and every process his own view of a physical memory probably not really a good idea. I don't know the details. I don't know the details about how Windows works so I can really compare apples to apples, oranges to oranges. But what would be the design if we don't have FizzMap at all? No idea. A lot of things rely on that you can get page to physical, physical to page. Which is the direct map. Like DMA, device drivers, other things. I don't know. Thanks. No questions? I have a question. I'm not sure I understood correctly. But how would that work with device drivers having some kind of shared state when you have a packet, for instance, going somewhere and they have to manage hardware state. And then you switch namespace. How would it be handled? Will it be still available? If I understand correctly you have, for instance, network device drivers that creates an SK buffer that probably happens in any TMM context. Then that SKB will need to go to some of the namespaces. That's the question. SKB, I can understand how it would go to a namespace but if there's other memory allocations. Other memories that's mapped by physical device? For example? Or a common state? A common state between part of a physical device in a namespace and part of the physical device outside the namespace. I don't think it's something that's possible. And in container environments mostly physical device live in initial namespace and the virtual devices live in the inside the namespace and they just communicate with each other via bridge or some other sort of SDN implementation. Thank you. No questions? If not, let's thank the speaker.