 Hello everybody. I'm Mike. I work for IBM Research. This is James here. He'll help me if I'll get stuck somewhere. And we're going to talk about using MMU and restricted address spaces for the container security. As you know, obviously, container images are now the standard way to deploy applications in the cloud and on-prem data centers, particularly everywhere. Docker image is the most convenient form of distribution of complex applications. And yet the container runtimes do not usually run natively on the bare metal machines. People run the container runtimes inside virtual machines because virtual machines are perceived as more secure than containers and people consider virtual machines providing better isolation than container can provide. We'd like to mitigate this problem and to get to the point where containers are as secure as virtual machines or even more secure. And we are working on providing hardware isolation for containers as well as for virtual machines. So things containers do not have any particular hardware to support their execution like VMX and so on. We are using MMU to implement address space isolation for containers because MMU essentially is one of the best isolation methods since like forever. And we are trying to use different page tables to create visibility of different areas of memory so that they will be isolated one from each other. One of the major points that consider the container vulnerability is the shared address space with the rest of the system. So essentially any vulnerability in the kernel may lead to data exposure or to privilege escalation to get access to other people's containers to get access to other people's containers. And a malicious container can get to control of the entire system. And since the major isolation mechanism for containers in the logical sense are the namespace, Linux namespaces, what we are trying to do is to assign a restricted address space for certain Linux namespaces so that processes that run in a namespace will be isolated from the rest of the system. It's anyway happens this way that most projects, most objects in a namespace are already private and usually used by the kernel code that executes on behalf of processes running in that namespaces. So the idea is to provide the per namespace address space to improve isolation of containers. The private objects will be mapped exclusively in that address space and if a malicious container gets control of the host it still won't be able to extract data from other containers. To prevent multiple additional context switches the processes that run in the same namespace will share the same page tables, will share the same address space and so the context switch will become more or less implicit in the database processes scheduled on the CPU. The page tables that's common to every process in the namespace will get visible on that CPU. The first namespace we tackled and we are working on this pretty much now after the conference is the network namespace. Network namespace by definition creates an isolated network stack independent from the host namespace, from the host network stack and from the network stacks of other namespaces. The objects in the network namespace are private that namespace like TCP caches, IP tables and so on they are not needed outside the container and if we for instance create a container that has a hardware interface moved to the network space of that container we essentially create an isolated networking stack pretty much the same way like device assignment for the virtual machines. Now it won't work. The idea is to have a part of the page table that's mapped inside an isolated network namespace unmapped in the kernel direct map and this way neither host kernel nor the other containers will see the objects, will have mapping for the objects that present in this area and they won't be able to access them. There is still some gap between what we have now and what we need to get. The first one we need to add more mechanism to management of the kernel page tables and we need to provide a way for applications for kernel services running inside the isolated namespace to allocate memory on behalf of that namespace and to make this memory isolated from the rest of the system and since we are touching the direct map and we need to preserve its integrity in a way that the performance of the system won't be degraded because of our manipulations on the direct map. There are several related use cases that need similar functionality in that way or another. First, it's a MemofDSecret system call that we've added to the Linux kernel as of 5.14. It also fragments direct memory and we need to address this fragmentation to make MemofDSecret more than it is available now. There is an ongoing project for protecting patch tables with PKS and they also would need some mechanisms to reduce direct map fragmentation and there is a AMD secure nested page in AppRiseum.tdx from Intel does similar things. When they have a memory that's private to the guest, they also need to fragment direct map and something should be done to avoid the fragmentation affecting system performance. A kernel page tables are since forever like there is only one kernel page table. We've put a lot of effort into managing user page tables. There is a lot of code that does various optimizations into how user page tables are handled, but this code is not really accessible when one wants to do things with the kernel page tables. Kernel page tables are mostly accessed by simple accessors. There are no mechanisms for TLB gathering and so on, etc. Another thing is that APIs that allow modification of the kernel mappings and the direct map initially were designed for debugging. They are not as robust as the APIs that deal with user page tables. To provide the facilities required for the restricted kernel outer spaces, we need new APIs that will be able to create and tear down kernel page tables efficiently. We need APIs that will allow populating non-default kernel page tables. For instance, the ranges that are excluded in one kernel page table can be visible in another page table. As I've mentioned several times now, there is a problem of direct map fragmentation. The direct map is essentially the kernel page table that creates one-on-one mapping for the entire physical memory on 64-bit systems, at least. There could be some offsets if there is an address-patient termization enabled, but for simplicity, we can presume that every physical address has its counterpart, virtual address, with a particular offset. Here's an example of an x86 system with two deems of 8 gigabytes. One deem is at physical address 0 and another memory bank is at physical address 28-0s. So the direct map for these physical memory banks will look something like that. It starts at this address ff8-mini-0s. And there is a hole that corresponds to the hole in the physical memory. And another part of the virtual map will start at address offset 28-0s from the beginning of the virtual map. So for any virtual address in the direct map, it is enough to subtract the base address from the direct mapping to get the physical address version. Most kernel memory allocations return addresses in the direct map. These are Kmalak alloc pages, and these are used in vast majority of cases. Now, on x86 systems, a direct map is created using large pages to reduce memory overhead and, more importantly, to reduce TLB pressure. So for memory layout that has several gigabytes of RAM, there will be some part of a direct map laid out with four kilobytes and two megabyte pages. And whenever there is enough space, there will be one gigabyte pages covering the virtual mappings. So whenever we try to exclude some address from the direct map, we cause direct map fragmentation because this gigabyte page needs to be split into two megabytes and two megabyte needs to be split into four kilobytes. So essentially, instead of having one mapping at PUD levels that maps the entire one gigabyte, we create a bunch of two megamapings, a bunch of 4K mappings, and again, a bunch of two megmappings to create the aligned page table entries. This causes certain degradation in the performance. What we measured with the system, with this, we've run a couple of benchmarks that are more sensitive to page table alignment and to page table layout. We've tested how this will behave on SSD and on TempFS. And another parameter we added to the test is the mitigations for hardware vulnerability such as Spectre and Meltdown. And what we've got is four cases presenting the results. And as we can see, there is indeed a performance degradation in most cases for the 4K pages, two megabyte pages and one gigabyte pages on average perform very similarly, at least on this system. And while there is performance degradation, it's not the end of the world. It's always below 10% and in many cases it's one digit percent of the degradation. While we should do something about direct map fragmentation, it still doesn't justify a huge complexity of the proposed implementation. There were a couple of suggestions on how to address this. More or less there is a consensus that we need some cache of two megabyte pages so that this cache can be used to provide 4K pages for users that need memory excluded from the direct map. And so whenever there is an allocation request that requires an unmapped memory, let's call it, we allocate two megabyte page. We split this page to four kilobyte chunks and then a subsequent allocation request of 4K is actually allocated from that cache. And if the memory in the page allocator is too fragmented to have continuous chunks of two megabytes, we just fall back to normal allocations of 4K and we can live with that the direct map will be fragmented to that area. There were two ways to implement such cache. First is to implement a cache for each user of unmapped memory. And another suggestion was to implement such cache as an extension for the page allocator. Each of the approaches has its own pros and cons so per user caches are probably simpler to implement. They have better access control because the user actually knows what the memory will be used for and they may do various optimizations like, for instance, compaction of the cache to move all the used pages into a single two megabyte chunk and then release a three-day two megabyte chunk to the rest of the system. From the other side, they would have larger memory overhead than the extension to the page allocator. At least some initial experiments we've done show that. And there will be overall higher fragmentation of the memory in the system. And the pros and cons of the other approaches is pretty much inverse of the first. It's more intrusive changes. The cache is a black box so it would be hard to move pages around inside the cache because the page allocator would not know what does it destined for and how the access rights can be changed on the fly. It will be more memory efficient with overall lower memory fragmentation. And the core memory integration will provide an easier path for free in memory which is, as I found out in the hard way, much more difficult to realize that allocation because when you do an allocation of memory, you know what context are you in and it's pretty much streamlined. When you freeze a memory, in many cases, you need to understand what particular cache it should be free to. So our idea is to add the varness of the DirectNet to page allocator. For now, we call it, and we created a new get free page flag. We call it GFP exclusive so any allocation that adds GFP exclusive to its GFP flags will receive a page that excluded from the DirectMemory. And since it will be extension to the page allocator, the free page and its companion functions will know how to put the page into the appropriate cache and how to release it in a way so that it won't be put in the global free lists and won't cause additional fragmentation of the DirectMemory. We are planning on creating a shrinker to free unused part of the caches when there is a memory pressure and there is also possibility to create some cache defragmentation probably with callbacks to the users to allow changes of the permissions of particular pages. Next thing is moving up on the allocation stack. We need to extend the slab allocators. Again, we call the slab exclusive so whenever somebody creates a Kmem cache with Kmem cache create, they can pass a slab exclusive and then the cache will be entirely excluded from the DirectMemory. The objects in that cache won't be visible in the default page tables. And the idea is to reuse the mechanism cgroups used up until recently whenever there is a request for such private cache we create a child cache of the original Kmem cache and this cache serves as a pool of memory for the context that requires private memory allocations. And whenever we free memory in such cache we add the metadata to the cache or to the struct page that represents some of the metadata of the slab caches. So K3 can look up that metadata and understand what context uses the memory and to free it into the appropriate cache. Sorry. As I said freeing memory that is restricted to a particular context is more difficult than allocating it. For instance, things like RCU, software queue and so on usually run in entirely different contexts that we wanted them to in different contexts that are allocated memory so we need some metadata to address this and to be able to to detect from the freed address what is the context and what is the cache that the memory should be free to. After we implemented these two extensions to the memory allocator we need to adjust the networking stack to use these methods to hide the networking objects. So we are adding a page table or even an entire MM struct to the network namespace representation which is struct namespace. And then we can start switching KMM cache users inside the networking stack to use a slab exclusive or GFP exclusive depending on the context. And some of the networking stack functionality requires minor adjustments because things like timers and software queue processing always run in the default context so we need an ability to map things back and forth so to switch context whenever it is needed. And it's a kind of division picture where we would like to get after all this will work. So as of today there is a VMX isolation for virtual machines and you can assign a virtual function to AVM and this creates an isolated networking stack for the virtual machines. What we are trying to achieve with the MMU isolation and with the restricted kernel address spaces is the creation of very similar network isolation for the containers so that a virtual function can be assigned to a container and the entire networking stack of that container will run in a different address space than the default address space of the system. And so MMU will guarantee that no other containers can access the data that passes in that networking stack. We actually did some proof-of-concept implementation of the whole thing and it was stable enough to run some benchmarks between reboots of course. We've tested MMKHD, Apache NGNX and the IPERF to see if there is a significant performance degradation caused by the isolation we've created. So there were three variants we checked and it was a more defined baseline like the normal networking stack of the Linux kernel. We checked our implementation which is the isolated network namespace and we also did a test this configuration on Virtual Machine with the assigned networking interface using PCI path through. So as you can see, we are better than VM, right? There is slight performance degradation relatively to the native Linux kernel stack but VM with VF assignment worked less efficient in all the use cases. They think I talked too fast or I had two little slides anyway. To conclude, restricted other spaces for network namespace creates isolation of the networking stack similar to what Virtual Machines have and it is better in terms of performance. Caching of large pages removed from the direct map is essential for particularly any security features that uses memory management for security and it's important to have this otherwise we'll have performance degradation of the entire system and another point is that reworking kernel address space management is a major challenge probably not as major as real-time Linux and we hope it will take less than 20 years to achieve. Thank you. Questions? I did brute force. The question was what testing have we done to verify that network namespace address space did isolate something. I did a brute force test I created a dedicated kernel model with IOCTLs that allowed me to try to access different memory inside the machine and I just seen that I cannot access this memory. I didn't do penetration tests and so on and I verified it did exclude it and I can verify that an attacker would need to rebuild page table to access these areas. Yes, we didn't finish with network namespace yes it's not upstream. Sorry, the question was what's the next namespace to tackle? Probably mount or user or any other weight of them. It'll take a while until we finish with the network namespace so we we'll wait on that. Really hard to tell because like, sorry the question was we didn't submit it yet to the upstream what are the plans about going upstream? If we get back to this slide these are kind of prerequisites to start actually doing things to the network namespace memory management. So before we have at least a direct map and page allocator done we cannot really start changing how network namespace allocates and frees memory. Yes, and there was an RFC from people who work on the PKS about using their user caches for their use case a while ago and I've sent an RFC to create a page allocator extension to address the direct map fragmentation problem. So this is worked on these days and this is going to be upstream first I presume. Any more questions? Let's go back to the last slide. Thank you very much.