 Welcome everybody to next talk. This talk is from Mike and James who worked at IBM and they will give a talk about address space isolation. Please when the talk finishes and the Q&A starts, please remain seated. It's not going to take long and we can hear the questions and then afterwards we can leave and don't disturb the Q&A. For the next one, I'll give the warm welcome to these people. Please give a warm applause and this is yours. Thanks, I'm Mike. I work on the memory management in Linux kernel. I happen to maintain a boot time memory management called the memblock and I'm an employee of IBM research and we are going to talk about our research how to use memory management techniques to make containers even more secure than they are today. Okay, and I'm James. My job in all of this was really just to persuade Mike that it was worth doing and since I gave a talk this morning, my voice isn't doing too well. So he's going to be doing all the talking telling you what it's about. I'm going to be the demo monkey because we have a demo right in the middle explaining how this all works in practice. And so with that I'll hand over to Mike to do the slides. Thank you. So it took a couple of years to get from CH route to cloud native and the containers as you probably know is can can be described a kind of CH route on steroids and thanks to technologists like Docker and Kubernetes containers now everywhere. It's probably the most popular form of application deployment. And you may find containers deployments in that or other form both in private data centers and in public clouds. If you noticed, if you used container services in public clouds, they all run the Kubernetes clusters on top virtual machines, which creates kind of unnecessary level of virtualization, which obviously costs additional money cycles, performance and so on. One of the claims to use virtual machines to run container installations because the containers are less secure than virtual machines. And proponents of this claim usually say, guys come on, virtual machines have hardware that ensures their security. And like we all know, with mail down L1TF and everything, hardware probably not as good to ensure security for anything particularly with L1TF VMs are much more vulnerable than containers or simple processes. Nevertheless, as the researchers, we were looking at interesting problems and we said, okay, we also can use some hardware to ensure isolation of containers and what we have is MMU. So we will try to use MMU and to protect Linux containers with the page tables. Our goal is to make containers less vulnerable and besides, we can presume that every system will be vulnerable in that way or in other. So once an attacker has gained some control of the system, we are trying to make sure it will be harder for him or her to penetrate to containers of other tenants sharing the same system. For that, we are proposing to use restricted address spaces to allow better isolation of privileged contexts of different tenants in the system. The containers surface, attack surface, is the entire system call interface of Linux kernel, which is about 400 plus system calls. So the first question we asked ourselves was, what can we do to make system call less vulnerable or at least less exposing the rest of the system to the attacker? And the other thing we have been thinking of is that in Linux containers are isolated mostly using Linux namespaces. So what we are trying to do is to provide namespaces, their mean of hardware isolation and in other words, we are trying to broaden namespaces with their own page tables. We'll explain in a bit more detail what we are trying to achieve. There is a similar work that's done some of it is already there and all of you probably know that as a result of meltdown vulnerability Linux kernel started to use it restricted address namespaces for the first time, which is page table isolation. There is a work on going at KVM area to protect virtual memories from the host and from each other that also try to implement address space isolation in KVM and another another mechanism that was dubbed as a process local memory. To ensure that VM secrets are visible only to that VM and are not visible in the hostile in the other virtual machines. So what we tried first is to create a restricted address space for execution of a system call. It builds on the technology PTI introduced into the Linux kernel where the kernel mappings are very much restricted for the user space part of the application and the only thing user space page tables contain are the code necessary to jump into the system call to the interrupt handler. We thought that probably it would make sense to extend this a bit and to make a system call execution inside the Linux kernel also use some very minimalistic page table and then a map required pages on demand. So it would be something like So this is a page table of the kernel part of a process. This is a page table of the user part of the process. The privileged code and data are not mapped in the user space page table except for the small part required for entering into the kernel. We introduced yet another page table that adds some code that allows code and data that allows selection of a particular system call execution and then when system call continues its execution and we try to demand page whatever code and data is necessary. The idea was that whenever we enter system call with switching address space and then we remain in a restricted address space and every access to unmapped area causes page fault and page fault handler can decide if the access is safe or not. If the access is not safe we kill the offender and if access is considered safe we map the page and continue the execution. We actually implemented this thing. I think here the patches we found out this really slow like times slower than normal system call execution and that context switch are really costly and also it has some security weaknesses as well and we couldn't validate red targets to actually prevent Rope attack properly. We were competing with upcoming CET technology that probably eventually will be available sometime. Do you know anything about CET? Intel has been promising it for several years. It's basically in their next chip, but nobody's seen it. Intel CET is going to do the same thing but in the hardware so if the chips will be available there is no sense in implementing our approach. But we also thought about another possibility. We didn't try to implement anything of that. One can use a trace to create a shadow stack of the execution and then upon return from any routine there is a possibility to insert red hunks using GCC or LLVM. That's what Red Palin does for instance with the call. It's possible to do the same thing with red and then at that thunk there is a way to check if the return address actually matches the shadow stack created with an F trace. This should be faster than using a page fault for that. We don't know yet if it will fly at all. The next thing we were trying to do came actually from the idea. Some of the KVM developers proposed a while ago on the mailing list. They call it process local memory. What we're suggesting is to hide a piece of user process from the kernel and obviously it won't be mapped in other processes. So this memory can be used to store secrets for instance keys and maybe some other sensitive information. Another possible use case is to hide the virtual machine memory from the entire kernel and from the entire host. And for instance for storing a secret this may be used in a way described here. We create a mapping with a particular flags. We open a secret file and then read its contents into that mapping. The patches are here if anybody is interested. It was a long discussion about this approach about using a memory map with some flag. The outcome more or less summarized here. The pros were that it was relatively simple. At least our submission was about 200 lines of difference. It can be easily plugged into existing user space allocators and it can be easily plugged into existing applications with M-Advice and Protector and such. But at the downside the implementation has to take into account various places of the memory management code to address the kind of the mapping and to see if for instance it is possible to do M-Advice on such an area or if it's possible to pipe into there supplies and so on etc. And the most significant disadvantage of this was the necessity to fragment the direct map kernel used to map the physical memory. Because whenever we create a special mapping in order to make it invisible for the kernel and for the privileged code we drop this memory from the direct map and it requires splitting of large and very large pages that usually constitute the direct map. Okay. So one of the feedbacks we've got on the M-Map-Map exclusive suggestion was that it's probably better to use file descriptor or device Characters device to create such secret mappings and we came with another version of the patch that actually extends MMFD create system call. So to create a secret area one has to create a file descriptor this MMFD create secret and then you must call IOCTL to specify exactly the way kernel would treat your memory. It could be exclusive, un-cached, maybe some other different properties there. And then continue with M-Map and use the memory in a secure way. It has an advantage of less modifications to the core memory management. We don't mark the allocated area with anything except VM specials, so we wouldn't need to insert as many if something in the core M-Map. It is possible to pre-allocate memory boot and then use it as a baking memory for such file descriptor based memory management. We still would need to audit all the memory management code to make sure that nothing would try to access the secret memory and that the safety is preserved. And as it sees file-based memory management, we are it would use page cache mechanism in such way or another and the final implementation may introduce some complexities into page cache management. And still we didn't address the huge gap of the fragmentation of the direct map, which also may cause some pain in the future. I've recorded a slide just in case, but now it's all yours. My job is to be the demo monkey. Can I actually get this demo up and running? Let's just stop the presentation. Let's get a same as so I can see what I'm doing. Let's bring up a GNOME terminal. How big does it need to be for you a lot? So everybody can see that. Yeah, of course every tab you start also has to be sized. Great. So Mike actually sent the patch that does this to the mailing list what three days ago? So I have built a 5.5 kernel with this patch integrated and that's actually what I'm going to demo to you. So ignore the fact that this is a UFI secure boot. This is where I've got all my demo kernels. So I'm just going to boot up this kernel in KVM and then once it's booted up I will log into it and we're going to try twice to poke it to the memory of what should be a container but in order to convince you that this works generally I'm just going to do a normal ordinary process and prove that we can actually abstract its secrets. So if I just log into this system, I actually have a very simple program that uses OpenSSL. One of the great things about using OpenSSL is for reasons best known to the OpenSSL developers they insisted on rewriting the buddy allocation interface for Linux for G-Lib C memory because they claimed it would make the program more secure. So OpenSSL Malik if you use it in an OpenSSL program and OpenSSL does use it for all of the private keys should be actually give you more security according to OpenSSL. Realistically as we'll demonstrate it doesn't, but one of the great things is I can just use a preload library to override OpenSSL Malik and insert our special allocation thing into their buddy allocator and then OpenSSL Malik is really allocating all of your private keys in secure memory. And so the purpose of this demonstration is a very very simple program to actually do that. So this is it. It's basically I allocate a secure pointer using the OpenSSL Malik. I allocate an insecure pointer using the standard Linux Malik. I copy two strings into each of these pointers and I print them out again. This is obviously highly insecure if you have access to the console, but it serves as a demonstration. The reason for printing it out again is to prove that the process itself can still get access to memory we've designated secure. And if I actually just oh, and usually when you're dealing with secrets, the trick is to get the secret in, use it and shred it as fast as possible. In order to demonstrate the actual program working, I put a pause in here that allows me to go in and actually try and extract the secret. So if I run this program, it's going to print out the two pointers. So I mean, I haven't done an LD.SO override. So this is only OpenSSL's protection. And what I'm going to do is I'm going to use the biggest hammer I can possibly do, which is log into the virtual machine as a route and see if I can extract the secret. So all I have to do is find the process. There it is. And then I can just ask GDB to attach to it, which it does with no problem. It's stuck at the pause. Sorry. So I go up in the stack frame and now I can actually print out the pointers. So here's the insecure pointer. I'm running as route so I can easily grab about in anybody's memory. And here's the secure pointer. So I can just as route extract all secrets from the system. So if I managed to compromise the system sufficiently, I could also do the secret extraction. So now what we're going to do is leave this. I'll come out here, kill this. And now I'm just going to add a preload that overrides the OpenSSL malloc. Let me actually show you roughly what the preload looks like. It's basically the same program Mike showed you. So all it's doing is getting a secret memory thing. It's mapping a single page and it's putting that page in a secure pool. And when you call OpenSSL malloc, it just returns the page. This obviously is not a buddy allocator, but the purpose of demonstration where we only have a single allocation, it demonstrates all of the principles. And obviously if I'm going to do this in practice, I would have actually written a buddy allocator, but I did this on fly last night and I couldn't be bothered. So let's apply the preload. And again, the program runs. I actually, if you looked, put a debugging print in the OpenSSL malloc override. So we know that the secure pointer is now actually in Mike's secure memory. So what we're going to do is find it again, attach to it with GDB. Okay, go up. And obviously the insecure pointer was only an ordinary malloc memory, so I should still be able to get access to it. But now let's see what happens if I actually try to grub about in here and get access to the secure pointer. My program is killed. And if I actually have a look at what happened to the kernel, that is a paging fault in the direct map. So even root on this machine cannot get access to the secrets that are processed, deposited into its memory. And so this affords us a lot of useful secrecy for things like OpenSSL keys, HTTPS establishing secure channels in containers in the cloud. Yeah, sure. So as we can easily see, right, the page is not present here. Back to the presentation. Okay, let me just... Yeah, I know you want it bigger. Now another thing we are trying to do is to protect address spaces with name spaces, name spaces with address spaces. And name spaces in Linux anyway create their objects in a way that is isolated from the rest of the system. So there is no actual need for a kernel code running in one namespace to access objects in the other namespace. And that's why we think it would be possible to give each and every namespace its own page table and then just take care of some rare cases of namespace transition for different objects. We've started with the network namespace. Network namespace creates a known copy of the entire network stack, TCP caches, UDP caches, sockets, everything. They're all private to that network namespace. There is no need for any other namespace to touch the data in the caches of the network stack of other namespace. The only things that behave differently is escape buffs, which represent packets that usually traverse several namespaces on their way to other services out of the machine or into the machine. And then we started working on this more or less at the same time KVM developers submitted their work for what they call KVMSI. And one of the comments for their submissions from Tomoks Glecksner, one of X86 maintainers, was that there are actually four points to creation of restricted namespaces. There need to be a way to create restricted mapping. There need to be a way to switch into it and switch back to it. And there should be some machinery to track the state to understand those that could understand which address spaces it is actually using at the moment. And together with the KVM guys, we started to work on some generic APIs that will allow usage of restricted address spaces in the kernel. First of all, the API for creation of the page tables. What we thought is that we need a first-class abstraction for a kernel page table, which is nonexistent as of today. The kernel presumes that every address space has its MM struct that actually used to represent user process memory. It has a lot of information that is not necessary to represent the kernel page table. So what we're trying to do is to extract a page table information proper from MM struct and create a first-class abstraction. We call it PgTable for now. It may evolve with a... Then we need some API to populate this PgTable and we need an API that will be able to tear down this PgTable. The context creation varies between different use cases. KVM guys have it explicit on the VM mentor, VM exit with the network namespace with the address spaces of processes. It becomes implicit as a part of the context, which is just the page table of particular context is reduced because it is already there inside the namespace or because of some other reason. And freeing the restricted PgTable is kind of pain in the whatever because currently the code in the kernels that actually freeze PgTable very tightly bound to the MM struct and to the assumption that kernel PgTables are never freed. And there is a lot of care that must be taken in freeing PgTables to properly play with the TLBs and to avoid the TLB shootdowns as much as possible. And there is a lot of work we are going to do in that area. What did I do? Uh-huh. You should warn me. So on top of these PgTable management primitives that we are going to implement some day, we are trying to implement a private memory allocation, private memory allocations, that PgAlloc or KMalloc will receive a particular flag that will say, okay, I want this page visible only in my PgTable and I want it absent in all the other PgTables. And I want it dropped out from the direct map. So the idea was to add some page flags for struct page and some flags for struct slab. As we got a pretty much of pushback on using new page flags on our first submission of MAP Exclusive, we'll need to think about some different way, probably using a page extension for this mechanism. And then we can use existing interface to tweak in the direct map like setMemory and P and setMemory P, which makes my pages present or not present in the direct map. And again, despite the availability of this interface, it is really not good to use it because it will fragment the direct memory and there is some groundwork required to properly implement direct map manipulations and maybe doing something like THP for direct memory as well. And for the private allocations using KMalloc and this family KMemcache alloc, we are proposing mechanism which is similar to what memory Cgroup is currently doing. We create another level in the hierarchy of the KMallocaches and for every context there will be a new cache that belong to particular like Cgroups create their own. We will also create our own for address space one, address space two, et cetera. And if we are looking again at address spaces for network namespace, we add the page table to the struct net which is the kernel representation of the network namespace. And then whenever a process joins the network namespace using clone, set an S or something like that, its page table gets overwritten with the page table that is common to all the processes present in that network namespace. And every allocation of the memory inside the kernel makes sure that the page is allocated privately to that namespace and that they are not visible in the direct map in other namespaces. We had some proof of concept mostly working that socket objects and escape buffers were allocated using GFP exclusive and using the exclusive memory. And I actually planned another demo but I couldn't get Wi-Fi on my laptop so sorry about that. And this is our current vision of how it's gonna happen. It may evolve over the time so we are going to implement some page table management API for management of the kernel page table. Page allocator, slide allocator will use these APIs and then it will be available to namespaces isolation as well as to KVM isolation for what KVM people are doing. And for the exclusive memory mappings currently we are looking at extending page cache functionality and using that to implement exclusive memory mappings. So to conclude, well first of all it would be nice to make all this work. It will take a couple of years or so I presume. We can presume that using restricted address spaces does reduce attack surface but we yet to evaluate the security benefits versus the added complexity and probably performance degradation. So to evaluate the pros and cons we need to implement so it will take some time. As we've seen with SCI, this is no-go. The system co-isolation we tried to do was too slow. We hope that exclusive memory will be fast enough to be useful in production and we hope that address spaces for namespaces will also be fast enough. And reworking kernel address space management in kernel is really difficult because we have to break a lot of assumptions that go into kernel memory management. The major assumption was that there is single kernel page table and we don't need anything to actually manage it. It's always there. So that's all I have to say. Do you want to add something? And if you have any questions. Thank you. Please raise your hand if you have a question. We finished quite early so we have plenty of time to move to the other room after questions. Hello, thank you for the talk. I'm afraid terminating processes this way will break workflow of traces and debuggers. I mean they usually expect some kind of error when they try to access a memory. That's the idea, right? Yeah, but in this demo there was just a killed process. Right, but the theory of the demo is that somebody has already broken into your machine and tried something nefarious. You don't get killed in normal operation and because it's a kernel page fault we can actually choose the signal we give to terminate the process and that signal can be picked up by the container orchestration software. And in addition, the kernel log contains a bit of a verbose trace of what went on. That can also be passed up through Kafka in the container world and analyzed to show that you have a problem. I mean if I actually saw a log message like this on one of my containers, what it shows me is somebody is trying to break into your system and they already have enough privilege to be trying to steal secrets. So this machine needs to be shut down as fast as possible. All right, next question over there. Thank you, great talk. In the MMFD case, how does it play with SCM rights? And could you theoretically adapt this to be passed between processes? How it's going to play with C-groups? No, no, SCM rights, like passing the MMFD to another process. Could you theoretically use something like SCM rights to then securely donate that memory to another process? Yes. It's a normal file descriptor. So you can do SCM rights, you can do LSM on it and everything you can do on the file descriptor. Pretty much like MMFD today, the usual MMFD, we just thought that instead of introducing a new system we extend the existing one. So you would then update the other processes, page tables in there. Very cool. Okay, next question. Would it be possible to use this work to further lock down file system implementations in Linux kernel to isolate file system implementations? Well, in theory it's possible. So if you looked at the patches we are planning for network isolation, we're planning to give the network namespace its own allocation of socket buffs. And if you terminates to the end with a virtual function that means we have an isolated network stack. It is not impossible in the kernel to isolate the file system in the same way using the mount namespace. So we have a namespace to do this. It's just that neither of us has looked at what the complexity of actually adding private allocations to the mount namespace is. So I can answer theoretically yes, but I have no idea because we've not tried it. Okay, I see the next question over there. Thank you all for being so quiet. It's really special. Have you explored looking at using this with key rings? Being able to allocate secure memory for key rings seems like an obvious choice. So there is a specific problem with the kernel key ring mechanism that has no C-group or namespace isolation currently. So right at the moment, key rings are shared amongst all processors. It is theoretically possible that we could use secret memory for, say, the user part of the key ring. But the protections it affords may not be as great as you think because you don't have... I mean, in the current model, it's actually only the children of the process can get access to the memory. In fact, we did the MFD with close on exec, which means even the children don't get access. So this is really restricted. The key ring has a much more general use case within the kernel, and it's much longer lived. So I think we won't get key rings in secret memory until we get the key ring namespace, which is actually necessary in order to consume key rings and containers anyway. Without the key ring namespace, a key you put into a key ring, even inside a container, is shared by all of the containers. Thank you. Any other questions? No? One more question. This architecture, especially adding a context to key mail looks much like a first step to micro kernels. Do you think we are heading into the direction? Okay, so this work does have contact points with micro kernels. But if you think about the architecture of a micro kernel, and actually, although Mike and I are standing up here, we had a third guy called Joel Neider, who also worked on us with IBM Hyfer doing this. He was a micro kernel guy, so his job was to bring micro kernel techniques to what we were doing. But in a standard micro kernel, it's actually all of the internal servers within the micro kernel that run in their own address space. And the problem is that if a tenant can exploit one of those servers, you can limit the exploit to that server, but that tenant can still compromise any other tenant also using that server. So it doesn't provide a lot of protection in the micro kernel against exploits that are exercised by tenants. Whereas if you look at what we're doing, we're actually trying to bring up an entire address space that belongs to the tenant alone. So any other tenant running in the kernel can't get access to this address space. So instead of trying to isolate the servers within the kernel, we're actually trying to isolate the access from the tenant from the top. That actually is a very conceptually is very different from the way micro kernels operate. So it's true to say that we definitely got our ideas by looking at micro kernel work because Joel was very, very fanatical about it. But the ultimate implementation we have is very dissimilar from what a micro kernel would do. Thank you. We have two more questions. Stop the idea of time. So with your dummy, you stopped it raised by a question from accessing the tracing memory, but what it stops from like injecting code into the context of the process and basically executing it? So the question is really about mechanisms we can use for protection. Thanks to the no execute bit in the modern processors, it's actually very difficult to inject code into processors and force them to execute it. It is definitely not impossible and a root attacker has many other ways of compromising a process other than by trying to pull secrets straight out of the memory. So if they know we've deployed this protect. I mean, security is basically a turtle's game. So, you know, we've gone down about a couple of layers in the turtle, but in order to get perfect security, you have to go down the infinite layers of the turtle. But what we're hoping is that this is definitely a building block for providing enhanced security coupled with a few other security techniques that containers will use, like no execute memory, enhanced protections for the namespace, various other things. We might be able to block most of the standard attack channels. And obviously when you do this, the black hats just tend to come up with new attack channels, which we look forward to seeing what they are and we end up in an arms race to see if we can also block those with the same technology or whether we need new technology. So this is definitely not an endpoint for security in containers. This is just not a question. Okay, thank you. We have not a question there. Hey, have you discussed your design with the potential consumers of those patches, say the container orchestration community or the TLS providing libraries or something like that. And if so, what was their reaction and how did they adapt the features that you just presented? Okay, so as you probably know, there is a bit of a bifurcation between the container orchestration community, Docker, Go, and the actual mechanisms and Linux that implement containers. Most Docker people can get their heads around namespaces and C-groups, but if you look at what Docker does, it still can't take advantage of a lot of the security mechanisms we have and it's the user namespace being the classic example. And so the kernel developer's view of the Docker community is that in the rare case, they can actually formulate the question correctly. They usually don't understand the answer. So I would agree exactly that what we need to be doing is evangelizing our features, but just due to the fact that the complexity of what we've done in the kernel is almost incomprehensible to people who are managing orchestration and have a sensible conversation about how you would make use of it. So I think the business end of the conversation goes to that demo that I showed you. This is a way of using a preloaded library in a container, which is very easy to do, to get security. Just put this LD.SO in, attach it, and your container is more secure and the Docker community will be perfectly happy about that. Trying to explain to them the mechanics of an address-based separation mechanism that pushes the page out of the direct map will cause their eyes to fall back in their head. Let's leave it at these questions. It's been a good time already. Thank the speakers. Thank you for the clapping already.