 So, we will be talking about how to improve implementation of the secret memory, particularly in MMFD's secret system called we added to the Linux kernel just in the last version, and how to extend the different techniques that utilize memory management for better security of containers. Back to you, James. Yeah. So, if we... Actually, we'll start with this diagram. The basic thing that we've all been discussing at least for the last couple of days have been encrypted virtual machines, which is where you get confidentiality in the system by putting a huge encrypted envelope around a virtual machine. The problem with this is the cloud service provider can't see into this encrypted blob and can't do anything with it. So these orchestrators really fast, fat virtual machines, which is somewhat of an anathema to the new lean, mean, modern container environment. So my group does actually do a lot of work on confidential virtual machines, so I've been working with both AMD and Intel on this for a while, but the research we're going to present is different ideas for trying to preserve confidentiality in a more container-like setting. And to have a more container-like setting, you're basically the diagram on the right, MMU isolation for containers, and you only get leanness and meanness of container orchestration if you have a lot of sharing. Sharing means that a lot of stuff inside your container is going to be public. And so what we're trying to do with secret memory is to carve out elements of the container that are very hard to get to, and we use this memory management unit protection. So the way it works, and Mike actually has a slide on it, is that for some of this we use namespaces for isolation. So we make the objects that run through the namespaces completely private to that container. The technique we use to make them private is we actually push them out of the direct map of the Linux kernel. Because this is a container system, you're already sharing most of the kernel. You have to trust the kernel on the machine you're sharing. So this is not a situation where the cloud service provider is untrusted. The cloud service provider should be trusted. They should be trusted to run what you want, which is a Linux kernel, but there are all sorts of things like Measure Boot and TXT and everything that allow you to actually verify that trust. So you don't have to trust blindly. And in this scenario, we also assume the cloud service provider is reasonably compliant. So the machine is locked away. No physical access. They've trained their admins. The admins are not going to try and log in as root and bomb you. So your main fear is actually an attacker breaking out of one container and trying to abstract the information from another. That's the threat model we're mostly concentrating on. That's because you're supposed to be watching the Excel platform. We promised we'd projected on the Excel platform. So that white rectangle in BBB will always stay white. If Guy gives us presenter rights, we upload the presentation there as well. Yeah, it's not a bug. It's a feature. We're supposed to be using the Linux Foundation's platform. So we're using it for the slides and the chat. So don't use the BBB chat either. We will just use BBB chat. So don't use the BBB chat either. We will just use BBB to project your presence into the conference room. And I suppose with that, do we have anything more to say or should we actually get on to the debate about what we're trying to do with the security properties of this? We have normal slides, that's for sure. Look, that's Steve Roster. Isn't he looking pretty? Just wave to us, please, Steve. Okay, so what I hope with the thousand foot view is you've got a sense of the type of security we're trying to do. And the security is not as foolproof as the encrypted envelopes. The question for the room and the remote audience, and if you want to ask questions, please join and try, is, is it worth, well, we think it's worth pursuing this type of security. So we're actually going to do it anyway. But what we want to hear is what are your concerns about the path we're taking you down? Because the path we're taking you down is, if you look at the confidential virtual machines, they want a clear boundary around everything. So when you're running a confidential workload, certain bits of that workload will have to be protected, you know, private keys, computing of sort of secret algorithms have to be protected a lot. But other pieces of stuff that it runs on like the sort of, it's running on a vanilla Ubuntu operating system, it wouldn't really matter if anybody saw that. They don't have to be protected, but they are. And this is why it's very hard to orchestrate confidential virtual machines in a lean way like containers. If we did our system, let's say we're running a web server that has a private key for signing, just the HTTPS requests and nothing else. All we would do is put that private key for signing into the secret memory area. And all of the rest would be open to both the host kernel to see and to share and do all of the other Linux things too. So we could bring up many containers running this type of hosting because the only private thing in all of those containers is just the memory area that contains the key. It'll be one page per container for that situation. Obviously, if you have more keys, you require more private memory, we can give you that. We can give you network stack isolation all the way from the container down to the card. So if you are running in the standard mechanism where the container is speaking HTTP and the card is actually adding the encryption, which is increasingly what people want to do with containers nowadays, we can actually protect the unencrypted traffic over this private network as well by using secret memory for all of the packets that go between you and the card. And so effectively, we can give you a hardware endpoint. The actual problem with that setup, it looks very much like a virtual machine setup using virtual functions. The only difference is we've terminated the virtual function in the container, and we're using privacy of the network namespace to keep all of the packets secret. So if somebody actually attaches a tracer to it, the tracer will crash because they can't see the content of the packets. And then the problem in the container use case is that virtual functions are very expensive and not very prevalent. So it's easy to get a 128 virtual function network card. But on a container system, we tend to bring up thousands to tens of thousands of these things. Getting 10,000 virtual functions on a network card is currently, it's within the realm of possibility, but not many manufacturers produce them. This is the type of world we're taking you to. So in this scenario, it becomes quite a lot easier to try and trick the container out of its secrets. Because so much of the container is actually open to an attacker and open to inspection. So let's throw it open to the room and to the remote audience and ask what concerns you actually have about this approach and how could we do it better? And of course, any aspects of this approach you don't understand will be happy to explain. I mean, we've done tons of technical presentations that you may not have been to be defined on the web. So if there's anything that you've missed, we'll be perfectly happy to explain it here as well. And anyone on the remote audience, if you want to ask a question, just unmute your video. Please make sure that, so I see lots of you are joined in listen-only mode. We won't be able to hear you if you join. So if you click on the little telephone button at the bottom here, it's actually we're not showing that screen. You can do it here. Yeah, they won't really see me when I'm pointing to this. But the telephone button in the middle allows you to switch audio. You need to switch audio to have a microphone or leave audio and join to get a microphone and then you'll actually be able to talk to us. So let's start. We can't hear you online. You can't? Can you hear us now? Yes, I can. Okay, so it looks like Zach, you're the first person to ask a question. I'll mute your video. Okay, I'm gonna make sure I'm, you can share the webcam. Yeah, we'll see you in the room. Okay. Everybody will wave to you, won't we, everybody? Okay. Hopefully this is a, there we go. Okay, cool. Hey, so I'm calling, and my question is, so the idea is that you would encrypt both the network traffic coming into the container and the read-write file system layer? I actually know, the idea is we would encrypt nothing. So this is purely isolation technology, it's not encryption technology. The virtual machine sort of confidential computing is encryption technology, but we're using pure isolation techniques rooted in the memory management unit. And so the idea is that we bring up the network namespace in such a way that all of the packets, all of the network packets become invisible to any aspect of the kernel that is not in the network namespace we just created. So if you try and abstract the packets from outside the container, suddenly everything that's within that, everything within that network traffic becomes invisible to you, and invisible actually means you'll take a page fault if you try and touch it. Yeah. Okay, so just namespace, you know usually namespace, okay. Yeah, I mean, and so is this, is the idea here that you're still trying to, are you trying to protect the containers from the host, or is it just to protect the containers from each other? So the idea is that we are trying to protect it from the host, but we're leveraging the properties of the host to do that protection. So the host kernel has to be trusted by both the container and by the hosting provider. So the hosting provider provides the kernel, and we accept, partly accept their guarantees the kernel they booted is what we think it is. An open source allows us to do this. Ideally, you'd probably have done a measured boot here so that we have some confidence that what the cloud service provider tells us they booted is what they've actually booted. But we're assuming that some mechanism will agree with the cloud service provider over what the actual host kernel is running is. And this is essential for any containers because effectively we are not shimming any of the system calls that containers use. So the containers themselves to get the maximum orchestration ability are using the entirety of the host kernel for most of their system call interface. So they have to be able to trust it. And so the assumption here is not that the cloud service provider is completely distrusted. The assumption is that the cloud service provider is trusted to a reasonable extent, but we're also using verification techniques like measured boot, which means that if the cloud service provider did something nefarious outside of that trust boundary, eventually we'd find out about it. Not immediately, but eventually. And we're assuming that the reputational requirements of the cloud service provider mean they never want to get caught this way and therefore we can trust them reasonably well. So it's a trust but verify approach. Give them a mic. Here, why don't you talk them through it? Here. So Steve wants to ask you to lower your volume. Yeah, if you could talk. Yeah, just lower the volume and try talking again. I just want to see if it, because it comes out very high. I can hear you fine, but apparently some people in the audience are having trouble. Well, here I'm fine. It's just that the game is like really high and it's blaring. You trying to exactly talk? I know. That's the problem. The camera is over there and the person I'm talking to is over there. Yeah, you sound okay now that you're on the mic. You were a little bit rough for a little bit there. Yeah, well, I think you actually are sounding better now too. Did you change your volume or? Or maybe it's from sitting where I was sitting. I don't know. But I'm usually pretty quiet, so sorry. I mean, that was kind of my questions. I'm about to stop talking so I won't change that mic setting too quick. But so, I mean, that covers my question. I don't, so the thing that you mentioned was about the computer break. We lost you a bit. Yeah, is this the, so is that one of the responses? So can you repeat the question when you phased out while you're asking it? So is container break out the other impetus for this? Yes, so assuming we've got two containers belonging to different tenants on the host and one, so the host kernel is trusted but we can't guarantee that it's free of vulnerabilities. So there may be an attacker in a container who's trying to break out and actually get your secrets. So the main design of this is to ensure that an attacker who executes privilege escalation on the platform can't actually get access to the secrets in the other container. So root on the system is also unable to do abstraction of data from this container. And the way we achieve that is we do things like disable ptrace and the memory is actually as soon as it comes out of this namespace it triggers a page fault to anybody who tries it. So you can't use any of the ptrace interfaces to extract the information. And entry to the container is locked down only to the fork tree. So you can't actually enter the container. The only other attack you could do is to try and inject code through ptrace which we will block. We haven't actually got that block in place yet. We need to block ptrace on the system. And the real attack that you can mount is to recreate page table mappings. And this is really not easy. It's possible. So this is not foolproof in terms of security. This is raising the bar. Yeah. I mean, that sounds good to part of me from a container point. Okay. Thank you. Anybody else have any other questions now that we've proven it can work? Here as well. You're in the room. But yeah. Give him a mic. Can we give Mike a mic? You can't. You get us audio feedback in the room. So do we really think that a single system and a single kernel can be adequately secure for container isolation, even with additional raising of the bar with these kind of protected regions that you're pushing out of the direct map and whatnot? Well, so here's the debate about what is adequate. I mean, what, we can't give you completely foolproof security but I think all of the attacks on confidential computing prove that foolproof security is very hard to achieve. So what we're really trying is to enhance the isolation to a point at which container security is as reasonable as virtual machine security. I mean, I've done many talks where I've argued that actually I thought we hit that point ages ago but at least adding techniques like this allows us to explicitly protect the areas that you're concerned about. And I mean, this has its roots in confidential computing as well. If you think about what SGX did, it was trying to carve out little enclaves in the processes as well. All we've really done is a software isolation version of SGX with no encryption and no attestation because you trust the cloud service provider. So I think that it's hard for me to think how like punching secret regions inside of containers makes it as secure as virtualization based security that's utilized in the very large cloud providers today. Well, so the assumption is that it would be tenant based attacks that we were worried about. So the security of a cloud service provider running VMs today is the difficulty of breaking out of the VM envelope, right? And so for the container case, what we're actually trying to do is to raise the bar on breaking out of the container envelope itself. Even more, we're trying to raise the bar for those who already broke from the container. So it's two things. We're trying to use techniques to raise the bar for breaking out. And then if we assume you've broken out, how difficult is it to get back in and abstract the secrets? And it's that second one that the secret memory is going after. So the second one of pushing gets memory out of the direct map and making it less accessible even to the kernel of the host system. I think that that's generally a good thing to do. And it's a technique that in fact, like if you look at the Nitro Hypervisor used by AWS. That's an advert from Matt. Yeah, none of the guest memory of guest VMs is in the kernel direct map. It's all excluded. But that's for all of the guest memory in normal operation. Right, but that's because your normal operation has you orchestrating at the fat virtual machine level. And the specific design of this is to try and bring the orchestration back to the lean container level, which means we have to do sharing of certain aspects of the containers, the non-secret aspects of the containers, in order to get the leanness of the orchestration. This is the trade-off we're explicitly making. And it's always in the sharing where the side channels come in and whatnot, so. Yes, so this is a rich-picking area for side channels, I do admit that. But then I believe, if you look at most of the academic papers about virtual machines, there's rich-picking in side channels there too. It all depends on if one is employing kind of off-the-shelf, commodity available virtualization technology that does things like same-page merging and things of that nature. Yeah, and this is not how most cloud providers are operating. Well, as a cloud provider, if we want to sell ultimate security to somebody, we sell them bare-metal server. That's, you know, no other tenants on this server have at it. That's the way we do it. And so we think that people who are running virtually, whether it's in containers or virtual machines, on the understanding they have other tenants on the node, accept that there are leakage security risks from side channels and other things. And if we want to prevent that, we know how to. We're running on bare-metal and we charge you a lot more. I mean, the security is something people pay for, right? That's true. David. Yeah, so I think I kind of get the idea sort of as a defense-in-depth thing, but I guess I'm struggling a little bit with the statement that I think Mike made about how if you install a new page table entry, you could still get at the memory. But why is that hard? Why is that any harder than trying to, you know, do any of these other attacks? If you assume that you've already broken in the kernel, which I think is the assumption that you said, isn't that just an arbitrary right primitive? Well, no. I mean, most attackers that allegedly break into the kernel don't get full control. What they usually get now is the ability to string together ROP gadgets in sort of user space or they can break out of the container containment, do some privilege escalation flaw in the kernel and get root access to the host. So those are the two specific threat models that we're looking at. Root on the host can't install page tables because this is missing from the direct map. So root on the host is actually pretty powerless in this scenario. It's the ROP gadget problem where if you can string together enough ROP gadgets, eventually you might be able to put together a gadget that will actually give you back the page table entry that is missing, that's causing the fault. And then you could read the data in the secret memory area. That's the attack that we're worrying about. But if you had an arbitrary right primitive, like if that was just the bug that you were able to exploit, that's sufficient, right? It's sufficient, but it turns out that installing page tables, finding ROP gadgets that install page tables, because they're done in so few areas of the kernel is actually quite a difficult thing to do. It's not impossible, but it does involve an awful lot of stringing together of gadgets. If there's no one area of the kernel, you can pull a ROP gadget that just installs a page table out of. You have to build it yourself. If you're root on the host, you can install a module. So the assumption would be that we've got a lockdown kernel and modules have to be signed. So the cloud service provider might be able to install that module and we would discover it later through the audit trail, the sort of the runtime audit trail of the system. So we're assuming they won't and in theory the signing should prevent the attacker from installing it unless the cloud service provider is very lax with their keys. They can't hear you unless you speak into a microphone. Sorry, this is Steve Rostead. He's being very bad. Yeah, yeah, yeah, yeah. I thought I was loud enough that he'd go through your mic. No, I was just saying, once you get to a point where you can modify page tables, basically all bets are off, regardless, no matter what you do. Yeah, I mean, once you've strung together this rock gadget, all bets are off, you've compromised that kernel. We'd actually need hardware assist in protecting it at that point. And I mean, it's not impossible that we can enhance this with hardware assist going forwards because there are lots of isolation technologies beyond the MM, we're only using the MMU, which is what, 40 year old isolation technology. We could use some of the newer techniques to actually help with this. And I think Keece was gonna ask a question. He's unmuted this video, he's already asked it. How do you guys hear? Yeah, we hear you. Oh, we couldn't hear you. Can you hear me now? Yes. Okay. There's gonna be an interesting block of posts, hopefully, I'll listen from Jan Horne that really gets into how straightforward it is to do page table manipulations. Okay, I'll look forward to that. Yeah, it's rather terrifying, of course, because it's young. Well, okay, so I'll prepare myself for the fact that he's just going to blow my Rop Gadget argument entirely out of the water, and then I'm going to fall back onto novel hardware, which is what we always do in the security space, okay? Yeah, I agree. So did you have any other questions or comments on this? Is that a? Star Trooper, yes. Star Trooper behind me. Haven't you seen the plumbers? No, I haven't. And I think the restart is behind the glass. It keeps me up, keeps it working. Oh yes, there it is. Good grief. You little sci-fi bar for you. But since you're the resident security expert and the colonel, I mean, I know this is not perfect, and it's even, I mean, are you happy that it's raising the bar and it's going in the right direction, I suppose is mostly what I should ask. Yeah, I mean, I really like the fact that it is it has forced changes to the internal plumbing of the page table and gup stuff so that we can kick stuff out of the path of the colonel. It doesn't just by default have access to everything now. And now he says this, where were you 14 patch revisions ago when we were doing the Sisyphian thing of pushing this up the hill and asking does anybody support this? Well, because my interests are much, much further down the road and not have any use, but that's what I've said anytime anyone asked me about it. So like, sure, you can bypass it with P-Trace, but the next version will be- Hey guys, you're using the chat in BBB and you're not supposed to be, you're supposed to be using the Excel chat. You're naughty. I would slap all your wrists if you're in the room because I promised the Linux Foundation we'd mostly use their platform, so stop it. Okay. That's it for me. That's cool. I love these reading this chat. So if you've asked a question in this chat, you're really gonna have to unmute your video and actually come on screen here in the big venue to do it. Okay, how are we doing on time? One minute. Oh. One minute. So any last questions or comments from anybody either remotely or in the room? Okay, Juan, over here. I'll give you a couple. Thank you. Okay, yeah. So I have to admit I'm seeing this for the first time. I don't immediately see any objections. So I guess I'm gonna give a sense of a free one to you which is what's the most common objection you've either heard or anticipate and your counter to that. We've already talked about that a little bit. The most common objection is it's not as secure as virtual machines because virtual machines fully isolate all of the components and the comeback is that specifically the excluded case because we need the sharing to get the container orchestration. And most VM people don't really understand side channels enough yet to make the side channel argument that was made here but we fully accept that containers are gonna have more side channels than virtual machines just because the interfaces are fatter and there's more leeway for actually deducing what's going on inside the execution. If I may, if you're very, very seriously concerned about side channels, CPUs are cheap. It's much better just to have multiple computers. I've been a regular server. Yeah. Make one different computers, thank you very much. Yeah, and so with that, we will draw this to a close and just say thank you very much because you've been guinea pigs for the way we might choose to run a hybrid plumbers next year. So I think this format vaguely worked and we will try and refine it by the time plumbers conference comes along. So thank you very much everybody and we will see you next time. Thank you.