 Just a little introduction why we love containerization. So containerization is a very nice lightweight way to run applications in some isolated environment, where we don't, which allows us to manage our applications better on large scale and also control the environment as a user who runs tasks where our workloads are running. Since containerization that abstracts away the system where our tasks are running. And this helps both the operators and the user of the containerized application. We can isolate resources like physical resources like CPU disk, network, visibility of the system. We are containerizing all the outside system. And that clearly defines some application surface in the end, which makes it relevant to the enterprise since in that context we are interested in managing that surface. So but of course all this is not as easy as it sounds since containers are just very lightweight, wrapped, lightly wrapped processes that is potentially some crosstalk between the container and the host system since they both share the same kernel. And they will explain to us how SACROM can be used to minimize crosstalk or at least control it. We might require running processes and some, we might require processes to run with privileged access inside the container while we would not like the user starting that container to run with a privileged account on the host system. And Trini will tell us about that when he talks about user namespaces later. Or we might have a container which we don't want to isolate that much from the host system since we would like to have privileged access to host facilities. And I will talk about capabilities to explain that a little better. And the goal of these three new features of the Mesos container are to, in the end, improve isolation and to reduce the surface area for attack and also allow processes to run with less privileges than with that little process of privileges as possible. So let me briefly talk about capabilities. So capabilities are some mechanism under POSIX and also extended under Linux to divide the privileges that the root user has into more fine-grained capabilities, as they call them. And for examples might be that a process lines to some privileged port or that we send signals like kill to some process which we didn't initiate where we run or read files where we don't have the permissions to do that and many more. And this is all these things root can do. But of course, you would not want your container to run as root ideal. And so the purpose of introducing capabilities is to better control privileged access to the system since running with super user privileges is just too much and to also shield the systems and users from making errors which could have effects beyond the containerized environment. Since none of that really fits our expectation for containerization very well. So as a motivating example, this is some applications for phones which provide a flashlight. And the one on the left-hand side needs to require basically the maximum capabilities. And the one on the right, so this is an advertisement. That's why it says the competitors and ours. And the right-hand side requires access to take pictures and videos, basically, which is controlling the camera where the flash sits. So we are, at least I would be much more comfortable to run the application on the right than the one on the left which wants to make phone calls, for example. I don't know why my flashlight needs to make phone calls. So this kind of maps what we want to expose in Mesos. So this is a list of capabilities currently supported in Mesos. So there's a large list of capabilities that keeps growing in a Linux form, each new color version. And this is the list we currently expose. And there's, for example, here, NetBindService, which allows a process to bind to an arbitrary port. So there's KIL up there, which is the capability to send arbitrary signals to other processes or DAC override and DAC research, which are related to file permission checks. And there are more, for example, setting nice levels, setting the system time. So one big capabilities is sysadmin, which is like a blanket bag, which keeps being reshaped and getting smaller and more defined, basically. There's also NetRaw, we'll use later in some example, which is related to having raw access to some socket. So this is some protobuf message that we added to Mesos. And we also added some isolator, which we call the Linux capabilities isolator. And the idea here is that the operator sets up some agent with a set of allowed capabilities. Capabilities, the operator's comfortable to give them to users. And the user requests some capabilities required for a task in some message. And in the end, this is like some, aging capabilities are always some limiting set for what capabilities a task can request. And then the processes running inside that task are further restricted by the task. And that gives us that nice containerization and defined service area. So in some future extensions, we are thinking about so currently non-root tasks, the tasks not running as root, they can effectively only use file-based or lip-cap-based capabilities. And they, for example, lose all their capabilities if they fork. And Linux larger than 4.3 introduces some ambient, introduces ambient capabilities to address this issue. And we could, we're thinking about adding support for this kind of feature to Mesos. For example, we could expose ambient capabilities or we could use a user namespaces that's really able to tell us about. So what have we now? We have new abstractions to actually talk about capabilities on the agent and task level. We have some interfaces for operators to grant capabilities to task and for users to request capabilities. And so we have a new tool to work with privileged tasks. And we'll hear from the next. Okay, thanks. First of all, I wonder how many of you have ever heard about second, please raise your hand. Great, I see quite a few. So I'm gonna give a brief introduction of second self. So what is second? So the full name is Secure Computable. So you can see it's introduced to kernel since 2.6.12 and it's to put your process or application into a real secured way to ask you. So basically the first, very first version of second only support four is this cost, which is read, write, seek, return and exit. So exactly you need to open the file necessary before you put them into second mode and you start the second, load that into kernel and you start execution. And in this very limited way to secure your app or your process. And so basically the mechanism restricts this cost of processing base through the kernel. And since kernel 3.5, second two, version two is introduced, basically it's called filter mode. So you can start actually a filter, a custom filter and to limit your cost. So it's not only that for this cost anymore. And I wanna emphasize that it's a one way transition. So you can put your process into second mode but you can never withdraw it without exiting the process. The reason behind that is so malicious hacker may, there's no way you wanna allow a hacker to withdraw the profile or withdraw the filter and do malicious hacking on your kernel. So that's always a one way transition. You're in, you're never out. And so next one is why do we need second? So as you see it's a filter, it's a syscall filter you put into the kernel. So of course it's to reduce the tech surface of kernel and because as Benjamin has mentioned before, you container and share the kernel with the host and of course with other containers running on the same host. So you really wanna limit the capabilities. You wanna limit the tech surface of the kernel. It's not virtual machine. If your virtual machine is compromised, you still got the hypervisor and the hypervisor is compromised. Then you reach the whole kernel but then containerization is never there. And so with the second enable, you can execute a customer's call with more confidence. And you know that there are certain syscalls you're never allowed to make. So for example, you have a zero day attack and you know a certain syscall is compromised and a hacker can use that exploit your kernel. So the simplest way without patching your kernel the simplest way is to disable that syscall. And so how does the second work? I probably already mentioned a little bit. It's essentially a Berkeley packet filter as a program that you load it into the kernel and after that point of time, every syscall you make will go through the filter first and will go through all the rules in that filter and until it hits a clear red line or a green light. So for example, if you disable nano sleep then you make, you just in the batch type sleep then it will say and you put action to it as kill then your process will get killed immediately. And so there are a couple of actions available. Kill, allow, error, trap and trace. So most common use cases are kill, allow and error. So basically if a syscall you put that in kill then your process could kill. And of course allow is the green light and error is yes, you can make the syscall and your process won't be killed but I will always return an error. And I need to mention that there's a performance penalty introduced by the syscall filter because there's an overhead, every syscall needs to go through the filter and so if it's a huge complex filter and you wanna put your most common syscalls as first as possible, so that it will hit a certain rule as quick as possible. So who's using it? TechCom is introduced to Kernel to restrict your process. And we have open SSH using it. We have VS, FTB and LXD as some of them might know is a another container right section to be able to explain probably the very early one. And Chrome is using it to sandbox the, for example, Flash Player. And Docker, okay Docker supports it. And Docker supports it by default starting from 1.10. So probably you instantly Docker run a container, actually TechCom really is installed. And the profile, I think the filter is a very modest protection of your kernel. For example, some of the calls are disabled. For example, unshare. So if you do Docker run, busybox, unshare dash dash PID, it will work. So it will, your process will get killed. And we encourage you to do that with Mesos as well. It's like put the default protection on your agent or your host. But we are still in the design, no we're just still hesitating to put that into default mode because it might break the cost of backward compatibility issue. It may break your previous applications. But we encourage you to do that. So when it comes to Mesos, we add an isolator which is Linux class.com and you can use dash dash TechCom profile to enforce your own filter. And of course you could grab the one from Docker and if ever OCI is going to introduce us back for TechCom, we're gonna support it. And you can come up with your own either more strict or more customized TechCom profile. And to name that, this is, we have two level of protection. One is protect your agent against the user. Another one is to protect your executor against the text. So apparently the formal one is enforced by the operator of the cluster and the later one is enforced by the framework. So fix the users. So why? And it's probably the very similar reason to Benjamin as mentioned before for this capability. We want to have this extra protection enforced by the user. And yes, we are using Libs.com. So if you want to use this feature you need to have to install Libs.com. It's open sourced under LGPL license. And so you can install that pre-hand and you compile Mesos with dash dash enables dash TechCom. And I'm gonna, let's really talk about user namespaces. And later on we'll give a demo of all the features. This namespace has been added recently as new as 2013. And it's taking some traction. Firstly released in 3.8 for the version. The need for, sorry. The need for user namespaces is, because like any other namespace it provides isolation of the users. So far we have been running containers as privileged more. A root process is not good. It can do damage to your host system. So user namespaces when introduced it will allow you to virtualize the users. So that the users inside the namespaces are isolated just like as an example is the pit namespace. The process tree inside the pit namespace is different from the process tree outside the pit namespace. Processes have no clue of the pits outside the namespace. Same way the users running inside the user namespace have no idea about the users outside the namespace. And it is useful to run processes with different users outside and inside the namespace. I will see how. Like I said before, user namespaces are useful for running containers in privileged mode. That means the container is running as root inside but it is not root from outside. This is done by mapping the unprivileged user outside of the username space which is your host probably. But inside the container, inside the username space the container has UID0 and GID0 which will help you to do things that a normal root user can do like installing software, et cetera. At the same time it will protect your global resources. The containers running inside the username space will not have access to certain things. So outside, but it has root privileges inside. But when a container is launched it obtains the capabilities from its parent. It will have all the capabilities inside the username space but it will not have any capabilities in the parent. And the capability is that you will not have global capabilities such as Capsys time. You cannot change the system time or you cannot add devices or lower some common modules. That's the kind of protection you get from username spaces. User name spaces are useful when you map the users from inside the name space. For example, the user inside the container could be root in two different containers. For example, container A has UID0 which could be mapped to UID 500 on the outside, on the host. Similarly, container B, which is running as root, may be mapped to UID 800 on the host. So the process running inside the username space will have no idea what their privileges outside. This is accomplished by using GID and UID maps. These map files exist under slash proc of your container process, UID map and GID map files. There is also concept of sub UID and sub GID map files. This is introduced along with the username spaces in column three, nine. This functionality allows you when you add user on your Linux host, it will create a range of IDs and it will write that IDs into the sub UID and sub GID. GID is the group ID or the user ID files. These ranges can be used by these users to map into their containers. Docker today uses the sub UID ranges. So that means what Docker does is if you enable username spaces today, it will create a repo with that user ID and writes the image layers into that repo. So that means you're running a separate repo for the Docker. As I explained, the user mapping is written into UID map and GID map files. And as you can see in the picture above, that root zero in the first container is mapped to Ubuntu UID 1000 on the host and the second container running as root is mapped to unprivileged user. What it means to Mesos. Mesos currently runs tasks in two modes. The tasks are run as privileged task or unprivileged task. A privileged task is run as a root and so the root inside the container is nothing but the root outside the container. In for unprivileged tasks, you launch the task as a nonprivileged user as a non-root user. That means by default in Mesos, there is a flag file switch user, which is set to true and the task runs as that non-root user. It does not have much capabilities when you're running the task. It cannot do a lot of things that a privileged container can do. So username spaces are important. We need to enable username spaces. There are a few things that you need to do to enable username spaces. One is we need to add a new agent flag, add an isolator. Most of the work isolation is done through isolator concept in Mesos. There are isolators for pretty much everything, you know, like C-groups or CECOM, you heard. Similarly, there will be an isolator for username spaces. And then we also have to manage the user mappings for each of the users that the containers are running inside them. Agent flag, if you are running as a privileged user, you will add an agent flag isolation to name spaces slash user. That will enable, that will tell the agent that when the task is launched as a non-root user, the task will enter the username space and run inside the username space. And the task is normally started by an unprivileged user as shown in the last line here. Because if you're running as a root user, there is no point in mapping a root inside the container to root outside the container. That means you're still running as privileged container and you have full access to the root. So isolator is, for the username spaces, is just a marker which tells the agent that this process is going to run as inside a username space. Most of the work is done by the executor. Executor, when it knows that it is launched as a non-root user, it will make the task enter into the username space by joining the username space. And then once the process joined the username space, the executor will go ahead and write the UIDI and TID map files that will map the user as root inside the container. Suppose you are running Ubuntu user, the Ubuntu user on the host gets mapped to a root inside the container. This is all done by the executor for you. There are problems with username spaces. That is the reason why it hasn't taken much of traction yet, even in Docker. The way they do is, like I explained, they have to have a separate repo for each of the users. You can launch the Docker daemon with one user app for enabling the username space. And all the images that you download will be CH-modded to that user. This is one of the problems. It's hard to share the root file system between different users. Because the file systems do not have this capabilities. The second problem we have with username spaces is mounting system column by default is denied inside the username space. Only file systems that are marked FSC user NS mount are allowed to be mounted in the username space. And there is no real way to tell the user inside the username space to the user outside the username space. But there is some work done in this area. Fuse file system right now allows you to manage the mount for the user that is running inside the username space. And there is a patch that is currently in process on the kernel side, which will allow you to ship the UIDs and GIDs from inside the username space to outside the username space. That's through the virtual file system EFS patch that's currently hopefully will be merged at some point that we can leverage. And then we will not have that restriction on mounting the file systems inside the username space anymore. So that's pretty much about the username space. But I want to give you a brief update about where we are with all the three features that we talked today. After that, we will have brief demos on all the three features. First of all, the username space, the patches are there. Preliminary functionality is tested. We think the review will happen soon and then we still have problems to solve about the file systems. Capabilities code is already in the mainstream. So please use it and enjoy. Second is also the same, it's several patches are for review right now. So all in all, second capabilities and username spaces to work together will improve your container security. And by adding all these three together, you're restricting what your container can do and minimizing the surface area of attack for your container. It's demo time and I'll hand it over to Ben. So I'll show you a quick demo on capabilities. So just to orient you, up here I will increase the font size. Up here is the master running as non-root down here. I have some agent node running as root and over here I will start some tasks as non-root. Let me first show you. Let me just restart the agent here. Yeah, yeah, so I make it nice. Okay, so this is like some very, this is some released version of Meso. So I'm using sudo, start in agent, talking to some master I have running, some work directory launched here because I'm not installing that and I'm allowing. So here I have a flag allowed capabilities and that behind the JSON. So you see here some empty brackets so I allow nothing. Okay, and I specify isolation with Linux capabilities isolator. So let's start that and now it's talking back to the master which is nice. And so the task I want to run here. So I want to run ping. So you see here that ping uses lipcap which is a library to work with capabilities. So it uses lipcap to actually request the net raw capability which I can request as a user and that is used to then send packets to some host. And the command I want to run is run ping with one packet localhost. Okay, and you see that works if I'm not in a container. So now we have let's use Mesos execute to run that. And so I'm using Mesos execute talking to master. My command, you see my command here. I give it some name, my task, my framework and I request net raw capabilities. So just remember the agent had allowed capabilities was empty. So if I run that I expect that to fail. And it does fail. So I get told that down here failed to launch container capabilities requested net raw but only nothing allowed. So this is good. So what happens then if a user is not interested in declaring that he wants to use net raw. So what do I expect for this year? So this executable will use lipcap to get net raw capabilities. So I also expect this to fail. And if I execute that, well, I feel so happy. So this failed, let me show you the log. So let's just look at the standard verb since it failed. And this says that ping couldn't open that socket since the operation is not permitted. So that makes me happy because this Mesos execute can just do what the operator allowed. Okay, so next we'll see something from Jay on second. So first of all, I wanna show you this is the default second profile from Docker. And it looks like they are doing a wide list approach. So you see the default action is error. So you're allowing these discourse anything other than that will be disabled and will be returning error. And yes, you can see it. So I'm starting a master here is very normal as you started. And here say, I want to first show you Docker one, actually. So if I run Docker, this is a probability to show you Docker version. It's 1.11 is very old. And if I run this, come on, Docker run. And it's a busy box, it's onshare-pid. So onshare is not allowed in the default profile. I'm not doing anything special here. So you'll see it's permission not permitted. So that's actually how Docker enforce a default profile. And if you wanna do the same thing in Maysos, you first create a default JSON file specifying your rules. So I happen to have this one at hand. It's very simple, default action is to allow. And I'm gonna kill any process that calling nano-sleep. So basically that's the command sleep you've been using. So what I'm expecting, I'm starting the agent using this isolation equals to link second. And I'm using the profile I just created to disable the sleep call. So I'm starting the agent and I'm going to ask you this command. So the command is sleep for one second. So normally it will of course work, but with second enable, it will get task failed. And if you look at the log, you will say that's this call. That's what happened when you use set-comp. So for example, if you, the other day sleep, nano-sleep call suddenly become a compromise this call. People can exploit kernel using that call. You can simply disable it using set-comp. Okay, I'm gonna head over to screening. Three demos in a row, hopefully mine works. Do you guys see the screen? Yeah, at this point I'm going to start the agent. If you, as you can see here, my isolation is set to namespaces slash user. So that means I have username spaces enabled. I will start with master and slave over here. And then let's first launch this as a root user. We know we are running as root and there's no point in using namespaces actually. And if you can go and see, I just have a sleep command there and the sleep is running as root. So basically nothing much happens here. In this particular command, there are two things I'm doing. I'm printing the ID and I'm sleeping for some time so that I can know what user I'm running. And I will also look at the map file that this process that is running has written. And then I also touch couple files so that I know that I am doing something with the file system. So bear with me for a second. So I touched two files as root, so they are written as root. Let's clear up those two files for a second bear with me. So then the other thing I want to show you is if you go and look at the logs, did not write the logs. Let's go and start as user, unprivileged user. That's more important. Essentially as an unprivileged user, my ID is 1002. So that means I am launching the task as this 1002. And if I want to do a touch on ETC permission denied, I don't have access as a user to that volume, that ETC area. So now I'm launching the same task, right? I mean messers, I want to print the ID and I want to sleep for 10 seconds and see the UID map and then touch those two files and see what happens. As you can see, it is sleeping there. And if I see the sleep, sleep is run as unprivileged user. That means the task is running as unprivileged user. Only thing I need to know is if it is in the username space or not. If you see this task has exceeded and if you go and see blocks of this task, you can see the ID I printed. It printed as UID zero, GID zero. This task is run as root inside the container, but it is not root outside the container. So the task itself, and you can also see the map file here. The mapping has been done this way right now. One to one mapping for the first thousand one users. My UID is thousand two. So that is mapped to root inside the container. So for one entry, the root zero is mapped to thousand two which is my unprivileged user ID. And the rest of these IDs are mapped one to one thereafter. I randomly chose 64K. So the other thing important to notice here is if you look at touch files, it touched one file where it is allowed, slash ETC UNS touch test. And that touch file, when it touched, it touched it on the host as unprivileged users. So it's privileged as our unprivileged user. The task itself failed because it does not have permission to slash ETC where it tried touching a file. So that means that it worked as expected. So with that, my demo, that's pretty much it about my demo. If you have any questions, we would like to have questions. Thank you very much for your time.