 Anyway, so I will be talking about containers today, but half of my presentation is not about containers. So first, who here have played with Linux containers? Okay, there are some people. I hope that by the end of the presentation, some of you would decide to join the container camp writing protocol code. So who am I? I'm the Chief System Architect of SiteGround. I'm also CEO and CTO of two companies organizing a few conferences. Some of you may know me from Yaps Europe, 2014 in Bulgaria also teaching some stuff and I'm maintainer of Linux Unshare, Linux Seton S and very soon next week, Linux C Group. So in 2015, there were two talks about containers in Pro. Unfortunately, I didn't found everything that is available for Pro in those talks, which is not very good since Linux Unshare is, I think, five, six years old module. Linux Seton S is around 2013 and these were not included. So the other thing is why you need containers. I need containers because I offer containers to my cube brand. But in order to start containers there, I use LXC or LXD data to outside Pro. And all of my management is written in Pro. And I was like, okay, what is the container and what I need to do to make it run entirely in Pro because I really don't need to do a system something. That's something that I really don't like. And currently the only way you can create container, usually the people, what I think is LXC, LXD, currency or docker. So these are your options and all of them require system something. Also, there is another thing for containers. I'm infrastructure guy, but if you want to run your application sandboxed like you have an API, it receives some code, some data that cannot be really sanitized. You're not sure that it can be sanitized. So you want to create a sandbox. So my first part of the presentation would be how you would approach this problem. And I'll finish with the containers that are the maximum security that you can get inside your application. I have some example code, so I hope you understand it a little bit easier. Isolation and sandboxing, I try to be the most complete guide for securing complications in Linux. I found like 50 different presentations on the Internet that don't mention some of this stuff. When you're using Trout, Trout is well known. When you're using it, try to Trout in the directory that is not an overlay file system. Because most of the security issues with Trout were caused by faulty file system. I mean the code had a bug in the file system. Or usually a problem with the configuration of the overlay file system. And overlays are used in Docker, RunC, and most of these applications. Also try not to share the whole file system with your application. Bind mount what you need inside the Trout and nothing else. You don't need more than what your application actually requires. Dropping privileges, you'll see it's not actually trivial in pro. Actually you can shoot yourself in the leg a few times, like five times at least before you do it properly. Most of the developers simply don't remember that Linux has capabilities and they don't use this feature of the Linux. Also you need to set limits. Most of you understand very well the U limits, but I don't think any one of you is using C groups for their applications. This is why I started writing the C groups module because C groups are a very good set of limits that you can apply to multiple processes, not only to a single process. And then you can add Linux namespaces where you can simply separate this application totally from the other applications on your machines, which is actually containerizing this. And the last thing that you can do with your application is using Secomp. Secomp is a filter for syscalls that your application can do. So let's look at these things. Trute we have out of the box in Pro, that's nice. The one thing that you have to remember is that after truting you have to do a CHDR because if you don't, you're still outside of the truth directory and you have access to files outside the truth. Then setting U limits, it's easy. You have these in the POSIX module. I'm not going to talk more about those. Jumping privileges, now this is the funny part. We have this internal variables that we use to change the user ID and change the group ID. And unfortunately, some people mistake the effective user ID with the running user ID because they are so close on the keyboard. So the problem is that your run started user ID doesn't change the permissions that the kernel checks. So if you don't change the effective user ID, you haven't dropped any privileges. Unfortunately, OK, I'll continue on the next slide with that. The other thing is the same problem with the groups. And most of the processes, if you start a program on Linux, you don't have a single group. You have like five, at least, groups that your user account is a part of. So when you change your group ID, you're changing your main group ID. But the other five are left with you. So your application still has group access to files that are from those other groups. Usually in C, what you would do is use the init groups function, and it will clear your groups. Unfortunately, we don't have this in Pro. So I found out that, and this is not in the documentation. I don't know why. But I found out that if you use the running group and give it two times the same group that you want to change, it will actually set groups with a single group, and you will actually get the same functionality as init groups, which is quite nice. For capabilities, you can use Linux PR-CTL. I'll show you in a bit. Then creating a new namespace, you can use either Linux Unshare. If you want to do the fork inside your application, then you would use Linux Unshare. If you want to create new namespace, but inside a new process that is not related to yours, you can use Linux QuoN and create new process, QuoN, new process. If you want only to enter in a namespace, like you already have 10 containers inside your machine, what you want to do is enter one of those containers, like only its network namespace or IPC namespace, and execute some command like connecting to the socket of MySQL inside of that. You can use Linux SETMS, and these are very, very easy. If you want to create a control group, you would use Linux C-group. Currently, you would probably create the directory and initialize the control group and a lot of other stuff in order to actually use that C-group. Linux SETCOM rules, you can use the Linux SETCOM module, so we have this inside right now. The proper way to actually drop your privileges would be something like this. You would first create the control group. A lot of the stuff has to be done before you actually drop your privileges, so some of these stuff can be changed. But for example, why we are creating the control group before trouting? Because the control group would be usually mounted at CIS-FS C-group, which is okay, but after you trute, you don't have CIS-FS C-group. If you bind mount that inside your trute, you're exposing a lot of the kernel to the user, so this is something that you don't want to do. This is why you first create the control group outside of the trute. If you want to create other directories for your application, you have to do that before you drop the privileges. Then you're setting your limits. When you do this, if for example, one of the limits is for number of processes that you can have on one machine, when you set these limits and your application, you already have reached the limit, this will stop your application. This is why I'm setting the limits second. Maybe you can swap them first, so you can exit right away without doing any stuff on your file system. Then you're trouting. Then you may want to drop some of the CIS-Calls that you want to do. You want to allow to the process and you can drop some of the capabilities or all of the capabilities of Linux. I have an example for dropping capabilities, I think in the presentation, so you would understand why. Then you are creating the namespaces. After that, you're in completely different Linux namespace. This means that if you have created with Linux and share all of the new namespaces, you don't have IPC connectivity to the host machine, you don't have networking to the host machine, you don't have file systems from the host machine. Your process ID is different from the one you see on the host machine. Your user ID may be zero, even though you're not root, and you can do a lot of other stuff there. When you do this, you're completely different system now. Your application sees completely different system. And then you should drop your groups, so what was it? Parent S, open bracket, and then drop the username. Last, you drop the user ID because if you drop the user ID anywhere before that, you lose root privileges, so you cannot do anything of the both things. So the problem with setting the user ID, usually what you would do is set the running user ID, the effective user ID group, and then user ID. And what I have done here is issue the system command simply to see what's happening, what user ID I am now. What I would see on my laptop is I would have first result before, why I don't have, oh okay, my first system and my second system here. So my first system shows me the root, I'm root, I'm group root, and these are the other groups that I'm part of. After I switched to 1001, my wife's ID, I changed the user ID, but now I have more, even more groups because I added Tony's group. As you can see, I changed the user ID, changed the group ID, but I'm still with the groups. So if I change the group ID properly now, what you would see here, usually what's happening is if you don't do, if you don't give it two group IDs here, it doesn't call set groups and you simply do setGid and nothing else. And setGid doesn't change the other groups. So I think this creates a code to set groups. And now I'm left only with the groups of Tony. This is easier for me and this is the proper way to do it. The other thing is I have a longer example. So this is my ping, which is a normal ping. It should show me if the host is alive or reachable. I didn't show the whole code, but that's enough. So what I did is, okay, I changed the groups properly, changed the user ID, and then issued system, right? What's happening here is that I see that I cannot access the socket. The net ping has a check for the effective user ID currently. I had to remove it just to show you this example. So usually you wouldn't reach this part. But after removing that, you see that you cannot create the raw socket. This is because you're now user ID 99. And this is normal user and it doesn't have the capability, cop net roll. When you have this capability, you can actually create this socket without having crude access. So if we add this line here, after the system, before the actual ping, and this is using the PRCTL, setting the effective capabilities net roll, cop net roll to one. When we have this, we can actually ping it. Why? This is a big issue for me because you shouldn't be able to do this. And I'll tell you why. Because now here, you're user ID 99. And user ID 99 shouldn't have capability to set capabilities. But it actually did that with user ID 99 and we did the proper thing with the groups. So it's not the groups. The problem is here. When we are setting the user ID, it actually calls set EUID. So effective user ID. The library function set effective user ID. And this function changes your user ID, but saves your previous user ID in the saved user ID. Unfortunately now, your application can go back to root without a problem. And if your application can do it, somebody that is attacking your application also can do it. So this is your case here. If you don't actually use set EUID directly to clean the saved user ID, you're fucked. So in order to do it properly, you need to include the set EUID function from the POSIX module because it's the only function that actually calls directly set EUID. After you do this, oh, fuck. Okay, after you do this, you can now, you can change the capabilities now. And it's all fine. It's all good. It's how we were expecting it. One other thing for capabilities. For those of you that are not used to capabilities, don't know how they work. You have the effective capabilities. This is a bit mask that combines all the capabilities that you have. Usually root has all the capabilities, usually. So the bit mask is all once. Then you have permitted capabilities. These are capabilities that you are allowed to receive if you start an application that has set file capabilities on it, like a file option. And then you have the inheritable capabilities, which is another mask, which is used when you're forking or exacting another application. So you have this mask and you can say, okay, I want to leave only CapNet role. So you set only CapNet role there and then when you create a new process, this new process will have only that capability. But inside your application, if you do this here, like set the inheritable, it wouldn't change the outcome of the pink because it is for the next process. This process has to drop his capabilities first. So you have to set the effective capabilities to zero for everything except CapNet role, okay? So I'm skipping the example for ULimitSense. It should say secomp here, sorry. Okay. And we are continuing now with the containers part, right? Creating namespaces with Perl and control groups with Perl. Usually what you would have is a parent and a child process. And when you create a namespace, this becomes a little bit of a problem because for example, if you create a new mountain namespace in the child process, now the child cannot see any file that the parent can see. If you haven't mounted before that, bank mounted some of the file system to the new route that this child process sees, you cannot actually, from the child, you cannot see any file. If the same happens for every namespace, like if you create a socket in the parent and expect to access it from the child, unique socket, you cannot because when you switch to a new IPC namespace, there, this socket doesn't exist at all. Like these numbers here, these are internal numbers from the kernel for the namespaces that you are currently using. So if these are not the same numbers on both of your processes, you don't have access to this information. Networking, if you create a new network namespace, keep in mind that you would only have, no, you wouldn't have any interfaces there. You have to create interfaces like Wolfback, you have to create ETH0, you have to create routing inside of that. Currently in Pro, we don't have libraries for this, which is something that I really hate and I'm working on that also because in order to configure the network interface inside this new network namespace, I have to issue IP route commands. And this is only system, system, system, system. And Pro is not system, right? This is not very good. Pid namespace and username space for the Pid namespace, for example, if these are the Pids on the host machine, inside the new namespace for the child, your process ID can be completely different, like it can be zero, it can be one, it can be any other Pid number. So all of this is a problem of communication between the parent and the child. And when you create a container, like with Docker or LXC or LXD, the problem is that it's even bigger because these parent and child are not actually parent and child. You have to separate your application in two. You have to have a demon on the host machine, a demon inside the container. You have to think about some TCP, UDP connection between those, simply to make them talk. So this is a very big problem. Yep, yeah, also when you change the Pid namespace, you cannot even send signals to this process, which is very, very, very bad. Okay, so if we don't share all of the namespaces, all of the namespaces, this process can actually access different parts of your operating system, like, if you haven't created a new network namespace, it still has access to the network on the same host. So for example, if you want to run your application on the same port in different containers, and if you haven't created a new network namespace, after the first application binded, for example, to port 80, every other application wouldn't be able to do. The same goes for sockets and shared memory and stuff like this. Now, what we had missing in Perl. First list of supported namespaces. I'm adding this in the next release of Loomson share. I'm not sure if I should put it there or create a new module, but listing all the namespaces that you currently support, your kernel can be configured to only have, for example, user namespace or network namespace, but not all five of the namespaces that you have. I think maybe six. So if you want to see what you're supporting, you have to cut this file. It's obviously easy to open the file in Perl. Also, if you want to see what control groups you support, there are CPU sets, CPU memory, block.io, network.cos and a lot more. If you want to see what the current kernel supports, you have to open this file. In the Linux C Group module, you simply call C Group List and it tells you what control groups you currently have. Also, when you're creating a new C Group, you actually have to create a directory which is inside the mount of the C Group. There is different way of mounting the C Groups. One is to mount all of the C Groups in a single directory and create sub-directories with different C Groups. Or you can mount only, for example, CPU set in one directory, memory in other directory, block.io in third directory. And when you create one directory in, for example, the CPU set, you have only CPU set control group set up for this. And you have to go to the memory of the other directory for the block.io. Unfortunately, SystemG does this stupid mounting, so I had to support it. It's easier when all of your files are in single directory like this. So creating a new control group is nice. You're only creating a directory, but if you haven't set same behavior to one, you have to set up the control group because you cannot put the process inside of it because it doesn't contain any CPUs or memory. So if you put a process there, it cannot be scheduled anywhere. So the kernel doesn't allow you to put process there. So if this is not set to one, here you would normally cut this file which is from the main C Group and put the information into the new C Group, CPU set CPUs. Again, the same thing for memories. These are the actual memory banks that you have on your machine. And once you have put all of them or some of them to the new control group, now you can actually push a process there. And putting a process is quite easy. You simply echo it inside the tasks file. Now I'm going to show you in a bit how you can create this in Pro. First, what we usually do now is outside processes using system to create the container and inside the container, we start this process. The problem here is that even if your application is the only thing that resides in this container, you would use LXC Create, LXDuer Docker, which is a daemon that is creating your containers and a command that is connecting to this daemon and telling it, okay, I'm starting a new container and this new container, you have to monitor. This is something that your application would want to do, monitor its child, but it's children, but you can't because you're like connecting to another virtual machine now and this is breaking your connection of your code. So let's look at something very simple, an application that is forking and we have bidirectional pipe inside of it. This is directly from the pro programming book example, so it should be okay. Then in the child here, what we have done is the first one, after the else, the first line is from the book and then we initialize a new control group, which is actually doing the make-dir inside, cgroup, sysfscgroup and creating the demo one directory. It automatically checks if you have enabled same behavior and if you haven't, it automatically gets all CPUs and memories from the previous cgroup, the topmost cgroup and initializes your demo one cgroup. Then you simply print the process ID, your process ID in the tasks file. I'm using the move, bid function because obviously you would be able this way to move a process from one cgroup to another cgroup and you don't need to have another function only to issue your own PID to another file. Then I want to share all of my namespaces, so I have created a very simple, constant QuoN container and this is actually binary oring all of the flags, QuoN, UNS, QuoN unit, QuoN, new IPC and so on. So in order to compact space now, we have created a new container with a new control group. Okay, we didn't root, we didn't set to it here but this is completely different container now and these two containers, so the parent and the child can still talk via the pipe, which is quite nice because now you can build your application, put your code that you want sandboxed here, prepare everything in the parent and have all the data structures of your application without actually using storeable, JSON, BSON and so on to transfer your data over the network because this container that is running your insecure code is outside of your box and that's it. If you have questions, really no questions? I want to do the similar thing and one of the things I found out is if you separate processes, I mean the problem is if you have a server starting from, which should not be root and then you need to become root in order to do all these nice things again. So you need a second UID script in the middle. Script doesn't work on linux, so you need a safe wrapper. The first thing you should do in the pool side then is to reduce yourself to whatever safe grounds and bail out if you don't manage. One of the implications I found is that you can't, you break the, it's a nice model it worked but you break the signal chain, though, because you come from the parent and to signal the grandchild. Yeah, you can't. Yeah, you can't. You found a solution for that just in a moment? No, you can use the bi-directional pipe and that's it. You can talk to your application but you cannot send signals. One other thing is what I have done for me is when the child is created, it sends its PID, its process ID back to the parent but it, from that PID, the parent actually can see which is the child inside the container and that's it. The only solution I found is to, if the process was hung, I mean completely screwed up to really send out the second process afterwards. Yeah, but how can it hang up in the beginning of it? It shouldn't be able to do this. Then it wouldn't start at all. But if you have a long-running task. Yeah, if you have a long-running task but if you already know the process ID from the parent which is still rude, you can actually do whatever you want on the child. Yeah, but you have to leave the parent as it, that's it. Or you can simply leave a few capabilities for the parent. Like in the example I showed, we don't drop capabilities at all, even in the child but we should and in the parent, when we are dropping capabilities, we can actually drop only the capabilities that we don't need and if you need, for example, signalling it's bad because you would need Capsis admin which is like half of the kernel. We still have a few minutes, if someone want to ask a question. I forgot to show you Cetanus. Cetanus is similar to Unshare. The difference is that you actually enter inside the container. So you have to have a container that is already running somewhere. Your application is running inside one container. This is how you can actually do signaling inside a container like you're changing the name space, the PID name space to the one of your application and you can safely with normal user ID like 99, send the signal to your application. So again, this is another way to approach your problem. With, it's a lot easier with Cetanus because you don't need to remember the process ID in the parent and you're actually in the same name space. So it's not a problem. You have similar issues with shared memory. For example, if you want to change the certain things in the shared memory of another name space, you have to enter there. So you'd usually have a Docker command or a VxE attach and then you'll learn a per script that you have copied inside this container. Now with Cetanus, you can simply do, okay, Cetanus to this network name space of this process and that's it. Anything else? Okay, thank you. Thank you.