 So next up we have a pretty cool talk if you give a container capability And this is a tale of container exploitation by Vakas Kumar and Rob glue. Please give him a warm welcome. All right. Hello, everyone. I'm Rob glue. Oh We're good, there we go All right, so I'm Rob glue. This is Vakas Kumar This is our talk if you give a container capability We are basically going to be going over what makes a container container what security boundaries are put in place to keep things separated And where that tends to break down in actual container deployments Just a quick blurb about us. We are both security consultants at NCC group and we do hacker things But let's just get right into it so This is just sort of a high-level overview of what we will be going over for this talk We're going to go Basically, we're going to start with a process that is just running Completely uncontained and we're going to just start with pseudo process one pseudo process two and then add one layer at a time and sort of explain What protections get put in place at each? At each layer We're going to give a high-level summary of how these protections are actually implemented in real container solutions and where it sort of Deviates from the ideal Then we're going to sort of outline how you check for these things if you're just a consultant or Someone who just gets dropped in a container and you want to sort of figure out How you might be able to get out or mess with the host and other Contained processes and finally we are going to release a tool to sort of help check with all the things that we go over So quick intro to what a container is in case you're not already aware Containers are lightweight alternatives to virtual machines And so the idea is that you have a container running on the shared kernel with the Shared kernel with the host and other containers on the machine So that you can sort of run different flavors of Linux on the same host And this is done over using like a full virtual machine Because for a full VM you need to run an entire another kernel on top of your host kernel Which obviously has a good chunk of overhead now sometimes? Containers are used as a security boundary So for if like for example your cloud provider and you want to have your customers run arbitrary code you might say let's just run it in a container because It'll be a lot faster, and it'll be much better And finally these containers will allow your processes to run as root or what appears as root to them But still be heavily restricted and what they can do on the host so Another quick summary of what these protections that we're going to be going over are we have basically pivot route to make these processes Think they have their own dedicated file system. We have capabilities to limit what sys calls someone can make set comp to limit that even further and Then we'll add namespaces which sort of restrict what resources a process can Actually access and then there are other random protections that are not as significant, but still You know don't hurt So for our sort of Theoretical situation here. We are sort of looking at this from the perspective of a cloud provider who wants to run you know customer code and You know these customers need to have all the access possible. They need to have root access They need to touch all the system file systems. They need to be the only process running on the host Well from their perspective they need to perform some sys calls that may or may not require root privileges And our theoretical cloud provider is going to say we're going to use containers to sort of keep these things separate Obviously some of the customers might be malicious And so we want to make sure that there are enough protections to keep one customer from going in and stealing all the data belonging to other customers So this is just sort of a visual representation of what we're working with We have two of our tenants and they are you know both have access to system resources They have access to the kernel CPU memory network file system, etc and so What we're going to start out with is our Naive approach where we just run pseudo tenant one pseudo tenant two Obviously this has tons of problems that the two tenants have access to each other's data They can kill each other's processes. They can P trace each other They can access the full host file system. They can just generally wreak havoc and you know, so this is The exact opposite of what you want to do here because all the problems we worry about are a problem here So the first step we're going to take is to Make these processes think they have their own file system So we use pivot route for this which is sort of very similar to a ch route, but it Basically provides a more separation And so the intent of this is to create our custom route file directory That will put our container our contain process in So in the example on the right here, we have You know an example at see SSH config setup So on our host we create our slash new route and then we put a fake At see SSH SSH config into it then we call pivot route and then according to the contained process It just sees that at see SSH SSH config You know at route So what this perfection or what this protection does is it lets us Convince our contained processes that they have their own dedicated file system So when we create this new file system, we set up all the system files We provide everything that the process needs in order to do in order for it to look like it's like a full Linux system So when it comes to real containers, they just do this they use pivot route and create a new mountain name space and They use pivot route over ch route because it doesn't have as many security guarantees And this combo basically just disassociate the process from the host file system So And then while it sets everything up the container solution will copy relevant files into this This new new file system So if you are a hacker just dropped into a container You want to look around for files that were explicitly included in the container solution that should not have been in So this is things like access tokens SSH keys You also want to take a look to see if there any devices mounted from the host So if they mount like a whole file system that Contains some sensitive files you might be able to modify bad things on the host And another thing that you want to check for is if you're checking out a Docker solution You want to look to see if the Docker socket is available anywhere And so the Docker socket is a Unix domain socket that's used to manage the Docker daemon on the host and occasionally someone who's creating a container and running their own software and it wants to have their software communicate with the Docker daemon and Obviously the most straightforward way is to just include this socket within your container but Being that this is the domain socket that's meant to be used to manage Docker itself There's a lot of very dangerous things you can do with it And it's essentially just free route on the host and so if a developer provides the Docker The Docker socket in your container. You can just use it to get a free escape So while this is a good start and our different processes can actually run on the same machine and not stomp on each other's files It really doesn't stop malicious processes in any way Our contained process can still make any sys call it once and so there are numerous ways for it to get Direct access back on like the direct access to the host file system Which is obviously bad because if they get access to the host file system they can modify files and escape again And so while things run that's there's still a lot of things that our contained process can do And so the next protection that we're going to go over sort of limits what sys calls a contained process can make and can be used to sort of mitigate the effects I guess So in Linux the root users permissions have been split off into finer grained capabilities This includes like the ability to bind services to privileged ports or perform mounts for example this allows you to Run processes that need some level of root access but not full root access for example You could run TCP dump as non-root by giving it the capabilities netron net admin which allow it to sniff packets, etc So if you need a if there's a process that supposedly needs root You should find out what specific capabilities it needs and drops everything else drop everything else It lets you have a root user or root process that can't actually do everything root can do So how does this work in practice a real container solutions drop many capabilities on like doing the initialization of the container and The way this works is they drop them in a way where they cannot be regained through normal means Specifically there's this thing called the capability bounding set which we'll get to like more detail later But basically what it does is it says these are the max amount of capabilities that this process could ever have and Let's say you run a set UID root binary You won't be able to get more capabilities than what are in your bounding set even though it's a set UID root binary On sometimes in container solutions like dangerous capabilities are granted such as Dacrid search sys admin and others listed These are essentially free escapes in that they let you call dangerous syscalls that let you affect the host and Game code execution or file access and there's other dangerous capabilities like netron net admin while they don't directly give you a Container escape. They let you perform network attacks and lastly There's this notion of privileged containers such as in Docker Which essentially gives a container all the capabilities and these are trivial to escape and you shouldn't do this So let's walk through an example of how we can use one of these capabilities to escape a container Let's say we're given the Dacrid search capability Dacrid search is a capability that lets you run the syscall open by handle at which lets you access Files or directories by I node number instead of path because we're in a pivot route We can't actually see the hosts on root directory But we know it's I know number is zero two so we can use the open by handle at syscall to get a file descriptor to the the root directory on the host and The open by handle at syscall basically takes three arguments first is a file descriptor to something on the target file system The and the reason you need this is if you're referencing something by I know number You don't necessarily know which file system it's referring to because if you have multiple file systems many things could have I know numbers zero too So you have some file descriptor to something you know is from that file system And then you have the second argument is basically a struct That's a wrapper around the I know number and the third argument are like flags Which is how you deal with weird cases such as you if you're trying to get a handle to Something that's a symbolic link or something weird So once you call this open by handle at you you have a file descriptor to the root Directory of the host and you can just see a true into it and now you have full file system access So let's say you're dropped in a mystery container, how do you find what capabilities you have? You can check what capabilities you have with the status file in like proc one status or proc self status this will have a numerical representation of a number of different capability sets and The two you really care about are cap F, which is the effective capabilities and cap BND Which is the bounding set of capabilities The effective capabilities are the capabilities your process has right now. It's what you can currently do and the bounding set is like the Maximum capabilities the process could have through like normal privilege escalation means Such as like running a set you ID binary or a set cap binary so we've Successfully limited what the root user can do by dropping a number of dangerous capabilities, but let's say One of our tenants needs something that sysadmin can do they need to be able to mount something in specific But we don't want to give them all of sysadmin because sysadmin is a capability that gates a large number of syscalls It grants too much power So we can use seccomp to limit the syscalls even further This allows you to block syscalls on a per syscall basis And you can also block syscalls based on the arguments passed to that syscall This both can let you like grant dangerous capabilities and limit what they can do But also act as a second line of defense if you somehow escalate privileges and gain additional capabilities So how does this work? At a low level this uses the PRCTL syscall which has The two relevant modes one is the strict mode which creates a sec comp sandbox where the thread can only call read write and exit This this is like a very good sandbox, but this isn't what's not this is not what's normally used What's normally used is sec comp mode filter which allows you to create a Berkeley packet filter to restrict syscalls and arguments the syscalls So what's a Berkeley what's a Berkeley packet filter a Berkeley packet filter is Basically a VM inside the Linux kernel that runs this weird bytecode And it has like a number of uses and that could basically be its own talk But in short it's one of the things that enables is seccomp EPF You can create bytecode that filters syscalls based on the syscall or the arguments of the syscall All of this is wrapped by lib seccomp which allows you to create seccomp policies in like a more human readable format And like human configurable So how does this work in practice? A lot of container solutions will have a like default seccomp on profile that may or may not be actually useful For example dockers default seccomp policy Automatically grants exceptions if a capability is granted to the process For example, if you're if you gain the daq read search capability the seccomp Profile will no longer block the open by handle at syscall, which kind of defeats the purpose of being defense in depth the logic behind it is the Docker developers were like oh if they're granted the syscall we want it to just work by default We don't want like to have users struggle or developers struggle with creating custom seccomp profiles But sometimes when you're trying to do things seccomp will block what you're trying to do So developers will write a bad custom profile or just turn off seccomp So how do you check your seccomp profile if you drop the black box into a container? This is difficult one thing you can do is just enumerate all your syscalls and just like do them and see what gets blocked This will tell you if like a syscall is explicitly blocked or not But you won't be able to necessarily find rules that block syscalls based on the arguments Because brute-forcing all the potential arguments that are allowed or not is difficult So ideally if you're pentesting a container solution, you just ask your client really nicely to give you the seccomp profile Another way you can like add defense in depth and limit what the root user can do is mandatory access controls This includes se linux or app armor These are both linux security modules and they restrict resource access by applying security context applications or processes programs and They can limit like very narrowly what files on like even the container file system a given process can access In addition like app armor could block certain or all mounts It could deny network access and it can even like Redundantly restrict capability. So even if a capability is provided to a container and app armor profile could be like the backup and block the capability So in practice a lot of the times se linux or app armor are not enabled it depends on kernel version like your container runtime version and There's a lot of inconsistencies which will cause se linux or app armor to just not be supported You can check your app armor profile from within your container by looking at this At our current file within the proc file system You can also check your se linux profile dynamically with LS capital Z or outside the container You could look at the se linux rule set with the shown path So We've greatly limited what the root user can do We've limited their capabilities what syscalls they can perform, but there's still a few issues here One is that they have unrestricted access to consume cpu and memory resources potentially performing denial of service of other processes on The same host also because like proc and other system files are mounted or put into our pivot root fake file system It's possible that we could still influence or see or kill Other like tenant processes In addition, there's not really a restriction on network access both to the like host network interface and to other container services So the next protection we're going to be putting in place will limit what resources our contained process can access So this will keep our contained process from interacting with the host processes or processes in other containers And this is done through namespaces, which is a feature exposed by the linux kernel that allow you to sort of put Processes in various groups that will only let them access resources that exist in that group So for example a process namespace will only allow your if your process is put in a process namespace You can only see other processes that exist within the same namespace And so this is what allows our contained process to think it's the only thing running on the system When it's actually not And so There's other namespaces that allow this sort of segmentation So for example network namespaces will only allow your process to access network interfaces that are in its own namespace and so this is what allows you to keep a contained process from interacting with the host machines network interface So you just like create a custom interface for your containers and you Only allow processes in that network namespace to access that one interface and then you can have Pretty good control over the traffic that goes through it There's also mount namespaces which only allow you to see devices that are mounted in that namespace And this is what's combined with our pivot route to provide a separation from the host's file system There are also username spaces which are a bit more complex that will go over on the next slide here But it's worth noting that for all of the Capabilities enumerated here if you are granted enough capabilities or sorry for each of the namespaces explained here If you are granted the right capabilities you can just get around everything So like process namespaces network namespaces mount namespaces don't mean anything if you grant the contained process too much privilege User namespaces however are more complicated than the other namespaces that we went over So not only do they just only allow users to interact with other users in their namespace It also drops a large number of capabilities and heavily restricts what sort of syscalls you can make and how you can make them so the Goal of this is to like once you add a process to a user namespace You can only exercise capabilities on resources that are explicitly in one of your resource namespaces And so for example if you are in a network namespace you are only allowed to Exercise your net rock capability on your network to namespaces It also drops any capabilities which Which are used with resources that are not associated with any namespace so for example sysraw.io Which lets you directly mess with your the memory of the machine is just blocked all together because there's no namespace Which manages the resources that that capability grants access to And then finally user namespaces do what the name suggests in that map Let's process is running in the user namespace only see you know Makes it lets the process think that it's running it as UID 0 and is the only user on the machine While outside of that user namespace The host machine is aware that it's running as like a higher privileged user Sorry a higher user ID user And it will perform appropriate security checks on that So when it comes to real containers The best approach would be to just turn on all the namespaces However when it comes to actual How things are actually done The container solution may or may not turn on all the namespaces so for example Docker by default doesn't have user namespaces enabled and As you know we spent all that time talking about all the protections that user namespaces provide and it you know is Able to provide protections if you grant too many capabilities. It doesn't let you touch anything outside of your your namespaces Docker just doesn't have it enabled by default and it can be kind of a pain to get set up if you are a developer So as a result developers may or may not actually do this And so when you are dropped in a container you want to sort of poke around and see what namespaces you have So another alternative for using like a user namespace would be a developer might just run the process in the container as an unprivileged user But if the user gets UID 0 somehow so either through running a set UID binary or something along those lines They would be running as the actual root user and have all the capabilities that were granted to that contained process But yeah, if they do escalate to UID 0 they'd still be restricted by capabilities set count mandatory access controls So now is a good time to sort of look at the what you can access through the network if you're contained process so Docker by default has all of the containers on a shared namespace So the containers can all talk to each other and if they have sufficient capabilities. They can perform network attacks on each other so for example Docker grants NetRaw by default and so if you're on a default container configuration you can just sort of Perform these some network attacks on the other containers that are running on the host If there's no namespacing you can just attack the host directly if you have those capabilities But you also should consider what your container can actually access on the local network So if you have other internal resources that are accessible from the server running the containers You can still dial out and hit all of those So when it comes to checking for all of these namespaces it's Generally there's you can quickly heuristically check and Guess whether you're in a namespace or not so for process namespacing You can just look in the proc directory And if you see a whole bunch of processes that look like they're running on the host you know Process namespacing is not enabled and you can you know collect data about those processes that are running on the host So you can get like what command line arguments were passed to them and that sort of thing the scientific way to check this is to run lstat on proc1 NS PID and if you get a device number greater than four it means it's namespaced For use the namespaces you can open up a file which lists all the mappings that are used for your username space So in the example here it maps user ID zero to the user ID zero which means There's no namespacing enabled If it is enabled you would see for example that the user ID zero is mapped to some high number UID and that would be the UID if the process is viewed from outside of the username space For network namespacing you can similarly do a heuristic where you just do if can fig or look at Cis class net and just see if any of the interfaces look like it is probably on the host You can also lstat proc1 NS net and if the device number is greater than four You're in a network namespace Mount namespaces you can heuristically look at proc self mounts to see what devices are mounted from the perspective of your Process and then you can also do the lstat trick on that as well So now that we have all these namespaces. We have our pivot route. We drop capabilities. We implemented a se linux profile or app armor profile and it's almost there and we more or less have our Contained process and our contained process can't really hit much outside of its container And it's fairly limited in how it can attack the host So the next thing we're going to go over is sort of how you limit Denial of service attacks because we don't want one process to just chew up all of the resources And sort of choke out the other processes. So this is done through C groups It's a system file system. And so you can just go in and look at how much resources you're allocated One thing that is sort of counterintuitive is C groups also limits what character and block devices you can access and so you if you you set up a like a C groups device whitelist so that any process in the C groups in the C group would only be able to access like DevUrandom and DevNull or something like that And it would it's a second line of defense if somehow they get a handle to dev SDA you can just Yeah So when it comes to checking this out if you're dropped into a container You can just read files that will tell you whether you have your CPU shares limited You can also read a file that'll tell you your maximum memory that you're allowed to use and if it's some absurd number like 64 gigabytes you can probably safely assume that you're not limited at all and you can just choke out anything else that's running on the machine When it comes to the device whitelist you can go through the devices that list file and It will list like a what it's a big list of devices that are a device type a major number and a minor number and This is You can basically use a reference table to figure out what type of devices points to So for example DevUrandom has a specific You know, it's a character device and a specific major number and a specific minor number And so if you see that in the whitelist file, you know that it's you're allowed to use DevUrandom DevNull has like a different major minor number, etc And so when you're dropped on a container you want to look at this list file and see first if everything is allowed And then if it's not you can just sort of go through item by item and see what devices you're explicitly allowed to access And so this is pretty much there. We have a very Controlled process that thinks it's running his route, but it can't actually do anything. It can't access a whole lot outside of its own namespaces They can't directly interact with each other's processes like our tenants are you know separated We've limited what network attacks can be performed on the host But there is still like a handful of attacks that you should consider when you set up your container So one thing to keep in mind is the core basis of containers is that all the containers on a host Share the same kernel as the host So if you have a kernel exploit you can like free escape the container mess with the host mess with other containers, etc This is greatly limited like the attack surface of the kernel is limited by mandatory access controls set comp Like limiting what syscalls you can call capabilities again limiting what syscalls you can call C groups also acts as a kind of defense in depth mechanism For example, if there's a potentially vulnerable device driver with some character device with some like vulnerable I octal handler C groups might limit you from even talking to that device driver and exploiting it But it's important to keep in mind that older kernels may not fully or properly implement all these security measures We talk about so it's important to keep your kernel up to date So let's say you find a privilege escalation in the kernel version that's running You have some way to go from a low privilege user to root How do you turn that into a container breakout like a normal ish payload in a like Kernel exploit will be to prepare a kernel creds and then commit them and what this does is this like replaces the cred Struct associated with your user process with one that gives you UID 0 Gives you all the capabilities it disables Linux some security modules, which like turns off se linux and app armor It also disables user namespaces, but it doesn't actually give you full file system access immediately for that You need to change the FS struct associated with your process However, we have the previous Dak read search exploit and because the normal exploit will give us the Dak read search capability We can now just use that exploit get the handle to the host file systems root directory see a tune into it and now we have Full file system access of for the host Another thing you should keep in mind in a container solution is check if you're on kubernetes Kubernetes runs like multiple different API servers to manage everything There's like at CD, which has an API and it stores like the cluster configuration information There's a cuba at API which is used to like manage hosts and pods and the kubernetes API Which is what kubernetes administrator will talk to to like actually configure their kubernetes cluster and environment Each pod will have a service token which can be used to authenticate to the kubernetes API with a service account and sometimes the This token gives high privileges in the kubernetes API and you can use this for essentially a free Container escape you can just tell the kubernetes API Hey spin up a privileged pod which will have all the capabilities with the whole file system host file system mounted read write And that's just an easy escape another thing is that Cubelet API can have unauthenticated remote code execution This essentially gives you like code execution within a pod and you can kind of chain this with the previous thing gain code Execution within another pod steal their service token and try to escalate privileges that way in addition the at CD API can sometimes be unauthenticated and you can use that to directly modify kubernetes state and configuration and Kind of related to this you should also check like what cod environment you're in on if you're an aws Can you hit the aws metadata service? if the host that you're on Has an like I am instance role like assigned to it There'll be I am credentials present in the aws metadata service and you could steal those if that acts that like network access To the aws metadata service is not limited also sometimes like Things like chef or puppet will put in its scripts or like in its certificates or secrets in the user data portion of for example aws metadata service and Container on a host could just steal that user data and steal those relevant secrets So we wrote a tool called Konmachi, which is basically a go tool Statically compiles into a binary You just drop it in a container and it collects a bunch of information for you it checks capabilities it flags dangerous ones It looks for mounted volumes. It looks at your c-group policy. It looks at all the namespace configuration It sniffs the network to find out the like IP addresses for other containers in the host so you could like attack those And it's still in development. We're continually adding features like kubernetes scanning etc to it We're open to feature suggestions. It's to be released very soon TM So I keep an eye out on the NCC group get hub for it We're just wrapping a few things up and going through like our internal tool release process so now for the demo so Here we're on launching a singularity container singularity is like another container runtime like Docker and we'll run our tool here and It's part like enumerated a bunch of information like the Linux kernel version, which you can use to try to find exploits It detected that the runtime is singularity Enumerated the different capability sets it found that oh there might be some potential hardware attacks on the host It could be vulnerable to meltdown inspector It's noted noted that sec comp is disabled in this container and that there's not an app armor profile in addition it's Notice that user namespacing is not enabled It's listed all these different things that are mounted from the host that could be potentially dangerous That we'd have to investigate It also like note specifically what capabilities in the bounding set that could be dangerous if you get UID 0 And then it notes that the containers not using process namespaces So Just to sort of wrap things up from the perspective of a developer It's worth noting that containers are much better than nothing So if your choice is run your stuff in a container versus just run it straight on the machine You will want to use a container because it's security measures that you wouldn't normally have Then you want to go in and make sure that your solution enables all the namespaces if you're using Docker Make sure you go in and explicitly enable username spaces because that's sort of the big one that provides the most for you Don't grant random capabilities to your container Don't mount random things in the container Copy stuff instead. Don't run a privilege container. Don't give your Kubernetes pods service access Container and I'll also consider what your container can access on your network So if you're running on a cloud provider make sure or at least consider that your Docker container may be able to hit the metadata service and think about what sort of information exists in there If you're running it in like a data center or something consider what other servers the container might be able to hit if using Docker drop the net rock capability, which is enabled by default and If you're using something else consider what capabilities it grants to your process And if just drop anything that's dangerous ideally drop everything if you don't need any sort of route permissions You know basically just be aware of all the security controls and make sure they're all flipped on But if you're a pen tester, this is sort of a short cheat sheet of what you should be checking for Look at what your UID is are you UID 0? What are your capabilities? Are any of them dangerous? What namespaces are enabled and be very aware of if user namespacing is enabled because that shuts down a lot of your attack surface Even if you are explicitly granted capabilities Check for anything that's hosted mounted from the host and see if any of them could potentially be System files or other files that might be read by processes outside of your container Look for the that includes like looking for the Docker socket Look at your C group policy. See if you can just you know chew up resources if you feel like it See if that what devices you can access Look at your Kernel version and if you have something super out of date see if there is just some Easy Pone where you just like drop it in run it and just get root Because if you're running on an old kernel there might be some old exploit for you to get root and just escape from there although you might have to modify your payload to A little bit in order to turn into a container escape rather than just a privilege escalation Look at what you can hit on the network scan things sniff traffic That sort of thing and then again just poke at your metadata services. I Think that's it. There's any questions. All right. Thank you very much