 All right. Thank you everyone for joining us today and bearing with us through a couple microphone issues, but we are ready to roll. Welcome to today's CNCF-wide webinar. Windows came second. I'm Libby Schultz and I'll be moderating today's webinar. I'm going to read our code of conduct and then hand it over to Daniel Prismant, Senior Security Researcher at Unit 42 with Palo Alto Networks. A few housekeeping items before we get started. During the webinar, you're not able to speak as an attendee, but there's a Q&A box that we have been using already at the bottom of your screen or the right side, rather. Please feel free to drop questions there and Daniel will get to as many as he can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct and please be respectful of all of your fellow participants and presenters. Please also note that the recording and slides will be posted later today on the CNCF online programs page at community.cncf.io under online programs. They're also available via your registration link and the recording will be available on our online programs YouTube playlist. With that, I will hand it over to Daniel to kick things off, so take it away. Okay, so hello everyone and welcome. Thank you for joining and you're about to hear a story about how a single missing condition in an if statement enable one of the easiest container escape of recent year and escape that affected almost all of the cloud providers in numerous products. Okay, so let's begin. So today I want to talk about the container escape I found over a year ago and how it could affect cloud solution. We will start with motivation, why we should care about it, and then explain some of the fundamentals of containers in general and windows containers in particular, before moving on to the escape itself. After that, we'll talk a little bit more about what could we do about it when it wasn't patched and then about the patch itself and how it fixes this particular problem. Okay, so let's start with an example of why what I'm going to talk about today is important. Silo scape was a model I discovered a few months ago, which specifically targeted Kubernetes to windows containers. It abused the issue I'm going to talk about today to break out of the container barrier and escape to the host, which is less protected. From there we tried to use the Kubernetes config file which only accessible from the host to spread to the rest of the cluster, escaping the windows machine itself, it was it spread it to the rest of the cluster so if the cluster had Linux Linux containers as well, it could use that to As part of my research, I discovered it had active victims, each one being a Kubernetes cluster with the possibility of huge amount of processing power. Okay. Okay, now that we have the proper motivation. Let's begin. So what are containers. I know this is a cloud events. So most people here probably know what the container is but I will go over it quickly just in case. So a container is basically an operating system based virtual machine, meaning it runs inside the operating system with the same kernels operating system. It uses operating system features to isolate the virtual machine from the rest of the system, unlike virtual machines like regular virtual machines which is completely separated operating operating systems. Containers can run anything from the desired container must must match the operating system diversion. So for example, you won't be able to run windows containers on the Linux machine. The most important features of containers is that they pack all the necessary files to run the application. So for instance, if you have a special application with special dependencies, you can pack it all in a lightweight container and send it to the end user. So what is the difference between the containers and virtual machines? So the main difference is that containers rely on operating systems to make its isolation, while virtual machines rely on the hardware. So virtual machines virtualize everything, including the kernel, while containers run on the same kernel as the host. Because of that containers are much more portable and efficient. So a container image can be as small as a few kilobytes while a virtual machine image will usually be at least few gigabytes in size. All of that come with a price, of course, containers are much less secure than virtual machines. So on the left side, you can see a virtual machine infrastructure with hypervisor managing all the machines. Each machine has a separate operating system. And on the right, there is an infrastructure of a Docker machine costing a few different applications. Each application is inside a separate container, but they are all running on the same host operating system with the same kernel. Okay, diving deeper into the internals of containers here. We are only talking about Linux containers for now. What needs to be contained? Well, obviously you would want to limit the containers access to resources such as CPU, RAM, network bandwidth, disk bandwidth and such. This is done using a feature called C Groups. These features allow us to limit resource usage for a group of processes. You would want to limit the visibility the containers have too. So if we would only limit resources, nothing would stop a malicious container from just changing its own resource limitations. And for that, we would also want to limit the container's visibility to some of the host objects such as processors, network interfaces, users, mounts and such. And this is done using a feature called namespace. Okay, but this talk is about Windows containers. So let's move forward to Windows. In order to create a good solution for containers in Windows, the same requirements I talked about in the last slide need to be implemented here. Luckily, Windows had a solution for the resource limitation for years. It's called JobObjects and they do pretty much the same thing as C Groups in Linux. There is nothing too interesting about them just as I said in the last slide, but it's important to know that the feature existed in Windows for years. It's not new. Windows version of C Groups. Okay, that was JobObject was the Windows version of C Groups, but what about the visibility isolation? So until recently, Windows didn't actually have a solution for this. And that's why Windows containers didn't exist until only a few years ago. Okay, but luckily, few years ago, Microsoft came up with a feature called ServerSilo, which directly provides the missing feature that were necessary in order to create a container solution. So ServerSilo's provide everything that namespaces provide in Linux. They isolate the object manager, the registry, networking, devices, and basically any named object that the process can access. There are two types of Windows containers. There are Hyper-V containers and ServerSilo containers. They are not officially called ServerSilo containers, but they are using the ServerSilo feature as their isolation mechanism. So they are referred to as ServerSilo containers. And on the right, you can see how a Hyper-V container architecture looks. Not much different than a regular virtual machine. And Microsoft indeed calls these containers lightweight virtual machines. So they are not actually a container. Each container has its own lightweight kernel. Basically, it is the Microsoft version of a VM and is not what we are here to talk about today. We are here to talk about the left side. As you can see, more traditional meaning of the containers. One kernel with few containers for different applications. And this is done using ServerSilo's. And as you can see, all the containers are using the same system processors and same kernel. Okay. In order to fully understand how ServerSilo's isolates containers, one must understand the basics of the root directory object, which is a key feature in the Windows operating systems. Without getting into too deep about this mechanism, it's suffice to say that all applications visible named objects, such as file, registry keys, events, mutex, RPC ports, stuff like that are hosted in a root namespace, which allows applications to create, locate and share these objects among themselves. And the key here is named objects. So any object that you can access from your code using a name is a named object. Usually it's files and stuff like that, but there are other things. Okay, so take a look at this screenshot from the WinObject application. There is an application called WinObject. It shows you the root directory object. And it shows the root directory object perfectly. As you can see on the left, there are many directories under the root directory object. And under global question mark, there are tons of symbolic links, including the C letter symbolic links. So when you're accessing your C drive in your Windows machine, it is in fact a symbolic link to the actual device. And as you can see, there are many more symbolic links over there, but the most relevant to us is indeed the ceiling, which points to the actual file system device, which in this case is a hardest volume trick. In this screenshot, you can see a root directory object of the host and not a container. It can be a virtual machine, but it has its own kernel. I will soon see how a root directory object of a container looks like. Okay, so in this screenshot, however, you can see what the container looks like in the root directory object. So notice on the left, there is a silos directory. Every server silo container will have its own sub directory under this directory, with the name being the server silo ID. So in this particular case, I had a single container with a server silo ID of 804. And as you can see, the silo directory is almost identical to the host directory. This is because Windows tried to virtualize a mini operating system as accurately as possible. So most of the symbolic links that the host have, the container have to, so with different destinations, of course. So we'll discuss the server silo root directory object further later in the talk. Okay, let's move to, let's move over to an actual example of how the root directory object is used in a simple create file call. So the create file API receives a file path and returns a handle to the file. The process can read, write, or do other actions with that handle, depending on the permissions of the process asked when calling the create file. And in this example, we are calling create file on the C drive named secret dot text. Note that the C part is just a symbolic link, as you can see in the screenshot below. C is first converted to something the kernel can actually query. It is converted to the root directory object form of this path. This is done in user mode before the call arrives to the kernel. And after that in the kernel, the kernel query that global seed path in the root directory object that I show earlier, and receive the destination of that symbolic link. It then query that destination received from the symbolic link, still under the root directory object. But this time, the parsing end with an actual device and not a symbolic link. So the parsing is over. At this point, having an actual device, the kernel follows the request to the device driver. In this case, it's the file system driver. And from, from here, the file system driver takes the execution and that part is less important for us. But remember that the part where the kernel receive the symbolic link target and query that under the root directory object. And until it found the actual device, it will be important for us later. Also remember that this query was an example of a query from the host and not from inside the container. It is slightly different from calls that come from inside a container and we will talk about it about that scenario in a minute. Okay, so before giving an example of a file access from inside a container, let's discuss how the system decides that a call comes from a container. As you can see on the left, there are plenty of functions that can decide if a call comes from a server side or not. And the kernel uses different functions in different scenarios, but in our specific case, if the kernel decides a call comes from a container, it queries the path we saw before the global C path relative to the server-side or sub directory inside the root directory object. And this will happen every access of a named object such as file. So as you can see on the right, it will try to query and parse the symbolic link relative to the silos 804 sub directory instead from the root directory object. Okay, so in this screenshot slide, you can see the branch in the kernel where it decides if path will be query to the actual root directory object or to the server-side or sub directory in the root directory object. And this is done using the psget permanent silo context kernel function, which is one of the many functions that I showed earlier that the kernel decides if the call comes from server-side or not. Okay, so let's go over an example of accessing a file from a container. As before, we are accessing a file named secret.txt under the C drive, but this time from inside a container. I'm scoping the usable part where the API adds the global part before the C letter because we covered it already, and jumping straight to the kernel part. The kernel calls psget permanent silo context and retrieves a silo context and not now. So it takes the branch of a server silo in the kernel code 804 in our case and it query the relevant directory under the sub directory of the silo in the root directory object. And it does that until it finds the C symbolic link exactly as it did from the host, but this time relative to the 804 sub directory in the root directory of it. This time the C symbolic link points to a virtual hard disk device under the silo sub directory in the root directory object. So before that it was hard disk device volume three to those who remember, and this time it's virtual hard disk and some more numbers after it. This virtual hard disk eventually points to a path in the host file system, but it has its own device to do it in the file system driver. Okay, but take a look at the screenshot, and this is the interesting part. That virtual hard disk device isn't a device at all. It's a symbolic link to which points to itself, a symbolic link. So we end up in an infinite loop because we query a symbolic link that its target is the same symbolic link. So we will query it again and symbolic link again. And this was the point of my research when I realized there is something I'm missing here, because the request to the file was successful. And I was able to read and write to the file in my container. So it was obviously working, but on the other hand, it looked like we should have been stuck in an infinite loop. So what's going on here. Okay, let's, it's relevant. Let's go over what the requirement are. Every container, either in Linux or in Windows needs to be able to communicate with some of the host devices can be a screen to show output or a network device or the file system. So in the way Windows works, some of the symbolic links must point to the device in the host to the directory object for the container to roll. And the container won't have any access to anything. So for example, the virtual file system is eventually just a path in the host file system, and the container must have access to that path. So it must have some access to the host file system device. And we're getting closer to the actual escape promise. The way it is done in Windows is that the process with the right permission permissions can set a symbolic link as global. A global symbolic link will always be looked at relative to the host root directory object, regardless if the process that is trying to parse it is inside a container. So and when a container is created, docker set some of the containers symbolic link as a global, and that's why the container can maintain some sort of communication with the host. Non global symbolic links are passed relative to the silos root directory object as we discussed earlier. Okay. In this game shot, you can clearly see the condition in IEAX holds one of the symbolic links parameter, the one that holds if a symbolic link is global or not. And if it is the execution will take the right branch will and will retrieve the root directory object from a global variable in the channel. And if not, it will take the left branch and then we'll have to get the silo context first. As you can see the left branch called psgetter and silo context. So when we are trying to parse a symbolic link, eventually there is a condition. And if the symbolic link is global, the link will be parsed relative to the root directory object. And if it's not global, it will be parsed relative to the server silo sub directory object in our case the 804 sub directory. So if a symbolic link is global, the link will be parsed from here instead of from here, it will be parsed from the 944 in this case and not from the root. Just a clarification, it's not 804 anymore because it took me some time to create this presentation and I wasn't able to get the exact container ID again for taking this screenshot. So, okay, let's move to the house. How can we exploit that global symbolic link feature to break out of a container. So the function that is in charge of making symbolic global is the undocumented anti-setting formation symbolically in function. And after reversing it, reverse engineering it in order to understand what parameters the function expects, I discovered that it requires SCTCP privilege permissions in order to make links global. And sadly, the regular container user doesn't have those permissions. Luckily for us, there is another process in the container scope, meaning it is visible to the containers user called CXL service.exe. This process has in fact SCTCP privilege among other privileges and lucky for us again, by default, the normal container user is administrator. Okay, so take a look at this screenshot for Ida. As you can see, there is a check for SCTCP privilege before moving forward with the execution in making a symbolic link global. Okay, so let's go over the escape plan. First we impersonate CXL service to gain its TCD privileges. There are numerous ways to do that, such as thread impersonation or DLL injection. And after that, we create a regular symbolic link, which at this point is not global. It's not global yet. And it points to our local containerized C drive doesn't help us. We call anti-set information symbolic link with our newly created symbolic link to make this symbolic link global. And at this point, we have full access to the host C drive by using our local X drive. And from there, the possibilities are endless. Okay, so I'll explain it again a bit slower. We create a symbolic link. It's not global yet. We named it X and we make it points to C. So we basically have a redundant symbolic link X that points to a symbolic link C inside a container. After that, we call the anti-set information symbolic link because we impersonated a process that has the permissions. And we give it the parameter, we give the anti-set information symbolic link function, we give it X. And it makes it global. So now we have an X symbolic link inside a container that points to a C drive outside a container. So we broke the containers barrier. In this graph, you can see how SiloScape operated. SiloScape, the model from the beginning. After finding a vulnerable cluster using services like Shodan, it used known one days to get its payload in the windows container. It then used the container escape I described to get access to the host. And after that, it used the Kubernetes config file to get control of the rest of the cluster. It specifically targeted Kubernetes using Windows containers. After breaking out of the container and gaining access to the host, it issued a Kubernetes command to see its permissions in the cluster. And if it didn't have enough permissions to create other deployments, it just quit it and didn't even use that cluster. So what I was thinking, it tried to just gain free processing power. And if it can't deploy new containers, it doesn't help it very much. So how are cloud providers affected? Well, the trivial thing is any Kubernetes cluster with a Windows node and attacker that gained access to that container can just break out of the host and possibly spread from there like SiloScape did. And depending on how the cluster is configured, said attacker will at least be able to control every container that the specific node he compromises hosting. So even if the cluster is configured properly, someone that broke out of the container to the host will be able to control the containers that this particular host is hosting. As SiloScape did, an attacker could possibly spread in the rest of the cluster as well if the cluster is not well configured. And another possibility is the whole container as a service service. Imagine a cloud provider offering Windows containers as a service and not part of an entire cluster. An attacker can host in malicious container and break out of his own container and gain access to other customers' private containers. And I'm not saying I found out something like that happened or happening, I just think it can be now. Okay, let's go over the timer. As you can see in the timeline, the Windows containers were vulnerable to this issue for quite some time, almost 5 years since release. But the more important issue here is that Windows containers were vulnerable to this escape over a year and a half after it was made public. And during this time, anything that uses Windows container was vulnerable to as a byproduct. As we saw, there were players in the community, like for example, SiloScape, that used this vulnerable thing that made public to gain free processing power. Okay, so let's talk about what we as either cloud providers or users could do about it while it was vulnerable. So we have a vendor, in this case Microsoft, which takes some time to fix the container escape. What could it do about it? Well, there are a few things. First of all, in order for an attacker to even be in a position to use this issue, he must gain access to your container. So most of the time this happens through an outdated application in your container or a misconfigured container manager. So keep your applications up to date, even if they are inside a container. This is relevant to both Linux and Windows, and this one isn't relevant for cloud providers because this one isn't relevant for cloud provider because it uses user control. I mean, the cloud providers let the user control the cluster, so the user is accountable for updating its own application. And next, and this is only relevant to users and cloud providers, I advise you to run your containers as container user instead of administrator. It would be, and this one only for Windows. It would be much harder to pull off something like that without administrator permissions. Kubernetes supports that as well. So there is a special variable you can set in your YAML to run your applications as container user instead of administrator. This was a possible solution for cloud providers while waiting for a fix for Microsoft, simply changing the default user inside a container. So if I'm a cloud provider right now and I'm letting my user trade to Windows containers inside the cluster, but I don't want them to be vulnerable. I could change the default user of the Windows container to container user until Microsoft fixes this issue. And third, update your Windows host. This is mostly relevant for cloud providers and not the user or for users who are running their own cloud environment on their own machine. So keep your Windows host updated. And last but not least, and again, this is relevant for both users and cloud providers, configure your Kubernetes properly. So for example, in the case of Silascape, it managed to break out of the container to the host and spread in the entire cluster. And as I see it, there is no reason for a specific host to be able to create deployments on other hosts. And that's what happened with Silascape. Let's go over about on the Microsoft patch. So a few weeks ago Microsoft actually patched the anti-setting formation symbol and clean function. And the patch is easy to understand and straightforward. Any call to anti-setting formation symbol and clean from a thread inside the container will be blocked with the status privilege not held error code. And this is done using the PS is current thread in server silo function, which is one of the many functions that let the kernel decides if the process or a thread comes from a server silo. Which is its name suggests checks whether the current thread is associated with the process inside the server silo. And on the right, you can see how the function looks after the patch. And on the left, how it looked before the patch. So they simply added a condition to check. It makes sense to check if the process that's calling the anti-setting formation symbol and clean comes from a container or not. And if it is, it wanted, wanted. Okay, so here are some some my articles about about this subject. I covered everything I talked about it here in details. If anyone is interested. And that's it. Any questions. Nice job. Are there any questions for Daniel before we wrap up was very thorough then. All right, well Daniel thank you so much for a wonderful presentation. If that's it then thank you everyone for joining us and the webinar recording and slides will be online later today. We'll see you at another webinar soon.