 Here's Aviv, let's go on. Thank you. So thank you, everyone, for coming to this session. I'm really excited to be here at DEF CON 30 this time in person. And in this session, we're going to talk about the journey, my journey, from an isolated container to cluster admin in Service Fabric. So basically, we're going to talk about a vulnerability that I found in Service Fabric, which is kind of similar to Kubernetes in the sense that it is an orchestrator that runs everything on distributed systems. And the vulnerability allows us to break out of the container in Service Fabric, specifically Docker container, and later to compromise the whole Service Fabric cluster. So first and foremost, let's talk about me. My name is Aviv Sasone. I'm a proud dog owner. And over here on the right, you can see Lizzie. She was the assistant in the research. And I'm a research team lead in Prisma Cloud at Palo Alto Networks. And basically, my job is to look for zero days for vulnerabilities in the cloud. And it could be in CNCF projects in Kubernetes, Docker, RunC, any part of the infrastructure, just cloud vendors. And our job is to keep everyone safe by finding vulnerabilities and disclose them to the vendors. So on the latest research, I chose to work on Service Fabric. And the motivations and the reason I chose Service Fabric is that the first one is that Service Fabric hosts more than 1 million applications and runs on millions of cores. Now, this is according to Microsoft. So finding a vulnerability in such a software could have a really big impact. And other than that, there were almost no prior research to Service Fabric. So I found there was only one CVE to Service Fabric, unlike Kubernetes that had more than 40 CVEs. So this is kind of interesting because if I will audit Service Fabric, then perhaps I could find something that could have some impact. The third reason was that Microsoft released the source code of Service Fabric version 6.4 a few years ago. And this is only one version. But by looking at the code of the oldest version, I could understand a lot about how Service Fabric works. So let's make a quick brief about Service Fabric and what it is. So as I said earlier, it is very similar to Kubernetes. And it is a platform for deploying and managing applications on distributed systems. So basically, you can just take a bunch of machines, form them into a Service Fabric cluster, and then you could just deploy your applications in that cluster and manage them super easily. Now, Service Fabric is developed by Microsoft and therefore is widely used by Microsoft. And Microsoft discloses a partial list of offerings and production and products that are powered by Service Fabric. So for example, we have Azure Service Fabric, Azure SQL, Cosmos DB, and many, many more. And some products like Cortana or Bing or Skype. So Service Fabric is everywhere. And for example, if you're as a customer deploying Azure SQL database, there's a good chance that you're running on a managed Service Fabric cluster on Azure. And just to demonstrate how common is it, this screenshot was taken from our presentation of Microsoft in 2019. And as you can see, just one example is that Azure SQL Database had over 1.6 million databases in 2019 all powered by Service Fabric. So I guess that in 2022 it could be even 3 million databases and that is a lot. Now, in order to understand the vulnerability, we need to talk a little bit about how Service Fabric work and what is the architecture in the node. So first of all, Service Fabric can run on both Linux and Windows. So you could have Windows Service Fabric cluster or Linux Service Fabric cluster. And in this talk, we're gonna talk about Linux Service Fabric cluster. So on every node on the cluster, you have some fabric components which are actually the core of Service Fabric. So we have the Fabric getaway, you have file store service and on and on and on. And those processes are actually communicated with each other and with other nodes to form the cluster. Now, when you are deploying an application in Service Fabric, just a second, I'll put it on mute. All right, so when you're deploying an application on Service Fabric, you can deploy it either as code or as a container. So basically there is a Docker demon on each node and the applications will deployed as containers. So if you're deploying NGinux into your cluster, it will be an NGinux container. So I started to look at the components on each node and one component was really interesting and this one was the data collection agent. And this agent does a bunch of things and among them it collects logs from the host machine. So it runs its route on the host and collect the host logs. And in addition, it is also collecting logs from containers on the same machine. And this is really, really interesting because this process kind of breaks the container isolation in the way that on the one hand, it is not bound to the container. It does run as route on the host on every node, on every host on the cluster. And it also interact, it handles files within containers. So I thought that perhaps I could interact with this process by modifying the logs inside the container and that could maybe somehow translates into some interaction with DCA that will cause some kind of exploitation. So perhaps by doing so, we can manipulate DCA and exploit it and break out of the container. So I started to look at the code and I found a very interesting function called getindex. Now this function does three things. The first one is that, and just for the sake of the explanation, we're gonna talk about paths and we're gonna refer to the relevant paths as X. It could be any path that depend on the implementation. So let's say that getindex is being called for path X and then load from file will open X, read it and store it in the memory. Afterwards, getindex will verify some things inside that content of X. It will do some minor modifications and will add some data into the content in the memory and after that, it will override X with the modified data. So this scenario can result in a classical similar case in the sense then that I could just take X, modify it to my malicious content and then after the verification and everything, I could just override X with a sim link that points to any place that I want on the file system and when save to file will be called, it will actually try to save the data, the malicious payload into X, which actually points to another location and by that I will be able to get a right primitive and write to any place that I want on the file system and if I was able to do it from within the container, then I could actually write files into the host file system from within the container. So this right primitive could actually be translated later into a container escape. So let's talk about the requirements for exploiting this similar grace and trigger and escape the container. So first we need to find a way to trigger get index from within the container cause this function contains the vulnerable functionality. Second is that we need that get index will be called on a path that we have permissions to modify because if we cannot modify it and override it with a sim link, then it doesn't work for anything. The third requirement is that we need to beat the race condition for the sim link race. So between save to file and between load from file and save to file, we have like a bunch or fairly like three or four milliseconds. It's really hard to do the overwriting of the file during that really short period. So let's see how can we fulfill those requirements. But first, I just wanna mention that until now we talked about just the theoretical part of the vulnerability. And in order to test for real exploitation, I need to decide on a target and try to exploit it. So there are many products and services that are powered by service fabric. And I chose to work on Azure service fabric, which is offering by Azure that allow you to deploy your own service fabric cluster and manage it. So Microsoft does not manage it, you are the manager of the cluster. And by managing the cluster, you could just, you can have all of the permissions to the containers, to the nodes and that just to everything. So it will be really easy to debug such environment for the exploitation. So I chose this one. And for the operation system, I chose Ubuntu because I like Linux. And apparently this vulnerability cannot be exploited in Windows, which we will discuss later. So Ubuntu was a good choice. So first, let's talk about how can we trigger a get index, the rumble function. So after diving into the code, I found out that DCA, the data collection agent, monitors the creation of marker files inside each container. So other components on the machine could create those files and just mark to the agent that it should execute some functionalities. This is just a way for communication between processes in service fabric. And one of those processes is process container log dot txt. So I tried to create this one and I created many files and tried to see what happened. And apparently when I created it, DCA executed get index several time. So we got the first requirement and we were able to trigger get index from within the container. Now we needed to make sure that there are files inside the containers that we can modify that get index uses. So I use I notify wait, which is CLI tool for interacting with the I notify kernel feature. And this feature actually is really cool. It allow you to monitor the access to some files and once the files that you configured is being accessed, then you will get a notification immediately about it. So I set up the monitoring and created the file and apparently many paths inside the container were accessed. So get index use them and I could exploit them. And for the sake of this exploitation, I chose work sub folder map dot DAT. The third part was to beat the race condition. Now, first we needed to know when the file was being accessed and it was easy because we had I notify for that, but there was a really, really short window of opportunity between load from file and save to file. And when I tried to exploit it, I was not able to do so, but I figured there might be some other things we can do. And actually Lizzy thought about this idea. You can see her on the bike in here. And I just took the content of work sub folder map dots DAT, which is actually X that we talked about. And I just made it really, really, really long. And that created the situation where load to file uses much more time for the reading and parsing. And during that time I was able to override it with a sim link. And when I did it, it was actually working every time and I beat the race condition systematically. So now we have the exploitation and it's great, it works. And I was able to override any path on the host machine. So now there are many techniques in order to gain remote code execution from a write primitive on the host. For example, I could just add malicious SSH keys. I could just add them to the file system or I could just modify the file system to create a malicious user and use SSH and connect through using that user. Or if I just want to go ballistics, I can override the benign file with a backdoor. So just for example, I could override being a less. And it is really, it is presentable that in some point, being less would be executed on the host and the exploit will work. And my code will be executed. But all of those techniques are not applicable on this exploit because there are two very big limitations. The first one is that while overwriting a file on the host, this file does not have any execution permissions. So I cannot count on someone else to execute it because it cannot be executed. The second limitation which was more hard was that the file has to be in a specific format. And I said before that between load from file and save to file, there is a verification of the content in that file. And this content is gonna be written to the host. And according to the verification, this content has to be in this specific format. So it's gonna be version ints. And the only thing that I could control are malicious strings and malicious ints. Now I could add as many rows as I want. But still, this format is really weird. And even more, in order to beat the race condition, I had to first add a lot of data that looks like this into that file. So basically, I had a weak write permative to the host and the format should look like this. Now, me and Lizzy really banged our head against the wall on this one, trying to figure out how can we exploit it to gain remote code execution because this does not look like a bash script or PHP script or Python or binary code or just anything else. It's just normal text, weird text format. But after a while, we found out that perhaps we could modify the file in order for it to look like a file that contains environment variables. And we tried to feed some processes with this file and we actually found out that they're digesting this file as environment variables. So we potentially have a way to inject environment variables into processes on the host that runs this route. So for this exploitation, we chose ETC environment and this file is actually the default environment variable contains the different environment variables for new logins. But this was not really helpful because I didn't want to count on the fact that someone will log in to the machine in order for me to get a remote code execution. And beside that, we're talking about production environment. Service fabric is for production environments and I do not count on someone to log into that machine because it's kind of rare. So I try to see what else can I do. And by investigating, I found that Cron, which is the Linux task scheduler, actually imports ETC environment and use it for the default environment variables on every new job that it executes. And this is really interesting. So I try to see which jobs are actually running on Azure service fabric. And I found out there is one Cron job that learns every minute S route on the host and I could just use it and exploit it and in my vulnerability and exploit it so that every minute my environment variables will be injected into this process. But it wasn't really enough for me because there might be other places except Azure service fabric that are viable who do not have this minisly Cron job. So I started to look even deeper and I found out that even though you don't have any Cron jobs in your Cron scheduler, Cron is actually run internal hourly jobs and one of them is running S route. So it doesn't matter if you have anything over there. The exploit could work on that internal job every hour and just inject environment variables into that. So now that we could inject environment variables into processes on the host, let's talk about how can we escalate and use it to gain remote code execution. So for this one, I use LD preload. Now this environment variables is used by the Linux linker. And basically when you're initializing a process, the linker will take a look at this environment variable and if it points to any path and if it finds this environment variable, then the first thing that it will do is to load the shared object that is pointed by that variable. So that means that if I use it, if I upload the shared object to the container and set LD preload to point to the shared object, then I could actually inject shared objects into processes on the host, which is super awesome. So I compile the shared object with the construction attribute and this shared object, the construction attribute just helped me to execute code when the shared object was being loaded because I didn't want to wait for any function in the shared object to get called so that my code will be executed. I just want that once the code, once the shared object was being loaded, then my code will be executed. So it really was helpful. And I use a reverse shell for this one and I just upload it into the container, did everything and it worked. I just completed the full chain of the container escape and got root privileges from the stance of a container. Thank you. So I know I talked a lot. So I wanna just make a quick recap of all of the stages so we will understand the workflow of the vulnerability in a more high level. So first, let's talk about the attack vector where do the attacker begin everything? So to begin the attack, the attacker need to be in a container in service fabric and compromise it. Now this can be done by in many ways. For example, if the container runs an old software, let's say old NGINX server with a zero day or one day or a misconfiguration an attacker could just super easily exploit it and get onto the container and compromise it. Now, after the attacker got on the container, we can start to exploit the vulnerability. And by exploiting it, we could override ETC environment on the host. And afterwards, new cron jobs on the host will actually import this file, this malicious file ETC environment and use it. And in this file, you could find LD preload. And LD preload will point to a malicious shared object that the attacker uploaded to the container before for the exploitation. Now, when this object will be loaded by the cron job, it will actually initiate a reverse shell from root on the host to the attacker. And the attacker could just do anything he wants on the host and finish the container escape. So that was the vulnerability. And up until now, we talk about the vulnerability. And from now on, I wanna talk about the exploring the options of escalating further from that stance and just a quick disclaimer, Microsoft does not consider nodes as a security boundary in service fabric. So we have this great tool called SFCTL. It is a CLI tool for managing service fabric clusters. For those of you who manage Kubernetes, it's kind of Cube CTL. And you can provide a private certificate to your SFCTL and use it in order to manage your cluster. And actually what happens under the hood is that SFCTL will use this certificate in order to authenticate to the managing endpoint of your service fabric cluster. And then it will execute whatever you'll send it. So you can just deploy application, manage your applications or just do anything that you want on the cluster using that managing endpoint and the certificate. So finding this certificate, gaining this certificate could help us to just compromise the whole cluster. So I tried to see if I could find it anywhere. Now, after I exploited the vulnerability and got onto the host, I found the directory VORLIBWA agent, which was only accessible to root on the host. And luckily because of the exploit, I had those privileges and could access it. And by looking in the directory, I find so many interesting things. And among them, I actually found this desirable certificate that could allow us to compromise the whole cluster. So by using this certificate, I was able to do so many things. I could just manage the cluster in few different ways. I could use SFCTL to gain full control over the cluster, just manage application, deploy applications, delete applications, whatever I want. I could just send requests, row requests to the managing endpoints using the certificate. Or if I just want to go ballistics, I could open my browser and insert the certificate and go into that endpoint. And if I do so, I'll get the Service Fabric Explorer, which is a GUI web-based managing management endpoint. And over there, I could just manage everything super easily and whatever I want. So over here you can see how this GUI looks like. Now, so I was able to exploit this in Azure Service Fabric and a lot of companies and governments use Azure Service Fabric from all the areas. But other than Azure Service Fabric, there are other offerings and products that are based, that are powered by Service Fabric. And some of them are even multi-tenant. So for example, let's say that one morning I wake up, I decided that I want to have a Azure SQL database in Azure for my production environments and I just deploy it and it will actually deploy it in Service Fabric cluster. And on the other hand, an attacker can just wake up and will think to myself, well, all right, so I want to do some mess today. Let's see what I can do. So we could also open an SQL database on Azure and by exploiting this vulnerability in that cluster, he could actually compromise, escalate to control the whole cluster and compromise all of the other tenants in that cluster. So I was really scared about this option. I tried to figure out if there's a POC for this one. So I chose three targets for the test and needless to say that I needed first initial access in order to execute my exploit. So for that, I use Azure Azure Functions that allows you to run code and Azure Container is sincere that also allow you to run code and Azure Postgres. So a few months ago, you were able to just deploy old Azure Postgres URL server with some one days and you could just gain initial access on that container. And I tried to exploit those vulnerability on those platforms, but I discovered that they were actually secured and I couldn't exploit it. So I started to think and figured why I couldn't exploit it over there. And I found out there are two requirements for the exploitation to work. The first one is that your container or service fabric your application need to have runtime access enabled. Now this option just allow you to read some data about the runtime from within the container. And it is configured by default on applications in service fabric. So it's basically being used. But Microsoft did configure their services, the one that I tested to have runtime access disabled so I couldn't exploit the vulnerability. Other than that, the vulnerability only works on Linux clusters. So that means that it needs to be Linux host and a Linux container. So just for example, Elka which is Linux containers on Windows which is common on Azure is not relevant to this variability. So it's only Linux or Linux. So the real impact in here that I found was that there is a full container escape in Linux service fabric clusters by default. After that, I could have a full cluster compromise and I was actually was able to test this on our service fabric successfully and beat the race condition every single time. So let's talk about the disclosure process. So we reported this issue to Microsoft through the Azure bug bounty program. Of course, including a full operation exploit and full explanation. And MSRC acknowledged the issue and classified it as remote code execution. And they also awarded us with $30,000 for the bug bounty so I could buy a lot of great food for Lizzie with that money. And they actually reserved a severe for us with a great number which was so close to it but I guess next time. So after collaborating and talking about how we can fix this issue, Microsoft updated the release to fix with the update on June 14, 2022. Now if you're using Azure service fabric with automatic updates, you don't have any reason to worry because Microsoft updated your software automatically. And if you use any of the products that I mentioned or the Azure offerings that I mentioned or just any random stuff and you're scared, then you should know that MSRC contacted all of the internal partners in Microsoft so they would update their production environments. So this vulnerability will be fixed whether it works or doesn't works on the specific production environments. So that's how everything went out. And there's two takeaways that I want you to take from my story. And the first one is that we all have the cloud, we all have containers, but containers isolation is a weak security boundary because there were so many vulnerabilities in RunC and in Kubernetes and Docker and just everything in cloud renders. For example, we had Azure Escape a year ago which is a vulnerability that allowed you to work with the container isolation on Azure container instances. And you know what? Even if your code is great and up to date and you're really happy, there could always be a current vulnerability in your machine that will allow attacker to gain a full container escape. So let's take this into consideration that it is weak and assume that it will probably be broken in some point of time. And in order to mitigate this issue, the best approach on this one is to use multiple layers of security which is the castle approach. So for example, we have this castle and if an intruder wanna get in, it will have to go through a river, a bridge, a wall, a door, a guard, Lizzie, another guard, another door and just everything. And even if Lizzie is falling asleep and the intruder just bypass her, then there are also many other layers that will block the exploitation. So I guess that's the best way to do so. If you're interested, I just gave a talk two months ago in Open Source Summit about container security layers and how not to break them, how to configure them right. And other than that, you could just use many other layers in order to improve your security and avoid takeovers. So that was all for today. So I wanna thank you for coming and thank you. If you want, you can find me on this email and if you want to read some more details, you have our unit 42 blog post at Palo Alto Networks or you could read the MSRC blog post by Microsoft. And if you have any questions, we can do it right now. Thank you.